Skip to content

Fixed CLI errors. #268

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Fixed CLI errors. #268

wants to merge 7 commits into from

Conversation

ShigrafS
Copy link

@ShigrafS ShigrafS commented Apr 20, 2025

Closes #264

Flexible and Canonicalized Column Handling for dedup and sort

Overview

This pull request enhances the flexibility and robustness of column handling in pairtools, with a primary focus on improving CLI usability, internal consistency, and resilience to variations in input column names.

Note: This PR also fixes some flake8 linting issues.

Key Enhancements

✅ 1. Unified Column Lookup via headerops

  • Introduced headerops.get_column_index() to allow CLI options like --c1, --c2, --p1, --p2, --s1, --s2, and --pt to accept both:
    • Integer indices (e.g., 1, 3) — with bounds and type checks.
    • String names (e.g., "chr1", "chrom1") — supporting canonicalization and case-insensitivity.
  • Introduced headerops.canonicalize_columns() to standardize commonly used aliases (e.g., chr1chrom1, ptpair_type) across all CLI tools and internal logic.

✅ 2. Improved dedup and sort CLI Behavior

  • Replaced static string-based column name assumptions with dynamic lookups via get_column_index().
  • Enabled seamless handling of extra_col_pair and extra_col options with warnings for missing columns instead of hard failures.
  • Made the --pt (pair_type) option optional in sort, skipping it gracefully when not present in the header.
  • Fixed incorrect help text for --c2 in dedup (was "Chrom 1 column", now corrected to "Chrom 2 column").

✅ 3. Column Defaults Remain String-Based

  • CLI defaults (--c1, --c2, etc.) are still defined using canonical string names (e.g., "chr1", "pos1"), not integer indices as initially planned.
  • However, the backend now supports either form due to get_column_index()'s flexibility.
  • Suggested follow-up improvement: update default values to integers to fully align with the original plan.

✅ 4. Code Cleanup and Readability

  • Replaced unclear variable names (e.g., lline) for better readability across modules.
  • Removed unused imports and deprecated warnings (e.g., cython backend placeholder).
  • Refactored import ordering and string formatting for consistency.

✅ 5. Comprehensive Testing

  • Added unit tests in test_headerops.py to validate:
    • Canonicalization of various column aliases.
    • Accurate index lookup from mixed input types (ints, strings, canonicalized names).
    • Proper error handling and edge case coverage (e.g., negative indices, invalid columns).
    • Integration with header extraction utilities.

Summary

This PR lays the groundwork for robust and user-friendly CLI interactions in pairtools, reducing the brittleness of column name handling and allowing greater flexibility for users working with varied input formats. It introduces modular utilities (canonicalize_columns, get_column_index) that can be reused across future tools and extensions.


Follow-Up Considerations

  • [ ] Update CLI defaults (--c1, --c2, etc.) to use integer indices as per the original plan.

@ShigrafS ShigrafS marked this pull request as ready for review April 20, 2025 13:57
…PES_PAIRSAM, informing users it’s treated as a string, which may affect sorting if numeric.

Modified test_sort.py to create a temporary .pairsam file with custom_col in the header for test_custom_column_warning, ensuring the data type warning is triggered and tested.
@ShigrafS
Copy link
Author

@agalitsyna I've added a warning when sorting by columns not defined in pairsam_format.DTYPES_PAIRSAM, defaulting them to string type, and updated the --extra-col help text to reflect this behavior. I've also added a test to verify warnings for undefined custom columns
The PR is ready to be merged.
Kindly review it.

@ShigrafS ShigrafS requested a review from agalitsyna April 30, 2025 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

chrom1, chrom2 and pair_type fields are now required in pairs file header
2 participants