Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #264
Flexible and Canonicalized Column Handling for
dedup
andsort
Overview
This pull request enhances the flexibility and robustness of column handling in
pairtools
, with a primary focus on improving CLI usability, internal consistency, and resilience to variations in input column names.Note: This PR also fixes some flake8 linting issues.
Key Enhancements
✅ 1. Unified Column Lookup via
headerops
headerops.get_column_index()
to allow CLI options like--c1
,--c2
,--p1
,--p2
,--s1
,--s2
, and--pt
to accept both:1
,3
) — with bounds and type checks."chr1"
,"chrom1"
) — supporting canonicalization and case-insensitivity.headerops.canonicalize_columns()
to standardize commonly used aliases (e.g.,chr1
→chrom1
,pt
→pair_type
) across all CLI tools and internal logic.✅ 2. Improved
dedup
andsort
CLI Behaviorget_column_index()
.extra_col_pair
andextra_col
options with warnings for missing columns instead of hard failures.--pt
(pair_type) option optional insort
, skipping it gracefully when not present in the header.--c2
indedup
(was "Chrom 1 column", now corrected to "Chrom 2 column").✅ 3. Column Defaults Remain String-Based
--c1
,--c2
, etc.) are still defined using canonical string names (e.g.,"chr1"
,"pos1"
), not integer indices as initially planned.get_column_index()
's flexibility.✅ 4. Code Cleanup and Readability
l
→line
) for better readability across modules.✅ 5. Comprehensive Testing
test_headerops.py
to validate:Summary
This PR lays the groundwork for robust and user-friendly CLI interactions in
pairtools
, reducing the brittleness of column name handling and allowing greater flexibility for users working with varied input formats. It introduces modular utilities (canonicalize_columns
,get_column_index
) that can be reused across future tools and extensions.Follow-Up Considerations
--c1
,--c2
, etc.) to use integer indices as per the original plan.