-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Description
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
Reproducible Example
class Column:
def __init__(self, name):
self.name = name
col = Column(name='col')
df1 = pd.DataFrame({col: [1], 'X': [2]})
df2 = pd.DataFrame({col: [1], 'Y': [3]})
merged = pd.merge(left=df1, right=df2, left_index=True, right_index=True)
assert not isinstance(merged.columns.tolist()[0], str)
Issue Description
merged
dataframe columns converted to string (because the suffix was added to the equal column)
> merged.columns.tolist()
['<__main__.Column object at 0x7f41edd52d50>_x',
'X',
'<__main__.Column object at 0x7f41edd52d50>_y',
'Y']
Expected Behavior
I would expect merge
to keep the column of type __main__.Column
and not covert it to string
Regards the duplication, IMO its ok to have 2 identical columns and let the user decide how to handle it by his own
Installed Versions
INSTALLED VERSIONS
commit : 4bfe3d0
python : 3.8.7.final.0
python-bits : 64
OS : Darwin
OS-release : 20.2.0
Version : Darwin Kernel Version 20.2.0: Wed Dec 2 20:39:59 PST 2020; root:xnu-7195.60.75~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8
pandas : 1.4.2
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
pip : 20.2.3
setuptools : 49.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
Activity
ericman93 commentedon Apr 27, 2022
I fix this bug here: #46879
[-]BUG: complex type columns converted to string in merge[/-][+]BUG: column labels converted to string in merge[/+]rhshadrach commentedon May 16, 2022
I'm -1 on having duplicate columns in the result. Currently
raises "ValueError: columns overlap but no suffix specified: Index(['col'], dtype='object')". Allowing duplicate columns only if they are not strings (or not integers or not ...?) is surprising to me.
simonjayhawkins commentedon May 17, 2022
since pandas 1.2 we have a new mechanism to disallow duplicate column labels.
rhshadrach commentedon May 17, 2022
@simonjayhawkins - not sure I follow. For this proposed feature, under what condition(s) on the column name dtypes does the snippet I posted raise or allows duplicates?
simonjayhawkins commentedon May 18, 2022
needs discussion for this scenario, comment was more holistic.
going forward, it should not need to be a decision (or personal preference) on whether a method returns duplicates. duplicate column labels are a documented pandas feature https://pandas.pydata.org/pandas-docs/stable/user_guide/duplicates.html#duplicate-labels and therefore all methods should support them, work correctly with them and correctly propagate them.
and since pandas 1.2, https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.2.0.html#optionally-disallow-duplicate-labels the mechanism for disallowing duplicate column labels now means that allowing/disallowing duplicate column labels should not need to be incorporated into the api design of individual methods.
sure. users not wanting duplicate column labels are accommodated and will be able to use
.set_flags(allows_duplicate_labels=False)
going forward.rhshadrach commentedon May 24, 2022
@simonjayhawkins
I do not think this is correct. Index labels and column labels are not the same. Duplicate index labels occur often in frames that I work with, yet I never allow duplicate column labels because they are almost impossible to work with. I am not able to use this setting because it forbids both duplicate index and column labels, and I don't think my usage/experience is niche.