Skip to content

BUG: unique() casts type pd.Timestamp to numpy.datetime64 #35448

Closed
@SebastianoX

Description

@SebastianoX
  • I have checked that this issue has not already been reported.

    I have confirmed this bug exists on the latest version of pandas.

    (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample:

import pandas as pd


df = pd.DataFrame({"date": ["2019-02-10", "2019-02-10", "2019-02-11"]})
df["date"] = pd.to_datetime(df["date"])

print("Type before the for cycle:")
print(type(df["date"][0]))  # pandas._libs.tslibs.timestamps.Timestamp

for day in df["date"].unique(): 
    print("Type in the loop:")
    print(type(day))  # here is a numpy.datetime64

which returns:

Type before the for cycle:
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
Type in the loop:
<class 'numpy.datetime64'>
Type in the loop:
<class 'numpy.datetime64'>

Problem description

The function unique() should not cast the data type.

Expected Output

types of df_target["date"].unique() should be the same as in set(df_target["date"].to_list()). E.g.

import pandas as pd


df = pd.DataFrame({"date": ["2019-02-10", "2019-02-10", "2019-02-11"]})
df["date"] = pd.to_datetime(df["date"])

print("Type before the for cycle:")
print(type(df["date"][0])) 

for day in set(df["date"].to_list()): 
    print("Type in the loop:")
    print(type(day))

Returning:

Type before the for cycle:
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
Type in the loop:
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
Type in the loop:
<class 'pandas._libs.tslibs.timestamps.Timestamp'>

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.7.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.5.0
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_GB.UTF-8
LOCALE           : en_GB.UTF-8

pandas           : 1.0.5
numpy            : 1.19.0
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 20.1.1
setuptools       : 47.3.1
Cython           : None
pytest           : 5.4.1
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.16.1
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : 3.2.2
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pytest           : 5.4.1
pyxlsb           : None
s3fs             : None
scipy            : 1.5.2
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
numba            : None

Activity

jreback

jreback commented on Jul 29, 2020

@jreback
Contributor

there have been a number of discussions about this - pls look for duplicate issues before opening a new one

SebastianoX

SebastianoX commented on Jul 29, 2020

@SebastianoX
Author

Thanks for your answer @jreback .
Before posting I looked for duplicate issues / stackoverflow questions / google in general and I could not see any.
Please do link the discussions/issues here, so that I and other interested developers can find them.
If it is a duplicate feel free to close it.

simonjayhawkins

simonjayhawkins commented on Jul 29, 2020

@simonjayhawkins
Member

I'll close this since I think it is covered by #22824

Series.unique returns array, Series.drop_duplicates returns Series. Returning a plain np.ndarray is quite unusual for a Series method, and furthermore the differences between these closely-related methods are confusing from a user perspective, IMO

added and removed
Needs TriageIssue that has not been reviewed by a pandas team member
on Jul 29, 2020
added this to the No action milestone on Jul 29, 2020
jreback

jreback commented on Jul 29, 2020

@jreback
Contributor

though having a dedicated issue for this might be ok (as that catch all unique issue brings up many topics)

we cannot change this to return a DatetimeArray till 2.0 in any event (nor can we deprecate anything)

SebastianoX

SebastianoX commented on Jul 29, 2020

@SebastianoX
Author

@simonjayhawkins for what I can understand #22824 is a different issue.

The problem of the current issue, is not that unique() returns an array. The problem is that the objects of type Timestamp in a colum are casted to objects of type np.datetime64 in the numpy array returned when unique() is invoked on this column.

SebastianoX

SebastianoX commented on Jul 29, 2020

@SebastianoX
Author

Let me add a clearer example:

import pandas as pd 
 
 
df = pd.DataFrame({"date": ["2019-02-10", "2019-02-11"]}) 
df["date"] = pd.to_datetime(df["date"]) 
 
print("Date Types in column date:") 
for day in df["date"]: 
    print(type(day))  # this is pandas._libs.tslibs.timestamps.Timestamp 
 
print("Unique date Types in column date:") 
for day in df["date"].unique():  
    print(type(day))  # this is np.datetime64 

The code returns:

Date Types in column date:
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
<class 'pandas._libs.tslibs.timestamps.Timestamp'>

Unique date Types in column date:
<class 'numpy.datetime64'>
<class 'numpy.datetime64'>
simonjayhawkins

simonjayhawkins commented on Jul 29, 2020

@simonjayhawkins
Member

@simonjayhawkins for what I can understand #22824 is a different issue.

The problem of the current issue, is not that unique() returns an array. The problem is that the objects of type Timestamp in a colum are casted to objects of type np.datetime64 in the numpy array returned when unique() is invoked on this column.

OK but I don't think that's clear from the OP. Feel free to open a new issue.

SebastianoX

SebastianoX commented on Jul 29, 2020

@SebastianoX
Author

You do not think it is clear as in "I think it is covered by #22824"?
Anyway, new issue is on its way.

3 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @jreback@simonjayhawkins@SebastianoX

        Issue actions

          BUG: unique() casts type `pd.Timestamp` to `numpy.datetime64` · Issue #35448 · pandas-dev/pandas