Skip to content

BUG: Reindexing two tz-aware indices drops tz on the target index when tolerance and method is specified for only "ffill" and "bfill" #38566

@ketozhang

Description

@ketozhang
  • I have checked that this issue has not already been reported.

    I have confirmed this bug exists on the latest version of pandas.

    (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample

df = pd.DataFrame({'value': [0, 1, 2, 3]},
                  index=[pd.Timestamp('2020-01-01 05:00:00+0000', tz='UTC'),
                         pd.Timestamp('2020-01-01 06:00:00+0000', tz='UTC'),
                         pd.Timestamp('2020-01-01 07:00:00+0000', tz='UTC'),
                         pd.Timestamp('2020-01-01 08:00:00+0000', tz='UTC')
                         ]
                  )

new_index = pd.Series([pd.Timestamp('2020-01-01 5:30:00+0000', tz='UTC'),
                       pd.Timestamp('2020-01-01 6:30:00+0000', tz='UTC'),
                       pd.Timestamp('2020-01-01 7:30:00+0000', tz='UTC'),
                       pd.Timestamp('2020-01-01 8:30:00+0000', tz='UTC'),
                       pd.Timestamp('2020-01-01 9:30:00+0000', tz='UTC')]
                      )

new_df = df.reindex(new_index, method="ffill", tolerance=pd.Timedelta("1 hour"))

Problem description

The following exception is raised when method is "ffill" and "bfill" but not "nearest" (see #32740) AND tolerance is specified

TypeError: DatetimeArray subtraction must have the same timezones or no timezones

I found the timezone was dropped when reaching this function on lines 3024 and 3036

target_values = target._get_engine_target()
if self.is_monotonic_increasing and target.is_monotonic_increasing:
engine_method = (
self._engine.get_pad_indexer
if method == "pad"
else self._engine.get_backfill_indexer
)
indexer = engine_method(target_values, limit)
else:
indexer = self._get_fill_indexer_searchsorted(target, method, limit)
if tolerance is not None:
indexer = self._filter_indexer_tolerance(target_values, indexer, tolerance)

where target is the target index that's tz-aware. However once converted to target_values, the tz info disappears from the numpy array.

I found a working solution but unsure if this behavior affects any other parts functionalities

- self._filter_indexer_tolerance(target_values, indexer, tolerance)
+ self._filter_indexer_tolerance(target, indexer, tolerance)

Expected Output

                           value
2020-01-01 05:30:00+00:00    0.0
2020-01-01 06:30:00+00:00    1.0
2020-01-01 07:30:00+00:00    2.0
2020-01-01 08:30:00+00:00    3.0
2020-01-01 09:30:00+00:00    NaN

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : b5958ee1999e9aead1938c0bba2b674378807b3d
python           : 3.7.6.final.0
python-bits      : 64
OS               : Linux
OS-release       : 4.19.0-11-cloud-amd64
Version          : #1 SMP Debian 4.19.146-1 (2020-09-17)
machine          : x86_64
processor        : 
byteorder        : little
LC_ALL           : None
LANG             : C.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.5
numpy            : 1.19.2
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 19.2.3
setuptools       : 41.2.0
Cython           : None
pytest           : 6.1.1
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : 2.8.6 (dt dec pq3 ext lo64)
jinja2           : 2.11.2
IPython          : 7.18.1
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : 0.8.4
fastparquet      : 0.4.1
gcsfs            : 0.7.1
matplotlib       : 3.3.3
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 1.0.1
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.5.3
sqlalchemy       : 1.3.20
tables           : None
tabulate         : None
xarray           : None
xlrd             : 1.2.0
xlwt             : None
numba            : 0.51.2

Activity

added
Needs TriageIssue that has not been reviewed by a pandas team member
on Dec 18, 2020
ketozhang

ketozhang commented on Dec 18, 2020

@ketozhang
Author

A simpler example

df = pd.DataFrame(
    {'value': [0, 1, 2, 3]}, 
    index=pd.date_range('2020-01-01 00:00:00', periods=4, freq='H', tz="UTC")
)
new_index = pd.date_range('2020-01-01 00:01:00', periods=4, freq='H', tz="UTC")
new_df = df.reindex(new_index, method="ffill", tolerance=pd.Timedelta("1 hour"))
simonjayhawkins

simonjayhawkins commented on Dec 18, 2020

@simonjayhawkins
Member

Thanks @ketozhang for the report.

pandas-0.25.3 was giving the expected output, so will label as regression pending further investigation.

added
RegressionFunctionality that used to work in a prior pandas version
ReshapingConcat, Merge/Join, Stack/Unstack, Explode
TimezonesTimezone data dtype
and removed
Needs TriageIssue that has not been reviewed by a pandas team member
on Dec 18, 2020
added this to the Contributions Welcome milestone on Dec 18, 2020
modified the milestones: Contributions Welcome, 1.3 on Jan 11, 2021
ketozhang

ketozhang commented on Jan 11, 2021

@ketozhang
Author

Great work, thanks @phofl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugRegressionFunctionality that used to work in a prior pandas versionReshapingConcat, Merge/Join, Stack/Unstack, ExplodeTimezonesTimezone data dtype

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Participants

      @jreback@simonjayhawkins@ketozhang

      Issue actions

        BUG: Reindexing two tz-aware indices drops tz on the target index when tolerance and method is specified for only "ffill" and "bfill" · Issue #38566 · pandas-dev/pandas