Skip to content

read_crn returns -99999 instead of NaN #1372

Closed
@wholmgren

Description

@wholmgren
Member

Describe the bug
read_crn fails to map -99999 to NaN

To Reproduce

from pvlib.iotools import read_crn
crn = read_crn('https://www.ncei.noaa.gov/pub/data/uscrn/products/subhourly01/2021/CRNS0101-05-2021-NY_Millbrook_3_W.txt')
crn.loc['2021-12-14 0930':'2021-12-14 1130', 'ghi']
2021-12-14 09:30:00+00:00        0.0
2021-12-14 09:35:00+00:00        0.0
2021-12-14 09:40:00+00:00        0.0
2021-12-14 09:45:00+00:00        0.0
2021-12-14 09:50:00+00:00        0.0
2021-12-14 09:55:00+00:00        0.0
2021-12-14 10:00:00+00:00        0.0
2021-12-14 10:05:00+00:00   -99999.0
2021-12-14 10:10:00+00:00   -99999.0
2021-12-14 10:15:00+00:00   -99999.0
2021-12-14 10:20:00+00:00   -99999.0
2021-12-14 10:25:00+00:00   -99999.0
2021-12-14 10:30:00+00:00   -99999.0
2021-12-14 10:35:00+00:00   -99999.0
2021-12-14 10:40:00+00:00   -99999.0
2021-12-14 10:45:00+00:00   -99999.0
2021-12-14 10:50:00+00:00   -99999.0
2021-12-14 10:55:00+00:00   -99999.0
2021-12-14 11:00:00+00:00   -99999.0
2021-12-14 11:05:00+00:00        0.0
2021-12-14 11:10:00+00:00        0.0
2021-12-14 11:15:00+00:00        0.0
2021-12-14 11:20:00+00:00        0.0
2021-12-14 11:25:00+00:00        0.0
2021-12-14 11:30:00+00:00        0.0
Name: ghi, dtype: float64

Expected behavior
Should return NaN instead of -99999

Versions:

  • pvlib.__version__: 0.9.0
  • pandas.__version__: 1.0.3 (doesn't matter)
  • python: 3.7

Additional context

Documentation here says

     C.  Missing data are indicated by the lowest possible integer for a 
        given column format, such as -9999.0 for 7-character fields with 
        one decimal place or -99.000 for 7-character fields with three
        decimal places.

So we should change

# Now we can set nans. This could be done a per column basis to be
# safer, since in principle a real -99 value could occur in a -9999
# column. Very unlikely to see that in the real world.
for val in [-99, -999, -9999]:
# consider replacing with .replace([-99, -999, -9999])
data = data.where(data != val, np.nan)

to include -99999 and perhaps -999999. Or do the smarter thing as discussed in the comment.

also SolarArbiter/solarforecastarbiter-core#773

Activity

added
solarfx2DOE SETO Solar Forecasting 2 / Solar Forecast Arbiter
on Jan 7, 2022
AdamRJensen

AdamRJensen commented on Jan 7, 2022

@AdamRJensen
Member

@wholmgren I have added -99999 and -999999 to the list of nan values in #1368.

However, I'd also be happy to implement column-specific replacement of nans. Any suggestion on how to do this?

AdamRJensen

AdamRJensen commented on Jan 9, 2022

@AdamRJensen
Member

I see that the CRN files contain the latitude and longitude of the stations. I suppose it would make sense to parse this and output it in a metadata dictionary?

I realize that outputing a tuple of (data, meta) instead of just data would be a breaking change, but it would be in line with the iotools pattern.

wholmgren

wholmgren commented on Jan 9, 2022

@wholmgren
MemberAuthor

However, I'd also be happy to implement column-specific replacement of nans. Any suggestion on how to do this?

crn.WIDTHS gets you half way there, but you'd need a corresponding list for the number of decimals allowed in each column. From there you can determine, for each column, how many 9s will fit between a - and the space required for the decimals.

I see that the CRN files contain the latitude and longitude of the stations. I suppose it would make sense to parse this and output it in a metadata dictionary? I realize that outputing a tuple of (data, meta) instead of just data would be a breaking change, but it would be in line with the iotools pattern.

Could be nice, but not sure how we'd handle the breaking change in a nice way. Let's address in a separate issue.

AdamRJensen

AdamRJensen commented on Jan 10, 2022

@AdamRJensen
Member

It is possible to pass a dictionary to the na_values argument in pd.read_fwf:

NAN_DICT = {
    'CRX_VN': -99999,
    'AIR_TEMPERATURE': -9999,
    'PRECIPITATION': -9999,
    'SOLAR_RADIATION': -99999,
    'SURFACE_TEMPERATURE': -9999,
    'RELATIVE_HUMIDITY': -9999,
    'SOIL_MOISTURE_5': -99,
    'SOIL_TEMPERATURE_5': -9999,
    'WETNESS': -9999,
    'WIND_1_5': -99}

Although, currently na_values is set equal to ['\x00\x00\x00\x00\x00\x00'].

However, they can be combined:

NAN_DICT = {k: [v, '\x00\x00\x00\x00\x00\x00'] for k, v in NAN_DICT.items()}

and then:

# read in data. set fields with NUL characters to NaN
data = pd.read_fwf(filename, header=None, names=HEADERS.split(' '),
                   widths=WIDTHS, na_values=NAN_DICT)

I tested this method and it passes all of the tests.

AdamRJensen

AdamRJensen commented on Jan 11, 2022

@AdamRJensen
Member

Fun fact: there are a few CRN stations that have longitudes in the -99 range, but none that matches exactly:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugiosolarfx2DOE SETO Solar Forecasting 2 / Solar Forecast Arbiter

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @wholmgren@AdamRJensen

      Issue actions

        `read_crn` returns -99999 instead of `NaN` · Issue #1372 · pvlib/pvlib-python