Skip to content

pandas.io.gbq verify_schema seems to be too strict. #11359

Closed
@FlxVctr

Description

@FlxVctr

This line seems for me to be too strict for repeated insertion, because apparently GBQ is not consistent in the order of fields in the schema (or my application screws the order of fields up, anyway, I would say the verification is too strict).

So for example:

dict1 = {'fields': [{'name': 'coordinates_0', 'type': 'FLOAT'}, {'name': 'created_at', 'type': 'STRING'}]}
dict2 = {'fields': [{'name': 'created_at', 'type': 'STRING'}, {'name': 'coordinates_0', 'type': 'FLOAT'}]}
dict1 == dict2  # gives False

would make verification fail, though insertion would work, as the insert as JSON makes the order of fields irrelevant.

Solved that for myself for the moment with:

    def verify_schema(self, dataset_id, table_id, schema):
        from apiclient.errors import HttpError

        try:
            bq_schema = (self.service.tables().get(
                projectId=self.project_id,
                datasetId=dataset_id,
                tableId=table_id
                ).execute()['schema'])
            return set(
                       [json.dumps(x) for x in bq_schema['fields']]  # dump necessary to make dicts hashable
                      ) == set(
                               [json.dumps(x) for x in schema['fields']]
            )  # this still fails if key order is different. But GBQ seems to keep key order.

        except HttpError as ex:
            self.process_http_error(ex)

Activity

changed the title [-]pandas.io.gbq verify_schema is too strict.[/-] [+]pandas.io.gbq verify_schema seems to be too strict.[/+] on Oct 18, 2015
jorisvandenbossche

jorisvandenbossche commented on Oct 18, 2015

@jorisvandenbossche
Member
parthea

parthea commented on Oct 18, 2015

@parthea
Contributor

The strict ordering of fields is by design (also in 0.16.2). From the docs:

The dataframe must match the destination table in column order, structure, and data types.

because apparently GBQ is not consistent in the order of fields in the schema (or my application screws the order of fields up

Can you confirm whether it is your application that is changing the order of fields? If we can understand how the ordering of fields is changing, we may be better off fixing that problem instead.

My personal preference would be to fail if the column order of the DataFrame being inserted is different. It is trivial to alter the column order of a DataFrame prior to inserting.

Certainly, the proposed changed would be more flexible, but I think that the user should be aware of the column order, in case the BigQuery table schema has actually changed.

FlxVctr

FlxVctr commented on Oct 19, 2015

@FlxVctr
Author

Thanks for your answer. Wasn't reading the docs closely enough apparently. It's very likely that it is my application as I am transforming raw JSON from the Twitter API to a dataframe and the order there is arbitrary. I can try to verify that though. But order of columns seemed (before formatting anything for readable output) pretty unnecessary to me and only costing additional programming time as well as introducing more possibilities for errors to sneak in.

On the other hand, but now we are talking about particular project requirements and personal preferences, this use case it not that unlikely. And I don't know how the order of fields in BigQuery would matter for most applications of BigQuery in the end, as you query for fields by name anyway. For my project it's just another step processing the data (which will come in large volume) before inserting, and I need to minimise those. Have to admit that ordering is trivial, indeed. So yes, that's a design decision. I would plead for a less strict option (maybe with an optional argument in to_gbq?) though.

parthea

parthea commented on Oct 19, 2015

@parthea
Contributor

The number of optional arguments in to_gbq is growing, so it may be better to decide whether to incorporate the change by default or not. The proposed change would certainly make inserting data into BigQuery easier, as long as we are ok with not maintaining column order.

I'd like @jacobschaer to comment. The note in the docs to maintain column order was added in #6937.

jreback

jreback commented on Oct 19, 2015

@jreback
Contributor

just to note that HDF5 requires the orderings to be exactly the same as does SQL. I don't think this should be relaxes. Ok to have a 'better' error message though.

FlxVctr

FlxVctr commented on Oct 20, 2015

@FlxVctr
Author

Yes, a more detailed error message would be an improvement. Because neither Python dicts, nor pandas DataFrames nor GBQ really cares about column order, so I did not expect it to matter, it took me quite a while to find out what is wrong with my schema.

jorisvandenbossche

jorisvandenbossche commented on Oct 20, 2015

@jorisvandenbossche
Member

Note that for the SQL functions the ordering does not need to be the same (as we use named parameters and a dict with the data to insert)

added a commit that references this issue on Oct 21, 2015

Merge pull request #11401 from parthea/bq-improve-schema-error-msg

added a commit that references this issue on Oct 24, 2015

9 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Error ReportingIncorrect or improved errors from pandas

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Participants

      @jreback@jorisvandenbossche@parthea@FlxVctr

      Issue actions

        pandas.io.gbq verify_schema seems to be too strict. · Issue #11359 · pandas-dev/pandas