Closed
Description
This line seems for me to be too strict for repeated insertion, because apparently GBQ is not consistent in the order of fields in the schema (or my application screws the order of fields up, anyway, I would say the verification is too strict).
So for example:
dict1 = {'fields': [{'name': 'coordinates_0', 'type': 'FLOAT'}, {'name': 'created_at', 'type': 'STRING'}]}
dict2 = {'fields': [{'name': 'created_at', 'type': 'STRING'}, {'name': 'coordinates_0', 'type': 'FLOAT'}]}
dict1 == dict2 # gives False
would make verification fail, though insertion would work, as the insert as JSON makes the order of fields irrelevant.
Solved that for myself for the moment with:
def verify_schema(self, dataset_id, table_id, schema):
from apiclient.errors import HttpError
try:
bq_schema = (self.service.tables().get(
projectId=self.project_id,
datasetId=dataset_id,
tableId=table_id
).execute()['schema'])
return set(
[json.dumps(x) for x in bq_schema['fields']] # dump necessary to make dicts hashable
) == set(
[json.dumps(x) for x in schema['fields']]
) # this still fails if key order is different. But GBQ seems to keep key order.
except HttpError as ex:
self.process_http_error(ex)
Activity
[-]pandas.io.gbq verify_schema is too strict.[/-][+]pandas.io.gbq verify_schema seems to be too strict.[/+]jorisvandenbossche commentedon Oct 18, 2015
cc @parthea
parthea commentedon Oct 18, 2015
The strict ordering of fields is by design (also in 0.16.2). From the docs:
Can you confirm whether it is your application that is changing the order of fields? If we can understand how the ordering of fields is changing, we may be better off fixing that problem instead.
My personal preference would be to fail if the column order of the DataFrame being inserted is different. It is trivial to alter the column order of a DataFrame prior to inserting.
Certainly, the proposed changed would be more flexible, but I think that the user should be aware of the column order, in case the BigQuery table schema has actually changed.
FlxVctr commentedon Oct 19, 2015
Thanks for your answer. Wasn't reading the docs closely enough apparently. It's very likely that it is my application as I am transforming raw JSON from the Twitter API to a dataframe and the order there is arbitrary. I can try to verify that though. But order of columns seemed (before formatting anything for readable output) pretty unnecessary to me and only costing additional programming time as well as introducing more possibilities for errors to sneak in.
On the other hand, but now we are talking about particular project requirements and personal preferences, this use case it not that unlikely. And I don't know how the order of fields in BigQuery would matter for most applications of BigQuery in the end, as you query for fields by name anyway. For my project it's just another step processing the data (which will come in large volume) before inserting, and I need to minimise those. Have to admit that ordering is trivial, indeed. So yes, that's a design decision. I would plead for a less strict option (maybe with an optional argument in to_gbq?) though.
parthea commentedon Oct 19, 2015
The number of optional arguments in
to_gbq
is growing, so it may be better to decide whether to incorporate the change by default or not. The proposed change would certainly make inserting data into BigQuery easier, as long as we are ok with not maintaining column order.I'd like @jacobschaer to comment. The note in the docs to maintain column order was added in #6937.
jreback commentedon Oct 19, 2015
just to note that HDF5 requires the orderings to be exactly the same as does SQL. I don't think this should be relaxes. Ok to have a 'better' error message though.
FlxVctr commentedon Oct 20, 2015
Yes, a more detailed error message would be an improvement. Because neither Python dicts, nor pandas DataFrames nor GBQ really cares about column order, so I did not expect it to matter, it took me quite a while to find out what is wrong with my schema.
jorisvandenbossche commentedon Oct 20, 2015
Note that for the SQL functions the ordering does not need to be the same (as we use named parameters and a dict with the data to insert)
ENH: Improve the error message in to_gbq when the DataFrame schema do…
Merge pull request #11401 from parthea/bq-improve-schema-error-msg
Fix for issue pandas-dev#11317
9 remaining items