Skip to content

Commit e3b4476

Browse files
Coerce types on read (#76)
`COPY FROM parquet` is too strict when matching Postgres tupledesc schema to the parquet file schema. e.g. `INT32` type in the parquet schema cannot be read into a Postgres column with `int64` type. We can avoid this situation by casting arrow array to the array that is expected by the tupledesc schema, if the cast is possible. We can make use of `arrow-cast` crate, which is in the same project with `arrow`. Its public api lets us check if a cast possible between 2 arrow types and perform the cast. To make sure the cast is possible, we need to do 2 checks: 1. arrow-cast allows the cast from "arrow type at the parquet file" to "arrow type at the schema that is generated for tupledesc", (user created custom cast functions at Postgres won't work by arrow-cast) 2. the cast is meaningful at Postgres. We check if there is a cast from "Postgres type that corresponds to the arrow type at Parquet file" to "Postgres type at the tupledesc". With that we can implicitly cast between many types as shown below: - INT16 => INT32 - UINT32 => INT64 - FLOAT32 => FLOAT64 - LargeUtf8 => UTF8 - LargeBinary => Binary - Struct, Array, and Map with castable fields, e.g. [UINT16] => [INT64] or struct {'x': UINT16} => struct {'x': INT64} **NOTE**: Struct fields must always strictly match by name and position. We can cast below types but with runtime errors e.g. value overflow - INT64 => INT32 - TIMESTAMPTZ => TIMESTAMP Closes #67. Closes #79.
1 parent 518a5ac commit e3b4476

File tree

13 files changed

+1668
-402
lines changed

13 files changed

+1668
-402
lines changed

Cargo.lock

Lines changed: 1 addition & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ pg_test = []
2121

2222
[dependencies]
2323
arrow = {version = "53", default-features = false}
24+
arrow-cast = {version = "53", default-features = false}
2425
arrow-schema = {version = "53", default-features = false}
2526
aws-config = { version = "1.5", default-features = false, features = ["rustls"]}
2627
aws-credential-types = {version = "1.2", default-features = false}

README.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,7 @@ SELECT * FROM parquet.schema('/tmp/product_example.parquet') LIMIT 10;
110110
/tmp/product_example.parquet | name | BYTE_ARRAY | | OPTIONAL | | UTF8 | | | 3 | STRING
111111
/tmp/product_example.parquet | items | | | OPTIONAL | 1 | LIST | | | 4 | LIST
112112
/tmp/product_example.parquet | list | | | REPEATED | 1 | | | | |
113-
/tmp/product_example.parquet | items | | | OPTIONAL | 3 | | | | 5 |
113+
/tmp/product_example.parquet | element | | | OPTIONAL | 3 | | | | 5 |
114114
/tmp/product_example.parquet | id | INT32 | | OPTIONAL | | | | | 6 |
115115
/tmp/product_example.parquet | name | BYTE_ARRAY | | OPTIONAL | | UTF8 | | | 7 | STRING
116116
(10 rows)
@@ -185,12 +185,15 @@ Alternatively, you can use the following environment variables when starting pos
185185
186186
## Copy Options
187187
`pg_parquet` supports the following options in the `COPY TO` command:
188-
- `format parquet`: you need to specify this option to read or write Parquet files which does not end with `.parquet[.<compression>]` extension. (This is the only option that `COPY FROM` command supports.),
188+
- `format parquet`: you need to specify this option to read or write Parquet files which does not end with `.parquet[.<compression>]` extension,
189189
- `row_group_size <int>`: the number of rows in each row group while writing Parquet files. The default row group size is `122880`,
190190
- `row_group_size_bytes <int>`: the total byte size of rows in each row group while writing Parquet files. The default row group size bytes is `row_group_size * 1024`,
191-
- `compression <string>`: the compression format to use while writing Parquet files. The supported compression formats are `uncompressed`, `snappy`, `gzip`, `brotli`, `lz4`, `lz4raw` and `zstd`. The default compression format is `snappy`. If not specified, the compression format is determined by the file extension.
191+
- `compression <string>`: the compression format to use while writing Parquet files. The supported compression formats are `uncompressed`, `snappy`, `gzip`, `brotli`, `lz4`, `lz4raw` and `zstd`. The default compression format is `snappy`. If not specified, the compression format is determined by the file extension,
192192
- `compression_level <int>`: the compression level to use while writing Parquet files. The supported compression levels are only supported for `gzip`, `zstd` and `brotli` compression formats. The default compression level is `6` for `gzip (0-10)`, `1` for `zstd (1-22)` and `1` for `brotli (0-11)`.
193193

194+
`pg_parquet` supports the following options in the `COPY FROM` command:
195+
- `format parquet`: you need to specify this option to read or write Parquet files which does not end with `.parquet[.<compression>]` extension,
196+
194197
## Configuration
195198
There is currently only one GUC parameter to enable/disable the `pg_parquet`:
196199
- `pg_parquet.enable_copy_hooks`: you can set this parameter to `on` or `off` to enable or disable the `pg_parquet` extension. The default value is `on`.

0 commit comments

Comments
 (0)