Releases: delta-io/delta
Delta Lake 2.3.0
We are excited to announce the release of Delta Lake 2.3.0 on Apache Spark 3.3. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/2.3.0/
- Maven artifacts: delta-core_2.12, delta-core_2.13, delta-contribs_2.12 delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb, delta-iceberg_2.12, delta-iceberg_2.13
- Python artifacts: https://pypi.org/project/delta-spark/2.3.0/
The key features in this release are as follows
- Zero-copy convert to Delta from Iceberg tables using
CONVERT TO DELTA. This generates a Delta table in the same location and does not rewrite any parquet files. See the documentation for details. - Support
SHALLOW CLONEfor Delta, Parquet, and Iceberg tables to clone a source table without copying the data files.SHALLOW CLONEcreates a copy of the source table’s definition but refers to the source table’s data files. - Support idempotent writes for DML operations. This feature adds idempotency to
INSERT/DELETE/UPDATE/MERGEetc. operations using SQL configurationsspark.databricks.delta.write.txnAppIdandspark.databricks.delta.write.txnVersion. - Support “when not matched by source” clauses for the Merge command to update or delete rows in the chosen table that don’t have matches in the source table based on the merge condition. This clause is supported in the Python, Scala, and Java
DeltaTableAPIs. SQL Support will be added in Spark 3.4. - Support
CREATE TABLE LIKEto create empty Delta tables using the definition and metadata of an existing table or view. - Support reading Change Data Feed (CDF) in SQL queries using the
table_changestable-valued function. - Unblock Change Data Feed (CDF) batch reads on column mapping enabled tables when
DROP COLUMNandRENAME COLUMNhave been used. See the documentation for more details. - Improved read and write performance on S3 when writing from a single cluster. Efficient file listing decreases the metadata processing time when calculating a table snapshot. This is most impactful for tables with many commits. Set the Hadoop configuration
delta.enableFastS3AListFromtotrueto enable it. - Record
VACUUMoperations in the transaction log. With this feature,VACUUMoperations and their associated metrics (e.g.numDeletedFiles) will now show up in table history. - Support reading Delta tables with deletion vectors.
- Other notable changes
- Support schema evolution in
MERGEforUPDATE SET <assignments> and INSERT (...) VALUES (...) actions. Previously, schema evolution was only supported forUPDATE SET *andINSERT *actions. - Add
.show()support forCOUNT(*)aggregate pushdown. - Enforce idempotent writes for
df.saveAsTablefor overwrite and append mode. - Support Table Features to selectively add individual features when upgrading the table protocol version. This enables users to only add active features and will facilitate connectivity as downstream Delta connectors can selectively implement feature support.
- Automatically generate partition filters for additional generation expressions.
- Support the
truncanddate_truncfunctions. - Support for the
date_formatfunction with formatyyyy-MM-dd.
- Support the
- Block protocol downgrades when replacing a Delta table to prevent any incorrect time-travel or CDF queries.
- Fix
replaceWherewith the DataFrame V2 overwrite API to correctly evaluate less than conditions. - Fix dynamic partition overwrite for tables with more than one partition data type.
- Fix schema evolution for
INSERT OVERWRITEwith complex data types when the source schema is read incompatible. - Fix Delta streaming source to correctly detect read-incompatible schema changes during backfill when there is exactly one schema change in the versions read.
- Fix a bug in
VACUUMwhere sometimes the default retention period was used to remove files instead of the retention period specified in the table properties. - Include the table name in the DataFrame returned by the
deltaTable.details()Python/Scala/Java API. - Improve the log message for
VACUUM table_name DRY RUN.
- Support schema evolution in
Credits
Allison Portis, Andreas Chatzistergiou, Andrew Li, Bo Zhang, Brayan Jules, Burak Yavuz, Christos Stavrakakis, Daniel Tenedorio, Dhruv Shah, Felipe Pessoto, Fred Liu, Fredrik Klauss, Gengliang Wang, Haejoon Lee, Hussein Nagree, Jackie Zhang, Jiaheng Tang, Jintian Liang, Johan Lasperas, Jungtaek Lim, Kam Cheung Ting, Koki Otsuka, Lars Kroll, Lin Ma, Lukas Rupprecht, Ming DAI, Mitchell Riley, Ole Sasse, Paddy Xu, Prakhar Jain, Pranav, Rahul Shivu Mahadev, Rajesh Parangi, Ryan Johnson, Scott Sandre, Serge Rielau, Shixiong Zhu, Slim Ouertani, Tobias Fabritz, Tom van Bussel, Tushar Machavolu, Tyson Condie, Venki Korukanti, Vitalii Li, Wenchen Fan, Xinyi Yu, Yaohua Zhao, Yingyi Bu
Delta Lake 2.0.2
We are excited to announce the release of Delta Lake 2.0.2 on Apache Spark 3.2. This release contains important bug fixes and a few high-demand usability improvements over 2.0.1 and it is recommended that users update to 2.0.2. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/2.0.2/index.html
- Maven artifacts: delta-core_2.12, delta-core_2.13, delta-contribs_2.12 delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb
- Python artifacts: https://pypi.org/project/delta-spark/2.0.2/
This release includes the following bug fixes and improvements:
- Record VACUUM operation in the transaction log. With this feature, VACUUM operations and their associated metrics (e.g.
numDeletedFiles) will now show up in table history. - Support idempotent writes for DML operations. This feature adds idempotency to INSERTS/DELETE/UPDATE/MERGE etc. operations using SQL configurations
spark.databricks.delta.write.txnAppIdandspark.databricks.delta.write.txnVersion.
Support passing Hadoop configurations via DeltaTable APIfrom delta.tables import DeltaTable hadoop_config = { "fs.azure.account.auth.type": "OAuth", "fs.azure.account.oauth.provider.type": "...", "fs.azure.account.oauth2.client.id": "...", "fs.azure.account.oauth2.client.secret": "...", "fs.azure.account.oauth2.client.endpoint": "..." } delta_table = DeltaTable.forPath(spark, <table-path>, hadoop_config)
- Minor convenience improvement to the
DeltaTableBuilder:executeZOrderByJava API which allows users to pass in varargs instead of a List. - Fail fast on malformed delta log JSON entries. Previously, Delta queries could return inaccurate results whenever JSON commits in the
_delta_logwere malformed. For example, anaddaction with a missing}would be skipped. Now, queries will fail fast, preventing inaccurate results. - Fix “Could not find active SparkSession” bug by passing in the SparkSession when resolving tables in the DeltaTableBuilder.
Credits:
Helge Brügner, Jiaheng Tang, Mitchell Riley, Ryan Johnson, Scott Sandre, Venki Korukanti, Jintao Shen, Yann Byron
Delta Lake 2.2.0
We are excited to announce the release of Delta Lake 2.2.0 on Apache Spark 3.3. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/2.2.0/index.html
- Maven artifacts: delta-core_2.12, delta-core_2.13, delta-contribs_2.12 delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb
- Python artifacts: https://pypi.org/project/delta-spark/2.2.0/
The key features in this release are as follows:
-
LIMITpushdown into Delta scan. Improve the performance of queries containingLIMITclauses by pushing down theLIMITinto Delta scan during query planning. Delta scan uses theLIMITand the file-level row counts to reduce the number of files scanned which helps the queries read far less number of files and could makeLIMITqueries faster by 10-100x depending upon the table size. -
Aggregate pushdown into Delta scan for SELECT COUNT(*). Aggregation queries such as
SELECT COUNT(*)on Delta tables are satisfied using file-level row counts in Delta table metadata rather than counting rows in the underlying data files. This significantly reduces the query time as the query just needs to read the table metadata and could make full table count queries faster by 10-100x. -
Support for collecting file level statistics as part of the CONVERT TO DELTA command. These statistics potentially help speed up queries on the Delta table. By default the statistics are collected now as part of the CONVERT TO DELTA command. In order to disable statistics collection specify
NO STATISTICSclause in the command. Example:CONVERT TO DELTA table_name NO STATISTICS -
Improve performance of the DELETE command by pruning the columns to read when searching for files to rewrite.
-
Fix for a bug in the DynamoDB-based S3 multi-cluster mode configuration. The previous version wrote an incorrect timestamp which was used by DynamoDB’s TTL feature to cleanup expired items. This timestamp value has been fixed and the table attribute renamed from
commitTimetoexpireTime. If you already have TTL enabled, please follow the migration steps here. -
Fix non-deterministic behavior during MERGE when working with sources that are non-deterministic.
-
Remove the restrictions for using Delta tables with column mapping in certain Streaming + CDF cases. Earlier we used to block Streaming+CDF if the Delta table has column mapping enabled even though it doesn’t contain any RENAME or DROP columns.
-
Other notable changes
- Improve the monitoring of the Delta state construction queries (additional queries run as part of planning) by making them visible in the Spark UI.
- Support for multiple
where()calls in Optimize scala/python API - Support for passing Hadoop configurations via DeltaTable API
- Support partition column names starting with
.or_in CONVERT TO DELTA command. - Improvements to metrics in table history
- Fix a metric in MERGE command
- Source type metric for CONVERT TO DELTA
- Metrics for DELETE on partitions
- Additional vacuum stats
- Fix for accidental protocol downgrades with RESTORE command. Until now, RESTORE TABLE may downgrade the protocol version of the table, which could have resulted in inconsistent reads with time travel. With this fix, the protocol version is never downgraded from the current one.
- Fix a bug in
MERGE INTOwhen there are multipleUPDATEclauses and one of the UPDATEs is with a schema evolution. - Fix a bug where sometimes active
SparkSessionobject is not found when using Delta APIs - Fix an issue where partition schema couldn’t be set during the initial commit.
- Catch exceptions when writing
last_checkpointfile fails. - Fix an issue when restarting a streaming query with
AvailableNowtrigger on a Delta table. - Fix an issue with CDF and Streaming where the offset is not correctly updated when there are no data changes.
Credits
Abhishek Somani, Adam Binford, Allison Portis, Amir Mor, Andreas Chatzistergiou, Anish Shrigondekar, Carl Fu, Carlos Peña ,Chen Shuai, Christos Stavrakakis, Eric Maynard, Fabian Paul, Felipe Pessoto, Fredrik Klauss, Ganesh Chand, Hedi Bejaoui, Helge Brügner, Hussein Nagree, Ionut Boicu, Jackie Zhang, Jiaheng Tang, Jintao Shen, Jintian Liang, Joe Harris, Johan Lasperas, Jonas Irgens Kylling, Josh Rosen, Juliusz Sompolski, Jungtaek Lim, Kam Cheung Ting, Karthik Subramanian, Kevin Neville, Lars Kroll, Lin Ma, Linhong Liu, Lukas Rupprecht, Max Gekk, Ming Dai, Mingliang Zhu, Nick Karpov, Ole Sasse, Paddy Xu, Patrick Marx, Prakhar Jain, Pranav, Rajesh Parangi, Ronald Zhang, Ryan Johnson, Sabir Akhadov, Scott Sandre, Serge Rielau, Shixiong Zhu, Supun Nakandala, Thang Long Vu, Tom van Bussel, Tyson Condie, Venki Korukanti, Vitalii Li, Weitao Wen, Wenchen Fan, Xinyi, Yuming Wang, Zach Schuermann, Zainab Lawal, sherlockbeard (github id)
Delta Lake 2.2.0
We are excited to announce the preview release of Delta Lake 2.2.0 on Apache Spark 3.3. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/latest/index.html
- Maven artifacts: https://oss.sonatype.org/content/repositories/iodelta-1102
- Python artifacts: https://test.pypi.org/project/delta-spark/2.2.0rc1/
The key features in this release are as follows:
-
LIMITpushdown into Delta scan. Improve the performance of queries containingLIMITclauses by pushing down theLIMITinto Delta scan during query planning. Delta scan uses theLIMITand the file-level row counts to reduce the number of files scanned which helps the queries read far less number of files and could makeLIMITqueries faster by 10-100x depending upon the table size. -
Aggregate pushdown into Delta scan for SELECT COUNT(*). Aggregation queries such as
SELECT COUNT(*)on Delta tables are satisfied using file-level row counts in Delta table metadata rather than counting rows in the underlying data files. This significantly reduces the query time as the query just needs to read the table metadata and could make full table count queries faster by 10-100x. -
Support for collecting file level statistics as part of the CONVERT TO DELTA command. These statistics potentially help speed up queries on the Delta table. By default the statistics are collected now as part of the CONVERT TO DELTA command. In order to disable statistics collection specify
NO STATISTICSclause in the command. Example:CONVERT TO DELTA table_name NO STATISTICS -
Improve performance of the DELETE command by pruning the columns to read when searching for files to rewrite.
-
Fix for a bug in the DynamoDB-based S3 multi-cluster mode configuration. The previous version wrote an incorrect timestamp which was used by DynamoDB’s TTL feature to cleanup expired items. This timestamp value has been fixed and the table attribute renamed from
commitTimetoexpireTime. If you already have TTL enabled, please follow the migration steps here. -
Fix non-deterministic behavior during MERGE when working with sources that are non-deterministic.
-
Remove the restrictions for using Delta tables with column mapping in certain Streaming + CDF cases. Earlier we used to block Streaming+CDF if the Delta table has column mapping enabled even though it doesn’t contain any RENAME or DROP columns.
-
Other notable changes
- Improve the monitoring of the Delta state construction queries (additional queries run as part of planning) by making them visible in the Spark UI.
- Support for multiple
where()calls in Optimize scala/python API - Support for passing Hadoop configurations via DeltaTable API
- Support partition column names starting with
.or_in CONVERT TO DELTA command. - Improvements to metrics in table history
- Fix a metric in MERGE command
- Source type metric for CONVERT TO DELTA
- Metrics for DELETE on partitions
- Additional vacuum stats
- Fix for accidental protocol downgrades with RESTORE command. Until now, RESTORE TABLE may downgrade the protocol version of the table, which could have resulted in inconsistent reads with time travel. With this fix, the protocol version is never downgraded from the current one.
- Fix a bug in
MERGE INTOwhen there are multipleUPDATEclauses and one of the UPDATEs is with a schema evolution. - Fix a bug where sometimes active
SparkSessionobject is not found when using Delta APIs - Fix an issue where partition schema couldn’t be set during the initial commit.
- Catch exceptions when writing
last_checkpointfile fails. - Fix an issue when restarting a streaming query with
AvailableNowtrigger on a Delta table. - Fix an issue with CDF and Streaming where the offset is not correctly updated when there are no data changes.
How to use the preview release
For this preview we have published the artifacts to a staging repository. Here’s how you can use them:
- spark-submit: Add –-repositories https://oss.sonatype.org/content/repositories/iodelta-1102/ to the command line arguments. For example:
spark-submit --packages io.delta:delta-core_2.12:2.2.0rc1 --repositories https://oss.sonatype.org/content/repositories/iodelta-1102/ examples/examples.py
- Currently Spark shells (PySpark and Scala) do not accept the external repositories option. However, once the artifacts have been downloaded to the local cache, the shells can be run with Delta
2.2.0rc1by just providing the--packages io.delta:delta-core_2.12:2.2.0rc1argument. - Maven project:
<repositories>
<repository>
<id>staging-repo</id>
<url> https://oss.sonatype.org/content/repositories/iodelta-1102/</url>
</repository>
</repositories>
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-core_2.12</artifactId>
<version>2.2.0rc1</version>
</dependency>
- SBT project:
libraryDependencies += "io.delta" %% "delta-core" % "2.2.0rc1"
resolvers += "Delta" at https://oss.sonatype.org/content/repositories/iodelta-1102/
- Delta-spark:
pip install -i https://test.pypi.org/simple/ delta-spark==2.2.0rc1
Credits
Abhishek Somani, Adam Binford, Allison Portis, Amir Mor, Andreas Chatzistergiou, Anish Shrigondekar, Carl Fu, Carlos Peña ,Chen Shuai, Christos Stavrakakis, Eric Maynard, Fabian Paul, Felipe Pessoto, Fredrik Klauss, Ganesh Chand, Hedi Bejaoui, Helge Brügner, Hussein Nagree, Ionut Boicu, Jackie Zhang, Jiaheng Tang, Jintao Shen, Jintian Liang, Joe Harris, Johan Lasperas, Jonas Irgens Kylling, Josh Rosen, Juliusz Sompolski, Jungtaek Lim, Kam Cheung Ting, Karthik Subramanian, Kevin Neville, Lars Kroll, Lin Ma, Linhong Liu, Lukas Rupprecht, Max Gekk, Ming Dai, Mingliang Zhu, Nick Karpov, Ole Sasse, Paddy Xu, Patrick Marx, Prakhar Jain, Pranav, Rajesh Parangi, Ronald Zhang, Ryan Johnson, Sabir Akhadov, Scott Sandre, Serge Rielau, Shixiong Zhu, Supun Nakandala, Thang Long Vu, Tom van Bussel, Tyson Condie, Venki Korukanti, Vitalii Li, Weitao Wen, Wenchen Fan, Xinyi, Yuming Wang, Zach Schuermann, Zainab Lawal, sherlockbeard (github id)
Delta Lake 2.0.1
We are excited to announce the release of Delta Lake 2.0.1 on Apache Spark 3.2. This release contains important bug fixes to 2.0.0 and it is recommended that users update to 2.0.1. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/2.0.1/index.html
- Maven artifacts: delta-core_2.12, delta-core_2.13, delta-contribs_2.12 delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb
- Python artifacts: https://pypi.org/project/delta-spark/2.0.1/
This release includes the following bug fixes and improvements:
- Fix for a bug in the DynamoDB-based S3 multi-cluster mode configuration. The previous version wrote an incorrect timestamp which was used by DynamoDB’s TTL feature to cleanup expired items. This timestamp value has been fixed and the table attribute renamed from
commitTimetoexpireTime. If you already have TTL enabled, please follow the migration steps here. - Fix a duplicate CDF rows issue in some cases in MERGE operation.
- Fix for accidental protocol downgrades with RESTORE command. Until now, RESTORE TABLE may downgrade the protocol version of the table, which could have resulted in inconsistent reads with time travel. With this fix, the protocol version is never downgraded from the current one.
- Improve performance of the DELETE command by optimizing the step to search touched files to trigger column pruning.
- Fix for NotSerializableException when running RESTORE command in Spark SQL with Hadoop2.
- Fix incorrect stats collection issue in data skipping stats tracker.
Credits
Adam Binford, Allison Portis, Chen Shuai, Lars Kroll, Scott Sandre, Shixiong Zhu, Venki Korukanti
Delta Lake 2.1.1
We are excited to announce the release of Delta Lake 2.1.1 on Apache Spark 3.3. This release contains important bug fixes to 2.1.0 and it is recommended that users update to 2.1.1. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/2.1.1/index.html
- Maven artifacts: delta-core_2.12, delta-core_2.13, delta-contribs_2.12 delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb
- Python artifacts: https://pypi.org/project/delta-spark/2.1.1/
This release includes the following bug fixes and improvements:
- Fix for a bug in the DynamoDB-based S3 multi-cluster mode configuration. The previous version wrote an incorrect timestamp which was used by DynamoDB’s TTL feature to cleanup expired items. This timestamp value has been fixed and the table attribute renamed from
commitTimetoexpireTime. If you already have TTL enabled, please follow the migration steps here. - Fix for incorrect MERGE behavior when the Delta statistics are disabled.
- Fix for accidental protocol downgrades with RESTORE command. Until now, RESTORE TABLE may downgrade the protocol version of the table, which could have resulted in inconsistent reads with time travel. With this fix, the protocol version is never downgraded from the current one.
- Improve performance of the DELETE command by optimizing the step to search affected files to trigger column pruning.
- Fix for NotSerializableException when running RESTORE command in Spark SQL with Hadoop2.
Credits
Adam Binford, Allison Portis, Chen Shuai, Felipe Pessoto, Lars Kroll, Scott Sandre, Shixiong Zhu, Venki Korukanti
Delta Lake 2.1.0
We are excited to announce the release of Delta Lake 2.1.0 on Apache Spark 3.3. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/2.1.0/index.html
- Maven artifacts: delta-core_2.12, delta-core_2.13, delta-contribs_2.12 delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb
- Python artifacts:https://pypi.org/project/delta-spark/2.1.0/
The key features in this release are as follows
- Support for Apache Spark 3.3.
- Support for [TIMESTAMP | VERSION] AS OF in SQL. With Spark 3.3, Delta now supports time travel in SQL to query older data easily. With this update, time travel is now available both in SQL and through the DataFrame API.
- Support for Trigger.AvailableNow when streaming from a Delta table. Spark 3.3 introduces Trigger.AvailableNow for running streaming queries like Trigger.Once in multiple batches. This is now supported when using Delta tables as a streaming source.
- Support for SHOW COLUMNS to return the list of columns in a table.
- Support for DESCRIBE DETAIL in the Scala and Python DeltaTable API. Retrieve detailed information about a Delta table using the DeltaTable API and in SQL.
- Support for returning operation metrics from SQL Delete, Merge, and Update commands. Previously these SQL commands returned an empty DataFrame, now they return a DataFrame with useful metrics about the operation performed.
- Optimize performance improvements
- Added a config to use
repartition(1)instead ofcoalesce(1)in Optimize for better performance when compacting many small files. - Improve Optimize performance by using a queue-based approach to parallelize the compaction jobs.
- Added a config to use
- Other notable changes
- Support for using variables in the VACUUM and OPTIMIZE SQL commands.
- Improvements for CONVERT TO DELTA with catalog tables.
- Autofill the partition schema from the catalog when it’s not provided.
- Use partition information from the catalog to find the data files to commit instead of doing a full directory scan. Instead of committing all data files in the table directory, only data files under the directories of active partitions will be committed.
- Support for Change Data Feed (CDF) batch reads on column mapping enabled tables when DROP COLUMN and RENAME COLUMN have not been used. See the documentation for more details.
- Improve Update performance by enabling schema pruning in the first pass.
- Fix for
DeltaTableBuilderto preserve table property case of non-delta properties when setting properties. - Fix for duplicate CDF row output for delete-when-matched merges with multiple matches.
- Fix for consistent timestamps in a MERGE command.
- Fix for incorrect operation metrics for DataFrame writes with a
replaceWhereoption. - Fix for a bug in Merge that sometimes caused empty files to be committed to the table.
- Change in log4j properties file format. Apache Spark upgraded the log4j version from 1.x to 2.x which has a different format for the log4j file. Refer to the Spark upgrade notes.
Benchmark framework update
Improvements to the benchmark framework (initial version added in version 1.2.0) including support for benchmarking arbitrary functions and not just SQL queries. We’ve also added Terraform scripts to automatically generate the infrastructure to run benchmarks on AWS and GCP.
Credits
Adam Binford, Allison Portis, Andreas Chatzistergiou, Andrew Vine, Andy Lam, Carlos Peña, Chang Yong Lik, Christos Stavrakakis, David Lewis, Denis Krivenko, Denny Lee, EJ Song, Edmondo Porcu, Felipe Pessoto, Fred Liu, Fu Chen, Grzegorz Kołakowski, Hedi Bejaoui, Hussein Nagree, Ionut Boicu, Ivan Sadikov, Jackie Zhang, Jiawei Bao, Jintao Shen, Jintian Liang, Jonas Irgens Kylling, Juliusz Sompolski, Junlin Zeng, KaiFei Yi, Kam Cheung Ting, Karen Feng, Koert Kuipers, Lars Kroll, Lin Zhou, Lukas Rupprecht, Max Gekk, Min Yang, Ming DAI, Nick, Ole Sasse, Prakhar Jain, Rahul Shivu Mahadev, Rajesh Parangi, Rui Wang, Ryan Johnson, Sabir Akhadov, Scott Sandre, Serge Rielau, Shixiong Zhu, Tathagata Das, Terry Kim, Thomas Newton, Tom van Bussel, Tyson Condie, Venki Korukanti, Vini Jaiswal, Will Jones, Xi Liang, Yijia Cui, Yousry Mohamed, Zach Schuermann, sherlockbeard, yikf
Delta Lake 2.0.0
We are excited to announce the release of Delta Lake 2.0.0 on Apache Spark 3.2.
- Quick start guide on how to try the Delta Lake 2.0.0: https://docs.delta.io/2.0.0/quick-start.html
- Documentation: https://docs.delta.io/2.0.0/index.html
- Maven artifacts. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Python artifacts: https://pypi.org/project/delta-spark/2.0.0/
The key features in this release are as follows.
-
Support Change Data Feed on Delta tables. Change Data Feed represents the row level changes between different versions of the table. When enabled, additional information is recorded regarding row level changes for every write operation on the table. See the documentation for more details.
-
Support Z-Order clustering of data to reduce the amount of data read. Z-Ordering is a technique to colocate related information in the same set of files. This data clustering allows column stats (released in Delta 1.2) to be more effective in skipping data based on filters in a query. See the documentation for more details.
-
Support for idempotent writes to Delta tables to enable fault-tolerant retry of Delta table writing jobs without writing the data multiple times to the table. See the documentation for more details.
-
Support for dropping columns in a Delta table as a metadata change operation. This command drops the column from metadata and not the column data in underlying files. See documentation for more details.
-
Support for dynamic partition overwrite. Overwrite only the partitions with data written into them at runtime. See documentation for details.
-
Experimental support for multi-part checkpoints to split the Delta Lake checkpoint into multiple parts to speed up writing the checkpoints and reading. See documentation for more details.
-
Python and Scala API support for OPTIMIZE file compaction and Z-order by.
-
Other notable changes
- Improve the generated column data skipping by adding the support for skipping by nested column generated column
- Improve the table schema validation by blocking the unsupported data types in Delta Lake.
- Support creating a Delta Lake table with an empty schema.
- Change the behavior of DROP CONSTRAINT to throw an error when the constraint does not exist. Before this version the command used to return silently.
- Fix the symlink manifest generation when partition values contain space in them.
- Fix an issue where incorrect commit stats are collected.
- Support for
SimpleAWSCredentialsProviderorTemporaryAWSCredentialsProviderin S3 multi-cluster write supportedLogStore. - Fix an issue in generated columns that would not allow null columns in the insert
DataFrameto be written even if the column was nullable.
Benchmark Framework Update
Independent of this release, we have improved the framework for writing large scala performance benchmarks (initial version added in version 1.2.0), we have added support for running benchmarks on Google Compute Platform using Google Dataproc (in addition to the existing support for EMR on AWS)
Credits
Adam Binford, Alkis Evlogimenos, Allison Portis, Ankur Dave, Bingkun Pan, Burak Yilmaz, Chang Yong Lik, Chen Qingzhi, Denny Lee, Eric Chang, Felipe Pessoto, Fred Liu, Fu Chen, Gaurav Rupnar, Grzegorz Kołakowski, Hussein Nagree, Jacek Laskowski, Jackie Zhang, Jiaan Geng, Jintao Shen, Jintian Liang, John O'Dwyer, Junyong Lee, Kam Cheung Ting, Karen Feng, Koert Kuipers, Lars Kroll, Liwen Sun, Lukas Rupprecht, Max Gekk, Michael Mengarelli, Min Yang, Naga Raju Bhanoori, Nick Grigoriev, Nick Karpov, Ole Sasse, Patrick Grandjean, Peng Zhong, Prakhar Jain, Rahul Shivu Mahadev, Rajesh Parangi, Ruslan Dautkhanov, Sabir Akhadov, Scott Sandre, Serge Rielau, Shixiong Zhu, Shoumik Palkar, Tathagata Das, Terry Kim, Tyson Condie, Venki Korukanti, Vini Jaiswal, Wenchen Fan, Xinyi, Yijia Cui, Yousry Mohamed
Delta Lake 1.2.1
We are excited to announce the release of Delta Lake 1.2.1 on Apache Spark 3.2. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/1.2.1/index.html
- Maven artifacts: delta-core_2.12, delta-core_2.13, delta-contribs_2.12 delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb
- Python artifacts: https://pypi.org/project/delta-spark/1.2.1/
Key features in this release
- Fix an issue with loading error messages in
--packagesmode. Previous release had a bug that resulted in user gettingNullPointerExceptioninstead of proper error message when using Delta Lake with--packagesmode either inpysparkorspark-shell(Fix, Test) - Fix incorrect exception type thrown in some Python APIs. A bug caused
pysparkto throw incorrect type of exceptions instead of expectedAnalysisException. This issue is fixed. See issue #1086 for more details. - Fix for S3 multi-cluster mode configuration. A bug in the S3 multi-cluster mode caused
--confto not work for certain configuration parameters. This issue is fixed by having these configuration parameters begin withspark. See the updated documentation. - Make the GCS LogStore configuration simpler by automatically deriving the
LogStoreimplementation class configspark.delta.logStore.gs.implfrom the scheme in the table path. See the updated documentation. - Make SetAccumulator thread safe. SetAccumulator used by Merge was not thread safe and might cause executor heartbeat failures in rare cases. This was fixed by using a synchronized set.
Credits
Allison Portis, Chang Yong Lik, Kam Cheung Ting, Rahul Mahadev, Scott Sandre, Venki Korukanti
Delta Lake 1.2.0
We are excited to announce the release of Delta Lake 1.2.0 on Apache Spark 3.2. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/1.2.0/index.html
- Maven artifacts: delta-core_2.12, delta-core_2.13, delta-contribs_2.12 delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb
- Python artifacts: https://pypi.org/project/delta-spark/1.2.0/
Key features in this release
-
Support multi-cluster write in Delta Lake tables stored in S3. Users now have the option of specifying a new and experimental
LogStoreimplementation that supports concurrent reads and writes to a single Delta Lake table in S3 from multiple Spark drivers. See the documentation for more details. -
Support for compacting small files (optimize) into larger files in a Delta Lake table. Reduced number of data files improves read latency due to reduced metadata size and per-file overheads such as file-open overhead and file-close overhead. See the documentation for more details.
-
Support for data skipping using column statistics. Column statistics are collected for each file as part of the Delta Lake table writes. These statistics can be used during the reading of a Delta Lake table to skip reading files not matching the filters in the query. See the documentation for more details.
-
Support for restoring a Delta table to an earlier version. Restoring to an earlier version number or a version of a specific timestamp is supported using the SQL command, Scala APIs or Python APIs. See the documentation for more details.
-
Support for column renaming in a Delta Lake table without the need to rewrite the underlying Parquet data files. See the documentation for more details.
-
Support for arbitrary characters in column names in Delta tables. Before, the supported list of characters was limited by the support of the same in Parquet data format. Column names containing special characters such space, tab,
,,{,(etc. are supported now. See the documentation for more details. -
Support for automatic data skipping using generated columns. For any partition column that is a generated column, partition filters will be automatically generated from any data filters on its generating column(s), when possible.
-
Support for Google Cloud Storage is now generally available. See the documentation on how to read and write Delta Lake tables in Google Cloud Storage.
-
Other notable changes
- Create a new module
delta-storage. This extracts out theLogStoreinterface and implementations in a separate module which is published as its own jar. This enables new implementations ofLogStorewithout depending upon the complete Delta jars. See the migration guide here for more details. - Improve the error messages and exceptions to be better organized and queryable.
- Support for
gettimestampexpression in generated columns. - Snapshot/Checkpoint management improvements
- Make loading snapshots resilient to corrupt checkpoints in Delta. When reading a checkpoint fails, we try to search for an alternative checkpoint and use it to construct a snapshot.
- Fix to snapshot writing to not fail the write when a checkpoint fails due to non-fatal errors.
- Optimization to reduce the number of
listcalls to storage
- Improved output metrics for DELETE table command.
- Improved output metrics for UPDATE table command.
- Optimize merge operation in a Delta table with a large number of columns.
- Fix a
NullPointerExceptionwhen trying to reference aDeltaLogcreated with aSparkContextthat has stopped. - Fix an issue in handling null partition column values in the change data capture feature.
- Fix an issue in adding a new column to the Delta table when the preceding column is of type
Array. - Fix an issue where we are not closing the file list iterator when reading large log files in the Delta Streaming source.
- Throw proper exceptions when searching for a Delta table in the catalog.
- Fix a schema evolution issue when the column type is an array of structs.
- Better handling of
FileNotFoundExceptionwhen reading Delta log files to distinguish between the corrupt log files and no files found.
- Create a new module
Benchmark Framework
Independent of this release, we have also built a framework for writing large scale performance benchmarks on Delta tables using a real cluster. Currently, the framework provides a TPC-DS inspired benchmark to measure the ingestion time (e.g. time taken to create TPC-DS tables) and query times. But we encourage the community to contribute more benchmarks to measure performance of different real-world workloads on Delta tables.
Credits
Adam Binford, Alex Liu, Allison Portis, Anton Okolnychyi, Bart Samwel, Carmen Kwan, Chang Yong Lik, Christian Williams, Christos Stavrakakis, David Lewis, Denny Lee, Fabio Badalì, Fred Liu, Gengliang Wang, Hoang Pham, Hussein Nagree, Hyukjin Kwon, Jackie Zhang, Jan Paw, John ODwyer, Junlin Zeng, Jackie Zhang, Junyong Lee, Kam Cheung Ting, Kapil Sreedharan, Lars Kroll, Liwen Sun, Maksym Dovhal, Mariusz Krynski, Meng Tong, Peng Zhong, Prakhar Jain, Pranav, Ryan Johnson, Sabir Akhadov, Scott Sandre, Shixiong Zhu, Sri Tikkireddy, Tathagata Das, Tyson Condie, Vegard Stikbakke, Venkata Sai Akhil Gudesa, Venki Korukanti, Vini Jaiswal, Wenchen Fan, Will Jones, Xinyi Yu, Yann Byron, Yaohua Zhao, Yijia Cui