Skip to content

Commit 3a9762f

Browse files
xzhsehtdas
andauthored
[Spark] Relax check for generated columns and CHECK constraints on nested struct fields (#3601)
<!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md 2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 3. Be sure to keep the PR description updated to reflect all changes. 4. Please write your PR title to summarize what this PR proposes. 5. If possible, provide a concise example to reproduce the issue for a faster review. 6. If applicable, include the corresponding issue number in the PR title and link it in the body. --> #### Which Delta project/connector is this regarding? <!-- Please add the component selected below to the beginning of the pull request title For example: [Spark] Title of my pull request --> - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description <!-- - Describe what this PR changes. - Describe why we need the change. If this PR resolves an issue be sure to include "Resolves #XXX" to correctly link and close the issue upon merge. --> close #3250. this PR relaxes the check for nested struct fields especially when only some are being referenced by CHECK constraints or generated columns, which allows for more valid use cases in scenarios involving type widening or schema evolution. the core function, `checkReferencedByCheckConstraintsOrGeneratedColumns`, inspects the nested/inner fields of the provided `StructType` to determine if any are referenced by dependent (CHECK) constraints or generated columns; for column types like `ArrayType` or `MapType`, the function checks these properties directly without inspecting the inner fields. ## How was this patch tested? through unit tests in `TypeWideningConstraintsSuite` and `TypeWideningGeneratedColumnsSuite`. <!-- If tests were added, say they were added here. Please make sure to test the changes thoroughly including negative and positive cases if possible. If the changes were tested in any way other than unit tests, please clarify how you tested step by step (ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future). If the changes were not tested, please explain why. --> ## Does this PR introduce _any_ user-facing changes? yes, now the following (valid) use case will not be rejected by the check in [ImplicitMetadataOperation.checkDependentExpression](https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/schema/ImplicitMetadataOperation.scala#L241). ```sql -- with `DELTA_SCHEMA_AUTO_MIGRATE` enabled create table t (a struct<x: byte, y: byte>) using delta; alter table t add constraint ck check (hash(a.x) > 0); -- changing the type of struct field `a.y` when it's not -- the field referenced by the CHECK constraint is allowed now. insert into t (a) values (named_struct('x', CAST(2 AS byte), 'y', 1030)) ``` <!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Delta Lake versions or within the unreleased branches such as master. If no, write 'No'. --> --------- Co-authored-by: Tathagata Das <[email protected]>
1 parent 7dfc6d9 commit 3a9762f

File tree

4 files changed

+223
-80
lines changed

4 files changed

+223
-80
lines changed

spark/src/main/scala/org/apache/spark/sql/delta/schema/ImplicitMetadataOperation.scala

Lines changed: 118 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ import org.apache.spark.internal.MDC
2929
import org.apache.spark.sql.SparkSession
3030
import org.apache.spark.sql.catalyst.expressions.FileSourceGeneratedMetadataStructField
3131
import org.apache.spark.sql.catalyst.types.DataTypeUtils.toAttributes
32-
import org.apache.spark.sql.types.StructType
32+
import org.apache.spark.sql.types.{DataType, StructType}
3333

3434
/**
3535
* A trait that writers into Delta can extend to update the schema and/or partitioning of the table.
@@ -227,6 +227,104 @@ object ImplicitMetadataOperation {
227227
}
228228
}
229229

230+
/**
231+
* Check whether there are dependant (CHECK) constraints for
232+
* the provided `currentDt`; if so, throw an error indicating
233+
* the constraint data type mismatch.
234+
*
235+
* @param spark the spark session used.
236+
* @param path the full column path for the current field.
237+
* @param metadata the metadata used for checking dependant (CHECK) constraints.
238+
* @param currentDt the current data type.
239+
* @param updateDt the updated data type.
240+
*/
241+
private def checkDependentConstraints(
242+
spark: SparkSession,
243+
path: Seq[String],
244+
metadata: Metadata,
245+
currentDt: DataType,
246+
updateDt: DataType): Unit = {
247+
val dependentConstraints =
248+
Constraints.findDependentConstraints(spark, path, metadata)
249+
if (dependentConstraints.nonEmpty) {
250+
throw DeltaErrors.constraintDataTypeMismatch(
251+
path,
252+
currentDt,
253+
updateDt,
254+
dependentConstraints
255+
)
256+
}
257+
}
258+
259+
/**
260+
* Check whether there are dependant generated columns for
261+
* the provided `currentDt`; if so, throw an error indicating
262+
* the generated columns data type mismatch.
263+
*
264+
* @param spark the spark session used.
265+
* @param path the full column path for the current field.
266+
* @param protocol the protocol used.
267+
* @param metadata the metadata used for checking dependant generated columns.
268+
* @param currentDt the current data type.
269+
* @param updateDt the updated data type.
270+
*/
271+
private def checkDependentGeneratedColumns(
272+
spark: SparkSession,
273+
path: Seq[String],
274+
protocol: Protocol,
275+
metadata: Metadata,
276+
currentDt: DataType,
277+
updateDt: DataType): Unit = {
278+
val dependentGeneratedColumns = SchemaUtils.findDependentGeneratedColumns(
279+
spark, path, protocol, metadata.schema)
280+
if (dependentGeneratedColumns.nonEmpty) {
281+
throw DeltaErrors.generatedColumnsDataTypeMismatch(
282+
path,
283+
currentDt,
284+
updateDt,
285+
dependentGeneratedColumns
286+
)
287+
}
288+
}
289+
290+
/**
291+
* Check whether the provided field is currently being referenced
292+
* by CHECK constraints or generated columns.
293+
* Note that we explicitly ignore the check for `StructType` in this
294+
* function by only inspecting its inner fields to relax the check;
295+
* plus, any `StructType` will be traversed in [[checkDependentExpressions]].
296+
*
297+
* @param spark the spark session used.
298+
* @param path the full column path for the current field.
299+
* @param protocol the protocol used.
300+
* @param metadata the metadata used for checking constraints and generated columns.
301+
* @param currentDt the current data type.
302+
* @param updateDt the updated data type.
303+
*/
304+
private def checkConstraintsOrGeneratedColumnsOnStructField(
305+
spark: SparkSession,
306+
path: Seq[String],
307+
protocol: Protocol,
308+
metadata: Metadata,
309+
currentDt: DataType,
310+
updateDt: DataType): Unit = (currentDt, updateDt) match {
311+
// we explicitly ignore the check for `StructType` here.
312+
case (StructType(_), StructType(_)) =>
313+
314+
// FIXME: we intentionally incorporate the pattern match for `ArrayType` and `MapType`
315+
// here mainly due to the field paths for maps/arrays in constraints/generated columns
316+
// are *NOT* consistent with regular field paths,
317+
// e.g., `hash(a.arr[0].x)` vs. `hash(a.element.x)`.
318+
// this makes it hard to recurse into maps/arrays and check for the corresponding
319+
// fields - thus we can not actually block the operation even if the updated field
320+
// is being referenced by any CHECK constraints or generated columns.
321+
case (from, to) =>
322+
if (currentDt != updateDt) {
323+
checkDependentConstraints(spark, path, metadata, from, to)
324+
checkDependentGeneratedColumns(spark, path, protocol, metadata, from, to)
325+
}
326+
}
327+
230328
/**
231329
* Finds all fields that change between the current schema and the new data schema and fail if any
232330
* of them are referenced by check constraints or generated columns.
@@ -236,42 +334,23 @@ object ImplicitMetadataOperation {
236334
protocol: Protocol,
237335
metadata: actions.Metadata,
238336
dataSchema: StructType): Unit =
239-
SchemaMergingUtils.transformColumns(metadata.schema, dataSchema) {
240-
case (fieldPath, currentField, Some(updateField), _)
241-
// This condition is actually too strict, structs may be identified as changing because one
242-
// of their field is changing even though that field isn't referenced by any constraint or
243-
// generated column. This is intentional to keep the check simple and robust, esp. since it
244-
// aligns with the historical behavior of this check.
245-
if !SchemaMergingUtils.equalsIgnoreCaseAndCompatibleNullability(
246-
currentField.dataType,
247-
updateField.dataType
248-
) =>
249-
val columnPath = fieldPath :+ currentField.name
250-
// check if the field to change is referenced by check constraints
251-
val dependentConstraints =
252-
Constraints.findDependentConstraints(sparkSession, columnPath, metadata)
253-
if (dependentConstraints.nonEmpty) {
254-
throw DeltaErrors.constraintDataTypeMismatch(
255-
columnPath,
256-
currentField.dataType,
257-
updateField.dataType,
258-
dependentConstraints
259-
)
260-
}
261-
// check if the field to change is referenced by any generated columns
262-
val dependentGenCols = SchemaUtils.findDependentGeneratedColumns(
263-
sparkSession, columnPath, protocol, metadata.schema)
264-
if (dependentGenCols.nonEmpty) {
265-
throw DeltaErrors.generatedColumnsDataTypeMismatch(
266-
columnPath,
267-
currentField.dataType,
268-
updateField.dataType,
269-
dependentGenCols
270-
)
271-
}
272-
// We don't transform the schema but just perform checks, the returned field won't be used
273-
// anyway.
274-
updateField
275-
case (_, field, _, _) => field
276-
}
337+
SchemaMergingUtils.transformColumns(metadata.schema, dataSchema) {
338+
case (fieldPath, currentField, Some(updateField), _)
339+
if !SchemaMergingUtils.equalsIgnoreCaseAndCompatibleNullability(
340+
currentField.dataType,
341+
updateField.dataType
342+
) =>
343+
checkConstraintsOrGeneratedColumnsOnStructField(
344+
spark = sparkSession,
345+
path = fieldPath :+ currentField.name,
346+
protocol = protocol,
347+
metadata = metadata,
348+
currentDt = currentField.dataType,
349+
updateDt = updateField.dataType
350+
)
351+
// We don't transform the schema but just perform checks,
352+
// the returned field won't be used anyway.
353+
updateField
354+
case (_, field, _, _) => field
355+
}
277356
}

spark/src/test/scala/org/apache/spark/sql/delta/GeneratedColumnSuite.scala

Lines changed: 8 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -790,22 +790,14 @@ trait GeneratedColumnSuiteBase
790790
createTable(table, None, "t STRUCT<a: SMALLINT, b: SMALLINT>, gen SMALLINT",
791791
Map("gen" -> "CAST(HASH(t.a - 10s) AS SMALLINT)"), Nil)
792792

793-
checkError(
794-
exception = intercept[AnalysisException] {
795-
Seq((32767.toShort, 32767)).toDF("a", "b")
796-
.selectExpr("named_struct('a', a, 'b', b) as t")
797-
.write.format("delta").mode("append")
798-
.option("mergeSchema", "true")
799-
.saveAsTable(table)
800-
},
801-
errorClass = "DELTA_GENERATED_COLUMNS_DATA_TYPE_MISMATCH",
802-
parameters = Map(
803-
"columnName" -> "t",
804-
"columnType" -> "STRUCT<a: SMALLINT, b: SMALLINT>",
805-
"dataType" -> "STRUCT<a: SMALLINT, b: INT>",
806-
"generatedColumns" -> "gen -> CAST(HASH(t.a - 10s) AS SMALLINT)"
807-
)
808-
)
793+
// changing the type of `t.b` should succeed since it is not being
794+
// referenced by any CHECK constraints or generated columns.
795+
Seq((32767.toShort, 32767)).toDF("a", "b")
796+
.selectExpr("named_struct('a', a, 'b', b) as t")
797+
.write.format("delta").mode("append")
798+
.option("mergeSchema", "true")
799+
.saveAsTable(table)
800+
checkAnswer(spark.table(table), Row(Row(32767, 32767), -22677) :: Nil)
809801
}
810802
}
811803

spark/src/test/scala/org/apache/spark/sql/delta/typewidening/TypeWideningConstraintsSuite.scala

Lines changed: 45 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -133,25 +133,58 @@ trait TypeWideningConstraintsTests { self: QueryTest with TypeWideningTestMixin
133133
},
134134
errorClass = "DELTA_CONSTRAINT_DATA_TYPE_MISMATCH",
135135
parameters = Map(
136-
"columnName" -> "a",
137-
"columnType" -> "STRUCT<x: TINYINT, y: TINYINT>",
138-
"dataType" -> "STRUCT<x: INT, y: TINYINT>",
136+
"columnName" -> "a.x",
137+
"columnType" -> "TINYINT",
138+
"dataType" -> "INT",
139139
"constraints" -> "delta.constraints.ck -> hash ( a . x ) > 0"
140-
))
140+
)
141+
)
142+
143+
// changing the type of struct field `a.y` when it's not
144+
// the field referenced by the CHECK constraint is allowed.
145+
sql("INSERT INTO t (a) VALUES (named_struct('x', CAST(2 AS byte), 'y', 500))")
146+
checkAnswer(sql("SELECT hash(a.x) FROM t"), Seq(Row(1765031574), Row(1765031574)))
147+
}
148+
}
149+
}
141150

142-
// We're currently too strict and reject changing the type of struct field a.y even though
143-
// it's not the field referenced by the CHECK constraint.
151+
test("check constraint on nested field with complex type evolution") {
152+
withTable("t") {
153+
sql("CREATE TABLE t (a struct<x: struct<z: byte, h: byte>, y: byte>) USING DELTA")
154+
sql("ALTER TABLE t ADD CONSTRAINT ck CHECK (hash(a.x.z) > 0)")
155+
sql("INSERT INTO t (a) VALUES (named_struct('x', named_struct('z', 2, 'h', 3), 'y', 4))")
156+
checkAnswer(sql("SELECT hash(a.x.z) FROM t"), Row(1765031574))
157+
158+
withSQLConf(DeltaSQLConf.DELTA_SCHEMA_AUTO_MIGRATE.key -> "true") {
144159
checkError(
145160
exception = intercept[DeltaAnalysisException] {
146-
sql("INSERT INTO t (a) VALUES (named_struct('x', CAST(2 AS byte), 'y', 500))")
161+
sql(
162+
s"""
163+
| INSERT INTO t (a) VALUES (
164+
| named_struct('x', named_struct('z', 200, 'h', 3), 'y', 4)
165+
| )
166+
|""".stripMargin
167+
)
147168
},
148169
errorClass = "DELTA_CONSTRAINT_DATA_TYPE_MISMATCH",
149170
parameters = Map(
150-
"columnName" -> "a",
151-
"columnType" -> "STRUCT<x: TINYINT, y: TINYINT>",
152-
"dataType" -> "STRUCT<x: TINYINT, y: INT>",
153-
"constraints" -> "delta.constraints.ck -> hash ( a . x ) > 0"
154-
))
171+
"columnName" -> "a.x.z",
172+
"columnType" -> "TINYINT",
173+
"dataType" -> "INT",
174+
"constraints" -> "delta.constraints.ck -> hash ( a . x . z ) > 0"
175+
)
176+
)
177+
178+
// changing the type of struct field `a.y` and `a.x.h` when it's not
179+
// the field referenced by the CHECK constraint is allowed.
180+
sql(
181+
"""
182+
| INSERT INTO t (a) VALUES (
183+
| named_struct('x', named_struct('z', CAST(2 AS BYTE), 'h', 2002), 'y', 1030)
184+
| )
185+
|""".stripMargin
186+
)
187+
checkAnswer(sql("SELECT hash(a.x.z) FROM t"), Seq(Row(1765031574), Row(1765031574)))
155188
}
156189
}
157190
}

spark/src/test/scala/org/apache/spark/sql/delta/typewidening/TypeWideningGeneratedColumnsSuite.scala

Lines changed: 52 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -130,7 +130,7 @@ trait TypeWideningGeneratedColumnTests extends GeneratedColumnTest {
130130
partitionColumns = Seq.empty
131131
)
132132
sql("INSERT INTO t (a) VALUES (named_struct('x', 2, 'y', 3))")
133-
checkAnswer(sql("SELECT hash(a.x) FROM t"), Row(1765031574))
133+
checkAnswer(sql("SELECT gen FROM t"), Row(1765031574))
134134

135135
withSQLConf(DeltaSQLConf.DELTA_SCHEMA_AUTO_MIGRATE.key -> "true") {
136136
checkError(
@@ -139,25 +139,64 @@ trait TypeWideningGeneratedColumnTests extends GeneratedColumnTest {
139139
},
140140
errorClass = "DELTA_GENERATED_COLUMNS_DATA_TYPE_MISMATCH",
141141
parameters = Map(
142-
"columnName" -> "a",
143-
"columnType" -> "STRUCT<x: TINYINT, y: TINYINT>",
144-
"dataType" -> "STRUCT<x: INT, y: TINYINT>",
142+
"columnName" -> "a.x",
143+
"columnType" -> "TINYINT",
144+
"dataType" -> "INT",
145145
"generatedColumns" -> "gen -> hash(a.x)"
146-
))
146+
)
147+
)
147148

148-
// We're currently too strict and reject changing the type of struct field a.y even though
149-
// it's not the field referenced by the generated column.
149+
// changing the type of struct field `a.y` when it's not
150+
// the field referenced by the generated column is allowed.
151+
sql("INSERT INTO t (a) VALUES (named_struct('x', CAST(2 AS byte), 'y', 200))")
152+
checkAnswer(sql("SELECT gen FROM t"), Seq(Row(1765031574), Row(1765031574)))
153+
}
154+
}
155+
}
156+
157+
test("generated column on nested field with complex type evolution") {
158+
withTable("t") {
159+
createTable(
160+
tableName = "t",
161+
path = None,
162+
schemaString = "a struct<x: struct<z: byte, h: byte>, y: byte>, gen int",
163+
generatedColumns = Map("gen" -> "hash(a.x.z)"),
164+
partitionColumns = Seq.empty
165+
)
166+
167+
sql("INSERT INTO t (a) VALUES (named_struct('x', named_struct('z', 2, 'h', 3), 'y', 4))")
168+
checkAnswer(sql("SELECT gen FROM t"), Row(1765031574))
169+
170+
withSQLConf(DeltaSQLConf.DELTA_SCHEMA_AUTO_MIGRATE.key -> "true") {
150171
checkError(
151172
exception = intercept[DeltaAnalysisException] {
152-
sql("INSERT INTO t (a) VALUES (named_struct('x', CAST(2 AS byte), 'y', 200))")
173+
sql(
174+
s"""
175+
| INSERT INTO t (a) VALUES (
176+
| named_struct('x', named_struct('z', 200, 'h', 3), 'y', 4)
177+
| )
178+
|""".stripMargin
179+
)
153180
},
154181
errorClass = "DELTA_GENERATED_COLUMNS_DATA_TYPE_MISMATCH",
155182
parameters = Map(
156-
"columnName" -> "a",
157-
"columnType" -> "STRUCT<x: TINYINT, y: TINYINT>",
158-
"dataType" -> "STRUCT<x: TINYINT, y: INT>",
159-
"generatedColumns" -> "gen -> hash(a.x)"
160-
))
183+
"columnName" -> "a.x.z",
184+
"columnType" -> "TINYINT",
185+
"dataType" -> "INT",
186+
"generatedColumns" -> "gen -> hash(a.x.z)"
187+
)
188+
)
189+
190+
// changing the type of struct field `a.y` when it's not
191+
// the field referenced by the generated column is allowed.
192+
sql(
193+
"""
194+
| INSERT INTO t (a) VALUES (
195+
| named_struct('x', named_struct('z', CAST(2 AS BYTE), 'h', 2002), 'y', 1030)
196+
| )
197+
|""".stripMargin
198+
)
199+
checkAnswer(sql("SELECT gen FROM t"), Seq(Row(1765031574), Row(1765031574)))
161200
}
162201
}
163202
}

0 commit comments

Comments
 (0)