solutions for data comparison between large tables #90

WJX20 · 2025-11-11T08:03:22Z

Checklist:

Have you added an explanation of what your changes do and why you'd like them to be included?
Have you updated or added documentation for the change, as applicable?
Have you tested your changes on all related environments with successful results, as applicable?

Type of Changes:

What Happened

I think if there is too much data, it is not necessary to compare all of it. Because on the one hand, it wastes a lot of time; on the other hand, it is sufficient to just sample and compare a portion of the data.
"batch-offset-size" & "batch-compare-size"：
These two configurations are used to paginate the data for querying when generating "hash comparison". For instance, only compare the data ranging from 1001 to 2000 or from 5001 to 10000.
"batch-check-size":
This configuration is used for "check verification". During this process, n pre-generated hash values are selected for "check verification". For example, if "compare" generates one billion hash values, and if all these one billion hash values are different, the user doesn't need to check all the verification values. They only need to know that there is a difference. In general, the "batch-check-size" value is less than or equal to the "batch-compare-size" value.

Other Information:

"batch-compare-size", "batch-offset-size", "batch-check-size"

Added configuration options for batch processing.

Clarified the descriptions for batch-offset-size and batch-compare-size configurations in the README.

cbrianpace · 2025-11-13T04:28:43Z

Couple of items:

Review the conversation in issue 82 (solutions for data comparison between large tables #82). The recommended method for limiting or using a sample is to leverage the filter setting on each table. In issue 82 ranges were used as an example. One could also use the mod function to limit. For example, the filter could be set to mod(id,100)=1. This is better than an offset because an index scan could be used to selectively get the appropriate rows. Same for using ranges. When using offset, the table has to be scanned and sorted which is not saving a lot in terms of database resources.
If the fix is to go forward, coding must be done to take into account how different database platforms perform offset and limit. What databases have you tested the solution against? Have you looked into what would be required for all of the databases supported by pgCompare (oracle, postgres, mysql, mariadb, mssql, db2, snowflake)?
The if logic is checking for isEmpty. However, the values are being set with a default value so it will never be empty.
Using offset could result in the same set of rows being compared with each run and not an offset (due to sorting).

WJX20 · 2025-11-13T08:49:48Z

The method you mentioned seems not to be a universal one? Is it only applicable in the case of "incrementing primary key"? If the database selects a "uuid" or other non-sortable unique field as the primary key, it seems that the range cannot be determined anymore.
Yes, what I have attempted to do on your original code is to follow the principle of "only adding without modifying". This is equivalent to appending the LIMIT statement after your original SQL query based on the primary key. For example, in Oracle, DB2, and MSSQL, the SQL concatenation is "sql += OFFSET + batchOffsetSize + ROWS FETCH NEXT + batchCompareSize + ROWS ONLY", while for MySQL, PostgreSQL, and Snowflake, the SQL concatenation is "sql += LIMIT + batchCompareSize + OFFSET + batchOffsetSize".You're right. Writing it this way does have some issues. The offset will scan the previous lines.
The reason why I have included the if logic check here to see if it is empty is still based on the principle of "only adding new content and not modifying existing content". If the user does not set these default values, this part of the code written by me will be skipped and only the original code written by you will be executed.Prevent the occurrence of the "limit null offset null" error when users do not set these parameter values.

WJX20 · 2025-11-13T09:09:29Z

I conducted some simple tests and found that the efficiency of the "compare" stage (the stage where "hash values" are generated) is actually not very bad. The main problem lies in the "check" stage. If all the data are completely different, the efficiency will be extremely low.

…mpty". Remove offset,Using the method of directly specifying the initial value

WJX20 · 2025-11-14T09:22:48Z

now, The generated comparison SQL is "select pk ,pk_hash...from table where 1=1 and pk > 'batch-start-size' order by pk limit 'batch-compare-size' ",The prerequisite is that 'batch-start-size’ >= 0 and 'batch-compare-size' > 0.Is this the correct way to write it?

cbrianpace · 2025-11-14T13:47:38Z

The check process will always be slow as it is doing a row by row comparison. If you have thousands of rows out of sync (or millions) the bigger issue is why are that many rows out of sync.

WJX20 and others added 8 commits November 11, 2025 14:49

Added three new configuration parameters:

60fb68f

"batch-compare-size", "batch-offset-size", "batch-check-size"

Document batch-offset-size, batch-compare-size, and batch-check-size

83eb164

Added configuration options for batch processing.

Fix formatting for batch-offset-size default value

3e15dfe

Modify default values of configuration

9bf2da4

Merge remote-tracking branch 'origin/main'

69d8dce

Update README with clearer configuration descriptions

5366cb4

Clarified the descriptions for batch-offset-size and batch-compare-size configurations in the README.

Remove redundant names

6b2c06b

Merge remote-tracking branch 'origin/main'

141d870

WJX20 added 2 commits November 14, 2025 17:08

Remove the condition of "checking that configuration parameters are e…

6867170

…mpty". Remove offset,Using the method of directly specifying the initial value

comment remove

197b937

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

solutions for data comparison between large tables #90

solutions for data comparison between large tables #90

Uh oh!

WJX20 commented Nov 11, 2025

Uh oh!

cbrianpace commented Nov 13, 2025

Uh oh!

WJX20 commented Nov 13, 2025

Uh oh!

WJX20 commented Nov 13, 2025

Uh oh!

WJX20 commented Nov 14, 2025

Uh oh!

cbrianpace commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

solutions for data comparison between large tables #90

Are you sure you want to change the base?

solutions for data comparison between large tables #90

Uh oh!

Conversation

WJX20 commented Nov 11, 2025

Uh oh!

cbrianpace commented Nov 13, 2025

Uh oh!

WJX20 commented Nov 13, 2025

Uh oh!

WJX20 commented Nov 13, 2025

Uh oh!

WJX20 commented Nov 14, 2025

Uh oh!

cbrianpace commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants