Skip to content

Conversation

@WJX20
Copy link

@WJX20 WJX20 commented Nov 11, 2025

Checklist:

  • Have you added an explanation of what your changes do and why you'd like them to be included?
  • Have you updated or added documentation for the change, as applicable?
  • Have you tested your changes on all related environments with successful results, as applicable?

Type of Changes:

  • New feature
  • Bug fix
  • Documentation
  • Testing enhancement
  • Other

What Happened

I think if there is too much data, it is not necessary to compare all of it. Because on the one hand, it wastes a lot of time; on the other hand, it is sufficient to just sample and compare a portion of the data.
"batch-offset-size" & "batch-compare-size":
These two configurations are used to paginate the data for querying when generating "hash comparison". For instance, only compare the data ranging from 1001 to 2000 or from 5001 to 10000.
"batch-check-size":
This configuration is used for "check verification". During this process, n pre-generated hash values are selected for "check verification". For example, if "compare" generates one billion hash values, and if all these one billion hash values are different, the user doesn't need to check all the verification values. They only need to know that there is a difference. In general, the "batch-check-size" value is less than or equal to the "batch-compare-size" value.

Other Information:

@cbrianpace
Copy link
Collaborator

Couple of items:

  1. Review the conversation in issue 82 (solutions for data comparison between large tables #82). The recommended method for limiting or using a sample is to leverage the filter setting on each table. In issue 82 ranges were used as an example. One could also use the mod function to limit. For example, the filter could be set to mod(id,100)=1. This is better than an offset because an index scan could be used to selectively get the appropriate rows. Same for using ranges. When using offset, the table has to be scanned and sorted which is not saving a lot in terms of database resources.
  2. If the fix is to go forward, coding must be done to take into account how different database platforms perform offset and limit. What databases have you tested the solution against? Have you looked into what would be required for all of the databases supported by pgCompare (oracle, postgres, mysql, mariadb, mssql, db2, snowflake)?
  3. The if logic is checking for isEmpty. However, the values are being set with a default value so it will never be empty.
  4. Using offset could result in the same set of rows being compared with each run and not an offset (due to sorting).

@WJX20
Copy link
Author

WJX20 commented Nov 13, 2025

The method you mentioned seems not to be a universal one? Is it only applicable in the case of "incrementing primary key"? If the database selects a "uuid" or other non-sortable unique field as the primary key, it seems that the range cannot be determined anymore.
Yes, what I have attempted to do on your original code is to follow the principle of "only adding without modifying". This is equivalent to appending the LIMIT statement after your original SQL query based on the primary key. For example, in Oracle, DB2, and MSSQL, the SQL concatenation is "sql += OFFSET + batchOffsetSize + ROWS FETCH NEXT + batchCompareSize + ROWS ONLY", while for MySQL, PostgreSQL, and Snowflake, the SQL concatenation is "sql += LIMIT + batchCompareSize + OFFSET + batchOffsetSize".You're right. Writing it this way does have some issues. The offset will scan the previous lines.
The reason why I have included the if logic check here to see if it is empty is still based on the principle of "only adding new content and not modifying existing content". If the user does not set these default values, this part of the code written by me will be skipped and only the original code written by you will be executed.Prevent the occurrence of the "limit null offset null" error when users do not set these parameter values.

@WJX20
Copy link
Author

WJX20 commented Nov 13, 2025

I conducted some simple tests and found that the efficiency of the "compare" stage (the stage where "hash values" are generated) is actually not very bad. The main problem lies in the "check" stage. If all the data are completely different, the efficiency will be extremely low.

…mpty".

Remove offset,Using the method of directly specifying the initial value
@WJX20
Copy link
Author

WJX20 commented Nov 14, 2025

now, The generated comparison SQL is "select pk ,pk_hash...from table where 1=1 and pk > 'batch-start-size' order by pk limit 'batch-compare-size' ",The prerequisite is that 'batch-start-size’ >= 0 and 'batch-compare-size' > 0.Is this the correct way to write it?

@cbrianpace
Copy link
Collaborator

The check process will always be slow as it is doing a row by row comparison. If you have thousands of rows out of sync (or millions) the bigger issue is why are that many rows out of sync.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants