-
Notifications
You must be signed in to change notification settings - Fork 43
solutions for data comparison between large tables #90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
"batch-compare-size", "batch-offset-size", "batch-check-size"
Added configuration options for batch processing.
Clarified the descriptions for batch-offset-size and batch-compare-size configurations in the README.
|
Couple of items:
|
|
The method you mentioned seems not to be a universal one? Is it only applicable in the case of "incrementing primary key"? If the database selects a "uuid" or other non-sortable unique field as the primary key, it seems that the range cannot be determined anymore. |
|
I conducted some simple tests and found that the efficiency of the "compare" stage (the stage where "hash values" are generated) is actually not very bad. The main problem lies in the "check" stage. If all the data are completely different, the efficiency will be extremely low. |
…mpty". Remove offset,Using the method of directly specifying the initial value
|
now, The generated comparison SQL is "select pk ,pk_hash...from table where 1=1 and pk > 'batch-start-size' order by pk limit 'batch-compare-size' ",The prerequisite is that 'batch-start-size’ >= 0 and 'batch-compare-size' > 0.Is this the correct way to write it? |
|
The check process will always be slow as it is doing a row by row comparison. If you have thousands of rows out of sync (or millions) the bigger issue is why are that many rows out of sync. |
Checklist:
Type of Changes:
What Happened
I think if there is too much data, it is not necessary to compare all of it. Because on the one hand, it wastes a lot of time; on the other hand, it is sufficient to just sample and compare a portion of the data.
"batch-offset-size" & "batch-compare-size":
These two configurations are used to paginate the data for querying when generating "hash comparison". For instance, only compare the data ranging from 1001 to 2000 or from 5001 to 10000.
"batch-check-size":
This configuration is used for "check verification". During this process, n pre-generated hash values are selected for "check verification". For example, if "compare" generates one billion hash values, and if all these one billion hash values are different, the user doesn't need to check all the verification values. They only need to know that there is a difference. In general, the "batch-check-size" value is less than or equal to the "batch-compare-size" value.
Other Information: