Skip to content

Commit 56a4ea3

Browse files
authored
Merge pull request #25 from cbrianpace/main
Version 0.3.0
2 parents af66949 + dfd6521 commit 56a4ea3

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

54 files changed

+3392
-1225
lines changed

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
11
confero.properties
22
target/*
33
database/local_test.sql
4-
.DS_Store
4+
.DS_Store
5+
._*
6+
test_local.txt

.idea/compiler.xml

Lines changed: 9 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 60 additions & 98 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ Before initiating the build and installation process, ensure the following prere
3131
1. Java version 21 or higher.
3232
2. Maven 3.9 or higher.
3333
3. Postgres version 15 or higher (to use for the pgCompare Data Compare repository).
34-
4. Necessary JDBC drivers (Postgres, MySQL, MSSQL and Oracle currently supported).
34+
4. Necessary JDBC drivers (DB2, Postgres, MySQL, MSSQL and Oracle currently supported).
3535
5. Postgres connections must not go through a connection pooler like pgBouncer.
3636

3737
## Limitations
@@ -42,6 +42,21 @@ The following are current limitations of the compare utility:
4242
2. Unsupported data types: blob, long, longraw, bytea.
4343
3. Limitations with data type boolean when performing cross-platform compare.
4444

45+
## Upgrading
46+
47+
### Version 0.3.0
48+
49+
#### Enhancements / Fixes
50+
51+
- Added support for DB2.
52+
- Support for case sensitive table and column names are now supported.
53+
- Replaced json object used for column mapping with new tables for easier management.
54+
- Added Projects to allow multiple configurations to be stored in the repository instead of managing multiple properties files.
55+
56+
#### Upgrading to 0.3.0
57+
58+
Due to the changes required for the repository, the **repository must be dropped and recreated** to upgrade to version 0.3.0.
59+
4560
## Compile
4661
Once the prerequisites are met, begin by forking the repository and cloning it to your host machine:
4762

@@ -63,6 +78,8 @@ Copy the `pgcompare.properties.sample` file to pgcompare.properties and define t
6378

6479
By default, the application looks for the properties file in the execution directory. Use the PGCOMPARE_CONFIG environment variable override the default and point to a file in a different location.
6580

81+
At a minimal the repo-xxxxx parameters are required in the properties file (or specified by environment parameters). Besides the properties file and environment variables, another alternative is to store the property settings in the `dc_project` table. Settings can be stored in the `project_config` column in JSON format ({"parameter": "value"}).
82+
6683
## Configure Repository Database
6784

6885
pgCompare necessitates a hosted Postgres repository. To configure, connect to a Postgres database and execute the provided pgCompare.sql script in the database directory. The repository may also be created using the `--init` flag.
@@ -79,45 +96,47 @@ In the database directory are two scripts to deploy a sample table (EMP) to Orac
7996

8097
## Defining Table Mapping
8198

82-
The initial step involves defining a set of tables to compare, achieved by inserting rows into the `dc_table` within the pgCompare repository.
83-
84-
dc_table:
85-
- source_schema: Schema/user that owns the table on the source database.
86-
- source_table: Table name on the source database.
87-
- target_schema: Schema/user that owns the table on the target database.
88-
- target_table: Table name on the target database.
89-
- table_filter: Specify a valid predicate that would be used in the where clause of a select sql statement.
90-
- parallel_degree: Data can be compared by splitting up the work among many threads. The parallel_degree determines the number of threads. To use parallel threads, the mod_column value must be specified.
91-
- status: Expected values are 'disabled', which is the default, and 'ready'.
92-
- batch_nbr: Tables can be grouped into batches and compare jobs executed a batch, or grouping of tables.
93-
- mod_column: Used in conjunction with the parallel_degree. The work is divided up among the threads using a mod of the specified column. Therefore, the value entered must be a single column with a numeric data type.
94-
- column_map: Used to review or override column mapping used by compare functions. See Column Map section for more details.
99+
The initial step involves defining a set of tables to compare, achieved by inserting rows into the `dc_table` and `dc_table_map` tables in the pgCompare repository. This is best done using the automated process below.
95100

96101
### Automated Table Registry
97102

98-
Use pgCompare to perform a discovery against the target database and populate the dc_table with the results using the following command (where hr is the schema to be scanned).
103+
Use pgCompare to perform a discovery against the target database and populate the dc_table with the results using the following command. The schemas specified in the properties file will be used for the discovery process.
99104

100105
```shell
101-
java -jar pgcompare.jar --discovery hr
106+
java -jar pgcompare.jar --discover
102107
```
103108

104-
After automatic table registry, if there are tables that are case sensistive, those table names will need to be modified in the dc_table.source_table and dc_table.target_table columns as approriate.
105-
106109
### Manual Table Registry
107110

108-
Example of loading a row into `dc_table`:
111+
Example of loading a row into `dc_table` and `dc_table_map`:
109112

110113
```sql
111-
INSERT INTO dc_table (source_schema, source_table, target_schema, target_table, parallel_degree, status, batch_nbr)
112-
VALUES ('hr','emp','hr','emp',1,'ready',1);
114+
INSERT INTO dc_table (table_alias)
115+
VALUES ('emp');
116+
117+
INSERT INTO dc_table_map (tid, dest_type, schema_name, table_name)
118+
VALUES (1, 'source', 'hr', 'emp');
119+
120+
INSERT INTO dc_table_map (tid, dest_type, schema_name, table_name)
121+
VALUES (1, 'target', 'HR', 'EMP');
122+
```
123+
124+
After populating the list of tables, run the following to automatically map columns.
125+
126+
```shell
127+
java -jar pgcompare.jar --batch=0 --maponly
113128
```
114129

130+
### Projects
131+
132+
Projects allow for the repository to maintain different mappings for different compare objectives. This allows a central pgCompare repository to be used for multiple compare projects. Each table has a `pid` column which is the project id. If no project is specified, the default project (pid = 1) is used.
133+
115134
## Perform Data Compare
116135

117136
With the table mapping defined, execute the comparison and provide the mandatory batch command line argument:
118137

119138
```shell
120-
java -jar pgcompare.jar --batch=0
139+
java -jar pgcompare.jar --batch 0
121140
```
122141

123142
Using a batch value of 0 will execute the action for all batches. The batch number may also be specified using the environment variable PGCOMPARE-BATCH. The default value for batch number is 0 (all batches).
@@ -127,7 +146,7 @@ Using a batch value of 0 will execute the action for all batches. The batch num
127146
If discrepancies are detected, run the comparison with the 'check' option:
128147

129148
```shell
130-
java -jar pgcompare.jar --batch=0 --check
149+
java -jar pgcompare.jar --batch 0 --check
131150
```
132151

133152
This recheck process is useful when transactions may be in flight during the initial comparison. The recheck only checks the rows that have been flagged with a discrepancy. If the rows still do not match, details will be reported. Otherwise, the rows will be cleared and marked in-sync.
@@ -136,83 +155,14 @@ This recheck process is useful when transactions may be in flight during the ini
136155

137156
## Column Map
138157

139-
The system will automatically generate a column mapping during the first execution on a table. This column mapping will be stored in the column_map column of the dc_table repository table. The column mapping is stored in the form of a JSON object. This mapping can be performed ahead of time or the generated mapping modified as needed. If a column map is present in column_map, the program will not perform a remap.
158+
The system will automatically generate a column mapping during the first execution on a table. This column mapping will be stored in the `dc_table_column` and `dc_table_column_map` repository tables. This mapping can be performed ahead of time or the generated mapping modified as needed. If a column mapping is present, the program will not perform a remap unless instructed to using the `maponly` flag.
140159

141160
To create or overwrite current column mappings stored in column_map colum of dc_table, execute the following:
142161

143162
```shell
144-
java -jar pgcompare.jar --batch=0 --maponly
163+
java -jar pgcompare.jar --batch 0 --maponly
145164
```
146165

147-
### JSON Mapping Object
148-
149-
Below is a sample of a column mapping.
150-
151-
```json
152-
{
153-
"columns": [
154-
{
155-
"alias": "cola",
156-
"source": {
157-
"dataType": "char",
158-
"nullable": true,
159-
"dataClass": "char",
160-
"dataScale": 22,
161-
"supported": true,
162-
"columnName": "cola",
163-
"dataLength": 2,
164-
"primaryKey": false,
165-
"dataPrecision": 44,
166-
"valueExpression": "nvl(trim(col_char_2),' ')"
167-
},
168-
"status": "compare",
169-
"target": {
170-
"dataType": "bpchar",
171-
"nullable": false,
172-
"dataClass": "char",
173-
"dataScale": 22,
174-
"supported": true,
175-
"columnName": "cola",
176-
"dataLength": 2,
177-
"primaryKey": false,
178-
"dataPrecision": 44,
179-
"valueExpression": "coalesce(col_char_2::text,' ')"
180-
}
181-
},
182-
{
183-
"alias": "id",
184-
"source": {
185-
"dataType": "number",
186-
"nullable": false,
187-
"dataClass": "numeric",
188-
"dataScale": 0,
189-
"supported": true,
190-
"columnName": "id",
191-
"dataLength": 22,
192-
"primaryKey": true,
193-
"dataPrecision": 8,
194-
"valueExpression": "lower(nvl(trim(to_char(id,'0.9999999999EEEE')),' '))"
195-
},
196-
"status": "compare",
197-
"target": {
198-
"dataType": "int4",
199-
"nullable": true,
200-
"dataClass": "numeric",
201-
"dataScale": 0,
202-
"supported": true,
203-
"columnName": "id",
204-
"dataLength": 32,
205-
"primaryKey": true,
206-
"dataPrecision": 32,
207-
"valueExpression": "coalesce(trim(to_char(id,'0.9999999999EEEE')),' ')"
208-
}
209-
}
210-
]
211-
}
212-
```
213-
214-
Only Primary Key columns and columns with a status equal to 'compare' will be included in the final data compare.
215-
216166
## Properties
217167

218168
Properties are categorized into four sections: system, repository, source, and target. Each section has specific properties, as described in detail in the documentation. The properties can be specified via a configuration file, environment variables or a combination of both. To use environment variables, the environment variable will be the name of hte property in upper case prefixed with "PGCOMPARE-". For example, batch-fetch-size can be set by using the environment variable PGCOMPARE-BATCH-FETCH-SIZE.
@@ -221,13 +171,15 @@ Properties are categorized into four sections: system, repository, source, and t
221171
- batch-fetch-size: Sets the fetch size for retrieving rows from the source or target database.
222172
- batch-commit-size: The commit size controls the array size and number of rows concurrently inserted into the dc_source/dc_target staging tables.
223173
- batch-progress-report-size: Defines the number of rows used in mod to report progress.
174+
- database-source: Determines if the sorting of the rows based on primary key occurs on the source/target database. If set to true, the default, the rows will be sorted before being compared. If set to false, the sorting will take place in the repository database.
224175
- loader-threads: Sets the number of threads to load data into the temporary tables. Default is 4. Set to 0 to disable loader threads.
176+
- log-level: Level to determine the amount of log messages written to the log destination.
177+
- log-destination: Location where log messages will be written. Default is stdout.
225178
- message-queue-size: Size of message queue used by loader threads (nbr messages). Default is 100.
226179
- number-cast: Defines how numbers are cast for hash function (notation|standard). Default is notation (for scientific notation).
227180
- observer-throttle: Set to true or false, instructs the loader threads to pause and wait for the observer thread to catch up before continuing to load more data into the staging tables.
228181
- observer-throttle-size: Number of rows loaded before the loader thread will sleep and wait for clearance from the observer thread.
229182
- observer-vacuum: Set to true or false, instructs the observer whether to perform a vacuum on the staging tables during checkpoints.
230-
- stage-table-parallel: Sets the number of parallel workers for the temporary staging tables. Default is 0.
231183

232184
### Repository
233185
- repo-dbname: Repository database name.
@@ -243,9 +195,9 @@ Properties are categorized into four sections: system, repository, source, and t
243195
- source-database-hash: True or false, instructs the application where the hash should be computed (on the database or by the application).
244196
- source-dbname: Database or service name.
245197
- source-host: Database server name.
246-
- source-name: User defined name for the source.
247198
- source-password: Database password.
248199
- source-port: Database port.
200+
- source-schema: Name of schema that owns the tables.
249201
- source-sslmode: Set the SSL mode to use for the database connection (disable|prefer|require)
250202
- source-type: Database type: oracle, postgres
251203
- source-user: Database username.
@@ -255,13 +207,23 @@ Properties are categorized into four sections: system, repository, source, and t
255207
- target-database-hash: True or false, instructs the application where the hash should be computed (on the database or by the application).
256208
- target-dbname: Database or service name.
257209
- target-host: Database server name.
258-
- target-name: User defined name for the target.
259210
- target-password: Database password.
260211
- target-port: Database port.
212+
- target-schema: Name of schema that owns the tables.
261213
- target-sslmode: Set the SSL mode to use for the database connection (disable|prefer|require)
262214
- target-type: Database type: oracle, postgres
263215
- target-user: Database username.
264216

217+
## Property Precedence
218+
219+
The system contains default values for every parameter. These can be over-ridden using environment variables, properties file, or values saved in the `dc_project` table. The following is the order of precedence used:
220+
221+
- Default values
222+
- Properties file
223+
- Environment variables
224+
- Settings stored in `dc_project` table
225+
226+
265227
# Data Compare Concepts
266228

267229
pgCompare stores a hash representation of primary key columns and other table columns, reducing row size and storage demands. The utility optimizes network traffic and speeds up the process by using hash functions when comparing similar platforms.
@@ -282,7 +244,7 @@ A summary is printed at the end of each run. To view results at a later time, t
282244

283245
```sql
284246
WITH mr AS (SELECT max(rid) rid FROM dc_result)
285-
SELECT compare_dt, table_name, status, source_cnt total_cnt, equal_cnt, not_equal_cnt, missing_source_cnt+missing_target_cnt missing_cnt
247+
SELECT compare_start, table_name, status, source_cnt total_cnt, equal_cnt, not_equal_cnt, missing_source_cnt+missing_target_cnt missing_cnt
286248
FROM dc_result r
287249
JOIN mr ON (mr.rid=r.rid)
288250
ORDER BY table_name;

0 commit comments

Comments
 (0)