Skip to content

control-service: add MeterRegistry counters for DataJobsSynchronizer#2844

Merged
mrMoZ1 merged 5 commits intomainfrom
person/mzhivkov/synchronizer-telemetry
Oct 30, 2023
Merged

control-service: add MeterRegistry counters for DataJobsSynchronizer#2844
mrMoZ1 merged 5 commits intomainfrom
person/mzhivkov/synchronizer-telemetry

Conversation

@mrMoZ1
Copy link
Copy Markdown
Contributor

@mrMoZ1 mrMoZ1 commented Oct 27, 2023

what: added telemetry counters for the DataJobsSynchronizer

why: The counters can be used to monitor if the synchronizeDataJobs method is executing as expected.

testing: added unit test

mrMoZ1 and others added 2 commits October 27, 2023 02:22
@antoniivanov
Copy link
Copy Markdown
Contributor

Can you please explain how is this going to be used ?

Can you copy paste the new metrics generated from (/data-jobs/debug/prometheus)

Have you thought of adding data job name as tag ? So it's easier to recognize which job's sync is failing.

mrMoZ1 and others added 2 commits October 27, 2023 12:31
Signed-off-by: mrMoZ1 <mzhivkov@vmware.com>
@mrMoZ1
Copy link
Copy Markdown
Contributor Author

mrMoZ1 commented Oct 27, 2023

Can you please explain how is this going to be used ?

Can you copy paste the new metrics generated from (/data-jobs/debug/prometheus)

Have you thought of adding data job name as tag ? So it's easier to recognize which job's sync is failing.

This will be used in a similar fashion to the DataJobExecutionCleanupMonitor. We already have metrics on a per job basis they can be found in DeploymentMonitor class and afaik work with the new synchronizer through the DeploymentProgress class. The purpose of this metric is to have visibility on successful dataJobsSynchrnize() invocations, rather than monitor individual data jobs' deployment statuses which is already present. A possible query might look like:
increase(datajobs_failed_synchronizer_invocations_counter[1m])
increase(datajobs_synchronizer_invocations_counter[1m])
and an alerting panel could be configured to trigger an alarm based on a certain threshold of invocations being present/not present within a time period.

Signed-off-by: mrMoZ1 <mzhivkov@vmware.com>
@mrMoZ1 mrMoZ1 merged commit 21da2df into main Oct 30, 2023
@mrMoZ1 mrMoZ1 deleted the person/mzhivkov/synchronizer-telemetry branch October 30, 2023 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants