Skip to content

Commit f53d0e6

Browse files
authored
specs: VEP-2272 Complete Data Job Configuration Persistence Part 2 (#2302)
This change introduces VEP-2272, which aims at proposing an improvement to Versatile Data Kit by switching the source of truth from Kubernetes to a database. Signed-off-by: Miroslav Ivanov miroslavi@vmware.com --------- Signed-off-by: Miroslav Ivanov miroslavi@vmware.com
1 parent ae69956 commit f53d0e6

File tree

3 files changed

+167
-67
lines changed

3 files changed

+167
-67
lines changed

specs/vep-2272-complete-data-job-configuration-persistence/README.md

Lines changed: 123 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# VEP-2272: Complete Data Job Configuration Persistence
22

33
* **Author(s):** Miroslav Ivanov (miroslavi@vmware.com)
4-
* **Status:** draft
4+
* **Status:** implementable
55

66
- [Summary](#summary)
77
- [Glossary](#glossary)
@@ -69,6 +69,7 @@ and provide a seamless experience for our users.
6969

7070
1. Implement automatic data jobs restore capability in case of the loss of Kubernetes namespace whereby the Control
7171
Service automatically retrieves the configuration from a database and seamlessly restores the data jobs.
72+
2. Capability to maintain a record of past deployments for each data job.
7273

7374
## High-level design
7475

@@ -95,75 +96,130 @@ There would be no changes to the public APIs.
9596

9697
## Detailed design
9798

98-
### Database data model changes
99+
### Kubernetes Cron Jobs Synchronizer
100+
101+
To enhance data integrity and synchronization between a database and Kubernetes, an asynchronous process will be
102+
implemented. This process will leverage Spring Scheduler and [Scheduler Lock](https://github.com/lukas-krecan/ShedLock),
103+
similar to the
104+
[DataJobMonitorCron.watchJobs()](https://github.com/vmware/versatile-data-kit/blob/main/projects/control-service/projects/pipelines_control_service/src/main/java/com/vmware/taurus/service/monitoring/DataJobMonitorCron.java#L84).
105+
By using [Scheduler Lock](https://github.com/lukas-krecan/ShedLock), we can achieve an active-passive pattern across the
106+
Control Service instances. In simpler terms, only one instance of the Control Service will be able to synchronize Cron
107+
Jobs at any given time. The process will be executed with a fixed period between invocations, where the period will be
108+
configurable and the appropriate one will be determined during the implementation.
109+
110+
The overall purpose of this process is to iterate through the data job deployment configurations stored in the database
111+
and determine whether the corresponding Cron Job in Kubernetes is synchronized and up to date. If the Cron Job is not
112+
synchronized with the database, the process will initiate a new data job deployment. This deployment may involve tasks
113+
such as building an image or updating the Cron Job template.
114+
115+
In order to enhance the process further, the proposed asynchronous process will incorporate a mechanism to determine
116+
whether a Cron Job needs to be updated or not. This will be achieved by utilizing the hash
117+
code (`deployment_version_sha`) of the
118+
Cron Job YAML configuration, which will be stored in the database after each deployment.
119+
120+
By introducing this asynchronous process, you can ensure that the database and Kubernetes remain in sync, maintaining
121+
data integrity and enabling smooth operation of the system.
99122

100-
### Kubernetes Cron Jobs Synchronizer Process
101-
102-
In order to maintain data integrity and synchronization between database and Kubernetes we will introduce new
103-
asynchronous process. The process will be based on Spring Scheduler and Scheduler Lock similar
104-
to [DataJobMonitorCron.watchJobs()](https://github.com/vmware/versatile-data-kit/blob/main/projects/control-service/projects/pipelines_control_service/src/main/java/com/vmware/taurus/service/monitoring/DataJobMonitorCron.java#L84).
105-
106-
Basically, it will iterate over the data job deployment configurations from the database and check whether
107-
the Cron Job is already synchronized or it needs to be synchronized. If it is not in sync with the database the process
108-
will initiate new data job deployment which may include building image or just changing the template. It will be
109-
based on the sha of the cronjob template, which will be generated right after the job deployment.
110-
111-
<!--
112-
Dig deeper into each component. The section can be as long or as short as necessary.
113-
Consider at least the below topics but you do not need to cover those that are not applicable.
114-
115-
### Capacity Estimation and Constraints
116-
* Cost of data path: CPU cost per-IO, memory footprint, network footprint.
117-
* Cost of control plane including cost of APIs, expected timeliness from layers above.
118-
### Availability.
119-
* For example - is it tolerant to failures, What happens when the service stops working
120-
### Performance.
121-
* Consider performance of data operations for different types of workloads.
122-
Consider performance of control operations
123-
* Consider performance under steady state as well under various pathological scenarios,
124-
e.g., different failure cases, partitioning, recovery.
125-
* Performance scalability along different dimensions,
126-
e.g. #objects, network properties (latency, bandwidth), number of data jobs, processed/ingested data, etc.
127123
### Database data model changes
128-
### Telemetry and monitoring changes (new metrics).
129-
### Configuration changes.
130-
### Upgrade / Downgrade Strategy (especially if it might be breaking change).
131-
* Data migration plan (it needs to be automated or avoided - we should not require user manual actions.)
124+
125+
To ensure comprehensive synchronization between Kubernetes and the database, it is important to replicate all relevant
126+
data job configuration properties from Kubernetes to the database. Here is a list of properties that should be
127+
replicated:
128+
129+
* `git_commit_sha`
130+
* `vdk_version`
131+
* `python_version`
132+
* `cpu_request`
133+
* `cpu_limit`
134+
* `memory_request`
135+
* `memory_limit`
136+
* `deployed_by`
137+
* `deployed_date`
138+
* `deployment_version_sha`
139+
140+
The columns shown above were found by comparing the properties of the current Cron Job with the columns in the database.
141+
142+
In order to accommodate the replication of the additional data job configuration properties from Kubernetes to the
143+
database, it would be necessary to add a table called `data_job_deployment` to the existing database model.
144+
This table will have the mentioned columns, and the relationship between `data_job` and `data_job_deployment` tables
145+
will be one-to-one. This approach will normalize the database model and make it suitable for accommodating
146+
multiple deployments for each data job in the future.
147+
148+
By making these database model changes, the system can effectively store and synchronize all the required data job
149+
configuration details between Kubernetes and the database.
150+
151+
![database_data_model.png](database_data_model.png)
152+
153+
### DataJobsDeployment API and GraphQL API
154+
155+
To further enhance performance and reduce the load on the Kubernetes API, a part of the system's redesign will involve
156+
reworking the DataJobsDeployment and GraphQL APIs to interact solely with the database. This approach aims to streamline
157+
the communication process and leverage the database as the primary source of data job configuration information. Here's
158+
how this enhancement can be described in more detail:
159+
160+
The existing APIs, which might have previously interacted directly with Kubernetes for data job deployment-related
161+
operations, will be refactored to retrieve and update data job deployment configurations exclusively from the database.
162+
This architectural change allows for improved performance as the APIs will no longer need to make frequent and
163+
potentially resource-intensive calls to the Kubernetes API server.
164+
165+
By relying on the database as the central source of truth for data job configurations, the system can achieve faster
166+
response times and reduce the overall load on the Kubernetes API.
167+
168+
With this rework, the APIs will become more lightweight, leveraging the optimized data access and querying capabilities
169+
of the database. Consequently, the overall system performance will be enhanced, providing a smoother and more efficient
170+
experience for users interacting with the data job deployment management functionalities.
171+
172+
### Performance
173+
174+
Based on initial performance measurements conducted, the anticipated benefits of reworking the APIs to interact solely
175+
with the database are evident. The tests involved two GraphQL queries (pageSize=50)—one with deployments and the other
176+
without
177+
deployments. Here's how the results can be enriched:
178+
179+
The average response time for the GraphQL query that includes deployments is approximately 600 ms. This query involves
180+
retrieving data job configurations from the database and making additional call to the Kubernetes API for
181+
deployment-related information. Prior to the API rework, this query likely relied on Kubernetes for retrieving
182+
deployment details, resulting in longer response times.
183+
184+
On the other hand, the GraphQL query that doesn't involve deployments, relying solely on the database for data job
185+
information, demonstrates an average response time of approximately 400 ms. This query benefits from the direct
186+
interaction with the database, eliminating the need for additional calls to the Kubernetes API and resulting in improved
187+
performance.
188+
189+
Based on these preliminary measurements, it is evident that the proposed API rework brings notable performance
190+
enhancements. On average, the response times of the queries are expected to improve by approximately 30-35% when
191+
deployments are not involved, allowing for more efficient and faster retrieval of data job configurations.
192+
193+
It's important to note that these measurements provide a preliminary understanding of the potential performance gains.
194+
Further testing and profiling with a larger and more diverse dataset will be required to obtain more accurate and
195+
comprehensive performance insights.
196+
132197
### Troubleshooting
133-
* What are possible failure modes.
134-
* Detection: How can it be detected via metrics?
135-
* Mitigations: What can be done to stop the bleeding, especially for already
136-
running user workloads?
137-
* Diagnostics: What are the useful log messages and their required logging
138-
levels that could help debug the issue?
139-
* Testing: Are there any tests for failure mode? If not, describe why._
140-
### Operability
141-
* What are the SLIs (Service Level Indicators) an operator can use to determine the health of the system.
142-
* What are the expected SLOs (Service Level Objectives).
143-
### Test Plan
144-
* Unit tests are expected. But are end to end test necessary. Do we need to extend vdk-heartbeat ?
145-
* Are there changes in CICD necessary
146-
### Dependencies
147-
* On what services the feature depends on ? Are there new (external) dependencies added?
148-
### Security and Permissions
149-
How is access control handled?
150-
* Is encryption in transport supported and how is it implemented?
151-
* What data is sensitive within these components? How is this data secured?
152-
* In-transit?
153-
* At rest?
154-
* Is it logged?
155-
* What secrets are needed by the components? How are these secrets secured and attained?
156-
-->
157-
158-
## Implementation stories
159-
160-
<!--
161-
Optionally, describe what are the implementation stories (eventually we'd create github issues out of them).
162-
-->
198+
199+
* Possible failure modes:
200+
* If the Control Service Pod restarts while the Cron Jobs are being synchronized, another Pod of the
201+
Control Service will take over and become active to continue the synchronization process.
202+
* If the database stops working, the deployment configurations for data jobs won't be accessible through the public
203+
APIs. Additionally, the synchronization of Cron Jobs will be put on hold until the database is functioning again,
204+
and then it will be resumed.
205+
* If the Kubernetes API server stops working while the Cron Jobs are being synchronized, the active Control Service
206+
Pod will delay the synchronization process until the API server is back online, and then it will resume the
207+
synchronization.
163208

164209
## Alternatives
165210

166-
<!--
167-
Optionally, describe what alternatives has been considered.
168-
Keep it short - if needed link to more detailed research document.
169-
-->
211+
An alternative solution to consider for this scenario is leveraging Kubernetes Operator for the Cron Jobs management.
212+
Although the Operator-based approach can provide a more suitable and robust implementation, it's important to note that
213+
it may require a significant amount of additional time for its implementation. Here's how this alternative solution can
214+
be further enriched:
215+
216+
Utilizing a Kubernetes Operator entails designing and implementing a custom controller that extends the Kubernetes API
217+
functionality to manage the Cron Jobs lifecycle. This approach allows for a more declarative and automated management of
218+
data jobs, handling tasks such as provisioning.
219+
220+
While the Operator-based solution offers advantages such as built-in reconciliation loops and event-driven actions, it
221+
typically requires more effort and time to develop and deploy. Implementing a Kubernetes Operator requires knowledge
222+
about operators, a lot of architectural changes, and an additional component that needs to be deployed and operated.
223+
Compared to the operator-based approach the one described in the VEP will be easier and faster to implement since it is
224+
based on the existing component `Control Service`. Also, it does not require the implementation and management of
225+
additional components.
27.1 KB
Loading
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
@startuml
2+
entity data_job #silver {
3+
**name** primary key
4+
team
5+
description
6+
schedule
7+
enabled
8+
name_deprecated
9+
generate_keytab
10+
enable_execution_notifications
11+
notified_on_job_failure_user_error
12+
notified_on_job_failure_platform_error
13+
notified_on_job_success
14+
notified_on_job_deploy
15+
notification_delay_period_minutes
16+
last_execution_status
17+
last_execution_end_time
18+
last_execution_duration
19+
latest_job_deployment_status
20+
latest_job_termination_status
21+
latest_job_execution_id
22+
}
23+
24+
entity data_job_deployment #LightSkyBlue {
25+
**data_job_name** primary key
26+
deployment_version_sha
27+
vdk_version
28+
python_version
29+
git_commit_sha
30+
deployed_by
31+
deployed_date
32+
cpu_request
33+
cpu_limit
34+
memory_request
35+
memory_limit
36+
}
37+
38+
data_job ||..|| data_job_deployment
39+
40+
legend bottom left
41+
| <#LightSkyBlue> new table |
42+
| <#silver> existing table |
43+
endlegend
44+
@enduml

0 commit comments

Comments
 (0)