This project allocates billing usage of Databricks “Shared” SQL Warehouses to individual users and their respective organizational entities (e.g. cost center, departments / business units) and save the results as a delta table.
Many organizations would like to adopt a "cost centre" approach by invoicing DBU consumption against individual users, teams, business units/departments, or cost centers. Currently, Databricks provides DBU consumption data for the entire cluster, lacking the granularity needed for more detailed billing information. This is especially important for Data Warehousing use cases where SQL Warehouses are often shared across different organizational entities to unify cluster configuration and maximize utilization.
An alternative solution is to create separate clusters for each department. However, this requires reconfiguration of the existing applications (e.g. PowerBI) However, this requires reconfiguration of the existing applications (e.g. PowerBI) and does not provide per user cost allocation.
For questions, troubleshooting or bug fixes, please submit an issue. See contributing instructions to help improve this project.
- Granular Cost Calculator for Databricks Shared Clusters
- Motivation
- Prerequisites
- Installation
- Design
- Contribution
- Limitations
- Future Work
In order to deploy the project, the following is required:
- Python 3.10 or later. See instructions.
- Databricks CLI v0.213 or later. See instructions.
- At least one Unity Catalog-enabled workspace in your Databricks account.
- Access to system tables
The project uses Databricks Assets Bundles (DAB) for deployment.
Once the project is deployed, the following Jobs are created in your Databricks Workspace:
Granular billing usage: 1. create tables
Granular billing usage: 2. create alerts and dashboard
Granular billing usage: 3. calculate
Granular billing usage: 4. define budget
Granular billing usage: 5. fetch user info
Deploy and execute them in the given order.
-
Once you install Databricks CLI, authenticate your current machine to a Databricks Workspace:
databricks auth login --host <WORKSPACE_HOST>
To enable debug logs, simply add
--debug
flag to any command. More about authentication options here. -
Update job templates in resources to configure parameters like output catalog, output schema etc.
-
Update bundle configuration under databricks.yml to provide workspace host and user name to use for the deployment.
-
Validate the bundle configuration:
$ databricks bundle validate --target dev
Note that "dev" is the default target, so the
--target
parameter is optional here. -
Deploy a development copy of this project to the workspace:
$ databricks bundle deploy --target dev
-
Similarly, if you want to deploy a production copy of this project to the workspace:
$ databricks bundle deploy --target prod
-
Run the jobs:
a) Deploy tables:
$ databricks bundle run create_tables_job --target dev
b) Deploy Databricks SQL Alerts and Lake View dashboard:
$ databricks bundle run create_dbsql_objects_job --target dev
c) Run the calculation:
$ databricks bundle run calculate_job --target dev
d) Define budget limits:
Provide budget spending limits in the
src/define_budget.sql
file. Then run:$ databricks bundle run define_budget_job --target dev
e) Fetch user info from your Identity Provider (IdP):
Provide auth credentials in
src/fetch_user_info_from_entra_id.py
. Then run:$ databricks bundle run fetch_user_info_job --target dev
If you want to skip the step and pre-populate
user_info
table with sample data for demo purposes, import and run this notebook. You can use this to quickly get started before fetching the user info from your IdP provider.
The solution provides granular cost allocation for queries run by users on "Shared" SQL Warehouses. The solution uses numerous system tables to perform the calculation and write the granular cost allocation to a Delta table.
For the cost calculation the following system tables are required:
- Billing usage system table (system.billing.usage) containing DBU consumption of SQL Warehouses per day from all UC-enabled workspaces.
- Query History system table (
system.query.history
) containing query history of SQL Warehouses from all UC-enabled workspaces. Currently in Private Preview. - List Prices system table (system.billing.list_prices) containing historical log of SKU pricing. Should be replaced by
account_prices
in the future to get customer specific SKU pricing. - Cloud Infra Cost system table (
system.billing.cloud_infra_cost
) containing cloud costs (VM and cloud storage). Currently in Private Preview.
For the dashboard and alerts the following tables are required (deployed as part of the create_tables_job
job):
- User Info table (
user_info
) contains mapping of users to cost centers and departments. This table is populated byfetch_user_info_job
job using extension attributes from your IdP provider.
Column | Type | Description | Example data | Constraints |
---|---|---|---|---|
user_name | string | User name in the form of email address | [email protected] | PK |
user_id | string | Databricks user id | 2770987888128771 | |
display_name | string | User display name | Marcin Wojtyczka | |
organizational_entity_name | string | Organizational entity name | department, cost center, team | PK |
organizational_entity_value | string | Organizational entity value | R&D |
- Budget table (
budget
) contains monthly spending limits for your organizational entity (e.g. department). This table is populated bydefine_budget_job
job.
Column | Type | Description | Example data | Constraints |
---|---|---|---|---|
organizational_entity_name | string | Organizational entity name | department, cost center, team | PK, FK (user_info) |
organizational_entity_value | string | Organizational entity value | R&D | PK, FK (user_info) |
dbu_cost_limit | decimal(2) | DBU cost limit | 100000.00 | |
cloud_cost_limit | decimal(2) | Cloud cost limit | 120000 | |
currency_code | string | Code of the currency | EUR | |
effective_start_date | date | The date the limit is effective from | 2024-04-01 | PK |
effective_end_date | date | The date the limit is effective to | Null | Nullable |
The result is saved as Delta table and gives DBU consumption for each user and cluster.
The data is aggregated daily and stored in the Cost Agg Day table (cost_agg_day
).
Only users that have consumed any DBUs in the given day are included.
Column | Type | Description | Example data | Constraints |
---|---|---|---|---|
user_name | string | User name in the form of email address | [email protected] | PK, FK (user_info) |
user_id | string | Databricks user id | 2770987888128771 | |
cloud | string | Cloud provider | AZURE | PK |
billing_date | date | Billing period in UTC: yyyy–mm-dd | 2024-03-05 | PK |
account_id | string | Id of the Databricks account where the cluster is located | bab96770-8d6f-4ead-9596-c4eec6de581b | PK |
warehouse_id | string | Id of the sql warehouse | f64784f09898be10 | PK |
workspace_id | string | Id of the workspace where the cluster is located | 237536643179474 | PK |
dbu | decimal(2) | Databricks DBU consumption | 85736.21 | Nullable |
dbu_cost | decimal(2) | Databricks DBU cost | 18861.92 | Nullable |
cloud_cost | decimal(2) | Cloud cost (e.g. VM, storage) | 15412.24 | Nullable |
currency_code | string | Code of the currency | EUR | |
dbu_contribution_percent | decimal(14) | % contribution to the overall DBU consumption (all contributions per day for a user should sum up to 100) | 0.003 | Nullable |
Rows filters can be applied to the table to ensure
only specific rows can be viewed by specific departments/users.
The table can be joined to the User Info table (user_info
) to retrieve additional contextual information
about the users (e.g. cost center, department), and to the Budget table (budget
) to get info about spending limits.
Examples are included in the project as part of the alerts and the dashboard.
Cost calculation is implemented as a Databricks job and is scheduled to run daily. The calculation uses a weighted sum approach to calculate the contribution of user queries in the total DBU consumption of SQL Warehouse clusters. Idle time is split according to the contribution of each user in the cluster usage. The metrics and assigned weights can be found here.
The cost allocation is available on a daily granularity. Current date is excluded from the calculations to avoid partial results. Only query history of complete days is used.
The last processed day is persisted in a checkpoint table. The query history is processed up to the date persisted in the checkpoint. When the job starts again it will continue where it left. This is to ensure that the calculation is done incrementally (only for queries not processed yet).
Column | Type | Description | Example data | Constraints |
---|---|---|---|---|
last_processed_date | date | Last processed billing day | 2024-07-12 |
If you want to re-process the data, reset the checkpoint to a previous date. But note that most system tables have small retention time (usually up to a month) so you need to be careful.
See contribution guidance here.
- Works for SQL Warehouses only
- A user can only belong to one entity (e.g. cost center, department). If the user belongs to multiple entities, it’s unknown in the context of which entity the user executed the query.
- If a query spans multiple days, the cost is attributed to the end date only.
- The solution is compatible with Databricks on AWS, Azure, and GCP. However, currently, cloud cost reporting is only available for AWS. The reported cloud costs include EC2-related expenses. Future updates will include Storage and Network costs, as well as support for other cloud providers.
- More IdPs: Extend the fetch user info workflow to provide examples for more IdP providers (e.g. Okta, One Login). Currently only example for Microsoft Entra ID is available.
- Add support for regular Databricks clusters (all-purpose and job compute).