Databricks Configuration Introduction
This page contains references to the term whitelist, which Immuta no longer uses. When the term is removed from the software, it will be removed from this page.
- Databricks instance: Premium tier workspace and Cluster access control enabled
- Databricks instance has network level access to Immuta instance
- Access to Immuta releases
- Permissions and access to download (outside Internet access) or transfer files to the host machine
Recommended Databricks Workspace Configurations:
Note: Azure Databricks authenticates users with Azure AD. Be sure to configure your Immuta instance with an IAM that uses the same user ID as does Azure AD. Immuta's Spark security plugin will look to match this user ID between the two systems. See this Azure Active Directory page for details.
Supported Databricks Runtime Versions
Use the table below to determine which version of Immuta supports your Databricks Runtime version:
|Databricks Runtime Version
|2023.1 and newer
|2022.2.x and newer
|2021.5.x and newer
Supported Databricks Cluster Configurations
The table below outlines the integrations supported for various Databricks cluster configurations. For example, the only integration available to enforce policies on a cluster configured to run on Databricks Runtime 9.1 is the Databricks Spark integration.
|Unity Catalog in Databricks
|Databricks Spark integration
|Databricks Spark with Unity Catalog support
|Databricks Unity Catalog integration
- The feature or integration is enabled.
- The feature or integration is disabled.
Supported Access Mode and Languages
Immuta supports the Custom access mode.
- Supported Languages:
- R (requires advanced configuration; work with your Immuta support professional to use R)
- Scala (requires advanced configuration; work with your Immuta support professional to use Scala)
Databricks Installation Overview
Users Who Can Read Raw Tables On-Cluster
If a Databricks Admin is tied to an Immuta account, they will have the ability to read raw tables on-cluster.
If a Databricks user is listed as an "ignored" user, they will have the ability to read raw tables on-cluster. Users can be added to the
immuta.spark.acl.whitelistconfiguration to become ignored users.
The Immuta Databricks integration injects an Immuta plugin into the SparkSQL stack at cluster startup. The Immuta plugin creates an "immuta" database that is available for querying and intercepts all queries executed against it. For these queries, policy determinations will be obtained from the connected Immuta instance and applied before returning the results to the user.
The Databricks cluster init script provided by Immuta downloads the Immuta artifacts onto the target cluster and puts them in the appropriate locations on local disk for use by Spark. Once the init script runs, the Spark application running on the Databricks cluster will have the appropriate artifacts on its CLASSPATH to use Immuta for policy enforcement.
The cluster init script uses environment variables in order to
- Determine the location of the required artifacts for downloading.
- Authenticate with the service/storage containing the artifacts.
Note: Each target system/storage layer (HTTPS, for example) can only have one set of environment variables, so the cluster init script assumes that any artifact retrieved from that system uses the same environment variables.
See the Databricks Pre-Configuration Details page for known limitations.
There are two installation options for Databricks. Click a link below to navigate to a tutorial for your chosen method:
- Simplified Configuration: The steps to enable the integration with this method include
- Adding the integration on the App Settings page.
- Downloading or automatically pushing cluster policies to your Databricks workspace.
- Creating or restarting your cluster.
- Manual Configuration: The steps to enable the integration with this method include
- Downloading and configuring Immuta artifacts.
- Staging Immuta artifacts somewhere the cluster can read from during its startup procedures.
- Protecting Immuta environment variables with Databricks Secrets.
- Creating and configuring the cluster to start with the init script and load Immuta into its SparkSQL environment.
Debugging Immuta Installation Issues
For easier debugging of the Immuta Databricks installation, enable cluster init
script logging. In the cluster page in Databricks for the target cluster, under
Advanced Options -> Logging, change the Destination from
DBFS and change the path to the desired output location. Note: The unique cluster ID will be added onto
the end of the provided path.
For debugging issues between the Immuta web service and Databricks, you can view the Spark UI on your target Databricks cluster. On the cluster page, click the Spark UI tab, which shows the Spark application UI for the cluster. If you encounter issues creating Databricks data sources in Immuta, you can also view the JDBC/ODBC Server portion of the Spark UI to see the result of queries that have been sent from Immuta to Databricks.
Using the Validation and Debugging Notebook
The Validation and Debugging Notebook (
immuta-validation.ipynb) is packaged with other Databricks release artifacts
(for manual installations), or it can be downloaded from the App Settings page when configuring native Databricks
through the Immuta UI. This notebook is designed to be used by or under the guidance of an Immuta Support Professional.
- Import the notebook into a Databricks workspace by navigating to Home in your Databricks instance.
- Click the arrow next to your name and select Import.
- Once you have executed commands in the notebook and populated it with debugging information, export the notebook and its contents by opening the File menu, selecting Export, and then selecting DBC Archive.