Sparklyr
Audience: System Administrators
Content Summary: This page describes the sparklyr cluster policy.
Single-User Clusters Recommended
Like Databricks, Immuta recommends single-user clusters for sparklyr when user isolation is required. A single-user cluster can either be a job cluster or a cluster with credential passthrough enabled. Note: spark-submit jobs are not currently supported.
Two cluster types can be configured with sparklyr: Single-User Clusters (recommended) and Multi-User Clusters (discouraged).
-
Single-User Clusters: Credential Passthrough (required on Databricks) allows a single-user cluster to be created. This setting automatically configures the cluster to assume the role of the attached user when reading from storage (S3). Because Immuta requires that raw data is readable by the cluster, the instance profile associated with the cluster should be used rather than a role assigned to the attached user.
-
Multi-User Clusters: Because Immuta cannot guarantee user isolation in a multi-user sparklyr cluster, it is not recommended to deploy a multi-user cluster. To force all users to act under the same set of attributes, groups, and purposes with respect to their data access and eliminate the risk of a data leak, all sparklyr multi-user clusters must be equalized either by convention (all users able to attach to the cluster have the same level of data access in Immuta) or by configuration (detailed below).
Single-User Cluster Configuration
1 - Enable sparklyr
In addition to the configuration for an Immuta cluster with R, add this environment variable to the Environment Variables section of the cluster:
IMMUTA_DATABRICKS_SPARKLYR_SUPPORT_ENABLED=true
This configuration makes changes to the iptables rules on the cluster to allow the sparklyr client to connect to the required ports on the JVM used by the sparklyr backend service.
2 - Set Up a sparklyr Connection in Databricks
-
Install and load libraries into a notebook. Databricks includes the stable version of sparklyr, so
library(sparklyr)
in an R notebook is sufficient, but you may opt to install the latest version of sparklyr fromCRAN
. Additionally, loadinglibrary(DBI)
will allow you to execute SQL queries. -
Set up a sparklyr connection:
sc <- spark_connect(method = "databricks")
-
Pass the connection object to execute queries:
dbGetQuery(sc, "show tables in immuta")
3 - Configure a Single-User Cluster
Add the following items to the Spark Config section of the cluster:
spark.databricks.passthrough.enabled true
spark.databricks.pyspark.trustedFilesystems com.databricks.s3a.S3AFileSystem,shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem,shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem,com.databricks.adl.AdlFileSystem,shaded.databricks.V2_1_4.com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem,shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem,shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem,org.apache.hadoop.fs.ImmutaSecureFileSystemWrapper
spark.hadoop.fs.s3a.aws.credentials.provider com.amazonaws.auth.InstanceProfileCredentialsProvider
The trustedFileSystems
setting is required to
allow Immuta’s wrapper FileSystem (used in conjunction with the ImmutaSecurityManager
for data security purposes)
to be used with credential passthrough. Additionally, the InstanceProfileCredentialsProvider
must be configured to
continue using the cluster’s instance profile for data access, rather than a role associated with the attached user.
Multi-User Cluster Configuration
Immuta Discourages Deploying Multi-User Clusters with sparklyr Configuration
It is possible, but not recommended, to deploy a multi-user cluster sparklyr configuration. Immuta cannot guarantee user isolation in a multi-user sparklyr configuration.
The configurations in this section enable sparklyr, require project equalization, map sparklyr sessions to the correct Immuta user, and prevent users from accessing Immuta native workspaces.
-
Add the following environment variables to the Environment Variables section of your cluster configuration:
IMMUTA_DATABRICKS_SPARKLYR_SUPPORT_ENABLED=true IMMUTA_SPARK_REQUIRE_EQUALIZATION=true IMMUTA_SPARK_CURRENT_USER_SCIM_FALLBACK=false
-
Add the following items to the Spark Config section:
immuta.spark.acl.assume.not.privileged true immuta.api.key=<user’s API key>
Limitations
Immuta’s integration with sparklyr does not currently support
- spark-submit jobs,
- UDFs, or
- Databricks Runtimes 5, 6, or 7.