All Collections
Forward & Reverse Engineering
Databricks - Direct Connect and Reverse Engineer
Databricks - Direct Connect and Reverse Engineer

Direct connect and reverse engineering for Databricks projects

S
Written by Serge Gershkovich
Updated over a week ago

Databricks projects can use direct connect to reverse engineer entire schemas in seconds. Please refer to the instructions below to configure and perform a direct connection to your Databricks cluster.

Direct Connect

Using Unity Catalog API

To establish a direct connection to the Unity Catalog in Databricks using the API, you will require the following details:

  • Server (ex. xxx-myAccount-xxxx.cloud.databricks.com)

  • Personal access token (ex. asdf1234fd12s3412d3412dfa341234dsa)

Server

The Server information can be sourced from the cluster and is also retrievable from SQL Warehouses, as stated below.

Generate a Personal Access Token

Instead of a user and pass, you will need to generate a personal access token (PAT) to authenticate when connecting from SqlDBM. As the name suggests, a PAT should be unique to each user and must be kept safe and guarded with the rigor of login credentials.

To generate a PAT, follow these instructions:

1. Select "User Settings" from the top right user menu

2. Click on "Access tokens"

3. Press the button to "Generate new token"

4. Give the token a name and a duration and hit "Generate".

(If the token lifespan is unspecified, the token will live indefinitely.)

Write down your generated token in a secure location. You will not be able to see it again.

Direct Connect from SqlDBM Using Unity Catalog API

Select Unity Catalog API

Introduce the parameters

Select a Catalog and Schema(s)

Using Engineering Cluster

To direct connect using the engineering cluster for databricks projects, you will need the following details:

  • Server (ex. xxx-myAccount-xxxx.cloud.databricks.com)

  • HTTP path (ex. sql/protocolv1/o/12345/123-xyz)

  • Personal access token (ex. asdf1234fd12s3412d3412dfa341234dsa)

To obtain this information, log in to your databricks account and follow the steps below.

Server and HTTP path

To obtain the complete schema DDL, SqlDBM must loop through all schema objects to run the "SHOW CREATE TABLE" command. For this, we require an Engineering Cluster, running in "Single user" mode. Please ensure that the following parameters are set for the cluster you plan to use:

To obtain the Server and HTTP path for your databricks cluster, follow the steps below once logged in:

1. Select the "Data Science & Engineering" workspace

2. Select "Compute" to see existing clusters

3. Start the cluster you will connect to

4. Click on the started cluster to open its properties

In the cluster options, navigate to the JDBC/ODBC details in the advanced properties like so:

5. In the "Configuration" tab, scroll down and expand the "Advanced options"

6. Select the "JDBC/ODBC" tab

note the Server Hostname and HTTP Path details

Direct Connect from SqlDBM Using Engineering Cluster

Select the Engineering Cluster connection method.

With the three pieces of information in hand (Server, HTTP path, and Token), go to the Reverse Engineering screen in SqlDBM and, proceed to log in, select the database, and schemas you wish to retrieve. For complete details of the RE process, use the links to the associated articles at the end of this page.

Note that the direct connection and its parameters will only persist in the browser session. These credentials are not stored or recorded by SqlDBM.

Google Cloud Platform (GCP) Databricks Accounts

If Unity Catalog is not enabled, Databricks requires a "Google Service Account" to be configured at the cluster level for accessing entities using "SHOW CREATE TABLE" hosted on Google Cloud.

The Google Service Account requires appropriate permissions within the Databricks project. Please refer to the following screenshot for an example.

Manual DDL generation

Reading the DDL from Databricks Files

Databricks supports using external metastores instead of the default Hive metastore.

You can export all table metadata from Hive to the external metastore.

  1. Use the Apache Spark Catalog API to list the tables in the databases contained in the metastore.

  2. Use the SHOW CREATE TABLE statement to generate the DDLs and store them in a file.

  3. Use the file to import the table DDLs into the external metastore.

The following code accomplishes the first two steps.

%python

dbs = spark.catalog.listDatabases()
for db in dbs:
f = open("your_file_name_{}.ddl".format(db.name), "w")
tables = spark.catalog.listTables(db.name)
for t in tables:
DDL = spark.sql("SHOW CREATE TABLE {}.{}".format(db.name, t.name))
f.write(DDL.first()[0])
f.write("\n")
f.close()

Reading the databricks file:

1. List all the temporary files.

dbutils.fs.ls ("/tmp/")

2. Copy the file path and paste it to <file_path>

df = spark.read.text("<file_path>")
df.display()

For more information about the databricks File System (DBFS), please visit https://docs.databricks.com/dbfs/

Reading the DDL from Databricks User Interface

The process would include the following:

  1. Navigate to the Compute to run the cluster - Number 1 on the screenshot.

  2. Add a Notebook to run the below Python code - Number 2 on the screenshot.

# Set your catalog name
catalog = "catalog_name"
# There should be a comma-separated list of schemas or a single schema name
schemas = "schema1, schema2".split(",")
spark.catalog.setCurrentCatalog(catalog)
# Create an empty list to store the table metadata
table_metadata = []
# Iterate through schemas and tables
for schema in schemas:
allTables = spark.catalog.listTables(schema)
for t in allTables:
# Skip temporary tables
if not t.isTemporary:
try:
ddl = spark.sql("SHOW CREATE TABLE {}.{};".format(schema, t.name))
create_table_statement = ddl.first()[0]
# Append the table metadata to the list
table_metadata.append([schema, t.name, t.tableType, create_table_statement])
except Exception as error:
# Handle errors if necessary
pass
# Convert the table metadata list to a DataFrame
metadata_df = spark.createDataFrame(table_metadata, ["Schema", "Table Name", "Table Type", "Create Table Statement"])
# Display the DataFrame in the Databricks UI
display(metadata_df)

3. From Databricks UI, download the results in csv format and copy the “Create Table Statement” column.

4. Manually paste the DDL into SqlDBM screen.

Upload and import the script as per SqlDBM reverse engineer instructions (see related article at the end of this page).

Account Settings and Privileges

This section describes the required privileges and settings that you will require for connecting and generating DDL from Databricks.

Required Entitlement and Privileges

Grant access to Databricks SQL to users or service principal users who can access the SQL environment.

Databricks Sql Access Description:

Entitlement name (UI)

Entitlement name (API)

Default

Description

Databricks SQL access

databricks-sql-access

Granted by default.

When granted to a user or service principal, they can access Databricks SQL.

The users group is granted the Workspace access and Databricks SQL access entitlements by default. All workspace users and service principals are members of the users group. To assign these entitlements on a user-by-user basis, a workspace admin must remove the entitlement from the users group and assign it individually to users, service principals, and groups.

Securable objects to be reverse-engineered in SQLDBM within the Hive Metastore and Unity Catalog

The securable objects are:

  • CATALOG: controls access to the entire data catalog.

    • SCHEMA: controls access to a schema.

      • TABLE: controls access to a managed or external table.

      • VIEW: controls access to SQL views.

      • FUNCTION: controls access to a named function.

Required Privileges on Hive metastore

The following object privileges will be required for obtaining Databricks DDL.

  • SELECT: gives read access to an object.

  • USAGE: does not give any abilities, but is an additional requirement to perform any action on a schema object.

To grant a privilege to all users in your workspace, grant the privilege to the users group. For example:

GRANT SELECT ON TABLE <schema-name>.<table-name> TO users

Required Privileges on Unity Catalog

The following object privileges will be required for obtaining Databricks DDL.

  • SELECT
    Query a table or view, invoke a user-defined or anonymous function, or select ANY FILE. The user needs SELECT on the table, view, or function, as well as USE CATALOG on the object’s catalog and USE SCHEMA on the object’s schema.

  • USE CATALOG
    Required, but not sufficient to reference any objects in a catalog. The principal also needs to have privileges on the individual securable objects.

  • USE SCHEMA
    Required, but not sufficient to reference any objects in a schema. The principal also needs to have privileges on the individual securable objects.

For example, SELECT privilege to a user in a workspace using the SQL command:

GRANT SELECT ON TABLE t TO `user`;

IP allowlisting and private networking

Corporate networking and security policies may interfere with connectivity between SqlDBM and Databricks. If you are unable to connect despite following the guidelines in this article, please make sure that IP blocking or private networking is not interfering.

An easy way to diagnose this type of issue is by attempting to connect to your Databricks account from a home computer, without using a VNP or any corporate software.

For instructions on allowing SqlDBM through the firewall or private network, please see the related article at the bottom of this page.


See also:

Did this answer your question?