All Collections
Forward & Reverse Engineering
Databricks direct connect and reverse engineering
Databricks direct connect and reverse engineering

Direct connect and reverse engineering for Databricks projects

S
Written by Serge Gershkovich
Updated over a week ago

Databricks projects can use direct connect to reverse engineer entire schemas in seconds. Please refer to the instructions below to configure and perform a direct connection to your Databricks cluster.

Direct Connect

1. Using Unity Catalog API

To establish a direct connection to the Unity Catalog in Databricks using the API, you will require the following details:

  • Server (ex. xxx-myAccount-xxxx.cloud.databricks.com)

  • Personal access token (ex. asdf1234fd12s3412d3412dfa341234dsa)

Server

The Server information can be sourced from the cluster and is also retrievable from SQL Warehouses, as stated below.

Generate a Personal Access Token

Instead of a user and pass, you will need to generate a personal access token (PAT) to authenticate when connecting from SqlDBM. As the name suggests, a PAT should be unique to each user and must be kept safe and guarded with the rigor of login credentials.

To generate a PAT, follow these instructions:

1. Select "User Settings" from the top right user menu

2. Click on "Access tokens"

3. Press the button to "Generate new token"

4. Give the token a name and a duration and hit "Generate".

(If the token lifespan is unspecified, the token will live indefinitely.)

Write down your generated token in a secure location. You will not be able to see it again.

Direct Connect from SqlDBM Using Unity Catalog API

1. Select Unity Catalog API

2. Introduce the parameters

3. Select a Catalog and Schema(s)

2. Using Engineering Cluster

To direct connect using the engineering cluster for databricks projects, you will need the following details:

  • Server (ex. xxx-myAccount-xxxx.cloud.databricks.com)

  • HTTP path (ex. sql/protocolv1/o/12345/123-xyz)

  • Personal access token (ex. asdf1234fd12s3412d3412dfa341234dsa)

To obtain this information, log in to your databricks account and follow the steps below.

Server and HTTP path

To obtain the complete schema DDL, SqlDBM must loop through all schema objects to run the "SHOW CREATE TABLE" command. For this, we require an Engineering Cluster, running in "Single user" mode. Please ensure that the following parameters are set for the cluster you plan to use:

To obtain the Server and HTTP path for your databricks cluster, follow the steps below once logged in:

1. Select the "Data Science & Engineering" workspace

2. Select "Compute" to see existing clusters

3. Start the cluster you will connect to

4. Click on the started cluster to open its properties

In the cluster options, navigate to the JDBC/ODBC details in the advanced properties like so:

5. In the "Configuration" tab, scroll down and expand the "Advanced options"

6. Select the "JDBC/ODBC" tab

note the Server Hostname and HTTP Path details

Direct Connect from SqlDBM Using Engineering Cluster

Select the Engineering Cluster connection method.

With the three pieces of information in hand (Server, HTTP path, and Token), go to the Reverse Engineering screen in SqlDBM and, proceed to log in, select the database, and schemas you wish to retrieve. For complete details of the RE process, use the links to the associated articles at the end of this page.

Manual DDL generation

Databricks supports using external metastores instead of the default Hive metastore.

You can export all table metadata from Hive to the external metastore.

  1. Use the Apache Spark Catalog API to list the tables in the databases contained in the metastore.

  2. Use the SHOW CREATE TABLE statement to generate the DDLs and store them in a file.

  3. Use the file to import the table DDLs into the external metastore.

The following code accomplishes the first two steps.

%python

dbs = spark.catalog.listDatabases()
for db in dbs:
f = open("your_file_name_{}.ddl".format(db.name), "w")
tables = spark.catalog.listTables(db.name)
for t in tables:
DDL = spark.sql("SHOW CREATE TABLE {}.{}".format(db.name, t.name))
f.write(DDL.first()[0])
f.write("\n")
f.close()

Reading the databricks file:

1. List all the temporary files.

dbutils.fs.ls ("/tmp/")

2. Copy the file path and paste it to <file_path>

df = spark.read.text("<file_path>")
df.display()

For more information about the databricks File System (DBFS), please visit https://docs.databricks.com/dbfs/

See also:

Did this answer your question?