Mirroring Databricks with Fabric

by Matthew Greenbank.

 

Commonly in large companies, the data teams might be split into an engineering and a reporting team. Often, the engineering team use Databricks for its data platform and the reporting team use PowerBI.

 

If this is the case, accessing the delta tables in PowerBI has now become much easier. Instead of using compute in Databricks, you can mirror the data into Microsoft Fabric.

 

In this blog we’ll explore how we can query the Databricks delta tables in Microsoft Fabric.

Overview

Before we delve into the solution, it’s worth highlighting that the mirroring option can only be used where unity catalog is configured and the table is external (e.g. the data is stored on ADLS).

 

Assuming we’ve got external tables with data, the solution works by creating shortcuts in Microsoft Fabric to the ADLS path where the delta table resides. The tables are available to be queried either via the SQL analytics endpoint or within a Fabric notebook.

Pre-Setup

Prior to configuring the shortcuts in Fabric, we must:

  • Obtain a Databricks PAT
  • Databricks Workspace URL
  • Grant access to the data lake.

 

In this example, we’ve granted the Storage Blob Data Reader role to a service principal on the storage account.We’ve stored the Databricks PAT and the Databricks URL in a Key Vault.

Setup

Firstly, within Fabric the ADLS connection must be configured and a workspace must be created.

 

  1. Navigate to Workspaces and select New workspace – our example is uc-fabric-sync
  2. Once in the workspace, ensure the Data Engineering workload is chosen.
  3. Create a new Lakehouse, once the lakehouse has opened we now need to obtain the lakehouse and workspace IDs.
  4. Within the lakehouse take a copy of the URL, the first GUID in the URL is the workspace ID and the second is the lakehouse ID – make a note of these as they will be needed later.
  5. Head to the Settings tab and click Manage connections and gateways, this is where we’ll create an Azure Data Lake Storage Gen2 connection.
  6. Choose New (found in the top left corner) and specify the connection details. Note: The full path being the container name on the storage account.
  7. Once the connection has been created the Connection ID will now be present, keep a copy of this ID as it will be required later.
  8. Head back to the workspace and import this notebook.
    fabric-samples/docs-samples/onelake/unity-catalog/nb-sync-uc-fabric-onelake.ipynb at main · microsoft/fabric-samples · GitHub
  9. Within the imported notebook update the configurations to the values we retrieved earlier.
  10. The notebook can now be executed, this will use the PAT to call the Databricks API to and obtain the delta table properties (e.g. ADLS path) to create the Fabric shortcut. In the example below, 9 new shortcuts are created as they don’t already exist.

Query Shortcuts

On heading back to the lakehouse, we can see that the new shortcuts are now present.

 

There are two options to query the data:

  1. SQL analytics endpoint
    a- If we head back to the workspace, we can obtain the connection details to query in a client e.g. SQL Server Management Studio.
    b- On the SQL analytics endpoint object, over the ellipse choose Copy SQL connection string
    c- In SQL Server Management Studio paste the connection string as the server name and enter the username.

    d- Once connected the tables will appear and can be queried.
  2. Notebook
    a- Within the same Fabric workspace, create a new notebook
    b- In the notebook add the default lakehouse found within the Explorer pane. The default lakehouse should be the newly created lakehouse.
    c- There should now be Tables visible, on one table clicking the Ellipsis? allows us to obtain the Copy relative path for Spark.

    d- In a code cell paste the relative path removing the Tables/. The code should look similar to this example:

    e- On running the code we’ve been able to read the data into a spark dataframe.

Wrap Up

Now you have a quick, simple way to migrate data from Databricks to Fabric. One caveat is that this only works with unity catalog external tables and overrides any security policies applied, thus meaning additional security policies would need to be implemented in Fabric.

 

For production use it’s recommended to use Databricks OAuth to connect to the databricks API instead of using a PAT.

 

As the Fabric workspaces need to have an active capacity running, this only becomes a cost-effective solution where an existing PowerBI P1 license already exists. In theory, it could reduce Databricks compute costs as you don’t need a Databricks cluster running to read the data.

 

In an alternate scenario where you don’t hold a P1 license, a Fabric capacity would have to be provisioned to allow the creation of the shortcuts and to read the data, thus possibly leading to a higher platform running cost vs using a Serverless SQL Warehouse in Databricks.

 

Source: Integrate Databricks Unity Catalog with OneLake – Microsoft Fabric | Microsoft Learn