Executing Azure Databricks Notebook from Azure Data Factory (ADF)

Executing or calling an Azure Databricks Notebook within ADF pipeline is quite easy. There is an in-built activity called “Notebook” available there under the Databricks category.

Just like any other data source, you need to create a linked service for Azure Databricks to establish a connection with it. Databricks workspace can easily be selected under the Azure subscription and you can authenticate to this either by storing the Access Token in Azure Key Vault or by entering the token directly here on the linked service creation screen. Access token can be generated within Databricks workspace by going to User Settings

As a best practice, name the token in a more meaningful way so that you can easily recognize the purpose/service it was created for at any later point of time. As always, tokens are not stored anywhere in Databricks workspace so you need to copy it before you click OK.

You can select whether you want to run the Notebook within ADF by creating a New Cluster each time the pipeline/activity is called or would like to run it against an Existing Cluster. This is an important consideration. Choosing the new cluster has the benefit that it is automatically terminated as soon as the execution is completed whereas running it against an existing cluster, leaves the cluster running once the ADF execution is over. In such case, termination of cluster is more driven by the idle timeout setting configured in the cluster within Azure Databricks workspace. However, in both the cases that whether you choose the New Cluster or an Existing Cluster, if the cluster is not running, activity execution get it started automatically.

Most of the settings/configuration that you specify while creating a cluster within Azure Databricks workspace is available here also while choosing a New Cluster option during Linked Service creation like the flexibility to mention node type, python version, auto scaling etc.

Once the connection details are defined through linked service, next configuration is select the Notebook which needs to be executed under the Settings tab. You don’t necessarily need to type in the exact path here. You can easily browse to the Notebook and the path is automatically captured there. If you have any parameters in the Notebook, you can map them as well. The only point you need to ensure is that the parameter names are same here and in the Notebook.

Once all this is configured, you can test it by running the pipeline in Debug mode. However, if the cluster is not running or if your linked service is configured to create a new cluster then this would take about 4-5 minutes to get it started before the actual code in Notebook is executed.

This is executed as a job in Azure Databricks workspace which is a cost-effective option as compared to interactive executions within Azure Databricks workspace and this is a big advantage of running the Notebook through ADF pipeline activity. However, you need to consider bit of ADF cost too here.

So, overall this is an easier process to execute Azure Databricks Notebooks through ADF pipelines.