In production environments, Databricks Pools might provide a better alternative to native clusters by creating idle instances that linked clusters could utilise. This decreases time required for clusters to start or auto-scale, and hence notebook could initiate execution quicker without having to wait for the cluster spin up. And whilst the Pool is idle there won’t be any processing costs. Pools can be utilised by multiple clusters, as they become available and depending on the Pool/Cluster link.

Similar to Clusters, Pools can be created from either the Portal or through a Command Line Interface.

When Continuous Integration/ Continuous Delivery (CICD) pipelines are in place for continuous deployments to be done as new releases are pushed, scripts could be utilised to create or edit pools or clusters automatically during the deployment. There might be different ways of enabling such scripting approach, including but not limited to using Azure Devops for the CICD pipelines alongside YAML, Terraform or PowerShell.

In this example we will look at a PowerShell script that can be utilised to facilitate Databricks Pools creation.

The script can work by providing either a JSON string with the Pool configuration details directly within the PowerShell script as a parameter. Else the PowerShell script can pick such configuration details from a .JSON file in a specified path.

The script is based on the following JSON Pool Configuration layout, which can accept one to multiple instance pools to be created or edited. In this case we have supplied two Pools – POOL0 and POOL1:

[
  {
    "instance_pool_name": "POOL0",
    "min_idle_instances": 5,
    "node_type_id": "Standard_DS3_v2",
    "max_capacity": 100,
    "idle_instance_autotermination_minutes": 15,
    "preloaded_spark_versions": "6.6.x-scala2.11"
  },
  {
    "instance_pool_name": "POOL1",
    "min_idle_instances": 5,
    "node_type_id": "Standard_DS4_v2",
    "max_capacity": 200,
    "idle_instance_autotermination_minutes": 0,
    "preloaded_spark_versions": "4.0.x-scala2.11"
  }
] 

The actual PowerShell script will then start with Parameters, accepting the access token created in Databricks for authentication. This can be set-up from the Databricks portal, by clicking on the User Profile and then load User Settings. In the Access Tokens, generate a New Token. Ensure this is then copied and kept inside the Azure Key Vault or any other password wallet as after closing the window it could not be retrieved again.

There is also a parameter for Azure Region to be used in the Databricks URI, alongside the Databricks JSON configuration script – which can be left blank as long as the path is supplied as well in the other parameter.

Some settings are set afterwards – like headers to be used in the API request.

The script will start by getting a list of the existing pools, which will be used to determine if to be update or create the pool.

A different action is defined depending on this to send the specific URI method and body.

The resulting message will then be outputted to highlight success or failure.

Param
  (

    #[Parameter(Mandatory=$true)]  
    [string] $accessToken = "da**************************",
    
    #[Parameter(Mandatory=$true)]  
    [string] $Region = "uksouth",
    
    #[Parameter(Mandatory=$false)] 
    [string] $DatabricksPoolConfigurationJson = '',

    # Accepts a JSON-formatted string of Databricks Pools.
    # If omitted, the default permission structure defined in ./DatabricksPoolConfig.json will be used
  [Parameter(Mandatory=$false)] 
  [string] $DatabricksPoolConfigurationFile = './DatabricksPoolConfig.json'
  
  )

# Convert a JSON-String into a PSArray.
# Used if DatabricksPoolConfigurationJson is specified
Function ConvertFrom-JsonArray
{
  Param(
    [Parameter(Mandatory = $true)]
    [ValidateNotNullOrEmpty()]
    [string] $Json
    )
    
    $jsonWrapper= '{"Array":' + $Json + '}';
    $array = ConvertFrom-Json -InputObject $jsonWrapper; 
    return [array]$array.Array;
}

# Get pool configuration list from JSON file or from provided string
if($DatabricksPoolConfigurationJson -ne '')
{
  $DatabricksPoolConfiguration = ConvertFrom-JsonArray $DatabricksPoolConfigurationJson
}
else
{
  # Use default Permissions structure defined in ./PermissionsConfiguration.json
  $DatabricksPoolConfiguration = (Get-Content -Raw -Path $DatabricksPoolConfigurationFile | ConvertFrom-Json)
}

#Set environment and API URL
Set-DatabricksEnvironment -AccessToken $accessToken -ApiRootUrl $apiUrl
$apiUrl = "https://"+$Region+".azuredatabricks.net"

#Set Header details to authenticate through the API Rest Call
$headers = @{
    "Authorization" = "Bearer " + $accessToken
    "Content-Type" = "application/json"
    "Host" = $Region+".azuredatabricks.net"
    "Accept" = "*/*"
    "Accept-Coding" = "gzip, deflate, br"
}

#List existing pools to determine whether to update or create
$URI = $apiUrl+"/api/2.0/instance-pools/list"	
try
{
   $Response = Invoke-RestMethod -method "GET" -Uri $URI -Headers $headers
    
   $existing = "True"
}
catch
{
    $existing = "False"
}

#Save obtained existing pool list in Hashtable
$ExistingPoolHashTable = @()

if ($existing -eq "True")
{
    foreach ($instancepool in $Response.instance_pools)
    {
         $ExistingPoolHashTable += [pscustomobject]@{
                PoolName  = $instancepool.instance_pool_name
                PoolID    = $instancepool.instance_pool_id
                NodeType  = $instancepool.node_type_id
                                                    }
    }
}

#Get every configuration line in Pool Config file
foreach ($configuration in $DatabricksPoolConfiguration)
{
    #Initialize variables
    $GetNodeType = ""
    $GetPoolId = ""
    $result= ""

    #Store configuration line details for each pool in variables
    $PoolName = $configuration.instance_pool_name
    $PoolTerminationTime = $configuration.idle_instance_autotermination_minutes
    $PoolNodeType = $configuration.node_type_id
    $PoolMinIdleInstances = $configuration.min_idle_instances
    $PoolMaxCapacity = $configuration.max_capacity
    $PoolSparkVersion = $configuration.preloaded_spark_versions

    #Check if this particular Pool existis already
    $CheckExists = $ExistingPoolHashTable | Where-Object {$_.PoolName -eq $PoolName}

    #Determine Body and Action to be used in Rest API call depending on whether Pool already exists 
    if ($CheckExists -ne $null)
    {
            
        $GetNodeType = $ExistingPoolHashTable | Where-Object {$_.PoolName -eq $PoolName} | select -Property NodeType
        $GetNodeType = $GetNodeType.NodeType  
            
        $GetPoolId = $ExistingPoolHashTable | Where-Object {$_.PoolName -eq $PoolName} | select -Property PoolID
        $GetPoolId = $GetPoolId.PoolID

        if($GetNodeType -eq $PoolNodeType)
        {
            $action = "Edit"
       
            $Body = '{"instance_pool_id":"' +$GetPoolId+'",
                        "instance_pool_name":"' +$PoolName+'",
                        "node_type_id":"' +$PoolNodeType+'",
                        "max_capacity":' +$PoolMaxCapacity+',
                        "min_idle_instances":' +$PoolMinIdleInstances+',
                        "idle_instance_autotermination_minutes":' +$PoolTerminationTime+'}'

            
            $URI = $apiUrl+"/api/2.0/instance-pools/edit"
        }
        else
        {
            $action = "Skip"
        }
    } else {
        $action = "Create"
        
        $Body = '{"instance_pool_name":"' +$PoolName+'",
                    "node_type_id":"' +$PoolNodeType+'",
                    "max_capacity":' +$PoolMaxCapacity+',
                    "min_idle_instances":' +$PoolMinIdleInstances+',
                    "preloaded_spark_versions":"' +$PoolSparkVersion+'",
                    "idle_instance_autotermination_minutes":' +$PoolTerminationTime+'}'

        
        $URI = $apiUrl+"/api/2.0/instance-pools/create"
    }
    
    
    #If the node of the existing pool mismatches the provided update details of the same pool, skip update and return message  
    #Do the API Rest call with the variable Body and URI and save output result message  
    if($action -eq "Skip")
    {
        $message = "Fail"
        $result = "Pool $PoolName : Provided Node Type $GetNodeType mismatches $PoolNodeType for the existing same pool. Both nodes should match for update to be done."
    }
    else
    {
        try
        {
            Invoke-RestMethod -method "POST" -Uri $URI -Headers $headers -Body $Body | Out-Null
            $message = "Success"
            if ($action -eq "Edit")
            {
                $result = "Pool $PoolName : Updated successfully"
            }
            else
            {
                $result = "Pool $PoolName : Created successfully"
            }
        }
        catch
        {
            $ErrorMessage =$_.Exception.Message
            $message = "Fail"
            $result = "Pool $PoolName : $ErrorMessage"
        }
    }
    
    #Return result message      
    if ($message -eq "Success")
      {
        Write-Host $result -fore green
      }
    else 
      {
        Write-Host $result -fore red
      }
 }

Two very useful packages and modules that have beneficial functions are DatabricksPS authored by Gerhard Brueckl and Azure.databricks.cicd.tools by Simon D’Morias – which can be accessed here:

Both have some great functions which allow many actions on Databricks environments done with ease.

Please get in touch if you have any questions or queries about this blog or our services.