Prerequisites (Databricks)
Platform
Each version of Unravel has specific platform requirements. See Unravel's Databricks compatibility matrix to confirm that your Databricks platform meets the requirement for the version of Unravel that you are installing.
Hardware
Azure instance type: Minimum: Standard_E8s_v3
EC2 instance type:
Minimum: r4.2xlarge (61 GiB RAM)
Maximum: r4.8xlarge (244 GiB RAM)
Recommended: r4.4xlarge (122 GiB RAM)
Virtualization type: HVM
Ports
GNU Compiler Collection (GCC)
GNU Compiler Collection (GCC) version 4.9.3, which consists of compilers and libraries for C, C++, etc., should be installed on the Unravel node for Cost > Budget estimation to function. Refer to Install GNU Compiler Collection (GCC) version 4.9.3
The following items are only Databricks Azure-specific prerequisites:
You must already have an Azure account.
You must already have a resource group assigned to a region to group your policies, VMs, and storage blobs/lakes/drives.
A resource group is a container that holds related resources for an Azure solution. In Azure, you logically group related resources such as storage accounts, virtual networks, and virtual machines (VMs) to deploy, manage, and maintain them as a single entity.
You must have root privilege to run commands on the VM.
Unravel recommends deploying Azure Databricks workspaces in secure cluster mode (NPIP) with virtual network (VNET) injection. Such a deployment provides better controls on network and security, especially if you want to lock the workspace egress IP addresses.
The expected traffic between Azure Databricks and the Unravel server is as follows:
Inbound - Azure Databricks workspace egress IP addresses.
Outbound - Azure Databricks Access IP addresses
If you are concerned about locking down the inbound traffic on the Unravel server (same as the egress traffic on the Azure Databricks workspaces), you can consider the following options:
If you have the workspaces deployed in secure mode, a NAT gateway is already built automatically with a static IP address. Refer to Secure cluster connectivity (No Public IP / NPIP) for more information.
If you have the existing workspaces deployed in insecure mode, you cannot turn on secure mode (NPIP) without rebuilding the workspaces. The workaround is to migrate the clusters from the existing insecure workspaces to secure workspaces.
Refer to Regional disaster recovery for Azure Databricks clusters for more information.
If you have the existing workspace deployed with VNET injection in insecure mode, you cannot turn on secure mode (NPIP) without rebuilding the workspaces. However, you can mitigate by adding a firewall. Refer to the following:
Your virtual network and subnet(s) must be big enough to be shared by the Unravel VM and the target Databricks cluster(s).
You can use an existing virtual network or create a new one, but the virtual network must be in the same region and same subscription as the Azure Databricks workspace that you plan to create.
A CIDR range between /16 - /24 is required for the virtual network.
There are two options to enable the communication between the Unravel server and the Databricks Data Plane:
Assign a public IP address to the Unravel Azure VM and open port 4043 for non-SSL and port 4443 for unsecured SSL.
Assign Unravel server No Public IP (NPIP) address, so that Unravel sensors installed on Databricks Data Plane can communicate (one-way) with the Unravel server via VNET peering or Virtual WAN.
You must ensure that there is no overlap in VNET IP ranges and that the traffic is private.
Assign a public IP address to the Unravel Azure VM and open port 4043 for non-SSL and 4443 for unsecured SSL.
Allow inbound SSH connections to the Unravel VM.
You must allow outbound Internet access and all traffic within the subnet (VSNET).
Azure IP ranges and service tags for Public Cloud can be found here.