Azure Databricks concepts

Article
03/03/2023
8 minutes to read

This article introduces the set of fundamental concepts you need to understand in order to use Azure Databricks effectively.

Some concepts are general to Azure Databricks, and others are specific to the persona-based Azure Databricks environment you are using:

Databricks Data Science & Engineering
Databricks Machine Learning
Databricks SQL

General concepts

This section describes concepts and terms that apply across all Azure Databricks persona-based environments.

Accounts and workspaces

In Azure Databricks, a workspace is an Azure Databricks deployment in the cloud that functions as an environment for your team to access Databricks assets. Your organization can choose to have either multiple workspaces or just one, depending on its needs.

An Azure Databricks account represents a single entity that can include multiple workspaces. Accounts enabled for Unity Catalog can be used to manage users and their access to data centrally across all of the workspaces in the account.

Billing

DBU

Azure Databricks bills based on Databricks units (DBUs), units of processing capability per hour based on VM instance type.

See the Azure Databricks pricing page.

Authentication and authorization

This section describes concepts that you need to know when you manage Azure Databricks identities and their access to Azure Databricks assets.

User

A unique individual who has access to the system. User identities are represented by email addresses.

Service principal

A service identity for use with jobs, automated tools, and systems such as scripts, apps, and CI/CD platforms. Service principals are represented by an application ID.

Group

A collection of identities. Groups simplify identity management, making it easier to assign access to workspaces, data, and other securable objects. All Databricks identities can be assigned as members of groups.

Access control list (ACL)

A list of permissions attached to the workspace, cluster, job, table, or experiment. An ACL specifies which users or system processes are granted access to the objects, as well as what operations are allowed on the assets. Each entry in a typical ACL specifies a subject and an operation.

Personal access token

An opaque string is used to authenticate to the REST API and by tools in the Databricks integrations to connect to SQL warehouses.

Azure Active Directory tokens can also be used to authenticate to the REST API.

Databricks Data Science & Engineering

Databricks Data Science & Engineering is the classic Azure Databricks environment for collaboration among data scientists, data engineers, and data analysts. This section describes the fundamental concepts you need to understand in order to work effectively in the Databricks Data Science & Engineering environment.

Workspace

A workspace is an environment for accessing all of your Azure Databricks assets. A workspace organizes objects (notebooks, libraries, dashboards, and experiments) into folders and provides access to data objects and computational resources.

This section describes the objects contained in the Azure Databricks workspace folders.

Notebook

A web-based interface to documents that contain runnable commands, visualizations, and narrative text.

Dashboard

An interface that provides organized access to visualizations.

Library

A package of code available to the notebook or job running on your cluster. Databricks runtimes include many libraries and you can add your own.

Repo

A folder whose contents are co-versioned together by syncing them to a remote Git repository.

Experiment

A collection of MLflow runs for training a machine learning model.

Data Science & Engineering interface

This section describes the interfaces that Azure Databricks supports for accessing your assets: UI, API, and command-line (CLI).

The Azure Databricks UI provides an easy-to-use graphical interface to workspace folders and their contained objects, data objects, and computational resources.

REST API

There are three versions of the REST API: 2.1, 2.0, and 1.2. The REST APIs 2.1 and 2.0 support most of the functionality of the REST API 1.2 and additional functionality and are preferred.

CLI

An open source project hosted on GitHub. The CLI is built on top of the REST API (latest).

Data management in Data Science & Engineering

This section describes the objects that hold the data on which you perform analytics and feed into machine learning algorithms.

Databricks File System (DBFS)

A filesystem abstraction layer over a blob store. It contains directories, which can contain files (data files, libraries, and images), and other directories. DBFS is automatically populated with some datasets that you can use to learn Azure Databricks.

Database

A collection of information that is organized so that it can be easily accessed, managed, and updated.

Table

A representation of structured data. You query tables with Apache Spark SQL and Apache Spark APIs.

Metastore

The component that stores all the structure information of the various tables and partitions in the data warehouse including column and column type information, the serializers and deserializers necessary to read and write data, and the corresponding files where the data is stored. Every Azure Databricks deployment has a central Hive metastore accessible by all clusters to persist table metadata. You also have the option to use an existing external Hive metastore.

Visualization

A graphical presentation of the result of running a query.

Computation management in Data Science & Engineering

This section describes concepts that you need to know to run computations in Databricks Data Science & Engineering.

Cluster

A set of computation resources and configurations on which you run notebooks and jobs. There are two types of clusters: all-purpose and job.

You create an all-purpose cluster using the UI, CLI, or REST API. You can manually terminate and restart an all-purpose cluster. Multiple users can share such clusters to do collaborative interactive analysis.
The Azure Databricks job scheduler creates a job cluster when you run a job on a new job cluster and terminates the cluster when the job is complete. You cannot restart an job cluster.

Pool

A set of idle, ready-to-use instances that reduce cluster start and auto-scaling times. When attached to a pool, a cluster allocates its driver and worker nodes from the pool. If the pool does not have sufficient idle resources to accommodate the cluster’s request, the pool expands by allocating new instances from the instance provider. When an attached cluster is terminated, the instances it used are returned to the pool and can be reused by a different cluster.

Databricks runtime

The set of core components that run on the clusters managed by Azure Databricks. Azure Databricks offers several types of runtimes:

Databricks Runtime includes Apache Spark but also adds a number of components and updates that substantially improve the usability, performance, and security of big data analytics.
Databricks Runtime for Machine Learning is built on Databricks Runtime and provides a ready-to-go environment for machine learning and data science. It contains multiple popular libraries, including TensorFlow, Keras, PyTorch, and XGBoost.
Databricks Light is the Azure Databricks packaging of the open source Apache Spark runtime. It provides a runtime option for jobs that don’t need the advanced performance, reliability, or autoscaling benefits provided by Databricks Runtime. You can select Databricks Light only when you create a cluster to run a JAR, Python, or spark-submit job; you cannot select this runtime for clusters on which you run interactive or notebook job workloads.

Workflows

Frameworks to develop and run data processing pipelines:

Create, run, and manage Azure Databricks Jobs: A non-interactive mechanism for running a notebook or library either immediately or on a scheduled basis.
Delta Live Tables introduction: A framework for building reliable, maintainable, and testable data processing pipelines.

Workload

Azure Databricks identifies two types of workloads subject to different pricing schemes: data engineering (job) and data analytics (all-purpose).

Data engineering An (automated) workload runs on a job cluster which the Azure Databricks job scheduler creates for each workload.
Data analytics An (interactive) workload runs on an all-purpose cluster. Interactive workloads typically run commands within an Azure Databricks notebook. However, running a job on an existing all-purpose cluster is also treated as an interactive workload.

Execution context

The state for a REPL environment for each supported programming language. The languages supported are Python, R, Scala, and SQL.

Databricks Machine Learning

The Databricks Machine Learning environment starts with the features provided in the Data Science & Engineering workspace and adds functionality. Important concepts include:

Experiments

The main unit of organization for tracking machine learning model development. Experiments organize, display, and control access to individual logged runs of model training code.

Feature Store

A centralized repository of features. Databricks Feature Store enables feature sharing and discovery across your organization and also ensures that the same feature computation code is used for model training and inference.

Models

A trained machine learning or deep learning model that has been registered in Model Registry.

Databricks SQL

Databricks SQL is geared toward data analysts who work primarily with SQL queries and BI tools. It provides an intuitive environment for running ad-hoc queries and creating dashboards on data stored in your data lake. Its UI is quite different from that of the Data Science & Engineering and Databricks Machine Learning environments. This section describes the fundamental concepts you need to understand in order to use Databricks SQL effectively.

Databricks SQL interface

This section describes the interfaces that Azure Databricks supports for accessing your Databricks SQL assets: UI and API.

UI: A graphical interface to dashboards and queries, SQL warehouses, query history, and alerts.

Databricks SQL Landing page

REST API An interface that allows you to automate tasks on Databricks SQL objects.

Data management in Databricks SQL

Dashboard: A presentation of query visualizations and commentary.

Alert: A notification that a field returned by a query has reached a threshold.

Computation management in Databricks SQL

This section describes concepts that you need to know to run SQL queries in Databricks SQL.

Query: A valid SQL statement.

SQL warehouse: A computation resource on which you execute SQL queries.

Query history: A list of executed queries and their performance characteristics.

Feb	MAR	Apr
	07
2022	2023	2024