Overview
Databricks offers the Lakehouse Platform, a cloud-native architecture designed to combine the data management capabilities of data warehouses with the flexibility and scalability of data lakes. The platform aims to unify data, analytics, and AI workloads on a single system Databricks Lakehouse Platform overview. It is built upon open-source technologies including Apache Spark for distributed data processing, Delta Lake for ACID transactions and data reliability on data lakes, and MLflow for managing the machine learning lifecycle.
The platform is suited for technical users such as data engineers, data scientists, and machine learning engineers who work with large datasets and require a collaborative environment for complex analytical and AI development tasks. Data engineers utilize Databricks for building and managing ETL/ELT pipelines, ensuring data quality, and preparing data for subsequent analysis. Data scientists can perform exploratory data analysis, build and train machine learning models, and manage experiments using integrated tools like MLflow.
Databricks' architecture addresses challenges associated with traditional data silos, where separate systems are often used for data warehousing and data science. By integrating these functions, the Lakehouse Platform seeks to reduce data movement, simplify data governance, and accelerate the development and deployment of data-intensive applications and AI models. Its serverless options and pay-as-you-go pricing model offer flexibility for varying workload demands, from ad-hoc queries to continuous data processing. The platform supports multiple cloud environments, including AWS, Azure, and Google Cloud, allowing organizations to deploy and manage their data and AI workloads across their preferred cloud provider Databricks pricing page.
The integration of open-source components is central to Databricks' approach. Apache Spark, co-founded by Databricks, provides the distributed processing engine. Delta Lake, an open-source storage layer, brings data reliability, schema enforcement, and time travel capabilities to data lakes. MLflow offers a standardized way to track experiments, package models, and manage model lifecycle stages MLflow documentation. This combination is designed to provide a cohesive environment for developing and operationalizing data and AI solutions, supporting a range of use cases from real-time analytics to MLOps.
Key features
- Unified Data & Analytics Platform: Integrates data warehousing, data lakes, and machine learning on a single platform to eliminate data silos and simplify data governance.
- Delta Lake: Provides ACID transactions, schema enforcement, scalable metadata handling, and data versioning on data lakes, enhancing data reliability and quality Delta Lake documentation.
- MLflow: An open-source platform for managing the end-to-end machine learning lifecycle, including experiment tracking, reproducible runs, model packaging, and model serving MLflow documentation.
- Apache Spark Integration: Leverages Spark's distributed processing capabilities for large-scale data engineering and analytics workloads, supporting batch and streaming data.
- Interactive Notebooks: Offers a collaborative, web-based environment for data exploration, code development (Python, SQL, Scala, R), and visualization.
- Databricks SQL: Provides a SQL-native interface for data analysts to run high-performance queries on data lakehouse data, with built-in BI tool connectors.
- Photon Engine: A vectorized query engine designed for faster data processing, particularly for SQL workloads, enhancing performance and cost efficiency.
- Unity Catalog: A unified metadata and governance layer for data, analytics, and AI on the lakehouse, enabling centralized access control, auditing, and data discovery Unity Catalog documentation.
- Serverless Compute: Managed compute infrastructure that automatically scales resources, reducing operational overhead for users.
Pricing
Databricks pricing is based on a pay-as-you-go model, primarily measured by Databricks Units (DBUs) consumed. DBUs are a normalized unit of processing capability, varying in cost based on workload type (e.g., jobs, SQL, DLT), region, and cloud provider (AWS, Azure, GCP). Pricing tiers are offered for different levels of usage and features, including Serverless, Premium, and Enterprise plans. Additional costs may apply for cloud infrastructure resources (compute, storage, networking) managed by the underlying cloud provider.
Databricks DBU Pricing (as of 2026-05-07)
| Workload Type | Description | Starting Price per DBU (example, varies by region/cloud) | Typical Use Cases |
|---|---|---|---|
| Jobs Light | Optimized for simple ETL and data ingestion. | $0.07 - $0.10 | Batch data processing, ETL pipelines. |
| Jobs Compute | General-purpose compute for data engineering and machine learning. | $0.15 - $0.20 | Complex ETL, ML model training, data preparation. |
| Serverless SQL | High-performance serverless compute for SQL analytics. | $0.22 - $0.28 | Business intelligence queries, interactive SQL dashboards. |
| All-Purpose Compute | Interactive notebooks for data science and ML development. | $0.40 - $0.50 | Ad-hoc analysis, collaborative data science, model development. |
| Delta Live Tables (DLT) | Simplified ETL pipelines with built-in data quality and monitoring. | $0.15 - $0.25 | Streaming data ingestion, automated ETL. |
For detailed and region-specific pricing, refer to the official Databricks pricing page.
Common integrations
- Cloud Storage: Integrates with Amazon S3 Databricks S3 documentation, Azure Data Lake Storage Gen2 Databricks ADLS Gen2 documentation, and Google Cloud Storage Databricks GCS documentation for data persistence.
- Business Intelligence Tools: Connects with Tableau, Power BI, Qlik Sense, and other BI tools via standard ODBC/JDBC drivers and Databricks SQL connectors Databricks BI documentation.
- Data Ingestion & Streaming: Integrates with Kafka Databricks Kafka documentation, Azure Event Hubs, and Kinesis for real-time data streaming and ingestion.
- ML/AI Frameworks: Supports popular machine learning libraries and frameworks like TensorFlow, PyTorch, scikit-learn, and XGBoost Databricks ML documentation.
- Version Control Systems: Integrates with Git (GitHub, GitLab, Azure DevOps, Bitbucket) for collaborative code development and version control of notebooks Databricks Git integration.
Alternatives
- Snowflake: A cloud data warehousing service offering a platform for data storage, processing, and analytics with a strong focus on SQL workloads.
- Google Cloud Dataproc: A managed Spark and Hadoop service on Google Cloud, providing an environment for big data processing and analytics.
- Amazon EMR: A managed cluster platform that simplifies running big data frameworks like Apache Spark and Hadoop on AWS.
- IBM Cloud Pak for Data: An integrated data and AI platform for collecting, organizing, and analyzing data, and infusing AI across businesses.
- DataRobot: An enterprise AI platform focused on automated machine learning, MLOps, and AI governance.
Getting started
To begin using Databricks, users typically create a workspace in their preferred cloud provider (AWS, Azure, or GCP). This involves setting up a Databricks account and configuring the necessary cloud resources. Once the workspace is provisioned, users can create clusters, which are sets of computation resources, and attach notebooks to these clusters for executing code. The following Python example demonstrates a basic data loading and transformation operation using PySpark within a Databricks notebook.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Initialize Spark Session (automatically available in Databricks notebooks)
# spark = SparkSession.builder.appName("DatabricksGettingStarted").getOrCreate()
# Sample data as a Python list of tuples
data = [
("Alice", 1, "New York"),
("Bob", 2, "London"),
("Charlie", 3, "Paris"),
("David", 4, "New York"),
("Eve", 5, "London")
]
# Define schema
columns = ["Name", "ID", "City"]
# Create a Spark DataFrame
df = spark.createDataFrame(data, columns)
print("Original DataFrame:")
df.show()
# Perform a simple transformation: filter by City and add a new column
df_filtered = df.filter(col("City") == "New York") \
.withColumn("Status", col("ID") % 2 == 0)
print("\nFiltered and Transformed DataFrame:")
df_filtered.show()
# Example of writing the DataFrame to a Delta Lake table
# This operation requires a configured Unity Catalog or external metastore
# and appropriate write permissions.
# df_filtered.write.format("delta").mode("overwrite").saveAsTable("my_database.filtered_users")
# To read from a Delta Lake table:
# delta_df = spark.read.format("delta").load("/path/to/delta/table")
# delta_df.show()
# Stop Spark Session (not usually needed in Databricks notebooks as env is managed)
# spark.stop()
This Python code snippet can be executed directly in a Databricks notebook. It creates a Spark DataFrame, performs filtering and column addition, and demonstrates the initial steps for working with data. For persistent storage and advanced features like ACID transactions, users would leverage Delta Lake by writing data to Delta tables, often managed via Unity Catalog Unity Catalog overview.