Databricks Unity Catalog is a pioneering unified governance tool designed for managing data and AI on the Lakehouse. Unity Catalog is Databricks’ answer to Data Governance challenges. It assists organizations in maintaining control over all data assets, managing data access, ensuring data quality, and the lineage.
// WHAT
Databricks Unity Catalog offers the following capabilities:
- Administer data access policies that apply across all workspaces:
Define the policies in a single place with Unity Catalog. Enable secure data sharing while adhering to governance guidelines.
Manage data access for data and AI assets, including tables, files, notebooks, machine learning models, and dashboards. - Govern and secure data effectively:
Offering the flexibility to use ANSI SQL or the Web UI interface for data governance and security management. It allows administrators to grant permissions in the data lake at different levels: catalogs, schemas, tables, and views. - Seamless auditing and data lineage integration:
Capture automatically detailed data lineage, providing insights into the creation and utilization of data assets across various programming languages. With automatic user-level audit logs, Unity Catalog records every instance of data access, ensuring robust security and accountability. It provides a clear data lineage view across jobs and notebooks in Databricks.
- Effortless data exploration:
Simplifies data tagging and search. It simplifies the process of exploring the data via the search interface. With the ability to tag and document data assets, you can efficiently organize and categorize the information. - Easily access on operational data: query operational data via system tables, including audit logs, billable usage, and lineage.
Control access to data and other objects in Unity Catalog
Control access to data and objects in Unity Catalog can be authorized by a metastore administrator, the owner of an object, or the owner of the catalog or schema that contains the object. It can be set using Catalog Explorer, SQL statements in notebooks or Databricks SQL queries, using the Unity Catalog REST API, or using Terraform.
The Unit Catalog object model
This is a simplified representation of secure Unity Catalog objects.
Supported data file formats by Unity Catalog
- Managed tables: must use the delta table format.
- External tables: can use delta, CSV, JSON, avro, parquet, ORC, or text.
// HOW
In this part two set ups will be explained, setting up the Unity Catalog and setting up data access for users.
Steps to set up the Unity Catalog:
- Configure a bucket or storage container which can be used by Unity Catalog to store and access managed table data.
- Create a metastore for each region in which your organization operates. To create a metastore in Azure Databricks, you need to be an Azure Databricks account admin. The metastore serves as the top-level container for all the data in Unity Catalog. If applicable, give Unity Catalog access to the bucket.
- Assign the workspaces to the metastore.
- Add users, groups, and service principals to your Databricks account.
- An optional extra step is transferring the metastore admin role to a group.
To set up data access for users:
- Create a compute resource in a workspace: cluster or SQL warehouse. This compute resource will be used by running queries and commands, including grant statements on data objects.
- Create a catalog, schema, and table.
- Grant privilege to users, groups, or service principals for each level in the data hierarchy. In addition, the ability to assign privileges at the row or column level with dynamic views.
// BEST PRACTICES
- Develop data governance and data isolation in building blocks:
In unity catalog, the models and functions are securable objects.
- Plan your data isolation model:
- Users can only gain access to data based on specified access rules
- Only assigned people or teams can manage data
- Data is physically separated in storage
- Audit access to data:
Monitor the lakehouse platform with audit logs captured by Unity Catalog.
- Use Delta Sharing to share data securely:
It’s an open protocol for securing data sharing with departments or other organizations, regardless of the computing platform they use.
- Configure external locations and storage credentials:
A storage credential encapsulates along-term cloud credential that provides access to cloud storage.
- Organize data:
Use catalogs to create separation in your organization’s information architecture. Catalogs often correspond to a scope, team, or business unit.