Skip to content

Unity Catalog (Databricks data governance)

Category: Data-Governance · Areas: data, api, infra

Description

Category

data-governance

Areas

data, api, infra

Platform

Platform-specific (Databricks). Unity Catalog is Databricks’ unified governance layer for the lakehouse. This concern is the specific Databricks realization of the generic data-governance discipline — not a generic data-modeling or access-control concern (see ## Boundary).

Boundary

This concern owns how data and AI assets are governed on Databricks — the catalog namespace, the grant model, lineage, and governed external storage. It is Databricks’ concrete realization of data governance.

For the auth family (where app-layer authorization-model and catalog grants compose — neither substitutes for the other), see README-auth-family.md. For the logical domain model, defer to domain-driven-design: model entities/aggregates there; register and govern the physical catalog.schema.object namespace here. databricks-declarative-pipelines produces governed datasets; this concern owns the grants, ownership, and lineage on the result. databricks-apps consumes governed data; this concern owns the rule that an app reads through Unity Catalog grants, not around them.

Components

Unity Catalog organizes every governed asset under a metastore (the top-level container, one per region) exposing a three-level namespace: catalog.schema.object.

Namespace — where assets are registered

  • Metastore — the top-level container; holds catalogs, plus storage credentials and external locations directly beneath it.
  • Catalog — first level; the primary unit of data isolation. Organize by environment (dev/staging/prod) and/or business unit.
  • Schema (database) — second level; groups related objects.
  • Objects (third level) — tables and views (tabular), volumes (governance for non-tabular files), models (registered ML models), functions (UDFs). Tables and volumes are managed (Unity Catalog owns governance and storage lifecycle — preferred) or external (governance only; data lives at an external location).

Securable objects & the grant model

Every governed asset is a securable object on which privileges are granted to users, service principals, or groups. Privileges are inherited downward: a grant at the catalog level applies to current and future schemas and objects within it; a schema grant applies to its objects.

Key privileges:

  • USE CATALOG / USE SCHEMA — traversal prerequisites; required before any data access on objects beneath.
  • SELECT — read a table/view/materialized view.
  • MODIFY — insert/update/delete table data.
  • READ VOLUME / WRITE VOLUME — read/write files in a volume.
  • EXECUTE — invoke a function or load a registered model for inference.
  • BROWSE — discover an object and view its metadata (and explore its lineage) without data access.
  • CREATE TABLE / CREATE SCHEMA / CREATE CATALOG / CREATE VOLUME / CREATE FUNCTION / CREATE MODEL — creation rights at each level.
  • MANAGE — manage privileges, transfer ownership, delete (close to ownership, but does not auto-grant data privileges).
  • ALL PRIVILEGES — every applicable privilege (broad; avoid by default).
  • EXTERNAL USE SCHEMA — access tables via external engines over open APIs.

Every securable object has an owner (a user, service principal, or — for production assets — a group) who can grant/revoke on it.

Governed external storage

  • Storage credential — the cloud identity Unity Catalog uses to reach external cloud storage.
  • External location — a governed path (a credential + a cloud URI) over which READ FILES / WRITE FILES are granted; external tables and external volumes are created beneath governed external locations, never at their root.

Fine-grained access & lineage

  • Row filters and column masks apply row-level and column-level security at query time.
  • Data lineage is captured automatically (table-, column-, and notebook/job-level) and aggregated across every workspace attached to the metastore; visible to principals with at least BROWSE/SELECT.
  • Audit logging records access automatically.

Constraints

Every governed asset is registered and explicitly granted

  • All tables, views, volumes, models, and functions live in the catalog.schema.object namespace under a Unity Catalog metastore — no unmanaged or anonymous data access (no legacy Hive-metastore / no-isolation tables, no DBFS mounts standing in for governed data).
  • Access is granted explicitly via the privilege model; there is no implicit open access. USE CATALOG/USE SCHEMA plus the specific data privilege (SELECT, MODIFY, READ VOLUME, …) are all required.

Grant to groups, least-privilege, group ownership for production

  • Grant privileges to groups, not individual users (groups provisioned from the IdP via account-level SCIM).
  • Grant the narrowest privilege that satisfies the need; avoid ALL PRIVILEGES and broad READ FILES/WRITE FILES on external locations to end users.
  • Production catalogs and schemas are owned by a group, never an individual — and production jobs run under a service principal, not a personal identity.

Prefer managed; govern external storage through external locations

  • Prefer managed tables and volumes (full governance + storage lifecycle).
  • External data is reached only through a governed external location built on a storage credential; do not create external tables/volumes at an external location’s root, and do not mount storage to DBFS that is also used as an external location.

Isolate by catalog; fine-grained access at the data layer

  • Use the catalog as the primary isolation boundary (per environment / business unit).
  • Apply row filters and column masks for row-/column-level access rather than forking copies of data per audience.

Data-layer governance is not replaced by app-layer authz

  • Unity Catalog grants are the data-layer control and compose with — never substitute for — application-layer authentication/authorization (security-owasp). An app reading lakehouse data does so through Unity Catalog (see databricks-apps), not around it.

Drift Signals (anti-patterns to reject in review)

  • A table/view/volume/model accessed outside the catalog.schema.object namespace — legacy Hive metastore, raw DBFS mount, anonymous path → register it in Unity Catalog and grant explicitly
  • Privileges granted to individual users instead of groups → grant to IdP-provisioned groups
  • ALL PRIVILEGES (or broad READ FILES/WRITE FILES on an external location) handed to end users → grant the narrowest privilege that satisfies the need
  • A production catalog/schema owned by an individual, or a production job running under a personal identity → group ownership + service-principal job identity
  • External table/volume created at an external location root, or DBFS mount doubling as an external location → create beneath a governed external location; remove the overlapping mount
  • Row-/column-level access solved by forking per-audience data copies → use row filters / column masks
  • App- or pipeline-side code reaching data around the catalog (hardcoded cloud path, direct credential) instead of through Unity Catalog grants → route data access through the catalog

When to use

Any product whose data and AI assets live in the Databricks lakehouse and must be governed — registered, access-controlled, and lineage-tracked through Unity Catalog. This is the data-governance member of the Databricks platform family; select it together with databricks-apps (when the product is a Databricks-hosted app) and/or databricks-declarative-pipelines (when ETL runs as declarative pipelines). It is composable (no slot); areas: data, api, infra scopes its practices to the data, service, and infrastructure work items.

Do not select it for a product that does not store data in Databricks — use the generic data-governance / data-modeling concerns there instead.

Artifact Impact

Selecting this concern requires these artifacts to change (a selected concern absent from them is drift):

  • ADR: metastore/catalog layout (isolation boundaries), managed-vs-external, group/ownership model for production
  • TD: assets registered in catalog.schema.object; data access through Unity Catalog grants, not around them
  • DATA_DESIGN: namespace placement, grant model, row filters/column masks, governed external locations

ADR References

Record an ADR for the metastore/catalog layout (isolation boundaries: per-environment vs per-business-unit), the managed-vs-external decision for the product’s data, and the group/ownership model for production assets. A material uncertainty (workspace/metastore topology, external-storage credentials, cross-region constraints) is a tech-spike, not a silent assumption (see workflows/references/concern-resolution.md).

Practices by activity

Agents working in any of these activities inherit the practices below via the bead’s context digest.

These practices govern how data and AI assets are registered, granted, and lineage-tracked on Databricks. They are the Databricks realization of data governance. For the boundary (composition with authorization-model / security-owasp, domain-driven-design, databricks-apps, databricks-declarative-pipelines) see concern.md and the auth family ownership table at README-auth-family.md.

Requirements (Frame activity)

  • Decide the catalog isolation boundary up front: catalogs per environment (dev/staging/prod), per business unit, or both.
  • Identify every dataset, volume, model, and function the product reads or writes, and the groups that need access to each.

Design

  • Lay out the three-level namespace (catalog.schema.object) for the product’s assets; name the metastore/catalog topology in an ADR.
  • Decide managed vs external per asset — prefer managed; for external data, design the external location + storage credential rather than raw cloud paths or DBFS mounts.
  • Design grants as group-based, least-privilege: list USE CATALOG / USE SCHEMA plus the specific data privilege (SELECT, MODIFY, READ VOLUME, EXECUTE, …) each consumer group needs.
  • Design production ownership as group ownership, and production job/app identity as a service principal.
  • Where row-/column-level access differs by audience, design row filters and column masks instead of forked data copies.

Implementation

  • Register every asset in Unity Catalog — no Hive-metastore / no-isolation tables, no DBFS-mount-as-data, no anonymous cloud paths.
  • GRANT to groups (IdP-provisioned via account-level SCIM), never to individual users; grant the narrowest privilege (no reflexive ALL PRIVILEGES, no broad READ FILES/WRITE FILES to end users).
  • Create external tables/volumes beneath a governed external location, never at its root.
  • Assign group ownership to production catalogs/schemas; run production jobs and apps under a service principal.
  • Apply row filters / column masks for fine-grained access.
  • Let consuming pipelines and apps read through the catalog — no hardcoded cloud paths or embedded credentials that bypass grants.

Testing / Verification

  • Verify the namespace: every product asset resolves as catalog.schema.object (no legacy/Hive/DBFS path).
  • Verify access control: a principal without the required grant is denied (negative control), and a principal with the grant succeeds — observed, not assumed.
  • Verify grants are group-scoped (no individual-user grants) and production assets are group-owned.
  • Verify lineage is captured for the product’s key tables (upstream → downstream visible in Unity Catalog lineage).

Quality Gates

  • All tables, views, volumes, and models the product uses are registered in Unity Catalog under catalog.schema.objectno unmanaged or anonymous data access (no Hive metastore, no DBFS-mount data, no raw cloud paths).
  • Access is granted explicitly and to groups (not individual users), least-privilege (no reflexive ALL PRIVILEGES); verified by a negative control — an ungranted principal is denied.
  • Production catalogs/schemas are group-owned and production jobs/apps run under a service principal, not a personal identity.
  • External data is reached through a governed external location (+ storage credential), not a DBFS mount or raw cloud path; nothing is created at an external location root.
  • Fine-grained access (where required) uses row filters / column masks, and lineage is captured for the product’s key datasets.