Skip to main content

Process

On a high level, the steps in migration process are:

  1. assessment
  2. group migration
  3. table migration
  4. HMS Federation
  5. data reconciliation
  6. code migration
  7. delta live table pipeline migration
  8. final details

The migration process can be schematic visualized as:

Table migration process

You are required to complete the assessment workflow before starting the table migration workflow.

This section explains how to migrate Hive metastore data objects to Unity Catalog. The table migration process consists of more steps than only a workflow, these steps are:

  1. Table mapping : Create a file that maps Hive metastore data objects to Unity Catalog locations.
  2. Data access : Setup identities to access table data.
  3. New Unity Catalog resources : Create new Unity Catalog resources that do not require touching existing Hive metastore resources.
  4. Migrate Hive metastore data objects : Migrate Hive metastore data objects to UC.

Table mapping

This section details how to create the table mapping file. This file points existing Hive metastore tables and views to Unity Catalog locations. When migrating the tables and views, the file is read to decide where to migrate the tables and views to.

Step 1 : Create the mapping file

Create the mapping file in the UCX installation folder by running the create-table-mapping command. By default, the file contains all the Hive metastore tables and views mapped to a single UC catalog, while maintaining the original schema and table names.

Step 2: Update the mapping file

Edit the mapping file from the previous step to:

  • Exclude tables and/or views by removing the lines
  • Change UC location by editing the destination catalog and/or schema

The mapping file is in CSV format and can be edited using any text editor or Excel. If using Excel, save the file in CSV format.

Example changes:

Before editing

workspace_namecatalog_namesrc_schemadst_schemasrc_tabledst_table
data_engineering_ws123333333sales_analysissales_analysisytd_salesytd_sales

After editing

workspace_namecatalog_namesrc_schemadst_schemasrc_tabledst_table
data_engineering_wsdata_engineeringsales_analysissalesytd_salesytd_analysis

Data access

Throughout this guide, we refer to IAM roles/instance profiles in AWS & service principals/managed identities in as "cloud principals".

This section creates the cloud principals to access data in Unity Catalog and during the table data during migration. To understand the motivation for this step, read how Databricks accesses cloud locations:

Step 1 : Map cloud principals to cloud storage locations

Map the cloud principals to cloud storage by running the principal-prefix-access command.

Step 2 : Create or modify cloud principals and credentials

Manually create the cloud principals to access data from Unity Catalog:

Then, run the migrate-credentials command to migrate the cloud principal credentials to Unity Catalog.

Step 3: Create the "uber" Principal

Create the "uber" principal by running the create-uber-principal command. The table migration requires a cloud principal with access to table data stored in cloud storage. These tables are known as external tables. UCX name this cloud principal the "uber" principal as it has access to all in-scope cloud storage. This principal is only required for the table migration process and not for ongoing UC usage. Once the upgrade is completed, this principal should be deleted.

New Unity Catalog resources

This section details the new Unity Catalog resources required for migration the data objects. These resources can be created without touching the existing Hive metastore objecs.

Step 0: Attach a metastore

If skipping the group migration, then a metastore should be attached to the workspace by following these instructions or running the assign-metastore command.

Step 1: Create external Locations

Create UC external locations by running the migration-locations command. The command creates a location for each location found during the assessment. It uses the credentials created in the previous steps.

Alternatively, manually create the external locations

Step 2: Create Catalogs and Schemas

Create Unity Catalog catalogs and schemas to organize the destination tables and views in by running the create-catalogs-schemas command. The command creates the UC catalogs and schemas based on the table mapping file. Additionally, it migrates Hive metastore database permissions if present.

This step requires considering how to physically separate data in storage within UC. As Databricks recommends storing managed data at the catalog level, we advise to prepare the external locations for the to-be created catalogs before running the create-catalogs-schemas command. Either, reuse previously created external locations or create additional external locations outside of UCX if data separation restrictions requires that. Note that external locations can be reused when using subpaths, for example, a folder in a cloud storage (abfss://container@storage.dfs.core.windows.net/folder) can reuse the external location of the cloud storage (abfss://container@storage.dfs.core.windows.net/). (The previous example also holds for other clouds.)

Migrate Hive metastore data objects

In this step, tables and views are migrated using the following workflows:

The table migration workflows can be triggered using the migrate-tables command or by starting the workflows from the workspace UI manually. The table migration workflows are designed to minimize data copying and to maintain metadata while allowing users to choose preferred strategies where possible. Chose the strategies for migrating tables using the table below:

Object TypeDescriptionUpgrade Method
DBFS_ROOT_DELTADelta tables persisted on the Databricks file system (dbfs).Create a copy of the table with the DEEP CLONE command.
DBFS_ROOT_NON_DELTANon-delta tables persisted on the Databricks file system (dbfs).Create a copy of the table with a CREATE TABLE AS SELECT * FROM command. The UC table will be a Delta table.
MANAGEDManaged Hive metastore tables.Depending on the managed table migration strategy chosen during installation:
1. CLONE: Create a copy of the table with a CREATE TABLE LOCATION '<location>' AS SELECT * FROM command.
2. SYNC_AS_EXTERNAL, synchronize the table metadata to UC with the SYNC command. Warning: If the managed Hive metastore table is dropped, the drop deletes the underlying data affecting the synchronised UC table as well.
3. CONVERT_TO_EXTERNAL: First, in-place convert the managed Hive metastore to a non-managed table. Then, synchronize the table metadata to UC with the SYNC command. Warning: This strategy has the advantage over SYNC_AS_EXTERNAL that dropping the Hive metastore table does not delete the underlying data, however, impact should be carefully validated in existing workloads as the strategy converts the managed Hive metastore to an external table in-place.
EXTERNAL_SYNCExternal tables that the SYNC command supports: Delta, Parquet, CSV, JSON, ORC, text or Avro tables.Synchronize the table metadata to UC with the SYNC command.
EXTERNAL_NO_SYNCExternal tables that the SYNC command does not support.Create a copy of the table with a CREATE TABLE AS SELECT * FROM command. The UC table will be a Delta table.
EXTERNAL_HIVESERDEExternal Hive SerDe tables that the SYNC command does not support.Depending on the migration workflow chosen:
1. migrate-external-tables-ctas (officially supported) : Create a copy of the table with a CREATE TABLE AS SELECT * FROM command The UC table will be a Delta table.
2. migrate-external-hiveserde-tables-in-place-experimental (experimental) : Recreate the Hive SerDe tables using the serialization and deserialization protocols. Warning: Although this strategy is supported, there is a risk that the old files created by Hive SerDe may not be processed correctly by Spark datasource in corner cases.
VIEWViewsRecreate views using their definitions after repointing their dependencies to UC objects. The migration process supports views depending on other views.

The workflows may be executed multiple times. Each step is designed as a standalone operation that migrates all in-scope tables. After migration, each object is marked with an upgraded_to property containing the UC identifier to which the object is migrated. This property signals that the object is out-of-scope for other migration operations and that the view dependency exists within UC. The created UC objects are marked with an upgraded_from property containing the Hive metastore identifier from which the object was migrated.

Finally, the table migration workflows also migrate Hive metastore permissions to Unity Catalog.

Considerations:

  • You may want to run the workflows multiple times to migrate tables in phases.
  • If your Delta tables in DBFS root have a large number of files, consider:
    • Setting higher Min and Max workers for auto-scale when being asked during the UCX installation. More cores in the cluster means more concurrency for calling cloud storage API to copy files when deep cloning the Delta tables.
    • Setting higher Parallelism for migrating DBFS root Delta tables with deep clone (default 200) when being asked during the UCX installation. This controls the number of Spark tasks/partitions to be created for deep clone.
  • Consider creating an instance pool, and setting its id when prompted during the UCX installation. This instance pool will be specified in the cluster policy used by all UCX workflows job clusters.
  • You may also manually edit the job cluster configration per job or per task after the workflows are deployed.

Additional references:

Odds and Ends

The following sections detail how to repair/amend the UC metastore after the upgrade process.

Skip migrating schemas, tables or views

databricks labs ucx skip --schema X [--table Y] [--view Zj]

The skip command marks a schema, table or view as to-be skipped during the migration processes.

Move data objects

databricks labs ucx move --from-catalog A --from-schema B --from-table C --to-catalog D --to-schema E

The move command moves the object from the source to the target location. The upgraded_from property are updated to reflect the new location on the source object. Use this command if the data object is migrated to the wrong destination.

Alias data objects

databricks labs ucx alias --from-catalog A --from-schema B --from-table C --to-catalog D --to-schema E

This alias command creates an alias for the object in the target location by creating a view reading from the table that needs aliasing. It will create a mirror view to view that is marked as alias. Use this command if Hive metastore tables point to the same location as UC does not support UC does not support tables with overlapping data locations.

Revert migrated data objects

databricks labs ucx revert-migrated-tables --schema X --table Y [--delete-managed]

The revert-migrated-tables command drops the Unity Catalog table or view and reset the upgraded_to property on the source object. Use this command to allow for migrating a table or view again.

Hive Metastore Federation Process

Hive Metastore (HMS) Federation is a feature that allows HMS to be federated to a catalog. More information is available in the Databricks documentation. HMS Federation is used in two scenarios:

  1. A step to ease the migration to Unity Catalog.
  2. A Hybrid solution where both HMS and UC access to the data is required.

You are required to complete the assessment workflow before starting the table migration workflow. HMS Federation is an alternative to table migration. You are required to complete all the steps documented in the table migration process except for:

There are two steps to enable HMS Federation:

  1. enable-hms-federation
  2. create-federated-catalog

Code Migration

After you're done with the table migration and data reconciliation, you can proceed with code migration.

Before migrating code, use the linter to investigate what should be changed to become Unity Catalog compatible:

The linter advices show codes and messages on the detected compatability issues and how to resolve them.

After investigating the code linter advices, code can be migrated. We recommend to:

Delta Live Table Pipeline Migration Process

You are required to complete the assessment workflow before starting the pipeline migration workflow.

The pipeline migration process is a workflow that clones the Hive Metastore Delta Live Table (DLT) pipelines to the Unity Catalog. Upon the first update, the cloned pipeline will copy over all the data and checkpoints, and then run normally thereafter. After the cloned pipeline reaches ‘RUNNING’, both the original and the cloned pipeline can run independently.

Example:

Existing HMS DLT pipeline is called "dlt_pipeline", the pipeline will be stopped and renamed to "dlt_pipeline [OLD]". The new cloned pipeline will be "dlt_pipeline".

Known issues and Limitations:

  • Only clones from HMS to UC are supported.
  • Pipelines may only be cloned within the same workspace.
  • HMS pipelines must currently be publishing tables to some target schema.
  • Only the following streaming sources are supported:
    • Delta
    • Autoloader
      • If your pipeline uses Autoloader with file notification events, do not run the original HMS pipeline after cloning as this will cause some file notification events to be dropped from the UC clone. If the HMS original was started accidentally, missed files can be backfilled by using the cloudFiles.backfillInterval option in Autoloader.
    • Kafka where kafka.group.id is not set
    • Kinesis where consumerMode is not "efo"
  • Maintenance is automatically paused (for both pipelines) while migration is in progress
  • If an Autoloader source specifies an explicit cloudFiles.schemaLocation, mergeSchema needs to be set to true for the HMS original and UC clone to operate concurrently.
  • Pipelines that publish tables to custom schemas are not supported.
  • On tables cloned to UC, time travel queries are undefined when querying by timestamp to versions originally written on HMS. Time travel queries by version will work correctly, as will time travel queries by timestamp to versions written on UC.
  • All existing limitations of using DLT on UC.
  • Existing UC limitations
    • If tables in the HMS pipeline specify storage locations (using the "path" parameter in Python or the LOCATION clause in SQL), the configuration "pipelines.migration.ignoreExplicitPath" can be set to "true" to ignore the parameter in the cloned pipeline.

Considerations

  • Do not edit the notebooks that define the pipeline during cloning.
  • The original pipeline should not be running when requesting the clone.
  • When a clone is requested, DLT will automatically start an update to migrate the existing data and metadata for Streaming Tables, allowing them to pick up where the original pipeline left off.
  • It is expected that the update metrics do not include the migrated data.
  • Make sure all name-based references in the HMS pipeline are fully qualified, e.g. hive_metastore.schema.table
  • After the UC clone reaches RUNNING, both the original pipeline and the cloned pipeline may run independently.

Final details

Once you're done with the code migration, you can run the: