Checkpointing for raster operations
Mosaic offers the ability to checkpoint raster operations to disk. This is useful when working with large rasters and complex operations that may require multiple stages of computation. Checkpointing can be used to save intermediate results to the cloud object store, which can be loaded back into memory at a later stage. This can help to reduce the amount of memory required to store intermediate results, and can also help to improve performance by reducing the amount of data that needs to be transferred between nodes during wide transformations.
Checkpointing is enabled by setting the spark.databricks.labs.mosaic.raster.use.checkpoint
configuration to true
.
By default, checkpointing is disabled. When checkpointing is enabled, Mosaic will save intermediate results to the checkpoint directory
specified by the spark.databricks.labs.mosaic.raster.checkpoint
configuration.
The checkpoint directory must be a valid DBFS path (by default it is set to /dbfs/tmp/mosaic/raster/checkpoint
).
There are also a number of helper functions that can be used to manage checkpointing in Mosaic.
The simplest way to enable checkpointing is to specify a checkpoint directory when calling enable_gdal()
from the Python interface…
import mosaic as mos mos.enable_mosaic(spark, dbutils) mos.enable_gdal(spark, checkpoint_dir="/dbfs/tmp/mosaic/raster/checkpoint")
… or directly from Scala using MosaicGDAL.enableGDALWithCheckpoint()
:
import com.databricks.labs.mosaic.functions.MosaicContext import com.databricks.labs.mosaic.gdal.MosaicGDAL import com.databricks.labs.mosaic.H3 import com.databricks.labs.mosaic.JTS val mosaicContext = MosaicContext.build(H3, JTS) import mosaicContext.functions._ MosaicGDAL.enableGDALWithCheckpoint(session, "/dbfs/tmp/mosaic/raster/checkpoint")
Checkpointing can be modified within the Python interface using the functions
update_checkpoint_path(spark: SparkSession, path: str)
set_checkpoint_on(spark: SparkSession)
set_checkpoint_off(spark: SparkSession)