Skip to main content

Tile Structure

Understanding the internal structure of GeoBrix tiles is essential for advanced use cases like custom UDFs, direct data manipulation, and performance optimization.

Overview

In GeoBrix, a tile is not a simple binary column—it's a structured type (struct) containing three fields that together represent a raster dataset along with its metadata and optional grid cell information.

Tile Schema

A tile has the following structure:

struct<
cellid: bigint, -- Grid cell ID (nullable)
raster: string|binary, -- File path or binary content
metadata: map<string,string> -- Driver, extension, size, etc.
>

Field Descriptions

FieldTypeNullableDescription
cellidbigint (Long)YesGrid cell identifier for tessellated rasters. null for non-tessellated rasters.
rasterstring or binaryNoEither a file path (string) or binary raster content (bytes).
metadatamap<string,string>YesKey-value map containing driver name, file extension, size, and other metadata.

Field Details

1. cellid

The cellid field identifies which grid cell a tile belongs to when using tessellation (e.g., rst_h3_tessellate).

Properties:

  • Type: bigint (64-bit integer)
  • Nullable: Yes
  • Purpose: Enables spatial indexing and joining of tessellated rasters

Values:

  • null - For non-tessellated rasters (e.g., from rst_fromfile)
  • > 0 - For tessellated rasters (H3 cell ID)

Example:

SQL_CELLID_NON_TESSELLATED = f"""-- Non-tessellated: cellid is null
SELECT tile.cellid
FROM gdal.`{SAMPLE_NYC_RASTER}`;
-- Returns: null"""
Example output
+------+
|cellid|
+------+
|null |
+------+
SQL_CELLID_TESSELLATED = f"""-- Tessellated: cellid contains H3 cell ID
SELECT tile.cellid
FROM (
SELECT explode(gbx_rst_h3_tessellate(tile, 7)) as tile
FROM gdal.`{SAMPLE_NYC_RASTER}`
);
-- Returns: 604189641255419903, 604189641255420159, ..."""
Example output
+-------------------+
|cellid |
+-------------------+
|604189641255419903 |
+-------------------+

2. raster

The raster field contains the actual raster data, either as a file path or binary content.

Type depends on constructor:

  • rst_fromfile()string (file path)
  • rst_fromcontent()binary (raster bytes)
  • GDAL reader → binary (raster bytes)

Properties:

  • Type: string or binary
  • Nullable: No
  • Purpose: Provides access to the underlying raster data

String (Path) Format:

file:///path/to/raster.tif
/Volumes/<catalog>/<schema>/<volume>/data/raster.tif
s3://bucket/path/to/raster.tif

Binary Format:

  • Complete GeoTIFF file in memory
  • Can be deserialized with GDAL/rasterio
  • Typically compressed (LZW, DEFLATE, etc.)

Example:

from databricks.labs.gbx.rasterx import functions as rx

# Access path from rst_fromfile (sample data path)
df = spark.range(1).select(
rx.rst_fromfile(f.lit(SAMPLE_NYC_RASTER), f.lit("GTiff")).alias("tile")
)

path_df = df.select(f.col("tile.raster").alias("raster_path"))
# Returns: path string to the raster

# Access binary from GDAL reader (sample data from mounted Volumes)
df = spark.read.format("gdal").load(SAMPLE_NYC_RASTER)
binary_df = df.select(f.col("tile.raster").alias("raster_binary"))
Example output
+----------------------------------------------------------+
|raster_path / raster_binary |
+----------------------------------------------------------+
|.../nyc/sentinel2/nyc_sentinel2_red.tif or [BINARY] |
+----------------------------------------------------------+

3. metadata

The metadata field contains key-value pairs describing the raster format and properties.

Properties:

  • Type: map<string,string>
  • Nullable: Yes
  • Purpose: Provides format information needed for GDAL operations

Common Keys:

  • driver - GDAL driver name (e.g., "GTiff", "NetCDF", "HDF4")
  • extension - File extension (e.g., ".tif", ".nc")
  • size - Size in bytes (as string) or "-1" for file-based tiles
  • Other format-specific metadata

Example:

df = spark.read.format("gdal").load(SAMPLE_NYC_RASTERS)

metadata_df = df.select(
f.col("tile.metadata").alias("metadata"),
f.col("tile.metadata.driver").alias("driver"),
f.col("tile.metadata.extension").alias("extension"),
f.col("tile.metadata.size").alias("size")
)
Example output
+------------------+-------+----------+------+
|metadata |driver |extension |size |
+------------------+-------+----------+------+
|{driver=GTiff,...}|GTiff |.tif |... |
+------------------+-------+----------+------+

Working with Tiles

Accessing Tile Fields

Use dot notation to access tile struct fields:

Python:

from pyspark.sql import functions as f
from databricks.labs.gbx.rasterx import functions as rx

df = spark.read.format("gdal").load(SAMPLE_NYC_RASTERS)

# Access individual fields
df.select(
f.col("tile.cellid"),
f.col("tile.raster"),
f.col("tile.metadata"),
f.col("tile.metadata.driver")
)
Example output
+------+--------+------------------+-------+
|cellid|raster |metadata |driver |
+------+--------+------------------+-------+
|null |[BINARY]|{driver=GTiff,...}|GTiff |
+------+--------+------------------+-------+

Scala:

import org.apache.spark.sql.functions._
import com.databricks.labs.gbx.rasterx.{functions => rx}

val df = spark.read.format("gdal").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif")

// Access individual fields
df.select(
col("tile.cellid"),
col("tile.raster"),
col("tile.metadata"),
col("tile.metadata.driver")
).show()
Example output
+------+--------+------------------+-------+
|cellid|raster |metadata |driver |
+------+--------+------------------+-------+
|null |[BINARY]|{driver=GTiff,...}|GTiff |
+------+--------+------------------+-------+

SQL:

SQL_ACCESSING_TILE_FIELDS = f"""SELECT 
tile.cellid,
tile.raster,
tile.metadata,
tile.metadata['driver'] as driver
FROM gdal.`{SAMPLE_NYC_RASTER}`;"""
Example output
+------+--------+------------------+-------+
|cellid|raster |metadata |driver |
+------+--------+------------------+-------+
|null |[BINARY]|{driver=GTiff,...}|GTiff |
+------+--------+------------------+-------+

Filtering by Metadata

Filter tiles based on driver or other metadata:

df = spark.read.format("gdal").load(SAMPLE_NYC_RASTERS)

# Filter by driver
gtiff_only = df.filter(f.col("tile.metadata.driver") == "GTiff")

# Filter by file extension
tif_files = df.filter(f.col("tile.metadata.extension") == ".tif")
Example output
Filtered DataFrame (e.g. driver = GTiff or extension = .tif).

Using Tiles in Custom UDFs

Access tile components for custom processing:

from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

@udf(IntegerType())
def get_raster_size(raster_binary, metadata):
"""Get size of raster data"""
if metadata and "size" in metadata:
return int(metadata["size"])
elif raster_binary:
return len(raster_binary)
return 0

df = spark.read.format("gdal").load(SAMPLE_NYC_RASTERS)
df_with_size = df.withColumn(
"data_size",
get_raster_size(f.col("tile.raster"), f.col("tile.metadata"))
)
Example output
+----+---------+
|path|data_size|
+----+---------+
|... |12345678 |
+----+---------+

Processing Binary Raster Data

When the raster field contains binary data, use it with rasterio or GDAL:

from rasterio.io import MemoryFile
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

@udf(DoubleType())
def compute_mean_from_tile(raster_binary):
"""Compute mean from binary raster data"""
import numpy as np

if raster_binary is None:
return None

# Convert to bytes if needed
tile_data = bytes(raster_binary)

# Open with rasterio
with MemoryFile(tile_data) as memfile:
with memfile.open() as src:
data = src.read(1)
return float(np.mean(data))

# Use with tiles from content or GDAL reader (sample data)
df = spark.read.format("gdal").load(SAMPLE_NYC_RASTER)
stats_df = df.withColumn(
"mean_value",
compute_mean_from_tile(f.col("tile.raster"))
)
Example output
+----+----------+
|path|mean_value|
+----+----------+
|... |0.42 |
+----+----------+

Comparing File-Based vs Binary Tiles

from databricks.labs.gbx.rasterx import functions as rx

# File-based tile (sample data path)
file_tile = spark.range(1).select(
rx.rst_fromfile(f.lit(SAMPLE_NYC_RASTER), f.lit("GTiff")).alias("tile")
)

file_tile.select(f.col("tile.raster")).show(truncate=False)
# +----------------------------------------------------------+
# |raster |
# +----------------------------------------------------------+
# |/Volumes/.../nyc/sentinel2/nyc_sentinel2_red.tif |
# +----------------------------------------------------------+

# Binary tile (same file read as binary)
binary_tile = spark.read.format("binaryFile").load(SAMPLE_NYC_RASTER).select(
rx.rst_fromcontent(f.col("content"), f.lit("GTiff")).alias("tile")
)

binary_tile.select(f.length(f.col("tile.raster")).alias("size_bytes")).show()
Example output
+----------------------------------------------------------+
|raster (path) or size_bytes |
+----------------------------------------------------------+
|/Volumes/.../nyc_sentinel2_red.tif or 2345678 |
+----------------------------------------------------------+

Tessellated vs Non-Tessellated Tiles

Non-Tessellated Tiles

Created by constructors (rst_fromfile, rst_fromcontent) or readers:

df = spark.read.format("gdal").load(SAMPLE_NYC_RASTER)

df.select(
f.col("tile.cellid"), # null
f.col("tile.raster"), # binary data
f.col("tile.metadata") # {driver: "GTiff", ...}
).show()
Example output
+----+--------+------------------+
|cellid|raster |metadata |
+----+--------+------------------+
|null|[BINARY]|{driver=GTiff,...}|
+----+--------+------------------+

Characteristics:

  • cellid is null
  • Represents entire raster or a tile from tiling operations
  • Suitable for processing complete rasters

Tessellated Tiles

Created by rst_h3_tessellate:

from databricks.labs.gbx.rasterx import functions as rx

df = spark.read.format("gdal").load(SAMPLE_NYC_RASTER).select(
f.explode(rx.rst_h3_tessellate(f.col("tile"), f.lit(7))).alias("tile")
)

df.select(
f.col("tile.cellid"), # H3 cell ID (e.g., 604189641255419903)
f.col("tile.raster"), # binary data (clipped to cell)
f.col("tile.metadata") # {driver: "GTiff", RASTERX_CELL_ID: "604...", ...}
).show()
Example output
+-------------------+--------+------------------+
|cellid |raster |metadata |
+-------------------+--------+------------------+
|604189641255419903 |[BINARY]|{RASTERX_CELL_ID..|
+-------------------+--------+------------------+

Characteristics:

  • cellid contains H3 cell ID
  • Raster clipped to cell bounds
  • Enables spatial joins and grid-based processing
  • Metadata includes RASTERX_CELL_ID key

Next Steps