Tile Structure
Understanding the internal structure of GeoBrix tiles is essential for advanced use cases like custom UDFs, direct data manipulation, and performance optimization.
Overview
In GeoBrix, a tile is not a simple binary column—it's a structured type (struct) containing three fields that together represent a raster dataset along with its metadata and optional grid cell information.
Tile Schema
A tile has the following structure:

struct<
cellid: bigint, -- Grid cell ID (nullable)
raster: binary, -- Raster bytes
metadata: map<string,string> -- Driver, extension, size, etc.
>
Field Descriptions
| Field | Type | Nullable | Description |
|---|---|---|---|
cellid | bigint (Long) | Yes | Grid cell identifier for tessellated rasters. null for non-tessellated rasters. |
raster | binary | No | Binary raster content (bytes). |
metadata | map<string,string> | Yes | Key-value map containing driver name, file extension, size, and other metadata. |
Field Details
1. cellid
The cellid field identifies which grid cell a tile belongs to when using tessellation (e.g., rst_h3_tessellate).
Properties:
- Type:
bigint(64-bit integer) - Nullable: Yes
- Purpose: Enables spatial indexing and joining of tessellated rasters
Values:
null- For non-tessellated rasters (e.g., fromrst_fromfile)> 0- For tessellated rasters (H3 cell ID)
Example:
-- Non-tessellated: cellid is null
SELECT tile.cellid
FROM gdal.`{SAMPLE_NYC_RASTER}`;
-- Returns: null
+------+
|cellid|
+------+
|null |
+------+
-- Tessellated: cellid contains H3 cell ID
SELECT tile.cellid
FROM (
SELECT explode(gbx_rst_h3_tessellate(tile, 7)) as tile
FROM gdal.`{SAMPLE_NYC_RASTER}`
);
-- Returns: 604189641255419903, 604189641255420159, ...
+-------------------+
|cellid |
+-------------------+
|604189641255419903 |
+-------------------+
2. raster
The raster field contains the actual raster bytes — the full file content in memory.
All tile constructors and readers produce binary content:
rst_fromfile(path, driver)→ reads the file atpathinto binary bytesrst_fromcontent(content, driver)→ embeds the given binary bytes- GDAL reader → binary (raster bytes)
Properties:
- Type:
binary - Nullable: No
- Purpose: Self-contained raster payload carried through the plan; downstream operators
(
rst_clip,rst_transform, ...) read and produce bytes, so there is no orphan-path risk.
Binary Format:
- Complete raster file (e.g. GeoTIFF) in memory
- Can be deserialized with GDAL/rasterio
- Typically compressed (LZW, DEFLATE, etc.)
Example:
from databricks.labs.gbx.rasterx import functions as rx
# rst_fromfile loads the file into the tile as binary content
df = spark.range(1).select(
rx.rst_fromfile(f.lit(SAMPLE_NYC_RASTER), f.lit("GTiff")).alias("tile")
)
from_file_df = df.select(f.col("tile.raster").alias("raster_binary"))
# Returns: b'\x4d\x4d\x00\x2a...' (binary GeoTIFF data)
# The GDAL reader also produces binary tiles
df = spark.read.format("gdal").load(SAMPLE_NYC_RASTER)
from_reader_df = df.select(f.col("tile.raster").alias("raster_binary"))
+--------------+
|raster_binary |
+--------------+
|[BINARY] |
+--------------+
3. metadata
The metadata field contains key-value pairs describing the raster format and properties.
Properties:
- Type:
map<string,string> - Nullable: Yes
- Purpose: Provides format information needed for GDAL operations
Common Keys:
driver- GDAL driver name (e.g., "GTiff", "NetCDF", "HDF4")extension- File extension (e.g., ".tif", ".nc")size- Size in bytes (as string) matching the length of the raster payload- Other format-specific metadata
Example:
df = spark.read.format("gdal").load(SAMPLE_NYC_RASTERS)
metadata_df = df.select(
f.col("tile.metadata").alias("metadata"),
f.col("tile.metadata.driver").alias("driver"),
f.col("tile.metadata.extension").alias("extension"),
f.col("tile.metadata.size").alias("size")
)
+------------------+-------+----------+------+
|metadata |driver |extension |size |
+------------------+-------+----------+------+
|{driver=GTiff,...}|GTiff |.tif |... |
+------------------+-------+----------+------+
Working with Tiles
Accessing Tile Fields
Use dot notation to access tile struct fields:
Python:
from pyspark.sql import functions as f
from databricks.labs.gbx.rasterx import functions as rx
df = spark.read.format("gdal").load(SAMPLE_NYC_RASTERS)
# Access individual fields
df.select(
f.col("tile.cellid"),
f.col("tile.raster"),
f.col("tile.metadata"),
f.col("tile.metadata.driver")
)
+------+--------+------------------+-------+
|cellid|raster |metadata |driver |
+------+--------+------------------+-------+
|null |[BINARY]|{driver=GTiff,...}|GTiff |
+------+--------+------------------+-------+
Scala:
import org.apache.spark.sql.functions._
import com.databricks.labs.gbx.rasterx.{functions => rx}
val df = spark.read.format("gdal").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif")
// Access individual fields
df.select(
col("tile.cellid"),
col("tile.raster"),
col("tile.metadata"),
col("tile.metadata.driver")
).show()
+------+--------+------------------+-------+
|cellid|raster |metadata |driver |
+------+--------+------------------+-------+
|null |[BINARY]|{driver=GTiff,...}|GTiff |
+------+--------+------------------+-------+
SQL:
SELECT
tile.cellid,
tile.raster,
tile.metadata,
tile.metadata['driver'] as driver
FROM gdal.`{SAMPLE_NYC_RASTER}`;
+------+--------+------------------+-------+
|cellid|raster |metadata |driver |
+------+--------+------------------+-------+
|null |[BINARY]|{driver=GTiff,...}|GTiff |
+------+--------+------------------+-------+
Filtering by Metadata
Filter tiles based on driver or other metadata:
df = spark.read.format("gdal").load(SAMPLE_NYC_RASTERS)
# Filter by driver
gtiff_only = df.filter(f.col("tile.metadata.driver") == "GTiff")
# Filter by file extension
tif_files = df.filter(f.col("tile.metadata.extension") == ".tif")
Filtered DataFrame (e.g. driver = GTiff or extension = .tif).
Using Tiles in Custom UDFs
Access tile components for custom processing:
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
@udf(IntegerType())
def get_raster_size(raster_binary, metadata):
"""Get size of raster data"""
if metadata and "size" in metadata:
return int(metadata["size"])
elif raster_binary:
return len(raster_binary)
return 0
df = spark.read.format("gdal").load(SAMPLE_NYC_RASTERS)
df_with_size = df.withColumn(
"data_size",
get_raster_size(f.col("tile.raster"), f.col("tile.metadata"))
)
+----+---------+
|path|data_size|
+----+---------+
|... |12345678 |
+----+---------+
Processing Binary Raster Data
When the raster field contains binary data, use it with rasterio or GDAL:
from rasterio.io import MemoryFile
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
@udf(DoubleType())
def compute_mean_from_tile(raster_binary):
"""Compute mean from binary raster data"""
import numpy as np
if raster_binary is None:
return None
# Convert to bytes if needed
tile_data = bytes(raster_binary)
# Open with rasterio
with MemoryFile(tile_data) as memfile:
with memfile.open() as src:
data = src.read(1)
return float(np.mean(data))
# Use with tiles from content or GDAL reader (sample data)
df = spark.read.format("gdal").load(SAMPLE_NYC_RASTER)
stats_df = df.withColumn(
"mean_value",
compute_mean_from_tile(f.col("tile.raster"))
)
+----+----------+
|path|mean_value|
+----+----------+
|... |0.42 |
+----+----------+
Comparing rst_fromfile vs rst_fromcontent
Both produce tiles whose raster field is binary. Use rst_fromfile when you have a path,
and rst_fromcontent when you already have bytes (e.g. from spark.read.format("binaryFile")).
from databricks.labs.gbx.rasterx import functions as rx
# rst_fromfile reads the file off a path and embeds its bytes in the tile
fromfile_tile = spark.range(1).select(
rx.rst_fromfile(f.lit(SAMPLE_NYC_RASTER), f.lit("GTiff")).alias("tile")
)
fromfile_tile.select(f.length(f.col("tile.raster")).alias("size_bytes")).show()
# +-----------+
# |size_bytes |
# +-----------+
# |2345678 |
# +-----------+
# rst_fromcontent takes bytes you already have in a column (e.g. from binaryFile)
fromcontent_tile = spark.read.format("binaryFile").load(SAMPLE_NYC_RASTER).select(
rx.rst_fromcontent(f.col("content"), f.lit("GTiff")).alias("tile")
)
fromcontent_tile.select(f.length(f.col("tile.raster")).alias("size_bytes")).show()
+-----------+
|size_bytes |
+-----------+
|2345678 |
+-----------+
Tessellated vs Non-Tessellated Tiles
Non-Tessellated Tiles
Created by constructors (rst_fromfile, rst_fromcontent) or readers:
df = spark.read.format("gdal").load(SAMPLE_NYC_RASTER)
df.select(
f.col("tile.cellid"), # null
f.col("tile.raster"), # binary data
f.col("tile.metadata") # {driver: "GTiff", ...}
).show()
+----+--------+------------------+
|cellid|raster |metadata |
+----+--------+------------------+
|null|[BINARY]|{driver=GTiff,...}|
+----+--------+------------------+
Characteristics:
cellidisnull- Represents entire raster or a tile from tiling operations
- Suitable for processing complete rasters
Tessellated Tiles
Created by rst_h3_tessellate:
from databricks.labs.gbx.rasterx import functions as rx
df = spark.read.format("gdal").load(SAMPLE_NYC_RASTER).select(
f.explode(rx.rst_h3_tessellate(f.col("tile"), f.lit(7))).alias("tile")
)
df.select(
f.col("tile.cellid"), # H3 cell ID (e.g., 604189641255419903)
f.col("tile.raster"), # binary data (clipped to cell)
f.col("tile.metadata") # {driver: "GTiff", RASTERX_CELL_ID: "604...", ...}
).show()
+-------------------+--------+------------------+
|cellid |raster |metadata |
+-------------------+--------+------------------+
|604189641255419903 |[BINARY]|{RASTERX_CELL_ID..|
+-------------------+--------+------------------+
Characteristics:
cellidcontains H3 cell ID- Raster clipped to cell bounds
- Enables spatial joins and grid-based processing
- Metadata includes
RASTERX_CELL_IDkey
Next Steps
- RasterX Functions - Functions that work with tiles
- Custom UDFs - Build custom tile processing
- Library Integration - Use tiles with rasterio/xarray