Skip to main content

Tile Structure

Understanding the internal structure of GeoBrix tiles is essential for advanced use cases like custom UDFs, direct data manipulation, and performance optimization.

Overview

In GeoBrix, a tile is not a simple binary column—it's a structured type (struct) containing three fields that together represent a raster dataset along with its metadata and optional grid cell information.

Tile Schema

A tile has the following structure:

GeoBrix tile schema — cellid (bigint, nullable), raster (binary, required), metadata (map of string to string), with non-tessellated vs tessellated examples

struct<
cellid: bigint, -- Grid cell ID (nullable)
raster: binary, -- Raster bytes
metadata: map<string,string> -- Driver, extension, size, etc.
>

Field Descriptions

FieldTypeNullableDescription
cellidbigint (Long)YesGrid cell identifier for tessellated rasters. null for non-tessellated rasters.
rasterbinaryNoBinary raster content (bytes).
metadatamap<string,string>YesKey-value map containing driver name, file extension, size, and other metadata.

Field Details

1. cellid

The cellid field identifies which grid cell a tile belongs to when using tessellation (e.g., rst_h3_tessellate).

Properties:

  • Type: bigint (64-bit integer)
  • Nullable: Yes
  • Purpose: Enables spatial indexing and joining of tessellated rasters

Values:

  • null - For non-tessellated rasters (e.g., from rst_fromfile)
  • > 0 - For tessellated rasters (H3 cell ID)

Example:

-- Non-tessellated: cellid is null
SELECT tile.cellid
FROM gdal.`{SAMPLE_NYC_RASTER}`;
-- Returns: null
Example output
+------+
|cellid|
+------+
|null |
+------+
-- Tessellated: cellid contains H3 cell ID
SELECT tile.cellid
FROM (
SELECT explode(gbx_rst_h3_tessellate(tile, 7)) as tile
FROM gdal.`{SAMPLE_NYC_RASTER}`
);
-- Returns: 604189641255419903, 604189641255420159, ...
Example output
+-------------------+
|cellid |
+-------------------+
|604189641255419903 |
+-------------------+

2. raster

The raster field contains the actual raster bytes — the full file content in memory.

All tile constructors and readers produce binary content:

  • rst_fromfile(path, driver) → reads the file at path into binary bytes
  • rst_fromcontent(content, driver) → embeds the given binary bytes
  • GDAL reader → binary (raster bytes)

Properties:

  • Type: binary
  • Nullable: No
  • Purpose: Self-contained raster payload carried through the plan; downstream operators (rst_clip, rst_transform, ...) read and produce bytes, so there is no orphan-path risk.

Binary Format:

  • Complete raster file (e.g. GeoTIFF) in memory
  • Can be deserialized with GDAL/rasterio
  • Typically compressed (LZW, DEFLATE, etc.)

Example:

from databricks.labs.gbx.rasterx import functions as rx

# rst_fromfile loads the file into the tile as binary content
df = spark.range(1).select(
rx.rst_fromfile(f.lit(SAMPLE_NYC_RASTER), f.lit("GTiff")).alias("tile")
)
from_file_df = df.select(f.col("tile.raster").alias("raster_binary"))
# Returns: b'\x4d\x4d\x00\x2a...' (binary GeoTIFF data)

# The GDAL reader also produces binary tiles
df = spark.read.format("gdal").load(SAMPLE_NYC_RASTER)
from_reader_df = df.select(f.col("tile.raster").alias("raster_binary"))
Example output
+--------------+
|raster_binary |
+--------------+
|[BINARY] |
+--------------+

3. metadata

The metadata field contains key-value pairs describing the raster format and properties.

Properties:

  • Type: map<string,string>
  • Nullable: Yes
  • Purpose: Provides format information needed for GDAL operations

Common Keys:

  • driver - GDAL driver name (e.g., "GTiff", "NetCDF", "HDF4")
  • extension - File extension (e.g., ".tif", ".nc")
  • size - Size in bytes (as string) matching the length of the raster payload
  • Other format-specific metadata

Example:

df = spark.read.format("gdal").load(SAMPLE_NYC_RASTERS)

metadata_df = df.select(
f.col("tile.metadata").alias("metadata"),
f.col("tile.metadata.driver").alias("driver"),
f.col("tile.metadata.extension").alias("extension"),
f.col("tile.metadata.size").alias("size")
)
Example output
+------------------+-------+----------+------+
|metadata |driver |extension |size |
+------------------+-------+----------+------+
|{driver=GTiff,...}|GTiff |.tif |... |
+------------------+-------+----------+------+

Working with Tiles

Accessing Tile Fields

Use dot notation to access tile struct fields:

Python:

from pyspark.sql import functions as f
from databricks.labs.gbx.rasterx import functions as rx

df = spark.read.format("gdal").load(SAMPLE_NYC_RASTERS)

# Access individual fields
df.select(
f.col("tile.cellid"),
f.col("tile.raster"),
f.col("tile.metadata"),
f.col("tile.metadata.driver")
)
Example output
+------+--------+------------------+-------+
|cellid|raster |metadata |driver |
+------+--------+------------------+-------+
|null |[BINARY]|{driver=GTiff,...}|GTiff |
+------+--------+------------------+-------+

Scala:

import org.apache.spark.sql.functions._
import com.databricks.labs.gbx.rasterx.{functions => rx}

val df = spark.read.format("gdal").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif")

// Access individual fields
df.select(
col("tile.cellid"),
col("tile.raster"),
col("tile.metadata"),
col("tile.metadata.driver")
).show()
Example output
+------+--------+------------------+-------+
|cellid|raster |metadata |driver |
+------+--------+------------------+-------+
|null |[BINARY]|{driver=GTiff,...}|GTiff |
+------+--------+------------------+-------+

SQL:

SELECT 
tile.cellid,
tile.raster,
tile.metadata,
tile.metadata['driver'] as driver
FROM gdal.`{SAMPLE_NYC_RASTER}`;
Example output
+------+--------+------------------+-------+
|cellid|raster |metadata |driver |
+------+--------+------------------+-------+
|null |[BINARY]|{driver=GTiff,...}|GTiff |
+------+--------+------------------+-------+

Filtering by Metadata

Filter tiles based on driver or other metadata:

df = spark.read.format("gdal").load(SAMPLE_NYC_RASTERS)

# Filter by driver
gtiff_only = df.filter(f.col("tile.metadata.driver") == "GTiff")

# Filter by file extension
tif_files = df.filter(f.col("tile.metadata.extension") == ".tif")
Example output
Filtered DataFrame (e.g. driver = GTiff or extension = .tif).

Using Tiles in Custom UDFs

Access tile components for custom processing:

from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

@udf(IntegerType())
def get_raster_size(raster_binary, metadata):
"""Get size of raster data"""
if metadata and "size" in metadata:
return int(metadata["size"])
elif raster_binary:
return len(raster_binary)
return 0

df = spark.read.format("gdal").load(SAMPLE_NYC_RASTERS)
df_with_size = df.withColumn(
"data_size",
get_raster_size(f.col("tile.raster"), f.col("tile.metadata"))
)
Example output
+----+---------+
|path|data_size|
+----+---------+
|... |12345678 |
+----+---------+

Processing Binary Raster Data

When the raster field contains binary data, use it with rasterio or GDAL:

from rasterio.io import MemoryFile
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

@udf(DoubleType())
def compute_mean_from_tile(raster_binary):
"""Compute mean from binary raster data"""
import numpy as np

if raster_binary is None:
return None

# Convert to bytes if needed
tile_data = bytes(raster_binary)

# Open with rasterio
with MemoryFile(tile_data) as memfile:
with memfile.open() as src:
data = src.read(1)
return float(np.mean(data))

# Use with tiles from content or GDAL reader (sample data)
df = spark.read.format("gdal").load(SAMPLE_NYC_RASTER)
stats_df = df.withColumn(
"mean_value",
compute_mean_from_tile(f.col("tile.raster"))
)
Example output
+----+----------+
|path|mean_value|
+----+----------+
|... |0.42 |
+----+----------+

Comparing rst_fromfile vs rst_fromcontent

Both produce tiles whose raster field is binary. Use rst_fromfile when you have a path, and rst_fromcontent when you already have bytes (e.g. from spark.read.format("binaryFile")).

from databricks.labs.gbx.rasterx import functions as rx

# rst_fromfile reads the file off a path and embeds its bytes in the tile
fromfile_tile = spark.range(1).select(
rx.rst_fromfile(f.lit(SAMPLE_NYC_RASTER), f.lit("GTiff")).alias("tile")
)

fromfile_tile.select(f.length(f.col("tile.raster")).alias("size_bytes")).show()
# +-----------+
# |size_bytes |
# +-----------+
# |2345678 |
# +-----------+

# rst_fromcontent takes bytes you already have in a column (e.g. from binaryFile)
fromcontent_tile = spark.read.format("binaryFile").load(SAMPLE_NYC_RASTER).select(
rx.rst_fromcontent(f.col("content"), f.lit("GTiff")).alias("tile")
)

fromcontent_tile.select(f.length(f.col("tile.raster")).alias("size_bytes")).show()
Example output
+-----------+
|size_bytes |
+-----------+
|2345678 |
+-----------+

Tessellated vs Non-Tessellated Tiles

Non-Tessellated Tiles

Created by constructors (rst_fromfile, rst_fromcontent) or readers:

df = spark.read.format("gdal").load(SAMPLE_NYC_RASTER)

df.select(
f.col("tile.cellid"), # null
f.col("tile.raster"), # binary data
f.col("tile.metadata") # {driver: "GTiff", ...}
).show()
Example output
+----+--------+------------------+
|cellid|raster |metadata |
+----+--------+------------------+
|null|[BINARY]|{driver=GTiff,...}|
+----+--------+------------------+

Characteristics:

  • cellid is null
  • Represents entire raster or a tile from tiling operations
  • Suitable for processing complete rasters

Tessellated Tiles

Created by rst_h3_tessellate:

from databricks.labs.gbx.rasterx import functions as rx

df = spark.read.format("gdal").load(SAMPLE_NYC_RASTER).select(
f.explode(rx.rst_h3_tessellate(f.col("tile"), f.lit(7))).alias("tile")
)

df.select(
f.col("tile.cellid"), # H3 cell ID (e.g., 604189641255419903)
f.col("tile.raster"), # binary data (clipped to cell)
f.col("tile.metadata") # {driver: "GTiff", RASTERX_CELL_ID: "604...", ...}
).show()
Example output
+-------------------+--------+------------------+
|cellid |raster |metadata |
+-------------------+--------+------------------+
|604189641255419903 |[BINARY]|{RASTERX_CELL_ID..|
+-------------------+--------+------------------+

Characteristics:

  • cellid contains H3 cell ID
  • Raster clipped to cell bounds
  • Enables spatial joins and grid-based processing
  • Metadata includes RASTERX_CELL_ID key

Next Steps