Tile Structure

Understanding the internal structure of GeoBrix tiles is essential for advanced use cases like custom UDFs, direct data manipulation, and performance optimization.

Overview

In GeoBrix, a tile is not a simple binary column—it's a structured type (struct) containing three fields that together represent a raster dataset along with its metadata and optional grid cell information.

Tile Schema

A tile has the following structure:

GeoBrix tile schema — cellid (bigint, nullable), raster (binary, required), metadata (map of string to string), with non-tessellated vs tessellated examples

struct<
  cellid: bigint,                 -- Grid cell ID (nullable)
  raster: binary,                 -- Raster bytes
  metadata: map<string,string>    -- Driver, extension, size, etc.
>

Field Descriptions

Field	Type	Nullable	Description
`cellid`	`bigint` (Long)	Yes	Grid cell identifier for tessellated rasters. `null` for non-tessellated rasters.
`raster`	`binary`	No	Binary raster content (bytes).
`metadata`	`map<string,string>`	Yes	Key-value map containing driver name, file extension, size, and other metadata.

Field Details

1. cellid

The cellid field identifies which grid cell a tile belongs to when using tessellation (e.g., rst_h3_tessellate).

Properties:

Type: bigint (64-bit integer)
Nullable: Yes
Purpose: Enables spatial indexing and joining of tessellated rasters

Values:

null - For non-tessellated rasters (e.g., from rst_fromfile)
> 0 - For tessellated rasters (H3 cell ID)

Example:

-- Non-tessellated: cellid is null
SELECT tile.cellid 
FROM gdal.`{SAMPLE_NYC_RASTER}`;
-- Returns: null

Example output
+------+
|cellid|
+------+
|null  |
+------+

-- Tessellated: cellid contains H3 cell ID
SELECT tile.cellid 
FROM (
  SELECT explode(gbx_rst_h3_tessellate(tile, 7)) as tile
  FROM gdal.`{SAMPLE_NYC_RASTER}`
);
-- Returns: 604189641255419903, 604189641255420159, ...

Example output
+-------------------+
|cellid             |
+-------------------+
|604189641255419903 |
+-------------------+

2. raster

The raster field contains the actual raster bytes — the full file content in memory.

All tile constructors and readers produce binary content:

rst_fromfile(path, driver) → reads the file at path into binary bytes
rst_fromcontent(content, driver) → embeds the given binary bytes
GDAL reader → binary (raster bytes)

Properties:

Type: binary
Nullable: No
Purpose: Self-contained raster payload carried through the plan; downstream operators (rst_clip, rst_transform, ...) read and produce bytes, so there is no orphan-path risk.

Binary Format:

Complete raster file (e.g. GeoTIFF) in memory
Can be deserialized with GDAL/rasterio
Typically compressed (LZW, DEFLATE, etc.)

Example:

from databricks.labs.gbx.rasterx import functions as rx

# rst_fromfile loads the file into the tile as binary content
df = spark.range(1).select(
    rx.rst_fromfile(f.lit(SAMPLE_NYC_RASTER), f.lit("GTiff")).alias("tile")
)
from_file_df = df.select(f.col("tile.raster").alias("raster_binary"))
# Returns: b'\x4d\x4d\x00\x2a...' (binary GeoTIFF data)

# The GDAL reader also produces binary tiles
df = spark.read.format("gdal").load(SAMPLE_NYC_RASTER)
from_reader_df = df.select(f.col("tile.raster").alias("raster_binary"))

Example output
+--------------+
|raster_binary |
+--------------+
|[BINARY]      |
+--------------+

3. metadata

The metadata field contains key-value pairs describing the raster format and properties.

Properties:

Type: map<string,string>
Nullable: Yes
Purpose: Provides format information needed for GDAL operations

Common Keys:

driver - GDAL driver name (e.g., "GTiff", "NetCDF", "HDF4")
extension - File extension (e.g., ".tif", ".nc")
size - Size in bytes (as string) matching the length of the raster payload
Other format-specific metadata

Example:

df = spark.read.format("gdal").load(SAMPLE_NYC_RASTERS)

metadata_df = df.select(
    f.col("tile.metadata").alias("metadata"),
    f.col("tile.metadata.driver").alias("driver"),
    f.col("tile.metadata.extension").alias("extension"),
    f.col("tile.metadata.size").alias("size")
)

Example output
+------------------+-------+----------+------+
|metadata          |driver |extension |size  |
+------------------+-------+----------+------+
|{driver=GTiff,...}|GTiff  |.tif      |...   |
+------------------+-------+----------+------+

Working with Tiles

Accessing Tile Fields

Use dot notation to access tile struct fields:

Python:

from pyspark.sql import functions as f
from databricks.labs.gbx.rasterx import functions as rx

df = spark.read.format("gdal").load(SAMPLE_NYC_RASTERS)

# Access individual fields
df.select(
    f.col("tile.cellid"),
    f.col("tile.raster"),
    f.col("tile.metadata"),
    f.col("tile.metadata.driver")
)

Example output
+------+--------+------------------+-------+
|cellid|raster  |metadata          |driver |
+------+--------+------------------+-------+
|null  |[BINARY]|{driver=GTiff,...}|GTiff  |
+------+--------+------------------+-------+

Scala:

import org.apache.spark.sql.functions._
import com.databricks.labs.gbx.rasterx.{functions => rx}

val df = spark.read.format("gdal").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif")

// Access individual fields
df.select(
  col("tile.cellid"),
  col("tile.raster"),
  col("tile.metadata"),
  col("tile.metadata.driver")
).show()

Example output
+------+--------+------------------+-------+
|cellid|raster  |metadata          |driver |
+------+--------+------------------+-------+
|null  |[BINARY]|{driver=GTiff,...}|GTiff  |
+------+--------+------------------+-------+

SQL:

SELECT 
    tile.cellid,
    tile.raster,
    tile.metadata,
    tile.metadata['driver'] as driver
FROM gdal.`{SAMPLE_NYC_RASTER}`;

Example output
+------+--------+------------------+-------+
|cellid|raster  |metadata          |driver |
+------+--------+------------------+-------+
|null  |[BINARY]|{driver=GTiff,...}|GTiff  |
+------+--------+------------------+-------+

Filtering by Metadata

Filter tiles based on driver or other metadata:

df = spark.read.format("gdal").load(SAMPLE_NYC_RASTERS)

# Filter by driver
gtiff_only = df.filter(f.col("tile.metadata.driver") == "GTiff")

# Filter by file extension
tif_files = df.filter(f.col("tile.metadata.extension") == ".tif")

Example output
Filtered DataFrame (e.g. driver = GTiff or extension = .tif).

Using Tiles in Custom UDFs

Access tile components for custom processing:

from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

@udf(IntegerType())
def get_raster_size(raster_binary, metadata):
    """Get size of raster data"""
    if metadata and "size" in metadata:
        return int(metadata["size"])
    elif raster_binary:
        return len(raster_binary)
    return 0

df = spark.read.format("gdal").load(SAMPLE_NYC_RASTERS)
df_with_size = df.withColumn(
    "data_size",
    get_raster_size(f.col("tile.raster"), f.col("tile.metadata"))
)

Example output
+----+---------+
|path|data_size|
+----+---------+
|... |12345678 |
+----+---------+

Processing Binary Raster Data

When the raster field contains binary data, use it with rasterio or GDAL:

from rasterio.io import MemoryFile
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

@udf(DoubleType())
def compute_mean_from_tile(raster_binary):
    """Compute mean from binary raster data"""
    import numpy as np
    
    if raster_binary is None:
        return None
    
    # Convert to bytes if needed
    tile_data = bytes(raster_binary)
    
    # Open with rasterio
    with MemoryFile(tile_data) as memfile:
        with memfile.open() as src:
            data = src.read(1)
            return float(np.mean(data))

# Use with tiles from content or GDAL reader (sample data)
df = spark.read.format("gdal").load(SAMPLE_NYC_RASTER)
stats_df = df.withColumn(
    "mean_value",
    compute_mean_from_tile(f.col("tile.raster"))
)

Example output
+----+----------+
|path|mean_value|
+----+----------+
|... |0.42      |
+----+----------+

Comparing `rst_fromfile` vs `rst_fromcontent`

Both produce tiles whose raster field is binary. Use rst_fromfile when you have a path, and rst_fromcontent when you already have bytes (e.g. from spark.read.format("binaryFile")).

from databricks.labs.gbx.rasterx import functions as rx

# rst_fromfile reads the file off a path and embeds its bytes in the tile
fromfile_tile = spark.range(1).select(
    rx.rst_fromfile(f.lit(SAMPLE_NYC_RASTER), f.lit("GTiff")).alias("tile")
)

fromfile_tile.select(f.length(f.col("tile.raster")).alias("size_bytes")).show()
# +-----------+
# |size_bytes |
# +-----------+
# |2345678    |
# +-----------+

# rst_fromcontent takes bytes you already have in a column (e.g. from binaryFile)
fromcontent_tile = spark.read.format("binaryFile").load(SAMPLE_NYC_RASTER).select(
    rx.rst_fromcontent(f.col("content"), f.lit("GTiff")).alias("tile")
)

fromcontent_tile.select(f.length(f.col("tile.raster")).alias("size_bytes")).show()

Example output
+-----------+
|size_bytes |
+-----------+
|2345678    |
+-----------+

Tessellated vs Non-Tessellated Tiles

Non-Tessellated Tiles

Created by constructors (rst_fromfile, rst_fromcontent) or readers:

df = spark.read.format("gdal").load(SAMPLE_NYC_RASTER)

df.select(
    f.col("tile.cellid"),      # null
    f.col("tile.raster"),      # binary data
    f.col("tile.metadata")     # {driver: "GTiff", ...}
).show()

Example output
+----+--------+------------------+
|cellid|raster |metadata         |
+----+--------+------------------+
|null|[BINARY]|{driver=GTiff,...}|
+----+--------+------------------+

Characteristics:

cellid is null
Represents entire raster or a tile from tiling operations
Suitable for processing complete rasters

Tessellated Tiles

Created by rst_h3_tessellate:

from databricks.labs.gbx.rasterx import functions as rx

df = spark.read.format("gdal").load(SAMPLE_NYC_RASTER).select(
    f.explode(rx.rst_h3_tessellate(f.col("tile"), f.lit(7))).alias("tile")
)

df.select(
    f.col("tile.cellid"),      # H3 cell ID (e.g., 604189641255419903)
    f.col("tile.raster"),      # binary data (clipped to cell)
    f.col("tile.metadata")     # {driver: "GTiff", RASTERX_CELL_ID: "604...", ...}
).show()

Example output
+-------------------+--------+------------------+
|cellid             |raster  |metadata          |
+-------------------+--------+------------------+
|604189641255419903 |[BINARY]|{RASTERX_CELL_ID..|
+-------------------+--------+------------------+

Characteristics:

cellid contains H3 cell ID
Raster clipped to cell bounds
Enables spatial joins and grid-based processing
Metadata includes RASTERX_CELL_ID key

Next Steps

Raster Functions - Functions that work with tiles
Custom UDFs - Build custom tile processing
Library Integration - Use tiles with rasterio/xarray

Overview​

Tile Schema​

Field Descriptions​

Field Details​

1. cellid​

2. raster​

3. metadata​

Working with Tiles​

Accessing Tile Fields​

Filtering by Metadata​

Using Tiles in Custom UDFs​

Processing Binary Raster Data​

Comparing rst_fromfile vs rst_fromcontent​

Tessellated vs Non-Tessellated Tiles​

Non-Tessellated Tiles​

Tessellated Tiles​

Next Steps​

Overview

Tile Schema

Field Descriptions

Field Details

1. cellid

2. raster

3. metadata

Working with Tiles

Accessing Tile Fields

Filtering by Metadata

Using Tiles in Custom UDFs

Processing Binary Raster Data

Comparing `rst_fromfile` vs `rst_fromcontent`

Tessellated vs Non-Tessellated Tiles

Non-Tessellated Tiles

Tessellated Tiles

Next Steps