Tile Structure
Understanding the internal structure of GeoBrix tiles is essential for advanced use cases like custom UDFs, direct data manipulation, and performance optimization.
Overview
In GeoBrix, a tile is not a simple binary column—it's a structured type (struct) containing three fields that together represent a raster dataset along with its metadata and optional grid cell information.
Tile Schema
A tile has the following structure:
struct<
cellid: bigint, -- Grid cell ID (nullable)
raster: string|binary, -- File path or binary content
metadata: map<string,string> -- Driver, extension, size, etc.
>
Field Descriptions
| Field | Type | Nullable | Description |
|---|---|---|---|
cellid | bigint (Long) | Yes | Grid cell identifier for tessellated rasters. null for non-tessellated rasters. |
raster | string or binary | No | Either a file path (string) or binary raster content (bytes). |
metadata | map<string,string> | Yes | Key-value map containing driver name, file extension, size, and other metadata. |
Field Details
1. cellid
The cellid field identifies which grid cell a tile belongs to when using tessellation (e.g., rst_h3_tessellate).
Properties:
- Type:
bigint(64-bit integer) - Nullable: Yes
- Purpose: Enables spatial indexing and joining of tessellated rasters
Values:
null- For non-tessellated rasters (e.g., fromrst_fromfile)> 0- For tessellated rasters (H3 cell ID)
Example:
SQL_CELLID_NON_TESSELLATED = f"""-- Non-tessellated: cellid is null
SELECT tile.cellid
FROM gdal.`{SAMPLE_NYC_RASTER}`;
-- Returns: null"""
+------+
|cellid|
+------+
|null |
+------+
SQL_CELLID_TESSELLATED = f"""-- Tessellated: cellid contains H3 cell ID
SELECT tile.cellid
FROM (
SELECT explode(gbx_rst_h3_tessellate(tile, 7)) as tile
FROM gdal.`{SAMPLE_NYC_RASTER}`
);
-- Returns: 604189641255419903, 604189641255420159, ..."""
+-------------------+
|cellid |
+-------------------+
|604189641255419903 |
+-------------------+
2. raster
The raster field contains the actual raster data, either as a file path or binary content.
Type depends on constructor:
rst_fromfile()→ string (file path)rst_fromcontent()→ binary (raster bytes)- GDAL reader → binary (raster bytes)
Properties:
- Type:
stringorbinary - Nullable: No
- Purpose: Provides access to the underlying raster data
String (Path) Format:
file:///path/to/raster.tif
/Volumes/<catalog>/<schema>/<volume>/data/raster.tif
s3://bucket/path/to/raster.tif
Binary Format:
- Complete GeoTIFF file in memory
- Can be deserialized with GDAL/rasterio
- Typically compressed (LZW, DEFLATE, etc.)
Example:
from databricks.labs.gbx.rasterx import functions as rx
# Access path from rst_fromfile (sample data path)
df = spark.range(1).select(
rx.rst_fromfile(f.lit(SAMPLE_NYC_RASTER), f.lit("GTiff")).alias("tile")
)
path_df = df.select(f.col("tile.raster").alias("raster_path"))
# Returns: path string to the raster
# Access binary from GDAL reader (sample data from mounted Volumes)
df = spark.read.format("gdal").load(SAMPLE_NYC_RASTER)
binary_df = df.select(f.col("tile.raster").alias("raster_binary"))
+----------------------------------------------------------+
|raster_path / raster_binary |
+----------------------------------------------------------+
|.../nyc/sentinel2/nyc_sentinel2_red.tif or [BINARY] |
+----------------------------------------------------------+
3. metadata
The metadata field contains key-value pairs describing the raster format and properties.
Properties:
- Type:
map<string,string> - Nullable: Yes
- Purpose: Provides format information needed for GDAL operations
Common Keys:
driver- GDAL driver name (e.g., "GTiff", "NetCDF", "HDF4")extension- File extension (e.g., ".tif", ".nc")size- Size in bytes (as string) or "-1" for file-based tiles- Other format-specific metadata
Example:
df = spark.read.format("gdal").load(SAMPLE_NYC_RASTERS)
metadata_df = df.select(
f.col("tile.metadata").alias("metadata"),
f.col("tile.metadata.driver").alias("driver"),
f.col("tile.metadata.extension").alias("extension"),
f.col("tile.metadata.size").alias("size")
)
+------------------+-------+----------+------+
|metadata |driver |extension |size |
+------------------+-------+----------+------+
|{driver=GTiff,...}|GTiff |.tif |... |
+------------------+-------+----------+------+
Working with Tiles
Accessing Tile Fields
Use dot notation to access tile struct fields:
Python:
from pyspark.sql import functions as f
from databricks.labs.gbx.rasterx import functions as rx
df = spark.read.format("gdal").load(SAMPLE_NYC_RASTERS)
# Access individual fields
df.select(
f.col("tile.cellid"),
f.col("tile.raster"),
f.col("tile.metadata"),
f.col("tile.metadata.driver")
)
+------+--------+------------------+-------+
|cellid|raster |metadata |driver |
+------+--------+------------------+-------+
|null |[BINARY]|{driver=GTiff,...}|GTiff |
+------+--------+------------------+-------+
Scala:
import org.apache.spark.sql.functions._
import com.databricks.labs.gbx.rasterx.{functions => rx}
val df = spark.read.format("gdal").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif")
// Access individual fields
df.select(
col("tile.cellid"),
col("tile.raster"),
col("tile.metadata"),
col("tile.metadata.driver")
).show()
+------+--------+------------------+-------+
|cellid|raster |metadata |driver |
+------+--------+------------------+-------+
|null |[BINARY]|{driver=GTiff,...}|GTiff |
+------+--------+------------------+-------+
SQL:
SQL_ACCESSING_TILE_FIELDS = f"""SELECT
tile.cellid,
tile.raster,
tile.metadata,
tile.metadata['driver'] as driver
FROM gdal.`{SAMPLE_NYC_RASTER}`;"""
+------+--------+------------------+-------+
|cellid|raster |metadata |driver |
+------+--------+------------------+-------+
|null |[BINARY]|{driver=GTiff,...}|GTiff |
+------+--------+------------------+-------+
Filtering by Metadata
Filter tiles based on driver or other metadata:
df = spark.read.format("gdal").load(SAMPLE_NYC_RASTERS)
# Filter by driver
gtiff_only = df.filter(f.col("tile.metadata.driver") == "GTiff")
# Filter by file extension
tif_files = df.filter(f.col("tile.metadata.extension") == ".tif")
Filtered DataFrame (e.g. driver = GTiff or extension = .tif).
Using Tiles in Custom UDFs
Access tile components for custom processing:
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
@udf(IntegerType())
def get_raster_size(raster_binary, metadata):
"""Get size of raster data"""
if metadata and "size" in metadata:
return int(metadata["size"])
elif raster_binary:
return len(raster_binary)
return 0
df = spark.read.format("gdal").load(SAMPLE_NYC_RASTERS)
df_with_size = df.withColumn(
"data_size",
get_raster_size(f.col("tile.raster"), f.col("tile.metadata"))
)
+----+---------+
|path|data_size|
+----+---------+
|... |12345678 |
+----+---------+
Processing Binary Raster Data
When the raster field contains binary data, use it with rasterio or GDAL:
from rasterio.io import MemoryFile
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
@udf(DoubleType())
def compute_mean_from_tile(raster_binary):
"""Compute mean from binary raster data"""
import numpy as np
if raster_binary is None:
return None
# Convert to bytes if needed
tile_data = bytes(raster_binary)
# Open with rasterio
with MemoryFile(tile_data) as memfile:
with memfile.open() as src:
data = src.read(1)
return float(np.mean(data))
# Use with tiles from content or GDAL reader (sample data)
df = spark.read.format("gdal").load(SAMPLE_NYC_RASTER)
stats_df = df.withColumn(
"mean_value",
compute_mean_from_tile(f.col("tile.raster"))
)
+----+----------+
|path|mean_value|
+----+----------+
|... |0.42 |
+----+----------+
Comparing File-Based vs Binary Tiles
from databricks.labs.gbx.rasterx import functions as rx
# File-based tile (sample data path)
file_tile = spark.range(1).select(
rx.rst_fromfile(f.lit(SAMPLE_NYC_RASTER), f.lit("GTiff")).alias("tile")
)
file_tile.select(f.col("tile.raster")).show(truncate=False)
# +----------------------------------------------------------+
# |raster |
# +----------------------------------------------------------+
# |/Volumes/.../nyc/sentinel2/nyc_sentinel2_red.tif |
# +----------------------------------------------------------+
# Binary tile (same file read as binary)
binary_tile = spark.read.format("binaryFile").load(SAMPLE_NYC_RASTER).select(
rx.rst_fromcontent(f.col("content"), f.lit("GTiff")).alias("tile")
)
binary_tile.select(f.length(f.col("tile.raster")).alias("size_bytes")).show()
+----------------------------------------------------------+
|raster (path) or size_bytes |
+----------------------------------------------------------+
|/Volumes/.../nyc_sentinel2_red.tif or 2345678 |
+----------------------------------------------------------+
Tessellated vs Non-Tessellated Tiles
Non-Tessellated Tiles
Created by constructors (rst_fromfile, rst_fromcontent) or readers:
df = spark.read.format("gdal").load(SAMPLE_NYC_RASTER)
df.select(
f.col("tile.cellid"), # null
f.col("tile.raster"), # binary data
f.col("tile.metadata") # {driver: "GTiff", ...}
).show()
+----+--------+------------------+
|cellid|raster |metadata |
+----+--------+------------------+
|null|[BINARY]|{driver=GTiff,...}|
+----+--------+------------------+
Characteristics:
cellidisnull- Represents entire raster or a tile from tiling operations
- Suitable for processing complete rasters
Tessellated Tiles
Created by rst_h3_tessellate:
from databricks.labs.gbx.rasterx import functions as rx
df = spark.read.format("gdal").load(SAMPLE_NYC_RASTER).select(
f.explode(rx.rst_h3_tessellate(f.col("tile"), f.lit(7))).alias("tile")
)
df.select(
f.col("tile.cellid"), # H3 cell ID (e.g., 604189641255419903)
f.col("tile.raster"), # binary data (clipped to cell)
f.col("tile.metadata") # {driver: "GTiff", RASTERX_CELL_ID: "604...", ...}
).show()
+-------------------+--------+------------------+
|cellid |raster |metadata |
+-------------------+--------+------------------+
|604189641255419903 |[BINARY]|{RASTERX_CELL_ID..|
+-------------------+--------+------------------+
Characteristics:
cellidcontains H3 cell ID- Raster clipped to cell bounds
- Enables spatial joins and grid-based processing
- Metadata includes
RASTERX_CELL_IDkey
Next Steps
- RasterX Functions - Functions that work with tiles
- Custom UDFs - Build custom tile processing
- Library Integration - Use tiles with rasterio/xarray