Skip to main content

File GeoDatabase Reader

Both tiers produce the same (attrs..., geom_0, geom_0_srid, geom_0_srid_proj) schema — see Choosing an Execution Tier.

Benchmark & tradeoff

The lightweight (*_gbx) and heavyweight readers emit the same schema, but your compute usually decides the tier: the lightweight tier needs no JAR or init script and is the only option on Serverless, standard (shared), and ARM clusters. The heavyweight tier requires a classic x86 cluster (JAR + GDAL init script); where it is available it uses native GDAL on the JVM and tends to pull ahead on large workloads. See the Benchmarking page for light-vs-heavy timings and methodology.

Options

File Geodatabases contain multiple feature classes (layers); both tiers (lightweight file_gdb_gbx, heavyweight file_gdb_ogr) preset the OpenFileGDB driver and expose a feature-class selector. They take the same options.

OptionDefaultDescription
chunkSize"10000"Records per read batch — Arrow in-memory batching on the single per-file read, not partition splitting.
layerNumber / layerN"0"Feature-class index to read (0-based, first feature class) — layerNumber (lightweight) / layerN (heavyweight).
layerName"" (first feature class)Feature-class (layer) name to read; takes precedence over the index when set.
asWKB"true"Output geometry as WKB (binary) vs WKT (text).

All other OGR reader options (driverName, …) are also available.

Zipped input: a .gdb.zip (or a plain .zip wrapping a single .gdb) is read transparently — GDAL opens it via /vsizip/, so just point .load() at the archive (or a directory of them); no option is required. This is the layout the file_gdb_gbx writer produces with zip=true.

Example — reading a specific feature class by name:

# Read specific feature class (sample-data Volumes path)
df = spark.read.format("file_gdb_ogr") \
.option("layerName", "NYC_Boroughs") \
.load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/filegdb/NYC_Sample.gdb.zip")
df.show()
Example output
+--------------------+--------------+---------+
|SHAPE |SHAPE_srid |BoroName |
+--------------------+--------------+---------+
|[BINARY] |4326 |... |
|... |... |... |
+--------------------+--------------+---------+

file_gdb_gbx is the lightweight File Geodatabase reader (pyogrio-backed, no JAR). It reads ESRI File Geodatabases and emits the same schema as the heavyweight file_gdb_ogr reader.

# Lightweight File Geodatabase reader (pyogrio; no JAR)
from databricks.labs.gbx.ds.register import register
register(spark)
df = spark.read.format("file_gdb_gbx").load(SAMPLE) # (attrs..., <geom>, <geom>_srid, <geom>_srid_proj)
df.show()

It is the lightweight counterpart of the heavyweight file_gdb_ogr reader, supporting Python and SQL bindings (not Scala).

Typical pipeline: ingest into a table

The common pattern is to land File Geodatabase directories in a table for downstream analytics — on Databricks a managed table is Delta:

df = spark.read.format("file_gdb_gbx").load("/Volumes/main/geo/raw/")  # a folder of .gdb directories
df.write.mode("overwrite").saveAsTable("main.geo.parcels") # Delta table on Databricks

Reading a folder fans the files across the cluster (one partition per file), so ingest scales with the data — unlike a single-node pyogrio.read_* that parses one file on one machine. See Benchmarking for light-vs-heavy ingest figures.

Next Steps