Shapefile Reader

Both tiers produce the same (attrs..., geom_0, geom_0_srid, geom_0_srid_proj) schema — see Choosing an Execution Tier.

Benchmark & tradeoff

The lightweight (*_gbx) and heavyweight readers emit the same schema, but your compute usually decides the tier: the lightweight tier needs no JAR or init script and is the only option on Serverless, standard (shared), and ARM clusters. The heavyweight tier requires a classic x86 cluster (JAR + GDAL init script); where it is available it uses native GDAL on the JVM and tends to pull ahead on large workloads. See the Benchmarking page for light-vs-heavy timings and methodology.

What you can point `.load()` at

The table below covers all three accepted forms. The behavior is identical in both tiers (shapefile_gbx and shapefile_ogr).

Path passed to `.load()`	What happens
`/path/to/x.shp`	Reads that single shapefile. Side-car files (`.shx`, `.dbf`, `.prj`, `.cpg`) are picked up automatically from the same directory.
`/path/to/dir/`	Reads all shapefiles found under `dir` recursively and unions the results into one DataFrame. The directory may contain unzipped `.shp` bundles, zipped `.shp.zip` archives, or a mix — all are enumerated and unioned. All shapefiles must share the same schema; if their schemas diverge, the reader raises an error — load them separately or point at a single-stem directory.
`/path/to/x.shp.zip`	Reads a zipped shapefile archive (the `.zip` must contain the `.shp` and its side-cars at the top level).

Schema divergence

When loading a directory that contains shapefiles with different column layouts, the reader raises:

shapefile reader: shapefiles under <path> have differing schemas;
load them separately or use a single-stem directory.
Stems: <a>, <b>.

Either pass a path to a specific .shp file, or ensure all shapefiles in the directory share the same schema before loading them together.

Options

Both tiers (lightweight shapefile_gbx, heavyweight shapefile_ogr) preset the ESRI Shapefile driver and take the same options.

Option	Default	Description
`chunkSize`	`"10000"`	Records per read batch — Arrow in-memory batching on the single per-file read, not partition splitting.
`layerNumber` / `layerN`	`"0"`	Layer index for multi-layer formats (0-based) — `layerNumber` (lightweight) / `layerN` (heavyweight).
`layerName`	`""`	Layer name (overrides the layer index when set).
`asWKB`	`"true"`	Output geometry as WKB (binary) vs WKT (text).

All other OGR reader options (driverName, …) are also available.

Example — tuning the chunk size for performance:

# Adjust chunk size (sample-data Volumes path)
df = spark.read.format("shapefile_ogr") \
    .option("chunkSize", "50000") \
    .load("{SAMPLE_SHAPEFILE_PATH}")
df.show()

Example output
+--------------------+-----------+----+
|geom_0              |geom_0_srid|name|
+--------------------+-----------+----+
|[BINARY]            |4326       |... |
|...                 |...        |... |
+--------------------+-----------+----+

Lightweight · shapefile_gbx
Heavyweight · shapefile_ogr

shapefile_gbx is the lightweight Shapefile reader (pyogrio-backed, no JAR). It reads .shp and zipped shapefiles and emits the same schema as the heavyweight shapefile_ogr reader.

# Lightweight Shapefile reader (pyogrio; no JAR)
from databricks.labs.gbx.ds.register import register
register(spark)
df = spark.read.format("shapefile_gbx").load(SAMPLE)   # (attrs..., geom_0, geom_0_srid, geom_0_srid_proj)
df.show()

It is the lightweight counterpart of the heavyweight shapefile_ogr reader, supporting Python and SQL bindings (not Scala).

Typical pipeline: ingest into a table

The common pattern is to land Shapefile archives in a table for downstream analytics — on Databricks a managed table is Delta:

df = spark.read.format("shapefile_gbx").load("/Volumes/main/geo/raw/")  # a folder of .shp/.zip files
df.write.mode("overwrite").saveAsTable("main.geo.parcels")               # Delta table on Databricks

Reading a folder fans the files across the cluster (one partition per file), so ingest scales with the data — unlike a single-node pyogrio.read_* that parses one file on one machine. See Benchmarking for light-vs-heavy ingest figures.

Read ESRI Shapefile format using the shapefile reader.

Format Name

shapefile_ogr

Supported Files

.shp - Standard shapefile (requires .shx, .dbf files)
.zip - ZIP files containing shapefiles
Directories with multiple shapefiles

Basic Usage

Python

# Read shapefile (sample-data Volumes path)
df = spark.read.format("shapefile_ogr").load("{SAMPLE_SHAPEFILE_PATH}")
df.show()

Example output
+--------------------+-----------+----+
|geom_0              |geom_0_srid|name|
+--------------------+-----------+----+
|[BINARY]            |4326       |... |
|...                 |...        |... |
+--------------------+-----------+----+

Scala

val df = spark.read.format("shapefile_ogr").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/subway/nyc_subway.shp.zip")

Example output
+--------------------+-----------+----+
|geom_0              |geom_0_srid|name|
+--------------------+-----------+----+
|[BINARY]            |4326       |... |
|...                 |...        |... |
+--------------------+-----------+----+

SQL

-- Read shapefile in SQL (sample-data Volumes path)
SELECT * FROM shapefile_ogr.`{SAMPLE_SHAPEFILE_PATH}`;

Example output
+--------------------+-----------+----+
|geom_0              |geom_0_srid|name|
+--------------------+-----------+----+
|[BINARY]            |4326       |... |
|...                 |...        |... |
+--------------------+-----------+----+

What you can point .load() at​

Options​

Typical pipeline: ingest into a table​

Format Name​

Supported Files​

Basic Usage​

Python​

Scala​

SQL​

Next Steps​