{ "cells": [ { "cell_type": "markdown", "source": [ "# Using grid index systems in Mosaic" ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "1deecb01-8ded-4750-a112-5847e410846b" } } }, { "cell_type": "code", "source": [ "from pyspark.sql.functions import *\nfrom mosaic import enable_mosaic\nenable_mosaic(spark, dbutils)" ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "f6311e22-f0a8-4025-b612-4367b5f1cac2" } }, "outputs": [ { "output_type": "display_data", "metadata": { "application/vnd.databricks.v1+output": { "datasetInfos": [], "data": "
", "removedWidgets": [], "addedWidgets": {}, "metadata": {}, "type": "html", "arguments": {} } }, "data": { "text/html": [ "\n" ] } } ], "execution_count": 0 }, { "cell_type": "markdown", "source": [ "Set operations over big geospatial datasets become very expensive without some form of spatial indexing.\n\nSpatial indexes not only allow operations like point-in-polygon joins to be partitioned but, if only approximate results are required, can be used to reduce these to deterministic SQL joins directly on the indexes." ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "e5603057-662b-4bf3-9517-84b35ad7a8b5" } } }, { "cell_type": "markdown", "source": [ "" ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "c8199fdd-1372-43a4-8edf-f96129f4195a" } } }, { "cell_type": "markdown", "source": [ "The workflow for a point-in-poly spatial join might look like the following:" ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "39e3cf64-61c9-4930-be5e-91b254cf0984" } } }, { "cell_type": "markdown", "source": [ "## 1. Read the source point and polygon datasets." ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "947d6e33-06b5-4ea0-bc55-0fdb5b8e8524" } } }, { "cell_type": "code", "source": [ "drop_cols = [\n \"rate_code_id\", \"store_and_fwd_flag\", \"dropoff_longitude\",\n \"dropoff_latitude\", \"payment_type\", \"fare_amount\",\n \"extra\", \"mta_tax\", \"tip_amount\", \"tolls_amount\",\n \"total_amount\"\n]\n\ntrips = (\n spark.table(\"delta.`/databricks-datasets/nyctaxi/tables/nyctaxi_yellow`\")\n .drop(*drop_cols)\n .limit(5_000_000)\n .repartition(sc.defaultParallelism * 20)\n)\n\ntrips.show()" ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "f1474a53-da53-4955-b944-083fdf3a22a1" } }, "outputs": [ { "output_type": "display_data", "metadata": { "application/vnd.databricks.v1+output": { "datasetInfos": [], "data": "