{ "cells": [ { "cell_type": "markdown", "source": [ "# Using grid index systems in Mosaic" ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "1deecb01-8ded-4750-a112-5847e410846b" } } }, { "cell_type": "code", "source": [ "from pyspark.sql.functions import *\nfrom mosaic import enable_mosaic\nenable_mosaic(spark, dbutils)" ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "f6311e22-f0a8-4025-b612-4367b5f1cac2" } }, "outputs": [ { "output_type": "display_data", "metadata": { "application/vnd.databricks.v1+output": { "datasetInfos": [], "data": "
", "removedWidgets": [], "addedWidgets": {}, "metadata": {}, "type": "html", "arguments": {} } }, "data": { "text/html": [ "\n
" ] } } ], "execution_count": 0 }, { "cell_type": "markdown", "source": [ "Set operations over big geospatial datasets become very expensive without some form of spatial indexing.\n\nSpatial indexes not only allow operations like point-in-polygon joins to be partitioned but, if only approximate results are required, can be used to reduce these to deterministic SQL joins directly on the indexes." ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "e5603057-662b-4bf3-9517-84b35ad7a8b5" } } }, { "cell_type": "markdown", "source": [ "![example h3 point-in-poly image](https://databricks.com/wp-content/uploads/2021/01/blog-geospatial-3.jpg)" ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "c8199fdd-1372-43a4-8edf-f96129f4195a" } } }, { "cell_type": "markdown", "source": [ "The workflow for a point-in-poly spatial join might look like the following:" ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "39e3cf64-61c9-4930-be5e-91b254cf0984" } } }, { "cell_type": "markdown", "source": [ "## 1. Read the source point and polygon datasets." ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "947d6e33-06b5-4ea0-bc55-0fdb5b8e8524" } } }, { "cell_type": "code", "source": [ "drop_cols = [\n \"rate_code_id\", \"store_and_fwd_flag\", \"dropoff_longitude\",\n \"dropoff_latitude\", \"payment_type\", \"fare_amount\",\n \"extra\", \"mta_tax\", \"tip_amount\", \"tolls_amount\",\n \"total_amount\"\n]\n\ntrips = (\n spark.table(\"delta.`/databricks-datasets/nyctaxi/tables/nyctaxi_yellow`\")\n .drop(*drop_cols)\n .limit(5_000_000)\n .repartition(sc.defaultParallelism * 20)\n)\n\ntrips.show()" ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "f1474a53-da53-4955-b944-083fdf3a22a1" } }, "outputs": [ { "output_type": "display_data", "metadata": { "application/vnd.databricks.v1+output": { "datasetInfos": [], "data": "
+---------+-------------------+-------------------+---------------+-------------+----------------+---------------+\n|vendor_id| pickup_datetime| dropoff_datetime|passenger_count|trip_distance|pickup_longitude|pickup_latitude|\n+---------+-------------------+-------------------+---------------+-------------+----------------+---------------+\n| CMT|2009-01-01 20:07:33|2009-01-01 20:12:28| 1| 0.8| -74.001041| 40.731|\n| CMT|2009-01-06 15:29:12|2009-01-06 15:51:57| 2| 3.3| -73.996489| 40.725742|\n| CMT|2010-02-14 17:42:16|2010-02-14 17:55:03| 1| 3.4| -74.002949| 40.734254|\n| CMT|2010-02-11 18:19:01|2010-02-11 18:27:54| 1| 1.5| -73.998133| 40.682463|\n| VTS|2009-04-29 12:26:00|2009-04-29 12:35:00| 3| 2.05| -74.001332| 40.72006|\n| VTS|2009-04-24 15:03:00|2009-04-24 15:23:00| 2| 2.89| -73.989952| 40.734625|\n| CMT|2010-02-28 13:55:44|2010-02-28 14:02:37| 1| 1.2| -74.006015| 40.735279|\n| VTS|2009-09-27 08:46:00|2009-09-27 08:59:00| 1| 3.97| -74.000148| 40.717468|\n| CMT|2010-02-18 09:48:52|2010-02-18 10:08:38| 1| 3.0| -73.995177| 40.725297|\n| CMT|2009-04-09 20:33:44|2009-04-09 20:39:33| 2| 0.6| -73.990133| 40.729321|\n| CMT|2010-02-13 22:41:10|2010-02-13 23:07:04| 1| 4.2| -74.009175| 40.706284|\n| CMT|2009-01-25 20:06:51|2009-01-25 20:12:37| 1| 1.3| -74.007384| 40.717929|\n| VTS|2010-02-27 18:19:00|2010-02-27 18:38:00| 1| 4.2| -74.011512| 40.710588|\n| VTS|2010-02-15 10:17:00|2010-02-15 10:24:00| 1| 1.74| -74.016442| 40.711617|\n| CMT|2009-12-26 18:45:49|2009-12-26 18:59:08| 1| 4.8| -74.01014| 40.712263|\n| CMT|2009-12-06 01:00:07|2009-12-06 01:11:41| 2| 4.2| -74.002505| 40.729001|\n| VTS|2009-10-04 14:36:00|2009-10-04 14:42:00| 1| 1.13| -74.006767| 40.718942|\n| CMT|2009-01-18 00:20:50|2009-01-18 00:36:29| 3| 2.1| -73.993258| 40.721401|\n| VTS|2009-05-18 13:24:00|2009-05-18 13:33:00| 1| 1.91| -73.992785| 40.730412|\n| VTS|2009-11-11 21:51:00|2009-11-11 22:13:00| 5| 4.71| -74.010065| 40.733383|\n+---------+-------------------+-------------------+---------------+-------------+----------------+---------------+\nonly showing top 20 rows\n\n
", "removedWidgets": [], "addedWidgets": {}, "metadata": {}, "type": "html", "arguments": {} } }, "data": { "text/html": [ "\n
+---------+-------------------+-------------------+---------------+-------------+----------------+---------------+\nvendor_id| pickup_datetime| dropoff_datetime|passenger_count|trip_distance|pickup_longitude|pickup_latitude|\n+---------+-------------------+-------------------+---------------+-------------+----------------+---------------+\n CMT|2009-01-01 20:07:33|2009-01-01 20:12:28| 1| 0.8| -74.001041| 40.731|\n CMT|2009-01-06 15:29:12|2009-01-06 15:51:57| 2| 3.3| -73.996489| 40.725742|\n CMT|2010-02-14 17:42:16|2010-02-14 17:55:03| 1| 3.4| -74.002949| 40.734254|\n CMT|2010-02-11 18:19:01|2010-02-11 18:27:54| 1| 1.5| -73.998133| 40.682463|\n VTS|2009-04-29 12:26:00|2009-04-29 12:35:00| 3| 2.05| -74.001332| 40.72006|\n VTS|2009-04-24 15:03:00|2009-04-24 15:23:00| 2| 2.89| -73.989952| 40.734625|\n CMT|2010-02-28 13:55:44|2010-02-28 14:02:37| 1| 1.2| -74.006015| 40.735279|\n VTS|2009-09-27 08:46:00|2009-09-27 08:59:00| 1| 3.97| -74.000148| 40.717468|\n CMT|2010-02-18 09:48:52|2010-02-18 10:08:38| 1| 3.0| -73.995177| 40.725297|\n CMT|2009-04-09 20:33:44|2009-04-09 20:39:33| 2| 0.6| -73.990133| 40.729321|\n CMT|2010-02-13 22:41:10|2010-02-13 23:07:04| 1| 4.2| -74.009175| 40.706284|\n CMT|2009-01-25 20:06:51|2009-01-25 20:12:37| 1| 1.3| -74.007384| 40.717929|\n VTS|2010-02-27 18:19:00|2010-02-27 18:38:00| 1| 4.2| -74.011512| 40.710588|\n VTS|2010-02-15 10:17:00|2010-02-15 10:24:00| 1| 1.74| -74.016442| 40.711617|\n CMT|2009-12-26 18:45:49|2009-12-26 18:59:08| 1| 4.8| -74.01014| 40.712263|\n CMT|2009-12-06 01:00:07|2009-12-06 01:11:41| 2| 4.2| -74.002505| 40.729001|\n VTS|2009-10-04 14:36:00|2009-10-04 14:42:00| 1| 1.13| -74.006767| 40.718942|\n CMT|2009-01-18 00:20:50|2009-01-18 00:36:29| 3| 2.1| -73.993258| 40.721401|\n VTS|2009-05-18 13:24:00|2009-05-18 13:33:00| 1| 1.91| -73.992785| 40.730412|\n VTS|2009-11-11 21:51:00|2009-11-11 22:13:00| 5| 4.71| -74.010065| 40.733383|\n+---------+-------------------+-------------------+---------------+-------------+----------------+---------------+\nonly showing top 20 rows\n\n
" ] } } ], "execution_count": 0 }, { "cell_type": "code", "source": [ "from mosaic import st_geomfromgeojson\n\nuser = spark.sql(\"select current_user() as user\").collect()[0][\"user\"]\n\nneighbourhoods = (\n spark.read.format(\"json\")\n .load(f\"dbfs:/FileStore/shared_uploads/{user}/NYC_Taxi_Zones.geojson\")\n .repartition(sc.defaultParallelism)\n .withColumn(\"geometry\", st_geomfromgeojson(to_json(col(\"geometry\"))))\n .select(\"properties.*\", \"geometry\")\n .drop(\"shape_area\", \"shape_leng\")\n)\n\nneighbourhoods.show()" ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "e3a366bf-4d60-4356-8b7b-eedb0beabc8d" } }, "outputs": [ { "output_type": "display_data", "metadata": { "application/vnd.databricks.v1+output": { "datasetInfos": [], "data": "
+-------------+-----------+--------+-------------------+--------------------+\n| borough|location_id|objectid| zone| geometry|\n+-------------+-----------+--------+-------------------+--------------------+\n| Brooklyn| 123| 123| Homecrest|{6, 4326, [[[-73....|\n| Manhattan| 153| 153| Marble Hill|{6, 4326, [[[-73....|\n| Brooklyn| 112| 112| Greenpoint|{6, 4326, [[[-73....|\n| Manhattan| 233| 233|UN/Turtle Bay South|{6, 4326, [[[-73....|\n| Manhattan| 43| 43| Central Park|{6, 4326, [[[-73....|\n| Queens| 201| 201| Rockaway Park|{6, 4326, [[[-73....|\n| Queens| 131| 131| Jamaica Estates|{6, 4326, [[[-73....|\n| Brooklyn| 111| 111|Green-Wood Cemetery|{6, 4326, [[[-73....|\n| Queens| 226| 226| Sunnyside|{6, 4326, [[[-73....|\n| Queens| 129| 129| Jackson Heights|{6, 4326, [[[-73....|\n| Manhattan| 120| 120| Highbridge Park|{6, 4326, [[[-73....|\n| Brooklyn| 76| 76| East New York|{6, 4326, [[[-73....|\n| Manhattan| 24| 24| Bloomingdale|{6, 4326, [[[-73....|\n| Manhattan| 202| 202| Roosevelt Island|{6, 4326, [[[-73....|\n| Manhattan| 100| 100| Garment District|{6, 4326, [[[-73....|\n|Staten Island| 251| 251| Westerleigh|{6, 4326, [[[-74....|\n| Manhattan| 74| 74| East Harlem North|{6, 4326, [[[-73....|\n| Queens| 98| 98| Fresh Meadows|{6, 4326, [[[-73....|\n| Manhattan| 211| 211| SoHo|{6, 4326, [[[-74....|\n| Bronx| 174| 174| Norwood|{6, 4326, [[[-73....|\n+-------------+-----------+--------+-------------------+--------------------+\nonly showing top 20 rows\n\n
", "removedWidgets": [], "addedWidgets": {}, "metadata": {}, "type": "html", "arguments": {} } }, "data": { "text/html": [ "\n
+-------------+-----------+--------+-------------------+--------------------+\n borough|location_id|objectid| zone| geometry|\n+-------------+-----------+--------+-------------------+--------------------+\n Brooklyn| 123| 123| Homecrest|{6, 4326, [[[-73....|\n Manhattan| 153| 153| Marble Hill|{6, 4326, [[[-73....|\n Brooklyn| 112| 112| Greenpoint|{6, 4326, [[[-73....|\n Manhattan| 233| 233|UN/Turtle Bay South|{6, 4326, [[[-73....|\n Manhattan| 43| 43| Central Park|{6, 4326, [[[-73....|\n Queens| 201| 201| Rockaway Park|{6, 4326, [[[-73....|\n Queens| 131| 131| Jamaica Estates|{6, 4326, [[[-73....|\n Brooklyn| 111| 111|Green-Wood Cemetery|{6, 4326, [[[-73....|\n Queens| 226| 226| Sunnyside|{6, 4326, [[[-73....|\n Queens| 129| 129| Jackson Heights|{6, 4326, [[[-73....|\n Manhattan| 120| 120| Highbridge Park|{6, 4326, [[[-73....|\n Brooklyn| 76| 76| East New York|{6, 4326, [[[-73....|\n Manhattan| 24| 24| Bloomingdale|{6, 4326, [[[-73....|\n Manhattan| 202| 202| Roosevelt Island|{6, 4326, [[[-73....|\n Manhattan| 100| 100| Garment District|{6, 4326, [[[-73....|\nStaten Island| 251| 251| Westerleigh|{6, 4326, [[[-74....|\n Manhattan| 74| 74| East Harlem North|{6, 4326, [[[-73....|\n Queens| 98| 98| Fresh Meadows|{6, 4326, [[[-73....|\n Manhattan| 211| 211| SoHo|{6, 4326, [[[-74....|\n Bronx| 174| 174| Norwood|{6, 4326, [[[-73....|\n+-------------+-----------+--------+-------------------+--------------------+\nonly showing top 20 rows\n\n
" ] } } ], "execution_count": 0 }, { "cell_type": "markdown", "source": [ "## 2. Compute the resolution of index required to optimize the join." ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "ae8cc184-574e-458f-9613-d2225b7a73e8" } } }, { "cell_type": "code", "source": [ "from mosaic import MosaicFrame\n\nneighbourhoods_mdf = MosaicFrame(neighbourhoods, \"geometry\")\nhelp(neighbourhoods_mdf.get_optimal_resolution)" ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "08dc48c1-9da2-4f8a-a233-aae456cac972" } }, "outputs": [ { "output_type": "display_data", "metadata": { "application/vnd.databricks.v1+output": { "datasetInfos": [], "data": "
Help on method get_optimal_resolution in module mosaic.core.mosaic_frame:\n\nget_optimal_resolution(sample_rows: Union[int, NoneType] = None, sample_fraction: Union[float, NoneType] = None) -> int method of mosaic.core.mosaic_frame.MosaicFrame instance\n Analyzes the geometries in the currently selected geometry column and proposes an optimal\n grid-index resolution.\n \n Provide either `sample_rows` or `sample_fraction` parameters to control how much data is passed to the analyzer.\n (Providing too little data to the analyzer may result in a `NotEnoughGeometriesException`)\n \n Parameters\n ----------\n sample_rows: int, optional\n The number of rows to sample.\n sample_fraction: float, optional\n The proportion of rows to sample.\n \n Returns\n -------\n int\n The recommended grid-index resolution to apply to this MosaicFrame.\n\n
", "removedWidgets": [], "addedWidgets": {}, "metadata": {}, "type": "html", "arguments": {} } }, "data": { "text/html": [ "\n
Help on method get_optimal_resolution in module mosaic.core.mosaic_frame:\n\nget_optimal_resolution(sample_rows: Union[int, NoneType] = None, sample_fraction: Union[float, NoneType] = None) -> int method of mosaic.core.mosaic_frame.MosaicFrame instance\n Analyzes the geometries in the currently selected geometry column and proposes an optimal\n grid-index resolution.\n \n Provide either `sample_rows` or `sample_fraction` parameters to control how much data is passed to the analyzer.\n (Providing too little data to the analyzer may result in a `NotEnoughGeometriesException`)\n \n Parameters\n ----------\n sample_rows: int, optional\n The number of rows to sample.\n sample_fraction: float, optional\n The proportion of rows to sample.\n \n Returns\n -------\n int\n The recommended grid-index resolution to apply to this MosaicFrame.\n\n
" ] } } ], "execution_count": 0 }, { "cell_type": "code", "source": [ "(resolution := neighbourhoods_mdf.get_optimal_resolution(sample_fraction=1.))" ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "50ed9fb7-3521-40fe-856c-8d5ed9b182d3" } }, "outputs": [ { "output_type": "display_data", "metadata": { "application/vnd.databricks.v1+output": { "datasetInfos": [], "data": "
Out[15]: 9
", "removedWidgets": [], "addedWidgets": {}, "metadata": {}, "type": "html", "arguments": {} } }, "data": { "text/html": [ "\n
Out[15]: 9
" ] } } ], "execution_count": 0 }, { "cell_type": "markdown", "source": [ "## 3. Apply the index to the set of points in your left-hand dataframe.\nThis will generate an index value that corresponds to the grid ‘cell’ that this point occupies." ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "fccff6d5-da76-44c6-b015-749d3dcad3ca" } } }, { "cell_type": "code", "source": [ "from mosaic import grid_longlatascellid\nindexed_trips = trips.withColumn(\"ix\", grid_longlatascellid(lon=\"pickup_longitude\", lat=\"pickup_latitude\", resolution=lit(resolution)))\nindexed_trips.show()" ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "aad98fce-a22b-47d0-961d-62007acc23dd" } }, "outputs": [ { "output_type": "display_data", "metadata": { "application/vnd.databricks.v1+output": { "datasetInfos": [], "data": "
+---------+-------------------+-------------------+---------------+-------------+----------------+---------------+------------------+\n|vendor_id| pickup_datetime| dropoff_datetime|passenger_count|trip_distance|pickup_longitude|pickup_latitude| ix|\n+---------+-------------------+-------------------+---------------+-------------+----------------+---------------+------------------+\n| DDS|2009-01-17 18:49:57|2009-01-17 18:56:29| 3| 1.2| -74.004043| 40.733409|617733151092113407|\n| DDS|2009-12-01 00:47:52|2009-12-01 01:00:16| 1| 3.4| -73.991702| 40.726342|617733151087132671|\n| CMT|2009-02-09 16:50:21|2009-02-09 17:02:47| 1| 2.6| -73.999673| 40.733586|617733123805806591|\n| CMT|2009-12-07 07:15:47|2009-12-07 07:32:07| 1| 3.8| -74.01211| 40.716893|617733151084773375|\n| VTS|2009-10-16 22:02:00|2009-10-16 22:08:00| 1| 1.1| -74.010903| 40.71624|617733151084773375|\n| VTS|2009-12-23 22:13:00|2009-12-23 22:18:00| 1| 0.37| -74.002343| 40.73366|617733151092113407|\n| VTS|2009-12-12 01:24:00|2009-12-12 01:38:00| 2| 3.55| -74.002565| 40.728188|617733151091326975|\n| CMT|2009-12-07 13:10:37|2009-12-07 13:13:45| 1| 0.5| -73.999184| 40.73428|617733123805806591|\n| CMT|2009-11-08 22:20:44|2009-11-08 22:31:23| 1| 1.9| -74.003029| 40.733385|617733151092113407|\n| VTS|2009-12-27 20:01:00|2009-12-27 20:04:00| 1| 1.04| -74.000227| 40.732603|617733151092375551|\n| VTS|2009-02-13 14:33:00|2009-02-13 14:50:00| 3| 1.59| -74.006535| 40.732303|617733151092637695|\n| CMT|2009-11-15 21:13:32|2009-11-15 21:25:56| 3| 3.0| -73.998795| 40.730621|617733151092375551|\n| VTS|2009-01-08 18:13:00|2009-01-08 18:33:00| 2| 4.18| -74.0079| 40.712012|617733151021334527|\n| CMT|2009-11-30 13:30:13|2009-11-30 13:41:55| 1| 1.6| -74.004487| 40.734072|617733151092637695|\n| CMT|2009-01-11 20:02:22|2009-01-11 20:08:15| 1| 1.0| -74.004493| 40.713349|617733151020810239|\n| CMT|2009-12-30 18:46:08|2009-12-30 19:02:23| 1| 2.3| -74.010798| 40.716717|617733151084773375|\n| CMT|2009-11-18 21:50:12|2009-11-18 22:05:19| 1| 5.8| -73.992515| 40.694106|617733151038111743|\n| VTS|2009-11-21 12:51:00|2009-11-21 13:27:00| 1| 14.18| -73.9923| 40.715218|617733151109414911|\n| CMT|2009-01-20 09:34:49|2009-01-20 09:37:15| 1| 0.4| -74.0027| 40.733479|617733151092113407|\n| VTS|2009-01-03 07:07:00|2009-01-03 07:18:00| 1| 7.81| -73.994358| 40.690345|617733151037325311|\n+---------+-------------------+-------------------+---------------+-------------+----------------+---------------+------------------+\nonly showing top 20 rows\n\n
", "removedWidgets": [], "addedWidgets": {}, "metadata": {}, "type": "html", "arguments": {} } }, "data": { "text/html": [ "\n
+---------+-------------------+-------------------+---------------+-------------+----------------+---------------+------------------+\nvendor_id| pickup_datetime| dropoff_datetime|passenger_count|trip_distance|pickup_longitude|pickup_latitude| ix|\n+---------+-------------------+-------------------+---------------+-------------+----------------+---------------+------------------+\n DDS|2009-01-17 18:49:57|2009-01-17 18:56:29| 3| 1.2| -74.004043| 40.733409|617733151092113407|\n DDS|2009-12-01 00:47:52|2009-12-01 01:00:16| 1| 3.4| -73.991702| 40.726342|617733151087132671|\n CMT|2009-02-09 16:50:21|2009-02-09 17:02:47| 1| 2.6| -73.999673| 40.733586|617733123805806591|\n CMT|2009-12-07 07:15:47|2009-12-07 07:32:07| 1| 3.8| -74.01211| 40.716893|617733151084773375|\n VTS|2009-10-16 22:02:00|2009-10-16 22:08:00| 1| 1.1| -74.010903| 40.71624|617733151084773375|\n VTS|2009-12-23 22:13:00|2009-12-23 22:18:00| 1| 0.37| -74.002343| 40.73366|617733151092113407|\n VTS|2009-12-12 01:24:00|2009-12-12 01:38:00| 2| 3.55| -74.002565| 40.728188|617733151091326975|\n CMT|2009-12-07 13:10:37|2009-12-07 13:13:45| 1| 0.5| -73.999184| 40.73428|617733123805806591|\n CMT|2009-11-08 22:20:44|2009-11-08 22:31:23| 1| 1.9| -74.003029| 40.733385|617733151092113407|\n VTS|2009-12-27 20:01:00|2009-12-27 20:04:00| 1| 1.04| -74.000227| 40.732603|617733151092375551|\n VTS|2009-02-13 14:33:00|2009-02-13 14:50:00| 3| 1.59| -74.006535| 40.732303|617733151092637695|\n CMT|2009-11-15 21:13:32|2009-11-15 21:25:56| 3| 3.0| -73.998795| 40.730621|617733151092375551|\n VTS|2009-01-08 18:13:00|2009-01-08 18:33:00| 2| 4.18| -74.0079| 40.712012|617733151021334527|\n CMT|2009-11-30 13:30:13|2009-11-30 13:41:55| 1| 1.6| -74.004487| 40.734072|617733151092637695|\n CMT|2009-01-11 20:02:22|2009-01-11 20:08:15| 1| 1.0| -74.004493| 40.713349|617733151020810239|\n CMT|2009-12-30 18:46:08|2009-12-30 19:02:23| 1| 2.3| -74.010798| 40.716717|617733151084773375|\n CMT|2009-11-18 21:50:12|2009-11-18 22:05:19| 1| 5.8| -73.992515| 40.694106|617733151038111743|\n VTS|2009-11-21 12:51:00|2009-11-21 13:27:00| 1| 14.18| -73.9923| 40.715218|617733151109414911|\n CMT|2009-01-20 09:34:49|2009-01-20 09:37:15| 1| 0.4| -74.0027| 40.733479|617733151092113407|\n VTS|2009-01-03 07:07:00|2009-01-03 07:18:00| 1| 7.81| -73.994358| 40.690345|617733151037325311|\n+---------+-------------------+-------------------+---------------+-------------+----------------+---------------+------------------+\nonly showing top 20 rows\n\n
" ] } } ], "execution_count": 0 }, { "cell_type": "markdown", "source": [ "## 4. Compute the set of indices that fully covers each polygon in the right-hand dataframe\nThis is commonly referred to as a polyfill operation." ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "e127e5be-30e4-4507-8aa9-59399290b8f7" } } }, { "cell_type": "code", "source": [ "from mosaic import grid_polyfill\n\nindexed_neighbourhoods = (\n neighbourhoods\n .select(\"*\", grid_polyfill(\"geometry\", lit(resolution)).alias(\"ix_set\"))\n .drop(\"geometry\")\n)\n\nindexed_neighbourhoods.show()" ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "06e1f722-8530-47c2-aec5-f0a6cb4f9301" } }, "outputs": [ { "output_type": "display_data", "metadata": { "application/vnd.databricks.v1+output": { "datasetInfos": [], "data": "
+-------------+-----------+--------+-------------------+--------------------+\n| borough|location_id|objectid| zone| ix_set|\n+-------------+-----------+--------+-------------------+--------------------+\n| Brooklyn| 123| 123| Homecrest|[6177331514226769...|\n| Manhattan| 153| 153| Marble Hill|[6177331229858201...|\n| Brooklyn| 112| 112| Greenpoint|[6177331237832622...|\n| Manhattan| 233| 233|UN/Turtle Bay South|[6177331238679347...|\n| Manhattan| 43| 43| Central Park|[6177331225792348...|\n| Queens| 201| 201| Rockaway Park|[6177331357831659...|\n| Queens| 131| 131| Jamaica Estates|[6177331242658693...|\n| Brooklyn| 111| 111|Green-Wood Cemetery|[6177331522277212...|\n| Queens| 226| 226| Sunnyside|[6177331238566625...|\n| Queens| 129| 129| Jackson Heights|[6177331243222302...|\n| Manhattan| 120| 120| Highbridge Park|[6177331231976325...|\n| Brooklyn| 76| 76| East New York|[6177331236938711...|\n| Manhattan| 24| 24| Bloomingdale|[6177331226458193...|\n| Manhattan| 202| 202| Roosevelt Island|[6177331237777571...|\n| Manhattan| 100| 100| Garment District|[6177331509717893...|\n|Staten Island| 251| 251| Westerleigh|[6177331466128588...|\n| Manhattan| 74| 74| East Harlem North|[6177331226508001...|\n| Queens| 98| 98| Fresh Meadows|[6177331242448977...|\n| Manhattan| 211| 211| SoHo|[6177331510784819...|\n| Bronx| 174| 174| Norwood|[6177331205497159...|\n+-------------+-----------+--------+-------------------+--------------------+\nonly showing top 20 rows\n\n
", "removedWidgets": [], "addedWidgets": {}, "metadata": {}, "type": "html", "arguments": {} } }, "data": { "text/html": [ "\n
+-------------+-----------+--------+-------------------+--------------------+\n borough|location_id|objectid| zone| ix_set|\n+-------------+-----------+--------+-------------------+--------------------+\n Brooklyn| 123| 123| Homecrest|[6177331514226769...|\n Manhattan| 153| 153| Marble Hill|[6177331229858201...|\n Brooklyn| 112| 112| Greenpoint|[6177331237832622...|\n Manhattan| 233| 233|UN/Turtle Bay South|[6177331238679347...|\n Manhattan| 43| 43| Central Park|[6177331225792348...|\n Queens| 201| 201| Rockaway Park|[6177331357831659...|\n Queens| 131| 131| Jamaica Estates|[6177331242658693...|\n Brooklyn| 111| 111|Green-Wood Cemetery|[6177331522277212...|\n Queens| 226| 226| Sunnyside|[6177331238566625...|\n Queens| 129| 129| Jackson Heights|[6177331243222302...|\n Manhattan| 120| 120| Highbridge Park|[6177331231976325...|\n Brooklyn| 76| 76| East New York|[6177331236938711...|\n Manhattan| 24| 24| Bloomingdale|[6177331226458193...|\n Manhattan| 202| 202| Roosevelt Island|[6177331237777571...|\n Manhattan| 100| 100| Garment District|[6177331509717893...|\nStaten Island| 251| 251| Westerleigh|[6177331466128588...|\n Manhattan| 74| 74| East Harlem North|[6177331226508001...|\n Queens| 98| 98| Fresh Meadows|[6177331242448977...|\n Manhattan| 211| 211| SoHo|[6177331510784819...|\n Bronx| 174| 174| Norwood|[6177331205497159...|\n+-------------+-----------+--------+-------------------+--------------------+\nonly showing top 20 rows\n\n
" ] } } ], "execution_count": 0 }, { "cell_type": "markdown", "source": [ "## 5. ‘Explode’ the polygon index dataframe, such that each polygon index becomes a row in a new dataframe." ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "be2c88fd-9307-4b9a-ada8-fd714142d2d0" } } }, { "cell_type": "code", "source": [ "exploded_indexed_neighbourhoods = (\n indexed_neighbourhoods\n .withColumn(\"ix\", explode(\"ix_set\"))\n .drop(\"ix_set\")\n)\n\nexploded_indexed_neighbourhoods.show()" ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "132de00e-753f-49d6-ab3f-82517eef296a" } }, "outputs": [ { "output_type": "display_data", "metadata": { "application/vnd.databricks.v1+output": { "datasetInfos": [], "data": "
+--------+-----------+--------+---------+------------------+\n| borough|location_id|objectid| zone| ix|\n+--------+-----------+--------+---------+------------------+\n|Brooklyn| 123| 123|Homecrest|617733151422676991|\n|Brooklyn| 123| 123|Homecrest|617733151503417343|\n|Brooklyn| 123| 123|Homecrest|617733151502893055|\n|Brooklyn| 123| 123|Homecrest|617733151502368767|\n|Brooklyn| 123| 123|Homecrest|617733151492407295|\n|Brooklyn| 123| 123|Homecrest|617733151488737279|\n|Brooklyn| 123| 123|Homecrest|617733151484542975|\n|Brooklyn| 123| 123|Homecrest|617733151484018687|\n|Brooklyn| 123| 123|Homecrest|617733151483494399|\n|Brooklyn| 123| 123|Homecrest|617733151425560575|\n|Brooklyn| 123| 123|Homecrest|617733151424511999|\n|Brooklyn| 123| 123|Homecrest|617733151423463423|\n|Brooklyn| 123| 123|Homecrest|617733151511019519|\n|Brooklyn| 123| 123|Homecrest|617733151505776639|\n|Brooklyn| 123| 123|Homecrest|617733151505252351|\n|Brooklyn| 123| 123|Homecrest|617733151504203775|\n|Brooklyn| 123| 123|Homecrest|617733151503679487|\n|Brooklyn| 123| 123|Homecrest|617733151503155199|\n|Brooklyn| 123| 123|Homecrest|617733151502630911|\n|Brooklyn| 123| 123|Homecrest|617733151502106623|\n+--------+-----------+--------+---------+------------------+\nonly showing top 20 rows\n\n
", "removedWidgets": [], "addedWidgets": {}, "metadata": {}, "type": "html", "arguments": {} } }, "data": { "text/html": [ "\n
+--------+-----------+--------+---------+------------------+\n borough|location_id|objectid| zone| ix|\n+--------+-----------+--------+---------+------------------+\nBrooklyn| 123| 123|Homecrest|617733151422676991|\nBrooklyn| 123| 123|Homecrest|617733151503417343|\nBrooklyn| 123| 123|Homecrest|617733151502893055|\nBrooklyn| 123| 123|Homecrest|617733151502368767|\nBrooklyn| 123| 123|Homecrest|617733151492407295|\nBrooklyn| 123| 123|Homecrest|617733151488737279|\nBrooklyn| 123| 123|Homecrest|617733151484542975|\nBrooklyn| 123| 123|Homecrest|617733151484018687|\nBrooklyn| 123| 123|Homecrest|617733151483494399|\nBrooklyn| 123| 123|Homecrest|617733151425560575|\nBrooklyn| 123| 123|Homecrest|617733151424511999|\nBrooklyn| 123| 123|Homecrest|617733151423463423|\nBrooklyn| 123| 123|Homecrest|617733151511019519|\nBrooklyn| 123| 123|Homecrest|617733151505776639|\nBrooklyn| 123| 123|Homecrest|617733151505252351|\nBrooklyn| 123| 123|Homecrest|617733151504203775|\nBrooklyn| 123| 123|Homecrest|617733151503679487|\nBrooklyn| 123| 123|Homecrest|617733151503155199|\nBrooklyn| 123| 123|Homecrest|617733151502630911|\nBrooklyn| 123| 123|Homecrest|617733151502106623|\n+--------+-----------+--------+---------+------------------+\nonly showing top 20 rows\n\n
" ] } } ], "execution_count": 0 }, { "cell_type": "markdown", "source": [ "## 6. Join the new left- and right-hand dataframes directly on the index." ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "a369d19f-2917-4320-ab4a-e701953db7b5" } } }, { "cell_type": "code", "source": [ "joined_df = (\n indexed_trips.alias(\"t\")\n .join(exploded_indexed_neighbourhoods.alias(\"n\"), on=\"ix\", how=\"inner\"))\njoined_df.count()" ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "09a92bb5-4c72-4757-8176-01e14dbc3504" } }, "outputs": [ { "output_type": "display_data", "metadata": { "application/vnd.databricks.v1+output": { "datasetInfos": [], "data": "
Out[25]: 4934937
", "removedWidgets": [], "addedWidgets": {}, "metadata": {}, "type": "html", "arguments": {} } }, "data": { "text/html": [ "\n
Out[25]: 4934937
" ] } } ], "execution_count": 0 }, { "cell_type": "markdown", "source": [ "## Final notes\nMosaic provides support for Uber’s H3 spatial indexing library as a core part of the API, but we plan to add support for other index systems, including S2 and British National Grid in due course." ], "metadata": { "application/vnd.databricks.v1+cell": { "title": "", "showTitle": false, "inputWidgets": {}, "nuid": "9825374a-d5d3-407f-9c0d-94c7514edacf" } } } ], "metadata": { "application/vnd.databricks.v1+notebook": { "notebookName": "spatial-indexes", "dashboards": [], "notebookMetadata": { "pythonIndentUnit": 2 }, "language": "python", "widgets": {}, "notebookOrigID": 4066942804256643 } }, "nbformat": 4, "nbformat_minor": 0 }