Spatial Data

📖 Ahead of time…

This block is all about understanding spatial data, both conceptually and practically. Before your fingers get on the keyboard, the following readings will help you get going and familiar with core ideas:

  • Chapter 1 of the GDS Book [RABWng], which provides a conceptual overview of representing Geography in data

  • Chapter 3 of the GDS Book [RABWng], a sister chapter with a more applied perspective on how concepts are implemented in computer data structures

Additionally, parts of this block are based and source from Block C in the GDS Course [AB19].

💻 Hands-on coding

(Geographic) tables

import pandas
import geopandas
import xarray, rioxarray
import contextily
import matplotlib.pyplot as plt

Points

Assuming you have the file locally on the path ../data/:

pts = geopandas.read_file("../data/madrid_abb.gpkg")

If you’re online, you can do:

pts = geopandas.read_file(
    "https://github.com/GDSL-UL/san/raw/v0.1.0/data/assignment_1_madrid/madrid_abb.gpkg"
)
pts.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 18399 entries, 0 to 18398
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   price            18399 non-null  object  
 1   price_usd        18399 non-null  float64 
 2   log1p_price_usd  18399 non-null  float64 
 3   accommodates     18399 non-null  int64   
 4   bathrooms        18399 non-null  object  
 5   bedrooms         18399 non-null  float64 
 6   beds             18399 non-null  float64 
 7   neighbourhood    18399 non-null  object  
 8   room_type        18399 non-null  object  
 9   property_type    18399 non-null  object  
 10  WiFi             18399 non-null  object  
 11  Coffee           18399 non-null  object  
 12  Gym              18399 non-null  object  
 13  Parking          18399 non-null  object  
 14  km_to_retiro     18399 non-null  float64 
 15  geometry         18399 non-null  geometry
dtypes: float64(5), geometry(1), int64(1), object(9)
memory usage: 2.2+ MB
pts.head()
price price_usd log1p_price_usd accommodates bathrooms bedrooms beds neighbourhood room_type property_type WiFi Coffee Gym Parking km_to_retiro geometry
0 $60.00 60.0 4.110874 2 1 shared bath 1.0 1.0 Hispanoamérica Private room Private room in apartment 1 0 0 0 5.116664 POINT (-3.67688 40.45724)
1 $31.00 31.0 3.465736 1 1 bath 1.0 1.0 Cármenes Private room Private room in apartment 1 1 0 1 5.563869 POINT (-3.74084 40.40341)
2 $60.00 60.0 4.110874 6 2 baths 3.0 5.0 Legazpi Entire home/apt Entire apartment 1 1 0 1 3.048442 POINT (-3.69304 40.38695)
3 $115.00 115.0 4.753590 4 1.5 baths 2.0 3.0 Justicia Entire home/apt Entire apartment 1 1 0 1 2.075484 POINT (-3.69764 40.41995)
4 $26.00 26.0 3.295837 1 1 private bath 1.0 1.0 Legazpi Private room Private room in house 1 0 0 0 2.648058 POINT (-3.69011 40.38985)

Challenge

Show the top ten values of of price and neighbourhood

Lines

Assuming you have the file locally on the path ../data/:

pts = geopandas.read_file("../data/arturo_streets.gpkg")

If you’re online, you can do:

pts = geopandas.read_file(
    "https://darribas.org/gds4ae/_downloads/67d5480f98453027d59bf49606a7ad92/arturo_streets.gpkg"
)
lines.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 66499 entries, 0 to 66498
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   OGC_FID             66499 non-null  object  
 1   dm_id               66499 non-null  object  
 2   dist_barri          66483 non-null  object  
 3   average_quality     66499 non-null  float64 
 4   population_density  66499 non-null  float64 
 5   X                   66499 non-null  float64 
 6   Y                   66499 non-null  float64 
 7   value               5465 non-null   float64 
 8   geometry            66499 non-null  geometry
dtypes: float64(5), geometry(1), object(3)
memory usage: 4.6+ MB
lines.loc[0, "geometry"]
../../_images/02-Spatial_data_17_0.svg

Challenge

Print descriptive statistics for population_density and average_quality

Polygons

Assuming you have the file locally on the path ../data/:

polys = geopandas.read_file("../data/neighbourhoods.geojson")

If you’re online, you can do:

polys = geopandas.read_file(
    "https://darribas.org/gds4ae/_downloads/44b4bc22c042386c2c0f8dc6685ef17c/neighbourhoods.geojson"
)
polys.head()
neighbourhood neighbourhood_group geometry
0 Palacio Centro MULTIPOLYGON (((-3.70584 40.42030, -3.70625 40...
1 Embajadores Centro MULTIPOLYGON (((-3.70384 40.41432, -3.70277 40...
2 Cortes Centro MULTIPOLYGON (((-3.69796 40.41929, -3.69645 40...
3 Justicia Centro MULTIPOLYGON (((-3.69546 40.41898, -3.69645 40...
4 Universidad Centro MULTIPOLYGON (((-3.70107 40.42134, -3.70155 40...
polys.query("neighbourhood_group == 'Retiro'")
neighbourhood neighbourhood_group geometry
13 Pacífico Retiro MULTIPOLYGON (((-3.67015 40.40654, -3.67017 40...
14 Adelfas Retiro MULTIPOLYGON (((-3.67283 40.39468, -3.67343 40...
15 Estrella Retiro MULTIPOLYGON (((-3.66506 40.40647, -3.66512 40...
16 Ibiza Retiro MULTIPOLYGON (((-3.66916 40.41796, -3.66927 40...
17 Jerónimos Retiro MULTIPOLYGON (((-3.67874 40.40751, -3.67992 40...
18 Niño Jesús Retiro MULTIPOLYGON (((-3.66994 40.40850, -3.67012 40...
polys.neighbourhood_group.unique()
array(['Centro', 'Arganzuela', 'Retiro', 'Salamanca', 'Chamartín',
       'Moratalaz', 'Tetuán', 'Chamberí', 'Fuencarral - El Pardo',
       'Moncloa - Aravaca', 'Puente de Vallecas', 'Latina', 'Carabanchel',
       'Usera', 'Ciudad Lineal', 'Hortaleza', 'Villaverde',
       'Villa de Vallecas', 'Vicálvaro', 'San Blas - Canillejas',
       'Barajas'], dtype=object)

Challenge

Print the neighborhoods within the “Latina” group

Surfaces

Assuming you have the file locally on the path ../data/:

sat = xarray.open_rasterio("../data/madrid_scene_s2_10_tc.tif")

If you’re online, you can do:

sat = xarray.open_rasterio(
    "https://darribas.org/gds4ae/_downloads/cafed4de0cfde63e6d2ffcb92264b431/madrid_scene_s2_10_tc.tif"
)
sat
<xarray.DataArray (band: 3, y: 3681, x: 3129)>
[34553547 values with dtype=uint8]
Coordinates:
  * band         (band) int64 1 2 3
  * x            (x) float64 4.248e+05 4.248e+05 4.248e+05 ... 4.56e+05 4.56e+05
  * y            (y) float64 4.499e+06 4.499e+06 ... 4.463e+06 4.463e+06
    spatial_ref  int64 0
Attributes:
    AREA_OR_POINT:  Area
    scale_factor:   1.0
    add_offset:     0.0
sat.sel(band=1)
<xarray.DataArray (y: 3681, x: 3129)>
[11517849 values with dtype=uint8]
Coordinates:
    band         int64 1
  * x            (x) float64 4.248e+05 4.248e+05 4.248e+05 ... 4.56e+05 4.56e+05
  * y            (y) float64 4.499e+06 4.499e+06 ... 4.463e+06 4.463e+06
    spatial_ref  int64 0
Attributes:
    AREA_OR_POINT:  Area
    scale_factor:   1.0
    add_offset:     0.0
sat.sel(
    x=slice(430000, 440000),  # x is ascending
    y=slice(4480000, 4470000) # y is descending
)
<xarray.DataArray (band: 3, y: 1000, x: 1000)>
[3000000 values with dtype=uint8]
Coordinates:
  * band         (band) int64 1 2 3
  * x            (x) float64 4.3e+05 4.3e+05 4.3e+05 ... 4.4e+05 4.4e+05 4.4e+05
  * y            (y) float64 4.48e+06 4.48e+06 4.48e+06 ... 4.47e+06 4.47e+06
    spatial_ref  int64 0
Attributes:
    AREA_OR_POINT:  Area
    scale_factor:   1.0
    add_offset:     0.0

Challenge

Subset sat to band 2 and the section within [444444, 455555] of Easting and [4470000, 4480000] of Northing.

  • How many pixels does it contain?

  • What if you used bands 1 and 3 instead?

Visualisation

polys.explore()
Make this Notebook Trusted to load map: File -> Trust Notebook
polys.plot()
<Axes: >
../../_images/02-Spatial_data_36_1.png
ax = lines.plot(linewidth=0.1, color="black")
contextily.add_basemap(ax, crs=lines.crs)
../../_images/02-Spatial_data_37_0.png
ax = pts.plot(color="red", figsize=(12, 12), markersize=0.1)
contextily.add_basemap(
    ax,
    crs = pts.crs,
    source = contextily.providers.CartoDB.DarkMatter
);
../../_images/02-Spatial_data_39_0.png
sat.plot.imshow(figsize=(12, 12))
<matplotlib.image.AxesImage at 0x7f7803bc36d0>
../../_images/02-Spatial_data_40_1.png
f, ax = plt.subplots(1, figsize=(12, 12))
sat.plot.imshow(ax=ax)
contextily.add_basemap(
    ax,
    crs=sat.rio.crs,
    source=contextily.providers.CartoDB.VoyagerOnlyLabels,
    zoom=11,
);
../../_images/02-Spatial_data_42_0.png

Challenge

Make three plots of sat, plotting one single band in each

Spatial operations

(Re-)Projections

pts.crs
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
sat.rio.crs
CRS.from_epsg(32630)
pts.to_crs(sat.rio.crs).crs
<Projected CRS: EPSG:32630>
Name: WGS 84 / UTM zone 30N
Axis Info [cartesian]:
- [east]: Easting (metre)
- [north]: Northing (metre)
Area of Use:
- undefined
Coordinate Operation:
- name: UTM zone 30N
- method: Transverse Mercator
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
sat.rio.reproject(pts.crs).rio.crs
CRS.from_epsg(4326)
# All into Web Mercator (EPSG:3857)
f, ax = plt.subplots(1, figsize=(12, 12))
## Satellite image
sat.rio.reproject(
    "EPSG:3857"
).plot.imshow(
    ax=ax
)
## Neighbourhoods
polys.to_crs(epsg=3857).plot(
    linewidth=2, 
    edgecolor="xkcd:lime", 
    facecolor="none",
    ax=ax
)
## Labels
contextily.add_basemap( # No need to reproject
    ax,
    source=contextily.providers.CartoDB.VoyagerOnlyLabels,
);
../../_images/02-Spatial_data_50_0.png

Centroids

polys.centroid
/tmp/ipykernel_2058/2101097851.py:1: UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  polys.centroid
0      POINT (-3.71398 40.41543)
1      POINT (-3.70237 40.40925)
2      POINT (-3.69674 40.41485)
3      POINT (-3.69657 40.42367)
4      POINT (-3.70698 40.42568)
                 ...            
123    POINT (-3.59135 40.45656)
124    POINT (-3.59723 40.48441)
125    POINT (-3.55847 40.47613)
126    POINT (-3.57889 40.47471)
127    POINT (-3.60718 40.46415)
Length: 128, dtype: geometry
lines.centroid
0        POINT (444133.737 4482808.936)
1        POINT (444192.064 4482878.034)
2        POINT (444134.563 4482885.414)
3        POINT (445612.661 4479335.686)
4        POINT (445606.311 4479354.437)
                      ...              
66494    POINT (451980.378 4478407.920)
66495    POINT (436975.438 4473143.749)
66496    POINT (442218.600 4478415.561)
66497    POINT (442213.869 4478346.700)
66498    POINT (442233.760 4478278.748)
Length: 66499, dtype: geometry
ax = polys.plot(color="purple")
polys.centroid.plot(
    ax=ax, color="lime", markersize=1
)
/tmp/ipykernel_2058/1054587808.py:2: UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  polys.centroid.plot(
<Axes: >
../../_images/02-Spatial_data_55_2.png

Spatial joins

sj = geopandas.sjoin(
    lines,
    polys.to_crs(lines.crs)
)
# Subset of lines
ax = sj.query(
    "neighbourhood == 'Jerónimos'"
).plot(color="xkcd:bright turquoise")

# Subset of line centroids
ax = sj.query(
    "neighbourhood == 'Jerónimos'"
).centroid.plot(
    color="xkcd:bright violet", markersize=7, ax=ax
)

# Local basemap
contextily.add_basemap(
    ax,
    crs=sj.crs,
    source="../data/madrid_scene_s2_10_tc.tif",
    alpha=0.5
)
../../_images/02-Spatial_data_59_0.png
sj.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
Index: 69420 entries, 0 to 66438
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   OGC_FID              69420 non-null  object  
 1   dm_id                69420 non-null  object  
 2   dist_barri           69414 non-null  object  
 3   average_quality      69420 non-null  float64 
 4   population_density   69420 non-null  float64 
 5   X                    69420 non-null  float64 
 6   Y                    69420 non-null  float64 
 7   value                5769 non-null   float64 
 8   geometry             69420 non-null  geometry
 9   index_right          69420 non-null  int64   
 10  neighbourhood        69420 non-null  object  
 11  neighbourhood_group  69420 non-null  object  
dtypes: float64(5), geometry(1), int64(1), object(5)
memory usage: 6.9+ MB

Areas

areas = polys.to_crs(
    epsg=25830
).area * 1e-6 # Km2
areas.head()
0    1.471037
1    1.033253
2    0.592049
3    0.742031
4    0.947616
dtype: float64

Distances

cemfi = geopandas.tools.geocode(
    "Calle Casado del Alisal, 5, Madrid"
).to_crs(epsg=25830)
cemfi
geometry address
0 POINT (441477.245 4473939.537) 5, Calle Casado del Alisal, 28014, Calle Casad...
polys.to_crs(
    cemfi.crs
).distance(
    cemfi.geometry
)
/tmp/ipykernel_2058/176561454.py:1: UserWarning: The indices of the two GeoSeries are different.
  polys.to_crs(
0      1491.338749
1              NaN
2              NaN
3              NaN
4              NaN
          ...     
123            NaN
124            NaN
125            NaN
126            NaN
127            NaN
Length: 128, dtype: float64
d2cemfi = polys.to_crs(
    cemfi.crs
).distance(
    cemfi.geometry[0] # NO index
)
d2cemfi.head()
0    1491.338749
1     565.418135
2     278.121017
3     650.926572
4    1196.771601
dtype: float64
ax = polys.assign(
    dist=d2cemfi/1000
).plot("dist", legend=True)

cemfi.to_crs(
    polys.crs
).plot(
    marker="*", 
    markersize=15, 
    color="r", 
    label="CEMFI", 
    ax=ax
)

ax.legend()
ax.set_title(
    "Distance to CEMFI"
);
../../_images/02-Spatial_data_67_0.png

Challenge

Give Task III in this block of the GDS course a go

🐾 Next steps

If you are interested in following up on some of the topics explored in this block, the following pointers might be useful:

  • Although we have seen here geopandas only, all non-geographic operations on geo-tables are really thanks to pandas, the workhorse for tabular data in Python. Their official documentation is an excellent first stop. If you prefer a book, McKinney (2012) [McK12] is a great one.

  • For more detail on geographic operations on geo-tables, the Geopandas official documentation is a great place to continue the journey.

  • Surfaces, as covered here, are really an example of multi-dimensional labelled arrays. The library we use, xarray represents the cutting edge for working with these data structures in Python, and their documentation is a great place to wrap your head around how data of this type can be manipulated. For geographic extensions (CRS handling, reprojections, etc.), we have used rioxarray under the hood, and its documentation is also well worth checking.