London AirBnb dataset prep

In this notebook we put together the dataset of London properties on AirBnb.

import pandas, geopandas

Source data

AirBnb listings

We use the recorded listings in January 2020:

url = ("http://data.insideairbnb.com/united-kingdom/"\
       "england/london/2020-01-09/data/listings.csv.gz")
url
'http://data.insideairbnb.com/united-kingdom/england/london/2020-01-09/data/listings.csv.gz'
# Accessed on Sep. 16th 2020
listings = pandas.read_csv(url)
/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3072: DtypeWarning: Columns (61,62,94,95) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Let’s keep tidy the variables we will want to pull out:

x = ['id', 'longitude', 'latitude', 'property_type', 'room_type', 'accommodates', \
     'bathrooms', 'bedrooms', 'beds', 'price', 'security_deposit', \
     'number_of_reviews', 'reviews_per_month', \
     'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', \
     'review_scores_checkin', 'review_scores_communication', 'review_scores_location', \
     'review_scores_value']

Now let’s turn listings into a GeoDataFrame:

pts = geopandas.points_from_xy(listings["longitude"], listings["latitude"])
geo_listings = geopandas.GeoDataFrame(listings[x].assign(geometry=pts),
                                      crs="EPSG:4326")

London geographies

From the London Datastore, we can download a .zip file with statistical boundaries for London (as of Sep. 16th 2020):

! wget https://data.london.gov.uk/download/statistical-gis-boundary-files-london/9ba8c833-6370-4b11-abdc-314aa020d5e0/statistical-gis-boundaries-london.zip
--2020-09-16 19:06:02--  https://data.london.gov.uk/download/statistical-gis-boundary-files-london/9ba8c833-6370-4b11-abdc-314aa020d5e0/statistical-gis-boundaries-london.zip
Resolving data.london.gov.uk (data.london.gov.uk)... 99.84.10.83, 99.84.10.96, 99.84.10.34, ...
Connecting to data.london.gov.uk (data.london.gov.uk)|99.84.10.83|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/statistical-gis-boundary-files-london/2016-10-03T13%3A52%3A28/statistical-gis-boundaries-london.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20200916%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20200916T190556Z&X-Amz-Expires=300&X-Amz-Signature=5bf7051cbe47f79bbdcd6a3b094f3d55d5f0f263bfe5426cff4bcdf4b723fe4d&X-Amz-SignedHeaders=host [following]
--2020-09-16 19:06:02--  https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/statistical-gis-boundary-files-london/2016-10-03T13%3A52%3A28/statistical-gis-boundaries-london.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20200916%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20200916T190556Z&X-Amz-Expires=300&X-Amz-Signature=5bf7051cbe47f79bbdcd6a3b094f3d55d5f0f263bfe5426cff4bcdf4b723fe4d&X-Amz-SignedHeaders=host
Resolving airdrive-secure.s3-eu-west-1.amazonaws.com (airdrive-secure.s3-eu-west-1.amazonaws.com)... 52.218.109.32, 52.218.88.56
Connecting to airdrive-secure.s3-eu-west-1.amazonaws.com (airdrive-secure.s3-eu-west-1.amazonaws.com)|52.218.109.32|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28666674 (27M) [application/zip]
Saving to: ‘statistical-gis-boundaries-london.zip’

statistical-gis-bou 100%[===================>]  27.34M  2.48MB/s    in 12s     

2020-09-16 19:06:15 (2.29 MB/s) - ‘statistical-gis-boundaries-london.zip’ saved [28666674/28666674]

Unpack the compressed files:

! unzip statistical-gis-boundaries-london.zip
Archive:  statistical-gis-boundaries-london.zip
   creating: statistical-gis-boundaries-london/ESRI/
  inflating: statistical-gis-boundaries-london/ESRI/London_Borough_Excluding_MHW.dbf  
  inflating: statistical-gis-boundaries-london/ESRI/London_Borough_Excluding_MHW.GSS_CODE.atx  
  inflating: statistical-gis-boundaries-london/ESRI/London_Borough_Excluding_MHW.NAME.atx  
  inflating: statistical-gis-boundaries-london/ESRI/London_Borough_Excluding_MHW.prj  
  inflating: statistical-gis-boundaries-london/ESRI/London_Borough_Excluding_MHW.sbn  
  inflating: statistical-gis-boundaries-london/ESRI/London_Borough_Excluding_MHW.sbx  
  inflating: statistical-gis-boundaries-london/ESRI/London_Borough_Excluding_MHW.shp  
  inflating: statistical-gis-boundaries-london/ESRI/London_Borough_Excluding_MHW.shp.xml  
  inflating: statistical-gis-boundaries-london/ESRI/London_Borough_Excluding_MHW.shx  
  inflating: statistical-gis-boundaries-london/ESRI/London_Ward.BOROUGH.atx  
 extracting: statistical-gis-boundaries-london/ESRI/London_Ward.cpg  
  inflating: statistical-gis-boundaries-london/ESRI/London_Ward.dbf  
  inflating: statistical-gis-boundaries-london/ESRI/London_Ward.GSS_CODE.atx  
  inflating: statistical-gis-boundaries-london/ESRI/London_Ward.LB_GSS_CD.atx  
  inflating: statistical-gis-boundaries-london/ESRI/London_Ward.prj  
  inflating: statistical-gis-boundaries-london/ESRI/London_Ward.sbn  
  inflating: statistical-gis-boundaries-london/ESRI/London_Ward.sbx  
  inflating: statistical-gis-boundaries-london/ESRI/London_Ward.shp  
  inflating: statistical-gis-boundaries-london/ESRI/London_Ward.shp.xml  
  inflating: statistical-gis-boundaries-london/ESRI/London_Ward.shx  
  inflating: statistical-gis-boundaries-london/ESRI/London_Ward_CityMerged.BOROUGH.atx  
 extracting: statistical-gis-boundaries-london/ESRI/London_Ward_CityMerged.cpg  
  inflating: statistical-gis-boundaries-london/ESRI/London_Ward_CityMerged.dbf  
  inflating: statistical-gis-boundaries-london/ESRI/London_Ward_CityMerged.GSS_CODE.atx  
  inflating: statistical-gis-boundaries-london/ESRI/London_Ward_CityMerged.LB_GSS_CD.atx  
  inflating: statistical-gis-boundaries-london/ESRI/London_Ward_CityMerged.prj  
  inflating: statistical-gis-boundaries-london/ESRI/London_Ward_CityMerged.sbn  
  inflating: statistical-gis-boundaries-london/ESRI/London_Ward_CityMerged.sbx  
  inflating: statistical-gis-boundaries-london/ESRI/London_Ward_CityMerged.shp  
  inflating: statistical-gis-boundaries-london/ESRI/London_Ward_CityMerged.shp.xml  
  inflating: statistical-gis-boundaries-london/ESRI/London_Ward_CityMerged.shx  
  inflating: statistical-gis-boundaries-london/ESRI/LSOA_2004_London_Low_Resolution.dbf  
  inflating: statistical-gis-boundaries-london/ESRI/LSOA_2004_London_Low_Resolution.prj  
  inflating: statistical-gis-boundaries-london/ESRI/LSOA_2004_London_Low_Resolution.shp  
  inflating: statistical-gis-boundaries-london/ESRI/LSOA_2004_London_Low_Resolution.shx  
  inflating: statistical-gis-boundaries-london/ESRI/LSOA_2011_London_gen_MHW.dbf  
  inflating: statistical-gis-boundaries-london/ESRI/LSOA_2011_London_gen_MHW.prj  
  inflating: statistical-gis-boundaries-london/ESRI/LSOA_2011_London_gen_MHW.sbn  
  inflating: statistical-gis-boundaries-london/ESRI/LSOA_2011_London_gen_MHW.sbx  
  inflating: statistical-gis-boundaries-london/ESRI/LSOA_2011_London_gen_MHW.shp  
  inflating: statistical-gis-boundaries-london/ESRI/LSOA_2011_London_gen_MHW.shp.xml  
  inflating: statistical-gis-boundaries-london/ESRI/LSOA_2011_London_gen_MHW.shx  
  inflating: statistical-gis-boundaries-london/ESRI/MSOA_2004_London_High_Resolution.dbf  
  inflating: statistical-gis-boundaries-london/ESRI/MSOA_2004_London_High_Resolution.prj  
  inflating: statistical-gis-boundaries-london/ESRI/MSOA_2004_London_High_Resolution.sbn  
  inflating: statistical-gis-boundaries-london/ESRI/MSOA_2004_London_High_Resolution.sbx  
  inflating: statistical-gis-boundaries-london/ESRI/MSOA_2004_London_High_Resolution.shp  
  inflating: statistical-gis-boundaries-london/ESRI/MSOA_2004_London_High_Resolution.shp.xml  
  inflating: statistical-gis-boundaries-london/ESRI/MSOA_2004_London_High_Resolution.shx  
  inflating: statistical-gis-boundaries-london/ESRI/MSOA_2011_London_gen_MHW.dbf  
  inflating: statistical-gis-boundaries-london/ESRI/MSOA_2011_London_gen_MHW.prj  
  inflating: statistical-gis-boundaries-london/ESRI/MSOA_2011_London_gen_MHW.sbn  
  inflating: statistical-gis-boundaries-london/ESRI/MSOA_2011_London_gen_MHW.sbx  
  inflating: statistical-gis-boundaries-london/ESRI/MSOA_2011_London_gen_MHW.shp  
  inflating: statistical-gis-boundaries-london/ESRI/MSOA_2011_London_gen_MHW.shp.xml  
  inflating: statistical-gis-boundaries-london/ESRI/MSOA_2011_London_gen_MHW.shx  
  inflating: statistical-gis-boundaries-london/ESRI/OA_2011_London_gen_MHW.dbf  
  inflating: statistical-gis-boundaries-london/ESRI/OA_2011_London_gen_MHW.prj  
  inflating: statistical-gis-boundaries-london/ESRI/OA_2011_London_gen_MHW.sbn  
  inflating: statistical-gis-boundaries-london/ESRI/OA_2011_London_gen_MHW.sbx  
  inflating: statistical-gis-boundaries-london/ESRI/OA_2011_London_gen_MHW.shp  
  inflating: statistical-gis-boundaries-london/ESRI/OA_2011_London_gen_MHW.shp.xml  
  inflating: statistical-gis-boundaries-london/ESRI/OA_2011_London_gen_MHW.shx  
  inflating: statistical-gis-boundaries-london/Geography-licensing.pdf  
   creating: statistical-gis-boundaries-london/MapInfo/
  inflating: statistical-gis-boundaries-london/MapInfo/London_Borough_Excluding_MHW.dat  
  inflating: statistical-gis-boundaries-london/MapInfo/London_Borough_Excluding_MHW.id  
  inflating: statistical-gis-boundaries-london/MapInfo/London_Borough_Excluding_MHW.ind  
  inflating: statistical-gis-boundaries-london/MapInfo/London_Borough_Excluding_MHW.map  
  inflating: statistical-gis-boundaries-london/MapInfo/London_Borough_Excluding_MHW.tab  
  inflating: statistical-gis-boundaries-london/MapInfo/London_Ward_CityMerged.dat  
  inflating: statistical-gis-boundaries-london/MapInfo/London_Ward_CityMerged.id  
  inflating: statistical-gis-boundaries-london/MapInfo/London_Ward_CityMerged.ind  
  inflating: statistical-gis-boundaries-london/MapInfo/London_Ward_CityMerged.map  
  inflating: statistical-gis-boundaries-london/MapInfo/London_Ward_CityMerged.tab  
  inflating: statistical-gis-boundaries-london/MapInfo/LSOA_2004_London_Low_Resolution.DAT  
  inflating: statistical-gis-boundaries-london/MapInfo/LSOA_2004_London_Low_Resolution.ID  
  inflating: statistical-gis-boundaries-london/MapInfo/LSOA_2004_London_Low_Resolution.IND  
  inflating: statistical-gis-boundaries-london/MapInfo/LSOA_2004_London_Low_Resolution.MAP  
  inflating: statistical-gis-boundaries-london/MapInfo/LSOA_2004_London_Low_Resolution.TAB  
  inflating: statistical-gis-boundaries-london/MapInfo/LSOA_2011_London_gen_MHW.DAT  
  inflating: statistical-gis-boundaries-london/MapInfo/LSOA_2011_London_gen_MHW.ID  
  inflating: statistical-gis-boundaries-london/MapInfo/LSOA_2011_London_gen_MHW.MAP  
  inflating: statistical-gis-boundaries-london/MapInfo/LSOA_2011_London_gen_MHW.tab  
  inflating: statistical-gis-boundaries-london/MapInfo/MSOA_2004_London_Low_Resolution.DAT  
  inflating: statistical-gis-boundaries-london/MapInfo/MSOA_2004_London_Low_Resolution.ID  
  inflating: statistical-gis-boundaries-london/MapInfo/MSOA_2004_London_Low_Resolution.MAP  
  inflating: statistical-gis-boundaries-london/MapInfo/MSOA_2004_London_Low_Resolution.TAB  
  inflating: statistical-gis-boundaries-london/MapInfo/MSOA_2011_London_gen_MHW.DAT  
  inflating: statistical-gis-boundaries-london/MapInfo/MSOA_2011_London_gen_MHW.ID  
  inflating: statistical-gis-boundaries-london/MapInfo/MSOA_2011_London_gen_MHW.MAP  
  inflating: statistical-gis-boundaries-london/MapInfo/MSOA_2011_London_gen_MHW.tab  
  inflating: statistical-gis-boundaries-london/MapInfo/OA_2011_London_gen_MHW.DAT  
  inflating: statistical-gis-boundaries-london/MapInfo/OA_2011_London_gen_MHW.ID  
  inflating: statistical-gis-boundaries-london/MapInfo/OA_2011_London_gen_MHW.MAP  
  inflating: statistical-gis-boundaries-london/MapInfo/OA_2011_London_gen_MHW.tab  

Read in MSOAs:

msoas = geopandas.read_file("statistical-gis-boundaries-london/ESRI/MSOA_2004_London_High_Resolution.shp")
msoas.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 983 entries, 0 to 982
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   MSOA_CODE  983 non-null    object  
 1   MSOA_NAME  983 non-null    object  
 2   LA_CODE    983 non-null    object  
 3   LA_NAME    983 non-null    object  
 4   GEOEAST    983 non-null    int64   
 5   GEONORTH   983 non-null    int64   
 6   POPEAST    983 non-null    int64   
 7   POPNORTH   983 non-null    int64   
 8   AREA_KM2   983 non-null    float64 
 9   geometry   983 non-null    geometry
dtypes: float64(1), geometry(1), int64(4), object(4)
memory usage: 76.9+ KB

Read in boroughs:

boroughs = geopandas.read_file("statistical-gis-boundaries-london/ESRI/London_Borough_Excluding_MHW.shp")
boroughs.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 33 entries, 0 to 32
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   NAME        33 non-null     object  
 1   GSS_CODE    33 non-null     object  
 2   HECTARES    33 non-null     float64 
 3   NONLD_AREA  33 non-null     float64 
 4   ONS_INNER   33 non-null     object  
 5   SUB_2009    0 non-null      object  
 6   SUB_2006    0 non-null      object  
 7   geometry    33 non-null     geometry
dtypes: float64(2), geometry(1), object(5)
memory usage: 2.2+ KB

Aggregate variables to MSOA

  • Attach MSOA code to each point

db = geopandas.sjoin(geo_listings, 
                     msoas[["geometry", "MSOA_CODE", "MSOA_NAME"]].to_crs(geo_listings.crs), 
                     how="left"
                    )
  • Aggregate stats to MSOA level and reattach geometries

g = db.groupby("MSOA_CODE")
msoas_abb = g.mean().drop("index_right", axis=1)
msoas_abb["property_count"] = g.size()

msoas_abb = geopandas.GeoDataFrame(msoas_abb.join(msoas.set_index("MSOA_CODE")[["geometry"]]), 
                                   crs=msoas.crs
                                  )

Attach boroughs

We attach to a MSOA the borough where its centroid falls within.

msoa_cents = geopandas.GeoDataFrame({"MSOA11CD": msoas_abb.index,
                                     "geometry": msoas_abb.centroid
                                    }, crs=msoas_abb.crs
                                   )
msoa2borough = geopandas.sjoin(msoa_cents,
                               boroughs[["NAME", "GSS_CODE", "geometry"]]\
                                       .to_crs(msoas_abb.crs), 
                               how="left"
                              )
msoa2borough.head()
MSOA11CD geometry index_right NAME GSS_CODE
MSOA_CODE
E02000001 E02000001 POINT (532464.075 181219.688) 32 City of London E09000001
E02000002 E02000002 POINT (548312.704 189878.010) 31 Barking and Dagenham E09000002
E02000003 E02000003 POINT (548456.427 188399.878) 31 Barking and Dagenham E09000002
E02000004 E02000004 POINT (551009.985 186307.533) 31 Barking and Dagenham E09000002
E02000005 E02000005 POINT (548666.108 186902.593) 31 Barking and Dagenham E09000002

And add the borough code to the main table:

abb = msoas_abb.reset_index()\
               .join(msoa2borough[["NAME", "GSS_CODE"]], on="MSOA_CODE")\
               .rename({"NAME": "BOROUGH"}, axis=1)\
               .drop(["id", "longitude", "latitude"], axis=1)
abb.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 981 entries, 0 to 980
Data columns (total 18 columns):
 #   Column                       Non-Null Count  Dtype   
---  ------                       --------------  -----   
 0   MSOA_CODE                    981 non-null    object  
 1   accommodates                 981 non-null    float64 
 2   bathrooms                    981 non-null    float64 
 3   bedrooms                     981 non-null    float64 
 4   beds                         981 non-null    float64 
 5   number_of_reviews            981 non-null    float64 
 6   reviews_per_month            977 non-null    float64 
 7   review_scores_rating         977 non-null    float64 
 8   review_scores_accuracy       977 non-null    float64 
 9   review_scores_cleanliness    977 non-null    float64 
 10  review_scores_checkin        976 non-null    float64 
 11  review_scores_communication  977 non-null    float64 
 12  review_scores_location       976 non-null    float64 
 13  review_scores_value          976 non-null    float64 
 14  property_count               981 non-null    int64   
 15  geometry                     981 non-null    geometry
 16  BOROUGH                      981 non-null    object  
 17  GSS_CODE                     981 non-null    object  
dtypes: float64(13), geometry(1), int64(1), object(3)
memory usage: 138.1+ KB

Keep Inner London

To keep only MSOAs with higher density of properties, we restrict the dataset to Inner London.

il = [
    "City of London",
    "Camden",
    "Greenwich",
    "Hackney",
    "Hammersmith and Fulham",
    "Islington",
    "Kensington and Chelsea",
    "Lambeth",
    "Lewisham",
    "Southwark",
    "Tower Hamlets",
    "Wandsworth",
    "Westminster",
]
fltr = abb["BOROUGH"].isin(il)
abb_il = abb[fltr]

This is why we keep Inner London only:

ax = abb.plot(column="property_count", scheme="quantiles")
geopandas.GeoSeries(abb_il.unary_union).plot(ax=ax, edgecolor="red", facecolor="none")
<AxesSubplot:>
../../_images/london_abb_prep_31_1.png

Let’s also use the Inner London borough list to write them into a separate file:

! rm -f london_inner_boroughs.geojson
boroughs[boroughs["NAME"].isin(il)]\
        .to_crs(epsg=4326)\
        .to_file("london_inner_boroughs.geojson", 
                 driver="GeoJSON"
                )

Write out and clean up

! rm -f london_abb.gpkg
abb_il.to_crs(epsg=4326).to_file("london_abb.gpkg", driver="GPKG")
! du -h london_abb.gpkg
4.0M	london_abb.gpkg
! rm -rf statistical-gis-boundaries-london.zip statistical-gis-boundaries-london/