This session1 This note is part of Spatial Analysis Notes
Points – Kernel Density Estimation and Spatial interpolation by Dani Arribas-Bel is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. is based on the following references, which are great follow-up’s on the topic:
This tutorial is part of Spatial Analysis Notes, a compilation hosted as a GitHub repository that you can access it in a few ways:
.zip
file that contains all the materials.This tutorial relies on the following libraries that you will need to have installed on your machine to be able to interactively follow along2 You can install package mypackage
by running the command install.packages("mypackage")
on the R prompt or through the Tools --> Install Packages...
menu in RStudio.. Once installed, load them up with the following commands:
# Layout
library(tufte)
# For pretty table
library(knitr)
# Spatial Data management
library(rgdal)
# Pretty graphics
library(ggplot2)
# Thematic maps
library(tmap)
# Pretty maps
library(ggmap)
# Various GIS utilities
library(GISTools)
# For all your interpolation needs
library(gstat)
# For data manipulation
library(plyr)
Before we start any analysis, let us set the path to the directory where we are working. We can easily do that with setwd()
. Please replace in the following line the path to the folder where you have placed this file -and where the house_transactions
folder with the data lives.
#setwd('/media/dani/baul/AAA/Documents/teaching/u-lvl/2016/envs453/code')
setwd('.')
For this session, we will use a subset of residential property transaction data for the city of Liverpool. These are provided by the Land Registry (as part of their Price Paid Data) but have been cleaned and re-packaged by Dani Arribas-Bel.
Let us start by reading the data, which comes in a shapefile:
db <- readOGR(dsn = 'house_transactions', layer = 'liv_house_trans')
## OGR data source with driver: ESRI Shapefile
## Source: "house_transactions", layer: "liv_house_trans"
## with 6324 features
## It has 18 fields
Before we forget, let us make sure price
is considered a number, not a factor:
db@data$price <- as.numeric(as.character((db@data$price)))
The dataset spans the year 2014:
# Format dates
dts <- as.Date(db@data$trans_date)
# Set up summary table
tab <- summary(dts)
tab
## Min. 1st Qu. Median Mean 3rd Qu.
## "2014-01-02" "2014-04-11" "2014-07-09" "2014-07-08" "2014-10-03"
## Max.
## "2014-12-30"
We can then examine the elements of the object with the summary
method:
summary(db)
## Object of class SpatialPointsDataFrame
## Coordinates:
## min max
## coords.x1 333536 345449
## coords.x2 382684 397833
## Is projected: TRUE
## proj4string :
## [+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000
## +y_0=-100000 +datum=OSGB36 +units=m +no_defs +ellps=airy
## +towgs84=446.448,-125.157,542.060,0.1502,0.2470,0.8421,-20.4894]
## Number of points: 6324
## Data attributes:
## pcds id
## L1 6LS : 126 {00029226-80EF-4280-9809-109B8509656A}: 1
## L8 5TE : 63 {00041BD2-4A07-4D41-A5AE-6459CD5FD37C}: 1
## L1 5AQ : 34 {0005AE67-9150-41D4-8D56-6BFC868EECA3}: 1
## L24 1WA: 31 {00183CD7-EE48-434B-8A1A-C94B30A93691}: 1
## L17 6BT: 26 {003EA3A5-F804-458D-A66F-447E27569456}: 1
## L3 1EE : 24 {00411304-DD5B-4F11-9748-93789D6A000E}: 1
## (Other):6020 (Other) :6318
## price trans_date type new duration
## Min. : 1000 2014-06-27 00:00: 109 D: 505 N:5495 F:3927
## 1st Qu.: 70000 2014-12-19 00:00: 109 F:1371 Y: 829 L:2397
## Median : 110000 2014-02-28 00:00: 105 O: 119
## Mean : 144310 2014-10-31 00:00: 95 S:1478
## 3rd Qu.: 160000 2014-03-28 00:00: 94 T:2851
## Max. :26615720 2014-11-28 00:00: 94
## (Other) :5718
## paon saon street
## 3 : 203 FLAT 2 : 25 CROSSHALL STREET: 133
## 11 : 151 FLAT 3 : 25 STANHOPE STREET : 63
## 14 : 148 FLAT 1 : 24 PALL MALL : 47
## 5 : 146 APARTMENT 4: 23 DUKE STREET : 41
## 4 : 140 APARTMENT 2: 21 MANN ISLAND : 41
## 8 : 128 (Other) : 893 OLD HALL STREET : 39
## (Other):5408 NA's :5313 (Other) :5960
## locality town district county
## WAVERTREE : 126 LIVERPOOL:6324 KNOWSLEY : 12 MERSEYSIDE:6324
## MOSSLEY HILL: 102 LIVERPOOL:6311
## WALTON : 88 WIRRAL : 1
## WEST DERBY : 71
## WOOLTON : 66
## (Other) : 548
## NA's :5323
## ppd_cat status lsoa11 LSOA11CD
## A:5393 A:6324 E01033762: 144 E01033762: 144
## B: 931 E01033756: 98 E01033756: 98
## E01033752: 93 E01033752: 93
## E01033750: 71 E01033750: 71
## E01006518: 68 E01006518: 68
## E01033755: 65 E01033755: 65
## (Other) :5785 (Other) :5785
See how it contains several pieces, some relating to the spatial information, some relating to the tabular data attached to it. We can access each of the separately if we need it. For example, to pull out the names of the columns in the data.frame
, we can use the @data
appendix:
colnames(db@data)
## [1] "pcds" "id" "price" "trans_date" "type"
## [6] "new" "duration" "paon" "saon" "street"
## [11] "locality" "town" "district" "county" "ppd_cat"
## [16] "status" "lsoa11" "LSOA11CD"
The rest of this session will focus on two main elements of the shapefile: the spatial dimension (as stored in the point coordinates), and the house price values contained in the price
column. To get a sense of what they look like first, let us plot both. We can get a quick look at the non-spatial distribution of house values with the following commands:
Raw house prices in Liverpool
# Create the histogram
hist <- qplot(data=db@data,x=price)
hist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This basically shows there is a lot of values concentrated around the lower end of the distribution but a few very large ones. A usual transformation to shrink these differences is to take logarithms:
Log of house price in Liverpool
# Create log and add it to the table
logpr <- log(as.numeric(db@data$price))
db@data['logpr'] <- logpr
# Create the histogram
hist <- qplot(x=logpr)
hist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
To obtain the spatial distribution of these houses, we need to turn away from the @data
component of db
. The easiest, quickest (and also dirtiest) way to get a sense of what the data look like over space is using plot
:
Spatial distribution of house transactions in Liverpool
plot(db)
Kernel Density Estimation (KDE) is a technique that creates a continuous representation of the distribution of a given variable, such as house prices. Although theoretically it can be applied to any dimension, usually, KDE is applied to either one or two dimensions.
KDE over a single dimension is essentially a contiguous version of a histogram. We can see that by overlaying a KDE on top of the histogram of logs that we have created before:
# Create the base
base <- ggplot(db@data, aes(x=logpr))
# Histogram
hist <- base +
geom_histogram(bins=50, aes(y=..density..))
# Overlay density plot
kde <- hist +
geom_density(fill="#FF6666", alpha=0.5, colour="#FF6666")
kde
Histogram and KDE of the log of house prices in Liverpool
The key idea is that we are smoothing out the discrete binning that the histogram involves. Note how the histogram is exactly the same as above shape-wise, but it has been rescalend on the Y axis to reflect probabilities rather than counts.
Geography, at the end of the day, is usually represented as a two-dimensional space where we locate objects using a system of dual coordinates, X
and Y
(or latitude and longitude). Thanks to that, we can use the same technique as above to obtain a smooth representation of the distribution of a two-dimensional variable. The crucial difference is that, instead of obtaining a curve as the output, we will create a surface, where intensity will be represented with a color gradient, rather than with the second dimension, as it is the case in the figure above.
To create a spatial KDE in R, there are several ways. If you do not want to necessarily acknowledge the spatial nature of your data, or you they are not stored in a spatial format, you can plot them using ggplot2
. Note we first need to convert the coordinates (stored in the spatial part of db
) into columns of X and Y coordinates, then we can plot them:
KDE of house transactions in Liverpool
# Attach XY coordinates
db@data['X'] <- db@coords[, 1]
db@data['Y'] <- db@coords[, 2]
# Set up base layer
base <- ggplot(data=db@data, aes(x=X, y=Y))
# Create the KDE surface
kde <- base + stat_density2d(aes(x = X, y = Y, alpha = ..level..),
size = 0.01, bins = 16, geom = 'polygon') +
scale_fill_gradient()
kde
Or, we can use a package such as the GISTools
, which allows to pass a spatial object directly:
KDE of house transactions in Liverpool
# Compute the KDE
kde <- kde.points(db)
# Plot the KDE
level.plot(kde)
Either of these approaches generate a surface that represents the density of dots, that is an estimation of the probability of finding a house transaction at a given coordinate. However, without any further information, they are hard to interpret and link with previous knowledge of the area. To bring such context to the figure, we can plot an underlying basemap, using a cloud provider such as Google Maps or, as in this case, OpenStreetMap. To do it, we will leverage the library ggmap
, which is designed to play nicely with the ggplot2
family (hence the seemingly counterintuitive example above). Before we can plot them with the online map, we need to reproject them though.
# Reproject coordinates
wgs84 <- CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0")
db_wgs84 <- spTransform(db, wgs84)
db_wgs84@data['lon'] <- db_wgs84@coords[, 1]
db_wgs84@data['lat'] <- db_wgs84@coords[, 2]
xys <- db_wgs84@data[c('lon', 'lat')]
# Bounding box
liv <- c(left = min(xys$lon), bottom = min(xys$lat),
right = max(xys$lon), top = max(xys$lat))
# Download map tiles
basemap <- get_stamenmap(liv, zoom = 12,
maptype = "toner-lite")
## Map from URL : http://tile.stamen.com/toner-lite/12/2013/1325.png
## Map from URL : http://tile.stamen.com/toner-lite/12/2014/1325.png
## Map from URL : http://tile.stamen.com/toner-lite/12/2015/1325.png
## Map from URL : http://tile.stamen.com/toner-lite/12/2013/1326.png
## Map from URL : http://tile.stamen.com/toner-lite/12/2014/1326.png
## Map from URL : http://tile.stamen.com/toner-lite/12/2015/1326.png
## Map from URL : http://tile.stamen.com/toner-lite/12/2013/1327.png
## Map from URL : http://tile.stamen.com/toner-lite/12/2014/1327.png
## Map from URL : http://tile.stamen.com/toner-lite/12/2015/1327.png
# Overlay KDE
final <- ggmap(basemap, extent = "device",
maprange=FALSE) +
stat_density2d(data = db_wgs84@data,
aes(x = lon, y = lat,
alpha=..level..,
fill = ..level..),
size = 0.01, bins = 16,
geom = 'polygon',
show.legend = FALSE) +
scale_fill_gradient2("Transaction\nDensity",
low = "#fffff8",
high = "#8da0cb")
final