This session1 This note is part of Spatial Analysis Notes Creative Commons License
Points – Kernel Density Estimation and Spatial interpolation by Dani Arribas-Bel is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
is based on the following references, which are great follow-up’s on the topic:

This tutorial is part of Spatial Analysis Notes, a compilation hosted as a GitHub repository that you can access it in a few ways:

Dependencies

This tutorial relies on the following libraries that you will need to have installed on your machine to be able to interactively follow along2 You can install package mypackage by running the command install.packages("mypackage") on the R prompt or through the Tools --> Install Packages... menu in RStudio.. Once installed, load them up with the following commands:

# Layout
library(tufte)
# For pretty table
library(knitr)
# Spatial Data management
library(rgdal)
# Pretty graphics
library(ggplot2)
# Thematic maps
library(tmap)
# Pretty maps
library(ggmap)
# Various GIS utilities
library(GISTools)
# For all your interpolation needs
library(gstat)
# For data manipulation
library(plyr)

Before we start any analysis, let us set the path to the directory where we are working. We can easily do that with setwd(). Please replace in the following line the path to the folder where you have placed this file -and where the house_transactions folder with the data lives.

#setwd('/media/dani/baul/AAA/Documents/teaching/u-lvl/2016/envs453/code')
setwd('.')

Data

For this session, we will use a subset of residential property transaction data for the city of Liverpool. These are provided by the Land Registry (as part of their Price Paid Data) but have been cleaned and re-packaged by Dani Arribas-Bel.

Let us start by reading the data, which comes in a shapefile:

db <- readOGR(dsn = 'house_transactions', layer = 'liv_house_trans')
## OGR data source with driver: ESRI Shapefile 
## Source: "house_transactions", layer: "liv_house_trans"
## with 6324 features
## It has 18 fields

Before we forget, let us make sure price is considered a number, not a factor:

db@data$price <- as.numeric(as.character((db@data$price)))

The dataset spans the year 2014:

# Format dates
dts <- as.Date(db@data$trans_date)
# Set up summary table
tab <- summary(dts)
tab
##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "2014-01-02" "2014-04-11" "2014-07-09" "2014-07-08" "2014-10-03" 
##         Max. 
## "2014-12-30"

We can then examine the elements of the object with the summary method:

summary(db)
## Object of class SpatialPointsDataFrame
## Coordinates:
##              min    max
## coords.x1 333536 345449
## coords.x2 382684 397833
## Is projected: TRUE 
## proj4string :
## [+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000
## +y_0=-100000 +datum=OSGB36 +units=m +no_defs +ellps=airy
## +towgs84=446.448,-125.157,542.060,0.1502,0.2470,0.8421,-20.4894]
## Number of points: 6324
## Data attributes:
##       pcds                                           id      
##  L1 6LS : 126   {00029226-80EF-4280-9809-109B8509656A}:   1  
##  L8 5TE :  63   {00041BD2-4A07-4D41-A5AE-6459CD5FD37C}:   1  
##  L1 5AQ :  34   {0005AE67-9150-41D4-8D56-6BFC868EECA3}:   1  
##  L24 1WA:  31   {00183CD7-EE48-434B-8A1A-C94B30A93691}:   1  
##  L17 6BT:  26   {003EA3A5-F804-458D-A66F-447E27569456}:   1  
##  L3 1EE :  24   {00411304-DD5B-4F11-9748-93789D6A000E}:   1  
##  (Other):6020   (Other)                               :6318  
##      price                     trans_date   type     new      duration
##  Min.   :    1000   2014-06-27 00:00: 109   D: 505   N:5495   F:3927  
##  1st Qu.:   70000   2014-12-19 00:00: 109   F:1371   Y: 829   L:2397  
##  Median :  110000   2014-02-28 00:00: 105   O: 119                    
##  Mean   :  144310   2014-10-31 00:00:  95   S:1478                    
##  3rd Qu.:  160000   2014-03-28 00:00:  94   T:2851                    
##  Max.   :26615720   2014-11-28 00:00:  94                             
##                     (Other)         :5718                             
##       paon               saon                   street    
##  3      : 203   FLAT 2     :  25   CROSSHALL STREET: 133  
##  11     : 151   FLAT 3     :  25   STANHOPE STREET :  63  
##  14     : 148   FLAT 1     :  24   PALL MALL       :  47  
##  5      : 146   APARTMENT 4:  23   DUKE STREET     :  41  
##  4      : 140   APARTMENT 2:  21   MANN ISLAND     :  41  
##  8      : 128   (Other)    : 893   OLD HALL STREET :  39  
##  (Other):5408   NA's       :5313   (Other)         :5960  
##          locality           town           district           county    
##  WAVERTREE   : 126   LIVERPOOL:6324   KNOWSLEY :  12   MERSEYSIDE:6324  
##  MOSSLEY HILL: 102                    LIVERPOOL:6311                    
##  WALTON      :  88                    WIRRAL   :   1                    
##  WEST DERBY  :  71                                                      
##  WOOLTON     :  66                                                      
##  (Other)     : 548                                                      
##  NA's        :5323                                                      
##  ppd_cat  status         lsoa11          LSOA11CD   
##  A:5393   A:6324   E01033762: 144   E01033762: 144  
##  B: 931            E01033756:  98   E01033756:  98  
##                    E01033752:  93   E01033752:  93  
##                    E01033750:  71   E01033750:  71  
##                    E01006518:  68   E01006518:  68  
##                    E01033755:  65   E01033755:  65  
##                    (Other)  :5785   (Other)  :5785

See how it contains several pieces, some relating to the spatial information, some relating to the tabular data attached to it. We can access each of the separately if we need it. For example, to pull out the names of the columns in the data.frame, we can use the @data appendix:

colnames(db@data)
##  [1] "pcds"       "id"         "price"      "trans_date" "type"      
##  [6] "new"        "duration"   "paon"       "saon"       "street"    
## [11] "locality"   "town"       "district"   "county"     "ppd_cat"   
## [16] "status"     "lsoa11"     "LSOA11CD"

The rest of this session will focus on two main elements of the shapefile: the spatial dimension (as stored in the point coordinates), and the house price values contained in the price column. To get a sense of what they look like first, let us plot both. We can get a quick look at the non-spatial distribution of house values with the following commands:

Raw house prices in Liverpool Raw house prices in Liverpool

# Create the histogram
hist <- qplot(data=db@data,x=price)
hist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This basically shows there is a lot of values concentrated around the lower end of the distribution but a few very large ones. A usual transformation to shrink these differences is to take logarithms:

Log of house price in Liverpool Log of house price in Liverpool

# Create log and add it to the table
logpr <- log(as.numeric(db@data$price))
db@data['logpr'] <- logpr
# Create the histogram
hist <- qplot(x=logpr)
hist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

To obtain the spatial distribution of these houses, we need to turn away from the @data component of db. The easiest, quickest (and also dirtiest) way to get a sense of what the data look like over space is using plot:

Spatial distribution of house transactions in Liverpool Spatial distribution of house transactions in Liverpool

plot(db)

KDE

Kernel Density Estimation (KDE) is a technique that creates a continuous representation of the distribution of a given variable, such as house prices. Although theoretically it can be applied to any dimension, usually, KDE is applied to either one or two dimensions.

One-dimensional KDE

KDE over a single dimension is essentially a contiguous version of a histogram. We can see that by overlaying a KDE on top of the histogram of logs that we have created before:

# Create the base
base <- ggplot(db@data, aes(x=logpr))
# Histogram
hist <- base + 
  geom_histogram(bins=50, aes(y=..density..))
# Overlay density plot
kde <- hist + 
  geom_density(fill="#FF6666", alpha=0.5, colour="#FF6666")
kde
Histogram and KDE of the log of house prices in Liverpool

Histogram and KDE of the log of house prices in Liverpool

The key idea is that we are smoothing out the discrete binning that the histogram involves. Note how the histogram is exactly the same as above shape-wise, but it has been rescalend on the Y axis to reflect probabilities rather than counts.

Two-dimensional (spatial) KDE

Geography, at the end of the day, is usually represented as a two-dimensional space where we locate objects using a system of dual coordinates, X and Y (or latitude and longitude). Thanks to that, we can use the same technique as above to obtain a smooth representation of the distribution of a two-dimensional variable. The crucial difference is that, instead of obtaining a curve as the output, we will create a surface, where intensity will be represented with a color gradient, rather than with the second dimension, as it is the case in the figure above.

To create a spatial KDE in R, there are several ways. If you do not want to necessarily acknowledge the spatial nature of your data, or you they are not stored in a spatial format, you can plot them using ggplot2. Note we first need to convert the coordinates (stored in the spatial part of db) into columns of X and Y coordinates, then we can plot them:

KDE of house transactions in Liverpool KDE of house transactions in Liverpool

# Attach XY coordinates
db@data['X'] <- db@coords[, 1]
db@data['Y'] <- db@coords[, 2]
# Set up base layer
base <- ggplot(data=db@data, aes(x=X, y=Y))
# Create the KDE surface
kde <- base + stat_density2d(aes(x = X, y = Y, alpha = ..level..), 
               size = 0.01, bins = 16, geom = 'polygon') +
            scale_fill_gradient()
kde

Or, we can use a package such as the GISTools, which allows to pass a spatial object directly:

KDE of house transactions in Liverpool KDE of house transactions in Liverpool

# Compute the KDE
kde <- kde.points(db)
# Plot the KDE
level.plot(kde)

Either of these approaches generate a surface that represents the density of dots, that is an estimation of the probability of finding a house transaction at a given coordinate. However, without any further information, they are hard to interpret and link with previous knowledge of the area. To bring such context to the figure, we can plot an underlying basemap, using a cloud provider such as Google Maps or, as in this case, OpenStreetMap. To do it, we will leverage the library ggmap, which is designed to play nicely with the ggplot2 family (hence the seemingly counterintuitive example above). Before we can plot them with the online map, we need to reproject them though.

# Reproject coordinates
wgs84 <- CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0")
db_wgs84 <- spTransform(db, wgs84)
db_wgs84@data['lon'] <- db_wgs84@coords[, 1]
db_wgs84@data['lat'] <- db_wgs84@coords[, 2]
xys <- db_wgs84@data[c('lon', 'lat')]
# Bounding box
liv <- c(left = min(xys$lon), bottom = min(xys$lat), 
         right = max(xys$lon), top = max(xys$lat))
# Download map tiles
basemap <- get_stamenmap(liv, zoom = 12, 
                         maptype = "toner-lite")
## Map from URL : http://tile.stamen.com/toner-lite/12/2013/1325.png
## Map from URL : http://tile.stamen.com/toner-lite/12/2014/1325.png
## Map from URL : http://tile.stamen.com/toner-lite/12/2015/1325.png
## Map from URL : http://tile.stamen.com/toner-lite/12/2013/1326.png
## Map from URL : http://tile.stamen.com/toner-lite/12/2014/1326.png
## Map from URL : http://tile.stamen.com/toner-lite/12/2015/1326.png
## Map from URL : http://tile.stamen.com/toner-lite/12/2013/1327.png
## Map from URL : http://tile.stamen.com/toner-lite/12/2014/1327.png
## Map from URL : http://tile.stamen.com/toner-lite/12/2015/1327.png
# Overlay KDE
final <- ggmap(basemap, extent = "device", 
               maprange=FALSE) +
  stat_density2d(data = db_wgs84@data, 
                aes(x = lon, y = lat, 
                    alpha=..level.., 
                    fill = ..level..), 
                size = 0.01, bins = 16, 
                geom = 'polygon', 
                show.legend = FALSE) +
  scale_fill_gradient2("Transaction\nDensity", 
                       low = "#fffff8", 
                       high = "#8da0cb")
final