This session1 This note is part of Spatial Analysis Notes
Flows – Exploring flows visually and through spatial interaction by Dani Arribas-Bel is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. covers spatial interaction flows. Using open data from the city of San Francisco about trips on its bikeshare system, we will estimate spatial interaction models that try to capture and explain the variation in the amount of trips on each given route. After visualizing the dataset, we begin with a very simple model and then build complexity progressively by augmenting it with more information, refined measurements, and better modeling approaches. Throughout the note, we explore different ways to grasp the predictive performance of each model. We finish with a prediction example that illustrates how these models can be deployed in a real-world application.
Content is based on the following references, which are great follow-up’s on the topic:
This tutorial is part of Spatial Analysis Notes, a compilation hosted as a GitHub repository that you can access in a few ways:
.zipfile that contains all the materials.
This tutorial relies on the following libraries that you will need to have installed on your machine to be able to interactively follow along2 You can install package
mypackage by running the command
install.packages("mypackage") on the R prompt or through the
Tools --> Install Packages... menu in RStudio.. Once installed, load them up with the following commands:
# Layout library(tufte) # Spatial Data management library(rgdal) # Pretty graphics library(ggplot2) # Thematic maps library(tmap) # Pretty maps library(ggmap) # Simulation methods library(arm)
Before we start any analysis, let us set the path to the directory where we are working. We can easily do that with
setwd(). Please replace in the following line the path to the folder where you have placed this file -and where the
sf_bikes folder with the data lives.
In this note, we will use data from the city of San Francisco representing bike trips on their public bike share system. The original source is the SF Open Data portal (link) and the dataset comprises both the location of each station in the Bay Area as well as information on trips (station of origin to station of destination) undertaken in the system from September 2014 to August 2015 and the following year. Since this note is about modeling and not data preparation, a cleanly reshaped version of the data, together with some additional information, has been created and placed in the
sf_bikes folder. The data file is named
flows.geojson and, in case you are interested, the (Python) code required to created from the original files in the SF Data Portal is also available on the
flows_prep.ipynb notebook [url], also in the same folder.
Let us then directly load the file with all the information necessary:
db <- readOGR(dsn='sf_bikes/flows.geojson', layer='OGRGeoJSON')
## OGR data source with driver: GeoJSON ## Source: "sf_bikes/flows.geojson", layer: "OGRGeoJSON" ## with 1722 features ## It has 9 fields
rownames(db@data) <- db$flow_id db@data$flow_id <- NULL
Note how the interface is slightly different since we are reading a
GeoJSON file instead of a shapefile.
The data contains the geometries of the flows, as calculated from the Google Maps API, as well as a series of columns with characteristics of each flow:
## dest orig straight_dist street_dist total_down total_up trips15 ## 39-41 41 39 1452.201 1804.1150 11.205753 4.698162 68 ## 39-42 42 39 1734.861 2069.1557 10.290236 2.897886 23 ## 39-45 45 39 1255.349 1747.9928 11.015596 4.593927 83 ## 39-46 46 39 1323.303 1490.8361 3.511543 5.038044 258 ## 39-47 47 39 715.689 769.9189 0.000000 3.282495 127 ## 39-48 48 39 1996.778 2740.1290 11.375186 3.841296 81 ## trips16 ## 39-41 68 ## 39-42 29 ## 39-45 50 ## 39-46 163 ## 39-47 73 ## 39-48 56
dest are the station IDs of the origin and destination,
street/straight_dist is the distance in metres between stations measured along the street network or as-the-crow-flies,
total_down/up is the total downhil and climb in the trip, and
tripsXX contains the amount of trips undertaken in the years of study.
The easiest way to get a quick preview of what the data looks like spatially is to make a simple plot:
Equally, if we want to visualize a single route, we can simply subset the table. For example, to get the shape of the trip from station
39 to station
48, we can:
Trip from station 39 to 48
one39to48 <- db[ which( db@data$orig == 39 & db@data$dest == 48 ) , ] plot(one39to48)
or, for the most popular route, we can:
Most popular trip
most_pop <- db[ which( db@data$trips15 == max(db@data$trips15) ) , ] plot(most_pop)
These however do not reveal a lot: there is no geographical context (why are there so many routes along the NE?) and no sense of how volumes of bikers are allocated along different routes. Let us fix those two.
The easiest way to bring in geographical context is by overlaying the routes on top of a background map of tiles downloaded from the internet. Let us download this using
sf_bb <- c(left=db@bbox['x', 'min'], right=db@bbox['x', 'max'], bottom=db@bbox['y', 'min'], top=db@bbox['y', 'max']) SanFran <- get_stamenmap(sf_bb, zoom = 14, maptype = "toner-lite")
## Source : http://tile.stamen.com/toner-lite/14/2620/6330.png
## Source : http://tile.stamen.com/toner-lite/14/2621/6330.png
## Source : http://tile.stamen.com/toner-lite/14/2622/6330.png
## Source : http://tile.stamen.com/toner-lite/14/2620/6331.png
## Source : http://tile.stamen.com/toner-lite/14/2621/6331.png
## Source : http://tile.stamen.com/toner-lite/14/2622/6331.png
## Source : http://tile.stamen.com/toner-lite/14/2620/6332.png
## Source : http://tile.stamen.com/toner-lite/14/2621/6332.png
## Source : http://tile.stamen.com/toner-lite/14/2622/6332.png
## Source : http://tile.stamen.com/toner-lite/14/2620/6333.png
## Source : http://tile.stamen.com/toner-lite/14/2621/6333.png
## Source : http://tile.stamen.com/toner-lite/14/2622/6333.png
and make sure it looks like we intend it to look: