R
]R
: what is it and why should I care?R
overview
data.frame
data.frame
Open principles applied to the way you conduct statistical data analysis:
This typically involves three levels:
Being fully open on the three is not always possible (e.g. proprietary data/software), but that should be goal to which to get as close as possible.
Getting halfway is better than not starting
In this session we will focus on the last two: code and platform
Lessons:
R
R
: what is it?R is a language and environment for statistical computing and graphics
Characteristics:
S
language created by Ross Ihaka and Robert Gentleman in 1993Some people who care about R:
…Then sit back, relax, and enjoy being part of something big… [Tom Preston-Werner]
Being Free Software (“the users have the freedom to run, copy, distribute, study, change and improve the software”) has enhanced:
R
R
R
R
overviewLook for R info and packages
#rstats
hashtagInstall and load packages
instal.packages
functionlibrary
(e.d. library(maptools)
to load the package maptools
)Command | Function |
---|---|
?read.csv |
Check local documentation for read.csv function |
spdep::moran.test |
Check local documentation in package spdep for moran.test |
help("read.csv") |
Check local documentation for read.csv function |
help.search("read.csv") |
Search for “read.csv” in all help files |
RSiteSearch("plot maps") |
Search for the term “plot maps” in the RSiteSearch website (requires connectivity) |
Clean up working space
rm(list = ls())
Point to the folder
#setwd('~/code/WooWii/slides/')
getwd()
## [1] "/Users/dani/code/WooWii/slides"
Native csv
reading
nl <- read.csv("../data/RR - Netherlands.csv")
Foreign formats supported
library(foreign)
proc <- read.dta("../data/RR-processed.dta")
Many other formats supported (dbf
, xls
, sql
-like databases…)
data.frame
head
/tail
for the top/bottom of the table
head(nl)
## Country Year Debt GDP1 GDP2 RGDP1 RGDP2 GDPI1 GDPI2
## 1 Netherlands 1807 NA 490.3 NA 381.9 NA 128.4 NA
## 2 Netherlands 1808 NA 436.2 NA 339.3 NA 128.6 NA
## 3 Netherlands 1809 NA 407.9 NA 334.8 NA 121.8 NA
## 4 Netherlands 1810 NA NA NA NA NA NA NA
## 5 Netherlands 1811 NA NA NA NA NA NA NA
## 6 Netherlands 1812 NA NA NA NA NA NA NA
nl[1, ]
## Country Year Debt GDP1 GDP2 RGDP1 RGDP2 GDPI1 GDPI2
## 1 Netherlands 1807 NA 490.3 NA 381.9 NA 128.4 NA
data.frame
max(nl$GDP1, na.rm=TRUE)
## [1] 6489
min(nl$Debt, na.rm=TRUE)
## [1] 6.6
Create new variables
nl['dtg'] = nl$Debt / nl$GDP1
data.frame
summary
for basic statistics
summary(nl)
## Country Year Debt GDP1
## Netherlands:204 Min. :1807 Min. : 7 Min. : 406
## 1st Qu.:1858 1st Qu.: 26 1st Qu.: 562
## Median :1908 Median : 620 Median :1038
## Mean :1908 Mean : 1263 Mean :1668
## 3rd Qu.:1959 3rd Qu.: 1158 3rd Qu.:1533
## Max. :2010 Max. :12619 Max. :6489
## NA's :74 NA's :83
## GDP2 RGDP1 RGDP2 GDPI1
## Min. : 14.5 Min. : 335 Min. :243 Min. : 79.4
## 1st Qu.: 53.4 1st Qu.: 772 1st Qu.:279 1st Qu.: 89.0
## Median :178.8 Median : 2078 Median :343 Median : 97.1
## Mean :212.3 Mean : 61539 Mean :356 Mean :103.5
## 3rd Qu.:316.1 3rd Qu.: 83826 3rd Qu.:427 3rd Qu.:106.2
## Max. :595.9 Max. :385709 Max. :488 Max. :195.2
## NA's :150 NA's :9 NA's :173 NA's :83
## GDPI2 dtg
## Min. : 64.9 Min. :0.42
## 1st Qu.: 76.1 1st Qu.:0.59
## Median : 87.7 Median :0.72
## Mean : 91.5 Mean :0.72
## 3rd Qu.:108.1 3rd Qu.:0.86
## Max. :123.0 Max. :0.99
## NA's :174 NA's :151
data.frame
A data.frame
has fancy query features
with_debt <- nl[!is.na(nl$Debt), ]
head(with_debt, 3)
## Country Year Debt GDP1 GDP2 RGDP1 RGDP2 GDPI1 GDPI2 dtg
## 74 Netherlands 1880 942 1120 NA 1139 NA 98.36 NA 0.8407
## 75 Netherlands 1881 941 1134 NA 1160 NA 97.80 NA 0.8297
## 76 Netherlands 1882 999 1191 NA 1191 NA 100.00 NA 0.8388
nl_clean <- nl[!is.na(nl$GDP1), ]
mean_gdp <- mean(nl_clean$GDP1)
high_gdp <- nl_clean[nl_clean$GDP1 > mean_gdp, ]
head(high_gdp, 3)
## Country Year Debt GDP1 GDP2 RGDP1 RGDP2 GDPI1 GDPI2 dtg
## 99 Netherlands 1905 1106 1711 NA 1931 NA 88.62 NA 0.6466
## 100 Netherlands 1906 1145 1823 NA 1971 NA 92.50 NA 0.6280
## 101 Netherlands 1907 1140 1812 NA 1935 NA 93.61 NA 0.6291
You could use >
for greater than, <
for smaller than, or ==
for equal to
data.frame
Which you can combine:
super_clean <- nl[(!is.na(nl$GDP1)) & (!is.na(nl$Debt)), ]
ratio <- super_clean$Debt / super_clean$GDP1
good_years <- super_clean[(ratio < 0.9) & (super_clean$GDP1 > mean_gdp), ]
head(good_years, 3)
## Country Year Debt GDP1 GDP2 RGDP1 RGDP2 GDPI1 GDPI2 dtg
## 99 Netherlands 1905 1106 1711 NA 1931 NA 88.62 NA 0.6466
## 100 Netherlands 1906 1145 1823 NA 1971 NA 92.50 NA 0.6280
## 101 Netherlands 1907 1140 1812 NA 1935 NA 93.61 NA 0.6291
In proc
:
GDP
is largest?Debt1
to GDP1
ratio has ever been between 20% and 40%GDP
is largest?clean_gdp <- proc[!is.na(proc$GDP), ]
max_gdp <- max(clean_gdp$GDP)
top_gdp <- clean_gdp[clean_gdp$GDP == max_gdp, ]
top_gdp[, c('Country', 'Year', 'GDP')]
## Country Year GDP
## 1107 UK 2008 1446110
Debt1
to GDP1
ratio has ever been between 20% and 40%proc['dtg'] <- 100 * proc$Debt1 / proc$GDP1
debt_to_gdp_clean <- proc[!is.na(proc$dtg), ]
answer <- debt_to_gdp_clean[(debt_to_gdp_clean$dtg > 20) &
(debt_to_gdp_clean$dtg < 40), ]
answer[1:5, c('Country', 'Year', 'dtg')]
## Country Year dtg
## 65 Austria 1948 35.20
## 66 Austria 1949 26.45
## 67 Austria 1950 22.14
## 96 Austria 1979 23.59
## 97 Austria 1980 24.87
unique(answer$Country)
## [1] Austria Finland Portugal
## 20 Levels: Australia Austria Belgium Canada Denmark Finland ... US
ols <- lm('log(Debt) ~ log(GDP)', data=proc)
summary(ols)
##
## Call:
## lm(formula = "log(Debt) ~ log(GDP)", data = proc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.698 -0.259 -0.121 0.120 1.421
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2086 0.1640 -1.27 0.21
## log(GDP) 0.9668 0.0164 59.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.473 on 125 degrees of freedom
## (1044 observations deleted due to missingness)
## Multiple R-squared: 0.965, Adjusted R-squared: 0.965
## F-statistic: 3.49e+03 on 1 and 125 DF, p-value: <2e-16
plot(log(proc$GDP), log(proc$Debt))
Try also plot(ols)
in RStudio!
Corrected Reinhart & Rogoff Table 1 (as shown by Herndon et al.)
library(xtable)
proc$dgcat.lm <- cut(proc$debtgdp, breaks=c(0,30,60,90,Inf))
proc$dgcat <- factor(proc$dgcat.lm, labels = c("0-30%","30-60%","60-90%","Above 90%"),ordered=TRUE)
(RR.correct.mean <- with(proc, tapply( dRGDP, dgcat, mean, na.rm=TRUE )))
## 0-30% 30-60% 60-90% Above 90%
## 4.174 3.116 3.222 2.168
Converted to a data.frame
table_df <- data.frame(RR.correct.mean, dgcat=names(RR.correct.mean) )
table_df
## RR.correct.mean dgcat
## 0-30% 4.174 0-30%
## 30-60% 3.116 30-60%
## 60-90% 3.222 60-90%
## Above 90% 2.168 Above 90%
Simply write to a csv
write.csv(table_df, '~/Desktop/table.csv')
Or to LaTeX
library(xtable)
xtable(table_df)
## % latex table generated in R 3.0.3 by xtable 1.7-3 package
## % Sun Sep 7 11:50:12 2014
## \begin{table}[ht]
## \centering
## \begin{tabular}{rrl}
## \hline
## & RR.correct.mean & dgcat \\
## \hline
## 0-30\% & 4.17 & 0-30\% \\
## 30-60\% & 3.12 & 30-60\% \\
## 60-90\% & 3.22 & 60-90\% \\
## Above 90\% & 2.17 & Above 90\% \\
## \hline
## \end{tabular}
## \end{table}
If we wanted to write it out to a file
sink('~/Desktop/table.tex')
xtable(table_df)
sink()
Content by Dani Arribas-Bel and Thomas De Graaff, licensed under Creative Commons Attribution 4.0 International License.
For this session, we have borrowed important amounts of inspiration and material from Software Carpentry’s session on git and the freely available book Pro Git