Special sessions

Applications of new sources of (big) data in Regional Science

61st NARSC Meetings

Washington, DC
November 12-15, 2014

Session I - Time and location
- Big, Data, Smart Cities and Research Infrastructure Innovation . Robert J. Stimson [abstract]
- Residential Foreclosure and Non-Housing Wealth. Sharon O'Donnell [abstract]
- Forecasting Regional Health Crises Using Google Trends. Jason Parker [abstract]
- Fast Food Data: The Usefulness of Social Media Byproducts. David Folch [abstract]
Session II - Time and location
- Neighborhood Effects in a Behavioral Randomized Controlled Trial. Tammy Leonard [abstract]
- The Spatial Pattern of Inequality within Cities and its Relation with the Local Economy. Norbert Schanne [abstract]
- Spatial and Social Frictions in the City: Evidence From Yelp. Ronald R. Davis [abstract]
- “The magic’s in the recipe” - Urban Diversity and Popular Amenities. Dani Arribas-Bel [abstract]
Session III - Time and location
- From 'Big Noise' to 'Big Data': a case study of cross-validation between 3 large geographical datasets on visitor flows between regional urban centres. Robin Lovelace [abstract]
- Digital Neighborhoods. Luc Anselin [abstract]
- Sensitivity of Location-Sharing Services Data: Evidence from American Travel Pattern. Zhenhua Chen [abstract]
- A framework of Mapping Social Connections in Space and Time. Xinyue Ye [abstract]
Session IV - Time and location
- New Data, New Applications: A Methon for Transportation System Performance Monitoring. Mohja Rhoads [abstract]
- Mobile phone data and motorway traffic: can the former predict the latter? Emmanouil Tranos [abstract]
- Freight Deliveries Directly Generated by Residential Units: An Analysis with the 2009 NHTS Data. Yiwei Zhou [abstract]
- Baltimore’s Post-recession Socioeconomic Environment and Local Job Access for Work-eligible Temporary Assistance for Needy Families (TANF) Recipients: a Locational Approach for Welfare-to-work Examination. Ting Zhang [abstract]

Session I

Big, Data, Smart Cities and Research Infrastructure Innovation
Stimson, Robert J. (University of Melbourne); Pettit, Chris (University of Melbourne); Sinnott, Richard (University of Melbourne)

Advances in information technologies are opening new ways to approach research and policy analysis for cities and regions. This is being driven in part by what is now referred to as ‘big data’ and also by the emergence of policies that are championing the ‘creative commons’ and ‘open data’. Harnessing the opportunities presented by these innovations is being championed by what is being referred to as ‘smart cities’. This paper overviews these developments and then focuses on how innovations in building new research infrastructures are starting to revolutionise the way urban and regional research might be facilitated through a number of initiatives that are occurring around the world. That includes the Australian Urban research Infrastructure (AURIN) project which is taken as an example. AURIN is a A$24million initiative by the Australian Government - led by the University of Melbourne but involving universities, research institutions and data agencies across the country - which is developing and operating a new national research infrastructure that facilitates access to a wide range of types of data at multiple levels of scale sourced from multiple sources with the on-line capability to integrate those data and interrogate data using open source spatial and statistical analysis and modelling e-research tools with visualization. An important feature of the AURIN e-research infrastructure is the capability to enable merit-based securitized access to unit record data and its integration with spatial objective data with the researcher not being able to download the individual data but conduct interrogation on-line and receive the results of those interrogations thus ensuring protection of privacy. The paper presents some of the applications that can be undertaken using the AURIN e-infrastructure capabilities. That includes: havesting spatial data and conducting econometric analysis and modelling; applying customised tools open source developed such as a Walkability tool and a Planning What If? Tool; and the integration of survey-based unit record data with spatial objective data. The paper concludes by discussing some of the impediments faced in this interface between ‘big data’, ‘smart cities’ and research infrastructure initiatives which are challenges that need to be addressed.

Residential Foreclosure and Non-Housing Wealth
Coulson, Ed (Pennsylvania State University); O'Donnell, Sharon (U.S. Census Bureau)

The foreclosure literature describes a "double trigger" mechanism that increases the probability that a homeowner will face foreclosure. The double trigger is made up of a negative economic shock (e.g, job loss, divorce) and the presence of an underwater mortgage. In this paper, we examine the role of non-housing wealth on foreclosure probabilities. Homeowners with underwater mortgages have more housing choices if they have sufficient non-housing assets that can be used to sell the home and payoff the deficiency. For those with only housing assets, the absence of non-housing assets intensifies the triggers. The presence of non-housing wealth also contributes to foreclosure probability if the homeowner uses the assets to purchase a second home and default on the first. This is a form of strategic default (""ruthless"" strategic default) exists if the homeowner has the means to keep current on the mortgage of the first home and has sufficient non-housing assets to pay off the deficiency but choose to apply it to the down payment of a second home. Analysis is based on a commingled file of 60 months of panel survey data (2008 Survey of Income and Program Participation (SIPP)) record linked at the address level with local government data that document stages of foreclosure of a portion of the SIPP mortgaged home. Data file contains monthly events of the households synchronized with foreclosure events. Initial analysis, based on 40 months of panel data, show an association between homeowners in financial distress and defaults. Homeowners that feel the strongest effect of the local housing bubble on the value of their homes are at greater risk of default. Using two measures of non-housing assets, analysis failed to find any evidence between wealth and default. The paper does not determine who are motivated to engage in "ruthless" strategic default but empirical evidence suggests that less than 2.5% homeowners have the means to attempt it.

Forecasting Regional Health Crises Using Google Trends
Parker, Jason (Michigan State University); Loveridge, Scott (Michigan State University)

Community behavioral health includes positive mental health and freedom from substance abuse. The incidence of conditions such as depression and alcoholism vary greatly from one place to the next and in a given location can change over time; sometimes, such as in the case of crystal meth, the incidence can grow rapidly. Community behavioral health outcomes are produced by an array of resources that include formal and informal educational systems, law enforcement, health care providers, and social service workers. Often this large array of participants works in an uncoordinated fashion due to poor information about emerging needs. Local decision-makers seem to ignore many of the health data resources available to them, possibly because these resources appear after the fact, and don’t function well in predicting regional health problems. Before the advent of data science, the first sign of a serious health crisis was the local news media. Today, Google Trends provides new data that anyone can use to see patterns in search history at the regional level, including searches for web sites with information about illegal drugs and various mental health disorders. Because the information is free of charge and offered in real time, rather than based on laborious population sampling, multiple reminders to obtain a defensible survey response rate, and data cleaning, using the new Google Trends data may also make it possible to construct powerful localized predictors of health crises so that regional planners, law enforcement officials, and social service agencies can react much more quickly. The rich cyclicality of this data poses a serious statistical problem and must be estimated using a new panel frequency domain approach. The purpose of this paper is to demonstrate use of Google Trend data in conjunction with more traditional covariates to provide accurate forecast equations for regional health problems so that local decision-makers can adjust their programs more quickly in response to emerging needs. The covariates include measurements of unemployment, household income, local interest rates, demographics, regional government spending, and inequality. In our model, we control for local fixed effects and national common time effects so that our analysis is independent of any region-specific and year-specific differences. Any remaining cross-sectional dependence between regions in the model can be accounted for by factor augmentation.

Fast Food Data: The Usefulness of Social Media Byproducts
Spielman, Seth (University of Colorado - Boulder); Folch, David C. (Florida State University); Manduca, Robert (Harvard University)

The promise of “big” sets of user generated geographic content is that it provides a new way to understand cities. From a data collection perspective user-generated big urban data offers clear advantages: 1) it is available in real time and has very short update cycles and 2) it is inexpensive to collect in large part because it is often a byproduct of modern urban life. However, from a data use perspective the arguments for big urban data are much less compelling. There are real questions about the fidelity of big urban data to “real” on the ground conditions. Like “fast food”, big data is cheap and seemingly everywhere, but its quality is often suspect. Is big data a replacement for “traditional” data sources, a compliment to traditional data or simply a source of empty calories? Because big urban data sets are not designed studies in the formal statistical sense it is difficult to measure data quality. Thus much of the literature on big geographic data has focused on epistemological questions about what constitutes “good” user-generated content (Spielman 2014). These evaluations are further complicated when spatial accuracy is not necessarily the primary goal of the data collection process (Arribas-Bel, 2013). In this paper we examine the correspondence between Yelp (www.yelp.com) data and an administrative dataset of restaurants in the Phoenix, Arizona metro area produced by the Maricopa Association of Governments (MAG). Our initial goal was to examine the fidelity of Yelp to the “real world.” We assumed that the register of restaurants used for metropolitan planning purposes would represent reality and that Yelp would capture only some aspect of the “truth.” Instead we found little overlap between the administrative and “big” urban data sets. Extensive direct comparison revealed only about a third of restaurants in each dataset are found in the other. This lack of correspondence could be caused by trivial differences such as geocoding or spelling conventions, or it could indicate substantive differences in the data. We therefore turned to an indirect approach that compares the spatial distribution of the two datasets and found that they differ in systematic and predictable ways. Ripley’s Kdiff function (Bailey and Gatrell 1995), the difference between the expected number of restaurants within each dataset as a function of distance, indicates that across all spatial scales the Yelp data is significantly clustered relative to the MAG data. Specifically, restaurants in the MAG database are spread fairly evenly across the metro area while those in the Yelp data are more concentrated in certain parts of metro Phoenix, most notably the downtowns and upscale areas of Phoenix, Scottsdale, and Tempe. Further examination of the locations with a high density of Yelp restaurants suggests that they share some common traits. Areas with a high density of Yelp relative to MAG restaurants are in more walkable areas as judged by their average WalkScore. Additionally, logistic models using the US Census Longitudinal Employer Household Dynamics Database (http://lehd.ces.census.gov) indicate that census block groups with high probability of containing more Yelp than officially listed restaurants tend to have larger numbers of low-wage workers and people employed in the Arts, Entertainment, and Recreation sector and Accommodation and Food Services sector. These characteristics, combined with qualitative knowledge of the neighborhoods in question, begin to suggest that the Yelp data does an especially good job of documenting dense, hip areas, frequented by professionals, tourists, and the workers who serve them. MAG’s systematically collected data, on the other hand, is underrepresented in these areas, but is perhaps more complete in other parts of the metro area. Our comparison of Yelp and MAG data highlights the strengths and weaknesses of each. The two datasets had approximately the same number of restaurants, but the Yelp data is far more detailed and comprehensive in certain areas of Phoenix. It also may be better at documenting informal or family-run businesses, and it certainly is updated more frequently than MAG. However, Yelp does not represent the entire metro area equally, so analyses of Yelp data alone would likely overstate restaurant concentration in downtowns, and understate it in suburbs. MAG data may be less exhaustive in certain places, but due to its systematic construction it is likely to be more consistent across the entire region. Both datasets have a great deal of information to contribute. When combined, administrative and user generated databases seem to provide a more holistic and comprehensive picture of the world than either would do by itself. Effective planning might use the more evenly spread MAG data for metro-level research, supplementing it with Yelp data for detailed analysis of the neighborhoods well served by Yelp. The different strengths of the two datasets are just the most recent iteration of an established problem. Datasets on businesses have almost always been generated as the byproduct of some other commercial process. The nature of that process influences the form of the final dataset. By combining features of multiple datasets generated through different processes it may be possible to address the weaknesses of each.

Session II

Neighborhood Effects in a Behavioral Randomized Controlled Trial
Pruitt, Sandi L.; Leonard, Tammy (University of Dallas); Murdoch, James; Hughes, Amy; McQueend, Amy; Guptae, Samir

Electronic Medical Records data are a (relatively) newly available source of space, time data. Particularly for regionally based large urban safety-net health systems, EMR data can allow for novel insights into the provision of health services for low-income under- and uninsured populations. However, the use and interpretation of EMR data has largely been unexplored and is quite challenging. The data were collected for administrative purposes and are large. Additionally, there is a high level of sensitivity around data security and privacy when using EMR data. Despite these challenges, we demonstrate the utility of EMR data to explore geographic and social influences on the outcomes in a Randomized Control Trial (RCT) by examining a RCT designed to increase colorectal cancer (CRC) screening. Cadastral geocoding of EMR address records was used to validate patient addresses and to also append housing data from the count appraisal district to the EMR records. Additionally, street network data was used to calculate travel times to the health clinics where health services were received. We found statistically significant neighborhood effects. Most notably, average CRC test use among neighboring study participants was significantly and positively associated with individual patient’s CRC test use. This potentially important spatially-varying covariate has not previously been considered in health-behavior RCTs. The implications are both empirical and methodological. Empirically, we find that in the case of the RCT examined, neighborhood effects, while significant, do not modify the intervention effect size estimates. Methodologically, our results contribute to the understanding of neighborhood effects and RCTSs. RCTs of interventions intended to modify health behaviors may be influenced by neighborhood effects, which can impede unbiased estimation of intervention effects. Our results contribute to the growing literature suggesting that RCTs focused on individual behavior should assess potential social interactions between participants, which may cause intervention arm contamination.

The Spatial Pattern of Inequality within Cities and its Relation with the Local Economy
vom Berge, Philipp (Insitute for Employment Research); Schanne, Norbert (Insitute for Employment Research); Schild, Christopher-Johannes (IAB Institute for Employment Research); Wurdack, Anja (IAB Institute for Employment Research)

This paper investigates the intra-urban spatial structure of labour market inequality. Comparing the spatial pattern of intra-urban inequality across cities has been difficult so far because of the requirement of standardized data collection at a very detailed spatial scale. The Research Data Centre (FDZ) of the Federal Employment Agency in the Institute for Employment Research (IAB) has recently accessed geocoded register data on the German labour market which cover the entire workforce liable to statutory social security and all working-age social benefit recipients. We use the year 2009 wave of this data to construct measures for labour market inequality at the level of regular 500meter x 500meter grid cells, for example the local proportion of low-wage employees. These are the ground our further analysis bases on. We start our analysis with a case study on the three largest German cities: Berlin, Hamburg, and Munich. Visualisation of the social inequality measure at the level of grid cells forms a context for interpreting commonly employed metrics on intra-urban social segregation and other spatial structures in inequality. The three cities show distinctly shaped spatial structures in social inequality. Besides this, they differ with regard to the local economic development, the progress of structural change, and the policy of providing subsidised residences. In order to generalise this case-study evidence, we extend the analysis to cover all cities in Germany with more than 100,000 inhabitants. We establish quantitative relations between measures for the shape of urban labour market inequality and the city size and growth, the industry structure, structural change, and social policy. We discuss instrumental variables which potentially allow for interpreting the estimated correlations in a causal fashion. Preliminary results (for the cities with population size over 500,000 persons) suggest a positive relationship between Duncan’s Segregation Index and the median wage within a city; with regard to other variables, relationships are less clear.

Spatial and Social Frictions in the City: Evidence From Yelp
Davis, Donald R (Columbia University); Dingel, Jonathan I (University of Chicago); Monras, Joan (Sciences Politique); Morales, Eduardo (Princeton University)

A city means much more to its residents than just home and work. We eat out. We shop. We seek entertainment. We take advantage of the thousands of opportunities the city provides. While these choices are fundamental to how we use the city, they are also hard to observe. Surveys reveal what we say we do. Time diaries record what we do for how long, but not where we go. Even new GPS studies show where we go, but not why we go there or what other opportunities were relevant. In order to address these issues, we need information about individuals' residences and workplaces, places they go, and the alternatives available but not chosen. We need to pay attention not only to spatial frictions in the city but also to social frictions. And we need to be mindful that the spatial and social frictions themselves may vary with the characteristics of the residents. We construct the first data set that has all the features required to examine this problem. The starting point is data from the online user-generated review site, Yelp.com. In 2011, we downloaded all reviews written by about 50,000 Yelp users who had reviewed a venue in New York City. We randomly selected 25 percent of these users, jointly accounting for about 645,000 reviews, for closer study. We used a combination of keyword searches and close examination of review texts to identify approximate home and work locations for a subset of these users. We combine these locations with data on income levels and residential racial/ethnic composition from the 2000 Census of Population. To measure racial/ethnic demographic distances between two census tracts, we calculate the Euclidean distance between the two tracts' population shares for four racial and ethnic groups. To measure segregation, we calculate the Echenique and Fryer (2007) spectral segregation index for the modal race/ethnicity in each census tract. This particular index has the property that a census tract is more segregated if it is surrounded by more segregated tracts. We estimate travel times between home, work, and venues as the public transit travel time between the centroids of New York City census tracts from Google Maps. We restrict our estimation sample to users with home and work locations in Manhattan in order to mitigate the issue of transport-mode choice, since the large majority of Manhattan residents use public transit. To describe crime rates, we use geographically precise NYPD robbery statistics that we aggregate to the level of census tracts. We infer users' genders from their profile photos on the Yelp web site. These data allow us to estimate a discrete-choice model of restaurant visits with a rich set of user and venue characteristics. They also allow us to examine the separate spatial frictions owing to distance of venues from home and work. We can relate characteristics of users to their willingness to enter areas of the city with varying crime rates, incomes, and racial composition. This allows us to understand how these social factors may act as barriers to movement and commerce within the city. Using our estimated demand system, we construct counterfactuals in which we examine how proposed transport infrastructure additions or reversion to higher crime rates affect the degree of integration (in many ways) of the city. Our preliminary results suggest influential roles for travel times, demographic differences, crime rates, and user characteristics. We find that the “demographic distance" between two locations is as important as the travel time between them. Women in particular are significantly less likely to visit venues in neighborhoods with high crime or racial and ethnic demographics different from those of their own neighborhood.

“The magic’s in the recipe” - Urban Diversity and Popular Amenities
Arribas-Bel, Dani (University of Birmingham); Bakens, Jessie (VU University, Amsterdam)

This paper uses a novel source of (big) data to analyze the main factors behind the popularity of urban amenities in The Netherlands. In particular, we collect data from the location-based service Foursquare and employ it to obtain a rich catalogue of restaurant locations, as well as a database of other urban amenities. This, combined with traditional sources of socio-economic data, allows us to estimate regressions at the area and venue levels, uncovering the main determinants of the popularity of specific restaurants as well as of entire areas or neighborhoods of a city. In doing so, we contribute to the existing literature along three main dimensions: we provide insight and new knowledge about urban systems, in particular about the under-studied aspect of urban amenities; we demonstrate the use of a novel source of data available to urban researchers as a byproduct (Arribas-Bel 2014) to improve the understanding of phenomena of interest not only to researchers but to practitioners such as urban planners and business owners; and we quantify, document and characterize some of the biases inherent to these new sources of data in the context of urban applications. From an economic point of view, cities have become not only agglomerations of production, but also important consumption arenas (Glaeser et al. 2001). Although this dual role is widely recognized by the literature, very little research has been devoted to analyze and identify the mechanisms that lead to attractive consumer cities. In other words, We know much more about the ingredients (i.e. cultural amenities, the presence of green open space, population composition) than about the recipe. How these elements are internally combined to create a successful “consumer city” remains largely uninvestigated. There are at least two possible reasons why this is the case: first, previous studies usually consider cities in the aggregate and have thus focused only on the elements, the ingredients, failing to recognize the spatial arrangement within each urban area; second, but very much related, it has been traditionally difficult to obtain spatially detailed data on revealed preferences for urban amenities. During the last few decades, the world has witnessed an explosion in computing power that has put a powerful computer in the pocket of even non-experienced users. In parallel, location technology such as the global positioning system (GPS) has also undergone dramatic improvements and sharp drops in cost, enabling it to reach the consumer mass. The combination of these two trends is producing a vast amount of geo-referenced data, presenting many opportunities for research in the urban realms. A prime example of this is the phenomenon known as location-based services (LBSs), of which Foursquare is one of the main industry players. These are online applications that allow users to broadcast their location in real-time in what has come to be known as a checkin. The accumulation of this form of metadata is producing databases that effectively store a digital representation of some aspects of the world, as well as many traces of human behaviour. We believe this can help fill the need for quantifiable measures of revealed preferences about urban amenities.

Session III

From 'Big Noise' to 'Big Data': a case study of cross-validation between 3 large geographical datasets on visitor flows between regional urban centres
Lovelace, Robin (University of Leeds); Malleson, Nicolas (University of Leeds); Birkin, Mark (University of Leeds); Cross, Philip (University of Leeds)

Much has been written about 'Big Data': definitions, characteristics, the methodological challenges it poses (Boyd and Crawford, 2012). There has also been speculation about how it may or may not revolutionise Regional Science and related fields (Arribas-Bel, 2014). Amongst the excitement, there has been little time to pause for thought and reflect about the kinds of application where Big Data is most suited. Indeed, Big Data also has its critics (e.g. Taleb 2012) and their arguments should be heeded to avoid the field being tainted, for example, with the type of controversy that has engulfed national spying agencies since the Snowden leaks, or similar concerns of the ‘Big Brother’ variety which have set the patient records agenda back in the UK (Ganesh, 2014). What is needed in this context, we argue, is not wide-eyed speculation about an ambiguous concept of 'Big Data', but an honest appraisal of the applications for which different kinds of emerging data sources may be most and least useful. Specifically, with the growing volume of data available, there has been a tendency to uncritically proceed with the analysis, resulting in beautiful visualisations and new insights. Yet in many cases careful evaluation of the quality of Big Data sources is lacking. It is the aim of this paper to discuss how quality can be evaluated in the realm of big data sources, by cross-validation. The theoretical underpinning of this paper goes back to the definition of Big Data as information that is high in volume, velocity and variety (Laney, 2001). Although this definition is frequently mentioned in talks on the subject, rarely are the criteria which constitute whether a dataset is 'Big' or not explored in detail. Furthermore, the consequences of each attribute for the types of application for which Big datasets are suited is rarely discussed. We thus start from the premise that each of the aforementioned attributes of Big Data can provide advantages and disadvantages to the researcher, in equal measures. One Big dataset may be completely different from the next in terms of its relative merits. We thus use three unrelated datasets for the empirical part of this study: Geotagged Twitter data: The Twitter data were collected with the Twitter Streaming Application Programming Interface (API), which provides 'live' access to public messages posted to Twitter. Data were collected during 445 days between 2011-06-22 and 2012-09-09 in West Yorkshire. Mobile phone mast location data: A mobile telephone service provider provided aggregated data on home location as well as frequency and number of trips between major urban and retail centres across Yorkshire. Individual geolocated survey data on shopping habits: This dataset was provided by the consultancy Acxiom who do surveys across the UK, collecting ~1 million records yearly. The data has a fine spatial resolution (full postcode) and many attributes of interest for market research. The method was to test each dataset as an input into a spatial interaction model of movement between urban centres in Yorkshire, UK. In their raw form, it was found that each dataset is of little value to the majority of researchers, hence the term 'Big Noise'. It is only through a process of cleaning (to ensure consistency), filtering (to remove extraneous information) and aggregation that the raw datasets are transformed into a state that allows direct comparison between them and with the results of a spatial interaction model. We conclude by advocating a greater emphasis on these techniques of 'data tidying' in Big Data research as this seems to be a major bottleneck in the field and an area where value researchers can add most value to noisy information.

Digital Neighborhoods
Anselin, Luc (Arizona State University); Williams, Sarah (Massachusetts Institute of Technology)

This paper investigates the spatial footprint of “digital neighborhoods,” i.e., a concept of neighborhood derived from the content of geo-located and time-stamped social media messages, which greatly extend the usual range of local data available to urban and regional scientists. The messages pertain to different types of contents and activities that tend to cluster in space and over time. We are interested in using different spatial clustering techniques to detect significant groupings and how these can be explained by underlying socio-economic characteristics. In addition to the spatial dimension, we examine the space-time distribution of messages during the day and over the course of a week to assess the extent to which the digital neighborhoods are dynamic across time and over space, and how this varies by type of message. We base our analysis on two sources of social media data for a period in early 2014 in New York City. One is a sample of over 5 million Twitter messages collected through February and March, of which close to 600,000 have geographic coordinates that correspond to over 450,000 venues. The second is a comprehensive set of Foursquare check-ins for the first week of February, which similarly contains close to 600,000 observations, but for a much smaller set of venues (65,000). In addition to the locations of the Twitter messages and the Foursquare check-ins, we consider more than 300,000 business locations from the comprehensive ESRI business data base. In our analysis, we take two different perspectives. In one, we take the geography of N.Y.C. block groups as the point of departure (n = 6454) and investigate the spatial and space-time density of messages within this framework. Using a variety of clustering methods (including measures of local spatial autocorrelation), we identify block groups that form “digital hot spots” and “digital deserts.” The former show much more digital activity than would be expected, given their population share or share of the business locations. The hot spots are dominated by Manhattan, but also include new up-and-coming areas, such as Long Island City in Brooklyn, Williamsburg and Smith and Court Street. Digital deserts are the opposite, block groups that are severely under-represented in the digital world. We relate these patterns to socio-economic characteristics of the block-groups in a series of spatial regressions. In the second perspective, we take the location of the venues as the point of departure and address clustering by means of association matrices, i.e., a type of distance measure based on the similarity of check-ins among individuals, by type of venue. This replicates the approach taken by the “Livehoods” project (Cranshaw et al, 2012), but we also focus on the sensitivity of the obtained clusters to the parameters chosen in the process of the clustering approach. In addition, we investigate the dynamics of these clusters over the course of the day and the day in the week. These digital neighborhoods help to highlight the underlying economic dynamics of the matching geographic neighborhoods and tend to have a higher diversity of businesses.

Sensitivity of Location-Sharing Services Data: Evidence from American Travel Pattern
Chen, Zhenhua (George Mason University); Schintler, Laurie

Location sharing services (LSSs) enable individuals to “check-in” to locations via GPS-equipped devices, and to share this information with friends in real-time. These services, and other related applications, are generating a huge amount of passively collected data on social and spatio-temporal behavior. Unlike traditional sources of data, the information produced by LSS users has broad geographic coverage; it is also rich in spatial and temporal detail. In fact, a number of studies have already exploited this type of data to understand different aspects of human and societal behavior, including patterns of travel behavior. However, one concern about location sharing services data, as with other sources of Big Data, is that it is potentially biased. Users of such services tend to correspond to a particular demographic – i.e., low to medium income males between the ages of 19-29. Moreover, users can vary in the frequency with which they report their locations. For example, some individuals may only “check-in” when travelling long-distance, whereas others may do so on more of a regular basis. There may also be a bias in terms of the types of locations, or activities, that users report to their friends. To complicate matters, there may be differences in the demographics of users and their behavior across different location-sharing services. These differences could relate, for example, to the relative popularity of the services or stages of deployment. Without an understanding of these issues and sensitivities, any social or spatial behavior inferred from this type of data may end up being ad hoc, inaccurate, or ambiguous. Thus, it is critical to understand who and what is being represented by the data, and how these characteristics differ across different services. In this study, we begin to explore these issues. Specifically, the purpose of our study is three-fold: 1). to assess how well LSS data captures daily travel behavior patterns; 2). to examine how sensitive the estimates are across different location-sharing services; and 3). to develop a methodology for processing location-sharing services data to derive information on average daily travel behavior. For the purpose of the study, we focus on two aspects of daily travel behavior: person miles of travel (PMT) and daily person trips (DPT). The location-sharing services we examine include Brightkite, Gowalla and Foursquare. We use the National Household Travel Survey (NHTS) estimates of PMT and DPT as benchmarks for the study. The analysis is conducted at the national level (contiguous US) and for the top 51 most populated metropolitan areas in the contiguous US. The study has five major findings: First, estimates of travel behavior from LSS data are found to be more accurate for populated rather than less-populated areas; Second, some variations in daily travel behavior are found in LSS data, although there are some consistencies, especially between Gowalla and Foursquare. Third, Brightkite is the least accurate in terms of representing daily travel behavior; Fourth, LSS data provides a better estimation of daily person miles of travel than average daily person trips; Lastly, discrepancies between the travel behavior inferred from LSSs and those from the NHTS seem to correspond to the particular demographics and travel characteristics of metropolitan areas. Through the sensitivity analysis of three LSS data with a comparison to the classical NHTS data, our results indicate that the accuracy of estimation for PMT and DPT using LSS data is highly dependent on the numbers of check-in records. Since metropolitan areas with high population density tend to have a better representation of daily travel pattern as compared to NHTS, the research findings suggest that it would be more accurate and suitable to use LSS data for travel behavior analysis with a focus on big metropolitan areas.

A framework of Mapping Social Connections in Space and Time
Ye, Xinyue (Kent State University); Lai, Chih-Hui (Kent State University)

Emergency events such as natural disasters often precipitate the (re)activation of organized efforts in ways different from the normal times. Not only for individuals, organizations, including relief and non-relief related, often engage in intensive communicative action with individuals and organizations for offering and acquiring support of any kind. After the 2010 Haiti Earthquake, Twitter has become an important emergency information and communication backbone where individuals and organizations request and share information for disaster relief within and outside the affected area. This unique system of information and communication allows for the identification of the dynamics of active and latent social connections as well as the temporal shifts of resource allocation geospatially. These geospatial details are either identified by the user in the text or automatically recorded by the system. To advance societal understanding about the transferability of virtual network systems into physical relief actions, this project aims to achieve four goals. First, it will examine the patterns of the global network of interorganizational communication on Twitter in two disaster contexts: 2012 New York/New Jersey Superstorm Sandy and 2013 Typhoon Haiyan in the Philippines. These events are chosen because of their widespread impacts as well as their geographical variations, which allows for the observation of similar and divergent patterns of disaster relief. Analytically, this longitudinal analysis is meant to generate geospatial representation of the temporal change of the global virtual interorganizational network for each of these disasters. Findings will help illuminate the geo-economic disparities of organizational resource mobilization across disasters. Second, the analysis will identify the factors that differentiate the clusters of actors involved in different types of disaster relief around the world. As a result, volunteering coordination can be made more effectively by collaborating with relevant organizations falling within each cluster. Third, findings will locate the latent network of organizational collaboration for emergency response on a global scale. Predictions can be made about the timing and the geography of such network being activated before, during, and after disaster. These results will unveil the conditions and opportunities where the online links can translate into the provision of physical resources. Fourth, this research will reveal the broader patterns of global emergency and humanitarian aid network. In addition to disaster relief, most non-governmental and volunteer organizations are dedicated to multiple types of humanitarian aid. Using three disasters as the starting point will help identify the mechanisms of how virtual interorganizational network parallels or enhances the interorganizational collaboration for humanitarian efforts. Methodologically, supplementing the conventional survey and interview techniques, use of Twitter data allows for a more systematic way of obtaining data on different types of organizations (intergovernmental organizations, international, national, and local non-governmental organizations). It also enables longitudinal observation of the dynamic interaction among organizations of different types. In sum, this project presents significant scholarly, practical, and policy implications for disaster relief and humanitarian aid.

Session IV

New Data, New Applications: A Methon for Transportation System Performance Monitoring
Giuliano, Genevieve (University of Southern California); Rhoads, Mohja (University of Southern California); Chakrabarti, Sandip (University of Southern California)

This paper is motivated by the availability of a new data source. We have developed a data archive from the real-time data feed used for transportation system monitoring in the Los Angeles region. This system, Regional Integration of Intelligent Transportation Systems (RIITS), includes freeway, arterial and public transit data produced by several state and local agencies. The availability of detailed, historical data across modes and facilities has obvious applications for transportation system modeling and simulation, but also provide opportunities for developing new analytical tools for transportation planning and management. This paper presents a method for monitoring the regional transportation system. Performance monitoring is an essential part of transportation planning and system management, yet historically the cost and complexity of gathering sufficient data and conducting performance analyses has limited regular monitoring. Our data archive includes geo-spatial freeway, arterial and transit operations data. The freeway and arterial data come from over 6,000 sensors across Los Angeles County. The transit information comes from a combination of GPS devices and passenger counts from all Los Angeles Metro transit bus and rail routes. The data are generated in intervals as short as every 30 seconds, and all data are located by x-y coordinate. These data allow us to sample across time, space and modes at almost any time-space interval. In this paper we present our method for monitoring the highway system. The transportation network is diverse. Within the highway system, highways range from 2 lane rural roads to 12 lane urban freeways. We therefore use cluster analysis based on functional attributes to group segments of the highway system. Our cluster analysis yields three groups for the highway system. Operational data is not uniformly available: some parts of the system are more instrumented than others, and not all sensors report valid data. We therefore develop a weighting scheme to generate representative performance measures for each cluster group. We illustrate our method using 30 days of data from the highway system. Our results for highways using average speed, volumes, and variance as performance measures show that performance varies significantly across clusters, time periods and days of the week but different weighting schemes do not significantly affect results.

Mobile phone data and motorway traffic: can the former predict the latter?
Tranos, Emmanouil (University of Birmingham)

This paper aims to test the relationship between mobile phone usage and motorway traffic. Can we use data from mobile phone providers as a detector of motorway traffic? Such a modelling exercise can provide a useful tool for transport engineers as it will enable the (near) real-time estimation of car traffic in specific segments of motorways using data from mobile phone operators and avoiding the use of other more expensive and less efficient surveying techniques. The case study for our research is the city of Amsterdam. The data utilized for this paper has been supplied by a major telecom operator and provides aggregated information about mobile phone usage at the level of the GSM cell for the year 2010. The temporal dimension provides information at an hourly basis creating a very detailed pool of data. Such a rich dataset appears to be a ‘luxury’ for spatial analysts, but at the same time increases the complexity of the analytical approach. In addition, extensive datasets for motorway traffic using detection loops as well as weather data are also used for this paper. The richness of the mobile phone dataset will be utilised in two ways. At a first level, the effect of car traffic on mobile phone usage will be tested. The result of this exercise will provide the basis of the analysis as it will establish the relation between mobile phone usage and motorway traffic. At a second step the mobile phone dataset will be utilised in a more sophisticated way. Instead of using only data regarding mobile phone usage (e.g. new phone calls or erlangs), handovers will also be considered. The latter contains information regarding the transfer of calls from one GSM antenna to another. This usually happens when the mobile phone user crosses the boundaries of a GSM cell and therefore reflects movements in space. What is tested in the second step is that a low rate of handovers in relative terms can be related with bottle necks and traffic jams. Simply put, handovers during a traffic jams are, in relative terms, less than when roads are open. The latter will provide the main contribution of the paper as it will introduce a rather simple methodology to capture traffic jams at a (near) real time.

Freight Deliveries Directly Generated by Residential Units: An Analysis with the 2009 NHTS Data
Zhou, Yiwei (Rensselaer Polytechnic Institute); Wang, Xiaokun (Rensselaer Polytechnic Institute)

As a result of the rapid growth of online shopping, more goods and services are delivered directly to residential units. The door-to-door deliveries improve residents’ accessibility to retailing sector, and at the same time create truck delivery trips. However, partially due to the data limitation, most existing freight research focuses on freight trips generated by the multiple industrial sectors, little is known about freight trips generated by residential units. As more and more urban areas are pushing for dense and mixed development, it is necessary to understand the pattern of truck freight trips directly generated by residential units. For this paper, dataset from NHTS is used to investigate the freight trips generated by residential units. NHTS 2009 provides accurate, comprehensive and timely information on trips, land use, household characteristics and social economic factors. It is the first time NHTS data is used to estimate freight trips. A statistical model is established to explain freight trips generated by residential units and discover influential factors. Besides, the model is expected to predict freight trips generated when applying to real residential units. A negative binomial right censored model is used to identify the impacts of influential factors such as housing density, type of house and house ownership. An application is made to simulate number of freight deliveries generated by residential units in New York City. Results are compared with derived real business freight trips data. To further validate simulation results, the same model is applied to different education groups. The simulated freight trips generated by residential units are compared to those using full dataset. A closer examination at the state level further discloses the spatial variation in their relationship. Such a study will supplement city logistics studies that traditionally focus on business behaviors, help reconstruct the complete picture of freight activities in urban areas.

Baltimore’s Post-recession Socioeconomic Environment and Local Job Access for Work-eligible Temporary Assistance for Needy Families (TANF) Recipients: a Locational Approach for Welfare-to-work Examination
Zhang, Ting (University of Baltimore)

This study examines the impact of Baltimore local community socioeconomic environment and local job access on work-eligible Temporary Assistance for Needy Families (TANF) Recipients’ welfare-to-work transfer propensity during the post-recession period between July 2009 and December 2012. The data we use include linked multi-agency micro level longitudinal administrative record extracts from Maryland state government and census block level public data from US Census Bureau, American Community Survey, US Bureau of Labor Statistics, and Baltimore City Police Department. We adopt a hierarchical mixed-effect logistic regression, descriptive statistics, and spatial econometrics to estimate the impacts. ArcGIS will also be used to generate density maps, identify spatial hotspots and compute job access. The local community socioeconomic environment and the availability of the local jobs (defined by the location weighted job-hotspot-to-home distance) are critical to work-eligible TANF recipients’ employment outcome. This evidence-based study will inform Baltimore City government and local agencies, as well as Maryland State government agencies, of further strategies to redesign the TANF related social safety net services and service delivery in Central Baltimore area. The findings will not only identify the importance of home location and community environment to employment outcome and generate implications to welfare, planning and transportation policies, but also identify disparity of education, health, and family responsibility across industries. The study will conclude with policy implications and directions for future researches. Differences in local labor market opportunities and local socioeconomic community environment are critical to work-eligible welfare recipients. The February 2008 Reauthorization of the TANF Program Final Rule defined personal responsibility and serious effort to work expectations for work-eligible welfare recipients. The access to local labor market opportunities and local socioeconomic community environment plays important roles in work-eligible TANF recipients’ job finding propensity. Previous literature has indicated the importance of local community socioeconomic environment for employment outcomes. Previous literature has also indicated that the long distance and commuting time to local labor market often affects individuals’ employment outcome for various reasons. Our study therefore hypothesize that local community socioeconomic environment and distance between home and potential job opportunities matter to TANF recipients’ welfare-to-work transfer propensity. Local community demographic composition, income and poverty level, local transit conditions, and crime level are important factors affecting TANF recipients’ job opportunities. The longer the distance between home and job opportunities, the lower the odds for them to find a job and this distance impact varies by industry. Considering the demographics of our observing TANF recipients, we also hypothesize child responsibility, lower education attainment, poorer health are associated with lower odds to find a job.