Spatial Data Processing with PySAL & Pandas

IPYNB

#by convention, we use these shorter two-letter names
import pysal as ps
import pandas as pd
import numpy as np

PySAL has two simple ways to read in data. But, first, you need to get the path from where your notebook is running on your computer to the place the data is. For example, to find where the notebook is running:

!pwd # on windows !cd
/Users/dani/code/gds_scipy16/content/part1

PySAL has a command that it uses to get the paths of its example datasets. Let's work with a commonly-used dataset first.

ps.examples.available()
['10740',
 'arcgis',
 'baltim',
 'book',
 'burkitt',
 'calemp',
 'chicago',
 'columbus',
 'desmith',
 'geodanet',
 'juvenile',
 'Line',
 'mexico',
 'nat',
 'networks',
 'newHaven',
 'Point',
 'Polygon',
 'sacramento2',
 'sids2',
 'snow_maps',
 'south',
 'stl',
 'street_net_pts',
 'taz',
 'us_income',
 'virginia',
 'wmat']
ps.examples.explain('us_income')
{'description': 'Per-capita income for the lower 47 US states 1929-2010',
 'explanation': [' * us48.shp: shapefile ',
  ' * us48.dbf: dbf for shapefile',
  ' * us48.shx: index for shapefile',
  ' * usjoin.csv: attribute data (comma delimited file)'],
 'name': 'us_income'}
csv_path = ps.examples.get_path('usjoin.csv')
f = ps.open(csv_path)
f.header[0:10]
['Name',
 'STATE_FIPS',
 '1929',
 '1930',
 '1931',
 '1932',
 '1933',
 '1934',
 '1935',
 '1936']
y2009 = f.by_col('2009')
y2009[0:10]
[32274, 32077, 31493, 40902, 40093, 52736, 40135, 36565, 33086, 30987]

Working with shapefiles

We can also work with local files outside the built-in examples.

To read in a shapefile, we will need the path to the file.

shp_path = '../data/texas.shp'
print(shp_path)
../data/texas.shp

Then, we open the file using the ps.open command:

f = ps.open(shp_path)

f is what we call a "file handle." That means that it only points to the data and provides ways to work with it. By itself, it does not read the whole dataset into memory. To see basic information about the file, we can use a few different methods.

For instance, the header of the file, which contains most of the metadata about the file:

f.header
{'BBOX Mmax': 0.0,
 'BBOX Mmin': 0.0,
 'BBOX Xmax': -93.50721740722656,
 'BBOX Xmin': -106.6495132446289,
 'BBOX Ymax': 36.49387741088867,
 'BBOX Ymin': 25.845197677612305,
 'BBOX Zmax': 0.0,
 'BBOX Zmin': 0.0,
 'File Code': 9994,
 'File Length': 49902,
 'Shape Type': 5,
 'Unused0': 0,
 'Unused1': 0,
 'Unused2': 0,
 'Unused3': 0,
 'Unused4': 0,
 'Version': 1000}

To actually read in the shapes from memory, you can use the following commands:

f.by_row(14) #gets the 14th shape from the file
<pysal.cg.shapes.Polygon at 0x10d8baa20>
all_polygons = f.read() #reads in all polygons from memory
len(all_polygons)
254

So, all 254 polygons have been read in from file. These are stored in PySAL shape objects, which can be used by PySAL and can be converted to other Python shape objects.

They typically have a few methods. So, since we've read in polygonal data, we can get some properties about the polygons. Let's just have a look at the first polygon:

all_polygons[0:5]
[<pysal.cg.shapes.Polygon at 0x10d8baba8>,
 <pysal.cg.shapes.Polygon at 0x10d8ba908>,
 <pysal.cg.shapes.Polygon at 0x10d8ba860>,
 <pysal.cg.shapes.Polygon at 0x10d8ba8d0>,
 <pysal.cg.shapes.Polygon at 0x10d8baa90>]
all_polygons[0].centroid #the centroid of the first polygon
(-100.27156110567945, 36.27508640938005)
all_polygons[0].area
0.23682222998468205
all_polygons[0].perimeter
1.9582821721538344

While in the Jupyter Notebook, you can examine what properties an object has by using the tab key.

polygon = all_polygons[0]
polygon. #press tab when the cursor is right after the dot
  File "<ipython-input-20-aa03438a2fa8>", line 1
    polygon. #press tab when the cursor is right after the dot
                                                              ^
SyntaxError: invalid syntax

Working with Data Tables

dbf_path = "../data/texas.dbf"
print(dbf_path)
../data/texas.dbf

When you're working with tables of data, like a csv or dbf, you can extract your data in the following way. Let's open the dbf file we got the path for above.

f = ps.open(dbf_path)

Just like with the shapefile, we can examine the header of the dbf file.

f.header
['NAME',
 'STATE_NAME',
 'STATE_FIPS',
 'CNTY_FIPS',
 'FIPS',
 'STFIPS',
 'COFIPS',
 'FIPSNO',
 'SOUTH',
 'HR60',
 'HR70',
 'HR80',
 'HR90',
 'HC60',
 'HC70',
 'HC80',
 'HC90',
 'PO60',
 'PO70',
 'PO80',
 'PO90',
 'RD60',
 'RD70',
 'RD80',
 'RD90',
 'PS60',
 'PS70',
 'PS80',
 'PS90',
 'UE60',
 'UE70',
 'UE80',
 'UE90',
 'DV60',
 'DV70',
 'DV80',
 'DV90',
 'MA60',
 'MA70',
 'MA80',
 'MA90',
 'POL60',
 'POL70',
 'POL80',
 'POL90',
 'DNL60',
 'DNL70',
 'DNL80',
 'DNL90',
 'MFIL59',
 'MFIL69',
 'MFIL79',
 'MFIL89',
 'FP59',
 'FP69',
 'FP79',
 'FP89',
 'BLK60',
 'BLK70',
 'BLK80',
 'BLK90',
 'GI59',
 'GI69',
 'GI79',
 'GI89',
 'FH60',
 'FH70',
 'FH80',
 'FH90']

So, the header is a list containing the names of all of the fields we can read. If we just wanted to grab the data of interest, HR90, we can use either by_col or by_col_array, depending on the format we want the resulting data in:

HR90 = f.by_col('HR90')
print(type(HR90).__name__, HR90[0:5])
HR90 = f.by_col_array('HR90')
print(type(HR90).__name__, HR90[0:5])
list [0.0, 0.0, 18.31166453, 0.0, 3.6517674554]
ndarray [[  0.        ]
 [  0.        ]
 [ 18.31166453]
 [  0.        ]
 [  3.65176746]]

As you can see, the by_col function returns a list of data, with no shape. It can only return one column at a time:

HRs = f.by_col('HR90', 'HR80')
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-25-1fef6a3c3a50> in <module>()
----> 1 HRs = f.by_col('HR90', 'HR80')


TypeError: __call__() takes 2 positional arguments but 3 were given

This error message is called a "traceback," as you see in the top right, and it usually provides feedback on why the previous command did not execute correctly. Here, you see that one-too-many arguments was provided to __call__, which tells us we cannot pass as many arguments as we did to by_col.

If you want to read in many columns at once and store them to an array, use by_col_array:

HRs = f.by_col_array('HR90', 'HR80')
HRs[0:10]
array([[  0.        ,   0.        ],
       [  0.        ,  10.50199538],
       [ 18.31166453,   5.10386362],
       [  0.        ,   0.        ],
       [  3.65176746,  10.4297038 ],
       [  0.        ,   0.        ],
       [  0.        ,  18.85369532],
       [  2.59514448,   6.33617194],
       [  0.        ,   0.        ],
       [  5.59753708,   6.0331825 ]])

It is best to use by_col_array on data of a single type. That is, if you read in a lot of columns, some of them numbers and some of them strings, all columns will get converted to the same datatype:

allcolumns = f.by_col_array(['NAME', 'STATE_NAME', 'HR90', 'HR80'])
allcolumns
array([['Lipscomb', 'Texas', '0.0', '0.0'],
       ['Sherman', 'Texas', '0.0', '10.501995379'],
       ['Dallam', 'Texas', '18.31166453', '5.1038636248'],
       ..., 
       ['Hidalgo', 'Texas', '7.3003167816', '8.2383277607'],
       ['Willacy', 'Texas', '5.6481219994', '7.6212251119'],
       ['Cameron', 'Texas', '12.302014455', '11.761321464']], 
      dtype='<U13')

Note that the numerical columns, HR90 & HR80 are now considered strings, since they show up with the single tickmarks around them, like '0.0'.

These methods work similarly for .csv files as well.

Using Pandas with PySAL

A new functionality added to PySAL recently allows you to work with shapefile/dbf pairs using Pandas. This optional extension is only turned on if you have Pandas installed. The extension is the ps.pdio module:

ps.pdio
<module 'pysal.contrib.pdutilities' from '/Users/dani/anaconda/envs/gds-scipy16/lib/python3.5/site-packages/pysal/contrib/pdutilities/__init__.py'>

To use it, you can read in shapefile/dbf pairs using the ps.pdio.read_files command.

shp_path = ps.examples.get_path('NAT.shp')
data_table = ps.pdio.read_files(shp_path)

This reads in the entire database table and adds a column to the end, called geometry, that stores the geometries read in from the shapefile.

Now, you can work with it like a standard pandas dataframe.

data_table.head()
NAME STATE_NAME STATE_FIPS CNTY_FIPS FIPS STFIPS COFIPS FIPSNO SOUTH HR60 ... BLK90 GI59 GI69 GI79 GI89 FH60 FH70 FH80 FH90 geometry
0 Lake of the Woods Minnesota 27 077 27077 27 77 27077 0 0.000000 ... 0.024534 0.285235 0.372336 0.342104 0.336455 11.279621 5.4 5.663881 9.515860 <pysal.cg.shapes.Polygon object at 0x104125ef0>
1 Ferry Washington 53 019 53019 53 19 53019 0 0.000000 ... 0.317712 0.256158 0.360665 0.361928 0.360640 10.053476 2.6 10.079576 11.397059 <pysal.cg.shapes.Polygon object at 0x10d8ba390>
2 Stevens Washington 53 065 53065 53 65 53065 0 1.863863 ... 0.210030 0.283999 0.394083 0.357566 0.369942 9.258437 5.6 6.812127 10.352015 <pysal.cg.shapes.Polygon object at 0x104142400>
3 Okanogan Washington 53 047 53047 53 47 53047 0 2.612330 ... 0.155922 0.258540 0.371218 0.381240 0.394519 9.039900 8.1 10.084926 12.840340 <pysal.cg.shapes.Polygon object at 0x10dadd5c0>
4 Pend Oreille Washington 53 051 53051 53 51 53051 0 0.000000 ... 0.134605 0.243263 0.365614 0.358706 0.387848 8.243930 4.1 7.557643 10.313002 <pysal.cg.shapes.Polygon object at 0x10dadd630>

5 rows × 70 columns

The read_files function only works on shapefile/dbf pairs. If you need to read in data using CSVs, use pandas directly:

usjoin = pd.read_csv(csv_path)
#usjoin = ps.pdio.read_files(csv_path) #will not work, not a shp/dbf pair
usjoin.head()
Name STATE_FIPS 1929 1930 1931 1932 1933 1934 1935 1936 ... 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
0 Alabama 1 323 267 224 162 166 211 217 251 ... 23471 24467 25161 26065 27665 29097 30634 31988 32819 32274
1 Arizona 4 600 520 429 321 308 362 416 462 ... 25578 26232 26469 27106 28753 30671 32552 33470 33445 32077
2 Arkansas 5 310 228 215 157 157 187 207 247 ... 22257 23532 23929 25074 26465 27512 29041 31070 31800 31493
3 California 6 991 887 749 580 546 603 660 771 ... 32275 32750 32900 33801 35663 37463 40169 41943 42377 40902
4 Colorado 8 634 578 471 354 353 368 444 542 ... 32949 34228 33963 34092 35543 37388 39662 41165 41719 40093

5 rows × 83 columns

The nice thing about working with pandas dataframes is that they have very powerful baked-in support for relational-style queries. By this, I mean that it is very easy to find things like:

The number of counties in each state:

data_table.groupby("STATE_NAME").size()
STATE_NAME
Alabama                  67
Arizona                  14
Arkansas                 75
California               58
Colorado                 63
Connecticut               8
Delaware                  3
District of Columbia      1
Florida                  67
Georgia                 159
Idaho                    44
Illinois                102
Indiana                  92
Iowa                     99
Kansas                  105
Kentucky                120
Louisiana                64
Maine                    16
Maryland                 24
Massachusetts            12
Michigan                 83
Minnesota                87
Mississippi              82
Missouri                115
Montana                  55
Nebraska                 93
Nevada                   17
New Hampshire            10
New Jersey               21
New Mexico               32
New York                 58
North Carolina          100
North Dakota             53
Ohio                     88
Oklahoma                 77
Oregon                   36
Pennsylvania             67
Rhode Island              5
South Carolina           46
South Dakota             66
Tennessee                95
Texas                   254
Utah                     29
Vermont                  14
Virginia                123
Washington               38
West Virginia            55
Wisconsin                70
Wyoming                  23
dtype: int64

Or, to get the rows of the table that are in Arizona, we can use the query function of the dataframe:

data_table.query('STATE_NAME == "Arizona"')
NAME STATE_NAME STATE_FIPS CNTY_FIPS FIPS STFIPS COFIPS FIPSNO SOUTH HR60 ... BLK90 GI59 GI69 GI79 GI89 FH60 FH70 FH80 FH90 geometry
1707 Navajo Arizona 04 017 04017 4 17 4017 0 5.263989 ... 0.905251 0.366863 0.414135 0.401999 0.445299 13.146998 12.1 13.762783 18.033782 <pysal.cg.shapes.Polygon object at 0x10e280080>
1708 Coconino Arizona 04 005 04005 4 5 4005 0 3.185449 ... 1.469081 0.301222 0.377785 0.381655 0.403188 9.475171 8.5 11.181563 15.267643 <pysal.cg.shapes.Polygon object at 0x10e2800f0>
1722 Mohave Arizona 04 015 04015 4 15 4015 0 0.000000 ... 0.324075 0.279339 0.347150 0.375790 0.374383 11.508554 4.8 7.018268 9.214294 <pysal.cg.shapes.Polygon object at 0x10e280710>
1726 Apache Arizona 04 001 04001 4 1 4001 0 10.951223 ... 0.162361 0.395913 0.450552 0.431013 0.489132 15.014738 14.6 18.727548 22.933635 <pysal.cg.shapes.Polygon object at 0x10e2808d0>
2002 Yavapai Arizona 04 025 04025 4 25 4025 0 3.458771 ... 0.298011 0.289509 0.378195 0.376313 0.384089 9.930032 8.6 7.516372 9.483521 <pysal.cg.shapes.Polygon object at 0x10e394518>
2182 Gila Arizona 04 007 04007 4 7 4007 0 6.473749 ... 0.246171 0.265294 0.337519 0.353848 0.386976 10.470261 8.1 9.934237 11.706102 <pysal.cg.shapes.Polygon object at 0x10e4500b8>
2262 Maricopa Arizona 04 013 04013 4 13 4013 0 6.179259 ... 3.499221 0.277828 0.352374 0.366015 0.372756 10.642382 9.8 11.857260 14.404902 <pysal.cg.shapes.Polygon object at 0x10e49d438>
2311 Greenlee Arizona 04 011 04011 4 11 4011 0 2.896284 ... 0.349650 0.177691 0.257158 0.283518 0.337256 9.806115 6.7 5.295110 10.453284 <pysal.cg.shapes.Polygon object at 0x10e4c1a58>
2326 Graham Arizona 04 009 04009 4 9 4009 0 4.746648 ... 1.890487 0.310256 0.362926 0.383554 0.408379 11.979335 10.1 11.961367 16.129032 <pysal.cg.shapes.Polygon object at 0x10e4ea128>
2353 Pinal Arizona 04 021 04021 4 21 4021 0 13.828390 ... 3.134586 0.304294 0.369974 0.361193 0.400130 10.822965 8.8 10.341699 15.304144 <pysal.cg.shapes.Polygon object at 0x10e4ead68>
2499 Pima Arizona 04 019 04019 4 19 4019 0 5.520841 ... 3.118252 0.268266 0.367218 0.375039 0.392144 11.381626 10.2 12.689768 16.163178 <pysal.cg.shapes.Polygon object at 0x10e575e80>
2514 Cochise Arizona 04 003 04003 4 3 4003 0 4.845049 ... 5.201590 0.261208 0.359500 0.359701 0.399208 10.197573 8.7 9.912732 13.733872 <pysal.cg.shapes.Polygon object at 0x10e59b550>
2615 Santa Cruz Arizona 04 023 04023 4 23 4023 0 9.252406 ... 0.326863 0.327130 0.396807 0.393240 0.413795 19.007213 14.7 15.690913 18.272244 <pysal.cg.shapes.Polygon object at 0x10e60a2e8>
3080 La Paz Arizona 04 012 04012 4 12 4012 0 5.046682 ... 2.628811 0.271556 0.364110 0.372662 0.405743 9.216414 8.0 9.296093 12.379134 <pysal.cg.shapes.Polygon object at 0x10e79bb70>

14 rows × 70 columns

Behind the scenes, this uses a fast vectorized library, numexpr, to essentially do the following.

First, compare each row's STATE_NAME column to 'Arizona' and return True if the row matches:

data_table.STATE_NAME == 'Arizona'
0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
3055    False
3056    False
3057    False
3058    False
3059    False
3060    False
3061    False
3062    False
3063    False
3064    False
3065    False
3066    False
3067    False
3068    False
3069    False
3070    False
3071    False
3072    False
3073    False
3074    False
3075    False
3076    False
3077    False
3078    False
3079    False
3080     True
3081    False
3082    False
3083    False
3084    False
Name: STATE_NAME, dtype: bool

Then, use that to filter out rows where the condition is true:

data_table[data_table.STATE_NAME == 'Arizona']
NAME STATE_NAME STATE_FIPS CNTY_FIPS FIPS STFIPS COFIPS FIPSNO SOUTH HR60 ... BLK90 GI59 GI69 GI79 GI89 FH60 FH70 FH80 FH90 geometry
1707 Navajo Arizona 04 017 04017 4 17 4017 0 5.263989 ... 0.905251 0.366863 0.414135 0.401999 0.445299 13.146998 12.1 13.762783 18.033782 <pysal.cg.shapes.Polygon object at 0x10e280080>
1708 Coconino Arizona 04 005 04005 4 5 4005 0 3.185449 ... 1.469081 0.301222 0.377785 0.381655 0.403188 9.475171 8.5 11.181563 15.267643 <pysal.cg.shapes.Polygon object at 0x10e2800f0>
1722 Mohave Arizona 04 015 04015 4 15 4015 0 0.000000 ... 0.324075 0.279339 0.347150 0.375790 0.374383 11.508554 4.8 7.018268 9.214294 <pysal.cg.shapes.Polygon object at 0x10e280710>
1726 Apache Arizona 04 001 04001 4 1 4001 0 10.951223 ... 0.162361 0.395913 0.450552 0.431013 0.489132 15.014738 14.6 18.727548 22.933635 <pysal.cg.shapes.Polygon object at 0x10e2808d0>
2002 Yavapai Arizona 04 025 04025 4 25 4025 0 3.458771 ... 0.298011 0.289509 0.378195 0.376313 0.384089 9.930032 8.6 7.516372 9.483521 <pysal.cg.shapes.Polygon object at 0x10e394518>
2182 Gila Arizona 04 007 04007 4 7 4007 0 6.473749 ... 0.246171 0.265294 0.337519 0.353848 0.386976 10.470261 8.1 9.934237 11.706102 <pysal.cg.shapes.Polygon object at 0x10e4500b8>
2262 Maricopa Arizona 04 013 04013 4 13 4013 0 6.179259 ... 3.499221 0.277828 0.352374 0.366015 0.372756 10.642382 9.8 11.857260 14.404902 <pysal.cg.shapes.Polygon object at 0x10e49d438>
2311 Greenlee Arizona 04 011 04011 4 11 4011 0 2.896284 ... 0.349650 0.177691 0.257158 0.283518 0.337256 9.806115 6.7 5.295110 10.453284 <pysal.cg.shapes.Polygon object at 0x10e4c1a58>
2326 Graham Arizona 04 009 04009 4 9 4009 0 4.746648 ... 1.890487 0.310256 0.362926 0.383554 0.408379 11.979335 10.1 11.961367 16.129032 <pysal.cg.shapes.Polygon object at 0x10e4ea128>
2353 Pinal Arizona 04 021 04021 4 21 4021 0 13.828390 ... 3.134586 0.304294 0.369974 0.361193 0.400130 10.822965 8.8 10.341699 15.304144 <pysal.cg.shapes.Polygon object at 0x10e4ead68>
2499 Pima Arizona 04 019 04019 4 19 4019 0 5.520841 ... 3.118252 0.268266 0.367218 0.375039 0.392144 11.381626 10.2 12.689768 16.163178 <pysal.cg.shapes.Polygon object at 0x10e575e80>
2514 Cochise Arizona 04 003 04003 4 3 4003 0 4.845049 ... 5.201590 0.261208 0.359500 0.359701 0.399208 10.197573 8.7 9.912732 13.733872 <pysal.cg.shapes.Polygon object at 0x10e59b550>
2615 Santa Cruz Arizona 04 023 04023 4 23 4023 0 9.252406 ... 0.326863 0.327130 0.396807 0.393240 0.413795 19.007213 14.7 15.690913 18.272244 <pysal.cg.shapes.Polygon object at 0x10e60a2e8>
3080 La Paz Arizona 04 012 04012 4 12 4012 0 5.046682 ... 2.628811 0.271556 0.364110 0.372662 0.405743 9.216414 8.0 9.296093 12.379134 <pysal.cg.shapes.Polygon object at 0x10e79bb70>

14 rows × 70 columns

We might need this behind the scenes knowledge when we want to chain together conditions, or when we need to do spatial queries.

This is because spatial queries are somewhat more complex. Let's say, for example, we want all of the counties in the US to the West of -121 longitude. We need a way to express that question. Ideally, we want something like:

SELECT
        *
FROM
        data_table
WHERE
        x_centroid < -121

So, let's refer to an arbitrary polygon in the the dataframe's geometry column as poly. The centroid of a PySAL polygon is stored as an (X,Y) pair, so the longitude is the first element of the pair, poly.centroid[0].

Then, applying this condition to each geometry, we get the same kind of filter we used above to grab only counties in Arizona:

data_table.geometry.apply(lambda x: x.centroid[0] < -121)\
                   .head()
0    False
1    False
2    False
3    False
4    False
Name: geometry, dtype: bool

If we use this as a filter on the table, we can get only the rows that match that condition, just like we did for the STATE_NAME query:

data_table[data_table.geometry.apply(lambda x: x.centroid[0] < -119)].head()
NAME STATE_NAME STATE_FIPS CNTY_FIPS FIPS STFIPS COFIPS FIPSNO SOUTH HR60 ... BLK90 GI59 GI69 GI79 GI89 FH60 FH70 FH80 FH90 geometry
3 Okanogan Washington 53 047 53047 53 47 53047 0 2.612330 ... 0.155922 0.258540 0.371218 0.381240 0.394519 9.039900 8.1 10.084926 12.840340 <pysal.cg.shapes.Polygon object at 0x10dadd5c0>
27 Whatcom Washington 53 073 53073 53 73 53073 0 1.422131 ... 0.508687 0.247630 0.346935 0.369436 0.358418 9.174415 7.1 9.718054 11.135022 <pysal.cg.shapes.Polygon object at 0x10df05080>
31 Skagit Washington 53 057 53057 53 57 53057 0 2.596560 ... 0.351958 0.239346 0.344830 0.364623 0.362265 8.611518 7.9 10.480031 11.382484 <pysal.cg.shapes.Polygon object at 0x10df05240>
42 Chelan Washington 53 007 53007 53 7 53007 0 4.908698 ... 0.153110 0.246292 0.367681 0.374505 0.383486 8.787907 8.1 9.968454 12.236493 <pysal.cg.shapes.Polygon object at 0x10df05710>
44 Clallam Washington 53 009 53009 53 9 53009 0 3.330891 ... 0.568504 0.240573 0.349320 0.361619 0.366854 8.788882 6.5 9.660900 12.281690 <pysal.cg.shapes.Polygon object at 0x10df057f0>

5 rows × 70 columns

len(data_table[data_table.geometry.apply(lambda x: x.centroid[0] < -119)]) #how many west of -119?
109

Other types of spatial queries

Everybody knows the following statements are true:

  1. If you head directly west from Reno, Nevada, you will shortly enter California.
  2. San Diego is in California.

But what does this tell us about the location of San Diego relative to Reno?

Or for that matter, how many counties in California are to the east of Reno?

geom = data_table.query('(NAME == "Washoe") & (STATE_NAME == "Nevada")').geometry
lon,lat = geom.values[0].centroid
lon
-119.6555030699793
cal_counties = data_table.query('(STATE_NAME=="California")')
cal_counties[cal_counties.geometry.apply(lambda x: x.centroid[0] > lon)]
NAME STATE_NAME STATE_FIPS CNTY_FIPS FIPS STFIPS COFIPS FIPSNO SOUTH HR60 ... BLK90 GI59 GI69 GI79 GI89 FH60 FH70 FH80 FH90 geometry
1312 Mono California 06 051 06051 6 51 6051 0 15.062509 ... 0.431900 0.229888 0.327520 0.388414 0.366316 6.743421 6.6 8.187135 10.083699 <pysal.cg.shapes.Polygon object at 0x10dfd5e48>
1591 Fresno California 06 019 06019 6 19 6019 0 5.192037 ... 5.007266 0.286651 0.379884 0.394981 0.412947 11.788963 11.8 13.998747 18.523541 <pysal.cg.shapes.Polygon object at 0x10e0ecc50>
1620 Inyo California 06 027 06027 6 27 6027 0 8.558713 ... 0.432143 0.242293 0.334361 0.353784 0.378516 9.676365 8.1 8.480065 12.067279 <pysal.cg.shapes.Polygon object at 0x10e220940>
1765 Tulare California 06 107 06107 6 107 6107 0 3.364944 ... 1.480503 0.303303 0.380229 0.395059 0.410811 10.398757 9.5 11.957928 16.303423 <pysal.cg.shapes.Polygon object at 0x10e2a4a58>
1956 Kern California 06 029 06029 6 29 6029 0 6.393044 ... 5.544117 0.273083 0.364603 0.380617 0.395393 10.225236 10.4 12.037755 16.118827 <pysal.cg.shapes.Polygon object at 0x10e348fd0>
1957 San Bernardino California 06 071 06071 6 71 6071 0 3.243373 ... 8.103188 0.259038 0.350927 0.362389 0.372659 9.857519 10.0 12.999831 15.403925 <pysal.cg.shapes.Polygon object at 0x10e36f080>
2117 Ventura California 06 111 06111 6 111 6111 0 3.180374 ... 2.336118 0.259643 0.325908 0.350430 0.329175 10.304761 8.9 11.047165 12.044930 <pysal.cg.shapes.Polygon object at 0x10e40c390>
2255 Riverside California 06 065 06065 6 65 6065 0 4.898903 ... 5.433210 0.284000 0.376737 0.383226 0.368177 9.769959 9.2 11.145612 12.678340 <pysal.cg.shapes.Polygon object at 0x10e49d128>
2279 Orange California 06 059 06059 6 59 6059 0 2.083555 ... 1.770587 0.255864 0.316731 0.355117 0.330447 8.802545 9.1 12.405423 12.974648 <pysal.cg.shapes.Polygon object at 0x10e49dc18>
2344 San Diego California 06 073 06073 6 73 6073 0 2.387842 ... 6.377301 0.269147 0.352428 0.382815 0.368301 11.425041 11.5 14.384523 15.489702 <pysal.cg.shapes.Polygon object at 0x10e4ea908>
2351 Los Angeles California 06 037 06037 6 37 6037 0 4.564947 ... 11.203381 0.267207 0.353837 0.402031 0.392669 13.071850 13.4 17.767317 18.808224 <pysal.cg.shapes.Polygon object at 0x10e4eac18>
2358 Imperial California 06 025 06025 6 25 6025 0 4.160599 ... 2.398836 0.286568 0.374913 0.397430 0.426348 11.466417 9.7 12.590249 17.243741 <pysal.cg.shapes.Polygon object at 0x10e4eaf98>

12 rows × 70 columns

len(cal_counties)
58

This works on any type of spatial query.

For instance, if we wanted to find all of the counties that are within a threshold distance from an observation's centroid, we can do it in the following way.

But first, we need to handle distance calculations on the earth's surface.

from math import radians, sin, cos, sqrt, asin

def gcd(loc1, loc2, R=3961):
    """Great circle distance via Haversine formula

    Parameters
    ----------

    loc1: tuple (long, lat in decimal degrees)

    loc2: tuple (long, lat in decimal degrees)

    R: Radius of the earth (3961 miles, 6367 km)

    Returns
    -------
    great circle distance between loc1 and loc2 in units of R


    Notes
    ------
    Does not take into account non-spheroidal shape of the Earth



    >>> san_diego = -117.1611, 32.7157
    >>> austin = -97.7431, 30.2672
    >>> gcd(san_diego, austin)
    1155.474644164695


    """
    lon1, lat1 = loc1
    lon2, lat2 = loc2
    dLat = radians(lat2 - lat1)
    dLon = radians(lon2 - lon1)
    lat1 = radians(lat1)
    lat2 = radians(lat2)

    a = sin(dLat/2)**2 + cos(lat1)*cos(lat2)*sin(dLon/2)**2
    c = 2*asin(sqrt(a))

    return R * c

def gcdm(loc1, loc2):
    return gcd(loc1, loc2)

def gcdk(loc1, loc2):
    return gcd(loc1, loc2, 6367 )
san_diego = -117.1611, 32.7157
austin = -97.7431, 30.2672
gcd(san_diego, austin)
1155.474644164695
gcdk(san_diego, austin)
1857.3357887898544
loc1 = (-117.1611, 0.0)
loc2 = (-118.1611, 0.0)
gcd(loc1, loc2)
69.13249167149539
loc1 = (-117.1611, 45.0)
loc2 = (-118.1611, 45.0)
gcd(loc1, loc2)
48.88374342930467
loc1 = (-117.1611, 89.0)
loc2 = (-118.1611, 89.0)
gcd(loc1, loc2)
1.2065130336642724
lats = range(0, 91)
onedeglon = [ gcd((-117.1611,lat),(-118.1611,lat)) for lat in lats]
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(lats, onedeglon)
plt.ylabel('miles')
plt.xlabel('degree of latitude')
plt.title('Length of a degree of longitude')
<matplotlib.text.Text at 0x114174470>

png

san_diego = -117.1611, 32.7157
austin = -97.7431, 30.2672
gcd(san_diego, austin)
1155.474644164695

Now we can use our distance function to pose distance-related queries on our data table.

# Find all the counties with centroids within 50 miles of Austin
def near_target_point(polygon, target=austin, threshold=50):
    return gcd(polygon.centroid, target) < threshold 

data_table[data_table.geometry.apply(near_target_point)]
NAME STATE_NAME STATE_FIPS CNTY_FIPS FIPS STFIPS COFIPS FIPSNO SOUTH HR60 ... BLK90 GI59 GI69 GI79 GI89 FH60 FH70 FH80 FH90 geometry
2698 Burnet Texas 48 053 48053 48 53 48053 1 0.000000 ... 1.186224 0.327508 0.449285 0.385079 0.405890 10.774142 6.5 7.115629 10.568742 <pysal.cg.shapes.Polygon object at 0x10e6567f0>
2716 Williamson Texas 48 491 48491 48 491 48491 1 9.511852 ... 4.916482 0.363603 0.379902 0.341976 0.345201 13.532731 9.0 7.582572 12.032589 <pysal.cg.shapes.Polygon object at 0x10e656fd0>
2742 Travis Texas 48 453 48453 48 453 48453 1 4.242561 ... 10.959791 0.299292 0.372293 0.378953 0.388149 12.976379 10.9 14.459691 17.307113 <pysal.cg.shapes.Polygon object at 0x10e67eba8>
2751 Lee Texas 48 287 48287 48 287 48287 1 7.449622 ... 13.847829 0.376002 0.433132 0.394000 0.394959 12.305699 10.1 8.875542 10.530896 <pysal.cg.shapes.Polygon object at 0x10e67eef0>
2754 Blanco Texas 48 031 48031 48 31 48031 1 0.000000 ... 0.937709 0.369814 0.436449 0.394609 0.394414 9.365854 6.0 8.074074 9.080119 <pysal.cg.shapes.Polygon object at 0x10e6aa160>
2762 Bastrop Texas 48 021 48021 48 21 48021 1 3.938946 ... 11.792071 0.370264 0.419933 0.390927 0.380907 14.747191 12.5 10.559006 12.281387 <pysal.cg.shapes.Polygon object at 0x10e6aa518>
2769 Hays Texas 48 209 48209 48 209 48209 1 5.016555 ... 3.383424 0.385061 0.424973 0.378992 0.372266 13.064187 9.4 10.003691 10.337188 <pysal.cg.shapes.Polygon object at 0x10e6aa908>
2795 Caldwell Texas 48 055 48055 48 55 48055 1 9.677544 ... 10.704001 0.380080 0.412614 0.410802 0.413940 15.229616 10.5 12.894034 17.191502 <pysal.cg.shapes.Polygon object at 0x10e6d04a8>
2798 Comal Texas 48 091 48091 48 91 48091 1 3.359538 ... 0.854684 0.274182 0.359174 0.375810 0.380032 11.315107 7.6 8.693149 9.104427 <pysal.cg.shapes.Polygon object at 0x10e6d05f8>
2808 Guadalupe Texas 48 187 48187 48 187 48187 1 5.743759 ... 5.649500 0.340374 0.388821 0.354559 0.378073 11.758921 9.6 10.418149 12.918448 <pysal.cg.shapes.Polygon object at 0x10e6d0a58>

10 rows × 70 columns

Moving in and out of the dataframe

Most things in PySAL will be explicit about what type their input should be. Most of the time, PySAL functions require either lists or arrays. This is why the file-handler methods are the default IO method in PySAL: the rest of the computational tools are built around their datatypes.

However, it is very easy to get the correct datatype from Pandas using the values and tolist commands.

tolist() will convert its entries to a list. But, it can only be called on individual columns (called Series in pandas documentation).

So, to turn the NAME column into a list:

data_table.NAME.tolist()[0:10]
['Lake of the Woods',
 'Ferry',
 'Stevens',
 'Okanogan',
 'Pend Oreille',
 'Boundary',
 'Lincoln',
 'Flathead',
 'Glacier',
 'Toole']

To extract many columns, you must select the columns you want and call their .values attribute.

If we were interested in grabbing all of the HR variables in the dataframe, we could first select those column names:

HRs = [col for col in data_table.columns if col.startswith('HR')]
HRs
['HR60', 'HR70', 'HR80', 'HR90']

We can use this to focus only on the columns we want:

data_table[HRs].head()
HR60 HR70 HR80 HR90
0 0.000000 0.000000 8.855827 0.000000
1 0.000000 0.000000 17.208742 15.885624
2 1.863863 1.915158 3.450775 6.462453
3 2.612330 1.288643 3.263814 6.996502
4 0.000000 0.000000 7.770008 7.478033

With this, calling .values gives an array containing all of the entries in this subset of the table:

data_table[['HR90', 'HR80']].values
array([[  0.        ,   8.85582713],
       [ 15.88562351,  17.20874204],
       [  6.46245315,   3.4507747 ],
       ..., 
       [  4.36732988,   5.2803488 ],
       [  3.72771194,   3.00003   ],
       [  2.04885495,   1.19474313]])

Using the PySAL pdio tools means that if you're comfortable with working in Pandas, you can continue to do so.

If you're more comfortable using Numpy or raw Python to do your data processing, PySAL's IO tools naturally support this.

Exercises

  1. Find the county with the western most centroid that is within 1000 miles of Austin.
  2. Find the distance between Austin and that centroid.

results matching ""

    No results matching ""