Reading & Manipulating Tabular Data

👨‍🏫 [Sync]

import pandas
import numpy as np
from sqlalchemy import create_engine

To note:

  • matplotlib magic

  • Library import and alias

Data

  • Read in .csv

Assuming you have the file downloaded on the path ../data/:

db = pandas.read_csv("../data/paris_abb.csv.zip")

If you’re online, you can do:

db = pandas.read_csv("https://github.com/darribas/data_science_studio/raw/master/content/data/paris_abb.csv.zip")
  • Connect to SQLite and query

engine = create_engine("sqlite:///../data/paris_abb_mini.db")
qry = """
SELECT * 
FROM db 
LIMIT 5;
"""
dbs = pandas.read_sql(qry, engine)
dbs
id neighbourhood_cleansed property_type room_type accommodates bathrooms bedrooms beds bed_type Price
0 3109 Observatoire Apartment Entire home/apt 2 1.0 0.0 1.0 Real Bed 60.0
1 5396 Hôtel-de-Ville Apartment Entire home/apt 2 1.0 0.0 1.0 Pull-out Sofa 115.0
2 7397 Hôtel-de-Ville Apartment Entire home/apt 4 1.0 2.0 2.0 Real Bed 119.0
3 7964 Opéra Apartment Entire home/apt 2 1.0 1.0 1.0 Real Bed 130.0
4 9952 Popincourt Apartment Entire home/apt 2 1.0 1.0 1.0 Real Bed 75.0
qry = """
SELECT id, Price 
FROM db 
WHERE neighbourhood_cleansed = 'Popincourt';
"""
dbs = pandas.read_sql(qry, engine)
dbs.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5137 entries, 0 to 5136
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   id      5137 non-null   int64  
 1   Price   5137 non-null   float64
dtypes: float64(1), int64(1)
memory usage: 80.4 KB
  • Explore

db.head()
id neighbourhood_cleansed property_type room_type accommodates bathrooms bedrooms beds bed_type Price
0 3109 Observatoire Apartment Entire home/apt 2 1.0 0.0 1.0 Real Bed 60.0
1 5396 Hôtel-de-Ville Apartment Entire home/apt 2 1.0 0.0 1.0 Pull-out Sofa 115.0
2 7397 Hôtel-de-Ville Apartment Entire home/apt 4 1.0 2.0 2.0 Real Bed 119.0
3 7964 Opéra Apartment Entire home/apt 2 1.0 1.0 1.0 Real Bed 130.0
4 9952 Popincourt Apartment Entire home/apt 2 1.0 1.0 1.0 Real Bed 75.0
db.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50280 entries, 0 to 50279
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      50280 non-null  int64  
 1   neighbourhood_cleansed  50280 non-null  object 
 2   property_type           50280 non-null  object 
 3   room_type               50280 non-null  object 
 4   accommodates            50280 non-null  int64  
 5   bathrooms               50280 non-null  float64
 6   bedrooms                50280 non-null  float64
 7   beds                    50280 non-null  float64
 8   bed_type                50280 non-null  object 
 9   Price                   50280 non-null  float64
dtypes: float64(4), int64(2), object(4)
memory usage: 3.8+ MB
db.describe()
id accommodates bathrooms bedrooms beds Price
count 5.028000e+04 50280.000000 50280.000000 50280.000000 50280.000000 50280.000000
mean 1.968671e+07 3.063425 1.117045 1.077804 1.664698 110.630549
std 1.140675e+07 1.556731 0.687370 1.001000 1.164917 157.848189
min 3.109000e+03 1.000000 0.000000 0.000000 0.000000 0.000000
25% 9.340209e+06 2.000000 1.000000 1.000000 1.000000 60.000000
50% 1.999153e+07 2.000000 1.000000 1.000000 1.000000 80.000000
75% 2.987492e+07 4.000000 1.000000 1.000000 2.000000 120.000000
max 3.858893e+07 17.000000 50.000000 50.000000 50.000000 10000.000000

DataFrame objects can hold different types of data

  • “Whole” numbers (int)

  • Decimals (float)

  • Categorical (pandas.Category)

  • Dates (pandas.Timestamp)

  • Geo (geopandas.GeoDataFrame)

Indices

db.head(2)
id neighbourhood_cleansed property_type room_type accommodates bathrooms bedrooms beds bed_type Price
0 3109 Observatoire Apartment Entire home/apt 2 1.0 0.0 1.0 Real Bed 60.0
1 5396 Hôtel-de-Ville Apartment Entire home/apt 2 1.0 0.0 1.0 Pull-out Sofa 115.0
db.set_index("id").head(2)
neighbourhood_cleansed property_type room_type accommodates bathrooms bedrooms beds bed_type Price
id
3109 Observatoire Apartment Entire home/apt 2 1.0 0.0 1.0 Real Bed 60.0
5396 Hôtel-de-Ville Apartment Entire home/apt 2 1.0 0.0 1.0 Pull-out Sofa 115.0
dbi = db.set_index("id")
dbi.index
Int64Index([    3109,     5396,     7397,     7964,     9952,    10710,
               11170,    11213,    11265,    11798,
            ...
            38485563, 38489275, 38513797, 38516223, 38516341, 38517692,
            38520175, 38537044, 38546594, 38588929],
           dtype='int64', name='id', length=50280)
dbi.columns
Index(['neighbourhood_cleansed', 'property_type', 'room_type', 'accommodates',
       'bathrooms', 'bedrooms', 'beds', 'bed_type', 'Price'],
      dtype='object')

Slicing and Dicing Data

Index-based queries

  • Columns

db["Price"].head()
0     60.0
1    115.0
2    119.0
3    130.0
4     75.0
Name: Price, dtype: float64
  • Generic point-querying

db.loc[0, "Price"]
60.0
  • Full slice of one dimension

db.loc[0, :]
id                                   3109
neighbourhood_cleansed       Observatoire
property_type                   Apartment
room_type                 Entire home/apt
accommodates                            2
bathrooms                               1
bedrooms                                0
beds                                    1
bed_type                         Real Bed
Price                                  60
Name: 0, dtype: object
db.loc[:, "Price"]
0         60.0
1        115.0
2        119.0
3        130.0
4         75.0
         ...  
50275    250.0
50276     40.0
50277     60.0
50278     65.0
50279     69.0
Name: Price, Length: 50280, dtype: float64
  • Range queries

db.loc[0:5, "neighbourhood_cleansed"]
0      Observatoire
1    Hôtel-de-Ville
2    Hôtel-de-Ville
3             Opéra
4        Popincourt
5            Élysée
Name: neighbourhood_cleansed, dtype: object
db.loc[5, "property_type":"bed_type"]
property_type          Apartment
room_type        Entire home/apt
accommodates                   4
bathrooms                      1
bedrooms                       1
beds                           2
bed_type                Real Bed
Name: 5, dtype: object
  • List-based queries

db.loc[[0, 49, 19, 29, 39, 9], ["Price", "id"]]
Price id
0 60.0 3109
49 128.0 32082
19 65.0 17919
29 80.0 21264
39 49.0 26567
9 120.0 11798

Order-based queries

dbi.iloc[0:5, :]
neighbourhood_cleansed property_type room_type accommodates bathrooms bedrooms beds bed_type Price
id
3109 Observatoire Apartment Entire home/apt 2 1.0 0.0 1.0 Real Bed 60.0
5396 Hôtel-de-Ville Apartment Entire home/apt 2 1.0 0.0 1.0 Pull-out Sofa 115.0
7397 Hôtel-de-Ville Apartment Entire home/apt 4 1.0 2.0 2.0 Real Bed 119.0
7964 Opéra Apartment Entire home/apt 2 1.0 1.0 1.0 Real Bed 130.0
9952 Popincourt Apartment Entire home/apt 2 1.0 1.0 1.0 Real Bed 75.0

EXERCISE:

  • Slice the dataset to keep only properties with the following IDs, in that order: 38520175, 619716, and 37847454

  • Extract the section of the dataset that includes the 50th to the 100th rows, and the room_type and bedrooms columns


Conditional queries

  • Using loc

db.loc[db["neighbourhood_cleansed"] == "Observatoire", 
       ["neighbourhood_cleansed", "Price"]]\
  .head()
neighbourhood_cleansed Price
0 Observatoire 60.0
47 Observatoire 90.0
52 Observatoire 150.0
104 Observatoire 84.0
144 Observatoire 140.0
db.loc[db["Price"] < 100, ["id", "neighbourhood_cleansed"]].head()
id neighbourhood_cleansed
0 3109 Observatoire
4 9952 Popincourt
5 10710 Élysée
6 11170 Panthéon
10 11848 Popincourt
db.loc[(db["Price"] < 100) & \
       (db["bathrooms"] >= 8),
       :]
id neighbourhood_cleansed property_type room_type accommodates bathrooms bedrooms beds bed_type Price
12066 8876983 Luxembourg Boutique hotel Private room 16 50.0 50.0 16.0 Real Bed 85.0
22964 18766792 Luxembourg Boutique hotel Private room 16 50.0 50.0 17.0 Real Bed 85.0
24788 19819352 Luxembourg Boutique hotel Private room 16 50.0 50.0 20.0 Real Bed 85.0
25798 20433587 Luxembourg Boutique hotel Private room 16 50.0 50.0 50.0 Real Bed 85.0
26048 20691340 Luxembourg Boutique hotel Private room 16 50.0 50.0 50.0 Real Bed 85.0
26108 20747725 Luxembourg Boutique hotel Private room 16 50.0 50.0 50.0 Real Bed 85.0
26121 20768013 Luxembourg Boutique hotel Private room 16 50.0 50.0 50.0 Real Bed 85.0
  • Conditional filters

fltr = db["bathrooms"] > 8
fltr.head()
0    False
1    False
2    False
3    False
4    False
Name: bathrooms, dtype: bool
db[fltr]
id neighbourhood_cleansed property_type room_type accommodates bathrooms bedrooms beds bed_type Price
12066 8876983 Luxembourg Boutique hotel Private room 16 50.0 50.0 16.0 Real Bed 85.0
22964 18766792 Luxembourg Boutique hotel Private room 16 50.0 50.0 17.0 Real Bed 85.0
24788 19819352 Luxembourg Boutique hotel Private room 16 50.0 50.0 20.0 Real Bed 85.0
25798 20433587 Luxembourg Boutique hotel Private room 16 50.0 50.0 50.0 Real Bed 85.0
26048 20691340 Luxembourg Boutique hotel Private room 16 50.0 50.0 50.0 Real Bed 85.0
26108 20747725 Luxembourg Boutique hotel Private room 16 50.0 50.0 50.0 Real Bed 85.0
26121 20768013 Luxembourg Boutique hotel Private room 16 50.0 50.0 50.0 Real Bed 85.0
  • Concatenated queries

db.loc[(db["Price"] < 100) & \
       (db["bathrooms"] >= 8),
       :]
id neighbourhood_cleansed property_type room_type accommodates bathrooms bedrooms beds bed_type Price
12066 8876983 Luxembourg Boutique hotel Private room 16 50.0 50.0 16.0 Real Bed 85.0
22964 18766792 Luxembourg Boutique hotel Private room 16 50.0 50.0 17.0 Real Bed 85.0
24788 19819352 Luxembourg Boutique hotel Private room 16 50.0 50.0 20.0 Real Bed 85.0
25798 20433587 Luxembourg Boutique hotel Private room 16 50.0 50.0 50.0 Real Bed 85.0
26048 20691340 Luxembourg Boutique hotel Private room 16 50.0 50.0 50.0 Real Bed 85.0
26108 20747725 Luxembourg Boutique hotel Private room 16 50.0 50.0 50.0 Real Bed 85.0
26121 20768013 Luxembourg Boutique hotel Private room 16 50.0 50.0 50.0 Real Bed 85.0
db.loc[(db["Price"] < 5) | \
       (db["Price"] > 5000),
       :]
id neighbourhood_cleansed property_type room_type accommodates bathrooms bedrooms beds bed_type Price
8149 6088687 Temple Apartment Entire home/apt 2 1.0 1.0 1.0 Real Bed 8500.0
10113 7225849 Buttes-Montmartre Apartment Entire home/apt 8 2.0 2.0 5.0 Real Bed 9379.0
11286 8093890 Passy Apartment Entire home/apt 3 1.0 2.0 2.0 Real Bed 8721.0
25106 19974916 Buttes-Montmartre Condominium Entire home/apt 4 1.0 1.0 2.0 Real Bed 0.0
25558 20219162 Buttes-Chaumont Apartment Entire home/apt 1 1.0 0.0 1.0 Real Bed 0.0
25676 20291987 Passy Apartment Entire home/apt 3 1.0 0.0 2.0 Real Bed 0.0
25697 20313940 Temple Apartment Entire home/apt 2 1.0 0.0 1.0 Real Bed 0.0
26899 21422028 Popincourt Apartment Entire home/apt 3 2.0 2.0 3.0 Real Bed 0.0
32190 25448670 Vaugirard Apartment Entire home/apt 1 1.0 0.0 1.0 Real Bed 8576.0
35291 27608896 Observatoire Apartment Private room 1 1.0 0.0 1.0 Real Bed 9059.0
43961 34380017 Bourse Serviced apartment Hotel room 8 2.5 3.0 4.0 Real Bed 5150.0
46767 36019554 Louvre Serviced apartment Hotel room 8 2.5 2.0 4.0 Real Bed 8000.0
47574 36402651 Batignolles-Monceau House Entire home/apt 12 3.0 4.0 5.0 Real Bed 10000.0
  • Using query

db.query("neighbourhood_cleansed == 'Observatoire'")\
  [["neighbourhood_cleansed", "Price"]]\
  .head()
neighbourhood_cleansed Price
0 Observatoire 60.0
47 Observatoire 90.0
52 Observatoire 150.0
104 Observatoire 84.0
144 Observatoire 140.0
db.query("Price < 100")\
  [["id", "neighbourhood_cleansed"]]\
  .head()
id neighbourhood_cleansed
0 3109 Observatoire
4 9952 Popincourt
5 10710 Élysée
6 11170 Panthéon
10 11848 Popincourt
db.query("(Price < 100) & (bathrooms >= 8)")
id neighbourhood_cleansed property_type room_type accommodates bathrooms bedrooms beds bed_type Price
12066 8876983 Luxembourg Boutique hotel Private room 16 50.0 50.0 16.0 Real Bed 85.0
22964 18766792 Luxembourg Boutique hotel Private room 16 50.0 50.0 17.0 Real Bed 85.0
24788 19819352 Luxembourg Boutique hotel Private room 16 50.0 50.0 20.0 Real Bed 85.0
25798 20433587 Luxembourg Boutique hotel Private room 16 50.0 50.0 50.0 Real Bed 85.0
26048 20691340 Luxembourg Boutique hotel Private room 16 50.0 50.0 50.0 Real Bed 85.0
26108 20747725 Luxembourg Boutique hotel Private room 16 50.0 50.0 50.0 Real Bed 85.0
26121 20768013 Luxembourg Boutique hotel Private room 16 50.0 50.0 50.0 Real Bed 85.0

Editing tables

  • Modifying single values

db.loc[1, "bed_type"]
'Pull-out Sofa'
db.loc[1, "bed_type"] = "Pullout Sofa"
db.loc[1, "bed_type"]
'Pullout Sofa'
  • Modifying blocks of values

db.neighbourhood_cleansed.unique()
array(['Observatoire', 'Hôtel-de-Ville', 'Opéra', 'Popincourt', 'Élysée',
       'Panthéon', 'Entrepôt', 'Buttes-Montmartre', 'Gobelins',
       'Buttes-Chaumont', 'Luxembourg', 'Louvre', 'Palais-Bourbon',
       'Reuilly', 'Bourse', 'Ménilmontant', 'Vaugirard',
       'Batignolles-Monceau', 'Temple', 'Passy'], dtype=object)
db.loc[db["neighbourhood_cleansed"] == "Hôtel-de-Ville",
       "neighbourhood_cleansed"] = "City Council"
db.neighbourhood_cleansed.unique()
array(['Observatoire', 'City Council', 'Opéra', 'Popincourt', 'Élysée',
       'Panthéon', 'Entrepôt', 'Buttes-Montmartre', 'Gobelins',
       'Buttes-Chaumont', 'Luxembourg', 'Louvre', 'Palais-Bourbon',
       'Reuilly', 'Bourse', 'Ménilmontant', 'Vaugirard',
       'Batignolles-Monceau', 'Temple', 'Passy'], dtype=object)
  • Creating new columns and rows

db["more_beds_than_accomodates"] = db["beds"] > db["accommodates"]

EXERCISE

  • Find how many properties have more than ten bathrooms

  • Can you rent an AirBnb in Paris with only one bed but three bedrooms?

  • In which neighbourhoods can you rent an “Airbed”?


Writing Data

[i for i in dir(db) if i[:3]=="to_"]
['to_clipboard',
 'to_csv',
 'to_dict',
 'to_excel',
 'to_feather',
 'to_gbq',
 'to_hdf',
 'to_html',
 'to_json',
 'to_latex',
 'to_markdown',
 'to_numpy',
 'to_parquet',
 'to_period',
 'to_pickle',
 'to_records',
 'to_sql',
 'to_stata',
 'to_string',
 'to_timestamp',
 'to_xarray']