Advanced Tabular Manipulation

👨‍🏫 [Sync]

%matplotlib inline

import pandas

Assuming you have the file downloaded on the path ../data/:

db = pandas.read_csv("../data/paris_abb.csv.zip")

If you’re online, you can do:

db = pandas.read_csv("https://github.com/darribas/data_science_studio/raw/master/content/data/paris_abb.csv.zip")

Sorting

  • By values

# Top-5 cheapes properties
db.sort_values("Price")\
  .head(5)
id neighbourhood_cleansed property_type room_type accommodates bathrooms bedrooms beds bed_type Price
25676 20291987 Passy Apartment Entire home/apt 3 1.0 0.0 2.0 Real Bed 0.0
25697 20313940 Temple Apartment Entire home/apt 2 1.0 0.0 1.0 Real Bed 0.0
26899 21422028 Popincourt Apartment Entire home/apt 3 2.0 2.0 3.0 Real Bed 0.0
25558 20219162 Buttes-Chaumont Apartment Entire home/apt 1 1.0 0.0 1.0 Real Bed 0.0
25106 19974916 Buttes-Montmartre Condominium Entire home/apt 4 1.0 1.0 2.0 Real Bed 0.0
# Top-5 most expensive properties
db.sort_values("Price", ascending=False)\
  .head(5)
id neighbourhood_cleansed property_type room_type accommodates bathrooms bedrooms beds bed_type Price
47574 36402651 Batignolles-Monceau House Entire home/apt 12 3.0 4.0 5.0 Real Bed 10000.0
10113 7225849 Buttes-Montmartre Apartment Entire home/apt 8 2.0 2.0 5.0 Real Bed 9379.0
35291 27608896 Observatoire Apartment Private room 1 1.0 0.0 1.0 Real Bed 9059.0
11286 8093890 Passy Apartment Entire home/apt 3 1.0 2.0 2.0 Real Bed 8721.0
32190 25448670 Vaugirard Apartment Entire home/apt 1 1.0 0.0 1.0 Real Bed 8576.0
  • By index

tmp = db.set_index("property_type")\
        .sort_index()
tmp.head()
id neighbourhood_cleansed room_type accommodates bathrooms bedrooms beds bed_type Price
property_type
Aparthotel 31728955 Vaugirard Hotel room 2 1.0 1.0 1.0 Real Bed 141.0
Aparthotel 13003443 Vaugirard Entire home/apt 4 1.0 0.0 3.0 Real Bed 75.0
Aparthotel 31733851 Temple Hotel room 4 1.0 1.0 3.0 Real Bed 141.0
Aparthotel 17630233 Bourse Hotel room 4 1.0 2.0 2.0 Real Bed 330.0
Aparthotel 24387710 Observatoire Private room 12 5.0 5.0 11.0 Real Bed 698.0

(Useful for quick subsetting:)

tmp.loc["Tiny house", 
       ["id", "neighbourhood_cleansed", "Price"]
       ]
id neighbourhood_cleansed Price
property_type
Tiny house 20976623 Panthéon 110.0
Tiny house 6838781 Entrepôt 60.0
Tiny house 29322690 Passy 30.0
Tiny house 18919023 Louvre 100.0
Tiny house 4191080 Louvre 80.0
Tiny house 9582468 Popincourt 80.0
Tiny house 37312602 Ménilmontant 40.0
Tiny house 2555221 Buttes-Montmartre 114.0
Tiny house 34572791 Passy 35.0
Tiny house 37174177 Vaugirard 50.0
Tiny house 26292760 Ménilmontant 75.0
Tiny house 36864294 Passy 38.0
Tiny house 20273867 Panthéon 100.0
Tiny house 34816113 Observatoire 40.0
Tiny house 768986 Temple 79.0
Tiny house 13220094 Buttes-Chaumont 160.0
Tiny house 5699102 Buttes-Montmartre 71.0
Tiny house 36048024 Buttes-Montmartre 70.0
Tiny house 20688679 Buttes-Montmartre 73.0
Tiny house 15157047 Temple 42.0
Tiny house 12588228 Passy 75.0

Joinning

  • Additional data (linked through ID!)

reviews = pandas.read_csv("../data/paris_abb_review.csv.zip")
reviews.head()
id review_scores_rating review_scores_accuracy review_scores_cleanliness review_scores_checkin review_scores_communication review_scores_location review_scores_value
0 3109 100.0 10.0 10.0 10.0 10.0 10.0 10.0
1 5396 90.0 9.0 8.0 9.0 9.0 10.0 8.0
2 7397 94.0 10.0 9.0 10.0 10.0 10.0 10.0
3 7964 96.0 10.0 10.0 10.0 10.0 10.0 10.0
4 9952 98.0 10.0 10.0 10.0 10.0 10.0 10.0
  • Join to original table:

dbj1 = db.join(reviews.set_index("id"),
              on = "id"
             )
dbj1.head()
id neighbourhood_cleansed property_type room_type accommodates bathrooms bedrooms beds bed_type Price review_scores_rating review_scores_accuracy review_scores_cleanliness review_scores_checkin review_scores_communication review_scores_location review_scores_value
0 3109 Observatoire Apartment Entire home/apt 2 1.0 0.0 1.0 Real Bed 60.0 100.0 10.0 10.0 10.0 10.0 10.0 10.0
1 5396 Hôtel-de-Ville Apartment Entire home/apt 2 1.0 0.0 1.0 Pull-out Sofa 115.0 90.0 9.0 8.0 9.0 9.0 10.0 8.0
2 7397 Hôtel-de-Ville Apartment Entire home/apt 4 1.0 2.0 2.0 Real Bed 119.0 94.0 10.0 9.0 10.0 10.0 10.0 10.0
3 7964 Opéra Apartment Entire home/apt 2 1.0 1.0 1.0 Real Bed 130.0 96.0 10.0 10.0 10.0 10.0 10.0 10.0
4 9952 Popincourt Apartment Entire home/apt 2 1.0 1.0 1.0 Real Bed 75.0 98.0 10.0 10.0 10.0 10.0 10.0 10.0
dbj2 = db.set_index("id")\
        .join(reviews.set_index("id"))
dbj2.head()
neighbourhood_cleansed property_type room_type accommodates bathrooms bedrooms beds bed_type Price review_scores_rating review_scores_accuracy review_scores_cleanliness review_scores_checkin review_scores_communication review_scores_location review_scores_value
id
3109 Observatoire Apartment Entire home/apt 2 1.0 0.0 1.0 Real Bed 60.0 100.0 10.0 10.0 10.0 10.0 10.0 10.0
5396 Hôtel-de-Ville Apartment Entire home/apt 2 1.0 0.0 1.0 Pull-out Sofa 115.0 90.0 9.0 8.0 9.0 9.0 10.0 8.0
7397 Hôtel-de-Ville Apartment Entire home/apt 4 1.0 2.0 2.0 Real Bed 119.0 94.0 10.0 9.0 10.0 10.0 10.0 10.0
7964 Opéra Apartment Entire home/apt 2 1.0 1.0 1.0 Real Bed 130.0 96.0 10.0 10.0 10.0 10.0 10.0 10.0
9952 Popincourt Apartment Entire home/apt 2 1.0 1.0 1.0 Real Bed 75.0 98.0 10.0 10.0 10.0 10.0 10.0 10.0

Note:

  • Left can choose index/column, right needs to be index (results “almost” the same)

  • For more flexibility, check out merge:

Grouping

  • Get the mean price for “Louvre”:

db.query("neighbourhood_cleansed == 'Louvre'")\
  ["Price"]\
  .mean()
175.2634194831014
  • Get the mean price for “Luxembourg”:

db.query("neighbourhood_cleansed == 'Luxembourg'")\
  ["Price"]\
  .mean()
160.02262142381903
  • Get the mean price for “Palais-Bourbon”:

db.query("neighbourhood_cleansed == 'Palais-Bourbon'")\
  ["Price"]\
  .mean()
169.8526690391459
  • For every neighbourhood???

db.groupby("neighbourhood_cleansed")\
  ["Price"]\
  .mean()
neighbourhood_cleansed
Batignolles-Monceau    103.355239
Bourse                 138.898017
Buttes-Chaumont         76.971745
Buttes-Montmartre       84.867211
Entrepôt                99.130493
Gobelins                80.000000
Hôtel-de-Ville         152.547723
Louvre                 175.263419
Luxembourg             160.022621
Ménilmontant            71.074504
Observatoire            98.907340
Opéra                  122.495921
Palais-Bourbon         169.852669
Panthéon               124.683662
Passy                  149.237946
Popincourt              90.832782
Reuilly                 84.225256
Temple                 147.904592
Vaugirard              106.269580
Élysée                 194.158416
Name: Price, dtype: float64

MultiIndex Tables

Grouping can be based on more than one variable only…

nr = dbj1.groupby(["neighbourhood_cleansed", "room_type"])\
       [["Price", "review_scores_rating"]]\
       .mean()

This generates a MultiIndex:

nr.head()
Price review_scores_rating
neighbourhood_cleansed room_type
Batignolles-Monceau Entire home/apt 103.501922 92.484096
Hotel room 343.660714 93.607143
Private room 58.698305 95.064407
Shared room 53.142857 90.214286
Bourse Entire home/apt 139.618020 91.608503
nr.index
MultiIndex([('Batignolles-Monceau', 'Entire home/apt'),
            ('Batignolles-Monceau',      'Hotel room'),
            ('Batignolles-Monceau',    'Private room'),
            ('Batignolles-Monceau',     'Shared room'),
            (             'Bourse', 'Entire home/apt'),
            (             'Bourse',      'Hotel room'),
            (             'Bourse',    'Private room'),
            (             'Bourse',     'Shared room'),
            (    'Buttes-Chaumont', 'Entire home/apt'),
            (    'Buttes-Chaumont',      'Hotel room'),
            (    'Buttes-Chaumont',    'Private room'),
            (    'Buttes-Chaumont',     'Shared room'),
            (  'Buttes-Montmartre', 'Entire home/apt'),
            (  'Buttes-Montmartre',      'Hotel room'),
            (  'Buttes-Montmartre',    'Private room'),
            (  'Buttes-Montmartre',     'Shared room'),
            (           'Entrepôt', 'Entire home/apt'),
            (           'Entrepôt',      'Hotel room'),
            (           'Entrepôt',    'Private room'),
            (           'Entrepôt',     'Shared room'),
            (           'Gobelins', 'Entire home/apt'),
            (           'Gobelins',      'Hotel room'),
            (           'Gobelins',    'Private room'),
            (           'Gobelins',     'Shared room'),
            (     'Hôtel-de-Ville', 'Entire home/apt'),
            (     'Hôtel-de-Ville',      'Hotel room'),
            (     'Hôtel-de-Ville',    'Private room'),
            (     'Hôtel-de-Ville',     'Shared room'),
            (             'Louvre', 'Entire home/apt'),
            (             'Louvre',      'Hotel room'),
            (             'Louvre',    'Private room'),
            (             'Louvre',     'Shared room'),
            (         'Luxembourg', 'Entire home/apt'),
            (         'Luxembourg',      'Hotel room'),
            (         'Luxembourg',    'Private room'),
            (         'Luxembourg',     'Shared room'),
            (       'Ménilmontant', 'Entire home/apt'),
            (       'Ménilmontant',      'Hotel room'),
            (       'Ménilmontant',    'Private room'),
            (       'Ménilmontant',     'Shared room'),
            (       'Observatoire', 'Entire home/apt'),
            (       'Observatoire',      'Hotel room'),
            (       'Observatoire',    'Private room'),
            (       'Observatoire',     'Shared room'),
            (              'Opéra', 'Entire home/apt'),
            (              'Opéra',      'Hotel room'),
            (              'Opéra',    'Private room'),
            (              'Opéra',     'Shared room'),
            (     'Palais-Bourbon', 'Entire home/apt'),
            (     'Palais-Bourbon',      'Hotel room'),
            (     'Palais-Bourbon',    'Private room'),
            (     'Palais-Bourbon',     'Shared room'),
            (           'Panthéon', 'Entire home/apt'),
            (           'Panthéon',      'Hotel room'),
            (           'Panthéon',    'Private room'),
            (           'Panthéon',     'Shared room'),
            (              'Passy', 'Entire home/apt'),
            (              'Passy',      'Hotel room'),
            (              'Passy',    'Private room'),
            (              'Passy',     'Shared room'),
            (         'Popincourt', 'Entire home/apt'),
            (         'Popincourt',      'Hotel room'),
            (         'Popincourt',    'Private room'),
            (         'Popincourt',     'Shared room'),
            (            'Reuilly', 'Entire home/apt'),
            (            'Reuilly',      'Hotel room'),
            (            'Reuilly',    'Private room'),
            (            'Reuilly',     'Shared room'),
            (             'Temple', 'Entire home/apt'),
            (             'Temple',      'Hotel room'),
            (             'Temple',    'Private room'),
            (             'Temple',     'Shared room'),
            (          'Vaugirard', 'Entire home/apt'),
            (          'Vaugirard',      'Hotel room'),
            (          'Vaugirard',    'Private room'),
            (          'Vaugirard',     'Shared room'),
            (             'Élysée', 'Entire home/apt'),
            (             'Élysée',      'Hotel room'),
            (             'Élysée',    'Private room'),
            (             'Élysée',     'Shared room')],
           names=['neighbourhood_cleansed', 'room_type'])

These indices allow us to do several more things than single index objects. For example:

  • One-level queries:

nr.xs("Bourse", level="neighbourhood_cleansed")
Price review_scores_rating
room_type
Entire home/apt 139.618020 91.608503
Hotel room 321.244898 93.367347
Private room 71.088710 93.225806
Shared room 35.062500 90.875000
nr.xs("Shared room", level="room_type")
Price review_scores_rating
neighbourhood_cleansed
Batignolles-Monceau 53.142857 90.214286
Bourse 35.062500 90.875000
Buttes-Chaumont 25.228571 90.400000
Buttes-Montmartre 42.555556 89.333333
Entrepôt 30.458333 90.458333
Gobelins 35.600000 92.533333
Hôtel-de-Ville 64.000000 95.375000
Louvre 72.000000 93.400000
Luxembourg 55.000000 88.666667
Ménilmontant 34.923077 90.730769
Observatoire 41.000000 97.142857
Opéra 44.125000 94.500000
Palais-Bourbon 35.400000 95.000000
Panthéon 73.800000 95.200000
Passy 39.785714 93.785714
Popincourt 31.880952 89.261905
Reuilly 30.300000 93.200000
Temple 30.833333 92.833333
Vaugirard 51.333333 94.750000
Élysée 70.000000 96.000000
nr.loc[("Bourse", "Shared room"), :]
Price                   35.0625
review_scores_rating    90.8750
Name: (Bourse, Shared room), dtype: float64

But also “unstack” it so we can cross-tab:

unstacked = nr.unstack()
unstacked
Price review_scores_rating
room_type Entire home/apt Hotel room Private room Shared room Entire home/apt Hotel room Private room Shared room
neighbourhood_cleansed
Batignolles-Monceau 103.501922 343.660714 58.698305 53.142857 92.484096 93.607143 95.064407 90.214286
Bourse 139.618020 321.244898 71.088710 35.062500 91.608503 93.367347 93.225806 90.875000
Buttes-Chaumont 82.568452 117.947368 45.866667 25.228571 93.077806 91.368421 93.528205 90.400000
Buttes-Montmartre 87.478750 118.342105 55.217899 42.555556 92.940811 90.250000 93.729572 89.333333
Entrepôt 101.621981 311.736842 55.144893 30.458333 92.995046 87.508772 93.320665 90.458333
Gobelins 86.284188 106.961538 53.238671 35.600000 91.858974 91.346154 93.305136 92.533333
Hôtel-de-Ville 154.687543 412.291667 79.372881 64.000000 93.123193 90.000000 93.745763 95.375000
Louvre 165.077367 426.830508 102.828947 72.000000 91.763279 91.457627 94.500000 93.400000
Luxembourg 164.863531 204.746667 87.226562 55.000000 92.269854 92.186667 92.515625 88.666667
Ménilmontant 75.727787 128.200000 45.138425 34.923077 93.332795 92.800000 92.076372 90.730769
Observatoire 95.354978 306.525000 90.521401 41.000000 92.864564 93.150000 93.501946 97.142857
Opéra 115.688708 287.089655 75.561905 44.125000 93.126144 89.172414 93.704762 94.500000
Palais-Bourbon 170.620910 450.722222 76.144144 35.400000 92.757382 93.527778 94.180180 95.000000
Panthéon 129.989522 141.925000 69.246753 73.800000 92.206942 90.575000 94.006494 95.200000
Passy 155.033317 331.766667 77.497674 39.785714 91.810702 92.600000 93.660465 93.785714
Popincourt 95.073171 138.333333 53.348606 31.880952 93.042408 94.238095 93.541833 89.261905
Reuilly 89.299656 104.727273 53.468165 30.300000 93.082090 91.636364 94.441948 93.200000
Temple 150.560496 241.977273 77.474453 30.833333 92.378794 94.295455 93.043796 92.833333
Vaugirard 108.483296 288.891892 58.145729 51.333333 92.632517 93.067568 93.778894 94.750000
Élysée 194.715942 300.590361 94.835165 70.000000 91.547826 89.867470 92.780220 96.000000

This in turn creates a MultiIndex on the columns instead, which works similarly:

unstacked["Price"]
room_type Entire home/apt Hotel room Private room Shared room
neighbourhood_cleansed
Batignolles-Monceau 103.501922 343.660714 58.698305 53.142857
Bourse 139.618020 321.244898 71.088710 35.062500
Buttes-Chaumont 82.568452 117.947368 45.866667 25.228571
Buttes-Montmartre 87.478750 118.342105 55.217899 42.555556
Entrepôt 101.621981 311.736842 55.144893 30.458333
Gobelins 86.284188 106.961538 53.238671 35.600000
Hôtel-de-Ville 154.687543 412.291667 79.372881 64.000000
Louvre 165.077367 426.830508 102.828947 72.000000
Luxembourg 164.863531 204.746667 87.226562 55.000000
Ménilmontant 75.727787 128.200000 45.138425 34.923077
Observatoire 95.354978 306.525000 90.521401 41.000000
Opéra 115.688708 287.089655 75.561905 44.125000
Palais-Bourbon 170.620910 450.722222 76.144144 35.400000
Panthéon 129.989522 141.925000 69.246753 73.800000
Passy 155.033317 331.766667 77.497674 39.785714
Popincourt 95.073171 138.333333 53.348606 31.880952
Reuilly 89.299656 104.727273 53.468165 30.300000
Temple 150.560496 241.977273 77.474453 30.833333
Vaugirard 108.483296 288.891892 58.145729 51.333333
Élysée 194.715942 300.590361 94.835165 70.000000

EXERCISE

Create a table that shows the average price for properties by room and property type


More at: