Visualising Tabular Data

👨‍🏫 [Sync]

%matplotlib inline

import pandas
import seaborn as sns
import matplotlib.pyplot as plt

Assuming you have the file downloaded on the path ../data/:

db = pandas.read_csv("../data/paris_abb.csv.zip")

If you’re online, you can do:

db = pandas.read_csv("https://github.com/darribas/data_science_studio/raw/master/content/data/paris_abb.csv.zip")

Simple & Quick (pandas)

db["Price"].plot.hist()
<AxesSubplot:ylabel='Frequency'>
../../_images/04-Tabular_data_viz_5_1.png
db["Price"].plot.hist(bins=100)
<AxesSubplot:ylabel='Frequency'>
../../_images/04-Tabular_data_viz_6_1.png
db.query("Price < 500")["Price"].plot.kde()
<AxesSubplot:ylabel='Density'>
../../_images/04-Tabular_data_viz_7_1.png
db.plot.scatter("accommodates", "Price")
<AxesSubplot:xlabel='accommodates', ylabel='Price'>
../../_images/04-Tabular_data_viz_8_1.png

EXERCISE

  • Create a histogram of the distribution of number of people a property accommodates (accommodates)

  • Create a scatter plot of the number of beds and number of bedrooms


More

We can also combine this with groupings from the previous notebook:

db.groupby("neighbourhood_cleansed")\
  ["Price"]\
  .mean()\
  .plot.bar()
<AxesSubplot:xlabel='neighbourhood_cleansed'>
../../_images/04-Tabular_data_viz_11_1.png

EXERCISE

Create a bar plot of the average review_scores_rating by neighbourhood, sorted by average price.

For a “pro” touch (optional), substract 90 from the reviews score before plotting.


seaborn

  • Univariate

sns.distplot(db.query("Price < 500")["Price"], 
             kde=True, 
             rug=True)
/opt/conda/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/opt/conda/lib/python3.8/site-packages/seaborn/distributions.py:2055: FutureWarning: The `axis` variable is no longer used and will be removed. Instead, assign variables directly to `x` or `y`.
  warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='Price', ylabel='Density'>
../../_images/04-Tabular_data_viz_15_2.png
  • Bivariate

sns.jointplot(x = "accommodates",
              y = "Price",
              data=db,
              kind="hex"
             );
../../_images/04-Tabular_data_viz_17_0.png
sns.regplot(x = "accommodates",
            y = "Price",
            data=db.query("Price < 500")
           );
../../_images/04-Tabular_data_viz_18_0.png
sns.jointplot(x = "accommodates",
              y = "Price",
              data=db.query("Price < 500")\
                     .sample(1000),
              kind="kde"
             );
../../_images/04-Tabular_data_viz_19_0.png

EXERCISE

Explore the documentation of jointplot and create a figure similar to the one above where you replace the hexagonal binning for a KDE. Since this will probably take too long, subset your data before plotting to only properties cheaper than $500, and then randomly sample 1,000 observations (tip: check out the sample method).


More

  • Categorical

sns.stripplot(x = "Price",
              y = "neighbourhood_cleansed",
              data=db
             );
../../_images/04-Tabular_data_viz_22_0.png
sns.swarmplot(x = "room_type",
              y = "Price",
              data=db.query("neighbourhood_cleansed == 'Louvre'")\
                     .sample(250),
              alpha=0.25
             );
/opt/conda/lib/python3.8/site-packages/seaborn/categorical.py:1296: UserWarning: 58.3% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
../../_images/04-Tabular_data_viz_23_1.png

To note:

  • With larger datasets, it’s hard to see any pattern

  • This is true even if you jitter the points around to avoid overlap and/or you play with transparency (alpha)

  • Algorithms to separate out dots exist but they’re computationally intensive and can only do so much

sns.stripplot(x = "Price",
              y = "property_type",
              hue = "room_type",
              data=db.query("neighbourhood_cleansed == 'Louvre'")\
                     .sample(250),
              alpha=0.25
             );
../../_images/04-Tabular_data_viz_25_0.png
sns.boxplot(x = "room_type",
            y = "Price",
            data = db
           );
../../_images/04-Tabular_data_viz_26_0.png
nei_list = ["Hôtel-de-Ville", "Louvre", "Buttes-Montmartre"]
sub = db["neighbourhood_cleansed"].isin(nei_list)

sns.violinplot(x = "Price",
               y = "neighbourhood_cleansed",
               data = db[sub].query("Price < 500")
              );
../../_images/04-Tabular_data_viz_27_0.png

EXERCISE

Explore the distribution of price by property type


Full control (matplotlib)

One

f, ax = plt.subplots(1)

ax.hist(db['Price'], bins=100)

plt.show()
../../_images/04-Tabular_data_viz_31_0.png
sub = db.query("Price < 500")\
        .sample(1000)

f, ax = plt.subplots(1)

sns.kdeplot(sub['Price'], sub['accommodates'], 
            shade=True, ax=ax)
ax.scatter(sub['Price'], sub['accommodates'],
           alpha=0.1, s=0.75, color='orange')

plt.show()
/opt/conda/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
../../_images/04-Tabular_data_viz_32_1.png
f, ax = plt.subplots(1, figsize=(4, 4))

sns.kdeplot(sub['Price'],
            sub['accommodates'], 
            shade=True, ax=ax, cmap='Purples')

plt.show()
/opt/conda/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
../../_images/04-Tabular_data_viz_33_1.png

CHALLENGE - Create a visualisation that includes a KDE and a scatter plot and that explores the relationship between number of beds and number of people it accommodates, for a random sample of 500 properties.

Two or more

f, axs = plt.subplots(1, 2, figsize=(12, 6))

sns.distplot(sub['Price'], 
             kde=False, 
             rug=True, 
             ax=axs[0]
            )

sns.distplot(sub['accommodates'], 
             hist=False, 
             kde=True, 
             rug=True, 
             ax=axs[1]
            )

plt.show()
/opt/conda/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/opt/conda/lib/python3.8/site-packages/seaborn/distributions.py:2055: FutureWarning: The `axis` variable is no longer used and will be removed. Instead, assign variables directly to `x` or `y`.
  warnings.warn(msg, FutureWarning)
/opt/conda/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).
  warnings.warn(msg, FutureWarning)
/opt/conda/lib/python3.8/site-packages/seaborn/distributions.py:2055: FutureWarning: The `axis` variable is no longer used and will be removed. Instead, assign variables directly to `x` or `y`.
  warnings.warn(msg, FutureWarning)
../../_images/04-Tabular_data_viz_36_1.png

CHALLENGE - Create a visualisation for all of the properties with three subplots:

  1. Histogram of price

  2. Scatter plot of price Vs number of people it accommodates

  3. Histogram of number of people the property accommodates