Do-It-Yourself#
import pandas
This section is all about you taking charge of the steering wheel and choosing your own adventure. For this block, we are going to use what we’ve learnt before to take a look at a dataset of casualties in the war in Afghanistan. The data was originally released by Wikileaks, and the version we will use is published by The Guardian.
Data preparation#
Before you can set off on your data journey, the dataset needs to be read, and there’s a couple of details we will get out of the way so it is then easier for you to start working.
The data are published on a Google Sheet you can check out at:
As you will see, each row includes casualties recorded month by month, split by Taliban, Civilians, Afghan forces, and NATO.
To read it into a Python session, we need to slightly modify the URL to access it into:
url = ("https://docs.google.com/spreadsheets/d/"\
"1EAx8_ksSCmoWW_SlhFyq2QrRn0FNNhcg1TtDFJzZRgc/"\
"export?format=csv&gid=1")
url
'https://docs.google.com/spreadsheets/d/1EAx8_ksSCmoWW_SlhFyq2QrRn0FNNhcg1TtDFJzZRgc/export?format=csv&gid=1'
Note how we split the url into three lines so it is more readable in narrow screens. The result however, stored in url
, is the same as one long string.
This allows us to read the data straight into a DataFrame, as we have done in the previous session:
db = pandas.read_csv(url, skiprows=[0, -1], thousands=",")
Note also we use the skiprows=[0, -1]
to avoid reading the top (0
) and bottom (-1
) rows which, if you check on the Google Sheet, involves the title of the table.
Now we are good to go!
db.head()
Year | Month | Taliban | Civilians | Afghan forces | Nato (detailed in spreadsheet) | Nato - official figures | |
---|---|---|---|---|---|---|---|
0 | 2004.0 | January | 15 | 51 | 23 | NaN | 11.0 |
1 | 2004.0 | February | NaN | 7 | 4 | 5 | 2.0 |
2 | 2004.0 | March | 19 | 2 | NaN | 2 | 3.0 |
3 | 2004.0 | April | 5 | 3 | 19 | NaN | 3.0 |
4 | 2004.0 | May | 18 | 29 | 56 | 6 | 9.0 |
Tasks#
Now, the challenge is to put to work what we have learnt in this block. For that, the suggestion is that you carry out an analysis of the Afghan Logs in a similar way as how we looked at population composition in Liverpool. These are of course very different datasets reflecting immensely different realities. Their structure, however, is relatively parallel: both capture counts aggregated by a spatial (neighbourhood) or temporal unit (month), and each count is split by a few categories.
Try to answer the following questions:
Obtain the minimum number of civilian casualties (in what month was that?)
How many NATO casualties were registered in August 2008?
What is the month with the most total number of casualties?
Can you make a plot of the distribution of casualties over time?