For anyone like me who wants to do DIY election analysis, first step is to get the data and make it ready for analysis. This is a time consuming part of the process and sadly it’s also must do part. According to opinion polls most important subject in this election is brexit, given that referendum result and 2015 general election results are vital for any analysis. In this post I will demonstrate how did I get the data and what did I do to clean it. It’s important partially because I made this datasets public so anyone whom wishes to do their own analysis can use clean and reliable data.
Data source for general election is from electoral commission website and for referendum result its from wikipedia and the reason why I choose wikipedia is that electoral commission doesn’t have a result by constituency. If you don’t want to be involve with data wrangling then you can get the end product of this process from my Github repo here.
Let’s get the open source libraries I used and take a peak at the data.
import requests
import bs4 as bs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('hocl-ge2015-results-summary.csv')
df.head()
ons_id | ons_region_id | constituency_name | county_name | region_name | country_name | constituency_type | declaration_time | result | first_party | ... | ukip | green | snp | pc | dup | sf | sdlp | uup | alliance | other | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | W07000049 | W92000004 | Aberavon | West Glamorgan | Wales | Wales | County | 08/05/2015 03:19 | Lab hold | Lab | ... | 4971 | 711 | 0 | 3663 | 0 | 0 | 0 | 0 | 0 | 1623 |
1 | W07000058 | W92000004 | Aberconwy | Clwyd | Wales | Wales | County | 08/05/2015 02:48 | Con hold | Con | ... | 3467 | 727 | 0 | 3536 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | S14000001 | S92000003 | Aberdeen North | Scotland | Scotland | Scotland | Borough | 08/05/2015 03:40 | SNP gain from Lab | SNP | ... | 0 | 0 | 24793 | 0 | 0 | 0 | 0 | 0 | 0 | 392 |
3 | S14000002 | S92000003 | Aberdeen South | Scotland | Scotland | Scotland | Borough | 08/05/2015 09:39 | SNP gain from Lab | SNP | ... | 897 | 964 | 20221 | 0 | 0 | 0 | 0 | 0 | 0 | 139 |
4 | S14000003 | S92000003 | Airdrie and Shotts | Scotland | Scotland | Scotland | County | 08/05/2015 03:07 | SNP gain from Lab | SNP | ... | 1088 | 0 | 23887 | 0 | 0 | 0 | 0 | 0 | 0 | 136 |
5 rows × 28 columns
Uk is divided to 650 constituency, we should check the sum before proceed to any other operation.
len(df.constituency_name.unique())
650
It appears number of entries are correct. Let’s now examine the all the parties hold seats and check if there is any error.
df.first_party.unique()
array(['Lab', 'Con', 'SNP', 'PC', 'DUP', 'SDLP', 'SF', 'Green', 'Spk',
'LD', 'UKIP', 'UUP', 'Ind'], dtype=object)
As we can see there is two non-party entry in the data. One of it is ‘Ind’ for Independent candidate and other one is ‘Spk’ which is short hand for House Speaker. Vote count for both of these instances are recorded in column ‘other’, given that condition it’s a good practice to store these instances as other so we can make cross calculation between records to corresponding column data. One way this could be helpful for us is to calculate how many votes first party received in all constituencies.
df.loc[df.first_party == 'Ind', ['first_party']] = 'other'
df.loc[df.first_party == 'Spk', ['first_party']] = 'other'
We should check the ‘second_party’ column too.
df.second_party.unique()
array(['UKIP', 'Lab', 'Con', 'LD', 'Alliance', 'SF', 'DUP', 'PBPA',
'Respect', 'Green', 'PC', 'SNP', 'UUP', 'Ind', 'TUV'], dtype=object)
We should apply same change to ‘pbpa’ and ‘respect’ too.
df.loc[df.second_party == 'pbpa', ['second_party']] = 'other'
df.loc[df.second_party == 'respect', ['second_party']] = 'other'
Now we can change to lower case for all the string entries so when we merge with referendum result, we don’t get any conflicted constituency names.
col_to_lower = ['constituency_name', 'county_name', 'region_name', 'country_name', 'constituency_type',
'result', 'first_party', 'second_party']
for col in col_to_lower:
df[col] = [_.lower() for _ in df[col]]
Also constituencies that have a special character in their name will also create conflict when merged. Here I will remove all characters from constituency names.
const_name = []
for i in df.constituency_name:
if ',' in i:
j = "".join(i.split(','))
if ' &' in j:
j = "".join(i.split(' &'))
const_name.append(j)
else:
const_name.append(j)
elif '-' in i:
j = " ".join(i.split('-'))
const_name.append(j)
else:
const_name.append(i)
df.constituency_name = const_name
We now can fetch the referendum results. Data we are looking for stored in the 36th (Python use 0 based index.) table in that page. I will print the table head variables to see if thats the correct table.
url = 'https://en.wikipedia.org/wiki/Results_of_the_United_Kingdom_European_Union_membership_referendum,_2016'
respond = requests.get(url)
soup = bs.BeautifulSoup(respond.text, 'lxml')
div = soup.find_all('div', class_='mw-content-ltr')[0]
tables = div.find_all('table')
tb = tables[36]
tb.find_all('th')
[<th colspan="2" rowspan="2">Constituency</th>,
<th rowspan="2">Member of Parliament</th>,
<th rowspan="2">MP position</th>,
<th colspan="2" rowspan="2"></th>,
<th rowspan="2">Region</th>,
<th colspan="2">Proportion of votes</th>,
<th rowspan="2">Notes</th>,
<th>Remain %</th>,
<th>Leave %</th>]
That’s looks correct, but changing the column names will help us during the validation process and prevent confusion. New column names can be set as below.
colums = ['constituency_name_eu', 'parliament_member', 'mp_position', 'region_name_eu', 'remain', 'leave']
Now we can parse all input to data frame and take a peak at the data set.
df_cons = pd.DataFrame(columns=colums)
for i, row in enumerate(tb.find_all('tr')[2:]):
rows = []
rows.append(row.find_all('td')[1].text)
rows.append(row.find_all('td')[2].text)
rows.append(row.find_all('td')[3].text)
rows.append(row.find_all('td')[6].text)
rows.append(row.find_all('td')[7].text)
rows.append(row.find_all('td')[8].text)
df_cons.loc[i] = rows
df_cons.head()
constituency_name_eu | parliament_member | mp_position | region_name_eu | remain | leave | |
---|---|---|---|---|---|---|
0 | Streatham | Chuka Umunna | Remain | Greater London | 79.5% | 20.5% |
1 | Bristol West | Thangam Debbonaire | Remain | South West England | 79.3% | 20.7% |
2 | Hackney North and Stoke Newington | Diane Abbott | Remain | Greater London | 79.1% | 20.9% |
3 | Glasgow North | Patrick Grady | Remain | Scotland | 78.4% | 21.6% |
4 | Islington North | Jeremy Corbyn | Remain | Greater London | 78.4% | 21.6% |
Again validating the number of constituencies.
len(df.constituency_name)
650
Percentage sign within the remain and leave columns should be removed so we can turn that columns into floating numbers and make it available to arithmetic operations.
df_cons.remain = [float(_.split('%')[0]) for _ in df_cons.remain]
df_cons.leave = [float(_.split('%')[0]) for _ in df_cons.leave]
Finally, unifying constituency names by turning then to lower case and removing special characters. It’s also important to notice that region names slightly different from general election results. In the second data set region named as south west England instead of south west, we should also unify them too.
df_cons.constituency_name_eu = [_.lower() for _ in df_cons.constituency_name_eu]
df_cons.region_name_eu = [_.lower() for _ in df_cons.region_name_eu]
df_cons.mp_position = [_.lower() for _ in df_cons.mp_position]
region_n = []
for i in df_cons.region_name_eu:
if 'of' in i.split(' '):
region_n.append(i[:-11])
elif 'england' in i.split(' '):
region_n.append(i[:-8])
elif 'greater' in i.split(' '):
region_n.append(i[-6:])
else:
region_n.append(i)
df_cons.region_name_eu = region_n
df_cons.loc[df_cons.constituency_name_eu == 'berwickshire, roxburgh selkirk',
'constituency_name_eu'] = 'berwickshire roxburgh and selkirk'
df_cons.loc[df_cons.constituency_name_eu == 'ynys môn', 'constituency_name_eu'] = 'ynys mon'
cons_name = []
for i in df_cons.constituency_name_eu:
if ',' in i:
j = "".join(i.split(','))
if ' &' in j:
j = "".join(i.split(' &'))
cons_name.append(j)
else:
cons_name.append(j)
elif ' &' in i:
cons_name.append("".join(i.split(' &')))
elif '-' in i:
j = " ".join(i.split('-'))
cons_name.append(j)
else:
cons_name.append(i)
df_cons.constituency_name_eu = cons_name
Time for Testing
Now that our data entries are unified we can run some validation tests to see if everything is as expected. First test is for referendum data integrity. I will check if the percentage sums of leave and remain add up to 100%.
l_validate = df_cons.remain + df_cons.leave
df_cons.ix[l_validate[l_validate != 100].index]
constituency_name_eu | parliament_member | mp_position | region_name_eu | remain | leave | |
---|---|---|---|---|---|---|
24 | richmond park | Sarah Olney | remain | london | 73.3 | 27.7 |
94 | cardiff north | Craig Williams | remain | wales | 60.9 | 39.0 |
143 | brighton kemptown | Simon Kirby | remain | south east | 57.6 | 43.4 |
197 | leeds central | Hilary Benn | remain | yorkshire and the humber | 52.4 | 47.4 |
353 | wolverhampton south west | Rob Marris | remain | west midlands | 45.6 | 54.6 |
459 | islwyn | Chris Evans | remain | wales | 45.3 | 58.8 |
554 | dover | Charlie Elphicke | remain | south east | 36.8 | 63.1 |
555 | old bexley and sidcup | James Brokenshire | remain | london | 36.8 | 63.1 |
It appears we have 8 instances of inaccuracies. This problem is particularly hard to automate. I had to look up other reported results and manually change them.
remain_man = [69.3, 60.9, 57.6, 52.5, 45.6, 45.3, 36.9, 36.9]
leave_man = [30.7, 39.1, 42.4, 47.5, 54.4, 54.7, 63.1, 63.1]
df_cons.ix[l_validate[l_validate != 100].index, 'remain'] = remain_man
df_cons.ix[l_validate[l_validate != 100].index, 'leave'] = leave_man
Final test is about constituency names, I will check if there is any constituency name not appears in other data set.
for i in df.constituency_name.unique():
if i not in df_cons.constituency_name_eu.unique():
print('{} name is not available in the referandum data set'.format(i))
else:
print('Constituency names are parsed correctly.')
Constituency names are parsed correctly.