Uk General Election 2017 Powered By Python

For anyone like me who wants to do DIY election analysis, first step is to get the data and make it ready for analysis. This is a time consuming part of the process and sadly it’s also must do part. According to opinion polls most important subject in this election is brexit, given that referendum result and 2015 general election results are vital for any analysis. In this post I will demonstrate how did I get the data and what did I do to clean it. It’s important partially because I made this datasets public so anyone whom wishes to do their own analysis can use clean and reliable data.

Data source for general election is from electoral commission website and for referendum result its from wikipedia and the reason why I choose wikipedia is that electoral commission doesn’t have a result by constituency. If you don’t want to be involve with data wrangling then you can get the end product of this process from my Github repo here.

Let’s get the open source libraries I used and take a peak at the data.

import requests
import bs4 as bs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('hocl-ge2015-results-summary.csv')

df.head()

	ons_id	ons_region_id	constituency_name	county_name	region_name	country_name	constituency_type	declaration_time	result	first_party	...	ukip	green	snp	pc	other
0	W07000049	W92000004	Aberavon	West Glamorgan	Wales	Wales	County	08/05/2015 03:19	Lab hold	Lab	...	4971	711	0	3663	1623
1	W07000058	W92000004	Aberconwy	Clwyd	Wales	Wales	County	08/05/2015 02:48	Con hold	Con	...	3467	727	0	3536	0
2	S14000001	S92000003	Aberdeen North	Scotland	Scotland	Scotland	Borough	08/05/2015 03:40	SNP gain from Lab	SNP	...	0	0	24793	0	392
3	S14000002	S92000003	Aberdeen South	Scotland	Scotland	Scotland	Borough	08/05/2015 09:39	SNP gain from Lab	SNP	...	897	964	20221	0	139
4	S14000003	S92000003	Airdrie and Shotts	Scotland	Scotland	Scotland	County	08/05/2015 03:07	SNP gain from Lab	SNP	...	1088	0	23887	0	136

5 rows × 28 columns

Uk is divided to 650 constituency, we should check the sum before proceed to any other operation.

len(df.constituency_name.unique())

It appears number of entries are correct. Let’s now examine the all the parties hold seats and check if there is any error.

df.first_party.unique()

array(['Lab', 'Con', 'SNP', 'PC', 'DUP', 'SDLP', 'SF', 'Green', 'Spk',
       'LD', 'UKIP', 'UUP', 'Ind'], dtype=object)

As we can see there is two non-party entry in the data. One of it is ‘Ind’ for Independent candidate and other one is ‘Spk’ which is short hand for House Speaker. Vote count for both of these instances are recorded in column ‘other’, given that condition it’s a good practice to store these instances as other so we can make cross calculation between records to corresponding column data. One way this could be helpful for us is to calculate how many votes first party received in all constituencies.

df.loc[df.first_party == 'Ind', ['first_party']] = 'other'
df.loc[df.first_party == 'Spk', ['first_party']] = 'other'

We should check the ‘second_party’ column too.

df.second_party.unique()

array(['UKIP', 'Lab', 'Con', 'LD', 'Alliance', 'SF', 'DUP', 'PBPA',
       'Respect', 'Green', 'PC', 'SNP', 'UUP', 'Ind', 'TUV'], dtype=object)

We should apply same change to ‘pbpa’ and ‘respect’ too.

df.loc[df.second_party == 'pbpa', ['second_party']] = 'other'
df.loc[df.second_party == 'respect', ['second_party']] = 'other'

Now we can change to lower case for all the string entries so when we merge with referendum result, we don’t get any conflicted constituency names.

col_to_lower = ['constituency_name', 'county_name', 'region_name', 'country_name', 'constituency_type',
               'result', 'first_party', 'second_party']

for col in col_to_lower:
    df[col] = [_.lower() for _ in df[col]]

Also constituencies that have a special character in their name will also create conflict when merged. Here I will remove all characters from constituency names.

const_name = []
for i in df.constituency_name:
    if ',' in i:
        j = "".join(i.split(','))
        if ' &' in j:
            j = "".join(i.split(' &'))
            const_name.append(j)
        else:
            const_name.append(j)
    elif '-' in i:
        j = " ".join(i.split('-'))
        const_name.append(j)
    else:
        const_name.append(i)

df.constituency_name = const_name

We now can fetch the referendum results. Data we are looking for stored in the 36th (Python use 0 based index.) table in that page. I will print the table head variables to see if thats the correct table.

url = 'https://en.wikipedia.org/wiki/Results_of_the_United_Kingdom_European_Union_membership_referendum,_2016'
respond = requests.get(url)

soup = bs.BeautifulSoup(respond.text, 'lxml')
div = soup.find_all('div', class_='mw-content-ltr')[0]
tables = div.find_all('table')
tb = tables[36]
tb.find_all('th')

[<th colspan="2" rowspan="2">Constituency</th>,
 <th rowspan="2">Member of Parliament</th>,
 <th rowspan="2">MP position</th>,
 <th colspan="2" rowspan="2"></th>,
 <th rowspan="2">Region</th>,
 <th colspan="2">Proportion of votes</th>,
 <th rowspan="2">Notes</th>,
 <th>Remain %</th>,
 <th>Leave %</th>]

That’s looks correct, but changing the column names will help us during the validation process and prevent confusion. New column names can be set as below.

colums = ['constituency_name_eu', 'parliament_member', 'mp_position', 'region_name_eu', 'remain', 'leave']

Now we can parse all input to data frame and take a peak at the data set.

df_cons = pd.DataFrame(columns=colums)

for i, row in enumerate(tb.find_all('tr')[2:]):
    rows = []
    rows.append(row.find_all('td')[1].text)
    rows.append(row.find_all('td')[2].text)
    rows.append(row.find_all('td')[3].text)
    rows.append(row.find_all('td')[6].text)
    rows.append(row.find_all('td')[7].text)
    rows.append(row.find_all('td')[8].text)
    df_cons.loc[i] = rows

df_cons.head()

	constituency_name_eu	parliament_member	mp_position	region_name_eu	remain	leave
0	Streatham	Chuka Umunna	Remain	Greater London	79.5%	20.5%
1	Bristol West	Thangam Debbonaire	Remain	South West England	79.3%	20.7%
2	Hackney North and Stoke Newington	Diane Abbott	Remain	Greater London	79.1%	20.9%
3	Glasgow North	Patrick Grady	Remain	Scotland	78.4%	21.6%
4	Islington North	Jeremy Corbyn	Remain	Greater London	78.4%	21.6%

Again validating the number of constituencies.

len(df.constituency_name)

Percentage sign within the remain and leave columns should be removed so we can turn that columns into floating numbers and make it available to arithmetic operations.

df_cons.remain = [float(_.split('%')[0]) for _ in df_cons.remain]
df_cons.leave = [float(_.split('%')[0]) for _ in df_cons.leave]

Finally, unifying constituency names by turning then to lower case and removing special characters. It’s also important to notice that region names slightly different from general election results. In the second data set region named as south west England instead of south west, we should also unify them too.

df_cons.constituency_name_eu = [_.lower() for _ in df_cons.constituency_name_eu]
df_cons.region_name_eu = [_.lower() for _ in df_cons.region_name_eu]
df_cons.mp_position = [_.lower() for _ in df_cons.mp_position]

region_n = []
for i in df_cons.region_name_eu:
    if 'of' in i.split(' '):
        region_n.append(i[:-11])
    elif 'england' in i.split(' '):
        region_n.append(i[:-8])
    elif 'greater' in i.split(' '):
        region_n.append(i[-6:])
    else:
        region_n.append(i)

df_cons.region_name_eu = region_n

df_cons.loc[df_cons.constituency_name_eu == 'berwickshire, roxburgh selkirk',
            'constituency_name_eu'] = 'berwickshire roxburgh and selkirk'
df_cons.loc[df_cons.constituency_name_eu == 'ynys môn', 'constituency_name_eu'] = 'ynys mon'

cons_name = []
for i in df_cons.constituency_name_eu:
    if ',' in i:
        j = "".join(i.split(','))
        if ' &' in j:
            j = "".join(i.split(' &'))
            cons_name.append(j)
        else:
            cons_name.append(j)
    elif ' &' in i:
        cons_name.append("".join(i.split(' &')))
    elif '-' in i:
        j = " ".join(i.split('-'))
        cons_name.append(j)
    else:
        cons_name.append(i)

df_cons.constituency_name_eu = cons_name

Time for Testing

Now that our data entries are unified we can run some validation tests to see if everything is as expected. First test is for referendum data integrity. I will check if the percentage sums of leave and remain add up to 100%.

l_validate = df_cons.remain + df_cons.leave

df_cons.ix[l_validate[l_validate != 100].index]

	constituency_name_eu	parliament_member	mp_position	region_name_eu	remain	leave
24	richmond park	Sarah Olney	remain	london	73.3	27.7
94	cardiff north	Craig Williams	remain	wales	60.9	39.0
143	brighton kemptown	Simon Kirby	remain	south east	57.6	43.4
197	leeds central	Hilary Benn	remain	yorkshire and the humber	52.4	47.4
353	wolverhampton south west	Rob Marris	remain	west midlands	45.6	54.6
459	islwyn	Chris Evans	remain	wales	45.3	58.8
554	dover	Charlie Elphicke	remain	south east	36.8	63.1
555	old bexley and sidcup	James Brokenshire	remain	london	36.8	63.1

It appears we have 8 instances of inaccuracies. This problem is particularly hard to automate. I had to look up other reported results and manually change them.

remain_man = [69.3, 60.9, 57.6, 52.5, 45.6, 45.3, 36.9, 36.9]
leave_man = [30.7, 39.1, 42.4, 47.5, 54.4, 54.7, 63.1, 63.1]

df_cons.ix[l_validate[l_validate != 100].index, 'remain'] = remain_man
df_cons.ix[l_validate[l_validate != 100].index, 'leave'] = leave_man

Final test is about constituency names, I will check if there is any constituency name not appears in other data set.

for i in df.constituency_name.unique():
    if i not in df_cons.constituency_name_eu.unique():
        print('{} name is not available in the referandum data set'.format(i))
else:
    print('Constituency names are parsed correctly.')

Constituency names are parsed correctly.