Uk General Election 2017 Powered By Python

Posted by Baran Buluttekin on May 1, 2017

For anyone like me who wants to do DIY election analysis, first step is to get the data and make it ready for analysis. This is a time consuming part of the process and sadly it’s also must do part. According to opinion polls most important subject in this election is brexit, given that referendum result and 2015 general election results are vital for any analysis. In this post I will demonstrate how did I get the data and what did I do to clean it. It’s important partially because I made this datasets public so anyone whom wishes to do their own analysis can use clean and reliable data.

Data source for general election is from electoral commission website and for referendum result its from wikipedia and the reason why I choose wikipedia is that electoral commission doesn’t have a result by constituency. If you don’t want to be involve with data wrangling then you can get the end product of this process from my Github repo here.

Let’s get the open source libraries I used and take a peak at the data.

import requests
import bs4 as bs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('hocl-ge2015-results-summary.csv')

df.head()
ons_id ons_region_id constituency_name county_name region_name country_name constituency_type declaration_time result first_party ... ukip green snp pc dup sf sdlp uup alliance other
0 W07000049 W92000004 Aberavon West Glamorgan Wales Wales County 08/05/2015 03:19 Lab hold Lab ... 4971 711 0 3663 0 0 0 0 0 1623
1 W07000058 W92000004 Aberconwy Clwyd Wales Wales County 08/05/2015 02:48 Con hold Con ... 3467 727 0 3536 0 0 0 0 0 0
2 S14000001 S92000003 Aberdeen North Scotland Scotland Scotland Borough 08/05/2015 03:40 SNP gain from Lab SNP ... 0 0 24793 0 0 0 0 0 0 392
3 S14000002 S92000003 Aberdeen South Scotland Scotland Scotland Borough 08/05/2015 09:39 SNP gain from Lab SNP ... 897 964 20221 0 0 0 0 0 0 139
4 S14000003 S92000003 Airdrie and Shotts Scotland Scotland Scotland County 08/05/2015 03:07 SNP gain from Lab SNP ... 1088 0 23887 0 0 0 0 0 0 136

5 rows × 28 columns

Uk is divided to 650 constituency, we should check the sum before proceed to any other operation.

len(df.constituency_name.unique())
650

It appears number of entries are correct. Let’s now examine the all the parties hold seats and check if there is any error.

df.first_party.unique()
array(['Lab', 'Con', 'SNP', 'PC', 'DUP', 'SDLP', 'SF', 'Green', 'Spk',
       'LD', 'UKIP', 'UUP', 'Ind'], dtype=object)

As we can see there is two non-party entry in the data. One of it is ‘Ind’ for Independent candidate and other one is ‘Spk’ which is short hand for House Speaker. Vote count for both of these instances are recorded in column ‘other’, given that condition it’s a good practice to store these instances as other so we can make cross calculation between records to corresponding column data. One way this could be helpful for us is to calculate how many votes first party received in all constituencies.

df.loc[df.first_party == 'Ind', ['first_party']] = 'other'
df.loc[df.first_party == 'Spk', ['first_party']] = 'other'

We should check the ‘second_party’ column too.

df.second_party.unique()
array(['UKIP', 'Lab', 'Con', 'LD', 'Alliance', 'SF', 'DUP', 'PBPA',
       'Respect', 'Green', 'PC', 'SNP', 'UUP', 'Ind', 'TUV'], dtype=object)

We should apply same change to ‘pbpa’ and ‘respect’ too.

df.loc[df.second_party == 'pbpa', ['second_party']] = 'other'
df.loc[df.second_party == 'respect', ['second_party']] = 'other'

Now we can change to lower case for all the string entries so when we merge with referendum result, we don’t get any conflicted constituency names.

col_to_lower = ['constituency_name', 'county_name', 'region_name', 'country_name', 'constituency_type',
               'result', 'first_party', 'second_party']

for col in col_to_lower:
    df[col] = [_.lower() for _ in df[col]]

Also constituencies that have a special character in their name will also create conflict when merged. Here I will remove all characters from constituency names.

const_name = []
for i in df.constituency_name:
    if ',' in i:
        j = "".join(i.split(','))
        if ' &' in j:
            j = "".join(i.split(' &'))
            const_name.append(j)
        else:
            const_name.append(j)
    elif '-' in i:
        j = " ".join(i.split('-'))
        const_name.append(j)
    else:
        const_name.append(i)

df.constituency_name = const_name

We now can fetch the referendum results. Data we are looking for stored in the 36th (Python use 0 based index.) table in that page. I will print the table head variables to see if thats the correct table.

url = 'https://en.wikipedia.org/wiki/Results_of_the_United_Kingdom_European_Union_membership_referendum,_2016'
respond = requests.get(url)

soup = bs.BeautifulSoup(respond.text, 'lxml')
div = soup.find_all('div', class_='mw-content-ltr')[0]
tables = div.find_all('table')
tb = tables[36]
tb.find_all('th')
[<th colspan="2" rowspan="2">Constituency</th>,
 <th rowspan="2">Member of Parliament</th>,
 <th rowspan="2">MP position</th>,
 <th colspan="2" rowspan="2"></th>,
 <th rowspan="2">Region</th>,
 <th colspan="2">Proportion of votes</th>,
 <th rowspan="2">Notes</th>,
 <th>Remain %</th>,
 <th>Leave %</th>]

That’s looks correct, but changing the column names will help us during the validation process and prevent confusion. New column names can be set as below.

colums = ['constituency_name_eu', 'parliament_member', 'mp_position', 'region_name_eu', 'remain', 'leave']

Now we can parse all input to data frame and take a peak at the data set.

df_cons = pd.DataFrame(columns=colums)

for i, row in enumerate(tb.find_all('tr')[2:]):
    rows = []
    rows.append(row.find_all('td')[1].text)
    rows.append(row.find_all('td')[2].text)
    rows.append(row.find_all('td')[3].text)
    rows.append(row.find_all('td')[6].text)
    rows.append(row.find_all('td')[7].text)
    rows.append(row.find_all('td')[8].text)
    df_cons.loc[i] = rows

df_cons.head()
constituency_name_eu parliament_member mp_position region_name_eu remain leave
0 Streatham Chuka Umunna Remain Greater London 79.5% 20.5%
1 Bristol West Thangam Debbonaire Remain South West England 79.3% 20.7%
2 Hackney North and Stoke Newington Diane Abbott Remain Greater London 79.1% 20.9%
3 Glasgow North Patrick Grady Remain Scotland 78.4% 21.6%
4 Islington North Jeremy Corbyn Remain Greater London 78.4% 21.6%

Again validating the number of constituencies.

len(df.constituency_name)
650

Percentage sign within the remain and leave columns should be removed so we can turn that columns into floating numbers and make it available to arithmetic operations.

df_cons.remain = [float(_.split('%')[0]) for _ in df_cons.remain]
df_cons.leave = [float(_.split('%')[0]) for _ in df_cons.leave]

Finally, unifying constituency names by turning then to lower case and removing special characters. It’s also important to notice that region names slightly different from general election results. In the second data set region named as south west England instead of south west, we should also unify them too.

df_cons.constituency_name_eu = [_.lower() for _ in df_cons.constituency_name_eu]
df_cons.region_name_eu = [_.lower() for _ in df_cons.region_name_eu]
df_cons.mp_position = [_.lower() for _ in df_cons.mp_position]
region_n = []
for i in df_cons.region_name_eu:
    if 'of' in i.split(' '):
        region_n.append(i[:-11])
    elif 'england' in i.split(' '):
        region_n.append(i[:-8])
    elif 'greater' in i.split(' '):
        region_n.append(i[-6:])
    else:
        region_n.append(i)

df_cons.region_name_eu = region_n
df_cons.loc[df_cons.constituency_name_eu == 'berwickshire, roxburgh selkirk',
            'constituency_name_eu'] = 'berwickshire roxburgh and selkirk'
df_cons.loc[df_cons.constituency_name_eu == 'ynys môn', 'constituency_name_eu'] = 'ynys mon'
cons_name = []
for i in df_cons.constituency_name_eu:
    if ',' in i:
        j = "".join(i.split(','))
        if ' &' in j:
            j = "".join(i.split(' &'))
            cons_name.append(j)
        else:
            cons_name.append(j)
    elif ' &' in i:
        cons_name.append("".join(i.split(' &')))
    elif '-' in i:
        j = " ".join(i.split('-'))
        cons_name.append(j)
    else:
        cons_name.append(i)

df_cons.constituency_name_eu = cons_name

Time for Testing

Now that our data entries are unified we can run some validation tests to see if everything is as expected. First test is for referendum data integrity. I will check if the percentage sums of leave and remain add up to 100%.

l_validate = df_cons.remain + df_cons.leave

df_cons.ix[l_validate[l_validate != 100].index]
constituency_name_eu parliament_member mp_position region_name_eu remain leave
24 richmond park Sarah Olney remain london 73.3 27.7
94 cardiff north Craig Williams remain wales 60.9 39.0
143 brighton kemptown Simon Kirby remain south east 57.6 43.4
197 leeds central Hilary Benn remain yorkshire and the humber 52.4 47.4
353 wolverhampton south west Rob Marris remain west midlands 45.6 54.6
459 islwyn Chris Evans remain wales 45.3 58.8
554 dover Charlie Elphicke remain south east 36.8 63.1
555 old bexley and sidcup James Brokenshire remain london 36.8 63.1

It appears we have 8 instances of inaccuracies. This problem is particularly hard to automate. I had to look up other reported results and manually change them.

remain_man = [69.3, 60.9, 57.6, 52.5, 45.6, 45.3, 36.9, 36.9]
leave_man = [30.7, 39.1, 42.4, 47.5, 54.4, 54.7, 63.1, 63.1]

df_cons.ix[l_validate[l_validate != 100].index, 'remain'] = remain_man
df_cons.ix[l_validate[l_validate != 100].index, 'leave'] = leave_man

Final test is about constituency names, I will check if there is any constituency name not appears in other data set.

for i in df.constituency_name.unique():
    if i not in df_cons.constituency_name_eu.unique():
        print('{} name is not available in the referandum data set'.format(i))
else:
    print('Constituency names are parsed correctly.')
Constituency names are parsed correctly.