Identifying Fraud from Enron Emails
Objective
Using the Enron email corpus data to extract and engineer model features, we will attempt to develop a classifier able to identify a "Person of Interest" (PoI) that may have been involved or had an impact on the fraud that occured within the Enron scandal. A list of known PoI has been hand generated from this USATODAY article by the friendly folks at Udacity, who define a PoI as individuals who were indicted, reached a settlement or plea deal with the government, or testified in exchange for prosecution immunity. We will use these PoI labels with the Enron email corpus data to develop the classifier.
Data Structure
The dataset used in this analysis was generated by Udacity and is a dictionary with each person's name in the dataset being the key to each data dictionary. The data dictionaries have the following features:
financial features-=: 'salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'director_fees' - all units are in US dollars
email features: 'to_messages', 'email_address', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'poi', 'shared_receipt_with_poi' -units are generally number of emails messages; notable exception is ‘email_address’, which is a text string
POI label: ‘poi’ -boolean, represented as integer
Model Development Plan
Developing the classification model will consist of 6 steps that can be iterated over until an acceptable model performance is obtained. These steps include: 1. Exploratory Data Analysis on the dataset to understand the data and investigate informative features 2. Select features likely to provide predictive power in the classification model 3. Clean or remove erroneous and outlier data from the selected features 4. Engineer/transform selected features into new features appropriate for classification modelling 5. Test different classifiers and review performance 6. Investigate new data sources that may exist to provide better model performance
Data Exploration and Cleaning
Data Overview
The dataset includes 146 unique individials having 21 features extracted from the email corpus. The amount of feature data for each individual varies. Table 1 summarizes the data count for each feature in the dataset and the count of PoI included in non-nan feature values.
Feature | Data Count | Count PoI |
---|---|---|
name | 146 | 18 |
poi | 146 | 18 |
total_stock_value | 126 | 18 |
total_payments | 125 | 18 |
email_address | 111 | 18 |
restricted_stock | 110 | 17 |
exercised_stock_options | 102 | 12 |
salary | 95 | 17 |
expenses | 95 | 18 |
other | 93 | 18 |
to_messages | 86 | 14 |
shared_receipt_with_poi | 86 | 14 |
from_messages | 86 | 14 |
from_this_person_to_poi | 86 | 14 |
from_poi_to_this_person | 86 | 14 |
bonus | 82 | 16 |
long_term_incentive | 66 | 12 |
deferred_income | 49 | 11 |
deferral_payments | 39 | 5 |
restricted_stock_deferred | 18 | 0 |
director_fees | 17 | 0 |
loan_advances | 4 | 1 |
There are some features in the dataset that having missing information that will be important to our usecase. Some of the features in the dataset will not be very useful in the classification model, as they do not have labelled PoI in their subset of availible data, such as restricted_stock_deferred and director_fees. loan_advances is such a small data sample that it will likely not provide statistically signifigant predictive strength. deferral_payments is borderline, but there are other feature datasets that will likely provide better predictive information. Reviewing the data for missing values, we will exclude restricted_stock_deferred, director_fees, loan_advances and deferral_payments as features from the model.
Outliers
In analyzing the histograms for each feature, a large outlier was noticed across many variables. After further investigation, this turned out to be attributed to an aggreate row of the dataset labelled "TOTAL". This was removed from the dataset.
It also appears that 'LAY KENNETH L' is a large outlier in many features. Kenneth is a PoI however, so we will keep his data for building the classifier and see if we can maintain the ability to build a generalized classifier for the other PoI, as well as for Kenneth.
Data Imputation
In reviewing the dataset, many of the email feature datasets only contain 14 of the 18 identified PoI, while most of the financial features have the full 18 or 17 PoI included. Throwing out data for 20% of our identified PoI seems unnacceptable for such a scarce dataset. Imputing the mean value for any missing email feature seems appropriate, as there is likely a reasonable average number of emails any one employee might send. We will also impute 0 for any financial feature, making the assumption that if the value is missing the employee did not recieve that form of financial compensation. A review of this data imputation may be required as the model is developed and in analyzing the results.
Feature Selection and Engineering
Selection
As was discussed above, restricted_stock_deferred, director_fees, loan_advances and deferral_payments have been ommitted as features for the classification model due to their small sample size. There also exists features that are labels instead of useful datum, such as name and email_address. These will not be used as features in the classification model. This leaves us with 15 potentially useful features to select from. Reviewing the data structure, there are two general categories of data: Finanical and Meta-Email. These overarching data categories seem well suited for Primary Component Analysis, to build a featureset that encompasses the most predictive information from the features and avoids dependant features such as total_payments and salary causing erroneous classification or overfitting. Applying PCA to the 15 features to reduce the dimensioanilty to 2 overarching categories, we can visualize the transformed feature relationships.
%matplotlib inline
import numpy as np
import pandas as pd
import pickle
from ggplot import *
from feature_format import featureFormat, targetFeatureSplit
with open("my_dataset.pkl", "r") as data_file:
my_dataset = pickle.load(data_file)
with open("my_feature_list.pkl", "r") as data_file:
features_list = pickle.load(data_file)
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)
#Apply PCA to dataset to allow for visualization
from sklearn.decomposition import PCA
reduced_data = PCA(n_components=2).fit_transform(features)
labels = np.array(labels).reshape(len(reduced_data[:,0]),1)
reduced_data = np.hstack((reduced_data,labels))
df = pd.DataFrame(reduced_data,columns = ['param1', 'param2','poi'])
pca_plt = ggplot(df, aes(x='param1', y='param2', color='poi')) +\
geom_point()
print pca_plt
pca_plt_zoomed = pca_plt + xlim(-2500000,5000000) +ylim(-1000000,3500000)
print pca_plt_zoomed
<ggplot: (32479948)>
<ggplot: (32864214)>
Reviewing the PCA plots, there appears to be a few PoI outliers that should be easy to classify, but zooming in on the cluster of datapoints shows PoI data points are very commingled with non-PoI data points that may be difficult to adequately classify. This model may need more advanced features to develop an adequate classifier.
Feature Engineering
It is likely that additional features are required to adequately model a PoI classifier. Most of the supplied features are absolute values of a persons' financial compensation or email correspondance. It makes sense that a more relative measure of these features is appropriate to use when comparing whether someone is a PoI or not. Three additional features have been engineered for use in classifier development:
1. perc_salary - The percentage each employees' salary makes of of their total payment, defined as salary/total_payments. This will represent if the employees compensation is made up of complex additional payments or just a basic salary.
2. perc_to_poi - The percentage of employees total sent emails to a PoI. This will represent the degree an employee spends their time interacting with a PoI.
3. perc_from_poi - The percentage of employees total recieved emails from a PoI. This will represent the degree a PoI spends their time interacting with an employee.
eng_data = np.hstack((data[:,(16,17,18)],labels))
df = pd.DataFrame(eng_data,columns = ['perc_salary', 'perc_to_poi','perc_from_poi','poi'])
print ggplot(df, aes(x='perc_salary', y='perc_to_poi', color='poi')) +\
geom_point()
print ggplot(df, aes(x='perc_salary', y='perc_from_poi', color='poi')) +\
geom_point()
print ggplot(df, aes(x='perc_to_poi', y='perc_from_poi', color='poi')) +\
geom_point()
<ggplot: (36868016)>
<ggplot: (35575721)>
<ggplot: (36864902)>
In reviewing the plots, both the perc_to_poi vs perc_salary and the perc_from_poi vs perc_to_poi plots show good seperation of PoI from non-PoI datapoints. The perc_from_poi vs perc_salary plot shows less of a seperation, with many PoI data points closely clustered with non-PoI datum. These advanced features seem to show more promise then the simple PCA approach above, but we will test and compare the relative performance of tuned classifiers using the two approaches.
Model Development, Tuning and Evaluation
Using sklearn's Machine Learning Map as a guide, we will test the PCA and advanced feature datasets with various classifier algorithms and tuning parameters in an attempt to develop the optimal PoI classifier. Sklearn's pipeline and gridsearch modules will be helpful in performing this analysis. We will also use sklearns StandardScaler to scale out input datasets, as some of our classifier algorithms highly recommend the use of scalled variables, like the svm classifer which is not scale invariant.
Model Validation
In order to determine which model has the "best" performance, model validation is required to prove the models ability to make correct predictions on a generalized dataset. Model validation is the method of extracting a subset of a dataset, known as the test data, training a model excluding this extraction and feeding the model with the extracted dataset to test its performance in predicting the known results of the extracted dataset. This is important to allow for the quantified comparasion of different models.
To compare the performance of each tuned model, we will use Udacity's test_classifier function. This function uses Precision and Recall as the performance metrics for each model. The Precision of the classifier is its ability to correctly identify PoI without incorrectly labelling people that are not PoI as a PoI. The Recall of the classifier is its ability to correctly identify all of the PoI that should be identified. In order to perform the validation classification, the test function uses sklearns StratifiedShuffleSplit method to split the dataset into training and test data. This allows us to maximize the data points used in both the training and testing datasets by performing multiple training and validation experiments on a shuffled subset of the full dataset. Being a fairly sparse dataset with only 18 labelled examples of PoIs, it is important for the testing methodology to perform on multiple interations of training and test data to ensure accurate performance results. It is also important to include shuffling in the test data selection to ensure a random distribution of PoI and non-PoI data points in each dataset.
PCA Featureset Model
from sklearn.pipeline import Pipeline
from sklearn import svm
from sklearn.grid_search import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn.decomposition import PCA
import tester
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)
pca_svm = Pipeline([('pca',PCA(n_components=2)),('scaler',StandardScaler()),('svm',svm.SVC())])
param_grid = ([{'svm__C': [1000,10000],
'svm__gamma': [0.01,0.0001],
'svm__degree':[2,3],
'svm__kernel': ['linear','rbf','poly']}])
svm_clf = GridSearchCV(pca_svm,param_grid,scoring='recall').fit(features,labels).best_estimator_
pca_knb = Pipeline([('pca',PCA(n_components=2)),('scaler',StandardScaler()),('knb',KNeighborsClassifier())])
param_grid = ([{'knb__n_neighbors': [4,5,6]}])
knb_clf = GridSearchCV(pca_knb,param_grid,scoring='recall').fit(features,labels).best_estimator_
pca_rfst = Pipeline([('pca',PCA(n_components=2)),('scaler',StandardScaler()),
('rfst',RandomForestClassifier())])
param_grid = ([{'rfst__n_estimators': [4,5,6]}])
rfst_clf = GridSearchCV(pca_rfst,param_grid,scoring='recall').fit(features,labels).best_estimator_
print svm_clf
tester.test_classifier(svm_clf,my_dataset,features_list)
print knb_clf
tester.test_classifier(knb_clf,my_dataset,features_list)
print rfst_clf
tester.test_classifier(rfst_clf,my_dataset,features_list)
Pipeline(steps=[('pca', PCA(copy=True, n_components=2, whiten=False)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm', SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0, degree=2,
gamma=0.01, kernel='rbf', max_iter=-1, probability=False,
random_state=None, shrinking=True, tol=0.001, verbose=False))])
Accuracy: 0.87573 Precision: 0.68579 Recall: 0.12550 F1: 0.21217 F2: 0.15001
Total predictions: 15000 True positives: 251 False positives: 115 False negatives: 1749 True negatives: 12885
Pipeline(steps=[('pca', PCA(copy=True, n_components=2, whiten=False)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('knb', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_neighbors=4, p=2, weights='uniform'))])
Accuracy: 0.86807 Precision: 0.54393 Recall: 0.06500 F1: 0.11612 F2: 0.07889
Total predictions: 15000 True positives: 130 False positives: 109 False negatives: 1870 True negatives: 12891
Pipeline(steps=[('pca', PCA(copy=True, n_components=2, whiten=False)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('rfst', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=6, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False))])
Accuracy: 0.83913 Precision: 0.25150 Recall: 0.10450 F1: 0.14765 F2: 0.11833
Total predictions: 15000 True positives: 209 False positives: 622 False negatives: 1791 True negatives: 12378
Reviewing the results, we see that the svm and KNeighbors classifiers have fairly good Precision, meaning they are able to properly identify PoI wihout too many incorrectly labelled PoI. The RandomForestClassiefier has poorer Precision performance. The Recall for all classifiers is rather poor, with the svm classifier having the highest Recall at 0.1255. This indicates that, although the classifiers do not miss-label many non-PoI employees, they tend to under-label correctly people that should be considered PoI. Considering the use case for this model is likely to be a filter that could provide a short-list of people to investigate further, having a high Recall is important to ensure the short-list captures people that are likely to be PoI. Given this, it is likely the above classifiers are not performant enough to provide sufficient value in this investigation. From visual inspection it was seen that the data relationships showed little classifcation potential and after exhaustively tuning these classifiers, that interpretation seems to be correct. The PCA featureset does not seem to provide enough informative characteristics to build a suitable classifier, and further feature engineering is likely required.
Engineered Feature Model
from sklearn.feature_selection import SelectKBest
eng_svm = Pipeline([('scaler',StandardScaler()),('kbest',SelectKBest()),('svm',svm.SVC())])
param_grid = ([{'kbest__k':[3,4,5,6],
'svm__C': [1,10,100,1000],
'svm__gamma': [1,0.1,0.01,0.001],
'svm__degree':[2,3,4],
'svm__kernel': ['linear','rbf','poly']}])
svm_clf = GridSearchCV(eng_svm,param_grid,scoring='recall').fit(features,labels).best_estimator_
eng_knb = Pipeline([('scaler',StandardScaler()),('kbest',SelectKBest()),('knb',KNeighborsClassifier())])
param_grid = ([{'kbest__k':[3,4,5,6],'knb__n_neighbors': [2,3,4,5,6]}])
knb_clf = GridSearchCV(eng_knb,param_grid,scoring='recall').fit(features,labels).best_estimator_
eng_rfst = Pipeline([('scaler',StandardScaler()),('kbest',SelectKBest()),
('rfst',RandomForestClassifier())])
param_grid = ([{'kbest__k':[3,4,5,6],'rfst__n_estimators': [2,3,4,5,6]}])
rfst_clf = GridSearchCV(eng_rfst,param_grid,scoring='recall').fit(features,labels).best_estimator_
print svm_clf
tester.test_classifier(svm_clf,my_dataset,features_list)
print knb_clf
tester.test_classifier(knb_clf,my_dataset,features_list)
print rfst_clf
tester.test_classifier(rfst_clf,my_dataset,features_list)
Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('kbest', SelectKBest(k=6, score_func=<function f_classif at 0x000000001F36F128>)), ('svm', SVC(C=1, cache_size=200, class_weight=None, coef0=0.0, degree=4, gamma=1,
kernel='poly', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False))])
Accuracy: 0.81753 Precision: 0.25351 Recall: 0.18950 F1: 0.21688 F2: 0.19958
Total predictions: 15000 True positives: 379 False positives: 1116 False negatives: 1621 True negatives: 11884
Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('kbest', SelectKBest(k=3, score_func=<function f_classif at 0x000000001F36F128>)), ('knb', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_neighbors=3, p=2, weights='uniform'))])
Accuracy: 0.86127 Precision: 0.46341 Recall: 0.25650 F1: 0.33022 F2: 0.28165
Total predictions: 15000 True positives: 513 False positives: 594 False negatives: 1487 True negatives: 12406
Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('kbest', SelectKBest(k=4, score_func=<function f_classif at 0x000000001F36F128>)), ('rfst', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_l...n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False))])
Accuracy: 0.83020 Precision: 0.32066 Recall: 0.24450 F1: 0.27745 F2: 0.25669
Total predictions: 15000 True positives: 489 False positives: 1036 False negatives: 1511 True negatives: 11964
Although we managed to improve our best Recall score with a tuned KNeibors Classifier, the Precision performance was dramitacally reduced. At this point, it would likely be beneficial to perform anothet iteration of the model development process in order to better explore the data, develop features and investigate other classisifier algorithms. However, there is one more approach that may be useful to investigate: developing a hybrid model by combining the two input theories outlined above.
Hybrid Model
Using sklearns FeatureUninion module, we can combine the PCA reduced dimensionality dataset with the engineered feature dataset as a hybrid featureset for model development.
from sklearn.pipeline import FeatureUnion
combined_features = FeatureUnion([("pca", PCA()), ("kbest", SelectKBest())])
hybrid_svm = Pipeline([('features',combined_features),('scaler',StandardScaler()),('svm',svm.SVC())])
param_grid = ([{'features__pca__n_components':[2,3,4,5,6,7],'features__kbest__k':[2,3,4,5,6,7],
'svm__C': [1,10,100,1000],
'svm__gamma': [1,0.1,0.01,0.001],
'svm__degree':[2,3,4],
'svm__kernel': ['rbf','poly']}])
svm_clf = GridSearchCV(hybrid_svm,param_grid,scoring='recall').fit(features,labels).best_estimator_
hybrid_knb = Pipeline([('features',combined_features),('scaler',StandardScaler()),('knb',KNeighborsClassifier())])
param_grid = ([{'features__pca__n_components':[2,3,4,5,6],'features__kbest__k':[2,3,4,5,6],'knb__n_neighbors': [1,2,3,4,5,6,7]}])
knb_clf = GridSearchCV(hybrid_knb,param_grid,scoring='recall').fit(features,labels).best_estimator_
hybrid_rfst = Pipeline([('features',combined_features),('scaler',StandardScaler()),
('rfst',RandomForestClassifier())])
param_grid = ([{'features__pca__n_components':[2,3,4,5,6],'features__kbest__k':[2,3,4,5,6],'rfst__n_estimators': [2,3,4,5,6,7]}])
rfst_clf = GridSearchCV(hybrid_rfst,param_grid,scoring='recall').fit(features,labels).best_estimator_
print svm_clf
tester.test_classifier(svm_clf,my_dataset,features_list)
print knb_clf
tester.test_classifier(knb_clf,my_dataset,features_list)
print rfst_clf
tester.test_classifier(rfst_clf,my_dataset,features_list)
Pipeline(steps=[('features', FeatureUnion(n_jobs=1,
transformer_list=[('pca', PCA(copy=True, n_components=7, whiten=False)), ('kbest', SelectKBest(k=5, score_func=<function f_classif at 0x000000001F36F128>))],
transformer_weights=None)), ('scaler', StandardScaler(copy=True, with_mean=True, with...y', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False))])
Accuracy: 0.82167 Precision: 0.33234 Recall: 0.33450 F1: 0.33342 F2: 0.33407
Total predictions: 15000 True positives: 669 False positives: 1344 False negatives: 1331 True negatives: 11656
Pipeline(steps=[('features', FeatureUnion(n_jobs=1,
transformer_list=[('pca', PCA(copy=True, n_components=4, whiten=False)), ('kbest', SelectKBest(k=6, score_func=<function f_classif at 0x000000001F36F128>))],
transformer_weights=None)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('knb', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_neighbors=1, p=2, weights='uniform'))])
Accuracy: 0.80880 Precision: 0.22462 Recall: 0.17700 F1: 0.19799 F2: 0.18484
Total predictions: 15000 True positives: 354 False positives: 1222 False negatives: 1646 True negatives: 11778
Pipeline(steps=[('features', FeatureUnion(n_jobs=1,
transformer_list=[('pca', PCA(copy=True, n_components=2, whiten=False)), ('kbest', SelectKBest(k=5, score_func=<function f_classif at 0x000000001F36F128>))],
transformer_weights=None)), ('scaler', StandardScaler(copy=True, with_mean=True, with...n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False))])
Accuracy: 0.83400 Precision: 0.32771 Recall: 0.23300 F1: 0.27236 F2: 0.24729
Total predictions: 15000 True positives: 466 False positives: 956 False negatives: 1534 True negatives: 12044
The Hybrid model was able to develop a classifier that satisfies the minimun target of 0.3 for both Precision and Recall with an SVM classifier. Although the model's Precision is reduced compared to previous iterations, it is better suited for correctly identifier employees that are PoI in order to develop a short-list of people to investigate further.
Final Model
import operator
combined_features = FeatureUnion([("pca", PCA(n_components=7)), ("kbest", SelectKBest(k=6))])
final_svm = Pipeline([('features',combined_features),('scaler',StandardScaler()),
('svm',svm.SVC(C=1,degree=3,kernel='poly',gamma=1))])
svm_clf = final_svm.fit(features,labels)
feature_scores = sorted({features_list[i]:svm_clf.get_params()['features'].get_params()['kbest'].scores_[i]
for i in range(0,18)}.items(),reverse=True, key=operator.itemgetter(1))
print feature_scores
print svm_clf.get_params()
tester.test_classifier(svm_clf,my_dataset,features_list)
[('restricted_stock', 25.380105299760199), ('from_poi_to_this_person', 24.752523020258508), ('other', 21.327890413979102), ('exercised_stock_options', 18.861795316466416), ('perc_salary', 16.719777335704574), ('long_term_incentive', 11.732698076065354), ('bonus', 10.222904205832778), ('total_payments', 9.4807432034789336), ('total_stock_value', 8.9678193476776205), ('salary', 6.3746144901977475), ('to_messages', 5.7652373136035786), ('expenses', 4.2635766381444693), ('from_this_person_to_poi', 3.0545709279872115), ('deferred_income', 2.8591257010691469), ('perc_to_poi', 1.5752718701560835), ('from_messages', 1.3690711377259386), ('shared_receipt_with_poi', 0.58945562335007018), ('poi', 0.37046177768797534)]
{'features': FeatureUnion(n_jobs=1,
transformer_list=[('pca', PCA(copy=True, n_components=7, whiten=False)), ('kbest', SelectKBest(k=6, score_func=<function f_classif at 0x000000001F36F128>))],
transformer_weights=None), 'scaler': StandardScaler(copy=True, with_mean=True, with_std=True), 'features__pca__copy': True, 'svm__shrinking': True, 'svm__gamma': 1, 'svm__verbose': False, 'svm__probability': False, 'features__pca__whiten': False, 'features__kbest__k': 6, 'features__kbest__score_func': <function f_classif at 0x000000001F36F128>, 'svm__cache_size': 200, 'scaler__copy': True, 'svm__degree': 3, 'scaler__with_mean': True, 'features__kbest': SelectKBest(k=6, score_func=<function f_classif at 0x000000001F36F128>), 'svm__kernel': 'poly', 'svm__max_iter': -1, 'svm__coef0': 0.0, 'svm__random_state': None, 'scaler__with_std': True, 'svm': SVC(C=1, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=1,
kernel='poly', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False), 'features__pca__n_components': 7, 'svm__C': 1, 'svm__class_weight': None, 'svm__tol': 0.001, 'features__pca': PCA(copy=True, n_components=7, whiten=False)}
Accuracy: 0.81413 Precision: 0.31398 Recall: 0.33250 F1: 0.32297 F2: 0.32862
Total predictions: 15000 True positives: 665 False positives: 1453 False negatives: 1335 True negatives: 11547
Conclusions and Reflection
In this analysis, we attempted to build a classification model that could predict whether someone is likely to be a Person of Interest (PoI) in the Enron scandal given their email and financial data. After exploring the data and removing outliers, we investigated two input featuresets: PCA transformed data and engineered features that provided a normalized representation of the email and financial data. Applying multiple classifier algorithms and tuning via GridSearch, the final model was developed which consisted of using a hybrid of the PCA and engineered featureset as the model inputs and a polynomial svm classifier with degree 3 and C and gamme values both 1. This final classifier was selected by reviewing the comparative performance of each models' Precision and Recall scores, with the selected model having a Precision of 0.314 and Recall of 0.333. These performance metrics were chosen as they provide a balance between the goal of producing a short-list of potential PoI candidates to flag for further investigation, while preventing the over classification of non-PoI individuals. Using other performance metrics, such as accuracy in this situation would lead to sub-optimal classifiers, as the number of correct predictions is not as useful as ensuring people that are PoI's are identified as such. This can be a common pitfall of performance measurement, and the goal of the model needs to be well defined before its performance can be measured.
Using the GridSearch functionality of sklearn allowed for the automated tuning of the classifier algorithms. This tuning allowed for the use of different parameters in the classifiers, such as the type of svm (linear, rbf or polynomial) or the number of nearest neighbors to use in the KNeighbors classifier. This tuning is important in producing an effective classifier, as different datasets will result in different patterns. A linear svm may work well with clearly split datasets, but a polynimal svm will be better suited to datasets that result in curved patterns. Without correctly tuning the classification algorithms, a sufficient model will be difficult to develop.
Although the final model meets the minimum goal of Precision and Recall >0.3, there is likely a more optimal classifier that could be developed by further exploring the dataset, engineering more intelligent features and investigating other classifier algorithms better suited to this problem. This model is also likely overfitted to this dataset and probably not directly useful in future investigations, but the proof of concept could be utilized by the investigators to generate their own model as they begin to identify PoI in their own investigation of a similiar nature to the Enron scandal. It would also be interesting to investigate if other features can be developed to help identify potential PoI, such as performing Natural Language Processing on the content of email messages to descern if any conversation patterns emerged between PoI.