Prep Data for Correlation Evaluation
- Using the sklearn.preprocessing python library import LabelEncoder.
- Use LabelEncoder to encode the column ‘Opportunity Result’ as 0 and 1 instead of Won/Loss.
Reference Material
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
Import the LabelEncoder from the python library sklearn.preprocessing:
from sklearn.preprocessing import LabelEncoder
Instantiate a LabelEncoder:
responsetimeencoder = LabelEncoder()
You must use both fit and transform to do the encoding:
Perform the Encoding:
dataset['Opportunity Result'] = responsetimeencoder.fit_transform(dataset['Opportunity Result'])
#Encode columns using label encoding from sklearn.preprocessing import LabelEncoder responsetimeencoder = LabelEncoder() dataset['Opportunity Result'] = responsetimeencoder.fit_transform(dataset['Opportunity Result'])
- Using the dataframe, find correleations and assign them to a variable named correlations, and sort them.
Reference Material
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sort_values.html
Find the correlations:
correlations = dataset.corr()['Opportunity Result']
Sort the values:
correlations = correlations.sort_values()
Set outpath to the folder out in the current folder:
##What are the correlations between columns and target correlations = dataset.corr()['Opportunity Result'].sort_values()
- Use Variable Explorer to display the correlations variable, it should look something like this:
- Notice that Opportunity Result shows a Correlation of 1 to itself, this means that they are perfectly related. Consider a scale similar to this:
.00-.19 “very weak” .20-.39 “weak” .40-.59 “moderate” .60-.79 “strong” .80-1.0 “very strong”
- Notice that we only see correlation for 14 columns, yet the dataframe (dataset) has 19 columns. The reason is that we can only draw correlations between numbers and numbers, but five of the columns contain text, lets fix that!
- Label Encode the following columns: ‘Supplies Group’, ‘Supplies Subgroup’, ‘Region’,’Route To Market’, ‘Competitor Type’ and then run the correlation again.
Reference Material
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sort_values.html
This code should be after the label encoding of Opportunity Result, but before the call to
dataset.corr()
Add the following to label encode ‘Supplies Group’:
suppliesgroupencoder = LabelEncoder() dataset['Supplies Group'] = suppliesgroupencoder.fit_transform(dataset['Supplies Group'])
Add the following to label encode ‘Subsupplies Group’:
suppliessubgroupencoder = LabelEncoder() dataset['Supplies Subgroup'] = suppliessubgroupencoder.fit_transform(dataset['Supplies Subgroup'])
Add the following to label encode ‘Region’:
regionencoder = LabelEncoder() dataset['Region'] = regionencoder.fit_transform(dataset['Region'])
Add the following to label encode ‘Competitor Type’:
competitortypeencoder = LabelEncoder() dataset['Competitor Type'] = competitortypeencoder.fit_transform(dataset['Competitor Type'])
suppliesgroupencoder = LabelEncoder() dataset['Supplies Group'] = suppliesgroupencoder.fit_transform(dataset['Supplies Group']) suppliessubgroupencoder = LabelEncoder() dataset['Supplies Subgroup'] = suppliessubgroupencoder.fit_transform(dataset['Supplies Subgroup']) regionencoder = LabelEncoder() dataset['Region'] = regionencoder.fit_transform(dataset['Region']) competitortypeencoder = LabelEncoder() dataset['Competitor Type'] = competitortypeencoder.fit_transform(dataset['Competitor Type']) routetomarketencoder = LabelEncoder() dataset['Route To Market'] = routetomarketencoder.fit_transform(dataset['Route To Market'])
- Use Variable Explorer to display the correlations variable, it should look something like this:
- Remeber to think of the scale similar to this:
.00-.19 “very weak” .20-.39 “weak” .40-.59 “moderate” .60-.79 “strong” .80-1.0 “very strong”
- Notice that we see correlation results for all 19 columns! Remember that the the negative correlations are an inverse relationship and the positive are positive relationship.
- Select which columns we will discard, I chose 9 to discard
Lab Complete!
Extra Credit – Pearsons Correlations
Learn More About the Correlations Work:
http://www.statstutor.ac.uk/resources/uploaded/pearsons.pdf