Neural Network Workshop – Lab 3 Correlations

Prep Data for Correlation Evaluation

Label Encode Opportunity Result

  1. Using the sklearn.preprocessing python library import LabelEncoder.
  2. Use LabelEncoder to encode the column ‘Opportunity Result’ as 0 and 1 instead of Won/Loss.

Reference Material

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

Hint 1

Import the LabelEncoder from the python library sklearn.preprocessing:

from sklearn.preprocessing import LabelEncoder

[collapse]
Hint 2

Instantiate a LabelEncoder:

responsetimeencoder = LabelEncoder()

[collapse]
Hint 4

Perform the Encoding:

dataset['Opportunity Result'] = responsetimeencoder.fit_transform(dataset['Opportunity Result'])

[collapse]
Full Solution
#Encode columns using label encoding 
from sklearn.preprocessing import LabelEncoder

responsetimeencoder = LabelEncoder()
dataset['Opportunity Result'] = responsetimeencoder.fit_transform(dataset['Opportunity Result'])

[collapse]

[collapse]
Find the Correlations

  1. Using the dataframe, find correleations and assign them to a variable named correlations, and sort them.

Reference Material

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sort_values.html

Hint 1

Find the correlations:

correlations = dataset.corr()['Opportunity Result']

[collapse]
Hint 2

Sort the values:

correlations = correlations.sort_values()

[collapse]
Full Solution

Set outpath to the folder out in the current folder:

##What are the correlations between columns and target
correlations = dataset.corr()['Opportunity Result'].sort_values()

[collapse]

[collapse]
Interpret the Correlations

  1. Use Variable Explorer to display the correlations variable, it should look something like this:

  1. Notice that Opportunity Result shows a Correlation of 1 to itself, this means that they are perfectly related.  Consider a scale similar to this:
.00-.19 “very weak”
.20-.39 “weak”
.40-.59 “moderate”
.60-.79 “strong”
.80-1.0 “very strong”
  1. Notice that we only see correlation for 14 columns, yet the dataframe (dataset) has 19 columns.  The reason is that we can only draw correlations between numbers and numbers, but five of the columns contain text, lets fix that!
  2. Label Encode the following columns: ‘Supplies Group’, ‘Supplies Subgroup’, ‘Region’,’Route To Market’, ‘Competitor Type’ and then run the correlation again.

Reference Material

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder.fit_transform

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sort_values.html

Hint 1

This code should be after the label encoding of Opportunity Result, but before the call to

dataset.corr()

[collapse]
Hint 2

Add the following to label encode ‘Supplies Group’:

suppliesgroupencoder = LabelEncoder()
dataset['Supplies Group'] = suppliesgroupencoder.fit_transform(dataset['Supplies Group'])

[collapse]
Hint 3

Add the following to label encode ‘Subsupplies Group’:

suppliessubgroupencoder = LabelEncoder()
dataset['Supplies Subgroup'] = suppliessubgroupencoder.fit_transform(dataset['Supplies Subgroup'])

[collapse]
Hint 4

Add the following to label encode ‘Region’:

regionencoder = LabelEncoder()
dataset['Region'] = regionencoder.fit_transform(dataset['Region'])

[collapse]
Hint 5

Add the following to label encode ‘Competitor Type’:

competitortypeencoder = LabelEncoder()
dataset['Competitor Type'] = competitortypeencoder.fit_transform(dataset['Competitor Type'])

[collapse]
Full Solution
suppliesgroupencoder = LabelEncoder()
dataset['Supplies Group'] = suppliesgroupencoder.fit_transform(dataset['Supplies Group'])

suppliessubgroupencoder = LabelEncoder()
dataset['Supplies Subgroup'] = suppliessubgroupencoder.fit_transform(dataset['Supplies Subgroup'])

regionencoder = LabelEncoder()
dataset['Region'] = regionencoder.fit_transform(dataset['Region'])

competitortypeencoder = LabelEncoder()
dataset['Competitor Type'] = competitortypeencoder.fit_transform(dataset['Competitor Type'])

routetomarketencoder = LabelEncoder()
dataset['Route To Market'] = routetomarketencoder.fit_transform(dataset['Route To Market'])

[collapse]

[collapse]
Interpret the Correlations Take 2

  1. Use Variable Explorer to display the correlations variable, it should look something like this:

  1. Remeber to think of the scale similar to this:
.00-.19 “very weak”
.20-.39 “weak”
.40-.59 “moderate”
.60-.79 “strong”
.80-1.0 “very strong”
  1. Notice that we see correlation results for all 19 columns! Remember that the the negative correlations are an inverse relationship and the positive are positive relationship.
  2. Select which columns we will discard, I chose 9 to discard
Hint 1

The following columns have the weakest relationship and will be discarded:

[collapse]

[collapse]

 

Lab Complete!

 

Extra Credit – Pearsons Correlations

Read More

Learn More About the Correlations Work:

http://www.statstutor.ac.uk/resources/uploaded/pearsons.pdf

[collapse]