Neural Network Workshop – Lab 4 Data Shaping and Scaling

 

Cleanup

Cleanup from Last Lab

  1. To find Correlations, we encoded a lot of columns we are about to discard, comment out all of your code from the previous lab except the encoding of “Opportunity Result”

Reference Material

https://stackoverflow.com/questions/36644144/shortcut-key-for-commenting-out-lines-of-python-code-in-spyder

Full Solution
#Encode columns using label encoding 
from sklearn.preprocessing import LabelEncoder

responsetimeencoder = LabelEncoder()
dataset['Opportunity Result'] = responsetimeencoder.fit_transform(dataset['Opportunity Result'])

#suppliesgroupencoder = LabelEncoder()
#dataset['Supplies Group'] = suppliesgroupencoder.fit_transform(dataset['Supplies Group'])
#
#suppliessubgroupencoder = LabelEncoder()
#dataset['Supplies Subgroup'] = suppliessubgroupencoder.fit_transform(dataset['Supplies Subgroup'])
#
#regionencoder = LabelEncoder()
#dataset['Region'] = regionencoder.fit_transform(dataset['Region'])
#
#competitortypeencoder = LabelEncoder()
#dataset['Competitor Type'] = competitortypeencoder.fit_transform(dataset['Competitor Type'])
#
#routetomarketencoder = LabelEncoder()
#dataset['Route To Market'] = routetomarketencoder.fit_transform(dataset['Route To Market'])
#
##What are the correlations between columns and target
##http://www.statstutor.ac.uk/resources/uploaded/pearsons.pdf
##.00-.19 “very weak”
##.20-.39 “weak”
##.40-.59 “moderate”
##.60-.79 “strong”
##.80-1.0 “very strong”
#correlations = dataset.corr()['Opportunity Result'].sort_values()

[collapse]

[collapse]
Drop Poorly Correlated Columns

  1. In the lab 3 we selected columns to throw out, drop those columns

Reference Material

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html

Hint 1

The following rows were selected to be dropped:

[collapse]
Hint 2

Call this function to drop the columns:

dataset = dataset.drop()

[collapse]
Full Solution
#Throw out unneeded columns 
dataset = dataset.drop(columns=['Client Size By Employee Count',
'Client Size By Revenue',
'Elapsed Days In Sales Stage',
'Opportunity Number',
'Opportunity Amount USD',
'Competitor Type',
'Supplies Group',
'Supplies Subgroup',
'Region'])

[collapse]

[collapse]

 

 

Encoding and Scaling

One Hot Encoding

  1. Use One Hot Encoding from the pandas library to encode the Route To Market column.  Don’t forget to avoid the dummy variable trap, you should have one less column then the number of values in the Route to Market column.  Prefix the new columns with Route To Market.
  2. Drop the original Route to Market column after it has been encoded

Reference Material

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html

https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.get_dummies.html

https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.concat.html

Hint 1

Start with the encoding:

pd.get_dummies(dataset['Route To Market'])

[collapse]
Hint 2

Add the prefix to the new columns:

pd.get_dummies(dataset['Route To Market'], prefix='Route To Market')

[collapse]
Hint 3

Drop the first column to avoid the dummy variable trap:

pd.get_dummies(dataset['Route To Market'], prefix='Route To Market', drop_first=True)

[collapse]
Hint 4

Now concat the new columns to the front of the dataset:

dataset = pd.concat([pd.get_dummies(dataset['Route To Market'], prefix='Route To Market', drop_first=True),dataset], axis=1)

[collapse]
Hint 5

Drop Route To Market Column:

dataset = dataset.drop(columns=['Route To Market'])

[collapse]
Full Solution
#One Hot Encode columns that are more than binary
# avoid the dummy variable trap
dataset = pd.concat([pd.get_dummies(dataset['Route To Market'], prefix='Route To Market', drop_first=True),dataset], axis=1)
dataset = dataset.drop(columns=['Route To Market'])

[collapse]

[collapse]
Seperate the X (inputs) from the y (result)

  1. Set X to contain all columns except Opportunity Result
  2. Set y to contain only Opportunity result
  3. Array slicing is the easiest way to accomplish both, pay special attention to that part of the arrays indexing reference.

Reference Material

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html

https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.iloc.html

https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.values.html

https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.indexing.html

Hint 1

Start by creating y, partial answer (???):

y = dataset.iloc[???].values

[collapse]
Hint 2

Here is the full answer to create y including the array slicing:

y = dataset.iloc[:, dataset.columns.get_loc('Opportunity Result')].values

[collapse]
Hint 3

To create X first drop the column Opportunity result:

X = dataset.drop(columns=['Opportunity Result'])

[collapse]
Hint 4

Use slicing again to convert the rest of the dataframe into an array:

X = dataset.drop(columns=['Opportunity Result']).iloc[:, 0:dataset.shape[1] - 1].values

[collapse]
Full Solution

Use slicing again to convert the rest of the dataframe into an array:

#Create the input data set (X) and the outcome (y)
X = dataset.drop(columns=['Opportunity Result']).iloc[:, 0:dataset.shape[1] - 1].values
y = dataset.iloc[:, dataset.columns.get_loc('Opportunity Result')].values

[collapse]
  1. Open up y in Variable explorer, note that this is a simple array containing the encoded opportunity result
  2. Open up X in Variable explorer, also open dataset at the same time and compare.  Notice that it is the same thing missing the opportunity result column!

[collapse]
Scale the Inputs

  1. Import StandardScaler from sklearn.preprocessing
  2. Use both fit and transform to standardize X by scaling to unit variance

Reference Material

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

Hint 1

Import the class:

from sklearn.preprocessing import StandardScaler

[collapse]
Hint 2

Instantiate a StandardScaler:

sc = StandardScaler()

[collapse]
Hint 3

Use fit_transform to rescale X:

X = sc.fit_transform(X)

[collapse]
Full Solution
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)

[collapse]
  1. Open up X in Variable explorer, also open dataset at the same time and compare.  Notice how dramatically the data has been rescaled

[collapse]

 

Lab Complete!

 

 

Extra Credit – The Effect of Different Scalers on Data with Outliers