Cleanup
- To find Correlations, we label encoded the column ‘Route To Market’, which we need to one hot encode, comment out the label encoding for ‘Route To Market’ and the code to find correlations from the previous lab.
Reference Material
#Encode columns using label encoding from sklearn.preprocessing import LabelEncoder responsetimeencoder = LabelEncoder() dataset['Opportunity Result'] = responsetimeencoder.fit_transform(dataset['Opportunity Result']) suppliesgroupencoder = LabelEncoder() dataset['Supplies Group'] = suppliesgroupencoder.fit_transform(dataset['Supplies Group']) suppliessubgroupencoder = LabelEncoder() dataset['Supplies Subgroup'] = suppliessubgroupencoder.fit_transform(dataset['Supplies Subgroup']) regionencoder = LabelEncoder() dataset['Region'] = regionencoder.fit_transform(dataset['Region']) competitortypeencoder = LabelEncoder() dataset['Competitor Type'] = competitortypeencoder.fit_transform(dataset['Competitor Type']) #routetomarketencoder = LabelEncoder() #dataset['Route To Market'] = routetomarketencoder.fit_transform(dataset['Route To Market']) # ##What are the correlations between columns and target #correlations = dataset.corr()['Opportunity Result'].sort_values()
- In lab 3 we selected columns to throw out, drop those columns.
Reference Material
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html
Call this function to drop the columns:
dataset = dataset.drop()
#Throw out unneeded columns dataset = dataset.drop('Client Size By Employee Count', axis=1) dataset = dataset.drop('Client Size By Revenue', axis=1) dataset = dataset.drop('Elapsed Days In Sales Stage', axis=1) dataset = dataset.drop('Opportunity Number', axis=1) dataset = dataset.drop('Opportunity Amount USD', axis=1) dataset = dataset.drop('Competitor Type', axis=1) dataset = dataset.drop('Supplies Group', axis=1) dataset = dataset.drop('Supplies Subgroup', axis=1) dataset = dataset.drop('Region', axis=1)
Encoding and Scaling
- Use One Hot Encoding from the pandas library to encode the Route To Market column. Don’t forget to avoid the dummy variable trap, you should have one less column than the number of values in the Route to Market column. Prefix the new columns with Route To Market.
- Drop the original Route to Market column after it has been encoded.
Reference Material
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html
https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.get_dummies.html
https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.concat.html
Start with the encoding:
pd.get_dummies(dataset['Route To Market'])
Add the prefix to the new columns:
pd.get_dummies(dataset['Route To Market'], prefix='Route To Market')
Drop the first column to avoid the dummy variable trap:
pd.get_dummies(dataset['Route To Market'], prefix='Route To Market', drop_first=True)
Now concat the new columns to the front of the dataset:
dataset = pd.concat([pd.get_dummies(dataset['Route To Market'], prefix='Route To Market', drop_first=True),dataset], axis=1)
Drop Route To Market Column:
dataset = dataset.drop('Route To Market', axis=1)
#One Hot Encode columns that are more than binary # avoid the dummy variable trap dataset = pd.concat([pd.get_dummies(dataset['Route To Market'], prefix='Route To Market', drop_first=True),dataset], axis=1) dataset = dataset.drop('Route To Market', axis=1)
- Set X to contain all columns except Opportunity Result.
- Set y to contain only Opportunity result.
- Array slicing is the easiest way to accomplish both, pay special attention to the section “Slicing” in the arrays indexing reference below.
Reference Material
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html
https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.iloc.html
https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.values.html
https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.indexing.html
Start by creating y:
y = dataset.iloc[???].values
Here is the full answer to create y including the array slicing:
y = dataset.iloc[:, dataset.columns.get_loc('Opportunity Result')].values
To create X first drop the column Opportunity result:
X = dataset.drop('Opportunity Result', axis=1)
Use slicing again to convert the rest of the dataframe into an array:
X = dataset.drop('Opportunity Result', axis=1).iloc[:, 0:dataset.shape[1] - 1].values
Use slicing again to convert the rest of the dataframe into an array:
#Create the input data set (X) and the outcome (y) X = dataset.drop('Opportunity Result', axis=1).iloc[:, 0:dataset.shape[1] - 1].values y = dataset.iloc[:, dataset.columns.get_loc('Opportunity Result')].values
- Open up y in Variable explorer, note that this is a simple array containing the encoded opportunity result.
- Open up X in Variable explorer, also open dataset at the same time and compare. Notice that it is the same thing missing the opportunity result column!
- Import StandardScaler from sklearn.preprocessing.
- Use both fit and transform to standardize X by scaling to unit variance.
Reference Material
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
Import the class:
from sklearn.preprocessing import StandardScaler
Instantiate a StandardScaler:
sc = StandardScaler()
Use fit_transform to rescale X:
X = sc.fit_transform(X)
# Feature Scaling from sklearn.preprocessing import StandardScaler sc = StandardScaler() X = sc.fit_transform(X)
- Open up X in Variable explorer, also open dataset at the same time and compare. Notice how dramatically the data has been rescaled.
Lab Complete!
Extra Credit – The Effect of Different Scalers on Data with Outliers