Preprocessing
The vocabulary elements for preprocessing will be highlighted in this section.
Nylon has built in vocabulary components for a variety of different elements. The vocabulary elements outlined in this section will all concern the preprocessor tag. For example:
1
"preprocessor": {
2
"ordinal" : "column name1"
3
},
Copied!
All built-in preprocessing elements work with a type: column_name pair. Your key should be the grammar component, and the value should be the name of the column that you want to perform that operation on. You can perform an operation on more than one column like this:
1
"preprocessor": {
2
"ordinal" : ["column name1", "column name2"]
3
},
Copied!
Below are the different vocabulary elements available to you for preprocessing.

Preprocessing Vocabulary

scale: Standardizes target feature between 0 and 1 based on data variance. Usually applied to numerical variables in order to reduce the affect of larger numbers during model training.
​Min-Max Scaler​
min-max: Transforms features by scaling them to a given range defined by the maximum and minimum of the variable.
1
"preprocessor": {
2
"scale" : "column name",
3
"min-max" : "column name"
4
},
Copied!

Categorical Data

​Label Encoding​
label-encode: Label encoding is used to encode the target variable with a value between 0 and the number of classes. Performed on columns that are composed of string variables in order to encode them into numerical components.

​Ordinal Features​

ordinal: Ordinal features are used to encode categorical variables that have a clear ordering. An example is a feature describing three economic statuses: low, medium, and high.

​One Hot Encoding​

one-hot: One Hot Encoding is used to encode categorical variables into a one-dimensional numeric array.
1
"preprocessor": {
2
"label-encode" : "column name",
3
"ordinal" : "column name",
4
"one-hot" : "column name"
5
},
Copied!

Text Data

clean-text: Performs a text-cleaning pipeline on the column specified. Use this on any column in your dataset that is a longer string of text; more than 3-4 words. This pipeline includes lemmatization, tokenization, and spell checking.

Custom Preprocessing

Nylon supports custom modules for preprocessing different columns of your input dataset. You can combine custom modules, with built in grammar components. Your custom function should take in these parameters in ORDER:
Parameter
Description
Pandas DataFrame
DataFrame
JSON File
Nylon Specifications File

Example

1
"preprocessor": {
2
"custom": {
3
"loc": "absolute_path_to_file",
4
"name": "function_name"
5
},
6
7
"label-encode": "a_column"
8
}
Copied!
Last modified 5mo ago