The following section covers four machine learning steps,
The last two Machine Learning plugin steps namely ‘Build Model for Intent Classification’ and Entity Extraction’ and ‘Intent Classification and Entity Extraction’ let you build a model for Intent Classification and Entity Extraction and then use this model for Intent Classification and Entity Extraction. Identification of Intent and entity has a huge variety of use cases in industry wherever there is a need to understand the intention behind the utterances from users and automate certain processes.
Prerequisites:
This step lets you build a classification model based on training data. One column or attribute of your data set can typically be considered as one feature. Features should ideally be independent. Features are also referred to as dimensions. Value which you want to predict is called label. This step can be used to build the model when features are either of Number type or String type or mixed.
Configuration Tab
No.
Field Name
Description
Row Handling
1
Step name
Used to specify the name of the step. The step name should be unique within the workflow.
2
Number of Rows to Process
Can have following two values.
- All
- Batch
Governs if all the rows of dataset are passed in one shot or they are batched. Typically if you are building model on a very large dataset, you can use Batch row processing.
3
Size
It has meaning only when Batch is selected for ‘Number of Rows to Process’. If your dataset has 50,000 rows, 1,000 can be a good batch size candidate.
Data Model Location
4
File name
Used to specify name and location of the file which will contain the model
Algorithm
5
Algorithm
Used to specify algorithm to be used for building the model. Step supports following algorithms
- Linear SVC
- SVC
- Decision Tree Classifier
- Random Forest Classifier
- Logistic Regression
- Multinomial NB
- SGD Classifier
- K Neighbors Classifier
6
Algorithm Parameters*
Based on the algorithm selected, corresponding algorithm parameters are shown. These are described in the last table of this plugin description.
Fields Tab
No.
Field Name
Description
Fields
1
Name
Name of the field
2
Incoming Type
Used to specify data type of the field. It can either be Number or String
3
Text Processing
All the classification algorithms work on vectors of numbers. Fields which are of type String need to be converted internally to numeric vectors and this cell lets you specify all the Text Processing attributes on that field. This cell can be clicked only for fields with String data type. Ensuing dialog when you click on it has two tabs.
- 11.First tab lets you specify one or more text processing options.
- Remove punctuation: removes standard punctuation marks from the text
- Remove Stop Words: removes stop words like ‘the’, ‘as’, ‘in’ etc.
- Additional Stop Words: this lets you choose a simple text file where every additional stop word is there on a separate line. These are your domain specific stop words.
- Lemmatization: this converts words like mice to mouse, houses to house etc.
- Stemming: this gets stem of the word no matter what word form is used in the text. So going, went, goes etc. would be converted to go
- Second tab lets you Test your text processing options. In the text box next to ‘Value:’ you can type any text. Clicking on ‘Test’ button will give you the text in the text box next to ‘Result:’ taking into account text processing options you have selected.
When you are processing a feature of type string, as mentioned in ‘Text Processing’ section of above table, this feature needs to be converted into numeric features. Text Vectorization Tab governs how all string features get converted into numeric features. An n-gram is a contiguous sequence of n items from a given sample of text or speech. Table below shows how internally a string gets tokenized given different values of n-gram
No.
String
N Gram Start/End
Tokens
1
Weather today is good
1-1
'Weather', 'today', 'good'
2
Weather today is good
1-2
'Weather', 'today', 'good', 'Weather today', 'today good'
3
Weather today is good
1-3
'Weather', 'today', 'good', 'Weather today', 'today good', 'Weather today good'
4
Weather today is good
2-3
'Weather today', 'today good', 'Weather today good'
*is treated as stop word and not considered
Text Vectorization Tab
No.
Field Name
Description
1
N Gram start
Should be a numeric value with minimum of 1
2
N Gram end
Should be a numeric value greater than or equal to N Gram start
3
Vectorization
N-Gram operation tokenizes input string feature. Vectorization is the operation where these tokens are converted to numeric features which are needed by the algorithms. There are three types of vectorizers supported
- Count Vectorizer: It counts the number of times a token shows up in the document and uses this value as its weight.
- Tfidf Vectorizer: TF-IDF stands for “term frequency-inverse document frequency”, meaning the weight assigned to each token not only depends on its frequency in a document but also how recurrent that term is in the entire corpora.
- Hashing Vectorizer: It is designed to be as memory efficient as possible. Instead of storing the tokens as strings, the vectorizer applies the hashing trick to encode them as numerical indexes. The downside of this method is that once vectorized, the features’ names can no longer be retrieved.
Evaluation Tab
No.
Field Name
Description
1
Evaluation Type
2
Test Percentage
For Train/Test Split:
Data Types allowed: default value float, int or None, optional (default=None)
- If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split.
- If int, represents the absolute number of test samples.
- If None, it will be set to 0.25.
3
Number of Folds
For Stratified k-Fold Cross-Validation:
Data Types allowed: int, default=3
- Must be at least 2. Default value is 3.
4
Random State
For Train/Test Split:
Data Types allowed: int, RandomState instance or None, optional (default=None)
- If int, random_state is the seed used by the random number generator;
- If RandomState instance, random_state is the random number generator;
- If None, the random number generator is the RandomState instance used by np.random.
5
Shuffle
For Stratified k-Fold Cross-Validation:
Data Types allowed: boolean, optional (default=True)
- Whether to shuffle each class’s samples before splitting into batches.
6
Evaluation Output File Name
7
Add output filename to result
Enable checkbox to display downloadable link of html report output file on AE portal.
*The following rows list the algorithms along with a description and snapshots of corresponding parameters. The right hand column has the description of these parameters.
Algorithm Description
Algorithm Parameter Description
1
Loss: It specifies the loss function. ‘hinge’ is the standard SVM loss (used e.g. by the SVC class) while ‘squared_hinge’ is the square of the hinge loss.
In machine learning, loss function measures the quality of your solution, while penalty function is mainly responsible to minimize the misclassification error (It imposes some constraints on your solution for regularization).
C is the penalty parameter of error term. It maximizes the kernel margin while keeping the misclassification error minimum. C is 1 by default and it’s a reasonable default choice. It works well for the majority of the common datasets. If you have a lot of noisy observations in the data set you should decrease it. Lower the C value, better the results are for noisy data and exactly opposite in case of clean data.
max_iter (int, default=1000) is the maximum number of iterations to be run for convergence.
2
Kernel (string, optional (default=’rbf’)) Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. If a callable is given it is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape (n_samples, n_samples).
Currently, the plugin supports ‘linear’, ‘poly’ and ‘rbf’ as explained below,
- Linear Kernel works well only when the data is linearly separable (in any dimension of feature space). This hyperplane which is a learned model can be used for prediction.
- RBF kernel of SVM especially might do a decent job in most of the other datasets that are non-linear. RBF is widely used kernel with Non Linear datasets.
- Poly kernel is suitable if data is separable by higher order functions.
Practical usage or benefits are pretty less. Hence it is not the most commonly used kernel.
C is the penalty parameter. It maximizes the margin while keeping the misclassification error minimum. C is 1 by default and it’s a reasonable default choice. It works well for the majority of the common datasets. If you have a lot of noisy observations in the data set you should decrease it. Lower the C value, better the results are for noisy data and exactly opposite in case of clean data.
Probability: This is a Boolean and optional. Choose True or False from the drop down list (default=False). It is about whether to enable probability estimates. This must be enabled prior to calling fit (Fit the SVM model according to the given training data).
3
max_depth: It is an integer or None (default=None). max_depth is optional. The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure.
4
max_depth int or None, optional (default=None). It is the maximum depth of each tree in the Random Forest. If None, then nodes are expanded until all leaves are pure.
5
C is the penalty parameter. It maximizes the margin while keeping the misclassification error minimum. C is 1 by default and it’s a reasonable default choice. It works well for the majority of the common datasets. If you have a lot of noisy observations in the data set you should decrease it. Lower the C value, better the results are for noisy data and exactly opposite in case of clean data.
max_iter (int, default=1000) is the maximum number of iterations to be run for convergence.
6
alpha (float, optional (default=1.0)) Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).
7
max_iter (int, default=1000) is the maximum number of iterations to be run for convergence.
In machine learning, loss function measures the quality of your solution, while penalty function is mainly responsible to minimize the misclassification error (It imposes some constraints on your solution for regularization).
penalty: string, ‘l1’ or ‘l2’ (default=’l2’) Specifies the norm used in the penalization. The ‘l2’ penalty is the standard used in SVC. The ‘l1’ leads to coef_vectors that are sparse.
loss: It specifies the loss function. Options are hinge, log, modified_huber, squared_hinge, perception.
8
n_neighbours: It defines the no. of nearest neighbors to be considered for prediction based on the distance.
Glossary:
Prediction step lets you predict the label based on the model built in ‘Classification Model Builder’ step.
Model Tab
No.
Field Name
Description
1
Model File
Used to specify path of the model file built with ‘Classification Model Builder’ Step
2
Load Model
Used to load the model and show all the relevant information of the model, like Algorithm, Vectorization algorithm, N Gram, Model parameters. All these values are read-only and only show you the values you had selected during ‘Classification Model Builder’ step
Field Mapping Tab
No.
Field Name
Description
1
Feature
Feature name used during model building step
2
Type
Type of the feature, it can be either String or Number
3
Field
Field name you want to map to the corresponding feature. It is important you map right field to a feature.
4
Text Preprocessing
If type is String, preprocessing options to be used to process the string. This is explained in detail in ‘Classification Model Builder’ step.
5
Target Field
Used to specify field name where value of the predicted label will be put
6
Prediction Confidence
Used to indicate if you would also want prediction confidence. This field is clickable only when algorithm used for model building supports prediction confidence
7
Prediction Confidence for all classes
Used to indicate if you would also like prediction confidence for all the classes. Say possible prediction values are ‘A’, ‘B’ and ‘C’, clicking this field will give you prediction confidence for all these labels/classes. This field is clickable only when algorithm used for model building supports prediction confidence
Introduction:
Identification of Intent and entity has a huge variety of use cases in industry wherever there is a need to understand the intention behind the utterances from users and automate certain processes.
Following are the terminology used in this plugin.
Utterance: Anything the user says. For example, if a user types “What's the weather outside today in SanFrancisco ”, the entire sentence is the utterance.
Intent: An intent is the user’s intention. For example, if a user types “What's the weather outside today in San Francisco”, the user’s intent is to get the weather reports. Intents are given a name, often a verb and a noun, such as “getWeather”.
Entity: An entity modifies a intent. For example, if a user types “What's the weather outside today in San Francisco”, the entities are “today” and “San Francisco”. Entities are given a name, such as “dateTime” and “location”. Entities are sometimes referred to as slots.
This step builds a model for Intent Classification and Entity Extraction.
No.
Field Name
Description
1
Step name
Specify the name of the step. Step names should be unique within a workflow.
Input Fields:
2
Use custom configuration file to build model?
Select this checkbox to enable ‘Custom Configuration FileName’ field below to provide a custom configuration file to build the model.
3
Custom Configuration FileName
This field is editable if the checkbox Use custom configuration files to build model? Is selected.
A default configuration file is used to build the intent entity model. However, you may specify the path of a custom configuration file (.yml) here to build the model.
4
JSON Filename
Specify path of a JSON Filename containing Intent and Entities data.
Sample JSON file contents:
{
"nlu_data": {
"common_examples": [
{
"text": "i'm looking for a place to eat",
"intent": "restaurant_search",
"entities": []
},
{
"text": "i'm looking for a place in the north of town",
"intent": "restaurant_search",
"entities": [
{
"start": 31,
"end": 36,
"value": "north",
"entity": "location"
}
]
}
}
5
Button: Browse
Click to browse for a JSON filename.
6
Model Directory Name
Specify or Browse for a Directory for the built Model file.
7
Button: Browse
Click to browse for a Model Directory.
Output Field:
8
Model Directory Field Name
Specify a fieldname to hold the complete path of the model (including the directory and model filename). The default value is outputModelDirectoryFieldName.
Common Buttons:
No.
Field Name
Description
Buttons:
1
OK
On click of this button. It will check the field values. If any required field values are missing then it will display validation error message.
If all the required field values are provided then it will save the field values.
2
Cancel
On click of this button, it will cancel the window and do not save any values
This step predicts Intent Classification and Entity Extraction based on the model built in ‘Build Model for Intent Classification and Entity Extraction’ step.
Model Tab
No.
Field Name
Description
1
Step name
Specify the name of the step. Step names should be unique within a workflow.
Input Fields:
1
Model Directory Name
Specify path of the model file built with ‘Build model for Intent Classification and Entity Extraction’ Step
2
Button: Browse
Click to browse for a Model file.
3
Input Data to Parse
Specify an input data (string) to be parsed for Intent Classification and Entity.
Output Fields:
4
Intent Field Name
Specify a fieldname to hold the Intent Field Name. The default value is intent.
5
Show intent confidence?
Enable checkbox to enable the Intent Confidence field below.
6
Intent Confidence Field Name
Specify a fieldname to hold Intent Confidence. The default value of the field name is intentConfidence.
7
Show Entities (in JSON format)?
Enable checkbox to enable the Entities field below.
8
Entities Field Name
Specify a fieldname to hold the Entities in JSON format. The default value of the field name is jsonEntities.
9
Show Intent Ranking (in JSON format)?
Enable checkbox to enable the Intent Ranking field below.
10
Intent Ranking Field Name
Specify a fieldname to hold the Intent Ranking in JSON format. All probable intents with confidence values (between 0 and 1), are generated in the JSON file. The default field name is jsonIntentRanking.
Common Buttons:
No.
Field Name
Description
Buttons:
1
OK
On click of this button. It will check the field values. If any required field values are missing then it will display validation error message.
If all the required field values are provided then it will save the field values.
2
Cancel
On click of this button, it will cancel the window and do not save any values