Advanced NLU settings

When editing project settings, you can configure NLU. Settings are passed as a JSON object.

caution

To train the classifier with the new settings:

In the sidebar, go to NLU → Intents section.
In the bottom right corner, click Test.

Common settings

Common settings include parameters that do not depend on the algorithm of the classifier in the project:

{
    "classificationAlgorithmVersion": 1,
    "patternsEnabled": true,
    "tokenizerEngine": "udpipe",
    "dictionaryAutogeneration": true
}

classificationAlgorithmVersion is the classifier version. Possible values:
- null. This version is used in previously created projects. In this version, the STS algorithm does not always detect the longest entity variant in a phrase.
- 1. It is the current version with improved STS. This value is specified by default when you create a new project.
caution
If you change the value of this parameter, the intent weights will be calculated differently when using STS. This might affect, for example, the results of the automated tests.
patternsEnabled — if the parameter is active, you can use patterns in training phrases.
tokenizerEngine — the tokenizer that will tokenize and lemmatize the text.
dictionaryAutogeneration — when the parameter is active, the entity content will fill the user dictionary.

tokenizerEngine

Different tokenizer engines are supported for different NLU languages.

NLU language	Tokenizers	Notes
Russian	`udpipe` `mystem` `morphsrus`	The `mystem` and `morphsrus` tokenizers are used for migrating projects to NLU.
Chinese	`pinyin`
Portuguese	`udpipe`
Kazakh	`kaznlp`
Any other language	`spacy`

STS

STS classifier default settings:

{
    "allowedPatterns": [],

    "stsSettings": {
        "exactMatch": 1.0,
        "lemmaMatch": 0.95,
        "jaccardMatch": 0.5,
        "jaccardMatchThreshold": 0.82,
        "acronymMatch": 1.0,
        "synonymMatch": 0.5,
        "synonymContextWeight": 0.0,
        "patternMatch": 1,
        "throughPatternMatch": 0.8,
        "wordSequence1": 0.8,
        "wordSequence2": 0.9,
        "wordSequence3": 1.0,
        "idfShift": 0.0,
        "idfMultiplier": 1.0,
        "namedEntitiesRequired": false
    }
}

allowedPatterns — the list of entities that have the Automatically expand intents setting enabled.
exactMatch — if the user’s words match words in training phrases, the weight of each word is multiplied by this coefficient. For example, home and home.
lemmaMatch — if dictionary forms (lemmas) of the user’s words match the lemmas of words in training phrases, the weight of each word will be multiplied by this coefficient. For example, homes and home.
jaccardMatch — when words match on their Jaccard index, the weight of a matching word is multiplied by this coefficient. jaccardMatch is triggered if:
- The letters in the words are the same, but in a different order. For example, cat and act.
- The letters in the words are almost the same, but their similarity coefficient is greater than or equal to jaccardMatchThreshold. For example, system and sstem.
jaccardMatchThreshold — the minimum value of the Jaccard index. By default, jaccardMatch considers two words to match if their similarity coefficient is greater than or equal to 0.82.
acronymMatch — if phrases and their abbreviations match, the phrase weight is multiplied by this coefficient. Abbreviations are determined by regular expressions. For example, University College London and UCL.
synonymMatch — if synonyms match, the word weight is multiplied by this coefficient. A ready-to-use synonym dictionary is built into NLU. It is supported only for the Russian language.
synonymContextWeight — the weight of the synonym is penalized by this coefficient:
- If "synonymContextWeight": 0.0, the synonym weight is not reduced.
- If "synonymContextWeight": 1.0, the synonym weight is significantly reduced.
patternMatch — if a word matches an entity specified in a training phrase, the word weight is multiplied by this coefficient.

For example, let’s take an intent that contains a phrase Call @agent. The @agent entity contains the synonyms agent, specialist, and consultant. If the user asks the bot to call @agent, the word consultant is recognized as an entity, and its weight is multiplied by the patternMatch value.
throughPatternMatch — if a word matches an entity specified in allowedPatterns, the weight of the word is multiplied by this coefficient.

If there is a matching sequence of words in the phrase, the word weight is multiplied by one of these coefficients:

The weight of the first word in the sequence is multiplied by wordSequence1.
The weight of the second word in the sequence is multiplied by wordSequence2.
The weight of the third word in the sequence is multiplied by wordSequence3. The fourth and subsequent words will also be multiplied by wordSequence3. It is recommended to specify a parameter value between 0 (not included) and 1 (included). Keep the ratio of wordSequence1 < wordSequence2 < wordSequence3.

For example, there is a training phrase I want to buy a course at a good price in the intent. The user writes to the bot I have decided to buy your course at a good price. The algorithm finds matching sequences:

Sequence	Word	Word weight multiplier
I	I	`wordSequence1`
to	to	`wordSequence1`
to buy	buy	`wordSequence2`
course	course	`wordSequence1`
course at	at	`wordSequence2`
course at a	a	`wordSequence3`
course at a good	good	`wordSequence3`
course at a good price	price	`wordSequence3`

idfShift and idfMultiplier — parameters that affect the word weight calculation via IDF. It is not recommended to change their values.
namedEntitiesRequired — if this parameter is active, a system entity should be found in the user’s request for it activate the corresponding intent.

For example, a phrase with a system entity I need @duckling.number of apples is added to the intent. If this parameter is active, the user’s request I need apples will not activate the intent, because it contains no system entity.

Classic ML

Classic ML classifier settings:

{
    "classicMLSettings": {
        "C": 1,
        "lang": "en",
        "word_ngrams": [
            1,
            2
        ],
        "lemma_ngrams": [
            0
        ],
        "stemma_ngrams": [
            1,
            2
        ],
        "char_ngrams": [
            3,
            4
        ],
        "lower": true,
        "useTfIdf": true
    }
}

C — the regularisation coefficient that can be used to control model overfitting. Use it to control larger values of the target function coefficients and to penalize them by the value of the parameter. It can take the following values: 0.01, 0.1, 1, 10.
word_ngrams — the number of words to be combined into word combinations. For "word_ngrams": [2, 3] combinations of two and three words will be used. For instance, the following word combinations will be generated for I like green apples:
- I like,
- like green,
- green apples,
- I like green,
- like green apples.
caution
Values greater than 3 are not recommended for this parameter.
lemma_n_grams — the number of words to be normalized and combined into word combinations. For "lemma_n_grams": [2] combinations of two words will be used. For instance, the following word combinations will be generated for I like green apples:
- I like,
- like green,
- green apple.
caution
Values greater than 3 are not recommended for this parameter.
stemma_ngrams — the number of stems to be combined into word combinations. A stem is not necessarily equal to the morphological root of the word. For "stemma_ngrams": [2] combinations of two stems will be used. For instance, the following word combinations will be generated for I like green apples:
- I like,
- like green,
- green apple.
caution
Using both lemma_n_grams and stemma_ngrams parameters is not recommended due to possible model overfitting. Setting the value of stemma_ngrams to be greater than 3 is not recommended either.
char_n_grams — the number of symbols to be combined and treated as a single unit of a phrase. For instance, for "char_n_grams": [3] the phrase green apples is converted to the following set:
- gre,
- ree,
- een etc.
lower — if set to true, all the phrases are converted to lowercase.
useTfIdf — the parameter determines which algorithm to use when vectorizing training phrases. The default value is false.
- If true, TF-IDF is used. It calculates the significance of a word or expression in the context of all training phrases. Use it on projects with a small dataset to improve the quality of intent recognition. The vectorization will be slower than when false is set, but its quality will be higher.
- If false, CountVectorizer is used. It calculates how often words or expressions are present in the intent. Use it on projects with a medium or large dataset. The vectorization will be faster, but the algorithm accuracy will decrease when working with a small dataset.
min_document_frequency — the minimum word frequency with which it can occur in training phrases, so that it can be vectorized and classified. The default value is 1.
- If you work with a medium or large dataset, increase the parameter value to speed up classifier training. Rare words in the dataset will not be taken into account.
- If you work with a small dataset, it is not recommended to change the default value.

Deep Learning

Deep Learning classifier settings:

{
    "cnnSettings": {
        "lang": "en",
        "kernel_sizes": [
            1,
            2
        ],
        "n_filters": 1024,
        "emb_drp": 0.25,
        "cnn_drp": 0.25,
        "bs": 64,
        "n_epochs": 15,
        "lr": 0.001,
        "pooling_name": "max"
    }
}

lang is the classifier language:
- If a value is not specified, the language from the project settings is used by default.
- If you set the multi value, the classifier can process phrases in 275 BPEmb languages. Add training phrases to your intents for every language that the bot needs to understand.
  
  tip
  We recommend using multi if user requests contain characters from multiple alphabets. This helps reduce false positives in your intents.
kernel_sizes is the list of convolution kernel sizes. A convolution kernel is the size of the context window to be taken into account by the classifier. For instance, "kernel_sizes": [3] means that the model will use all the triplets of adjacent words to find features in the text. Multiple convolution kernels can be defined for a single model.
n_filters is the number of filters. A filter is a specific pattern learned by the model. A model has a unique set of patterns for each kernel. For instance, if you specify "kernel_sizes": [2, 3] and "n_filters": 512, the total number of filters will be 1024 (512 per kernel).
emb_drp is the probability of the drop-out on the embedding layer. Drop-out is a mechanism that forcibly disables some of the weights in the network in the course of training. Drop-out prevents the network from overfitting, it helps to summarize the information instead of merely memorizing the entire dataset. It can take any value from 0 to 1.
cnn_drp is the probability of the drop-out on the convolution layers of the network.
bs is the size of the input batch for training. This value defines the number of training examples per step to be fed to the network in the course of training. If the dataset has less than 3,000 examples, a value from 16 to 32 is recommended. For larger datasets, this value can be from 32 to 128.
n_epochs is the number of learning epochs, i.e. the number of times the model will see all the training data.
lr is the learning rate. The factor the model will use to update its weights in the course of training.
pooling_name is the aggregation strategy. The model has to aggregate the patterns found in the input string before the final classification layer. The following aggregation strategies are possible: max, mean, concat.

Recommended settings

Deep Learning classifier settings:

Parameter	Dataset size
Parameter	1 to 3 thousand examples	3 to 10 thousand examples	10 to 30 thousand examples	30 to 100 thousand examples	Over 100 thousand examples
`kernel_sizes`	`[2, 3]`	`[2, 3]` or `[2, 3, 4]`	`[2, 3]` or `[2, 3, 4]`	`[2, 3, 4]`	`[2, 3, 4]`
`n_filters`	512	1024	1024	1024–2048	1024–2048
`emb_drp`	0.5	0.4–0.5	0.3–0.5	0.3–0.4	0.3–0.4
`cnn_drp`	0.5	0.4–0.5	0.3–0.5	0.3–0.4	0.3–0.4
`bs`	16–32	32	32–64	32–128	64–128
`n_epochs`	7–15	4–7	3–5	3	3
`lr`	0.001	0.001	0.001	0.001	0.001
`pooling_name`	`“max”`	`“max”`	`“max”`	`“max”` or `“concat”`	`“max”` or `“concat”`

Transformer

The Transformer classifier runs on the Caila platform. If you have selected this classifier type for your project, you can:

Change the training settings for the built-in Transformer service.
Replace the built-in Transformer with any other Caila classifier.

Change training settings

You can change the training settings:

If you use the built-in service.
If you connect the text-classifier-logreg-caila-roberta or text-classifier-logreg-labse classifier.

Specify the settings in the mlp.trainingModel.config section:

"mlp": {
    "trainingModel": {
        "config": {
            "regularizationCoefficient": 1.0,
            "maxNumIterations": 100,
            "toLowercase": true,
            "doRemovePunctuations": true
        }
    }
}

Here:

regularisationCoefficient is the regularisation coefficient. You can use it to control the overfitting of the model:
- At values less than 1.0, the model might underfit and fail to detect patterns in the data.
- At high values, the model might overfit and remember only the training data.
Default value: 1.0.
maxNumIterations is the maximum number of iterations for training:
- If few iterations are specified, for example, 10, the model might not have enough time to train.
- If too many iterations are specified, for example, 1000, the model might overfit and remember only the training data.
Default value: 100.
toLowercase converts texts to lowercase. If the value is true:
- During training, the data will be converted to lowercase.
- In the script, the case will be ignored during the user request classification.
Default value: false.
doRemovePunctuations removes all punctuation marks from texts except those in URLs, dates, time, and emails. If the value is true:
- Punctuation marks will be removed during training.
- In the script, punctuation marks will be ignored during the user request classification.
Default value: false.

tip

The values of the toLowercase and doRemovePunctuations parameters are passed to the predict configuration of the Caila service. Therefore, they affect the classification of requests in the script.

Connect another classifier

note

The description of the settings below mentions such concepts as base service, derived service, predict configuration. You can learn more about them in the Caila documentation.

Transformer classifier settings:

{
    "mlp": {
        "restUrl": "https://caila.io",
        "mlpAccessKey": "1234567.8.xxxxxxxxxxx",
        "noTraining": false,
        "trainingModel": {
            "account": "just-ai",
            "model": "text-classifier-logreg-labse",
            "config": {
                "regularizationCoefficient": 1.0,
                "maxNumIterations": 100,
                "toLowercase": true,
                "doRemovePunctuations": true
            }
        },
        "derivedModel": {
            "account": "1000005",
            "model": "my-little-wizard",
            "token": "8765432.1.xxxxxxxxxxx"
        },
        "nerModels": [
            {
                "account": "just-ai",
                "model": "ner-duckling",
                "config": {
                    "language": "ru",
                    "version": "v1"
                }
            }
        ],
        "markup": {
            "account": "just-ai",
            "model": "ner-duckling",
            "config": {}
        }
    }
}

restUrl — the Caila server URL. If you set this parameter, be sure to specify mlpAccessKey as well.
mlpAccessKey — the Caila access token. To create a token, visit My space → API tokens in the Caila user interface.
noTraining — this flag ensures that JAICP doesn’t retrain the Caila service during bot deployment. Use this parameter when the service is updated manually.
trainingModel — the parameters of the base service that a derived one should be created from:
- account — the ID of the account that created the service.
- model — the service ID. Both IDs can be either string or numeric.
  Where can I find account and model values
  Both IDs can be found in the service card URL, for example:
  - For a public service like caila.io/catalog/just-ai/FAQ, account = just-ai, model = FAQ.
  - For a private service like caila.io/workspace/model/12345/67890, account = 12345, model = 12345.
- config is the training settings. Specify only for text-classifier-logreg-caila-roberta and text-classifier-logreg-labse.
derivedModel — the parameters of the derived service. You should specify either only trainingModel or only derivedModel at a time. Including both will result in an error.
- token — the derived service access token, optional parameter. Include it when using a private service from a different account.
nerModels — the list of services for named entity recognition. Besides the primary parameters, each service can be supplied with config — the service predict configuration.

markup — the parameters for the morphological annotation service. Besides the primary parameters, the service can be supplied with config — its predict configuration.

note

If you would like to use the nerModels and markup parameters, you also need to add external NLU service settings and set nluType values of the necessary providers to mlp:

{
    "externalNluSettings": {
        "nluProviderSettings": {
            "markup": {
                "nluType": "mlp",
                "url": null
            },
            "ner": {
                "nluType": "mlp",
                "url": null
            },
            "classification": {
                "nluType": "mlp",
                "url": null
            }
        }
    }
}

External NLU service

The JAICP allows to connect an external NLU service using the Model API. You can use third-party services to recognize entities and intents in JAICP projects.

To use an external NLU service in a project, use externalNluSettings in the Advanced NLU settings field:

{
    "externalNluSettings": {
        "nluProviderSettings": {
            "markup": {
                "nluType": "external",
                "url": "http://example.com"
            },
            "ner": {
                "nluType": "external",
                "url": "http://example.com"
            },
            "classification": {
                "nluType": "external",
                "url": "http://example.com"
            }
        },
        "language": "ja",
        "nluActionAdditionalProperties": {
            "markup": null,
            "ner": null,
            "classification": {
                "modelId": "123",
                "classifierName": "example",
                "properties": null
            }
        }
    }
}

nluProviderSettings — the object that determines where the NLU action is going to be performed.
markup — the map of parameters for markup requests.
nluType — the NLU type. You can set either external or internal NLU type.
ner — the parameters of named entity recognition.
classification — the map of parameters for classification requests.
language — the external NLU language parameter. If not set, language from the project settings will be used.
nluActionAdditionalProperties — additional NLU service properties.
modelID — the classifier model ID.
classifierName — the classifier name.

How to use

caution

You cannot use intents and entities from an external NLU service and from the NLU core at the same time in a project.

In the JAICP project you can:

Use entities and intents from the external NLU service.
- Set the nluType to external for the markup, ner and classification parameters.
- Intents are available in the script using the intent tag. Entities are available in the script using the q tag.
- Visual customization in the NLU section for the external NLU service intents and entities is not supported.
Use intents from the NLU core and entities from the external NLU service.
- Set the nluType to external for the ner parameter and to internal for the markup and classification parameters.
- The use of entities from the external NLU service isn’t available while setting up the intents and slot filling.
- Entities are available in the script using the q tag.
Use entities from the NLU core and intents from the external NLU service.
- Set the nluType to external for the classification parameter and to internal for the markup and ner parameters.
- Intents are available in the script using the q tag.
Use external NLU service markup with NLU core intents and entities.
- Set the nluType: to external for the markup parameter and to internal for the classification and ner.
- In the NLU > Intents section, you can use Training phrases in languages that are not supported by JAICP, but these phrases will be recognized in the script.

tip

You can look through an example of an external NLU service in the GitHub repository.

Common settings​

tokenizerEngine​

STS​

Classic ML​

Deep Learning​

Recommended settings​

Transformer​

Change training settings​

Connect another classifier​

External NLU service​

How to use​