Text must not be keyword tokenized

Description

With this PR “keyword” tokenization is added to textual field types.

This means no tokenization at all happen and just a single (potentially) huge token is passed to the token filtering.

That analysis may be ok for short fields like “name” , definitely not for long fields like “content”.

The scope of this issue is to revert the change and make sure proper analysis is defined for long textual fields and short textual fields.

When indexing the attached document this error is shown in the log:

I also attached the extracted text in a text file.

Acceptance Criteria:

  • Remove the keyword analyser

  • Analysis for all types except the d:content must use Whitespace analyser

  • Analysis for d:content must use standard analyser (original mapping)

  • Add tests:

    • unit tests for field mapping builder

    • e2e/integration test that reproduces this problem (indexing a big text file)

  • Documentation:

    • confluence - behaviour of field mapping build (internal)

Environment

None

Activity

Show:
andrea gazzarini
2 days ago
Edited

Verification Steps

  • create a node using the attached file Longtext.txt and make sure no indexing error are printed in the Elasticsearch log

  • execute a search for Lavabo, the node above should be returned

  • create a node using and make sure no indexing error are printed in the Elasticsearch log

  • execute a search for maecenas, the node above should be returned

andrea gazzarini
4 days ago
Edited

The FieldMappingBuilder has been changed in order to differentiate between

  • mltext

  • content

  • text (d:text, d:encrypted)

Apart the multi-language text which is not related to this ticket, the behaviour of the other two datatypes has been implemented according to the approach already in place. That means

  • the analyzer for d:content could be missing in configuration(default behaviour): in that case a symmetric default (standard) analyzer will be used

  • the analyzer for d:content can be explicitly customised using a symmetric chain (1 analyzer for index & query time)

  • the analyzer for d:content can be explicitly customised using an asymmetric chain (1 analyzer for index and another for query time)

The behaviour is the same for the plain text. The connector will pickup the right analyzer following a naming convention:

  • content analyzer starts with “content_” prefix

  • general text analyzer use “general_” prefix.

Following that rule, the 3 scenarios above could be implemented declaring the following analyzers:

  • (nothing) in case the default has to be used

  • <prefix>_text_analyzer: symmetric chain

  • <prefix>_text_query_analyzer, <prefix>_text_index_analyzer: asymmetric chain

The info above are intended only for the review process: everything is documented in confluence.

Davide Cerbo
April 29, 2021, 11:43 AM
Edited

Just a note: The analyzer was changed because when we start to search also on other fields and not only on content the search doesn’t work as in Search Service 2.

I changed it when I added the word delimiter graph token filter because of the following text extracted from the official documentation:

I found why I changed the tokenizer to keyword:
Avoid using the word_delimiter_graph filter with tokenizers that remove punctuation, such as the standard tokenizer. This could prevent the word_delimiter_graph filter from splitting tokens correctly. It can also interfere with the filter’s configurable parameters, such as catenate_all or preserve_original. We recommend using the keyword or whitespace tokenizer instead.

 

I tried to use the whitespace tokenizer and it seems to work, can be this an easy solution?

Alessandro Benedetti
April 29, 2021, 8:34 AM

From Slack:

hi, while working on my current ticket I ended up in the text analysis definition approved in this PR:

this text is used across various fields and not only name related ones, I couldn't find any comment in the related Jira which seems actually related with multi field expansion from TEXT (and not text analysis?) 

what was the reason to move the entire Text analysis to be keyword tokenised and then word delimiter splitted? (edited) 

was a there a reason to involve all textual fields in this rather than just the name (if such analysis was designed for the name?
as a side note, that analysis seems decently ok for short field contents, for longer ones if it's really required I would recommend an accurate design and benchmarking as I doubt token filters such the word delimiter were designed to deal with extremely long tokens.
Not even sure there's a potential token length limit somewhere in ES, long story short I would be very careful to not tokenize a long field value (such as content values)”

Assignee

Keerat Lalia

Reporter

Davide Cerbo

Bug Priority

Category 2

Delivery Team

Search