Elasticsearch types and index/query analyzers

  • Updated

You can select the data type for field boosting to handle the attribute or custom property. Most of these are situational, and the Default Data Type usually lets the system handle the indexing and searching of the data.

Boolean

Treats the field as a binary True or False value.

Format: "true" or "false"

Elasticsearch Docs: https://www.elastic.co/guide/en/elasticsearch/reference/current/boolean.html

Date

Parses the field into a datetime that you can use to create date range searches.

Standard Format: "2015-01-01" or "2015/01/01 12:10:30".

Elasticsearch Docs: https://www.elastic.co/guide/en/elasticsearch/reference/current/date.html

Keyword

Use for structured content like an ID or emails. Useful for sorting and fields that have a set of values.

Elasticsearch Docs: https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html

Number

Treat a field as numeric, which gives more relevant results when you use numbers in searches.

Elasticsearch Docs: https://www.elastic.co/guide/en/elasticsearch/reference/current/number.html

Text

Allows for full text searches and the use of an Index/Query analyzer for more flexibility in how the text in the field is parsed and searched.

Elasticsearch Docs: https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html

Index/Query Analyzers

The Index and Query Analyzer options appear after you select Text as the Data Type for Field Boosting an attribute or custom property.

The analyzer list is a collection of Elasticsearch-compatible Analyzers. You should think about when and what data the analyzer is working against when selecting the analyzer.

The Index Analyzer processes against the data as it is added to the search collection. As a product is added to the index, it will go through the analyzer normalizing data on the product for the field tied to the Field Boosting record. The way it normalizes the field data is based on the Analyzer setup, as described below.

The Query Analyzer processes against the Search Query. As a frontend user types in a search, this query will be normalized against the data that was already Indexed.

Mixing different types of analyzers for indexing and querying can be very powerful but can also give unexpected results if the data being processed is not structured to handle the different types.

Below is the list of standard platform analyzers. The Index and Query will be the same, but the data they are working against changes.

Currently, none of the Analyzers support Numeric Normalization. 1/2 or 0.5 stay 1/2 and 0.5 respectively.

IscIndexAnalyzer

This is the standard Index Analyzer. If you have not configured anything, this is used to process a field during indexing. It will also remove HTML from the data.

Processing done on data: Standard/whitespace splitting, lowercasing, synonyms replacements, stop word removal, word stemming, and dimensional normalization.

Examples:

  • 'Hello, world! How are you DOING 2day?' -> ['hello', 'world', 'how', 'you', 'do', '2dai']

  • '<h1>Header</h1><p>This is a paragraph.</p>' -> ['header', 'paragraph']

  • 'United States of America USA United States Totally Different Text' -> ['usa', 'usa', 'usa', 'usa', ]

  • 'United Manager Stemming Day today reading reader Helloing' -> ['unit', 'manag', 'stem', 'dai', 'todai', 'read', 'reader', 'hello']

  • 'ft inch yards miles meter milli centi kilo mi2 yd2 in2 pounds' -> ['foot', 'inch', 'yard', 'mile', 'meter', 'millimeter', 'centimeter', 'kilometer', 'square', 'mile', 'square', 'yard', 'square', 'inch', 'pound']

IscQueryAnalyzer

This is the same as the IscIndexAnalyzer but dooes not strip out HTML.

Processing done on data: Standard/whitespace splitting, lowercasing, synonyms replacements, stop word removal, word stemming, and dimensional normalization.

Examples:

  • 'Hello, world! How are you DOING 2day?' -> ['hello', 'world', 'how', 'you', 'do', '2dai']

  • '<h1>Header</h1><p>This is a paragraph.</p>' -> ['h1', 'header', 'h1', 'p', 'paragraph', 'p']

  • 'United States of America USA United States Totally Different Text' -> ['usa', 'usa', 'usa', 'usa', ]

  • 'United Manager Stemming Day today reading reader Helloing' -> ['unit', 'manag', 'stem', 'dai', 'todai', 'read', 'reader', 'hello']

  • 'ft inch yards miles meter milli centi kilo mi2 yd2 in2 pounds' -> ['foot', 'inch', 'yard', 'mile', 'meter', 'millimeter', 'centimeter', 'kilometer', 'square', 'mile', 'square', 'yard', 'square', 'inch', 'pound']

IscLowercaseAnalyzer:

This analyzer splits on whitespace and lowercases the characters.

Processing done on data: Whitespace lowercasing

Examples:

  • 'Hello, world! How are you DOING 2day?' -> ['hello', 'world!', 'how', 'are', 'you', 'doing', '2day?']

  • '<h1>Header</h1><p>This is a paragraph.</p>' -> ['<h1>header</h1><p>this', 'is', 'a', 'paragraph.</p>']

  • 'United States of America USA United States Totally Different Text' -> ['united', 'states', 'of', 'america', 'usa', 'united', 'states', 'totally', 'different', 'text']

  • 'United Manager Stemming Day today reading reader Helloing' -> ['united', 'manager', 'stemming', 'day', 'today', 'reading', 'reader', 'helloing']

  • 'ft inch yards miles meter milli centi kilo mi2 yd2 in2 pounds' -> ['ft', 'inch', 'yards', 'miles', 'meter', 'milli', 'centi', 'kilo', 'mi2', 'yd2', 'in2', 'pounds']

IscStandardLowercaseAnalyzer:

This analyzer performs standard lowercasing and removes non-alphanumeric characters.

Processing done on data: Standard lowercasing

Examples:

  • 'Hello, world! How are you DOING 2day?' -> ['hello', 'world', 'how', 'are', 'you', 'doing', '2day']

  • '<h1>Header</h1><p>This is a paragraph.</p>' -> ['h1', 'header', 'h1', 'p', 'this', 'is', 'a', 'paragraph', 'p']

  • 'United States of America USA United States Totally Different Text' -> ['united', 'states', 'of', 'america', 'usa', 'united', 'states', 'totally', 'different', 'text']

  • 'United Manager Stemming Day today reading reader Helloing' -> ['united', 'manager', 'stemming', 'day', 'today', 'reading', 'reader', 'helloing']

  • 'ft inch yards miles meter milli centi kilo mi2 yd2 in2 pounds' -> ['ft', 'inch', 'yards', 'miles', 'meter', 'milli', 'centi', 'kilo', 'mi2', 'yd2', 'in2', 'pounds']

IscNgramAnalyzer:

Ngram processing splits a word to help with partial matching during a search.

Ngram Analyzer changes the words of the field into partial tokens to trigger more relevant hits to documents/products/categories/content during a frontend search.

Processing done on data: Ngram tokenization and lowercasing.
You can configure the Ngram settings to change the way the words are broken up.

Settings: SearchIndexSettings

  • MinimumNgramLength – The minimum a token will be broken up into.

  • MaximumNgramLength – The maximum a token should be broken up into.

Longer words will create more data, which increases the memory used by an index.

Examples:

Example: Min: 5 | Max: 8

'Hello, world! How are you DOING 2day?' -> ['hello', 'world', 'world!', 'orld!', 'doing', '2day?']

Example: Min: 3 | Max: 4

'United States of America USA United States Totally Different Text' -> ['uni', 'unit', 'unite', 'nit', 'nite', 'nited', 'ite', 'ited', 'ted', 'sta', 'stat', 'state', 'tat', 'tate', 'tates', 'ate', 'ates', 'tes', 'ame', 'amer', 'ameri', 'mer', 'meri', 'meric', 'eri', 'eric', 'erica', 'ric', 'rica', 'ica', 'usa', 'uni', 'unit', 'unite', 'nit', 'nite', 'nited', 'ite', 'ited', 'ted', 'sta', 'stat', 'state', 'tat', 'tate', 'tates', 'ate', 'ates', 'tes', 'tot', 'tota', 'total', 'ota', 'otal', 'otall', 'tal', 'tall', 'tally', 'all', 'ally', 'lly', 'dif', 'diff', 'diffe', 'iff', 'iffe', 'iffer', 'ffe', 'ffer', 'ffere', 'fer', 'fere', 'feren', 'ere', 'eren', 'erent', 'ren', 'rent', 'ent', 'tex', 'text', 'ext']

Example: Min: 5 | Max: 6

'United States of America USA United States Totally Different Text' -> ['unite', 'united', 'nited', 'state', 'states', 'tates', 'ameri', 'americ', 'meric', 'merica', 'erica', 'unite', 'united', 'nited', 'state', 'states', 'tates', 'total', 'totall', 'otall', 'otally', 'tally', 'diffe', 'differ', 'iffer', 'iffere', 'ffere', 'fferen', 'feren', 'ferent', 'erent']

IscDimensionalAnalyzer

Elasticsearch version 7.10 supports fractional numbers in search terms. See Elasticsearch v7 - Dimensional Analyzer and Fractional Number Search.

The Dimensional Analyzer is very situational because it only normalizes specific terms to a specific form. This makes it easier to match since all values across a mapping will always be the same when found.

Processing done on data: Whitespace lowercasing and dimensional normalization. (Fractional normalization added in Elasticsearch version 7)

Examples:

  • ‘1/2 in' -> ['1/2’, 'inch']

  • ‘1/2 0.5 1.5 3/4 2-1/2 1.75 10/20’ -> ['1/2', '0.5', '1.5', '3/4', '2-1/2', '1.75', '10/20'] with Elasticsearch version 5, ['1/2', '1/2', '1-1/2', '3/4', '2-1/2', '1-3/4', '10/20'] with version 7

  • 'EMT Electrical Metallic Tubing 1/2 in White Conduit 10 ft' -> ['emt', 'electrical', 'metallic', 'tubing', '1/2', 'inch', 'white', 'conduit', '10', 'foot']

  • '<h1>Header</h1><p>This is a paragraph.</p>' -> ['<h1>header</h1><p>this', 'is', 'a', 'paragraph.</p>']

  • 'United States of America USA United States Totally Different Text' -> ['united', 'states', 'of', 'america', 'usa', 'united', 'states', 'totally', 'different', 'text']

  • 'United Manager Stemming Day today reading reader Helloing' -> ['united', 'manager', 'stemming', 'day', 'today', 'reading', 'reader', 'helloing']

  • 'ft inch yards miles meter milli centi kilo mi2 yd2 in2 pounds' -> ['foot', 'inch', 'yard', 'mile', 'meter', 'millimeter', 'centimeter', 'kilometer', 'square', 'mile', 'square', 'yard', 'square', 'inch', 'pound']