ElasticSearch

Anatomy Of Setting Up An Elasticsearch N-Gram Word Analyzer

Adrienne Gessler Development Technologies, Java 6 Comments

Attention: The following article was published over 8 years ago, and the information provided may be aged or outdated. Please keep that in mind as you read the post.

To say that n-grams are a massive topic would be an understatement. Do a quick search and you will find yourself staring down volumes of information on linguistics and language models, on data mining, or on the implication of the breakdown of specific proteins on the decline of debutante culture.

Okay, I’m kidding about that last one. But if you are a developer setting about using Elasticsearch for searches in your application, there is a really good chance you will need to work with n-gram analyzers in a practical way for some of your searches and may need some targeted information to get your search to behave in the way that you expect. There are many, many possibilities for what you can do with an n-gram search in Elasticsearch. This blog will give you a start on how to think about using them in your searches.

An Example

Firstly, let’s narrow the field a little here. In a lot of cases, using n-grams might refer to the searching of sentences wherein your gram would refer to the words of the sentence. But for today, I want to focus on the breakdown of single words. Single words in the n-gram world are referred to as shingles.

Let’s further narrow ourselves, by assuming that we want to use this search for approximate matching. It is not going to be uncommon in an application to want to search words (names, usernames), or data similar to a word (telephone numbers) and then to give the searcher more information in the form of close matches to the search word. Here we also want partial matching somewhere within this word, not always at the front and not always at the end.

For the sake of a specific application for reference, let’s pretend we have a site where animals can be looked up by name. Maybe it’s the front line of a veterinarian’s office and the office wants to do all lookups by the pet’s name first. Of course, you would probably find yourself expanding this search to include other criteria quickly, but for the sake of an example let’s say that all dog lovers at this office are crazy and must use the dog’s name.

The Analyzer

Now let’s think about what we want in terms of analyzer. Firstly, we already know we want an n-gram of some sort. We want partial matching. Secondly, we have already decided above that we want to search for partial matching within the word. In this case, this will only be to an extent, as we will see later, but we can now determine that we need the NGram Tokenizer and not the Edge NGram Tokenizer which only keeps n-grams that start at the beginning of a token.

ElasticSearch Ngrams allow for minimum and maximum grams. Starting with the minimum, how much of the name do we want to match? Well, the default is one, but since we are already dealing in what is largely single word data, if we go with one letter (a unigram) we will certainly get way too many results. Realistically, the same thing is going to apply to a bigram, too. However, enough people have pets with three letter names that we’d better not keep going or we might never return the puppies named ‘Ace’ and ‘Rex’ in the search results. Now we know that our minimum gram is going to be three. What about the max gram? The default is two and we’ve already exceeded that with our minimum. Our goal is to include as many potential accurate matches as possible but still not go crazy in terms of index size storage.

Think about picking an excessively large number like 52 and breaking down names for all potential possibilities between 3 characters and 52 characters and you can see how this adds up quickly as your data grows. There is a bit of a give and take here because you can end up excluding data that exceeds the max-gram in some cases.

There are a couple of ways around this exclusion issue, one is to include a second mapping of your field and use a different analyzer, such as a standard analyzer, or to use a second mapping and benefit from the speed and accuracy of the exact match term query.

In our case, we are going to take advantage of the ability to use separate analyzers for search and index. We assume that the data after the max is largely irrelevant to our search, which in this case it most likely is.

So here we create the index and then set up a custom analyzer. The examples here are going to be a bit simple in relation to the overall content, but I hope they aid in understanding.

Note: Slightly off topic, but in real life you will want to go about this in a much more reusable way, such as a template so that you can easily use aliases and versions and make updates to your index, but for the sake of this example, I’m just showing the easiest setup of curl index creation.

Here is our first analyzer, creating a custom analyzer and using a ngram_tokenizer with our settings. If you are here, you probably know this, but the tokenizer is used to break a string down into a stream of terms or tokens. You could add whitespace and many other options here depending on your needs:

curl -XPUT 'localhost:9200/searchpets' -d '
    {
        "settings" : {
            "analysis" : {
                "analyzer" : {
                    "ngram_analyzer" : {
                        "tokenizer" : "ngram_tokenizer"
                    }
                },
                "tokenizer" : {
                    "ngram_tokenizer" : {
                        "type" : "nGram",
                        "min_gram" : "3",
                        "max_gram" : "8"
                    }
                }
            }
        }
    }'

And our response to this index creation is {“acknowledged”:true}. Excellent.

Alright, now that we have our index, what will the data look like when our new analyzer is used?

curl -XGET'localhost:9200/searchpets/_analyze?analyzer=ngram_analyzer' -d 'Raven'

And the response is:

{"tokens":[{"token":"Rav","start_offset":0,"end_offset":3,"type":"word","position":1},{"token":"Rave","start_offset":0,"end_offset":4,"type":"word","position":2},{"token":"Raven","start_offset":0,"end_offset":5,"type":"word","position":3},{"token":"ave","start_offset":1,"end_offset":4,"type":"word","position":4},{"token":"aven","start_offset":1,"end_offset":5,"type":"word","position":5},{"token":"ven","start_offset":2,"end_offset":5,"type":"word","position":6}]}

This is reasonable. All of the tokens generated between 3 and 5 characters (since the word is less than 8, obviously).

Okay, great, now let’s apply this to a field. And, yes, you can absolutely do it all in one step, I’m just breaking it down.

$ curl -XPUT 'http://localhost:9200/searchpets/_mapping/pet' -d '
{
    "pet": {
        "properties": {
            "name": {
                "type": "string",
                "analyzer": "ngram_analyzer"
            }
        }
    }
}
'

We test the analysis on the field:

curl -XGET 'http://localhost:9200/searchpets/_analyze?field=pet.name' -d 'Raven';

And, again, we get the results we expect:

{"tokens":[{"token":"Rav","start_offset":0,"end_offset":3,"type":"word","position":1},{"token":"Rave","start_offset":0,"end_offset":4,"type":"word","position":2},{"token":"Raven","start_offset":0,"end_offset":5,"type":"word","position":3},{"token":"ave","start_offset":1,"end_offset":4,"type":"word","position":4},{"token":"aven","start_offset":1,"end_offset":5,"type":"word","position":5},{"token":"ven","start_offset":2,"end_offset":5,"type":"word","position":6}]}

Now let’s assume that I’ve gone ahead and added a few records here and run a simple match query for: {“query”:{“match”:{“name”:”Pegasus”}}}.

With my data, we get back the following:

"hits": {
	"total": 2,
	"max_score": 0.29710895,
	"hits": [
		{
			"_index": "searchpets",
			"_type": "pet",
			"_id": "3",
			"_score": 0.29710895,
			"_source": {
				"name": "Pegasus"
			}
		}
		,{
			"_index": "searchpets",
			"_type": "pet",
			"_id": "2",
			"_score": 0.0060450486,
			"_source": {
				"name": "Degas"
			}
		}
	]
}
}

We get the closest match plus a close option that might actually be what the user is looking for.

Custom Analyzer

Alright, but right now we are using a pretty basic case of an analyzer. What if we need a custom analyzer so that we can handle a situation where we need a different tokenizer on the search versus on the indexing? What if we want to limit searches with a keyword tokenizer?

Let’s change this to setup a custom analyzer using a filter for the n-grams. Since we are using a tokenizer keyword and a match query in this next search, the results here will actually be the same as before in these test cases displayed, but you will notice a difference in how these are scored.

$ curl -XPUT 'localhost:9200/searchpets' -d '
 {
    "settings": {
        "analysis": {
            "analyzer": {
                "namegrams": {
                    "type": "custom",
                    "tokenizer": "keyword",
                    "filter": [
                        "ngrams_filter"
                    ]
                }
            },
            "filter": {
                "ngrams_filter": {
                    "type": "ngram",
                    "min_gram": 3,
                    "max_gram": 8
                }
            }
        }
    }
}'

Now we add a mapping and data as before:

curl -XPUT 'http://localhost:9200/searchpets/_mapping/pet' -d '
{
    "pet": {
        "properties": {
            "name": {
                "type": "string",
                "analyzer": "namegrams"
            }
        }
    }
}
'

I run another match query: {“query”:{“match”:{“name”:”Pegasus”}}} and the response is:

hits": {
"total": 2,
"max_score": 1.1884358,
"hits": [
	{
		"_index": "searchpets",
		"_type": "pet",
		"_id": "2",
		"_score": 1.1884358,
		"_source": {
			"name": "Pegasus"
		}
	}
	,{
		"_index": "searchpets",
		"_type": "pet",
		"_id": "3",
		"_score": 0.08060065,
		"_source": {
			"name": "Degas"
		}
	}
]
}

So we have this set up and we are getting the results and scoring that we expect based on the keyword tokenizer and n-grams filter. Let’s say we are are doing some more complex queries. We may have also added some other filters or tokenizers. Things are looking great, right? Well, almost.

One small factor to keep in mind with all of this that I mentioned earlier. We have a max 8-gram. So, what happens when we have a name that exceeds that size as our search criteria? Well, depending on your search you may not get any data back.

Probably not what you were anticipating to have happen here! How do you avoid this situation? One way is to use a different index_analyzer and search_analyzer. Splitting these up gives you much more control over your search.

So, here’s what your final setup might look like assuming everything we said about this original search is true. I won’t dive into the details of the query itself, but we will assume it will use the search_analyzer specified (I recommend reading the hierarchy of how analyzers are selected for a search in the ES documentation).

Note: a lowercase tokenizer on the search_ngram analyzer here normalizes token text so any numbers will be stripped. This works for this example, but with different data this could have unintended results.

$ curl -XPUT 'localhost/searchpets' -d '
 {
    "settings": {
        "analysis": {
            "analyzer": {
                "namegrams": {
                    "type": "custom",
                    "tokenizer": "keyword",
                    "filter": [
                        "ngrams_filter"
                    ]
                },
                "search_ngram": {
                    "type": "custom",
                    "tokenizer": "lowercase",
                    "filter": [
                        "truncate_filter"
                    ]
                }
            },
            "filter": {
                "ngrams_filter": {
                    "type": "ngram",
                    "min_gram": 3,
                    "max_gram": 8
                },
                "truncate_filter": {
                    "type": "truncate",
                    "length": 8
                }
            }
        }
    }
}
’

And then, finally, we set up our mapping again:

curl -XPUT 'http://localhost:9200/searchpets/_mapping/pet' -d '
{
    "pet": {
        "properties": {
            "name": {
                "type": "string",
                "index_analyzer": "namegrams",
                 "search_analyzer": "search_ngram"
            }
        }
    }
}'

Final Thoughts

And there you have it. This makes the assumption, though, that the data that exceeds the 8 characters is less important. If you were to have a lot of data that was larger than the max gram and similar you might find yourself needed further tweaking.

There are many, many possibilities for what you can do with an n-gram search in Elastisearch. I’m hoping that this gives you a start on how to think about using them in your searches.

— Adrienne Gessler, [email protected]

0 0 votes
Article Rating
Subscribe
Notify of
guest

6 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments