On various news portals and e-commerce platforms, you may have seen recommendations for articles and products related to the main article or product.
Recommending similar products, articles, or documents involves complex algorithms, but Elasticsearch empowers us to utilize recommendation algorithms effortlessly.
Mainly, recommending similar products, articles, or documents is accomplished through content similarity or by considering the user’s query history.
I will provide an example of Amazon product recommendations. In the image below, “Get similar item Fast” is related to content similarity, while the second one, “Customers who viewed this item also viewed” is related to the user’s query history.
Setup Elasticsearch locally
Before diving deep into finding related document, let’s set up the Elasticsearch project on your machine. I won’t go into detail on how to set up Elasticsearch locally; instead, I will clone the project from GitHub, which I have already created.
You can also clone it from my GitHub public repo. dockerize-elasticsearch
Or you can just use the below docker-compose.yml
file to run Elasticsearch and kibana locally
|
|
Once your Elasticsearch and Kibana project is up you can browse Kibana via http://localhost:5701/
Kibana is an open source analytics and visualization platform designed to work with Elasticsearch.
Similar Document Search With Elasticsearch
In this post, I will focus on content similarity with an example in Elasticsearch.
In Elasticsearch, content similarity can be found in two ways:
- k-nearest neighbor (kNN) search.
- More-like-this query.
k-nearest neighbor (kNN) search
A k-nearest neighbor (kNN) search finds the k nearest vectors to a query vector, as measured by a similarity metric.
Common use cases for kNN include:
Relevance ranking based on natural language processing (NLP) algorithms Product recommendations and recommendation engines Similarity search for images or videos
More-like-this query
The More Like This(MLT) Query allows you to find the similar documents from an input.
It works from a new query built from the relevant terms present in the input.
In this post, I will be focusing on More-like-this query. I will write new blog post for k-nearest neighbor (kNN) search
You can find the full Elasticsearch documentation at elastic.co for input parameters, term selection parameters, and query formation parameters. I will not delve into the details of these aspects.
In this post, I will demonstrate recommend similar reviews for Apple products.
For this demo, I will use like
Document Input Parameters and
min_term_freq
, max_query_terms
, and min_doc_freq
Term Selection Parameters.
min_term_freq: This sets the minimum term frequency, below which terms will be ignored from the input document.
In our case, min_term_freq
is set to 2
, indicating that the main document must have a term occurring two or more times.
max_query_terms: This defines the maximum number of query terms that will be included in the generated query. It imposes a limit on the number of terms in the query. In our case, max_query_terms is set to 12, which means that if a documents contains more than 12 terms, it will not be considered related document and ignored.
min_doc_freq: This specifies the minimum document frequency that a term must have to be considered when generating a query.
Document frequency refers to the number of documents in which a term appears.
Terms with a low document frequency are considered less common and specific,
whereas terms with a high document frequency are more common and general.
In our case, min_doc_freq
is set to 5,
indicating that a term should be present in at least in 5 documents.
If a term’s document frequency is less than 5, it will not yield any results.
Create Index with config map
You can use the create index API to add a new index to an Elasticsearch cluster. When creating an index, you can specify the following:
- Settings for the index
- Mappings for fields in the index
- Index aliases
The create index API allows for providing a mapping definition.
For our demo purpose I am creating apple-review-index
index with review
text field.
In review
I will be storing review of each user for Apple’s product
Once index is successfully created, you will be able to see 200 response code with below response
Result of create index is attached blow.
Store date in Field
We have already created the index with review
field. Now, in apple-review-index
Elasticsearch index, we have review
text field.
In apple-review-index
I will be storing below 7 reviews.
|
|
Let’s store all the reviews one by one in the Elasticsearch index
Once record is successfully created in the index, you will be able to see 200 response code with below response.
Create index by using Kibana UI to the Elasticsearch screenshot is attached below.
Retrieve all the data from index
We have already inserted all 7 reviews in the apple-review-index
.
Let’s retrieve all of them to confirm whether all the reviews are in the Elasticsearch index.
You will see the response below. Where you will be able to see all the reviews that we have created before.
|
|
The search result in the index screenshot is attached below.
Apply more_like_this Query
Certainly, let’s proceed with finding similar reviews in apple-review-index
by applying a “more_like_this” query.
To do this, you’ll need to provide the actual review content that you want to use as a source for finding similar reviews.
Let’s create a query for Elasticsearch using the “more_like_this” query. Here are the details:
- Actual review content: “Love Apple AirPods, the sound and quality are amazing as always with Apple”
- min_term_freq : 2 (indicating that a term in the actual review content must occur two or more times)
- max_query_terms : 12 (indicating that a term in the actual review content must not occur more than 12 times)
- max_doc_freq: 5 (indicating that a term should appear in a maximum of 5 reviews)
In our actual review content, the term “Apple” appears twice.
Here is the similar review search query
The query should only return the 5 reviews, while the two reviews below should be ignored. These reviews do not have the term “Apple” in their content.
Let’s see the response
|
|
In the above search response, we have observed only five hits, which is as expected. :)
The search result for similar reviews from the index screenshot is attached below.
Conclusion
In conclusion, this post has provided an overview of performing similar document searches with Elasticsearch, focusing on content similarity using the “more_like_this” query. Elasticsearch offers powerful features for finding related documents based on the content of the documents. We have explored the use of various parameters such as min_term_freq, max_query_terms, and min_doc_freq to fine-tune our similarity search.
By creating an index, storing data in fields, and applying the “more_like_this” query, we were able to effectively find similar reviews in our Elasticsearch index. Elasticsearch is a versatile tool that can be used for a wide range of applications, including recommendation systems, content similarity analysis, and more.
This post serves as a starting point for those looking to implement content-based recommendation systems and leverage Elasticsearch’s capabilities for similar document searches. Further exploration of Elasticsearch’s capabilities and parameter tuning can lead to more refined and accurate results in real-world applications.