Semantic search tutorial
When users type words in a search bar, they are looking for data in documents relevant to the topic of their interest. However, it usually turns out that the words searched for appear more frequently in documents unrelated to that topic. Users would therefore normally add words that narrow down the contents of the expected document group. Such narrowing-down or search refinement is often iterative: users keep adding and eliminating words until the expected results show.
So, the question comes up: How can we help users to refine the search query?
The easiest way is offered by search engines with millions of users where giant datasets and most liked search returns can be analyzed by algorithms. However, this will not be particularly helpful when something ‘exotic’ is searched for.
This route is no-go for collections of documents used by limited groups of users, such as narrow professional communities or employees of certain corporations. Another similar case is a brand-new emerging domain that is rapidly growing and making breakthroughs by the hour.
What we offer is a different solution: by using sophisticated Natural Language Processing (NLP) methods, our system creates a set of semantic filters (classes of documents). A semantic filter can then help a user to narrow down their search query where they are invited to select a context for their search. A context is formed by a collection of documents focusing on one subject. By selecting a context, a user may dramatically decrease the number of documents in search results.
Contexts are determined automatically with the Semantic Map technology during preprocessing of documents. This preprocessing highlights documents with related meanings. For search queries with a large number of results, a users can select a context that matches their target topic. Selecting a context is a faster alternative to guessing an extra word to refine the search query. By using predefined search contexts, users may save a significant amount of time.
Below, we explain the functionality of the Semantic Search based on English Wikipedia. Searching within other document collections (or domain areas) will have similar functionality, but topics or keywords will be different.
Let’s have a short tour of our search application. Type “leaf” in the search bar and click “Search” and you’ll have search results on the screen:
These results are quite similar to what a conventional corporate search engine will produce. It looks for the word “leaf” in all documents, sorts them by relevance (i.e. how often it encounters the word in each document), and returns the entire list of documents to a user. The bottom of the screenshot shows the number of pages in this list — more than 10,000 documents in total (our application shows 10 results per page)! Finding what you need in this list is very complicated. The query should be narrowed down to reach an acceptable number of search results. So, in what context do we want our “leaf”? Which extra word should be put in the search bar to define the necessary context?
The semantic filter built into our application saves your time to guess the relevant search terms. The contexts have been automatically identified in the preprocessing of the texts. Each article is placed in one single context that is indicated in the search results after the article itself. We can see on the screenshot that the first three articles belong to the contexts “school”, “botany” and “cars”. For all the articles in the search results, the occurrences of the contexts are calculated. The contexts are presented as a list, sorted in descending order depending on their occurrence.
To use the semantic filter, click on the suitable context either at the end of the article or in the contexts list. Selecting the “national cuisines” context provides the following results:
Now the results list contains the articles with the selected context only. The list has slimmed down dramatically to about 670 articles. Markedly less compared to the first search!
The current status of the semantic filter appears under the search bar. Now it shows the button associated with the selected context. To dismiss context selection, either click on it a second time in the list or click on the cross within the context button in the semantic filter under the search bar.
There is a document map below the list of contexts, illustrating the location of the selected context. The map enables us to further explore the semantic relations between the articles in the domain.
Document similarity search
We have another feature for finding necessary documents faster: search for similar documents.
Let us repeat the “leaf” search and apply the semantic filter with the “national cuisines” context. The search engine sorts the articles by the frequency of occurrence of the word “leaf.” The first article in the list concerns banana leaf application in cooking. Can the search be narrowed further down to find articles similar to the first one?
Click the “find similar” link on the bottom right of the article, which returns the following results:
Now the list includes articles on banana leaf applications in cuisine, i.e. articles thematically similar to the first one. Note the status of the semantic filter: besides the “national cuisines” context, it includes the “Banana leaf” article. This is how the search works now: articles semantically similar to the chosen one are retrieved, those falling into the “national cuisines” context are selected, and afterward the search engine returns the results sorted by the occurrence of the word “leaf”.
Let’s lift some of the limitations and remove the “national cuisines” context from the semantic filter. This can be done by clicking the cross on the context button within the semantic filter. What is shown now is articles similar to the one chosen earlier that might however have different contexts.
Another important feature of this application allows users to receive a list of articles similar to a given one without engaging the search engine.
As previously, let us type “leaf” into the search bar, select the “national cuisines” context, and click “find similar” on the first article (“Banana leaf”). The search results list features articles that mandatorily include the word “leaf,” and their order is defined by the frequency of occurrence of this word. Say, we would like to not apply this influence. Maybe the word “leaf” is not that important and sorting the articles by semantic proximity is a priority instead.
Clear the search bar and press Enter:
Now the results feature articles that appear to be closest to the given one semantically, with the top one being the closest one to it.
Similar documents search enables further exploratory search across articles: contexts can be removed from or added to the semantic filter and another reference article may be selected by clicking “find similar” next to it. This mode opens up a completely different dimension of search by considering only the semantic proximity of documents without using any search words.
Semantic Search is a new text search technology that factors in the content of documents. With the help of the Semantic Map technology, the search tool finds most important contexts for a specific collection of documents and helps a user to narrow down the search query easily. This boosts the efficiency and relevance of the search. Semantic Search is a component of the Silk Data Semantic Framework.