In our case the "queries" are also the index creation components. Every time someone discusses something, we are indexing it, so you can search media, documents, people from context. We hint at how this works here:
https://austingwalters.com/fast-full-text-search-in-postgres...
The downside of our approach is it needs lots of conversation data. From their TLDR version:
"""
- Our model of a web page is based on queries only. These queries could either be observed in the query logs or could be synthetic, i.e. we generate them. In other words, during the recall phase, we do not try to match query words directly with the content of the page. This is a crucial differentiating factor – it is the reason we are able to build a search engine with dramatically less resources in comparison to our competitors.
- Given a query, we first look for similar queries using a multitude of keyword and word vector based matching techniques.
- We pick the most similar queries and fetch the pages associated with them.
- At this point, we start considering the content of the page. We utilize it for feature extraction during ranking, filtering and dynamic snippet generation.
"""
It appears 0x65 has similarly figured this out, the name of the game is forming proper search queries. In their case, their results would be good as soon as they start indexing and create synthetic queries. IMO might be better for documents and what not.
Either way, interesting to compare notes! Kudos to the work.
If you're ever looking for something to write about for a new blog post, I would love to learn more about how you implemented that estimate_count function.
In our case the "queries" are also the index creation components. Every time someone discusses something, we are indexing it, so you can search media, documents, people from context. We hint at how this works here: https://austingwalters.com/fast-full-text-search-in-postgres...
The downside of our approach is it needs lots of conversation data. From their TLDR version:
"""
- Our model of a web page is based on queries only. These queries could either be observed in the query logs or could be synthetic, i.e. we generate them. In other words, during the recall phase, we do not try to match query words directly with the content of the page. This is a crucial differentiating factor – it is the reason we are able to build a search engine with dramatically less resources in comparison to our competitors.
- Given a query, we first look for similar queries using a multitude of keyword and word vector based matching techniques.
- We pick the most similar queries and fetch the pages associated with them.
- At this point, we start considering the content of the page. We utilize it for feature extraction during ranking, filtering and dynamic snippet generation.
"""
It appears 0x65 has similarly figured this out, the name of the game is forming proper search queries. In their case, their results would be good as soon as they start indexing and create synthetic queries. IMO might be better for documents and what not.
Either way, interesting to compare notes! Kudos to the work.