Can you elaborate a little bit about why it's not a distributed solution if sites share their local indexes with the routing search engine?
Yes, sure. I use the concept of distribution in the sense of separating a process within different entities (search router and local searches).
In the distributed case the "search router" queries other sites to determine the best results. For example, searching for code samples involved querying Stack Overflow, Code Project, Forums, etc. This approach is clearly expensive: you rely on the other sites speed, web service availability, etc.
The non distributed approach is just receiving their algorithms and data and processing everything in the router search engine. Obviously this solution can be implemented in a distributed way inside the search engine but it is not distributed in the sense of distributing the process within different entities.
In the two cases you are distributing efforts, one of the key goals of this approach because it is really difficult to compete with Google. Google "knows" how to give good results in diverse areas while in the proposed attack vector you rely on others for part of this optimization. A movie site should know how to give good results about movies while a site related to books knows about books.
It's important to note that the vast majority of search results ends in relatively few sites, so if the top visited sites implement this approach Google search market share can be challenged. Obviously we don't really know if this approach will work in practice until we see it.
Thank you for your explanation. Now I understand better. But a distributed system is usually defined in the second meaning of what you said inside the search engine, instead of across different websites.
I'm very interested in this topic because I've proposed my solution for how to improve search engines, not from algorithm point of view, but from systematic point of view. And making it fully distributed is the key.
While you mentioned about the 2-level search and "receiving their algorithms and data ...", I don't think it's very feasible. Do you agree? So vertical distributed architecture across various industries is not a feasible solution. But we can do a horizontal distributed architecture which will collect data from geographic locations. In each location, there will be many different verticals. It's matter of time if Google cannot find a better solution, search engine will be improved in certain way.
Because to my understanding, 2-tiered search means that the routing search engine scrapes data from the second level search engines and return it to the users. The second level search engines, e.g., Stack Overflow, are usually running by separate entities from the routing search engine, say DDG. If DDG does not own all the second level search engines, how can they get the local indexes and ranking algorithms from them? And even if DDG does get it, it's no longer decentralized any more. So what's the difference from Google?
1) Google quality of indexing doesn't have any competition yet.
2) They can calculate a page rank across different domains
3) No single entity can make the same efforts or is so smart to build a similar thing
If you follow the 2-tier route:
1) Each entity takes responsability to optimize the quality of search locally.
2) They know their own domain or they can learn how to optimize their page rank at a local level instead of a global level
So, at the end you have distributed the work of local optimization across different intelligent entities. For example, when you look at the Linux kernel or other open source projects you can count million of man hours that are difficult to have if you run a single entity.
Yes, I agree with you on using a 2-tier search which will increase the relevancy to optimize the quality of search. And Google's search quality is not unbeatable.
I also agree with you by using distributed sites to optimize the results locally. Actually what I proposed is to make the distributed search from both geographic location and vertical market point of view, as opposed of dedicated sites from you. But they are complementary. The dedicated sites definitely will provide better and more relevant results than a global search engine if Google search was not limited to a particular site.
However, the only thing I don't agree with you is when you said it does not have to be distributed though, the search router can integrate the algorithms from the dedicated sites. Then I think it's not quite feasible since it's not possible for Stack Overflow or Wikipedia to share their algorithms with DDG.
Let me know if I misunderstood you. If you'd like to take if offline, I'll be happy to discuss with you via email. See my profile.
What? DDG does not own Stack Overflow but it can run Stack Overflow's algorithm locally? Do you mean that Stack Overflow has it's search algorithm open sourced? I'm sorry, I don't quite get it.
Usually, the router search engine queries data from the second tier websites to get high quality results without having other websites' algorithms. Also there is another problem: how do you know which websites to go for given an arbitrary keywords? For example, when user searches for "cookie" on your search engine, where do you send the query to? How do you know if they are looking for food cookie or browser cookie?
The issue about sharing your algorithms is minimal in this case. Most search engines are using standard frameworks. Google doesn't share their algorithms but StackOverflow doesn't add any specific magic to it.
Regarding how do you know where to route a query, it is an issue but not so great in this case. The article doesn't talk about having a two tiered search for every web site. If it has a two tiered search for the top 100 sites that is enough to challenge Google (the main point of the article) and making 100 searches and filtering them in the 2nd tier it's not difficult.
Yes, sure. I use the concept of distribution in the sense of separating a process within different entities (search router and local searches).
In the distributed case the "search router" queries other sites to determine the best results. For example, searching for code samples involved querying Stack Overflow, Code Project, Forums, etc. This approach is clearly expensive: you rely on the other sites speed, web service availability, etc.
The non distributed approach is just receiving their algorithms and data and processing everything in the router search engine. Obviously this solution can be implemented in a distributed way inside the search engine but it is not distributed in the sense of distributing the process within different entities.
In the two cases you are distributing efforts, one of the key goals of this approach because it is really difficult to compete with Google. Google "knows" how to give good results in diverse areas while in the proposed attack vector you rely on others for part of this optimization. A movie site should know how to give good results about movies while a site related to books knows about books.
It's important to note that the vast majority of search results ends in relatively few sites, so if the top visited sites implement this approach Google search market share can be challenged. Obviously we don't really know if this approach will work in practice until we see it.