You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Yu Zhou <j_...@yahoo.com> on 2013/08/28 20:37:28 UTC

Unifying IDF for unbalanced shards?

Hello,

We have a large collection of documents that consists of multiple balanced shards. Now each shard is quickly approaching its limit. Therefore, we would like to explore the possibility of adding unbalanced shards into the mix. However, that means the IDF and Relevance would take a hit. 

Several days ago, I asked about relevance across unbalanced shards in IRC channel #lucene. Somebody pointed me to a SOLR Jira about distributed IDF (SOLR-1632).

After some thinking and research, I found out that there are some new Lucene 4 features that may help on unifying IDF across shards by calculating docFreq across shards at the index time. Then at the query time, we can supply/modify the TermStatistics in the IndexSearcher. I'm doing some experiments on this approach. 

Now, the question is, is that really a good thing to try?

Best Regards,

Jerry Zhou