You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by Henrib <hb...@gmail.com> on 2007/10/18 14:16:00 UTC

query handling / multiple languages / multiple cores

We have an application where we index documents that can exist in many (at
least 2) languages.
We have 1 SolrCore per language using the same field names in their schemas
(different stopwords , synonyms & stemmers), the benefits for content
maintenance overweighting (at least) complexity.
Using EN & FR as an example, a document always exist in EN as a reference
and some of them - not all - are translated in FR; the same document unique
id is used for the reference & the translation.
If a user performs a query in FR, FR documents and EN documents are
searched.
FR docs are seeked first; the same query is also run against EN removing
from the document set those returned by the FR query. That is, if document
id 'AZ123' is retrieved through the FR query, it can't be retrieved by the
EN query. Removing the FR returned documents ids from the EN searchable
document set guarantees that the 2 results sets are disjoint.

1/ Anyone with the same kind of functional requirements? Is using multiple
cores a bad idea for this need ?

On the practical side, this lead me to a handler that needs to restrict the
document set through an externally defined list of Solr unique ids (we also
need to deal with some upfront ACL management to top it all).
However, I'm missing a small method that would nicely complete the
SolrIndexSearcher.getListDoc*.

  public DocList getDocList(Query query, DocSet filter, Sort lsort, int
offset, int len, int flags) throws IOException {
    DocListAndSet answer = new DocListAndSet();
    getDocListC(answer,query,null,filter,lsort,offset,len,flags);
    return answer.docList;
  }

I intend to use this after I intersect potential filter queries & the
restricted document set in the request handler; the Query filter version of
the method is exposed, this would be the DocSet version of it.
2/ Any reason not to do this? {Sh,C}ould this method be included -or should
I create an enhancement request ?

My current idea to create the DocSet from the document ids is the following:

DocSet keyFilter(org.apache.lucene.index.IndexReader reader,
            String keyField,
            java.util.Iterator<String> ikeys) throws java.io.IOException {
        org.apache.solr.util.OpenBitSet bits = new
org.apache.solr.util.OpenBitSet(reader.maxDoc());
        if (ikeys.hasNext()) {
            org.apache.lucene.index.Term term = new
org.apache.lucene.index.Term(keyField,ikeys.next());
            org.apache.lucene.index.TermDocs termDocs =
reader.termDocs(term);
            try {
              if (termDocs.next())
                  bits.fastSet(termDocs.doc());
              while(ikeys.hasNext()) {
                  termDocs.seek(term.createTerm(ikeys.next()));
                  if(termDocs.next())
                      bits.fastSet(termDocs.doc());
               }
            } 
            finally {
              termDocs.close();
            }
        }
        return new org.apache.solr.search.BitDocSet(bits);
    }

3/ Any better/faster way to create a DocSet from a list of unique ids?

Comments & questions welcome.
Thanks


-- 
View this message in context: http://www.nabble.com/query-handling---multiple-languages---multiple-cores-tf4646246.html#a13272209
Sent from the Solr - Dev mailing list archive at Nabble.com.

Re: query handling / multiple languages / multiple cores

Posted by Henrib <hb...@gmail.com>.

I agree, the 2 document sets are not (should not be) mixed together; you get
a list of FR docs and a list of EN docs (each list can be sorted by
relevance).
However, not being able to compare result's score across different queries
is something a lot of people can not (or dont want to) understand or hear.
Some will even argue that this *is* the same query, that the index data
obviously takes the language factor into account and normalizing by the
highest score of the 2 lists is "ok"... And they'll add "the formula might
be a little off but end-users like the result"...
Cheers

Daniel Naber-10 wrote:
> 
> On Thursday 18 October 2007 14:16, Henrib wrote:
> 
>> 1/ Anyone with the same kind of functional requirements? Is using
>> multiple cores a bad idea for this need ?
> 
> Are documents sorted by relevance? Then this approach is problematic as
> you 
> cannot compare the result's score across different queries.
> 
> Regards
>  Daniel
> 
> -- 
> http://www.danielnaber.de
> 
> 

-- 
View this message in context: http://www.nabble.com/Re%3A-query-handling---multiple-languages---multiple-cores-tf4647287.html#a13276404
Sent from the Solr - Dev mailing list archive at Nabble.com.

Re: query handling / multiple languages / multiple cores

Posted by Daniel Naber <lu...@danielnaber.de>.

On Thursday 18 October 2007 14:16, Henrib wrote:

> 1/ Anyone with the same kind of functional requirements? Is using
> multiple cores a bad idea for this need ?

Are documents sorted by relevance? Then this approach is problematic as you 
cannot compare the result's score across different queries.

Regards
 Daniel

-- 
http://www.danielnaber.de