You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Melanie Langlois <Me...@tradingscreen.com> on 2007/04/10 10:55:44 UTC

distinct results

Hi,

I'm indexing documents, and some of them are provided in several languages. Thanks to this mailing list participants, I know that I have two choices to index these multiple instances of documents. Either, I create languages specific field, either I index the translations in different documents, adding the language field.

I choose the second solution, because first, the translated documents will not be the majority of documents that I need to index, second is that at search time, if I don't want to restrict the search to one language, with solution one, I have a query with potentially lot of fields to cover all languages. Also, the second option makes it faster to filter the results by language, if specified.

However, with this solution, when the query is not filtered by a language and that the user search for fields common to any language, such as author for instance, I will have as much results as I have translations. I'm wondering if there is a way to have a "distinct filter". For instance, I have a common field "docId" for the translations of one document, and I don't want to have two documents with the same "docId" in my results.

Also, even if the user didn't put restrictions on language, I want to give back the results in its default language if it's available, but I don't want to do a filter query, because I don't want to restrict the search to only this language.

So basically, if the default language of the user is English, and that I have translations of the matching documents in English, it will be the only one send, otherwise, it should take the first translation available for this document.

Any hint of how I could do this?

Thanks,

Mélanie

Re: index the whole plain text file's content

Posted by karl wettin <ka...@gmail.com>.

10 apr 2007 kl. 17.58 skrev Chen Li:

> Which is interesting that, for some larger files (around 500kb),  
> only the query term on the top of the file is searchable, once the  
> term is at the end or after an unknown point of the file, I  
> couldn't use SearchFiles.java, which also came with demo code, to  
> find it.
>
> I even tried to convert the file to String and index it as  
> Store.YES. But no luck, still same resultset was returned.
>
> Does anybody have same experience to share the solution with me? I  
> would so appreciate.

Could it perhaps be this:

<http://lucene.apache.org/java/docs/api/org/apache/lucene/index/ 
IndexWriter.html#setMaxFieldLength(int)>

?

-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

index the whole plain text file's content

Posted by Chen Li <ch...@ebi.ac.uk>.

Hello,

I used demo code(IndexFiles.java) from lucene to index around 100 text 
files.

doc.add(new Field("contents", new FileReader(f)));

Which is interesting that, for some larger files (around 500kb), only 
the query term on the top of the file is searchable, once the term is at 
the end or after an unknown point of the file, I couldn't use 
SearchFiles.java, which also came with demo code, to find it.

I even tried to convert the file to String and index it as Store.YES. 
But no luck, still same resultset was returned.

Does anybody have same experience to share the solution with me? I would 
so appreciate.

Cheers,
Chen

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: distinct results

Posted by Doron Cohen <DO...@il.ibm.com>.

> > I'm indexing documents, and some of them are provided in several
> > languages.   ...   Either, I create
> > languages specific field, either I index the translations in different
> > documents, adding the language field.
> >
> > I choose the second solution, because first, the translated documents
will
> > not be the majority of documents that I need to index, second is that
at
> > search time, if I don't want to restrict the search to one language,
with
> > solution one, I have a query with potentially lot of fields to cover
all
> > languages. Also, the second option makes it faster to filter the
results by
> > language, if specified.
> >
> > However, with this solution, when the query is not filtered by a
language
> > and that the user search for fields common to any language, such as
author
> > for instance, I will have as much results as I have translations.

If space can be afforded, perhaps a simple setting is: one Lucene doc per
"page", with N+1 fields: one per each existing translation for the page,
and an additional ALL field == union of all the translations of the page.
Then, if only language L is requested, search in field L only; if there is
no language specification and the user locale is unknown, search in the ALL
field; and if there is no language specification but the user locale is
known to be L, search in both L and ALL, optionally boost the L part of the
query.

HTH, Doron


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: distinct results

Posted by Erick Erickson <er...@gmail.com>.

You might get some good pointers by searching the mail archive for
"faceted search", or perhaps just "faceted". I vaguely remember that
the whole notion of sub-dividing result sets into bags of documents
was discussed under that heading, quite an extensive discussion
as I remember, and certainly not a term that jumps to mind <G>.

The other thing you might be able to do is combine a HitCollector with
a FieldSortedHitQueue. The idea here is to use a HitCollector to
gather the hits, and put the results in a FieldSortedHitQueue whose
comparator is sensitive to your unique doc ID (Not Lucene's id, but
the one it looks like you've assigned to your docs) and the user's
preferred language.

One caution about the second approach, you may slow your search
down dramatically if you go out and fetch each document to get
its ID and language. But if the fields are indexed, you can use TermDocs/
TermEnum to get them quickly.

Best
Erick

On 4/10/07, Melanie Langlois <Me...@tradingscreen.com> wrote:
>
> Hi,
>
>
>
> I'm indexing documents, and some of them are provided in several
> languages. Thanks to this mailing list participants, I know that I have two
> choices to index these multiple instances of documents. Either, I create
> languages specific field, either I index the translations in different
> documents, adding the language field.
>
> I choose the second solution, because first, the translated documents will
> not be the majority of documents that I need to index, second is that at
> search time, if I don't want to restrict the search to one language, with
> solution one, I have a query with potentially lot of fields to cover all
> languages. Also, the second option makes it faster to filter the results by
> language, if specified.
>
>
>
> However, with this solution, when the query is not filtered by a language
> and that the user search for fields common to any language, such as author
> for instance, I will have as much results as I have translations. I'm
> wondering if there is a way to have a "distinct filter". For instance, I
> have a common field "docId" for the translations of one document, and I
> don't want to have two documents with the same "docId" in my results.
>
> Also, even if the user didn't put restrictions on language, I want to give
> back the results in its default language if it's available, but I don't want
> to do a filter query, because I don't want to restrict the search to only
> this language.
>
> So basically, if the default language of the user is English, and that I
> have translations of the matching documents in English, it will be the only
> one send, otherwise, it should take the first translation available for this
> document.
>
> Any hint of how I could do this?
>
>
>
> Thanks,
>
>
>
> Mélanie
>
>
>
>
>
>