You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by oramas martín <jl...@gmail.com> on 2007/02/26 00:56:19 UTC

A solution to HitCollector-based searches problems

Hello,

As you probably know, the HitCollector-based search API is not meant to work
remotely, because it will generate a RPC-callback for every non-zero score.

There is another problem with MultiSearcher-HitCollector-based search which
knows nothing about mix HitCollector based searches (not to say it has
hardcode the way to mix TopDocs for the score and for the Sort searches).
Also the ParallelMultiSearcher inherits this problems and is unable to
parallelize the HitCollector-based searcher.

A final problem with the HitCollector-based search is related to the lost of
a limit in the results, as the Hits class implements thought the
getMoreDocs() function, and lazy loading and caching of documents it does.


To solve those problems it is necessary a factory (HitCollectorSource) able
to generate collectors for single (SingleHitCollector) an multi
(MultiHitCollector) searches, and a new search method in the
Searchable interface for it. To avoid modifications to the lucene core, the
later requirement is NOT IMPLEMENTED in the library I have just created.
Instead, an ugly solution, a wrapper for those searchers
(SearcherHCSourceWrapper) and a Filter wrapper (FilterHitCollectorSource) to
carry the factory-based searches, is provided.

Each collector is based in a two steps sequence, one for collecting hits or
subsearcher results, and another for generating the final result.

Also, just in case you don't want to add a wrapper to each searcher of your
project, there is an adapted version of IndexSearcher, MultiSearcher and
ParallelMultiSearcher (only for version 2.1) modified exactly the same way
the wrapper class SearcherHCSourceWrapper does. Just put them in your
class-path (before the Lucene core jar) and you will be using the new
collector interfaces without modifying your code.

There are some unit testing (copied and adapted from the Lucene
2.1distribution).

See http://sourceforge.net/projects/lucollector/ for the jar files and the
code.

If you find it interesting to complement the Lucene project, tell me how to
put it in the contribution area.

Regards,
José L. Oramas

Re: A solution to HitCollector-based searches problems

Posted by oramas martín <jl...@gmail.com>.
Hello Mohammad,

You don't have to remove the lucene-core jar file, just play with the
classpath order to load the jar files.

You have two possibilities:

    1.- Use the searcher wrapper (valid for Lucene 2.0 or 2.1 version): Put
"lu-collector-0.8.jar" AFTER "lucene-core-2.1.0.jar" and wrap any
IndexSearcher, MultiSearcher and ParallelSearcher with
SearcherHCSourceWrapper

    2.- Use the modified lucene core searcher classes (valid for
Lucene 2.1version): Put "
lu-collector-0.8.jar" BEFORE "lucene-core-2.1.0.jar" in the classpath, and
call "IndexSearcher.USE_NEW_COLLECTOR = true;" at the beginning of your
program.

If you take a look at the test files, you can see the usage of both method.

Hope this help,
José L. Oramas


On 3/11/07, Mohammad Norouzi <mn...@gmail.com> wrote:
>
> Hi Oramas
> if I use that jar file, it conflicts with lucene-core.jar file. for
> exampl,
> IndexSearcher class that you defined is different from the original one.
> Do
> I have to remove the lucene-core jar file?
> if yes, how about the other original classes
>
> On 3/8/07, oramas martín <jl...@gmail.com> wrote:
> >
> > Hello,
> >
> > I have just added some search implementation samples based on this
> > collector
> > solution, to easy the use and understanding or it:
> >
> >     - KeywordSearch: Extract the terms (and frequency) found in a list
> of
> > fields
> >                      from the results of a query/filter search
> >
> >     - GoogleSearch: Return an ordered search result grouped a la Google,
> > based
> >                     on the terms found in a list of fields
> >
> >     - GetFieldNamesOp: Operation to mimic the getFieldNames method of
> >                        IndexReader but using a searcher. With it, it is
> > possible
> >                        to explore the fields of remote indexes.
> >
> > See http://sourceforge.net/projects/lucollector/ for the source code (
> > lu-collector-src-sampleop-0.8.zip).
> >
> > Regards,
> > José L. Oramas
> >
> > On 2/26/07, oramas martín <jl...@gmail.com> wrote:
> > >
> > >
> > > Hello,
> > >
> > > As you probably know, the HitCollector-based search API is not meant
> to
> > > work remotely, because it will generate a RPC-callback for every
> > non-zero
> > > score.
> > >
> > > There is another problem with MultiSearcher-HitCollector-based search
> > > which knows nothing about mix HitCollector based searches (not to say
> it
> > has
> > > hardcode the way to mix TopDocs for the score and for the Sort
> > searches).
> > > Also the ParallelMultiSearcher inherits this problems and is unable to
> > > parallelize the HitCollector-based searcher.
> > >
> > > A final problem with the HitCollector-based search is related to the
> > lost
> > > of a limit in the results, as the Hits class implements thought the
> > > getMoreDocs() function, and lazy loading and caching of documents it
> > does.
> > >
> > >
> > > To solve those problems it is necessary a factory (HitCollectorSource)
> > > able to generate collectors for single (SingleHitCollector) an multi
> > > (MultiHitCollector) searches, and a new search method in the
> > > Searchable interface for it. To avoid modifications to the lucene
> core,
> > the
> > > later requirement is NOT IMPLEMENTED in the library I have just
> created.
> > > Instead, an ugly solution, a wrapper for those searchers
> > > (SearcherHCSourceWrapper) and a Filter wrapper
> > (FilterHitCollectorSource) to
> > > carry the factory-based searches, is provided.
> > >
> > > Each collector is based in a two steps sequence, one for collecting
> hits
> > > or subsearcher results, and another for generating the final result.
> > >
> > > Also, just in case you don't want to add a wrapper to each searcher of
> > > your project, there is an adapted version of IndexSearcher,
> > MultiSearcher
> > > and ParallelMultiSearcher (only for version 2.1) modified exactly the
> > same
> > > way the wrapper class SearcherHCSourceWrapper does. Just put them in
> > your
> > > class-path (before the Lucene core jar) and you will be using the new
> > > collector interfaces without modifying your code.
> > >
> > > There are some unit testing (copied and adapted from the Lucene
> > 2.1distribution).
> > >
> > > See http://sourceforge.net/projects/lucollector/ for the jar files and
> > the
> > > code.
> > >
> > > If you find it interesting to complement the Lucene project, tell me
> how
> > > to put it in the contribution area.
> > >
> > > Regards,
> > > José L. Oramas
> > >
> >
>
>
>
> --
> Regards,
> Mohammad
>

Re: A solution to HitCollector-based searches problems

Posted by Mohammad Norouzi <mn...@gmail.com>.
Hi Oramas
if I use that jar file, it conflicts with lucene-core.jar file. for exampl,
IndexSearcher class that you defined is different from the original one. Do
I have to remove the lucene-core jar file?
if yes, how about the other original classes

On 3/8/07, oramas martín <jl...@gmail.com> wrote:
>
> Hello,
>
> I have just added some search implementation samples based on this
> collector
> solution, to easy the use and understanding or it:
>
>     - KeywordSearch: Extract the terms (and frequency) found in a list of
> fields
>                      from the results of a query/filter search
>
>     - GoogleSearch: Return an ordered search result grouped a la Google,
> based
>                     on the terms found in a list of fields
>
>     - GetFieldNamesOp: Operation to mimic the getFieldNames method of
>                        IndexReader but using a searcher. With it, it is
> possible
>                        to explore the fields of remote indexes.
>
> See http://sourceforge.net/projects/lucollector/ for the source code (
> lu-collector-src-sampleop-0.8.zip).
>
> Regards,
> José L. Oramas
>
> On 2/26/07, oramas martín <jl...@gmail.com> wrote:
> >
> >
> > Hello,
> >
> > As you probably know, the HitCollector-based search API is not meant to
> > work remotely, because it will generate a RPC-callback for every
> non-zero
> > score.
> >
> > There is another problem with MultiSearcher-HitCollector-based search
> > which knows nothing about mix HitCollector based searches (not to say it
> has
> > hardcode the way to mix TopDocs for the score and for the Sort
> searches).
> > Also the ParallelMultiSearcher inherits this problems and is unable to
> > parallelize the HitCollector-based searcher.
> >
> > A final problem with the HitCollector-based search is related to the
> lost
> > of a limit in the results, as the Hits class implements thought the
> > getMoreDocs() function, and lazy loading and caching of documents it
> does.
> >
> >
> > To solve those problems it is necessary a factory (HitCollectorSource)
> > able to generate collectors for single (SingleHitCollector) an multi
> > (MultiHitCollector) searches, and a new search method in the
> > Searchable interface for it. To avoid modifications to the lucene core,
> the
> > later requirement is NOT IMPLEMENTED in the library I have just created.
> > Instead, an ugly solution, a wrapper for those searchers
> > (SearcherHCSourceWrapper) and a Filter wrapper
> (FilterHitCollectorSource) to
> > carry the factory-based searches, is provided.
> >
> > Each collector is based in a two steps sequence, one for collecting hits
> > or subsearcher results, and another for generating the final result.
> >
> > Also, just in case you don't want to add a wrapper to each searcher of
> > your project, there is an adapted version of IndexSearcher,
> MultiSearcher
> > and ParallelMultiSearcher (only for version 2.1) modified exactly the
> same
> > way the wrapper class SearcherHCSourceWrapper does. Just put them in
> your
> > class-path (before the Lucene core jar) and you will be using the new
> > collector interfaces without modifying your code.
> >
> > There are some unit testing (copied and adapted from the Lucene
> 2.1distribution).
> >
> > See http://sourceforge.net/projects/lucollector/ for the jar files and
> the
> > code.
> >
> > If you find it interesting to complement the Lucene project, tell me how
> > to put it in the contribution area.
> >
> > Regards,
> > José L. Oramas
> >
>



-- 
Regards,
Mohammad

Re: A solution to HitCollector-based searches problems

Posted by oramas martín <jl...@gmail.com>.
Hello,

I have just added some search implementation samples based on this collector
solution, to easy the use and understanding or it:

    - KeywordSearch: Extract the terms (and frequency) found in a list of
fields
                     from the results of a query/filter search

    - GoogleSearch: Return an ordered search result grouped a la Google,
based
                    on the terms found in a list of fields

    - GetFieldNamesOp: Operation to mimic the getFieldNames method of
                       IndexReader but using a searcher. With it, it is
possible
                       to explore the fields of remote indexes.

See http://sourceforge.net/projects/lucollector/ for the source code (
lu-collector-src-sampleop-0.8.zip).

Regards,
José L. Oramas

On 2/26/07, oramas martín <jl...@gmail.com> wrote:
>
>
> Hello,
>
> As you probably know, the HitCollector-based search API is not meant to
> work remotely, because it will generate a RPC-callback for every non-zero
> score.
>
> There is another problem with MultiSearcher-HitCollector-based search
> which knows nothing about mix HitCollector based searches (not to say it has
> hardcode the way to mix TopDocs for the score and for the Sort searches).
> Also the ParallelMultiSearcher inherits this problems and is unable to
> parallelize the HitCollector-based searcher.
>
> A final problem with the HitCollector-based search is related to the lost
> of a limit in the results, as the Hits class implements thought the
> getMoreDocs() function, and lazy loading and caching of documents it does.
>
>
> To solve those problems it is necessary a factory (HitCollectorSource)
> able to generate collectors for single (SingleHitCollector) an multi
> (MultiHitCollector) searches, and a new search method in the
> Searchable interface for it. To avoid modifications to the lucene core, the
> later requirement is NOT IMPLEMENTED in the library I have just created.
> Instead, an ugly solution, a wrapper for those searchers
> (SearcherHCSourceWrapper) and a Filter wrapper (FilterHitCollectorSource) to
> carry the factory-based searches, is provided.
>
> Each collector is based in a two steps sequence, one for collecting hits
> or subsearcher results, and another for generating the final result.
>
> Also, just in case you don't want to add a wrapper to each searcher of
> your project, there is an adapted version of IndexSearcher, MultiSearcher
> and ParallelMultiSearcher (only for version 2.1) modified exactly the same
> way the wrapper class SearcherHCSourceWrapper does. Just put them in your
> class-path (before the Lucene core jar) and you will be using the new
> collector interfaces without modifying your code.
>
> There are some unit testing (copied and adapted from the Lucene 2.1distribution).
>
> See http://sourceforge.net/projects/lucollector/ for the jar files and the
> code.
>
> If you find it interesting to complement the Lucene project, tell me how
> to put it in the contribution area.
>
> Regards,
> José L. Oramas
>

Re: A solution to HitCollector-based searches problems

Posted by Grant Ingersoll <gs...@apache.org>.
However, see Wiki HowToContribute: http://wiki.apache.org/jakarta- 
lucene/HowToContribute if you wish to donate your code.

-Grant
On Feb 25, 2007, at 6:56 PM, oramas martín wrote:

> Hello,
>
> As you probably know, the HitCollector-based search API is not  
> meant to work
> remotely, because it will generate a RPC-callback for every non- 
> zero score.
>
> There is another problem with MultiSearcher-HitCollector-based  
> search which
> knows nothing about mix HitCollector based searches (not to say it has
> hardcode the way to mix TopDocs for the score and for the Sort  
> searches).
> Also the ParallelMultiSearcher inherits this problems and is unable to
> parallelize the HitCollector-based searcher.
>
> A final problem with the HitCollector-based search is related to  
> the lost of
> a limit in the results, as the Hits class implements thought the
> getMoreDocs() function, and lazy loading and caching of documents  
> it does.
>
>
> To solve those problems it is necessary a factory  
> (HitCollectorSource) able
> to generate collectors for single (SingleHitCollector) an multi
> (MultiHitCollector) searches, and a new search method in the
> Searchable interface for it. To avoid modifications to the lucene  
> core, the
> later requirement is NOT IMPLEMENTED in the library I have just  
> created.
> Instead, an ugly solution, a wrapper for those searchers
> (SearcherHCSourceWrapper) and a Filter wrapper  
> (FilterHitCollectorSource) to
> carry the factory-based searches, is provided.
>
> Each collector is based in a two steps sequence, one for collecting  
> hits or
> subsearcher results, and another for generating the final result.
>
> Also, just in case you don't want to add a wrapper to each searcher  
> of your
> project, there is an adapted version of IndexSearcher,  
> MultiSearcher and
> ParallelMultiSearcher (only for version 2.1) modified exactly the  
> same way
> the wrapper class SearcherHCSourceWrapper does. Just put them in your
> class-path (before the Lucene core jar) and you will be using the new
> collector interfaces without modifying your code.
>
> There are some unit testing (copied and adapted from the Lucene
> 2.1distribution).
>
> See http://sourceforge.net/projects/lucollector/ for the jar files  
> and the
> code.
>
> If you find it interesting to complement the Lucene project, tell  
> me how to
> put it in the contribution area.
>
> Regards,
> José L. Oramas

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org