You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Antony Sequeira <an...@gmail.com> on 2005/03/30 02:01:16 UTC

pre computing possible search results narrowing and hit counts on those

Hi
I have the above requirement, for which I could not find a good way to do.
I think the best way to explain my problem would be to give an example.

I have documents where each document represents a real estate property
for sale in US.
So, each document would have a city associated with it.
(We index the document and maybe index a city field.)

A user does a search for say "condominium", and i show him the 50,000
properties that meet that description.

I need two other pieces of information for display -
1. I want to show a "select" box on the UI, which contains all the
cities that appear in those 50,000 documents
2. Against each city I want to show the count of matching documents.

For example the drop down might look like
"Los Angeles"  10000
"San Francisco" 5000

(But, I do not want to show "San Jose" if none of the 50,000 documents
contain it)
Now, the user will be able to narrow down the search using one of the
selections (which I can turn into a boolean query).

My problem is, I do not know how to generate that 'select' list
without having to actually access each of those  50,000 documents.

Thanks,
-Antony

P.S.: The above app is fictional, since my employer won't like it if I
expose the actual stuff I am working on.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: pre computing possible search results narrowing and hit counts on those

Posted by Antony Sequeira <an...@gmail.com>.

On Wed, 30 Mar 2005 09:42:32 -0800, Doug Cutting <cu...@apache.org> wrote:
> Antony Sequeira wrote:
> > A user does a search for say "condominium", and i show him the 50,000
> > properties that meet that description.
> >
> > I need two other pieces of information for display -
> > 1. I want to show a "select" box on the UI, which contains all the
> > cities that appear in those 50,000 documents
> > 2. Against each city I want to show the count of matching documents.
> >
> > For example the drop down might look like
> > "Los Angeles"  10000
> > "San Francisco" 5000
> >
> > (But, I do not want to show "San Jose" if none of the 50,000 documents
> > contain it)
> 
> You can use the FieldCache & HitCollector:
> 
> private class Count { int value; }
> 
> String[] docToCity = FieldCache.getStrings(indexReader, "city");
> Map cityToCount = new HashMap();
> 
> searcher.search(query, new HitCollector() {
>    public void collect(int doc, float score) {
>      String city = docToCity[doc];
>      Count count = cityToCount.get(city);
>      if (count == null) {
>        count = new Count();
>        cityToCount.put(city, count);
>      }
>      count.value++;
>    }
> });
> 
> // sort & display entries in cityToCount
> 
> Doug
> 
Based on a previous reply , I went through the java docs and came up with

 public class PreFilterCollector extends HitCollector {
        final BitVector bits = new BitVector(reader.maxDoc());
        java.util.HashMap<String,Integer> statemap = new    
java.util.HashMap<String,Integer>() ;

        public void collect(int id, float score) {
            bits.set(id);
        }

        public java.util.HashMap<String,Integer> getStateCounts() {
            try {
                int k = bits.size();
                int j = 0;
                for (int i =0; i < k; i++) {
                    if (!bits.get(i))
                        continue;
                    Document doc = reader.document(i); 
                    j++;
                    String state = doc.get("state"); // we assume one
state for now
                    if (statemap.containsKey(state)) {
                        statemap.put(state,statemap.get(state) + 1); 
                    } else {
                        statemap.put(state,1);
                    }
                }
            } catch (Exception e) {
                throw new RuntimeException(e);
            }
            return statemap;
        }
  }

But, I have the following questions
1. My code first collects all the doc ids and then iterates over them
to collect field info. I did this becasue,
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/HitCollector.html
says "This is called in an inner search loop. For good search
performance, implementations of this method should not call
Searchable.doc(int) or IndexReader.document(int) on every document
number encountered"
Have I misunderstood and doing this wrongly ?

2. Would your code be faster (under what circumstances) ?

3.  One problem i see with my current solution is that it accesses
every doc of the result  set.
One of the previous responses pointed to a solution in
http://www.mail-archive.com/java-dev@lucene.apache.org/msg00034.html
After reading it, to me it looked like that solution won't be any
better. (Looks like it walks values of terms that do not even occur in
teh current search result set).  Have I got this right ?


I am a newbee to lucene. Thanks for all the replies. Appreciate it very much.

-Antony

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: pre computing possible search results narrowing and hit counts on those

Posted by Doug Cutting <cu...@apache.org>.

Antony Sequeira wrote:
> A user does a search for say "condominium", and i show him the 50,000
> properties that meet that description.
> 
> I need two other pieces of information for display -
> 1. I want to show a "select" box on the UI, which contains all the
> cities that appear in those 50,000 documents
> 2. Against each city I want to show the count of matching documents.
> 
> For example the drop down might look like
> "Los Angeles"  10000
> "San Francisco" 5000
> 
> (But, I do not want to show "San Jose" if none of the 50,000 documents
> contain it)

You can use the FieldCache & HitCollector:

private class Count { int value; }

String[] docToCity = FieldCache.getStrings(indexReader, "city");
Map cityToCount = new HashMap();

searcher.search(query, new HitCollector() {
   public void collect(int doc, float score) {
     String city = docToCity[doc];
     Count count = cityToCount.get(city);
     if (count == null) {
       count = new Count();
       cityToCount.put(city, count);
     }
     count.value++;
   }
});

// sort & display entries in cityToCount

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: pre computing possible search results narrowing and hit counts on those

Posted by Chris Hostetter <ho...@fucit.org>.

: I need two other pieces of information for display -
: 1. I want to show a "select" box on the UI, which contains all the
: cities that appear in those 50,000 documents
: 2. Against each city I want to show the count of matching documents.

: My problem is, I do not know how to generate that 'select' list
: without having to actually access each of those  50,000 documents.

the straight forward way to do this, is to use a TermEnumerator to get a
list of all the "cities" in your collection, and then for each one
construct a Filter.  you can then either issue every search N+1 times
(once with no filter for real results, and once with each filter to get
the counts) or you can use the Filter.bits(IndexReader) method directly
with each Filter, and compute the AND of a clone of each with the BitSet
generated using a HitCollector when you do your search -- this assumes you
have access to the IndexReader.  in both cases you can improve performance
by using CachingWrapperFilter.

If you list of vales for that field is too vast to generate, then you
might wnat to take a look at this thread, which i have not read in it's
entirety, but what i did read lead me to believe it's the exact same
problem you describe...

http://www.mail-archive.com/java-dev@lucene.apache.org/msg00034.html

-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org