You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by JMA <mr...@comcast.net> on 2005/09/15 11:00:14 UTC

Terms given a filter?

Greetings -

I know I can get all the fields in an index: reader.getFieldNames()
and also all the terms:  reader.terms()

However, I need to be able to get all the terms and fields given a search
filter. For example, say I have an index that has crawled 5000 pdf files
(books) and I have the following fields:

content, author (not tokenized), and publish_date

I can easily find all the *distinct* authors in the index using
'reader.terms()'.  But say I want to list all the *distinct* authors that
have published books in 2002?  I can do a simple search to get all the books
filtered by publish_date:2002.  But then I have to do my own scan of the
results and pull out the author, removing duplicates.

Is there an easier way to do this?

Thanks in advance!

JMA





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


HitCollector with RemoteSearchable

Posted by Youngho Cho <yo...@nannet.co.kr>.
Hello,

Can I use HitCollector with RemoteSearchable ?

I am tring to use it. But I got the following error.

java.rmi.MarshalException: error marshalling arguments; nested exception is: 
        java.io.NotSerializableException: org.apache.lucene.search.MultiSearcher$1
        at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:129)
        at com.nannet.fulcrum.lucene.util.RefinedRemoteSearchable_Stub.search(Unknown Source)
        at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:168)
        at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:168)
        at org.apache.lucene.search.Searcher.search(Searcher.java:67)
....
....
Caused by: java.io.NotSerializableException: org.apache.lucene.search.MultiSearcher$1
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1054)
        at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:278)
        at sun.rmi.server.UnicastRef.marshalValue(UnicastRef.java:265)
        at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:124)
        ... 160 more

Thanks,

Youngho

Re: Terms given a filter?

Posted by mark harwood <ma...@yahoo.co.uk>.
Erik,
It may be worth looking at the code here:

http://issues.apache.org/jira/browse/LUCENE-328

The Bitsets in your example are likely to be very
sparse (I imagine you know only too well how long it
takes to write a book and therefore how many books
there are likely to be per author! :))With such a
sparse set per author BitSets could use a lot of
memory. In this example I imagine a SortedVIntList per
author would be a much more compact format.
The code in the link contains a standard interface for
a sorted list of ints with bitset,int array and VInt
encoded implementations. The AndDocNrSkipper and
OrDocNrSkipper classes can be used to perform set
intersections on any combination of these int sets.




Cheers,
Mark



		
___________________________________________________________ 
To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Terms given a filter?

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Sep 15, 2005, at 5:00 AM, JMA wrote:
> I know I can get all the fields in an index: reader.getFieldNames()
> and also all the terms:  reader.terms()
>
> However, I need to be able to get all the terms and fields given a  
> search
> filter. For example, say I have an index that has crawled 5000 pdf  
> files
> (books) and I have the following fields:
>
> content, author (not tokenized), and publish_date
>
> I can easily find all the *distinct* authors in the index using
> 'reader.terms()'.  But say I want to list all the *distinct*  
> authors that
> have published books in 2002?  I can do a simple search to get all  
> the books
> filtered by publish_date:2002.  But then I have to do my own scan  
> of the
> results and pull out the author, removing duplicates.
>
> Is there an easier way to do this?

I'm currently building a faceted navigation system (think Google for  
Nineteenth century literature, except with browsing navigation by  
author, date range, genre, and probably some others as it evolves.   
This is very much like the CNET implementation that Chris detailed  
here: http://www.lucenebook.com/blog/announcements/2005/08/31/cnet.html

My index is pretty static after it is built, so I cache a lot.  The  
first thing I do is walk all the unique terms (using reader.terms())  
for the faceted fields, and for each one I create a BitSet that has  
set bits corresponding to each document that has that term.  I allow  
the user to build up constraints while navigating with any number of  
these facets, and simply AND the BitSets together to find the  
matching documents.  I also allow for full-text search to occur  
within those constraints, and leverage QueryFilter.bits() in that  
case.  The BitSet's allow me to display how many documents, based on  
the constraints, are in each of the "buckets".

So more to your question - using the scheme I just described, you  
could build up a BitSet for each of the authors.  Then a BitSet for  
2002 (this could be a simple QueryFilter with a TermQuery 
("publish_date", "2002") for example).  AND the BitSet of 2002 to all  
of the author BitSets, and any BitSet with a cardinality > 0 has  
documents for that author.

Make sense?

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Cannot sort results with multisearcher when mismatched field names

Posted by Chris Hostetter <ho...@fucit.org>.
I've never dealt with multisearcher's before, so i'm not sure what caveats
there are when doing Sorts with them, but you should be able to make your
own SortComparatorSource which knows about any special fields you have
that might go by multiple names, and when it's requested to sort on one of
those fields, you can check the values for either field name before doing
the comparison.

For an eample of writing your own SortComparatorSource, take a look at
this PATCH...

	http://issues.apache.org/jira/browse/LUCENE-406



: Date: Mon, 26 Sep 2005 04:16:17 -0400
: From: JMA <mr...@comcast.net>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Cannot  sort results with multisearcher when mismatched field
:     names
:
:
: Greetings!
: I have a relatively simple problem:
: I want to sort a set of search results by a field, say "author".
: Fine for one index, or more than one if the field "author" is the same.
:
: However, say I want to use a multisearcher (2+ indices), but the second
: index uses field name "writer".
:
: If I set a sortField array, that sorts first by author THEN writer.
: That's not what I want - I want "author" and "writer" to be treated
: as if they are the same field.
:
: I could re-index using a common field name, but would rather not.
: I could do my own custom sort after I get back the hits, but would rather
: not.
:
: Ideas?
:
: Thanks,
: JMA
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Cannot sort results with multisearcher when mismatched field names

Posted by JMA <mr...@comcast.net>.
Greetings!
I have a relatively simple problem:
I want to sort a set of search results by a field, say "author".
Fine for one index, or more than one if the field "author" is the same.

However, say I want to use a multisearcher (2+ indices), but the second
index uses field name "writer".

If I set a sortField array, that sorts first by author THEN writer.
That's not what I want - I want "author" and "writer" to be treated
as if they are the same field.

I could re-index using a common field name, but would rather not.
I could do my own custom sort after I get back the hits, but would rather
not.

Ideas?

Thanks,
JMA


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Terms given a filter?

Posted by mark harwood <ma...@yahoo.co.uk>.
This sounds like another "group by" totalling
question.

See the generic "group by" totalling code I posted
here:
http://marc.theaimsgroup.com/?l=lucene-dev&m=111044178212335&w=2

In your example there is no quality threshold (just a
filter bitset of "books in 2002") so you can replace
the "scores" array in the code with a simple bitset
lookup. Also, you do not need the GroupKeyFactory
which can be used to adjust term values (eg truncate
20050101 date field into 2005 for grouping by year).

The code works best when your group field (in your
case "author") doesn't have large volumes of unique
values. It is fast because it uses TermDocs rather
than trying to read any stored doc values - reading
stored fields by calling reader.document() is often
slow because ALL doc fields are read from disk, even
if you only want one of them. Not something to do in a
tight loop.

Cheers,
Mark


		
___________________________________________________________ 
To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org