You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by lu...@nitwit.de on 2004/04/09 21:18:26 UTC

ValueListHandler pattern with Lucene

Hi!

I implemented a VLH pattern Lucene's search hits but noticed that hits.doc() 
is quite slow (3000+ hits took about 500ms).

So, I want to ask people here for a solution. I tought about something like a 
wrapper for the VO (value/transfer object), i.e. that the VO does not 
actually contain the value but a reference to lucene's Hits instance. But 
this somewhat a hack...

Any ideas?

Timo

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: ValueListHandler pattern with Lucene

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Patterns are work-arounds for language deficiences :)

Don't use patterns because some book said so - use them if they are the 
pragmatic choice.  Flattening data for reports or search results and 
perhaps being a little more coupled to Lucene between tiers in order to 
avoid performance problems seems a wise way to approach it.  Or go 
straight to Lucene from the presentation tier - no one said you had to 
proxy it through some other layer.

I would highly recommend *against* loading all documents from a search 
into a collection and passing it across tiers - you're only asking for 
trouble.

	Erik


On Apr 9, 2004, at 4:06 PM, lucene@nitwit.de wrote:

> On Friday 09 April 2004 21:30, Erik Hatcher wrote:
>> Do you really need *all* documents from Hits?  If not, then you should
>
> Only the user knows ;-) Well, no, I very likely only need one or a few 
> but
> nevertheless I have to pull all hit results to the presentation tier...
>
> That's just the problem. Using a VLH I have to fetch all hits from the 
> Hits
> instance and put them into the VLH - ordinarily you would lazily only 
> fetch
> the hits you actually need them - at the time you need them.
>
> That's just my question :-)
>
> So, to repeat, my idea was to use a wrapper for the VOs in order to 
> fetch only
> some hits at a time...
>
> It's actually a VLH pattern drawback. Maybe I should ask the blueprint
> people ;-)
>
> Timo
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: ValueListHandler pattern with Lucene

Posted by lu...@nitwit.de.
On Friday 09 April 2004 21:30, Erik Hatcher wrote:
> Do you really need *all* documents from Hits?  If not, then you should

Only the user knows ;-) Well, no, I very likely only need one or a few but 
nevertheless I have to pull all hit results to the presentation tier...

That's just the problem. Using a VLH I have to fetch all hits from the Hits 
instance and put them into the VLH - ordinarily you would lazily only fetch 
the hits you actually need them - at the time you need them.

That's just my question :-)

So, to repeat, my idea was to use a wrapper for the VOs in order to fetch only 
some hits at a time...

It's actually a VLH pattern drawback. Maybe I should ask the blueprint 
people ;-)

Timo

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: ValueListHandler pattern with Lucene

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Apr 9, 2004, at 3:18 PM, lucene@nitwit.de wrote:
> I implemented a VLH pattern Lucene's search hits but noticed that 
> hits.doc()
> is quite slow (3000+ hits took about 500ms).
>
> So, I want to ask people here for a solution. I tought about something 
> like a
> wrapper for the VO (value/transfer object), i.e. that the VO does not
> actually contain the value but a reference to lucene's Hits instance. 
> But
> this somewhat a hack...
>
> Any ideas?

This is an interesting architecture question.  If you are trying to 
decouple things so much that you want to package up all documents in 
another data structure and ship them to another tier, you're asking for 
a heap of resources for a large Hits collection.

Do you really need *all* documents from Hits?  If not, then you should 
not be pulling them all with hits.doc().

If you truly do need all hits, use a HitCollector instead of Hits (see 
the other search() methods).

Packaging up a Hits instance could be problematic - you need to be sure 
the *same* IndexSearcher is around when you start navigating through 
the hits.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: ValueListHandler pattern with Lucene

Posted by lu...@nitwit.de.
On Monday 12 April 2004 20:54, lucene@nitwit.de wrote:
> On Sunday 11 April 2004 17:46, Erik Hatcher wrote:
> > In other words, you need to invent your own "pattern" here?!  :)
>
> I just experimented a bit and came up with the ValueListSupplier which
> replaces the ValueList in the VLH. Seems to work so far... :-) Comments are
> greatly appreciated!

FYI http://www.nitwit.de/vlh2/

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: ValueListHandler pattern with Lucene

Posted by lu...@nitwit.de.
On Sunday 11 April 2004 17:46, Erik Hatcher wrote:
> In other words, you need to invent your own "pattern" here?!  :)

I just experimented a bit and came up with the ValueListSupplier which 
replaces the ValueList in the VLH. Seems to work so far... :-) Comments are 
greatly appreciated!

Timo

public class ValueListSupplier implements IValueListIterator
{
	private final Log log = LogFactory.getLog(this.getClass());

	// TODO junit test case
	private Hits hits;
	protected BitSet fetched;
	protected List list;
	protected int index;
	
	public ValueListSupplier(Hits hits)
	{
		int size = hits.length();
		this.list = new ArrayList(size);
		// stupid idiots at SUN
		for (int i = 0; i < size; i++) list.add(null);
		this.fetched = new BitSet();
		this.hits = hits;
		this.index = 0;
	}

	public List getList()
	{
		return list;
	}

	public int size()
	{
		return list.size();
	}

	public boolean hasPrevious()
	{
		return index > 0;
	}

	public boolean hasNext()
	{
		return index < size();
	}

	/**
	 * @param index
	 */
	public synchronized void move(int index)
	{
		this.index = index;
	}

	public void reset()
	{
		move(0);
	}

	public Object current()
	{
		validate(index, index + 1);
		return list.get(index);
	}

	public List previous(int count)
	{
		int from = Math.max(0, index - count);
		int to = index;

		validate(from, to);
		move(from);
		return list.subList(from, to);
	}

	public List next(int count)
	{
		int from = index;
		int to = Math.min(Math.max(0, size() - 1), index + count);

		validate(from, to);
		move(to);
		return list.subList(from, to);
	}

	/**
	 * @param from
	 *                 starting index (inclusive)
	 * @param to
	 *                 ending index (exclusive)
	 */
	private void validate(int from, int to)
	{
		while ((from = fetched.nextClearBit(from)) < to)
		{
			log.debug("fetching #" + from);

			try
			{
				list.set(from, SearchResultAdapter.wrap(hits.doc(from)));
				fetched.set(from);
			}
			catch (IOException e)
			{
				// TODO potentially bug
				e.printStackTrace();
			}
		}
	}

}

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: ValueListHandler pattern with Lucene

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Apr 11, 2004, at 11:28 AM, lucene@nitwit.de wrote:
> On Sunday 11 April 2004 17:16, Erik Hatcher wrote:
>> Well, yes.... the one we already discussed.  Let your presentation 
>> tier
>> talk directly to Hits, so you are as efficient as possible with access
>> to documents, and only fetch what you need.
>>
>> Again, don't let "patterns" get in your way.
>
> Well, the sense of tiers and (BTW: language-independant) patterns is to
> modularize software and make things exchangable. This way
> neither the presentation tier nor the search engine is exchangable.
>
> The problem actually is that VLH is designed to have a static list of 
> VOs. VLH
> needs to evolve to support something like a data provider that 
> dynamically
> may add data. The problems here so far is that an Iterator must throw 
> an
> ConcurrentModificationException if the backing data is modified but as 
> data
> in a VLH is actually never removed but only added this should be 
> something
> possible to implement.

In other words, you need to invent your own "pattern" here?!  :)

The benefit of agility is to know that any decision you make now is not 
something that prohibits you from change later.  Do you really think 
you're going to plug-and-play with search engines?  Or will you be 
sticking with Lucene for the foreseeable future?  Are you trying to 
plan for a future without Lucene when there is no use-case for doing 
so?  If you code with coupling to Lucene, do you see that as making 
life harder in the future, or are you smart enough and flexible enough 
to change your "soft"ware as times change?

Throw your patterns away when they don't solve the problem.  Be 
pragmatic _and_ agile.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: ValueListHandler pattern with Lucene

Posted by lu...@nitwit.de.
On Sunday 11 April 2004 17:16, Erik Hatcher wrote:
> Well, yes.... the one we already discussed.  Let your presentation tier
> talk directly to Hits, so you are as efficient as possible with access
> to documents, and only fetch what you need.
>
> Again, don't let "patterns" get in your way.

Well, the sense of tiers and (BTW: language-independant) patterns is to 
modularize software and make things exchangable. This way
neither the presentation tier nor the search engine is exchangable.

The problem actually is that VLH is designed to have a static list of VOs. VLH 
needs to evolve to support something like a data provider that dynamically 
may add data. The problems here so far is that an Iterator must throw an 
ConcurrentModificationException if the backing data is modified but as data 
in a VLH is actually never removed but only added this should be something 
possible to implement.

Timo

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: ValueListHandler pattern with Lucene

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Apr 11, 2004, at 10:00 AM, lucene@nitwit.de wrote:
> On Sunday 11 April 2004 15:56, Erik Hatcher wrote:
>> HitCollector was just an option - and apparently not the right one for
>> your use.
>
> So, any other option? :-)

Well, yes.... the one we already discussed.  Let your presentation tier 
talk directly to Hits, so you are as efficient as possible with access 
to documents, and only fetch what you need.

Again, don't let "patterns" get in your way.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Query search syntax: abs_path

Posted by Rodrigo Baptista <rb...@criticalsoftware.com>.
Hello list,

When I do a search using the property abs_path, I only have results if
the path name is all in lower-case, if it has one letter in upper-case
it doesn't work.
I must only have lower-case letter in the path?

Best regards,
Rodrigo Baptista.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: ValueListHandler pattern with Lucene

Posted by lu...@nitwit.de.
On Sunday 11 April 2004 15:56, Erik Hatcher wrote:
> HitCollector was just an option - and apparently not the right one for
> your use.

So, any other option? :-)

I did sort the List via Collections.sort() but I really don't like that 
code :-(

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: ValueListHandler pattern with Lucene

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Apr 11, 2004, at 9:32 AM, lucene@nitwit.de wrote:
> On Saturday 10 April 2004 20:40, Erik Hatcher wrote:
>> Thats the beauty.... it is up to you to load the doc iff you want it.
>
> Well, there's another problem with HitCollector: the list I build is 
> not
> sorted by score :-(

HitCollector was just an option - and apparently not the right one for 
your use.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: ValueListHandler pattern with Lucene

Posted by lu...@nitwit.de.
On Saturday 10 April 2004 20:40, Erik Hatcher wrote:
> Thats the beauty.... it is up to you to load the doc iff you want it.

Well, there's another problem with HitCollector: the list I build is not 
sorted by score :-(

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: ValueListHandler pattern with Lucene

Posted by lu...@nitwit.de.
On Sunday 11 April 2004 13:40, Erik Hatcher wrote:
> using a HitCollector you are bypassing those mechanisms.  Whether it is
> measurably faster would depend on several other factors.

Well, it is hardly faster, so this is no real solution :-\

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: ValueListHandler pattern with Lucene

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Apr 11, 2004, at 5:25 AM, lucene@nitwit.de wrote:
> On Saturday 10 April 2004 20:40, Erik Hatcher wrote:
>> Thats the beauty.... it is up to you to load the doc iff you want it.
>
> As I "want" all of them I don't see why this should be faster at all...

Then have a look at the Hits class.  It is doing more work for caching 
and keeping a most recently used collection of documents around.  By 
using a HitCollector you are bypassing those mechanisms.  Whether it is 
measurably faster would depend on several other factors.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: ValueListHandler pattern with Lucene

Posted by lu...@nitwit.de.
On Saturday 10 April 2004 20:40, Erik Hatcher wrote:
> Thats the beauty.... it is up to you to load the doc iff you want it.

As I "want" all of them I don't see why this should be faster at all...

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: ValueListHandler pattern with Lucene

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Apr 10, 2004, at 5:08 AM, lucene@nitwit.de wrote:
> On Friday 09 April 2004 23:59, Ype Kingma wrote:
>> When you need 3000 hits and their stored fields, you might
>> consider using the lower level search API with your own HitCollector.
>
> I apologize for the stupid question but ... where's the actualy result 
> in
> HitCollector? :-)
>
>   collect(int doc, float score)
>
> Where doc is the index and score is its score - and where's the 
> Document?

Thats the beauty.... it is up to you to load the doc iff you want it.  
In many situations, loading the doc would slow things down 
dramatically.  For example, QueryFilter uses a HitCollector internally, 
but could care less about the actual document object, just its id 
(which you get from the int doc).  To get the doc:

	 Document document = searcher.doc(doc);

(I'd use 'id' for the int, personally).

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: ValueListHandler pattern with Lucene

Posted by lu...@nitwit.de.
On Friday 09 April 2004 23:59, Ype Kingma wrote:
> When you need 3000 hits and their stored fields, you might
> consider using the lower level search API with your own HitCollector.

I apologize for the stupid question but ... where's the actualy result in 
HitCollector? :-) 

  collect(int doc, float score) 

Where doc is the index and score is its score - and where's the Document?

Timo

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: ValueListHandler pattern with Lucene

Posted by Ype Kingma <yk...@xs4all.nl>.
On Friday 09 April 2004 21:18, lucene@nitwit.de wrote:
> Hi!
>
> I implemented a VLH pattern Lucene's search hits but noticed that
> hits.doc() is quite slow (3000+ hits took about 500ms).
>
> So, I want to ask people here for a solution. I tought about something like
> a wrapper for the VO (value/transfer object), i.e. that the VO does not
> actually contain the value but a reference to lucene's Hits instance. But
> this somewhat a hack...

Lucene's Hits already wraps quite a bit. Under the hoods it will
redo your search in case you need more than 100 results.
Hits was designed for displaying a few web pages of search results.

When you need 3000 hits and their stored fields, you might
consider using the lower level search API with your own HitCollector.

This will allow you to do a single search, and retrieve the stored
document fields in order of document number after the search.
Documents are stored physically in document number order,
so retrieval in that order is normally close to optimal.

Actual savings depend a lot on the circumstances, though.

I checked the VLH pattern very briefly. The lower level search
API of Lucene seems to fit in quite well for the retrieval side
of it, ie. the DataAccessObject, for a larger number of results.
However, you'll have to throw some more RAM than Hits does
at the difference between the physical order of
Lucene and the order in which the client needs to iterate
the data.

Kind regards,
Ype


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org