You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Dennis Thrysøe <dt...@conscius.com> on 2004/03/19 11:39:16 UTC

PrefixQuery and hieracical queries problem

Hi,

I'm seeking any kind of advice that I can find to solve a problem I've 
run into with using lucene.

I'm integrating lucene as an alternative to other methods of indexing 
and searching that already exist in our product. Therefore it would be 
best if I could make the integration of lucene live up to the existing 
requirements.

What is indexed as lucene documents is structured in a tree (just like 
files in a filesystem), and the feature that I am working on is 
restricting a search to a certain part of this tree.

To implement this I used a PrefixQuery with the path to the folder to 
search below. Since the PrefixQuery creates a boolean query with a 
clause for each mathching term, this is a problem if there are more than 
1024 subfolders below the selected folder.

One way of getting around this would be if maxClauseCount could be set 
for a PrefixQuery, but there are problems with this.

Picking a number for this would be hard. In order to support very large 
installations a value of a million or so would have to be used. This 
would probably not perform very well.

The only alternative I can think of would be to store a whitespace 
seperated list of all ancestors along with a document:

/foo /foo/bar /foo/bar/baz

But this has two drawbacks: Index storage space used, and the cost of 
indexing (finding all ancestors).

So my question boils down to: Are there any alternatives to solve this 
scenario in an efficient way?


Thanks in advance,

Dennis Thrysøe



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: PrefixQuery and hieracical queries problem

Posted by Matt Quail <ma...@ctx.com.au>.
Dennis Thrysøe wrote:
> The only alternative I can think of would be to store a whitespace 
> seperated list of all ancestors along with a document:
> 
> /foo /foo/bar /foo/bar/baz

I think you will find that this kind of approach works very well (as it 
has for me). But instead of adding one field named "path" with 
space-seperated values, just add multiple "path" fields to the one document.

=Matt

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: PrefixQuery and hieracical queries problem

Posted by Matt Quail <ma...@ctx.com.au>.
Dennis,

I've attached some sample code that works well for my needs. You should
find it scales very well, even when you search for a "parentpath" near
the root.

The output I get from running this program is:

query:  name:rt.jar
   D:\opt\j2sdk1.4.2\jre\lib\rt.jar

query:  name:LICENSE
   D:\opt\j2sdk1.4.2\jre\LICENSE
   D:\opt\j2sdk1.4.2\LICENSE

query:  fullpath:"D:\opt\j2sdk1.4.2\LICENSE"
   D:\opt\j2sdk1.4.2\LICENSE

query:  parentpath:"D:\opt\j2sdk1.4.2\include"
   D:\opt\j2sdk1.4.2\include\jawt.h
   D:\opt\j2sdk1.4.2\include\jni.h
   D:\opt\j2sdk1.4.2\include\jvmdi.h
   D:\opt\j2sdk1.4.2\include\jvmpi.h
   D:\opt\j2sdk1.4.2\include\win32\jawt_md.h
   D:\opt\j2sdk1.4.2\include\win32\jni_md.h



=Matt



Dennis Thrys�e wrote:
> Andrzej Bialecki wrote:
> 
>> What about using PhraseQuery, and store the path with all but first 
>> path separator replaced by whitespace (i.e. "/foo bar baz one two 
>> three"). Then you could query for "/foo bar", "/foo bar baz", and so 
>> on...
> 
> 
> That sounds like a really good suggestion. I'll try that. Thanks.
> 
> -dennis
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 




Re: PrefixQuery and hieracical queries problem

Posted by Dennis Thrysøe <dt...@conscius.com>.
Andrzej Bialecki wrote:
> What about using PhraseQuery, and store the path with all but first path 
> separator replaced by whitespace (i.e. "/foo bar baz one two three"). 
> Then you could query for "/foo bar", "/foo bar baz", and so on...

That sounds like a really good suggestion. I'll try that. Thanks.

-dennis


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: PrefixQuery and hieracical queries problem

Posted by Dennis Thrysøe <dt...@conscius.com>.
Andrzej Bialecki wrote:
> Dennis Thrysøe wrote:
> 
>> Andrzej Bialecki wrote:
>>
>>> What about using PhraseQuery, and store the path with all but first 
>>> path separator replaced by whitespace (i.e. "/foo bar baz one two 
>>> three"). Then you could query for "/foo bar", "/foo bar baz", and so 
>>> on...
>>
>>
>>
>> Hi,
>>
>> It doesn't seem to work though - unless I'm missing something.
>>
>> I've tried to index the field both as Keyword and as UnStored.
>>
>> I'm constructing a PhraseQuery myself (no query parser used), so I 
>> don't know if I should add a single or multiple terms to the PhraseQuery.
>>
>> The following (simplified) debug output gives no hits:
>>
>> ADDING: Document<org.apache.lucene.document.Field@7bc0ac 
>> Keyword<name:art> Keyword<uri:/dt art>>
>>
>> SEARCHING: +(name:art) +uri:"/dt "
> 
> 
> Why the trailing space?

Because in this case the paths to folders actually end with a path 
seperator character. I guess I'll just remove it.

> Anyway.. I should've added that for Phrase Queries to work the text must 
> be tokenized. So, the best way in this case would be to use 
> WhitespaceAnalyzer for the uri field, and store it as Field.Text(...).

Right, thanks. I was just experimenting with that, but the stop analyzer 
seems to remove the leading slash.

I'll try your suggestion.


Thanks,

-dennis

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Lucene index - information

Posted by David Spencer <da...@tropo.com>.
Karl Koch wrote:

> If I create an standard index, what does Lucene store in this index?
> 
> What should be stored in an index at least? Just a link to the file and
> keywords? Or also wordnumbers? What else?
> 
> Does somebody know a paper which discusses this problem of "what to put in
> an good universal IR index" ?


Well if you want a textbook I found "Managing Gigabytes" to have 
excellent coverage of the internals and messy details of search/indexes.

http://www.amazon.com/exec/obidos/ASIN/1558605703/tropoA
http://www.cs.mu.oz.au/mg/



> 
> Cheers,
> Karl
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Cover density ranking?

Posted by Doug Cutting <cu...@apache.org>.
Boris Goldowsky wrote:
> How difficult would it be to implement something like Cover Density
> ranking for Lucene?  Has anyone tried it?  
> 
> Cover density is described at http://citeseer.ist.psu.edu/558750.html ,
> and is supposed to be particularly good for short queries of the type
> that you get in many web applications.

I just glanced at the paper, so my analysis may be wrong, but I think 
one could implement cover density ranking in Lucene with spans (only in 
CVS, not in 1.3).  I think spans correspond to covers in this paper. 
But you'd need to alter SpanScorer.java to implement the cover scoring 
described in that paper.  And you'd probably need to use a custom 
Similarity implementation, which disables most other scoring (tf=1.0, 
idf=1.0, etc.), but exaggerates coordination.  Finally, you'd need to 
construct span queries.  Or something like that.

If someone tries this, please tell us how it works.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Cover density ranking?

Posted by Boris Goldowsky <bo...@alum.mit.edu>.
Since there have been a few discussions recently of overriding various
aspects of Lucene's ranking formula, I got to wondering how difficult it
might be to implement something more different from the base tf/idf
ranking system that Lucene has built in.

How difficult would it be to implement something like Cover Density
ranking for Lucene?  Has anyone tried it?  

Cover density is described at http://citeseer.ist.psu.edu/558750.html ,
and is supposed to be particularly good for short queries of the type
that you get in many web applications.

Boris



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: VSpace Model Index <-> Prob. Model Index - Difference?

Posted by Ype Kingma <yk...@xs4all.nl>.
Karl,

On Friday 19 March 2004 18:24, Karl Koch wrote:
> Hello group,
>
> coming back to the discussion about probabilistic and vector space model
> (which occured here some time ago), I would like to ask something related.
>
> I only know the index structure Lucene offers. Does a IR system, based on
> the probabilistic model (e.g. Okapi) look different from a VS model? If
> yes, why?
>
> I hope this questions is not too stupid. I am mainly interested because of
> some theoretical background...
>
> Karl

First off: I don't know about the fine points between probabilistic and VS 
models.

Sometime ago I made a quick comparison between
the default scoring method of lucene and the okapi model.
Of the top of my head I remember this (it is
not complete):

Similarities:
- both do term weighting by inverse document frequency,
- both normalize for document length, effectively using term density.
- both have a saturation for this term density.
Differences:
Okapi can also use the document length in by itself.
Lucene has a factor (coord) for the overlap between a query and a document
(ie. the number of matching query terms present in a document).
The term density saturation functions are different, too:
Lucene uses square root, okapi uses an (increasing) reciprocal, however
in practice the limit if the reciprocal is far from reached.

When the overlap is ignored, from a practical view point, I would
be surprised if the two methods would order a given set of
docs much different for the same query.
I'd expect most differences in the 'middle' due to the differences
in the form (2nd derivative) of the saturation functions.

Coming back to your question:

> I only know the index structure Lucene offers. Does a IR system, based on
> the probabilistic model (e.g. Okapi) look different from a VS model? If
> yes, why?

My guess is that, in practice (ie. in the orderings of documents for queries),
the two systems are much more similar than different.

> I hope this questions is not too stupid. I am mainly interested because of
> some theoretical background...

Do you intend to do a theoretical comparison of the scoring
functions of Lucene and Okapi? AFAIK this has not been investigated.

Kind regards,
Ype Kingma


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


VSpace Model Index <-> Prob. Model Index - Difference?

Posted by Karl Koch <Th...@gmx.net>.
Hello group,

coming back to the discussion about probabilistic and vector space model
(which occured here some time ago), I would like to ask something related.

I only know the index structure Lucene offers. Does a IR system, based on
the probabilistic model (e.g. Okapi) look different from a VS model? If yes,
why? 

I hope this questions is not too stupid. I am mainly interested because of
some theoretical background...

Karl

> Uh, there are lots of ways to construct an inverted index.
> Citeseer will give you more than you can read on this topic.
> 
> As for Lucene, see File Formats section on the site.
> 
> Otis
> 
> --- Karl Koch <Th...@gmx.net> wrote:
> > If I create an standard index, what does Lucene store in this index?
> > 
> > What should be stored in an index at least? Just a link to the file
> > and
> > keywords? Or also wordnumbers? What else?
> > 
> > Does somebody know a paper which discusses this problem of "what to
> > put in
> > an good universal IR index" ?
> > 
> > Cheers,
> > Karl
> > 
> > -- 
> > +++ NEU bei GMX und erstmalig in Deutschland: TÜV-geprüfter
> > Virenschutz +++
> > 100% Virenerkennung nach Wildlist. Infos:
> > http://www.gmx.net/virenschutz
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

-- 
+++ NEU bei GMX und erstmalig in Deutschland: TÜV-geprüfter Virenschutz +++
100% Virenerkennung nach Wildlist. Infos: http://www.gmx.net/virenschutz


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Lucene index - information

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Uh, there are lots of ways to construct an inverted index.
Citeseer will give you more than you can read on this topic.

As for Lucene, see File Formats section on the site.

Otis

--- Karl Koch <Th...@gmx.net> wrote:
> If I create an standard index, what does Lucene store in this index?
> 
> What should be stored in an index at least? Just a link to the file
> and
> keywords? Or also wordnumbers? What else?
> 
> Does somebody know a paper which discusses this problem of "what to
> put in
> an good universal IR index" ?
> 
> Cheers,
> Karl
> 
> -- 
> +++ NEU bei GMX und erstmalig in Deutschland: T�V-gepr�fter
> Virenschutz +++
> 100% Virenerkennung nach Wildlist. Infos:
> http://www.gmx.net/virenschutz
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Lucene index - information

Posted by Karl Koch <Th...@gmx.net>.
If I create an standard index, what does Lucene store in this index?

What should be stored in an index at least? Just a link to the file and
keywords? Or also wordnumbers? What else?

Does somebody know a paper which discusses this problem of "what to put in
an good universal IR index" ?

Cheers,
Karl

-- 
+++ NEU bei GMX und erstmalig in Deutschland: TÜV-geprüfter Virenschutz +++
100% Virenerkennung nach Wildlist. Infos: http://www.gmx.net/virenschutz


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: PrefixQuery and hieracical queries problem

Posted by Andrzej Bialecki <ab...@getopt.org>.
Dennis Thrysøe wrote:

> Andrzej Bialecki wrote:
> 
>> Anyway.. I should've added that for Phrase Queries to work the text 
>> must be tokenized. So, the best way in this case would be to use
>> WhitespaceAnalyzer for the uri field, 
> 
> 
> I've figured out how to use the WhitespaceAnalyzer for creating the 
> PhraseQuery, but I suspect I should use the same analyzer when indexing 
> (sot the leading slash isn't removed).
> 
> This is a problem though, because I'm using the StopAnalyzer. Have I 
> overlooked a way to specify a specific analyzer for a single field when 
> indexing?

  PerFieldAnalyzerWrapper.

> 
>> and store it as Field.Text(...).
> 
> 
> Or UnStored?

Yes - if you don't need to retrieve it later; because you cannot get 
back the content of an UnStored field, as the name itself suggests.

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: PrefixQuery and hieracical queries problem

Posted by Otis Gospodnetic <ot...@yahoo.com>.
PerFieldAnalyzerWrapper ?

Otis

--- Dennis_Thrys�e <dt...@conscius.com> wrote:
> Andrzej Bialecki wrote:
> > Anyway.. I should've added that for Phrase Queries to work the text
> must 
> > be tokenized. So, the best way in this case would be to use
> > WhitespaceAnalyzer for the uri field, 
> 
> I've figured out how to use the WhitespaceAnalyzer for creating the 
> PhraseQuery, but I suspect I should use the same analyzer when
> indexing 
> (sot the leading slash isn't removed).
> 
> This is a problem though, because I'm using the StopAnalyzer. Have I 
> overlooked a way to specify a specific analyzer for a single field
> when 
> indexing?
> 
> > and store it as Field.Text(...).
> 
> Or UnStored?
> 
> -dennis
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: PrefixQuery and hieracical queries problem

Posted by Dennis Thrysøe <dt...@conscius.com>.
Andrzej Bialecki wrote:
> Anyway.. I should've added that for Phrase Queries to work the text must 
> be tokenized. So, the best way in this case would be to use
> WhitespaceAnalyzer for the uri field, 

I've figured out how to use the WhitespaceAnalyzer for creating the 
PhraseQuery, but I suspect I should use the same analyzer when indexing 
(sot the leading slash isn't removed).

This is a problem though, because I'm using the StopAnalyzer. Have I 
overlooked a way to specify a specific analyzer for a single field when 
indexing?

> and store it as Field.Text(...).

Or UnStored?

-dennis

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: PrefixQuery and hieracical queries problem

Posted by Andrzej Bialecki <ab...@getopt.org>.
Dennis Thrysøe wrote:

> Andrzej Bialecki wrote:
> 
>> What about using PhraseQuery, and store the path with all but first 
>> path separator replaced by whitespace (i.e. "/foo bar baz one two 
>> three"). Then you could query for "/foo bar", "/foo bar baz", and so 
>> on...
> 
> 
> Hi,
> 
> It doesn't seem to work though - unless I'm missing something.
> 
> I've tried to index the field both as Keyword and as UnStored.
> 
> I'm constructing a PhraseQuery myself (no query parser used), so I don't 
> know if I should add a single or multiple terms to the PhraseQuery.
> 
> The following (simplified) debug output gives no hits:
> 
> ADDING: Document<org.apache.lucene.document.Field@7bc0ac 
> Keyword<name:art> Keyword<uri:/dt art>>
> 
> SEARCHING: +(name:art) +uri:"/dt "

Why the trailing space?

Anyway.. I should've added that for Phrase Queries to work the text must 
be tokenized. So, the best way in this case would be to use 
WhitespaceAnalyzer for the uri field, and store it as Field.Text(...).



-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: PrefixQuery and hieracical queries problem

Posted by Dennis Thrysøe <dt...@conscius.com>.
Andrzej Bialecki wrote:
> What about using PhraseQuery, and store the path with all but first path 
> separator replaced by whitespace (i.e. "/foo bar baz one two three"). 
> Then you could query for "/foo bar", "/foo bar baz", and so on...

Hi,

It doesn't seem to work though - unless I'm missing something.

I've tried to index the field both as Keyword and as UnStored.

I'm constructing a PhraseQuery myself (no query parser used), so I don't 
know if I should add a single or multiple terms to the PhraseQuery.

The following (simplified) debug output gives no hits:

ADDING: Document<org.apache.lucene.document.Field@7bc0ac 
Keyword<name:art> Keyword<uri:/dt art>>

SEARCHING: +(name:art) +uri:"/dt "


-dennis


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: PrefixQuery and hieracical queries problem

Posted by Andrzej Bialecki <ab...@getopt.org>.
Dennis Thrysøe wrote:
> Hi,

> The only alternative I can think of would be to store a whitespace 
> seperated list of all ancestors along with a document:
> 
> /foo /foo/bar /foo/bar/baz
> 
> But this has two drawbacks: Index storage space used, and the cost of 
> indexing (finding all ancestors).
> 
> So my question boils down to: Are there any alternatives to solve this 
> scenario in an efficient way?

What about using PhraseQuery, and store the path with all but first path 
separator replaced by whitespace (i.e. "/foo bar baz one two three"). 
Then you could query for "/foo bar", "/foo bar baz", and so on...

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org