You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Fredrik Lindner <Fr...@r5.com> on 2004/05/17 11:07:48 UTC

hierarchical search

Hi all!

I'm currently developing an application in which text searching is a
main component. Among other things, a document will contain a field
denoting hierarchical information. The information is stored as a string
using the common path syntax, /x/y/z/etc/...

I would like to be able to search documents based on the path field
using two different selection criteria's,

1. given a prefix path and a suffix path select all documents for which
the path start with the supplied prefix, ends with the suffix and has
"some path" in between.

2. like (1) but with the requirement that "some path" spans one and one
level only. i.e. it defines a strict grandparent/grandchild relationship
between the last path entry of the prefix and the first of the suffix.

For example, with prefix /p1/p2/p3/ and suffix /s1/s2/s3/ and three
documents with the path filed values

a) /p1/p2/p3/x/s1/s2/s3/
b) /p1/p2/p3/y/s1/s2/s3/
c) /p1/p2/p3/x/y/s1/s2/s3/

case one should select them all whereas case two should select only a)
and b).

My problem is that I am uncertain on how to implement the second case. I
guess I have to extend the Lucene internals somehow but I am quite too
inexperienced regarding Lucene to do so directly. Any pointers, hints or
comments are most welcome.

Regards
/Fredrik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: hierarchical search

Posted by Matt Quail <ma...@ctx.com.au>.

Fredrik,

I would tackle your problem like this:

Say that that field you want to index is "path". I would turn this into
*three* indexed fields:
1) multiple path prefixes ("pre-paths")
2) multiple path suffixes ("post-paths")
3) the number of "components" in the path ("path-size").

For example, for a "path" of "/foo/bar/dog/cat/fish" I would index it
like this:

doc.add(Field.Keyword("pre-paths", "/foo/bar/dog/cat/fish/"));
doc.add(Field.Keyword("pre-paths", "/foo/bar/dog/cat/"));
doc.add(Field.Keyword("pre-paths", "/foo/bar/dog/"));
doc.add(Field.Keyword("pre-paths", "/foo/bar/"));
doc.add(Field.Keyword("pre-paths", "/foo/"));
doc.add(Field.Keyword("pre-paths", "/"));
doc.add(Field.Keyword("post-paths", "/foo/bar/dog/cat/fish/"));
doc.add(Field.Keyword("post-paths", "/bar/dog/cat/fish/"));
doc.add(Field.Keyword("post-paths", "/dog/cat/fish/"));
doc.add(Field.Keyword("post-paths", "/cat/fish/"));
doc.add(Field.Keyword("post-paths", "/fish/"));
doc.add(Field.Keyword("post-paths", "/"));
doc.add(Field.Keyword("path-size", "5"));

And to do your "type 2" search for (prefix="/p1/p2/p3/" and
suffix="/s1/s2/s3/") I would use a query like this:

Query q = QueryParser.parse("pre-paths:'/p1/p2/p3/' AND
post-paths:'/s1/s2/s3/ AND (path-size:7)'");

The trick is to lock down the prefix and suffix, then define the amount
of "slack" between the prefix and the suffix using the path-size. If you
wanted the "slack" between either end to be zero or one segments, then
change the size clause to something like (path-size:6 OR path-size:7)


I think that should work.

=Matt


Fredrik Lindner wrote:

> Hi all!
> 
> I'm currently developing an application in which text searching is a
> main component. Among other things, a document will contain a field
> denoting hierarchical information. The information is stored as a string
> using the common path syntax, /x/y/z/etc/...
> 
> I would like to be able to search documents based on the path field
> using two different selection criteria's,
> 
> 1. given a prefix path and a suffix path select all documents for which
> the path start with the supplied prefix, ends with the suffix and has
> "some path" in between.
> 
> 2. like (1) but with the requirement that "some path" spans one and one
> level only. i.e. it defines a strict grandparent/grandchild relationship
> between the last path entry of the prefix and the first of the suffix.
> 
> For example, with prefix /p1/p2/p3/ and suffix /s1/s2/s3/ and three
> documents with the path filed values
> 
> a) /p1/p2/p3/x/s1/s2/s3/
> b) /p1/p2/p3/y/s1/s2/s3/
> c) /p1/p2/p3/x/y/s1/s2/s3/
> 
> case one should select them all whereas case two should select only a)
> and b).
> 
> My problem is that I am uncertain on how to implement the second case. I
> guess I have to extend the Lucene internals somehow but I am quite too
> inexperienced regarding Lucene to do so directly. Any pointers, hints or
> comments are most welcome.
> 
> Regards
> /Fredrik
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 
> 




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

SELECTIVE Indexing

Posted by Karthik N S <ka...@controlnet.co.in>.

Hi all

   Can Some Body tell me How to Index  CERTAIN PORTION OF THE HTML FILE Only

   ex:-
        <table .....>
               ....

         </table>


with regards
Karthik




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org