You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Shahid Faiz <de...@gmail.com> on 2009/12/23 14:35:52 UTC

solution for parent-child relationship

Hi,

Following are details of my problem and possible solutions which I can think
of. Please suggest which should I choose, or is there any other approach
better than these.

I want to index blog posts and their comments, in my database posts and
comments are stored in two different tables. Currently I am indexing posts
only and search works perfectly fine, but now I want to index comments as
well. From now on when user will search any word, all posts will be
displayed which meet search criteria including posts and only those comments
which meet the specified criteria along with posts. All paging is done on
posts, comments are not considered while calculating page contents.

e.g.

POST =>          Google launched Chrome Browser
COMMENT =>  Microsoft IE also has tab browsing.

Now searching (microsoft AND IE), should also display this post with along
with above comment.

One solution is: I should index both posts+comments in one lucene index and
search that index. But as i have to display 10 posts (not including
comments) per page so it gets complicated while finding records to be
displayed on any given page. I think, I can solve this problem by creating
two queries one to search posts and one to search comments only and then can
join them using OR operator.

Second solution could be: I can store comments in different index, and on
search first I will execute query on comments index and use result of
comments search to search blog posts.

Thanks in advance.

- deve

Re: solution for parent-child relationship

Posted by Erick Erickson <er...@gmail.com>.

First, if you don't need to distinguish between posts and comments,
you don't care about the increment gap. But if you do...

It's kind of arcane, but here's the general idea. You override your Analyzer
of choice and implement getPositionIncrementGap. Say your
getPositionIncrementGap
returns 100. When you do something like this:

doc.add(new Field("text", "post text here", Field.Store.NO,
Field.Index.ANALYZED);
doc.add(new Field("text", "commentone one", Field.Store.NO,
Field.Index.ANALYZED);
doc.add(new Field("text", "commenttwo two", Field.Store.NO,
Field.Index.ANALYZED);

then the offsets of your tokens in the field "text" are as follows (note,
may be off by one or so).
1 - post
2 - text
3 - here
104 - commentone
105 - one
206 - commenttwo
207 - two

If your getPositionIncrementGap returned 1, the positions would be 1, 2, 3,
4, 5, 6, 7....

Now, you can use the offsets of the tokens to determine what comment you go
the hit
in. By putting an increment of, say,  100 you have the advantage of NOT
hitting on any
phrase query across post/comment/comment boundaries (i.e. "here commentone"
would NOT hit) and when executing any of the "near" queries anything < 100
won't
match either. But all of your posts and comments are in the same document,
so you
can search/display them in one query.

You can get creative with this. Say you implemented your
getPositionIncrementGap
to always return an even, say, 10,000 boundary (assuming no comment was more
than, 10,000 words). Then the offset tells you exactly which comment you're
in,
or whether it's the original post, it's just a modulo operation.

Another possibility is to *store* (but not index) meta data with each
document that
contains information about your doc, like the offset of each comment. This
could
even be a POJO with the necessary serialization. You'd still want to put an
increment
gap in there to keep from inappropriately matching across boundaries.....

Another thing that's hard to feel good about is that documents do NOT have
to have the same fields. It's really tempting to think of "documents" as
being
"just like tables", but they're not. There's no penalty that I know of for
having
different documents have different fields. Theoretically, you could have a
Lucene
index in which no document had any field in common with any other document.
So putting meta-data in your index is do-able, although do NOT use Lucene
document IDs in your meta-data, they can change on you.....

But do remember that a hybrid solution may be best if you absolutely require
database-like capabilities and can't think of a way to use a Lucene index
efficiently. My personal tendency is to use a hybrid solution last, when
neither
a Lucene index nor a database is adequate just on a "less complex is better"
perspective...

HTH
Erick

On Wed, Dec 23, 2009 at 11:41 AM, Deve <de...@gmail.com> wrote:

> @Erick, you are right i will have to stop thinking in terms of databases,
> thats why i wanted to discuss this.
>
> i don't get how can i use getPositionIncrementGap, could you provide little
> more details.
>
> thanks,
>
> On Wed, Dec 23, 2009 at 8:45 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > Ya just gotta stop thinking like a database guy, man <G>. Lucene searches
> > lots and lots of text very well. It doesn't do joins worth a darn. The
> > moment
> > you star thinking in terms of sub-queries, you're probably starting down
> > the
> > wrong track.
> >
> > Here's a possibility. Index each post and all associated comments as a
> > single
> > document, with the post and all comments put in the same field. Something
> > like
> > doc.add("text", <original post text>);
> > doc.add("text", <first comment>);
> > doc.add("text", <second comment>);
> > add the document to the index.....
> >
> > Now, all you need to do is search the "text" field and you get
> everything.
> > If
> > you set the position increment gap to something other than 1 (see the
> > docs),
> > you can distinguish between the individual calls to doc.add(), see
> > getPositionIncrementGap.
> >
> > In general, when indexing anything database-related, the advice is to
> > flatten
> > then data as you put it into a Lucene index. Think about making all the
> > data
> > available as the result of a single query. This leads to data
> replication,
> > which
> > goes against all your database instincts, but....
> >
> > HTH
> > Erick
> >
> >
> > On Wed, Dec 23, 2009 at 8:35 AM, Shahid Faiz <developer.inbox@gmail.com
> > >wrote:
> >
> > > Hi,
> > >
> > > Following are details of my problem and possible solutions which I can
> > > think
> > > of. Please suggest which should I choose, or is there any other
> approach
> > > better than these.
> > >
> > > I want to index blog posts and their comments, in my database posts and
> > > comments are stored in two different tables. Currently I am indexing
> > posts
> > > only and search works perfectly fine, but now I want to index comments
> as
> > > well. From now on when user will search any word, all posts will be
> > > displayed which meet search criteria including posts and only those
> > > comments
> > > which meet the specified criteria along with posts. All paging is done
> on
> > > posts, comments are not considered while calculating page contents.
> > >
> > > e.g.
> > >
> > > POST =>          Google launched Chrome Browser
> > > COMMENT =>  Microsoft IE also has tab browsing.
> > >
> > > Now searching (microsoft AND IE), should also display this post with
> > along
> > > with above comment.
> > >
> > > One solution is: I should index both posts+comments in one lucene index
> > and
> > > search that index. But as i have to display 10 posts (not including
> > > comments) per page so it gets complicated while finding records to be
> > > displayed on any given page. I think, I can solve this problem by
> > creating
> > > two queries one to search posts and one to search comments only and
> then
> > > can
> > > join them using OR operator.
> > >
> > > Second solution could be: I can store comments in different index, and
> on
> > > search first I will execute query on comments index and use result of
> > > comments search to search blog posts.
> > >
> > > Thanks in advance.
> > >
> > > - deve
> > >
> >
>

Re: solution for parent-child relationship

Posted by Deve <de...@gmail.com>.

@Erick, you are right i will have to stop thinking in terms of databases,
thats why i wanted to discuss this.

i don't get how can i use getPositionIncrementGap, could you provide little
more details.

thanks,

On Wed, Dec 23, 2009 at 8:45 PM, Erick Erickson <er...@gmail.com>wrote:

> Ya just gotta stop thinking like a database guy, man <G>. Lucene searches
> lots and lots of text very well. It doesn't do joins worth a darn. The
> moment
> you star thinking in terms of sub-queries, you're probably starting down
> the
> wrong track.
>
> Here's a possibility. Index each post and all associated comments as a
> single
> document, with the post and all comments put in the same field. Something
> like
> doc.add("text", <original post text>);
> doc.add("text", <first comment>);
> doc.add("text", <second comment>);
> add the document to the index.....
>
> Now, all you need to do is search the "text" field and you get everything.
> If
> you set the position increment gap to something other than 1 (see the
> docs),
> you can distinguish between the individual calls to doc.add(), see
> getPositionIncrementGap.
>
> In general, when indexing anything database-related, the advice is to
> flatten
> then data as you put it into a Lucene index. Think about making all the
> data
> available as the result of a single query. This leads to data replication,
> which
> goes against all your database instincts, but....
>
> HTH
> Erick
>
>
> On Wed, Dec 23, 2009 at 8:35 AM, Shahid Faiz <developer.inbox@gmail.com
> >wrote:
>
> > Hi,
> >
> > Following are details of my problem and possible solutions which I can
> > think
> > of. Please suggest which should I choose, or is there any other approach
> > better than these.
> >
> > I want to index blog posts and their comments, in my database posts and
> > comments are stored in two different tables. Currently I am indexing
> posts
> > only and search works perfectly fine, but now I want to index comments as
> > well. From now on when user will search any word, all posts will be
> > displayed which meet search criteria including posts and only those
> > comments
> > which meet the specified criteria along with posts. All paging is done on
> > posts, comments are not considered while calculating page contents.
> >
> > e.g.
> >
> > POST =>          Google launched Chrome Browser
> > COMMENT =>  Microsoft IE also has tab browsing.
> >
> > Now searching (microsoft AND IE), should also display this post with
> along
> > with above comment.
> >
> > One solution is: I should index both posts+comments in one lucene index
> and
> > search that index. But as i have to display 10 posts (not including
> > comments) per page so it gets complicated while finding records to be
> > displayed on any given page. I think, I can solve this problem by
> creating
> > two queries one to search posts and one to search comments only and then
> > can
> > join them using OR operator.
> >
> > Second solution could be: I can store comments in different index, and on
> > search first I will execute query on comments index and use result of
> > comments search to search blog posts.
> >
> > Thanks in advance.
> >
> > - deve
> >
>

Re: solution for parent-child relationship

Posted by Erick Erickson <er...@gmail.com>.

Ya just gotta stop thinking like a database guy, man <G>. Lucene searches
lots and lots of text very well. It doesn't do joins worth a darn. The
moment
you star thinking in terms of sub-queries, you're probably starting down the
wrong track.

Here's a possibility. Index each post and all associated comments as a
single
document, with the post and all comments put in the same field. Something
like
doc.add("text", <original post text>);
doc.add("text", <first comment>);
doc.add("text", <second comment>);
add the document to the index.....

Now, all you need to do is search the "text" field and you get everything.
If
you set the position increment gap to something other than 1 (see the docs),
you can distinguish between the individual calls to doc.add(), see
getPositionIncrementGap.

In general, when indexing anything database-related, the advice is to
flatten
then data as you put it into a Lucene index. Think about making all the data
available as the result of a single query. This leads to data replication,
which
goes against all your database instincts, but....

HTH
Erick

On Wed, Dec 23, 2009 at 8:35 AM, Shahid Faiz <de...@gmail.com>wrote:

> Hi,
>
> Following are details of my problem and possible solutions which I can
> think
> of. Please suggest which should I choose, or is there any other approach
> better than these.
>
> I want to index blog posts and their comments, in my database posts and
> comments are stored in two different tables. Currently I am indexing posts
> only and search works perfectly fine, but now I want to index comments as
> well. From now on when user will search any word, all posts will be
> displayed which meet search criteria including posts and only those
> comments
> which meet the specified criteria along with posts. All paging is done on
> posts, comments are not considered while calculating page contents.
>
> e.g.
>
> POST =>          Google launched Chrome Browser
> COMMENT =>  Microsoft IE also has tab browsing.
>
> Now searching (microsoft AND IE), should also display this post with along
> with above comment.
>
> One solution is: I should index both posts+comments in one lucene index and
> search that index. But as i have to display 10 posts (not including
> comments) per page so it gets complicated while finding records to be
> displayed on any given page. I think, I can solve this problem by creating
> two queries one to search posts and one to search comments only and then
> can
> join them using OR operator.
>
> Second solution could be: I can store comments in different index, and on
> search first I will execute query on comments index and use result of
> comments search to search blog posts.
>
> Thanks in advance.
>
> - deve
>

RE: solution for parent-child relationship

Posted by "Rao, Vaijanath" <va...@corp.aol.com>.

Hi Shahid,

This is just one of the ways to get it.

You can have an id to your post and comment for every post get's an
subID so for example

Post 1 get id 2566 and comment 1 in that post get id 2566-64. You can
use various methods to get this ID.  Then you can write your own Hit
Collector in which you can collect the results and display the top-10 or
all unique posts.

--Thanks and Regards
Vaijanath N. Rao

-----Original Message-----
From: Shahid Faiz [mailto:developer.inbox@gmail.com] 
Sent: Wednesday, December 23, 2009 7:06 PM
To: java-user@lucene.apache.org
Subject: solution for parent-child relationship

Hi,

Following are details of my problem and possible solutions which I can
think of. Please suggest which should I choose, or is there any other
approach better than these.

I want to index blog posts and their comments, in my database posts and
comments are stored in two different tables. Currently I am indexing
posts only and search works perfectly fine, but now I want to index
comments as well. From now on when user will search any word, all posts
will be displayed which meet search criteria including posts and only
those comments which meet the specified criteria along with posts. All
paging is done on posts, comments are not considered while calculating
page contents.

e.g.

POST =>          Google launched Chrome Browser
COMMENT =>  Microsoft IE also has tab browsing.

Now searching (microsoft AND IE), should also display this post with
along with above comment.

One solution is: I should index both posts+comments in one lucene index
and search that index. But as i have to display 10 posts (not including
comments) per page so it gets complicated while finding records to be
displayed on any given page. I think, I can solve this problem by
creating two queries one to search posts and one to search comments only
and then can join them using OR operator.

Second solution could be: I can store comments in different index, and
on search first I will execute query on comments index and use result of
comments search to search blog posts.

Thanks in advance.

- deve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org