You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "snehal.chennuru" <sn...@gmail.com> on 2012/07/23 01:17:07 UTC

Usage of NoMergePolicy and its potential implications

Hello Everyone,

We have a legacy system which uses lucene 2.4.1. We have ported a small hack
to lucene source code back then, so that the underlying lucene segment
merger code wouldn't reuse deleted docids. This helped us use lucene docids
as persistent dbids as well. But we want to upgrade lucene to 3.6, but it is
near impossible to "hack" lucene now to get the same behavior. 

I checked out NoMergePolicy, and it seemed to help achieve similar behavior
of not letting lucene reuse deleted docids. But I guess this would increase
the number of segments in the index. Any idea how many segments we are
talking about over here? Also, can we configure lucene to tell how many
documents to keep in a given segment. Each lucene index in this system can
have utmost 1M documents in them. Is there an alternative that I can
consider?

Thanks,
Snehal



--
View this message in context: http://lucene.472066.n3.nabble.com/Usage-of-NoMergePolicy-and-its-potential-implications-tp3996630.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Usage of NoMergePolicy and its potential implications

Posted by Trejkaz <tr...@trypticon.org>.

On Thu, Jul 26, 2012 at 5:38 AM, Simon Willnauer
<si...@gmail.com> wrote:
> you really shouldn't do that! If you use lucene as a Primary key
> generator why don't you build your own on top. Just add one layer that
> accepts the document and returns the PID and internally put it in an
> ID field. Using no merge policy is not a good idea either you will
> very likely reach system boarders (# file descriptors) and suffer from
> bad search performance and low compression.
>
> I think you should really consider fixing your app instead of hacking lucene.

I can understand how they would end up in this situation since we
ended up in it as well.

We tried using our own ID (which we still have in Lucene and still use
for other purposes), and it slows down some things.

For example, when building bit sets for filters based on the external
database, now you have to look up every ID you get back. Because you
don't know if the last row returned from the query might be Lucene doc
ID 0, you can't build the filter at all unless you process every row
returned from the query.

If you had a million docs returned by the SQL query, you had to do a
million term lookups in Lucene. We didn't have enough memory to store
the mapping from our ID back to Lucene's (OOME as soon as you tried to
make a map to look things up faster), which made it impossible to
cache the information at the time. I'm not sure if it's getting easier
or harder - memory sizes are increasing but the number of docs people
are putting into the indexes is increasing as well.

At the time, Lucene developers were adamant that we shouldn't be using
the doc ID because deleted doc IDs eventually get reused (or rather
all the IDs shifted downwards) but since we never physically delete
doc IDs (we want a history of item modification including deletion, so
doing that would be undesirable anyway) it was never a problem until
the new merging came along.

I guess while the doc ID is still available, people will continue to
use it. If it disappeared from the API completely, this would be good
encouragement to migrate to a different approach. :)

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Usage of NoMergePolicy and its potential implications

Posted by Simon Willnauer <si...@gmail.com>.

On Mon, Jul 23, 2012 at 7:00 PM, snehal.chennuru <sn...@gmail.com> wrote:
> Thanks for the heads up Ian. I know it is highly discouraged. But, like I
> said, it is a legacy application and it is very hard to go back and re-do
> it.

you really shouldn't do that! If you use lucene as a Primary key
generator why don't you build your own on top. Just add one layer that
accepts the document and returns the PID and internally put it in an
ID field. Using no merge policy is not a good idea either you will
very likely reach system boarders (# file descriptors) and suffer from
bad search performance and low compression.

I think you should really consider fixing your app instead of hacking lucene.

simon

>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Usage-of-NoMergePolicy-and-its-potential-implications-tp3996630p3996784.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Usage of NoMergePolicy and its potential implications

Posted by "snehal.chennuru" <sn...@gmail.com>.

Thanks for the heads up Ian. I know it is highly discouraged. But, like I
said, it is a legacy application and it is very hard to go back and re-do
it.  



--
View this message in context: http://lucene.472066.n3.nabble.com/Usage-of-NoMergePolicy-and-its-potential-implications-tp3996630p3996784.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Usage of NoMergePolicy and its potential implications

Posted by Ian Lea <ia...@gmail.com>.

I can't answer your questions, but use of lucene's document ids as
persistent ids is strongly discouraged, particularly in version 4.x
where I think it just won't work at all.  There was a related thread a
couple of weeks ago.  See Uwe's message at
http://mail-archives.apache.org/mod_mbox/lucene-java-user/201207.mbox/%3C033f01cd606a$1ad94960$508bdc20$%40thetaphi.de%3E
where he says "To uniquely identify documents later you *have* to use
a own key field."


--
Ian.


On Mon, Jul 23, 2012 at 12:17 AM, snehal.chennuru
<sn...@gmail.com> wrote:
> Hello Everyone,
>
> We have a legacy system which uses lucene 2.4.1. We have ported a small hack
> to lucene source code back then, so that the underlying lucene segment
> merger code wouldn't reuse deleted docids. This helped us use lucene docids
> as persistent dbids as well. But we want to upgrade lucene to 3.6, but it is
> near impossible to "hack" lucene now to get the same behavior.
>
> I checked out NoMergePolicy, and it seemed to help achieve similar behavior
> of not letting lucene reuse deleted docids. But I guess this would increase
> the number of segments in the index. Any idea how many segments we are
> talking about over here? Also, can we configure lucene to tell how many
> documents to keep in a given segment. Each lucene index in this system can
> have utmost 1M documents in them. Is there an alternative that I can
> consider?
>
> Thanks,
> Snehal
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Usage-of-NoMergePolicy-and-its-potential-implications-tp3996630.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Usage of NoMergePolicy and its potential implications

Posted by Snehal Chennuru <sn...@gmail.com>.

Thanks Simon and Trejkaz! You are right, we used this approach exactly for
the reasons you mentioned above. As we have to support old lucene indices,
we are not sure what approach to take. 



--
View this message in context: http://lucene.472066.n3.nabble.com/Usage-of-NoMergePolicy-and-its-potential-implications-tp3996630p3997330.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org