You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Duke DAI <du...@gmail.com> on 2013/08/13 10:54:23 UTC

problem found with DiskDocValuesFormat

Hi experts,

I'm upgrading Lucene 4.4 and trying to use DocValues instead of store field
for performance reason. But due to unknown size of index(depends on
customer), so I will use DiskDocValuesFormat, especially for some binary
field. Then I wrote my customized Codec:

      final Codec codec = new Lucene42Codec() {

        private final Lucene42DocValuesFormat memoryDVFormat = new
Lucene42DocValuesFormat();
        private final DiskDocValuesFormat diskDVFormat = new
DiskDocValuesFormat();

        @Override
        public DocValuesFormat getDocValuesFormatForField(String field) {
          if
(LucenePluginConstants.INDEX_STORED_RETURNABLE_FIELD.equals(field)
              || LucenePluginConstants.PAYLOAD_FIELD_NAME.equals(field) ||
LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE.equals(field)) {
            return diskDVFormat;
          } else {
            return memoryDVFormat
          }
        }
      };
      iwc.setCodec(codec);

Here field LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE is numeric field,
long type. And others are binary.

Then I consume DV like below pseudo-code:
    nodeIDDocValuesSource =
            MultiDocValues.getNumericValues(searcher.getIndexReader(),
                LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE);

   ......
   long nodeId= nodeIDDocValuesSource.get(scoreDoc.doc);

Then I'm sure I get a wrong nodeId, which will be verified by upper logic
and treated as data corruption.


But if I change to memoryDVFormat for the long type field, then everything
is OK.

Also for upgrading legacy data, I keep two index format, DV or stored
field, controlled by version. If I use stored field, everything is OK.
So I guess there is a bug with  DiskDocValuesFormat, numeric data type,
does it relate to byte-aligned numeric compression?
Or I didn't use DiskDocValuesFormat correctly? Seems no other parameters
for it.

Sorry that I have no pure Lucene test case yet. Hope someone shed some
light on this.




Best regards,
Duke
If not now, when? If not me, who?

Re: problem found with DiskDocValuesFormat

Posted by Sean Bridges <se...@gmail.com>.

Thanks for the answers, and thanks for the changes to load doc values to
disk, it will be nice to use a supported codec.

Upgrading our indexes is not an option, as they are very large.

Sean


On Wed, Aug 21, 2013 at 11:15 PM, Robert Muir <rc...@gmail.com> wrote:

> On Thu, Aug 22, 2013 at 1:48 AM, Sean Bridges <se...@gmail.com>
> wrote:
> > Is there a supported DocValuesFormat that doesn't load all the values
> into
> > ram?
>
> Not with any current release, but in lucene 4.5 if all goes well, the
> official implementation will work that way (I spent essentially the
> last entire week on this and committed it yesterday).
>
> Integrating new ideas into the official format takes a good amount of
> effort: lots of documentation and testing and so on, because we have
> to live with supporting that format for a long time, all the way until
> 5.9.
> .
> >
> > We can't reindex every time we upgrade lucene since our indexes are too
> > large.  Should we copy the code from DiskDocValuesFormat and call it
> > CustomDiskDocValuesFormat, and give CustomDiskDocValuesFormat a new name
> so
> > that when we upgrade lucene, we won't use an incompatible version of
> > DiskDocValuesFormat?
> >
>
> You can certainly maintain your own codec components: you can even
> name them the same thing as long as you put your .jar file first in
> the classpath (thats how SPI works: first one wins).
>
> But its not really a cure-all, its some work either way: codec APIs
> themselves change too, so you have to deal with that on upgrade (e.g.
> DocValuesProducer gets a new method in 4.5 as its now capable of
> representing missing values, and the iterators in DocValuesConsumer
> now provide null when a document is missing a value).
>
> Or, you can avoid reindexing by using addIndexes as I suggested (just
> buy a few big chips of RAM for the upgrade).
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: problem found with DiskDocValuesFormat

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Aug 22, 2013 at 1:48 AM, Sean Bridges <se...@gmail.com> wrote:
> Is there a supported DocValuesFormat that doesn't load all the values into
> ram?

Not with any current release, but in lucene 4.5 if all goes well, the
official implementation will work that way (I spent essentially the
last entire week on this and committed it yesterday).

Integrating new ideas into the official format takes a good amount of
effort: lots of documentation and testing and so on, because we have
to live with supporting that format for a long time, all the way until
5.9.
.
>
> We can't reindex every time we upgrade lucene since our indexes are too
> large.  Should we copy the code from DiskDocValuesFormat and call it
> CustomDiskDocValuesFormat, and give CustomDiskDocValuesFormat a new name so
> that when we upgrade lucene, we won't use an incompatible version of
> DiskDocValuesFormat?
>

You can certainly maintain your own codec components: you can even
name them the same thing as long as you put your .jar file first in
the classpath (thats how SPI works: first one wins).

But its not really a cure-all, its some work either way: codec APIs
themselves change too, so you have to deal with that on upgrade (e.g.
DocValuesProducer gets a new method in 4.5 as its now capable of
representing missing values, and the iterators in DocValuesConsumer
now provide null when a document is missing a value).

Or, you can avoid reindexing by using addIndexes as I suggested (just
buy a few big chips of RAM for the upgrade).

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: problem found with DiskDocValuesFormat

Posted by Sean Bridges <se...@gmail.com>.

Is there a supported DocValuesFormat that doesn't load all the values into
ram?

Our use is case is that we have 16 byte ids for all our documents.  We used
to store the ids in stored fields, and look up the stored field for each
search hit.  We got much better performance when we switched to storing our
ids in DiskDocValues rather than StoredFields, especially when we had a lot
of search hits.  We could use the Lucene42DocValuesFormat, but that loads
all the values into ram.

We can't reindex every time we upgrade lucene since our indexes are too
large.  Should we copy the code from DiskDocValuesFormat and call it
CustomDiskDocValuesFormat, and give CustomDiskDocValuesFormat a new name so
that when we upgrade lucene, we won't use an incompatible version of
DiskDocValuesFormat?

Thanks,
Sean

On Wed, Aug 21, 2013 at 8:44 AM, Robert Muir <rc...@gmail.com> wrote:

> On Wed, Aug 21, 2013 at 11:30 AM, Sean Bridges <se...@gmail.com>
> wrote:
> > What is the recommended way to use DiskDocValuesFormat in production if
> we
> > can't reindex when we upgrade?
>
> I'm not going to recommend using any experimental codecs in production,
> but...
>
> 1. with 4.3 jar file: IWC.setCodec(Codec.getDefault()) +
> IndexWriter.addIndexes(IndexReader) -> converts index to official 4.3
> format
> 2. with 4.4 jar file: IWC.setCodec(MyExperimentalCodec) +
> IndexWriter.addIndexes(IndexReader) -> converts index to customized
> codec on 4.4
>
> >
> > Will the 4.4 version of DDVF be backwards compatible
>
> no.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: problem found with DiskDocValuesFormat

Posted by Robert Muir <rc...@gmail.com>.

On Wed, Aug 21, 2013 at 11:30 AM, Sean Bridges <se...@gmail.com> wrote:
> What is the recommended way to use DiskDocValuesFormat in production if we
> can't reindex when we upgrade?

I'm not going to recommend using any experimental codecs in production, but...

1. with 4.3 jar file: IWC.setCodec(Codec.getDefault()) +
IndexWriter.addIndexes(IndexReader) -> converts index to official 4.3
format
2. with 4.4 jar file: IWC.setCodec(MyExperimentalCodec) +
IndexWriter.addIndexes(IndexReader) -> converts index to customized
codec on 4.4

>
> Will the 4.4 version of DDVF be backwards compatible

no.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: problem found with DiskDocValuesFormat

Posted by Sean Bridges <se...@gmail.com>.

What is the recommended way to use DiskDocValuesFormat in production if we
can't reindex when we upgrade?

Will the 4.4 version of DDVF be backwards compatible, or should we make our
own copy of DDVF and give it a different codec name to protect ourselves
against incompatible changes?

Thanks,

Sean


On Tue, Aug 13, 2013 at 4:34 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> DiskDVFormat does not have index back compatibility between minor
> releases; maybe that's what you are seeing?  So, you must fully
> re-index after any DiskDVFormat field after upgrading ...
>
> Only the default formats support index back compatibility between releases.
>
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Aug 13, 2013 at 4:54 AM, Duke DAI <du...@gmail.com> wrote:
> > Hi experts,
> >
> > I'm upgrading Lucene 4.4 and trying to use DocValues instead of store
> field
> > for performance reason. But due to unknown size of index(depends on
> > customer), so I will use DiskDocValuesFormat, especially for some binary
> > field. Then I wrote my customized Codec:
> >
> >       final Codec codec = new Lucene42Codec() {
> >
> >         private final Lucene42DocValuesFormat memoryDVFormat = new
> > Lucene42DocValuesFormat();
> >         private final DiskDocValuesFormat diskDVFormat = new
> > DiskDocValuesFormat();
> >
> >         @Override
> >         public DocValuesFormat getDocValuesFormatForField(String field) {
> >           if
> > (LucenePluginConstants.INDEX_STORED_RETURNABLE_FIELD.equals(field)
> >               || LucenePluginConstants.PAYLOAD_FIELD_NAME.equals(field)
> ||
> > LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE.equals(field)) {
> >             return diskDVFormat;
> >           } else {
> >             return memoryDVFormat
> >           }
> >         }
> >       };
> >       iwc.setCodec(codec);
> >
> > Here field LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE is numeric
> field,
> > long type. And others are binary.
> >
> > Then I consume DV like below pseudo-code:
> >     nodeIDDocValuesSource =
> >             MultiDocValues.getNumericValues(searcher.getIndexReader(),
> >                 LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE);
> >
> >    ......
> >    long nodeId= nodeIDDocValuesSource.get(scoreDoc.doc);
> >
> > Then I'm sure I get a wrong nodeId, which will be verified by upper logic
> > and treated as data corruption.
> >
> >
> > But if I change to memoryDVFormat for the long type field, then
> everything
> > is OK.
> >
> > Also for upgrading legacy data, I keep two index format, DV or stored
> > field, controlled by version. If I use stored field, everything is OK.
> > So I guess there is a bug with  DiskDocValuesFormat, numeric data type,
> > does it relate to byte-aligned numeric compression?
> > Or I didn't use DiskDocValuesFormat correctly? Seems no other parameters
> > for it.
> >
> > Sorry that I have no pure Lucene test case yet. Hope someone shed some
> > light on this.
> >
> >
> >
> >
> > Best regards,
> > Duke
> > If not now, when? If not me, who?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: problem found with DiskDocValuesFormat

Posted by Duke DAI <du...@gmail.com>.

Hi Mike,

Thanks for your quick response.

All data was newly indexed, so compatibility is not the culprit.

Is it possible a multi-thread issue? I use shared IndexReaders between
different IndexSearchers. No evidence for this guess because I have many
multi-thread test cases and they passed, but the one which has problem is
not a multi-thread scenario for index.


Best regards,
Duke
If not now, when? If not me, who?


On Tue, Aug 13, 2013 at 7:34 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> DiskDVFormat does not have index back compatibility between minor
> releases; maybe that's what you are seeing?  So, you must fully
> re-index after any DiskDVFormat field after upgrading ...
>
> Only the default formats support index back compatibility between releases.
>
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Aug 13, 2013 at 4:54 AM, Duke DAI <du...@gmail.com> wrote:
> > Hi experts,
> >
> > I'm upgrading Lucene 4.4 and trying to use DocValues instead of store
> field
> > for performance reason. But due to unknown size of index(depends on
> > customer), so I will use DiskDocValuesFormat, especially for some binary
> > field. Then I wrote my customized Codec:
> >
> >       final Codec codec = new Lucene42Codec() {
> >
> >         private final Lucene42DocValuesFormat memoryDVFormat = new
> > Lucene42DocValuesFormat();
> >         private final DiskDocValuesFormat diskDVFormat = new
> > DiskDocValuesFormat();
> >
> >         @Override
> >         public DocValuesFormat getDocValuesFormatForField(String field) {
> >           if
> > (LucenePluginConstants.INDEX_STORED_RETURNABLE_FIELD.equals(field)
> >               || LucenePluginConstants.PAYLOAD_FIELD_NAME.equals(field)
> ||
> > LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE.equals(field)) {
> >             return diskDVFormat;
> >           } else {
> >             return memoryDVFormat
> >           }
> >         }
> >       };
> >       iwc.setCodec(codec);
> >
> > Here field LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE is numeric
> field,
> > long type. And others are binary.
> >
> > Then I consume DV like below pseudo-code:
> >     nodeIDDocValuesSource =
> >             MultiDocValues.getNumericValues(searcher.getIndexReader(),
> >                 LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE);
> >
> >    ......
> >    long nodeId= nodeIDDocValuesSource.get(scoreDoc.doc);
> >
> > Then I'm sure I get a wrong nodeId, which will be verified by upper logic
> > and treated as data corruption.
> >
> >
> > But if I change to memoryDVFormat for the long type field, then
> everything
> > is OK.
> >
> > Also for upgrading legacy data, I keep two index format, DV or stored
> > field, controlled by version. If I use stored field, everything is OK.
> > So I guess there is a bug with  DiskDocValuesFormat, numeric data type,
> > does it relate to byte-aligned numeric compression?
> > Or I didn't use DiskDocValuesFormat correctly? Seems no other parameters
> > for it.
> >
> > Sorry that I have no pure Lucene test case yet. Hope someone shed some
> > light on this.
> >
> >
> >
> >
> > Best regards,
> > Duke
> > If not now, when? If not me, who?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: problem found with DiskDocValuesFormat

Posted by Michael McCandless <lu...@mikemccandless.com>.

DiskDVFormat does not have index back compatibility between minor
releases; maybe that's what you are seeing?  So, you must fully
re-index after any DiskDVFormat field after upgrading ...

Only the default formats support index back compatibility between releases.


Mike McCandless

http://blog.mikemccandless.com


On Tue, Aug 13, 2013 at 4:54 AM, Duke DAI <du...@gmail.com> wrote:
> Hi experts,
>
> I'm upgrading Lucene 4.4 and trying to use DocValues instead of store field
> for performance reason. But due to unknown size of index(depends on
> customer), so I will use DiskDocValuesFormat, especially for some binary
> field. Then I wrote my customized Codec:
>
>       final Codec codec = new Lucene42Codec() {
>
>         private final Lucene42DocValuesFormat memoryDVFormat = new
> Lucene42DocValuesFormat();
>         private final DiskDocValuesFormat diskDVFormat = new
> DiskDocValuesFormat();
>
>         @Override
>         public DocValuesFormat getDocValuesFormatForField(String field) {
>           if
> (LucenePluginConstants.INDEX_STORED_RETURNABLE_FIELD.equals(field)
>               || LucenePluginConstants.PAYLOAD_FIELD_NAME.equals(field) ||
> LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE.equals(field)) {
>             return diskDVFormat;
>           } else {
>             return memoryDVFormat
>           }
>         }
>       };
>       iwc.setCodec(codec);
>
> Here field LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE is numeric field,
> long type. And others are binary.
>
> Then I consume DV like below pseudo-code:
>     nodeIDDocValuesSource =
>             MultiDocValues.getNumericValues(searcher.getIndexReader(),
>                 LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE);
>
>    ......
>    long nodeId= nodeIDDocValuesSource.get(scoreDoc.doc);
>
> Then I'm sure I get a wrong nodeId, which will be verified by upper logic
> and treated as data corruption.
>
>
> But if I change to memoryDVFormat for the long type field, then everything
> is OK.
>
> Also for upgrading legacy data, I keep two index format, DV or stored
> field, controlled by version. If I use stored field, everything is OK.
> So I guess there is a bug with  DiskDocValuesFormat, numeric data type,
> does it relate to byte-aligned numeric compression?
> Or I didn't use DiskDocValuesFormat correctly? Seems no other parameters
> for it.
>
> Sorry that I have no pure Lucene test case yet. Hope someone shed some
> light on this.
>
>
>
>
> Best regards,
> Duke
> If not now, when? If not me, who?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: problem found with DiskDocValuesFormat

Posted by Duke DAI <du...@gmail.com>.

Thanks, Mike.

Finally I figured out the root cause. I use thread from Thread-Pool-1 to
probe indexes parallelly on multiple collections, but will consume
documents by thread from Thread-Pool-2. I hold the same DocValue object
reference to get values. After paying attention to thread switch, the
problem was resolved.

Thank you guys for building this feature into lucene-core.jar, it dispels
my worry about compatibility by using lucene-codecs.jar

Best regards,
Duke
If not now, when? If not me, who?


On Tue, Oct 22, 2013 at 12:48 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> It's perfectly fine, and recommended, to reuse a thread across
> different queries (ie, use a thread pool in your app, up above
> Lucene).
>
> The ThreadLocals used in SegmentCoreReaders should not interfere or
> cause problems with that: they can easily be re-used across queries.
>
> Maybe you can boil down the issue you are seeing into a small test case?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Oct 21, 2013 at 10:35 AM, Duke DAI <du...@gmail.com> wrote:
> > Hi Mike,
> >
> > My scenario, query thread from a ThreadPool will be used to execute
> query.
> > So thread must have to be reused to handle various queries. Now that
> > SegmentCoreReaders
> > uses ThreadLocal to hold per-thread instance, I think some private
> > variables must belong to the given thread(file offset? I didn't find any
> > other thread-dependent status), otherwise object-level instance is
> enough.
> > And ThreadPool is very common to facilitate heavy load queries, does the
> > ThreadLocal mechanism support thread reuse for different queries? You
> know,
> > either thread creation is heavy or ThreadLocal cleanup from outside is
> > complicated.
> > My test shows NumericDocValues will return wrong value, but sure that
> it's
> > a long value, upper logic can verify whether the value is valid or not.
> >
> > As I described in earlier mail, in Lucene4.4
> Lucene42DocValuesFormat(in-memory)
> > has no problem, DiskDocValuesFormat(in-disk) has problem. Now in
> > Lucene4.5, MemoryDocValuesFormat(in-memory)
> > has no problem, but Lucene45DocValuesFormat(in-disk) has problem.
> > Coincidency? My test is far more complex than I described, two
> ThreadPool,
> > one is used to handle main query, one is used to query sub collections
> > parallelly with proper RejectedExecutionHandler(now one sub rejected,
> > cancel and fail all subs).
> >
> > For simple, what's the private status of per-thread NumericDocValues
> > instance? The private status can be re-used for different queries?
> >
> >
> > Best regards,
> > Duke
> > If not now, when? If not me, who?
> >
> >
> > On Mon, Oct 21, 2013 at 7:26 PM, Michael McCandless <
> > lucene@mikemccandless.com> wrote:
> >
> >> Can you describe what problem you are actually hitting?
> >>
> >> The purpose of docValuesLocal is to hold the per-Thread instance of
> >> each doc values, and re-use it when that thread comes back again
> >> asking for the same doc values.
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Mon, Oct 21, 2013 at 6:28 AM, Duke DAI <du...@gmail.com>
> wrote:
> >> > Hi guys,
> >> >
> >> > Seems I have the same problem with Lucene45DocValuesFormat, no problem
> >> with
> >> > MemoryDocValuesFormat. The problem I encountered with Lucene4.4 is
> with
> >> > DiskDocValuesFormat, no with Lucene42DocValuesFormat.
> >> >
> >> > I dig into a little and found the superficial cause. In
> >> SegmentCoreReaders,
> >> > there is a ThreadLocal variable, docValuesLocal. Its purpose is avoid
> >> > building data structure repeatedly by query thread . But how about the
> >> > query thread is from thread pool, and reused for different query?
> >> > I removed docValuesLocal and built a lucene-core.jar, it works with my
> >> > multi-threads(thread pool) test cases.
> >> >
> >> > Do you have any idea about this? Information is enough?
> >> >
> >> >
> >> > Thanks,
> >> > Duke
> >> >
> >> >
> >> > Best regards,
> >> > Duke
> >> > If not now, when? If not me, who?
> >> >
> >> >
> >> > On Tue, Aug 13, 2013 at 4:54 PM, Duke DAI <du...@gmail.com>
> >> wrote:
> >> >
> >> >> Hi experts,
> >> >>
> >> >> I'm upgrading Lucene 4.4 and trying to use DocValues instead of store
> >> >> field for performance reason. But due to unknown size of
> index(depends
> >> on
> >> >> customer), so I will use DiskDocValuesFormat, especially for some
> binary
> >> >> field. Then I wrote my customized Codec:
> >> >>
> >> >>       final Codec codec = new Lucene42Codec() {
> >> >>
> >> >>         private final Lucene42DocValuesFormat memoryDVFormat = new
> >> >> Lucene42DocValuesFormat();
> >> >>         private final DiskDocValuesFormat diskDVFormat = new
> >> >> DiskDocValuesFormat();
> >> >>
> >> >>         @Override
> >> >>         public DocValuesFormat getDocValuesFormatForField(String
> field)
> >> {
> >> >>           if
> >> >> (LucenePluginConstants.INDEX_STORED_RETURNABLE_FIELD.equals(field)
> >> >>               ||
> LucenePluginConstants.PAYLOAD_FIELD_NAME.equals(field)
> >> ||
> >> >> LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE.equals(field)) {
> >> >>             return diskDVFormat;
> >> >>           } else {
> >> >>             return memoryDVFormat
> >> >>           }
> >> >>         }
> >> >>       };
> >> >>       iwc.setCodec(codec);
> >> >>
> >> >> Here field LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE is numeric
> >> field,
> >> >> long type. And others are binary.
> >> >>
> >> >> Then I consume DV like below pseudo-code:
> >> >>     nodeIDDocValuesSource =
> >> >>
> MultiDocValues.getNumericValues(searcher.getIndexReader(),
> >> >>                 LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE);
> >> >>
> >> >>    ......
> >> >>    long nodeId= nodeIDDocValuesSource.get(scoreDoc.doc);
> >> >>
> >> >> Then I'm sure I get a wrong nodeId, which will be verified by upper
> >> logic
> >> >> and treated as data corruption.
> >> >>
> >> >>
> >> >> But if I change to memoryDVFormat for the long type field, then
> >> everything
> >> >> is OK.
> >> >>
> >> >> Also for upgrading legacy data, I keep two index format, DV or stored
> >> >> field, controlled by version. If I use stored field, everything is
> OK.
> >> >> So I guess there is a bug with  DiskDocValuesFormat, numeric data
> type,
> >> >> does it relate to byte-aligned numeric compression?
> >> >> Or I didn't use DiskDocValuesFormat correctly? Seems no other
> parameters
> >> >> for it.
> >> >>
> >> >> Sorry that I have no pure Lucene test case yet. Hope someone shed
> some
> >> >> light on this.
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> Best regards,
> >> >> Duke
> >> >> If not now, when? If not me, who?
> >> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: problem found with DiskDocValuesFormat

Posted by Michael McCandless <lu...@mikemccandless.com>.

It's perfectly fine, and recommended, to reuse a thread across
different queries (ie, use a thread pool in your app, up above
Lucene).

The ThreadLocals used in SegmentCoreReaders should not interfere or
cause problems with that: they can easily be re-used across queries.

Maybe you can boil down the issue you are seeing into a small test case?

Mike McCandless

http://blog.mikemccandless.com


On Mon, Oct 21, 2013 at 10:35 AM, Duke DAI <du...@gmail.com> wrote:
> Hi Mike,
>
> My scenario, query thread from a ThreadPool will be used to execute query.
> So thread must have to be reused to handle various queries. Now that
> SegmentCoreReaders
> uses ThreadLocal to hold per-thread instance, I think some private
> variables must belong to the given thread(file offset? I didn't find any
> other thread-dependent status), otherwise object-level instance is enough.
> And ThreadPool is very common to facilitate heavy load queries, does the
> ThreadLocal mechanism support thread reuse for different queries? You know,
> either thread creation is heavy or ThreadLocal cleanup from outside is
> complicated.
> My test shows NumericDocValues will return wrong value, but sure that it's
> a long value, upper logic can verify whether the value is valid or not.
>
> As I described in earlier mail, in Lucene4.4 Lucene42DocValuesFormat(in-memory)
> has no problem, DiskDocValuesFormat(in-disk) has problem. Now in
> Lucene4.5, MemoryDocValuesFormat(in-memory)
> has no problem, but Lucene45DocValuesFormat(in-disk) has problem.
> Coincidency? My test is far more complex than I described, two ThreadPool,
> one is used to handle main query, one is used to query sub collections
> parallelly with proper RejectedExecutionHandler(now one sub rejected,
> cancel and fail all subs).
>
> For simple, what's the private status of per-thread NumericDocValues
> instance? The private status can be re-used for different queries?
>
>
> Best regards,
> Duke
> If not now, when? If not me, who?
>
>
> On Mon, Oct 21, 2013 at 7:26 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> Can you describe what problem you are actually hitting?
>>
>> The purpose of docValuesLocal is to hold the per-Thread instance of
>> each doc values, and re-use it when that thread comes back again
>> asking for the same doc values.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Mon, Oct 21, 2013 at 6:28 AM, Duke DAI <du...@gmail.com> wrote:
>> > Hi guys,
>> >
>> > Seems I have the same problem with Lucene45DocValuesFormat, no problem
>> with
>> > MemoryDocValuesFormat. The problem I encountered with Lucene4.4 is with
>> > DiskDocValuesFormat, no with Lucene42DocValuesFormat.
>> >
>> > I dig into a little and found the superficial cause. In
>> SegmentCoreReaders,
>> > there is a ThreadLocal variable, docValuesLocal. Its purpose is avoid
>> > building data structure repeatedly by query thread . But how about the
>> > query thread is from thread pool, and reused for different query?
>> > I removed docValuesLocal and built a lucene-core.jar, it works with my
>> > multi-threads(thread pool) test cases.
>> >
>> > Do you have any idea about this? Information is enough?
>> >
>> >
>> > Thanks,
>> > Duke
>> >
>> >
>> > Best regards,
>> > Duke
>> > If not now, when? If not me, who?
>> >
>> >
>> > On Tue, Aug 13, 2013 at 4:54 PM, Duke DAI <du...@gmail.com>
>> wrote:
>> >
>> >> Hi experts,
>> >>
>> >> I'm upgrading Lucene 4.4 and trying to use DocValues instead of store
>> >> field for performance reason. But due to unknown size of index(depends
>> on
>> >> customer), so I will use DiskDocValuesFormat, especially for some binary
>> >> field. Then I wrote my customized Codec:
>> >>
>> >>       final Codec codec = new Lucene42Codec() {
>> >>
>> >>         private final Lucene42DocValuesFormat memoryDVFormat = new
>> >> Lucene42DocValuesFormat();
>> >>         private final DiskDocValuesFormat diskDVFormat = new
>> >> DiskDocValuesFormat();
>> >>
>> >>         @Override
>> >>         public DocValuesFormat getDocValuesFormatForField(String field)
>> {
>> >>           if
>> >> (LucenePluginConstants.INDEX_STORED_RETURNABLE_FIELD.equals(field)
>> >>               || LucenePluginConstants.PAYLOAD_FIELD_NAME.equals(field)
>> ||
>> >> LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE.equals(field)) {
>> >>             return diskDVFormat;
>> >>           } else {
>> >>             return memoryDVFormat
>> >>           }
>> >>         }
>> >>       };
>> >>       iwc.setCodec(codec);
>> >>
>> >> Here field LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE is numeric
>> field,
>> >> long type. And others are binary.
>> >>
>> >> Then I consume DV like below pseudo-code:
>> >>     nodeIDDocValuesSource =
>> >>             MultiDocValues.getNumericValues(searcher.getIndexReader(),
>> >>                 LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE);
>> >>
>> >>    ......
>> >>    long nodeId= nodeIDDocValuesSource.get(scoreDoc.doc);
>> >>
>> >> Then I'm sure I get a wrong nodeId, which will be verified by upper
>> logic
>> >> and treated as data corruption.
>> >>
>> >>
>> >> But if I change to memoryDVFormat for the long type field, then
>> everything
>> >> is OK.
>> >>
>> >> Also for upgrading legacy data, I keep two index format, DV or stored
>> >> field, controlled by version. If I use stored field, everything is OK.
>> >> So I guess there is a bug with  DiskDocValuesFormat, numeric data type,
>> >> does it relate to byte-aligned numeric compression?
>> >> Or I didn't use DiskDocValuesFormat correctly? Seems no other parameters
>> >> for it.
>> >>
>> >> Sorry that I have no pure Lucene test case yet. Hope someone shed some
>> >> light on this.
>> >>
>> >>
>> >>
>> >>
>> >> Best regards,
>> >> Duke
>> >> If not now, when? If not me, who?
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: problem found with DiskDocValuesFormat

Posted by Duke DAI <du...@gmail.com>.

Hi Mike,

My scenario, query thread from a ThreadPool will be used to execute query.
So thread must have to be reused to handle various queries. Now that
SegmentCoreReaders
uses ThreadLocal to hold per-thread instance, I think some private
variables must belong to the given thread(file offset? I didn't find any
other thread-dependent status), otherwise object-level instance is enough.
And ThreadPool is very common to facilitate heavy load queries, does the
ThreadLocal mechanism support thread reuse for different queries? You know,
either thread creation is heavy or ThreadLocal cleanup from outside is
complicated.
My test shows NumericDocValues will return wrong value, but sure that it's
a long value, upper logic can verify whether the value is valid or not.

As I described in earlier mail, in Lucene4.4 Lucene42DocValuesFormat(in-memory)
has no problem, DiskDocValuesFormat(in-disk) has problem. Now in
Lucene4.5, MemoryDocValuesFormat(in-memory)
has no problem, but Lucene45DocValuesFormat(in-disk) has problem.
Coincidency? My test is far more complex than I described, two ThreadPool,
one is used to handle main query, one is used to query sub collections
parallelly with proper RejectedExecutionHandler(now one sub rejected,
cancel and fail all subs).

For simple, what's the private status of per-thread NumericDocValues
instance? The private status can be re-used for different queries?


Best regards,
Duke
If not now, when? If not me, who?


On Mon, Oct 21, 2013 at 7:26 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Can you describe what problem you are actually hitting?
>
> The purpose of docValuesLocal is to hold the per-Thread instance of
> each doc values, and re-use it when that thread comes back again
> asking for the same doc values.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Oct 21, 2013 at 6:28 AM, Duke DAI <du...@gmail.com> wrote:
> > Hi guys,
> >
> > Seems I have the same problem with Lucene45DocValuesFormat, no problem
> with
> > MemoryDocValuesFormat. The problem I encountered with Lucene4.4 is with
> > DiskDocValuesFormat, no with Lucene42DocValuesFormat.
> >
> > I dig into a little and found the superficial cause. In
> SegmentCoreReaders,
> > there is a ThreadLocal variable, docValuesLocal. Its purpose is avoid
> > building data structure repeatedly by query thread . But how about the
> > query thread is from thread pool, and reused for different query?
> > I removed docValuesLocal and built a lucene-core.jar, it works with my
> > multi-threads(thread pool) test cases.
> >
> > Do you have any idea about this? Information is enough?
> >
> >
> > Thanks,
> > Duke
> >
> >
> > Best regards,
> > Duke
> > If not now, when? If not me, who?
> >
> >
> > On Tue, Aug 13, 2013 at 4:54 PM, Duke DAI <du...@gmail.com>
> wrote:
> >
> >> Hi experts,
> >>
> >> I'm upgrading Lucene 4.4 and trying to use DocValues instead of store
> >> field for performance reason. But due to unknown size of index(depends
> on
> >> customer), so I will use DiskDocValuesFormat, especially for some binary
> >> field. Then I wrote my customized Codec:
> >>
> >>       final Codec codec = new Lucene42Codec() {
> >>
> >>         private final Lucene42DocValuesFormat memoryDVFormat = new
> >> Lucene42DocValuesFormat();
> >>         private final DiskDocValuesFormat diskDVFormat = new
> >> DiskDocValuesFormat();
> >>
> >>         @Override
> >>         public DocValuesFormat getDocValuesFormatForField(String field)
> {
> >>           if
> >> (LucenePluginConstants.INDEX_STORED_RETURNABLE_FIELD.equals(field)
> >>               || LucenePluginConstants.PAYLOAD_FIELD_NAME.equals(field)
> ||
> >> LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE.equals(field)) {
> >>             return diskDVFormat;
> >>           } else {
> >>             return memoryDVFormat
> >>           }
> >>         }
> >>       };
> >>       iwc.setCodec(codec);
> >>
> >> Here field LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE is numeric
> field,
> >> long type. And others are binary.
> >>
> >> Then I consume DV like below pseudo-code:
> >>     nodeIDDocValuesSource =
> >>             MultiDocValues.getNumericValues(searcher.getIndexReader(),
> >>                 LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE);
> >>
> >>    ......
> >>    long nodeId= nodeIDDocValuesSource.get(scoreDoc.doc);
> >>
> >> Then I'm sure I get a wrong nodeId, which will be verified by upper
> logic
> >> and treated as data corruption.
> >>
> >>
> >> But if I change to memoryDVFormat for the long type field, then
> everything
> >> is OK.
> >>
> >> Also for upgrading legacy data, I keep two index format, DV or stored
> >> field, controlled by version. If I use stored field, everything is OK.
> >> So I guess there is a bug with  DiskDocValuesFormat, numeric data type,
> >> does it relate to byte-aligned numeric compression?
> >> Or I didn't use DiskDocValuesFormat correctly? Seems no other parameters
> >> for it.
> >>
> >> Sorry that I have no pure Lucene test case yet. Hope someone shed some
> >> light on this.
> >>
> >>
> >>
> >>
> >> Best regards,
> >> Duke
> >> If not now, when? If not me, who?
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: problem found with DiskDocValuesFormat

Posted by Michael McCandless <lu...@mikemccandless.com>.

Can you describe what problem you are actually hitting?

The purpose of docValuesLocal is to hold the per-Thread instance of
each doc values, and re-use it when that thread comes back again
asking for the same doc values.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Oct 21, 2013 at 6:28 AM, Duke DAI <du...@gmail.com> wrote:
> Hi guys,
>
> Seems I have the same problem with Lucene45DocValuesFormat, no problem with
> MemoryDocValuesFormat. The problem I encountered with Lucene4.4 is with
> DiskDocValuesFormat, no with Lucene42DocValuesFormat.
>
> I dig into a little and found the superficial cause. In SegmentCoreReaders,
> there is a ThreadLocal variable, docValuesLocal. Its purpose is avoid
> building data structure repeatedly by query thread . But how about the
> query thread is from thread pool, and reused for different query?
> I removed docValuesLocal and built a lucene-core.jar, it works with my
> multi-threads(thread pool) test cases.
>
> Do you have any idea about this? Information is enough?
>
>
> Thanks,
> Duke
>
>
> Best regards,
> Duke
> If not now, when? If not me, who?
>
>
> On Tue, Aug 13, 2013 at 4:54 PM, Duke DAI <du...@gmail.com> wrote:
>
>> Hi experts,
>>
>> I'm upgrading Lucene 4.4 and trying to use DocValues instead of store
>> field for performance reason. But due to unknown size of index(depends on
>> customer), so I will use DiskDocValuesFormat, especially for some binary
>> field. Then I wrote my customized Codec:
>>
>>       final Codec codec = new Lucene42Codec() {
>>
>>         private final Lucene42DocValuesFormat memoryDVFormat = new
>> Lucene42DocValuesFormat();
>>         private final DiskDocValuesFormat diskDVFormat = new
>> DiskDocValuesFormat();
>>
>>         @Override
>>         public DocValuesFormat getDocValuesFormatForField(String field) {
>>           if
>> (LucenePluginConstants.INDEX_STORED_RETURNABLE_FIELD.equals(field)
>>               || LucenePluginConstants.PAYLOAD_FIELD_NAME.equals(field) ||
>> LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE.equals(field)) {
>>             return diskDVFormat;
>>           } else {
>>             return memoryDVFormat
>>           }
>>         }
>>       };
>>       iwc.setCodec(codec);
>>
>> Here field LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE is numeric field,
>> long type. And others are binary.
>>
>> Then I consume DV like below pseudo-code:
>>     nodeIDDocValuesSource =
>>             MultiDocValues.getNumericValues(searcher.getIndexReader(),
>>                 LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE);
>>
>>    ......
>>    long nodeId= nodeIDDocValuesSource.get(scoreDoc.doc);
>>
>> Then I'm sure I get a wrong nodeId, which will be verified by upper logic
>> and treated as data corruption.
>>
>>
>> But if I change to memoryDVFormat for the long type field, then everything
>> is OK.
>>
>> Also for upgrading legacy data, I keep two index format, DV or stored
>> field, controlled by version. If I use stored field, everything is OK.
>> So I guess there is a bug with  DiskDocValuesFormat, numeric data type,
>> does it relate to byte-aligned numeric compression?
>> Or I didn't use DiskDocValuesFormat correctly? Seems no other parameters
>> for it.
>>
>> Sorry that I have no pure Lucene test case yet. Hope someone shed some
>> light on this.
>>
>>
>>
>>
>> Best regards,
>> Duke
>> If not now, when? If not me, who?
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: problem found with DiskDocValuesFormat

Posted by Duke DAI <du...@gmail.com>.

Hi guys,

Seems I have the same problem with Lucene45DocValuesFormat, no problem with
MemoryDocValuesFormat. The problem I encountered with Lucene4.4 is with
DiskDocValuesFormat, no with Lucene42DocValuesFormat.

I dig into a little and found the superficial cause. In SegmentCoreReaders,
there is a ThreadLocal variable, docValuesLocal. Its purpose is avoid
building data structure repeatedly by query thread . But how about the
query thread is from thread pool, and reused for different query?
I removed docValuesLocal and built a lucene-core.jar, it works with my
multi-threads(thread pool) test cases.

Do you have any idea about this? Information is enough?


Thanks,
Duke


Best regards,
Duke
If not now, when? If not me, who?


On Tue, Aug 13, 2013 at 4:54 PM, Duke DAI <du...@gmail.com> wrote:

> Hi experts,
>
> I'm upgrading Lucene 4.4 and trying to use DocValues instead of store
> field for performance reason. But due to unknown size of index(depends on
> customer), so I will use DiskDocValuesFormat, especially for some binary
> field. Then I wrote my customized Codec:
>
>       final Codec codec = new Lucene42Codec() {
>
>         private final Lucene42DocValuesFormat memoryDVFormat = new
> Lucene42DocValuesFormat();
>         private final DiskDocValuesFormat diskDVFormat = new
> DiskDocValuesFormat();
>
>         @Override
>         public DocValuesFormat getDocValuesFormatForField(String field) {
>           if
> (LucenePluginConstants.INDEX_STORED_RETURNABLE_FIELD.equals(field)
>               || LucenePluginConstants.PAYLOAD_FIELD_NAME.equals(field) ||
> LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE.equals(field)) {
>             return diskDVFormat;
>           } else {
>             return memoryDVFormat
>           }
>         }
>       };
>       iwc.setCodec(codec);
>
> Here field LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE is numeric field,
> long type. And others are binary.
>
> Then I consume DV like below pseudo-code:
>     nodeIDDocValuesSource =
>             MultiDocValues.getNumericValues(searcher.getIndexReader(),
>                 LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE);
>
>    ......
>    long nodeId= nodeIDDocValuesSource.get(scoreDoc.doc);
>
> Then I'm sure I get a wrong nodeId, which will be verified by upper logic
> and treated as data corruption.
>
>
> But if I change to memoryDVFormat for the long type field, then everything
> is OK.
>
> Also for upgrading legacy data, I keep two index format, DV or stored
> field, controlled by version. If I use stored field, everything is OK.
> So I guess there is a bug with  DiskDocValuesFormat, numeric data type,
> does it relate to byte-aligned numeric compression?
> Or I didn't use DiskDocValuesFormat correctly? Seems no other parameters
> for it.
>
> Sorry that I have no pure Lucene test case yet. Hope someone shed some
> light on this.
>
>
>
>
> Best regards,
> Duke
> If not now, when? If not me, who?
>