You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Mikhail Khludnev <mk...@griddynamics.com> on 2014/01/07 19:09:35 UTC

Re: Iterating BinaryDocValues

Joel,

I tried to hack it straightforwardly, but found no free gain there. The
only attempt I can suggest is to try to reuse bytes in
https://github.com/apache/lucene-solr/blame/trunk/lucene/core/src/java/org/apache/lucene/codecs/lucene45/Lucene45DocValuesProducer.java#L401right
now it allocates bytes every time, which beside of GC can also impact
memory access locality. Could you try fix memory waste and repeat
performance test?

Have a good hack!

On Mon, Dec 23, 2013 at 9:51 PM, Joel Bernstein <jo...@gmail.com> wrote:

>
> Hi,
>
> I'm looking for a faster way to perform large scale docId -> bytesRef
> lookups for BinaryDocValues.
>
> I'm finding that I can't get the performance that I need from the random
> access seek in the BinaryDocValues interface.
>
> I'm wondering if sequentially scanning the docValues would be a faster
> approach. I have a BitSet of matching docs, so if I sequentially moved
> through the docValues I could test each one against that bitset.
>
> Wondering if that approach would be faster for bulk extracts and how
> tricky it would be to add an iterator to the BinaryDocValues interface?
>
> Thanks,
> Joel
>

-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: Iterating BinaryDocValues

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

> I don't think we should add such a method. Doc values are commonly
> read from collectors, so why do we need a method that works on top of
> a DocIdSetIterator?
on collect(docnum) collectors can access docvalues via currently present
get(), however it's not the most efficient way to access them. Let me refer
to Shai's talk at Dublin. (if you need to calculate DV facets for few field
- my assumption), the best way to store bitset, and loop by DV fields and
for every field scan a column looping by that docset, rather than loop by
docs. Solr does it by the same manner, old UnInvertFields for sure, and
recent DVFacets too, I guess. So, such column scan method makes sense from
performance consideration.

> I'm also curious how specialized implementations
> could make this method faster than the default implementation?
it can own bytes[] and write into (without ThreadLocal). Let me provide you
a scratch later.

btw,
I repeated by micro benchmark with removed random() call from the loop.

..
reuse: true took:117944 ms
...
reuse: false took:131507 ms

it seems like gain from reusing bytes[] is 10% (I understand that it's
incorrect impl, let's think how to improve it). Let me attach samples
screenshot. It's mmap directory and 4.5 codec.

Joel, it answers to your consideration: seek is not hot at all.

I'm attaching test and impl diff if you want to reproduce the measurement.



On Fri, Jan 10, 2014 at 8:58 PM, Adrien Grand <jp...@gmail.com> wrote:

> I don't think we should add such a method. Doc values are commonly
> read from collectors, so why do we need a method that works on top of
> a DocIdSetIterator? I'm also curious how specialized implementations
> could make this method faster than the default implementation?
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: Iterating BinaryDocValues

Posted by Joel Bernstein <jo...@gmail.com>.

Bulk extracting full unsorted result sets from Solr. You give Solr a query
and it dumps the full result in a single call. The result set streaming is
in place, but throughput is not as good as I would like it.

Joel Bernstein
Search Engineer at Heliosearch


On Fri, Jan 10, 2014 at 3:24 PM, Robert Muir <rc...@gmail.com> wrote:

> what are you doing with the data?
>
>
> On Fri, Jan 10, 2014 at 3:23 PM, Joel Bernstein <jo...@gmail.com>wrote:
>
>> I'll provide a little more context. I'm working on bulk extracting
>> BinaryDocValues. My initial performance test was with in-memory
>> binaryDocValues, but I think the end game is actually disk-based
>> binaryDocValues.
>>
>> I was able to perform around 1 million docId->BytesRef lookups per-second
>> with in-memory BinaryDocValues. Since I need to get the values for multiple
>> fields for each document, this bogs down pretty quickly.
>>
>> I'm wondering if there is a way to increase this throughput. Since
>> filling a BytesRef is pretty fast, I was assuming it was the seek that was
>> taking the time, but I didn't verify this. The first thing that came to
>> mind is iterating the docValues in such a way that the next docValue could
>> be loaded without a seek. But I haven't dug into how the BinaryDocValues
>> are formatted so I'm not sure if this would help or not. Also there could
>> be something else besides the seek that is limiting the throughput.
>>
>>
>>
>>
>>
>>
>>
>>
>> Joel Bernstein
>> Search Engineer at Heliosearch
>>
>>
>> On Fri, Jan 10, 2014 at 2:54 PM, Robert Muir <rc...@gmail.com> wrote:
>>
>>> Yeah, i dont think its from newer docvalues-using code like yours shai.
>>>
>>> instead the problems i had doing this are historical, because e.g.
>>> fieldcache pointed to large arrays and consumers were lazy about it,
>>> knowing that there reference pointed to bytes that would remain valid
>>> across invocations.
>>>
>>> we just have to remove these assumptions. I don't apologize for not
>>> doing this, as you show, its some small % improvement (which we should go
>>> and get back!), but i went with safety first initially rather than bugs.
>>>
>>>
>>>
>>> On Fri, Jan 10, 2014 at 2:50 PM, Shai Erera <se...@gmail.com> wrote:
>>>
>>>> I agree with Robert. We should leave cloning BytesRefs to whoever needs
>>>> that, and not penalize everyone else who don't need it. I must say I didn't
>>>> know I can "own" those BytesRefs and I clone them whenever I need to. I
>>>> think I was bitten by one of the other APIs, so I assumed returned
>>>> BytesRefs are not "mine" across all the APIs.
>>>>
>>>> Shai
>>>>
>>>>
>>>> On Fri, Jan 10, 2014 at 9:46 PM, Robert Muir <rc...@gmail.com> wrote:
>>>>
>>>>> the problem is really simpler to solve actually.
>>>>>
>>>>> Look at the comments in the code, it tells you why it is this way:
>>>>>
>>>>>           // NOTE: we could have one buffer, but various consumers
>>>>> (e.g. FieldComparatorSource)
>>>>>           // assume "they" own the bytes after calling this!
>>>>>
>>>>> That is what we should fix. There is no need to make bulk APIs or even
>>>>> change the public api in any way (other than javadocs).
>>>>>
>>>>> We just move the clone'ing out of the codec, and require the consumer
>>>>> to do it, same as termsenum or other apis. The codec part is extremely
>>>>> simple here, its even the way i had it initially.
>>>>>
>>>>> But at the time (and even still now) this comes with some risk of
>>>>> bugs. So initially I removed the reuse and went with a more conservative
>>>>> approach to start with.
>>>>>
>>>>>
>>>>> On Fri, Jan 10, 2014 at 2:36 PM, Mikhail Khludnev <
>>>>> mkhludnev@griddynamics.com> wrote:
>>>>>
>>>>>> Adrian,
>>>>>>
>>>>>> Please find bulkGet() scratch. It's ugly copy-paste, just reuses
>>>>>> ByteRef that provides 10% gain.
>>>>>> ...
>>>>>> bulkGet took:101630 ms
>>>>>> ...
>>>>>> get took:114422 ms
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 10, 2014 at 8:58 PM, Adrien Grand <jp...@gmail.com>wrote:
>>>>>>
>>>>>>> I don't think we should add such a method. Doc values are commonly
>>>>>>> read from collectors, so why do we need a method that works on top of
>>>>>>> a DocIdSetIterator? I'm also curious how specialized implementations
>>>>>>> could make this method faster than the default implementation?
>>>>>>>
>>>>>>> --
>>>>>>> Adrien
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sincerely yours
>>>>>> Mikhail Khludnev
>>>>>> Principal Engineer,
>>>>>> Grid Dynamics
>>>>>>
>>>>>> <http://www.griddynamics.com>
>>>>>>  <mk...@griddynamics.com>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>