You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by Ara Ebrahimi <ar...@argyledata.com> on 2015/10/29 20:30:27 UTC

pre-sorting row keys vs not pre-sorting row keys

Hi,

We just did a simple test:

- insert 10k batches of columns
- sort the same 10k batch based on row keys and insert

So basically the batch writer in the first test has items in non-sorted order and in the second one in sorted order. We noticed 50% better performance in the sorted version! Why is that the case? Is this something we need to consider doing for live ingest scenarios?

Thanks,
Ara.



________________________________

This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Thank you in advance for your cooperation.

________________________________

Re: pre-sorting row keys vs not pre-sorting row keys

Posted by Ara Ebrahimi <ar...@argyledata.com>.

10k rows of 40 columns. 9 tablets in total for this table. 9 number of nodes (1 tablet per node).

Ara.

> On Oct 29, 2015, at 12:51 PM, Christopher <ct...@apache.org> wrote:
>
> How many tablets were these batches going to?
>
> How much were the column updates spread across mutations? 1 mutation
> per update? or grouped by row?
>
> 10k also seems like a very small number. I'd be curious to know where
> the error bars are around that 50% value.
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Thu, Oct 29, 2015 at 3:30 PM, Ara Ebrahimi
> <ar...@argyledata.com> wrote:
>> Hi,
>>
>> We just did a simple test:
>>
>> - insert 10k batches of columns
>> - sort the same 10k batch based on row keys and insert
>>
>> So basically the batch writer in the first test has items in non-sorted order and in the second one in sorted order. We noticed 50% better performance in the sorted version! Why is that the case? Is this something we need to consider doing for live ingest scenarios?
>>
>> Thanks,
>> Ara.
>>
>>
>>
>> ________________________________
>>
>> This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Thank you in advance for your cooperation.
>>
>> ________________________________
>
>
>
> ________________________________
>
> This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Thank you in advance for your cooperation.
>
> ________________________________




________________________________

This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Thank you in advance for your cooperation.

________________________________

Re: pre-sorting row keys vs not pre-sorting row keys

Posted by Christopher <ct...@apache.org>.

How many tablets were these batches going to?

How much were the column updates spread across mutations? 1 mutation
per update? or grouped by row?

10k also seems like a very small number. I'd be curious to know where
the error bars are around that 50% value.

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Thu, Oct 29, 2015 at 3:30 PM, Ara Ebrahimi
<ar...@argyledata.com> wrote:
> Hi,
>
> We just did a simple test:
>
> - insert 10k batches of columns
> - sort the same 10k batch based on row keys and insert
>
> So basically the batch writer in the first test has items in non-sorted order and in the second one in sorted order. We noticed 50% better performance in the sorted version! Why is that the case? Is this something we need to consider doing for live ingest scenarios?
>
> Thanks,
> Ara.
>
>
>
> ________________________________
>
> This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Thank you in advance for your cooperation.
>
> ________________________________

Re: pre-sorting row keys vs not pre-sorting row keys

Posted by Keith Turner <ke...@deenlo.com>.

I think the batch writer does sort mutations to bin them by tablet.

Did you consider JIT in your testing?  If one part of the test ran after
JIT it would be much faster because of that.

Also are you measuring the sort time and adding that to the test where you
pass sorted data?

On Thu, Oct 29, 2015 at 3:30 PM, Ara Ebrahimi <ar...@argyledata.com>
wrote:

> Hi,
>
> We just did a simple test:
>
> - insert 10k batches of columns
> - sort the same 10k batch based on row keys and insert
>
> So basically the batch writer in the first test has items in non-sorted
> order and in the second one in sorted order. We noticed 50% better
> performance in the sorted version! Why is that the case? Is this something
> we need to consider doing for live ingest scenarios?
>
> Thanks,
> Ara.
>
>
>
> ________________________________
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Thank you in
> advance for your cooperation.
>
> ________________________________
>

Re: pre-sorting row keys vs not pre-sorting row keys

Posted by Adam Fuchs <af...@apache.org>.

I bet what you're seeing is more efficient batching in the latter case.
BatchWriter goes through a binning phase whenever it fills up half of its
buffer, binning everything in the buffer into tablets. If you give it
sorted data it will probably be binning into a subset of the tablets
instead of all of them, which would be likely in the random case. Fewer
batches translates into fewer RPC calls, and less general overhead.

This generally indicates that if your data starts roughly partitioned it
will load faster, and that becomes more important as you scale up.

Adam
Hi,

We just did a simple test:

- insert 10k batches of columns
- sort the same 10k batch based on row keys and insert

So basically the batch writer in the first test has items in non-sorted
order and in the second one in sorted order. We noticed 50% better
performance in the sorted version! Why is that the case? Is this something
we need to consider doing for live ingest scenarios?

Thanks,
Ara.



________________________________

This message is for the designated recipient only and may contain
privileged, proprietary, or otherwise confidential information. If you have
received it in error, please notify the sender immediately and delete the
original. Any other use of the e-mail by you is prohibited. Thank you in
advance for your cooperation.

________________________________