You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@accumulo.apache.org by Jeff Kubina <je...@gmail.com> on 2015/10/07 18:46:27 UTC

How to best measure how the lack of data-locality affects query performance

Per my thread "How does Accumulo process r-files for bulk ingesting?"
on the user@ list I would like to test/measure how a lack of
data-locality of bulk ingested files effects query performance. I seek
comments/suggestions on the outline of the design for the test:

Outline:
1. Create a table and pre-split it to have m tablets where m="total tservers".
2. Create 1 r-file containing m*n records that evenly distribute
across the m tablets.
3. Bulk ingest the r-file.
4. Query each of the split ranges in the table and log their times.
5. Compact the table and wait for the compaction to complete.
6. Query each of the split ranges in the table and log their times.
7. Compute the ratio of the median times from steps 4 and 6.

Questions:
1. Instead of compacting the table should I create a new table by
generating the m r-files whose ranges intersect only one of the
tablets and bulk ingest them?

2. What is a good size for n, the number of records per tablet server?

Re: How to best measure how the lack of data-locality affects query performance

Posted by Sean Busbey <bu...@cloudera.com>.

To see the impact on different kinds of workloads, it would be good to add
a bulk load option to something like YCSB, then run the normal workloads.

-- 
Sean
On Oct 7, 2015 11:58 AM, "Josh Elser" <jo...@gmail.com> wrote:

>
>
> Jeff Kubina wrote:
>
>> Per my thread "How does Accumulo process r-files for bulk ingesting?"
>> on the user@ list I would like to test/measure how a lack of
>> data-locality of bulk ingested files effects query performance. I seek
>> comments/suggestions on the outline of the design for the test:
>>
>> Outline:
>> 1. Create a table and pre-split it to have m tablets where m="total
>> tservers".
>> 2. Create 1 r-file containing m*n records that evenly distribute
>> across the m tablets.
>> 3. Bulk ingest the r-file.
>> 4. Query each of the split ranges in the table and log their times.
>> 5. Compact the table and wait for the compaction to complete.
>> 6. Query each of the split ranges in the table and log their times.
>> 7. Compute the ratio of the median times from steps 4 and 6.
>>
>> Questions:
>> 1. Instead of compacting the table should I create a new table by
>> generating the m r-files whose ranges intersect only one of the
>> tablets and bulk ingest them?
>>
>
> If you can be tricky in your non-data-local case to evenly balance the
> data, you could just do one table import followed by a compaction and rerun
> on the same table.
>
> You'd just want to make sure you have a decent distribution of the data
> across all servers in both the data-local and non-data-local cases
>
> 2. What is a good size for n, the number of records per tablet server?
>>
>
> I'm wondering if it depends on the type of workload that you're looking to
> run. Does it make a difference if you're just running randomized point
> queries? Or doing scan over the entire table?
>
> Assuming you're just doing one tablet per server for your table (it's not
> apparent to me if there's a reason that would result in a lesser test), I'd
> guess a couple 100MB's worth of records per tablet would be good. Enough to
> get a few HDFS blocks per RFile, but not enough that Accumulo would
> automatically split it from underneath you. You could also try to increase
> the split threshold and put more data per file.
>

Re: How to best measure how the lack of data-locality affects query performance

Posted by Josh Elser <jo...@gmail.com>.

Jeff Kubina wrote:
> Per my thread "How does Accumulo process r-files for bulk ingesting?"
> on the user@ list I would like to test/measure how a lack of
> data-locality of bulk ingested files effects query performance. I seek
> comments/suggestions on the outline of the design for the test:
>
> Outline:
> 1. Create a table and pre-split it to have m tablets where m="total tservers".
> 2. Create 1 r-file containing m*n records that evenly distribute
> across the m tablets.
> 3. Bulk ingest the r-file.
> 4. Query each of the split ranges in the table and log their times.
> 5. Compact the table and wait for the compaction to complete.
> 6. Query each of the split ranges in the table and log their times.
> 7. Compute the ratio of the median times from steps 4 and 6.
>
> Questions:
> 1. Instead of compacting the table should I create a new table by
> generating the m r-files whose ranges intersect only one of the
> tablets and bulk ingest them?

If you can be tricky in your non-data-local case to evenly balance the 
data, you could just do one table import followed by a compaction and 
rerun on the same table.

You'd just want to make sure you have a decent distribution of the data 
across all servers in both the data-local and non-data-local cases

> 2. What is a good size for n, the number of records per tablet server?

I'm wondering if it depends on the type of workload that you're looking 
to run. Does it make a difference if you're just running randomized 
point queries? Or doing scan over the entire table?

Assuming you're just doing one tablet per server for your table (it's 
not apparent to me if there's a reason that would result in a lesser 
test), I'd guess a couple 100MB's worth of records per tablet would be 
good. Enough to get a few HDFS blocks per RFile, but not enough that 
Accumulo would automatically split it from underneath you. You could 
also try to increase the split threshold and put more data per file.