You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Russ Weeks <rw...@newbrightidea.com> on 2014/05/16 19:32:29 UTC

MR Data Locality with AccumuloInputFormat?

Hi, folks,

When I execute an MR job with AccumuloInputFormat, are there any guarantees
about which mappers process which rows? I'm trying to minimize crosstalk in
my cluster but either I haven't split my table properly or I'm expecting
too much, because I'm only seeing 1 or 2 nodes running MR tasks that should
be reading data from tablet servers on 8 different nodes.

Thanks,
-Russ

Re: MR Data Locality with AccumuloInputFormat?

Posted by Corey Nolet <cj...@gmail.com>.
Has the table been compacted since loading the data?
Hi Russ,

I believe that the AccumuloInputFormat will use the splits on the table
you're reading to generate the MR InputSplits. The InputFormat should be
trying to run the Mappers on the same machine as the tserver serving the
data is located.

If you're only getting a few mappers, adding more splits to your table
should help. As your job runs, you can verify locality using the counters
that your Job creates using the JobTracker/ResourceManger web UI.

On 5/16/14, 1:32 PM, Russ Weeks wrote:

> Hi, folks,
>
> When I execute an MR job with AccumuloInputFormat, are there any
> guarantees about which mappers process which rows? I'm trying to
> minimize crosstalk in my cluster but either I haven't split my table
> properly or I'm expecting too much, because I'm only seeing 1 or 2 nodes
> running MR tasks that should be reading data from tablet servers on 8
> different nodes.
>
> Thanks,
> -Russ
>

Re: MR Data Locality with AccumuloInputFormat?

Posted by Russ Weeks <rw...@newbrightidea.com>.
Thanks, Josh. I'll take a look through the Hadoop web UI.
-Russ


On Fri, May 16, 2014 at 1:37 PM, Josh Elser <jo...@gmail.com> wrote:

> Hi Russ,
>
> I believe that the AccumuloInputFormat will use the splits on the table
> you're reading to generate the MR InputSplits. The InputFormat should be
> trying to run the Mappers on the same machine as the tserver serving the
> data is located.
>
> If you're only getting a few mappers, adding more splits to your table
> should help. As your job runs, you can verify locality using the counters
> that your Job creates using the JobTracker/ResourceManger web UI.
>
>
> On 5/16/14, 1:32 PM, Russ Weeks wrote:
>
>> Hi, folks,
>>
>> When I execute an MR job with AccumuloInputFormat, are there any
>> guarantees about which mappers process which rows? I'm trying to
>> minimize crosstalk in my cluster but either I haven't split my table
>> properly or I'm expecting too much, because I'm only seeing 1 or 2 nodes
>> running MR tasks that should be reading data from tablet servers on 8
>> different nodes.
>>
>> Thanks,
>> -Russ
>>
>

Re: MR Data Locality with AccumuloInputFormat?

Posted by Josh Elser <jo...@gmail.com>.
Hi Russ,

I believe that the AccumuloInputFormat will use the splits on the table 
you're reading to generate the MR InputSplits. The InputFormat should be 
trying to run the Mappers on the same machine as the tserver serving the 
data is located.

If you're only getting a few mappers, adding more splits to your table 
should help. As your job runs, you can verify locality using the 
counters that your Job creates using the JobTracker/ResourceManger web UI.

On 5/16/14, 1:32 PM, Russ Weeks wrote:
> Hi, folks,
>
> When I execute an MR job with AccumuloInputFormat, are there any
> guarantees about which mappers process which rows? I'm trying to
> minimize crosstalk in my cluster but either I haven't split my table
> properly or I'm expecting too much, because I'm only seeing 1 or 2 nodes
> running MR tasks that should be reading data from tablet servers on 8
> different nodes.
>
> Thanks,
> -Russ