You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by Marc Reichman <mr...@pixelforensics.com> on 2013/09/02 16:12:10 UTC

accessing accumulo row in mapper setup method?

Hello,

I am running a search job of a single piece of query data against potential
targets in an accumulo table, using AccumuloRowInputFormat. In most cases,
the query data itself is also in the same accumulo table.

To date, my client program has pulled the query data from accumulo using a
basic scanner, stored the data into HDFS, and added the file(s) in question
to distributed cache. My mapper then pulls the data from distributed cache
into a private class member in its setup method and uses it in all of the
map calls.

I had a thought, that maybe I'm spending a bit too much overhead on the
client-side doing this, and that my job submission performance is slow
because of all of the HDFS i/o and distributed cache handling for arguably
small files, in the 100-200k range max.

Does it seem like a reasonable idea to skip the preparation on the
client-side, and have the mapper setup pull the data directly from accumulo
in its setup method instead?

Questions related to this:
1. Does this put a lot of pressure on the tabletserver which contains the
data, to have many mappers hitting at once during setup for the first wave?
2. Is there any way whatsoever for the mapper to use the existing client
connection already being made? Or would I have to do the usual setup with
my own zookeeper connection, and if so does that make for a much worse
performance impact?

Thanks,
Marc

Re: accessing accumulo row in mapper setup method?

Posted by Eric Newton <er...@gmail.com>.

> Questions related to this:
> 1. Does this put a lot of pressure on the tabletserver which contains the
> data, to have many mappers hitting at once during setup for the first wave?
>

Yes.  However, it depends on the scale.  If it's 8K mappers, that might be
too many.  If it's <100, don't worry about it.


> 2. Is there any way whatsoever for the mapper to use the existing client
> connection already being made? Or would I have to do the usual setup with
> my own zookeeper connection, and if so does that make for a much worse
> performance impact?
>

It won't make any impact on performance; the connections to remote servers
are cached and reused under the covers anyhow.  The Connection object
doesn't hold a connection, it just holds the username/credentials for a
creating/using a connection.

-Eric