You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by Denis <de...@camfex.cz> on 2012/10/17 15:46:48 UTC

Accumulo Direct Reader

    Hi.

    I am thinking about creating a Direct Reader for Accumulo.

    A library which has API compatible with the Accumulo client but
reads .rf-files directly from HDFS, bypassing tservers.

    Motivation is:

    1. To have a possibility to quickly read stalled data when the
tserver is busy (with re-balancing, reading logs, etc) or just went
down and its tablets are not redistributed yet.

    2. If the table is read-only or can afford eventual consistency,
many readers can work in parallel with no bottleneck of tserver. Also,
the table's data becomes local on three (number of HDFS replicas)
servers instead of one.

    3. Distribution of data: analytics can download .rf-files (even to
a laptop) and run their software locally.

    Any suggestions ?

    Thanks.

Re: Accumulo Direct Reader

Posted by Keith Turner <ke...@deenlo.com>.

On Wed, Oct 17, 2012 at 10:57 AM, Eric Newton <er...@gmail.com> wrote:
> See InputFormatBase#setScanOffline.

This uses o.a.a.c.client.impl.OfflineScanner.  OfflineScanner will
scan an offline table by going directly to the files.  It does the
exact same thing the tablet server does when reading a tablets files.
 I was thinking of making OfflineScanner available through Connector
somehow when adding setScanOffline to M/R code, but did not for some
reason.  If there is interest we could revisit this.

>
> Clone a table, take it offline and then use it as your map/reduce
> input format.  This will preserve a consistent view of the underlying
> files, without going through the tablet servers.
>
> -Eric
>
> On Wed, Oct 17, 2012 at 9:46 AM, Denis <de...@camfex.cz> wrote:
>>     Hi.
>>
>>     I am thinking about creating a Direct Reader for Accumulo.
>>
>>     A library which has API compatible with the Accumulo client but
>> reads .rf-files directly from HDFS, bypassing tservers.
>>
>>     Motivation is:
>>
>>     1. To have a possibility to quickly read stalled data when the
>> tserver is busy (with re-balancing, reading logs, etc) or just went
>> down and its tablets are not redistributed yet.
>>
>>     2. If the table is read-only or can afford eventual consistency,
>> many readers can work in parallel with no bottleneck of tserver. Also,
>> the table's data becomes local on three (number of HDFS replicas)
>> servers instead of one.
>>
>>     3. Distribution of data: analytics can download .rf-files (even to
>> a laptop) and run their software locally.
>>
>>     Any suggestions ?
>>
>>     Thanks.

Re: Accumulo Direct Reader

Posted by Eric Newton <er...@gmail.com>.

See InputFormatBase#setScanOffline.

Clone a table, take it offline and then use it as your map/reduce
input format.  This will preserve a consistent view of the underlying
files, without going through the tablet servers.

-Eric

On Wed, Oct 17, 2012 at 9:46 AM, Denis <de...@camfex.cz> wrote:
>     Hi.
>
>     I am thinking about creating a Direct Reader for Accumulo.
>
>     A library which has API compatible with the Accumulo client but
> reads .rf-files directly from HDFS, bypassing tservers.
>
>     Motivation is:
>
>     1. To have a possibility to quickly read stalled data when the
> tserver is busy (with re-balancing, reading logs, etc) or just went
> down and its tablets are not redistributed yet.
>
>     2. If the table is read-only or can afford eventual consistency,
> many readers can work in parallel with no bottleneck of tserver. Also,
> the table's data becomes local on three (number of HDFS replicas)
> servers instead of one.
>
>     3. Distribution of data: analytics can download .rf-files (even to
> a laptop) and run their software locally.
>
>     Any suggestions ?
>
>     Thanks.

Re: Accumulo Direct Reader

Posted by Marc Parisi <ma...@accumulo.net>.

RFileOperations.getInstance() will return an instance of FileOperations,
which will allow you to call the open reader method and open any arbitrary
r file. The issue might be locating the r files, which are part of a given
row; however, this would be quit simple by going through the Metadata table
and looking for the rfiles associated with that given tablet. By doing this
you can bypass the entire iterator stack. I have an example of this on my
github, but in reality, those methods I mentioned above are all you really
need.

On Wed, Oct 17, 2012 at 9:46 AM, Denis <de...@camfex.cz> wrote:

>     Hi.
>
>     I am thinking about creating a Direct Reader for Accumulo.
>
>     A library which has API compatible with the Accumulo client but
> reads .rf-files directly from HDFS, bypassing tservers.
>
>     Motivation is:
>
>     1. To have a possibility to quickly read stalled data when the
> tserver is busy (with re-balancing, reading logs, etc) or just went
> down and its tablets are not redistributed yet.
>
>     2. If the table is read-only or can afford eventual consistency,
> many readers can work in parallel with no bottleneck of tserver. Also,
> the table's data becomes local on three (number of HDFS replicas)
> servers instead of one.
>
>     3. Distribution of data: analytics can download .rf-files (even to
> a laptop) and run their software locally.
>
>     Any suggestions ?
>
>     Thanks.
>