You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by ameet kini <am...@gmail.com> on 2012/10/10 16:22:21 UTC

Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connection

I have a related problem where I need to do a 1-1 join (every row in
table A joins with a unique row in table B and vice versa). My join
key is the row id of the table. In the past, I've used Hadoop's
CompositeInputFormat to do a map-side join over data in HDFS
(described here
http://www.congiu.com/joins-in-hadoop-using-compositeinputformat/)  My
tables in Accumulo seem to fit the eligibility criteria of
CompositeInputFormat: both tables are sorted by the join key, since
the join key is the row id in my case, and the tables are partitioned
the same way (i.e., same split points).

Has anyone tried using CompositeInputFormat over Accumulo tables? Is
it possible to configure CompositeInputFormat with
AccumuloInputFormat?

Thanks,
Ameet

On Tue, Aug 21, 2012 at 8:23 AM, Keith Turner <ke...@deenlo.com> wrote:
> Yeah, that would certainly work.
>
> You could run two map only jobs (could run concurrently).  A job that
> reads D1 and writes to Table3 and a job that reads D2 and writes
> Table3.   Map reduce may be faster, unless you want the final result
> in Accumulo in which case this may be faster.  The two map reduce jobs
> could also produce files to bulk import into table3.
>
> Keith
>
> On Mon, Aug 20, 2012 at 8:26 PM, David Medinets
> <da...@gmail.com> wrote:
>> Can you use a new table to join and then scan the new table? Use the foreign
>> key as the rowid. Basically create your own materialized view.

Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connection

Posted by ameet kini <am...@gmail.com>.

My previous post had some stale text after my signature - sorry. Reposting
after chopping the stale text off.

Turns out that my assumption of tables being partitioned the same way may
be too restrictive. I need to account for join partitions not being
co-located on the same tablet server. So the CompositeInputFormat is not
applicable as I'd initially thought. That said, I hadn't gotten very far
with it, and in particular, couldn't for the life of me figure out how to
configure the mapred.join.expr to work on Accumulo's rfile directory
structure.

I ended up extending AccumuloInputFormat to do the join. The record reader
would read table A using AccumuloInputFormat's scannerIterator and issue
BatchScanner lookups to get table B's matching records, similar to Keith's
suggestion above.

Thanks,
Ameet

On Wed, Oct 17, 2012 at 10:10 AM, ameet kini <am...@gmail.com> wrote:

>
> Turns out that my assumption of tables being partitioned the same way may
> be too restrictive. I need to account for join partitions not being
> co-located on the same tablet server. So the CompositeInputFormat is not
> applicable as I'd initially thought. That said, I hadn't gotten very far
> with it, and in particular, couldn't for the life of me figure out how to
> configure the mapred.join.expr to work on Accumulo's rfile directory
> structure.
>
> I ended up extending AccumuloInputFormat to do the join. The record reader
> would read table A using AccumuloInputFormat's scannerIterator and issue
> BatchScanner lookups to get table B's matching records, similar to Keith's
> suggestion above.
>
> Thanks,
> Ameet
>
>
>
>
>
>
> On Thu, Oct 11, 2012 at 2:57 PM, Billie Rinaldi <bi...@apache.org> wrote:
>
>> On Wed, Oct 10, 2012 at 7:22 AM, ameet kini <am...@gmail.com> wrote:
>>
>>> I have a related problem where I need to do a 1-1 join (every row in
>>> table A joins with a unique row in table B and vice versa). My join
>>> key is the row id of the table. In the past, I've used Hadoop's
>>> CompositeInputFormat to do a map-side join over data in HDFS
>>> (described here
>>> http://www.congiu.com/joins-in-hadoop-using-compositeinputformat/)  My
>>> tables in Accumulo seem to fit the eligibility criteria of
>>> CompositeInputFormat: both tables are sorted by the join key, since
>>> the join key is the row id in my case, and the tables are partitioned
>>> the same way (i.e., same split points).
>>>
>>> Has anyone tried using CompositeInputFormat over Accumulo tables? Is
>>> it possible to configure CompositeInputFormat with
>>> AccumuloInputFormat?
>>>
>>
>> I haven't tried it.  If you do, let us know how it works out.
>>
>> Billie
>>
>>
>>>
>>> Thanks,
>>> Ameet
>>>
>>>
>>> On Tue, Aug 21, 2012 at 8:23 AM, Keith Turner <ke...@deenlo.com> wrote:
>>> > Yeah, that would certainly work.
>>> >
>>> > You could run two map only jobs (could run concurrently).  A job that
>>> > reads D1 and writes to Table3 and a job that reads D2 and writes
>>> > Table3.   Map reduce may be faster, unless you want the final result
>>> > in Accumulo in which case this may be faster.  The two map reduce jobs
>>> > could also produce files to bulk import into table3.
>>> >
>>> > Keith
>>> >
>>> > On Mon, Aug 20, 2012 at 8:26 PM, David Medinets
>>> > <da...@gmail.com> wrote:
>>> >> Can you use a new table to join and then scan the new table? Use the
>>> foreign
>>> >> key as the rowid. Basically create your own materialized view.
>>>
>>
>>
>

Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connection

Posted by ameet kini <am...@gmail.com>.

Turns out that my assumption of tables being partitioned the same way may
be too restrictive. I need to account for join partitions not being
co-located on the same tablet server. So the CompositeInputFormat is not
applicable as I'd initially thought. That said, I hadn't gotten very far
with it, and in particular, couldn't for the life of me figure out how to
configure the mapred.join.expr to work on Accumulo's rfile directory
structure.

I ended up extending AccumuloInputFormat to do the join. The record reader
would read table A using AccumuloInputFormat's scannerIterator and issue
BatchScanner lookups to get table B's matching records, similar to Keith's
suggestion above.

Thanks,
Ameet

That said, I had spent some time trying to configure it with
AccumuloInputFormat, and couldn't get very far because I couldn't figure
out how to write a mapred.join.expr which would work directly on the
underlying rfiles in Accumulo. Even if I flush/compact the table so I end
up with exactly 1 rfile per tablet, the mapred.join.expr is

On Thu, Oct 11, 2012 at 2:57 PM, Billie Rinaldi <bi...@apache.org> wrote:

> On Wed, Oct 10, 2012 at 7:22 AM, ameet kini <am...@gmail.com> wrote:
>
>> I have a related problem where I need to do a 1-1 join (every row in
>> table A joins with a unique row in table B and vice versa). My join
>> key is the row id of the table. In the past, I've used Hadoop's
>> CompositeInputFormat to do a map-side join over data in HDFS
>> (described here
>> http://www.congiu.com/joins-in-hadoop-using-compositeinputformat/)  My
>> tables in Accumulo seem to fit the eligibility criteria of
>> CompositeInputFormat: both tables are sorted by the join key, since
>> the join key is the row id in my case, and the tables are partitioned
>> the same way (i.e., same split points).
>>
>> Has anyone tried using CompositeInputFormat over Accumulo tables? Is
>> it possible to configure CompositeInputFormat with
>> AccumuloInputFormat?
>>
>
> I haven't tried it.  If you do, let us know how it works out.
>
> Billie
>
>
>>
>> Thanks,
>> Ameet
>>
>>
>> On Tue, Aug 21, 2012 at 8:23 AM, Keith Turner <ke...@deenlo.com> wrote:
>> > Yeah, that would certainly work.
>> >
>> > You could run two map only jobs (could run concurrently).  A job that
>> > reads D1 and writes to Table3 and a job that reads D2 and writes
>> > Table3.   Map reduce may be faster, unless you want the final result
>> > in Accumulo in which case this may be faster.  The two map reduce jobs
>> > could also produce files to bulk import into table3.
>> >
>> > Keith
>> >
>> > On Mon, Aug 20, 2012 at 8:26 PM, David Medinets
>> > <da...@gmail.com> wrote:
>> >> Can you use a new table to join and then scan the new table? Use the
>> foreign
>> >> key as the rowid. Basically create your own materialized view.
>>
>
>

Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connection

Posted by Billie Rinaldi <bi...@apache.org>.

On Wed, Oct 10, 2012 at 7:22 AM, ameet kini <am...@gmail.com> wrote:

> I have a related problem where I need to do a 1-1 join (every row in
> table A joins with a unique row in table B and vice versa). My join
> key is the row id of the table. In the past, I've used Hadoop's
> CompositeInputFormat to do a map-side join over data in HDFS
> (described here
> http://www.congiu.com/joins-in-hadoop-using-compositeinputformat/)  My
> tables in Accumulo seem to fit the eligibility criteria of
> CompositeInputFormat: both tables are sorted by the join key, since
> the join key is the row id in my case, and the tables are partitioned
> the same way (i.e., same split points).
>
> Has anyone tried using CompositeInputFormat over Accumulo tables? Is
> it possible to configure CompositeInputFormat with
> AccumuloInputFormat?
>

I haven't tried it.  If you do, let us know how it works out.

Billie


>
> Thanks,
> Ameet
>
>
> On Tue, Aug 21, 2012 at 8:23 AM, Keith Turner <ke...@deenlo.com> wrote:
> > Yeah, that would certainly work.
> >
> > You could run two map only jobs (could run concurrently).  A job that
> > reads D1 and writes to Table3 and a job that reads D2 and writes
> > Table3.   Map reduce may be faster, unless you want the final result
> > in Accumulo in which case this may be faster.  The two map reduce jobs
> > could also produce files to bulk import into table3.
> >
> > Keith
> >
> > On Mon, Aug 20, 2012 at 8:26 PM, David Medinets
> > <da...@gmail.com> wrote:
> >> Can you use a new table to join and then scan the new table? Use the
> foreign
> >> key as the rowid. Basically create your own materialized view.
>