You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Kurt Christensen <ho...@hoodel.com> on 2013/05/04 16:15:08 UTC
Re: joining accumulo tables with mapreduce
How about three scanners, one for each table? Advance the one with the
least value (sort-wise) and combine when they match.
On 4/17/13 4:43 PM, Aji Janis wrote:
> Keith,
>
> You hit the problem that I purposely didn't ask.
> -Accumulo inputformat doesn't support multiple tables at this point and
> -I can't run three mappers in parallel on different tables and
> combine/send their output to a reducer (that I know of).
>
> If all three tables had the same rowid (eg: rowA exists in table 1, 2
> and 3) then we can write the row from each table w/a different
> family/qualifier/value to a new table. So it will be three mappers run
> sequentially and end result is a join... this is the best I came up
> with so far. If rowids are different accross three tables then I would
> have to reformat my rowid from all three tables (normalize) prior to
> writing the fourth/final table.
>
> Is calling a scanner on the other two tables from within a mapper
> (that takes the first table as the input) bad? Any clues on how that
> could be done in mapreduce?
>
>
> On Wed, Apr 17, 2013 at 10:59 AM, Keith Turner <keith@deenlo.com
> <ma...@deenlo.com>> wrote:
>
> If I am understaning you correctly, you are proposing for each row a
> mapper gets to look that row up in two other tables? This would
> result in a lot of little round trip RPC calls and random disk
> accesses.
>
> I think a better solution would be to read all three tables into your
> mappers, and do the join in the reduce. This solution will avoid all
> of the little RPC calls and do lots of sequential I/O instead of
> random accesses. Between the map and reduce, you could track which
> table each row came from. Any filtering could be done in the mapper
> or by iterators. Unfortunately Accumulo does not have the needed
> input format for this out of the box. There is a ticket,
> ACCUMULO-391.
>
>
>
> On Tue, Apr 16, 2013 at 5:28 PM, Aji Janis <aji1705@gmail.com
> <ma...@gmail.com>> wrote:
> > Hello,
> >
> > I am interested in learning what the best solution/practices
> might be to
> > join 3 accumulo tables by running a map reduce job. Interested
> in getting
> > feedback on best practices and such. Heres a pseudo code of what
> I want to
> > accomplish:
> >
> >
> > AccumuloInputFormat accepts tableA
> > Global variable <table_list> has table names: tableB, tableC
> >
> > In a mapper, for example, you would do something like this:
> >
> > for each row in TableA
> > if (row.family == "abc" && row.qualifier == "xyz") value =
> getValue()
> > if (foundvalue) {
> >
> > for each table in table_list
> > scan table with (this rowid && family = "def")
> > for each entry found in scan
> > write to final_table (rowid, value_as_family,
> tablename_as_qualifier,
> > entry_as_value_string)
> >
> > }//end if foundvalue
> >
> > }//end for loop
> >
> >
> > This is a simple version of what I want to do. In my non
> mapreduce java code
> > I would do this by calling a using different scanners per table
> in the list.
> > Couple questions:
> >
> >
> > - how bad/good is performance when using scanners withing mappers?
> > - if I get one mapper per range in tableA, do I reset scanners?
> how? or
> > would I set up a scanner in the setup() of mapper ? --> i have
> no clue how
> > this will play out so thinking out loud here.
> > - any optimization suggestions? or examples of creating
> join_tables/indexes
> > out there that I can refer to?
> >
> >
> > Thank you for all suggestions.
>
>
--
Kurt Christensen
P.O. Box 811
Westminster, MD 21158-0811
------------------------------------------------------------------------
"One of the penalties for refusing to participate in politics is that
you end up being governed by your inferiors."
--- Plato