You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Jeff Kubina <je...@gmail.com> on 2012/10/12 17:38:25 UTC

How to efficiently compute a list-ranking of rows in Accumulo?

I have an Accumulo table and I want to efficiently compute a
list-ranking of the rowIds, more precisely, let r(i) for i=0 to n-1 be
the ordered (sorted) rowids in the table, I want to create a column,
say numId, such that for each row r(i) the value in column numId is i.

I would like to do this in one map round. To do so I need to get the
total number of rows in each tablet and an id for the tablet that has
an ordering (sorting) consistent with the rowIds ordering. Assume
there are t tablets and (tabId(j),size(j)) for j=0 to t-1 is the
tablet id and row size, then I can compute a prefix sum of size(j),
such as (tabId(0), 0), (tabId(1), size(0), (tabId(2),
size(0)+size(1)), and so on; let offset(j) be the offset computed for
tabId(j). So I can then use t mappers to set the numId in the rows to
offset(j), offset(j) + 1, …

I can do all of this in two map rounds by using the first rowId in
each tablet as the tabId and simply counting the number of rows, but I
was hoping to avoid having to do this.

Can I get the size of each tablet and an id for it consistent with the
row ordering? If so, how?

Re: How to efficiently compute a list-ranking of rows in Accumulo?

Posted by Keith Turner <ke...@deenlo.com>.
I think you can do this in one map round, with something like the following.

class MyMapper{
  Text mapperId; // an id that uniquley identifies this mapper
  Text firstRow;
  long count = 0;

  //map called once per row
  void mapper(Text row,  Context context){

     Mutation m = new Mutation(row);
     m.put("count",mapperId, count.toString());
     count++;

     context.write(m);

     if(firstRow == null){
        firstRow = row;
     }

  }

  close(){
     //write to a secondary table that keeps a little bit of info
about each map range
     BatchWriter bw = conn.createBatchWriter("mapInfoTable");
     Mutation m = new Mutation(mapperId);

     m.put("rangeInfo","totalCount",count.toString());
     m.put("rangeInfo","firstRow", firstRow);

    bw.add(m);
    bw.close();
  }
}


After running the above you can do some post processing on the
mapInfoTable and determine what the absolute offset of each mapper
was.  When you scan the table you can lookup up the absolute offset of
a mapper id and add this to the relative count stored in the table.
You can cache the lookups into the mapInfoTable.  This approach makes
no assumptions about underlying tablets.  Its completely based on the
ranges used to start mappers.

Keith

On Fri, Oct 12, 2012 at 11:38 AM, Jeff Kubina <je...@gmail.com> wrote:
> I have an Accumulo table and I want to efficiently compute a
> list-ranking of the rowIds, more precisely, let r(i) for i=0 to n-1 be
> the ordered (sorted) rowids in the table, I want to create a column,
> say numId, such that for each row r(i) the value in column numId is i.
>
> I would like to do this in one map round. To do so I need to get the
> total number of rows in each tablet and an id for the tablet that has
> an ordering (sorting) consistent with the rowIds ordering. Assume
> there are t tablets and (tabId(j),size(j)) for j=0 to t-1 is the
> tablet id and row size, then I can compute a prefix sum of size(j),
> such as (tabId(0), 0), (tabId(1), size(0), (tabId(2),
> size(0)+size(1)), and so on; let offset(j) be the offset computed for
> tabId(j). So I can then use t mappers to set the numId in the rows to
> offset(j), offset(j) + 1, …
>
> I can do all of this in two map rounds by using the first rowId in
> each tablet as the tabId and simply counting the number of rows, but I
> was hoping to avoid having to do this.
>
> Can I get the size of each tablet and an id for it consistent with the
> row ordering? If so, how?