You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Rohit Kelkar <ro...@gmail.com> on 2013/06/22 18:42:52 UTC

running MR job and puts on the same table

I have a usecase where I push data in my HTable in waves followed by
Mapper-only processing. Currently once a row is processed in map I
immediately mark it as processed=true. For this inside the map I execute a
table.put(isprocessed=true). I am not sure if modifying the table like this
is a good idea. I am also concerned that I am modifying the same table that
I am running the MR job on.
So I am thinking of another approach where I accumulate the processed rows
in a list (or a better compact data structure) and use the cleanup method
of the MR job to execute all the table.put(isprocessed=true) at once.
What is the suggested best practice?

- R

Re: running MR job and puts on the same table

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Hi Rohit,

It will alway be consistent. I don't see why there will be any
un-consistency with the scenario your described below.

JM

2013/6/22 Rohit Kelkar <ro...@gmail.com>:
> Thanks JM, I am not so concerned about holding those rows in memory because
> they are mostly ordered integers and I would be using a bitset. So I have
> some leeway in that sense. My dilemma was
> 1. updating instantly within the map
> 2. bulk updating at the end of the map
> Yes I do understand the drawback with 2 if map crashes. I am ready to incur
> that penalty if that avoids any inconsistent behaviour on hbase.
>
> - R
>
>
> On Sat, Jun 22, 2013 at 12:16 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
>> Hi Rahit,
>>
>> The list is a bad idea. When you will have millions of lines per
>> regions, are going to pu millions of them in memory in your list?
>>
>> Your MR will scan the entire table, row by row. If you modify the
>> current row, when the scanner will search for the next one, it will
>> not look at current one. So there is no real issue with that.
>>
>> Also, instead of doing puts one by one I will recommand you to buffer
>> them (let's say, 100 by 100) and put them as a batch. Don't forget to
>> push the remaining at the end of the job. The drawback is that if the
>> MR crash you will have some rows already processed and not marked as
>> processed...
>>
>> JM
>>
>> 2013/6/22 Rohit Kelkar <ro...@gmail.com>:
>> > I have a usecase where I push data in my HTable in waves followed by
>> > Mapper-only processing. Currently once a row is processed in map I
>> > immediately mark it as processed=true. For this inside the map I execute
>> a
>> > table.put(isprocessed=true). I am not sure if modifying the table like
>> this
>> > is a good idea. I am also concerned that I am modifying the same table
>> that
>> > I am running the MR job on.
>> > So I am thinking of another approach where I accumulate the processed
>> rows
>> > in a list (or a better compact data structure) and use the cleanup method
>> > of the MR job to execute all the table.put(isprocessed=true) at once.
>> > What is the suggested best practice?
>> >
>> > - R
>>

Re: running MR job and puts on the same table

Posted by Rohit Kelkar <ro...@gmail.com>.
Thanks JM, I am not so concerned about holding those rows in memory because
they are mostly ordered integers and I would be using a bitset. So I have
some leeway in that sense. My dilemma was
1. updating instantly within the map
2. bulk updating at the end of the map
Yes I do understand the drawback with 2 if map crashes. I am ready to incur
that penalty if that avoids any inconsistent behaviour on hbase.

- R


On Sat, Jun 22, 2013 at 12:16 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Rahit,
>
> The list is a bad idea. When you will have millions of lines per
> regions, are going to pu millions of them in memory in your list?
>
> Your MR will scan the entire table, row by row. If you modify the
> current row, when the scanner will search for the next one, it will
> not look at current one. So there is no real issue with that.
>
> Also, instead of doing puts one by one I will recommand you to buffer
> them (let's say, 100 by 100) and put them as a batch. Don't forget to
> push the remaining at the end of the job. The drawback is that if the
> MR crash you will have some rows already processed and not marked as
> processed...
>
> JM
>
> 2013/6/22 Rohit Kelkar <ro...@gmail.com>:
> > I have a usecase where I push data in my HTable in waves followed by
> > Mapper-only processing. Currently once a row is processed in map I
> > immediately mark it as processed=true. For this inside the map I execute
> a
> > table.put(isprocessed=true). I am not sure if modifying the table like
> this
> > is a good idea. I am also concerned that I am modifying the same table
> that
> > I am running the MR job on.
> > So I am thinking of another approach where I accumulate the processed
> rows
> > in a list (or a better compact data structure) and use the cleanup method
> > of the MR job to execute all the table.put(isprocessed=true) at once.
> > What is the suggested best practice?
> >
> > - R
>

Re: running MR job and puts on the same table

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Hi Rahit,

The list is a bad idea. When you will have millions of lines per
regions, are going to pu millions of them in memory in your list?

Your MR will scan the entire table, row by row. If you modify the
current row, when the scanner will search for the next one, it will
not look at current one. So there is no real issue with that.

Also, instead of doing puts one by one I will recommand you to buffer
them (let's say, 100 by 100) and put them as a batch. Don't forget to
push the remaining at the end of the job. The drawback is that if the
MR crash you will have some rows already processed and not marked as
processed...

JM

2013/6/22 Rohit Kelkar <ro...@gmail.com>:
> I have a usecase where I push data in my HTable in waves followed by
> Mapper-only processing. Currently once a row is processed in map I
> immediately mark it as processed=true. For this inside the map I execute a
> table.put(isprocessed=true). I am not sure if modifying the table like this
> is a good idea. I am also concerned that I am modifying the same table that
> I am running the MR job on.
> So I am thinking of another approach where I accumulate the processed rows
> in a list (or a better compact data structure) and use the cleanup method
> of the MR job to execute all the table.put(isprocessed=true) at once.
> What is the suggested best practice?
>
> - R