You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Slava Gorelik <sl...@gmail.com> on 2009/03/03 17:45:35 UTC

MR Job question

Hi.I have a small question about MR jobs. Is it possible to run MR job on
part of the table ?
For example I have MR job running on table and next time when run this
job, I want to get only newly added or updated rows.

Thank You and Best Regards.

Re: MR Job question

Posted by Slava Gorelik <sl...@gmail.com>.

Hi.I understand that i need to write code, but i don't have any direction
how to do what i need, do you have any example for creating MR Job that pass
over a subset of rows ?

Thank You and Best Regards.


On Wed, Mar 4, 2009 at 5:27 PM, schubert zhang <zs...@gmail.com> wrote:

> Hi Slava, I mean you should write by yourself, the mapreduce code in HBase
> is just example. Please study how to code mapreduce job.
> You should implement yourself:
> 1. how to split the input dataset, InputSplit
> 2. how to read each record of each split in each mapper, RecordReader
> 3. Implement yourself InputFormat
> 4. mapper and reducer class
> 5. how to write output record, RecordWriter
> 6. implement yourself OutputFormat
> ........
>
>
> On Wed, Mar 4, 2009 at 8:45 PM, Slava Gorelik <slava.gorelik@gmail.com
> >wrote:
>
> > How can you tell that ? There no interface in MR Job definition that
> allows
> > that.Every sample of MR Job in Hbase is works like that (this is a map
> from
> > RowCounter):
> >
> > public void map(ImmutableBytesWritable row, RowResult value,
> >    OutputCollector<ImmutableBytesWritable, RowResult> output,
> >    @SuppressWarnings("unused") Reporter reporter)
> >  throws IOException {
> >    boolean content = false;
> >    for (Map.Entry<byte [], Cell> e: value.entrySet()) {
> >      Cell cell = e.getValue();
> >      if (cell != null && cell.getValue().length > 0) {
> >        content = true;
> >        break;
> >      }
> >    }
> >    if (!content) {
> >      return;
> >    }
> >
> > You can't say which rows you want to get.
> >
> > Best Regards.
> > Slava.
> >
> >
> > On Wed, Mar 4, 2009 at 1:31 PM, schubert zhang <zs...@gmail.com>
> wrote:
> >
> > > In my job, I can tell the MR job the startRow and endRow, i.e. a row
> > > range. Then my MR job can only scan the region(s) in the range, and
> > should
> > > not scan from begin of table or tablet/region to the end.
> > >
> > > So,  Slava, you should modify you code of MR job to do what you want.
> > >
> > > Schubert
> > >
> > > On Wed, Mar 4, 2009 at 4:58 PM, Slava Gorelik <slava.gorelik@gmail.com
> > > >wrote:
> > >
> > > > Hi.I'm confused a little bit.
> > > >
> > > > Please correct me if I wrong, but MR Job is it self is "scanning" all
> > > rows
> > > > in the table. The job is spread into each region server, into
> > > > multiple threads. Each thread get some part of the rows that are
> placed
> > > in
> > > > particular region server. So, the MR jobs is finished when all
> > > > threads are passed over all rows. Filtering will help the MR job only
> > to
> > > > filter out non-relevant rows, but any way those rows will be checked
> > > > (passed
> > > > to the filter), this not helps a lot, job still passing over all rows
> > in
> > > > the
> > > > table. Calling a scanner inside MR Job, will not
> > > > prevent from the job to pass over all rows, it simple will make job
> > > > more heavy(as i understand that). Is it correct, Michael ?
> > > >
> > > > So, my question is how can I tell to MR Job to pass over some rows
> and
> > > not
> > > > all rows.
> > > >
> > > > Thank You and Best Regards.
> > > > Slava.
> > > >
> > > >
> > > > On Wed, Mar 4, 2009 at 8:57 AM, stack <st...@duboce.net> wrote:
> > > >
> > > > > On Tue, Mar 3, 2009 at 6:17 PM, schubert zhang <zs...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Yes, we can tell HBase API only scan rows start with a key.
> > > > > >
> > > > >
> > > > > Would the filtering feature help here?
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> http://hadoop.apache.org/hbase/docs/r0.19.0/api/org/apache/hadoop/hbase/filter/package-summary.html#package_description
> > > > >
> > > > > Scanners can be passed a filter (Read the description section on
> the
> > > > above
> > > > > url).
> > > > >
> > > > >
> > > > > Can any expert share your ideas about:
> > > > > > 1. If the rowkey is not chronological, how can I only process the
> > > newly
> > > > > > added/updated rows?
> > > > >
> > > > >
> > > > > We don't have a means of asking for versions before a timestamp,
> only
> > > > older
> > > > > (Can you add timestamp to your row key if you need this?)
> > > > >
> > > > >
> > > > > > 2. How can I remove the old rows which are inserted three months
> > ago?
> > > > > >
> > > > >
> > > > > See above.
> > > > >
> > > > > St.Ack
> > > > >
> > > >
> > >
> >
>

Re: MR Job question

Posted by schubert zhang <zs...@gmail.com>.

Hi Slava, I mean you should write by yourself, the mapreduce code in HBase
is just example. Please study how to code mapreduce job.
You should implement yourself:
1. how to split the input dataset, InputSplit
2. how to read each record of each split in each mapper, RecordReader
3. Implement yourself InputFormat
4. mapper and reducer class
5. how to write output record, RecordWriter
6. implement yourself OutputFormat
........


On Wed, Mar 4, 2009 at 8:45 PM, Slava Gorelik <sl...@gmail.com>wrote:

> How can you tell that ? There no interface in MR Job definition that allows
> that.Every sample of MR Job in Hbase is works like that (this is a map from
> RowCounter):
>
> public void map(ImmutableBytesWritable row, RowResult value,
>    OutputCollector<ImmutableBytesWritable, RowResult> output,
>    @SuppressWarnings("unused") Reporter reporter)
>  throws IOException {
>    boolean content = false;
>    for (Map.Entry<byte [], Cell> e: value.entrySet()) {
>      Cell cell = e.getValue();
>      if (cell != null && cell.getValue().length > 0) {
>        content = true;
>        break;
>      }
>    }
>    if (!content) {
>      return;
>    }
>
> You can't say which rows you want to get.
>
> Best Regards.
> Slava.
>
>
> On Wed, Mar 4, 2009 at 1:31 PM, schubert zhang <zs...@gmail.com> wrote:
>
> > In my job, I can tell the MR job the startRow and endRow, i.e. a row
> > range. Then my MR job can only scan the region(s) in the range, and
> should
> > not scan from begin of table or tablet/region to the end.
> >
> > So,  Slava, you should modify you code of MR job to do what you want.
> >
> > Schubert
> >
> > On Wed, Mar 4, 2009 at 4:58 PM, Slava Gorelik <slava.gorelik@gmail.com
> > >wrote:
> >
> > > Hi.I'm confused a little bit.
> > >
> > > Please correct me if I wrong, but MR Job is it self is "scanning" all
> > rows
> > > in the table. The job is spread into each region server, into
> > > multiple threads. Each thread get some part of the rows that are placed
> > in
> > > particular region server. So, the MR jobs is finished when all
> > > threads are passed over all rows. Filtering will help the MR job only
> to
> > > filter out non-relevant rows, but any way those rows will be checked
> > > (passed
> > > to the filter), this not helps a lot, job still passing over all rows
> in
> > > the
> > > table. Calling a scanner inside MR Job, will not
> > > prevent from the job to pass over all rows, it simple will make job
> > > more heavy(as i understand that). Is it correct, Michael ?
> > >
> > > So, my question is how can I tell to MR Job to pass over some rows and
> > not
> > > all rows.
> > >
> > > Thank You and Best Regards.
> > > Slava.
> > >
> > >
> > > On Wed, Mar 4, 2009 at 8:57 AM, stack <st...@duboce.net> wrote:
> > >
> > > > On Tue, Mar 3, 2009 at 6:17 PM, schubert zhang <zs...@gmail.com>
> > > wrote:
> > > >
> > > > > Yes, we can tell HBase API only scan rows start with a key.
> > > > >
> > > >
> > > > Would the filtering feature help here?
> > > >
> > > >
> > > >
> > >
> >
> http://hadoop.apache.org/hbase/docs/r0.19.0/api/org/apache/hadoop/hbase/filter/package-summary.html#package_description
> > > >
> > > > Scanners can be passed a filter (Read the description section on the
> > > above
> > > > url).
> > > >
> > > >
> > > > Can any expert share your ideas about:
> > > > > 1. If the rowkey is not chronological, how can I only process the
> > newly
> > > > > added/updated rows?
> > > >
> > > >
> > > > We don't have a means of asking for versions before a timestamp, only
> > > older
> > > > (Can you add timestamp to your row key if you need this?)
> > > >
> > > >
> > > > > 2. How can I remove the old rows which are inserted three months
> ago?
> > > > >
> > > >
> > > > See above.
> > > >
> > > > St.Ack
> > > >
> > >
> >
>

Re: MR Job question

Posted by Slava Gorelik <sl...@gmail.com>.

How can you tell that ? There no interface in MR Job definition that allows
that.Every sample of MR Job in Hbase is works like that (this is a map from
RowCounter):

public void map(ImmutableBytesWritable row, RowResult value,
    OutputCollector<ImmutableBytesWritable, RowResult> output,
    @SuppressWarnings("unused") Reporter reporter)
  throws IOException {
    boolean content = false;
    for (Map.Entry<byte [], Cell> e: value.entrySet()) {
      Cell cell = e.getValue();
      if (cell != null && cell.getValue().length > 0) {
        content = true;
        break;
      }
    }
    if (!content) {
      return;
    }

You can't say which rows you want to get.

Best Regards.
Slava.


On Wed, Mar 4, 2009 at 1:31 PM, schubert zhang <zs...@gmail.com> wrote:

> In my job, I can tell the MR job the startRow and endRow, i.e. a row
> range. Then my MR job can only scan the region(s) in the range, and should
> not scan from begin of table or tablet/region to the end.
>
> So,  Slava, you should modify you code of MR job to do what you want.
>
> Schubert
>
> On Wed, Mar 4, 2009 at 4:58 PM, Slava Gorelik <slava.gorelik@gmail.com
> >wrote:
>
> > Hi.I'm confused a little bit.
> >
> > Please correct me if I wrong, but MR Job is it self is "scanning" all
> rows
> > in the table. The job is spread into each region server, into
> > multiple threads. Each thread get some part of the rows that are placed
> in
> > particular region server. So, the MR jobs is finished when all
> > threads are passed over all rows. Filtering will help the MR job only to
> > filter out non-relevant rows, but any way those rows will be checked
> > (passed
> > to the filter), this not helps a lot, job still passing over all rows in
> > the
> > table. Calling a scanner inside MR Job, will not
> > prevent from the job to pass over all rows, it simple will make job
> > more heavy(as i understand that). Is it correct, Michael ?
> >
> > So, my question is how can I tell to MR Job to pass over some rows and
> not
> > all rows.
> >
> > Thank You and Best Regards.
> > Slava.
> >
> >
> > On Wed, Mar 4, 2009 at 8:57 AM, stack <st...@duboce.net> wrote:
> >
> > > On Tue, Mar 3, 2009 at 6:17 PM, schubert zhang <zs...@gmail.com>
> > wrote:
> > >
> > > > Yes, we can tell HBase API only scan rows start with a key.
> > > >
> > >
> > > Would the filtering feature help here?
> > >
> > >
> > >
> >
> http://hadoop.apache.org/hbase/docs/r0.19.0/api/org/apache/hadoop/hbase/filter/package-summary.html#package_description
> > >
> > > Scanners can be passed a filter (Read the description section on the
> > above
> > > url).
> > >
> > >
> > > Can any expert share your ideas about:
> > > > 1. If the rowkey is not chronological, how can I only process the
> newly
> > > > added/updated rows?
> > >
> > >
> > > We don't have a means of asking for versions before a timestamp, only
> > older
> > > (Can you add timestamp to your row key if you need this?)
> > >
> > >
> > > > 2. How can I remove the old rows which are inserted three months ago?
> > > >
> > >
> > > See above.
> > >
> > > St.Ack
> > >
> >
>

Re: MR Job question

Posted by schubert zhang <zs...@gmail.com>.

In my job, I can tell the MR job the startRow and endRow, i.e. a row
range. Then my MR job can only scan the region(s) in the range, and should
not scan from begin of table or tablet/region to the end.

So,  Slava, you should modify you code of MR job to do what you want.

Schubert

On Wed, Mar 4, 2009 at 4:58 PM, Slava Gorelik <sl...@gmail.com>wrote:

> Hi.I'm confused a little bit.
>
> Please correct me if I wrong, but MR Job is it self is "scanning" all rows
> in the table. The job is spread into each region server, into
> multiple threads. Each thread get some part of the rows that are placed in
> particular region server. So, the MR jobs is finished when all
> threads are passed over all rows. Filtering will help the MR job only to
> filter out non-relevant rows, but any way those rows will be checked
> (passed
> to the filter), this not helps a lot, job still passing over all rows in
> the
> table. Calling a scanner inside MR Job, will not
> prevent from the job to pass over all rows, it simple will make job
> more heavy(as i understand that). Is it correct, Michael ?
>
> So, my question is how can I tell to MR Job to pass over some rows and not
> all rows.
>
> Thank You and Best Regards.
> Slava.
>
>
> On Wed, Mar 4, 2009 at 8:57 AM, stack <st...@duboce.net> wrote:
>
> > On Tue, Mar 3, 2009 at 6:17 PM, schubert zhang <zs...@gmail.com>
> wrote:
> >
> > > Yes, we can tell HBase API only scan rows start with a key.
> > >
> >
> > Would the filtering feature help here?
> >
> >
> >
> http://hadoop.apache.org/hbase/docs/r0.19.0/api/org/apache/hadoop/hbase/filter/package-summary.html#package_description
> >
> > Scanners can be passed a filter (Read the description section on the
> above
> > url).
> >
> >
> > Can any expert share your ideas about:
> > > 1. If the rowkey is not chronological, how can I only process the newly
> > > added/updated rows?
> >
> >
> > We don't have a means of asking for versions before a timestamp, only
> older
> > (Can you add timestamp to your row key if you need this?)
> >
> >
> > > 2. How can I remove the old rows which are inserted three months ago?
> > >
> >
> > See above.
> >
> > St.Ack
> >
>

Re: MR Job question

Posted by Slava Gorelik <sl...@gmail.com>.

Hi.I'm confused a little bit.

Please correct me if I wrong, but MR Job is it self is "scanning" all rows
in the table. The job is spread into each region server, into
multiple threads. Each thread get some part of the rows that are placed in
particular region server. So, the MR jobs is finished when all
threads are passed over all rows. Filtering will help the MR job only to
filter out non-relevant rows, but any way those rows will be checked (passed
to the filter), this not helps a lot, job still passing over all rows in the
table. Calling a scanner inside MR Job, will not
prevent from the job to pass over all rows, it simple will make job
more heavy(as i understand that). Is it correct, Michael ?

So, my question is how can I tell to MR Job to pass over some rows and not
all rows.

Thank You and Best Regards.
Slava.

On Wed, Mar 4, 2009 at 8:57 AM, stack <st...@duboce.net> wrote:

> On Tue, Mar 3, 2009 at 6:17 PM, schubert zhang <zs...@gmail.com> wrote:
>
> > Yes, we can tell HBase API only scan rows start with a key.
> >
>
> Would the filtering feature help here?
>
>
> http://hadoop.apache.org/hbase/docs/r0.19.0/api/org/apache/hadoop/hbase/filter/package-summary.html#package_description
>
> Scanners can be passed a filter (Read the description section on the above
> url).
>
>
> Can any expert share your ideas about:
> > 1. If the rowkey is not chronological, how can I only process the newly
> > added/updated rows?
>
>
> We don't have a means of asking for versions before a timestamp, only older
> (Can you add timestamp to your row key if you need this?)
>
>
> > 2. How can I remove the old rows which are inserted three months ago?
> >
>
> See above.
>
> St.Ack
>

Re: MR Job question

Posted by stack <st...@duboce.net>.

On Tue, Mar 3, 2009 at 6:17 PM, schubert zhang <zs...@gmail.com> wrote:

> Yes, we can tell HBase API only scan rows start with a key.
>

Would the filtering feature help here?

http://hadoop.apache.org/hbase/docs/r0.19.0/api/org/apache/hadoop/hbase/filter/package-summary.html#package_description

Scanners can be passed a filter (Read the description section on the above
url).


Can any expert share your ideas about:
> 1. If the rowkey is not chronological, how can I only process the newly
> added/updated rows?


We don't have a means of asking for versions before a timestamp, only older
(Can you add timestamp to your row key if you need this?)


> 2. How can I remove the old rows which are inserted three months ago?
>

See above.

St.Ack

Re: MR Job question

Posted by schubert zhang <zs...@gmail.com>.

for MR job, you should write you code to call HBase API.

On Wed, Mar 4, 2009 at 1:14 PM, Slava Gorelik <sl...@gmail.com>wrote:

> Yes, but as I understand this is a not MR  Job. This is a scanner usage.
> Best Regards.
> Slava.
>
> On Wed, Mar 4, 2009 at 4:17 AM, schubert zhang <zs...@gmail.com> wrote:
>
> > Yes, we can tell HBase API only scan rows start with a key.
> > // get rows start from startRow to table end
> > HTable.getScanner(final byte[][] columns, final byte [] startRow)
> >
> > // get rows start from startRow to table end, only the cells time stamp
> > <= timestamp are retrieved
> > HTable.getScanner(final byte[][] columns, final byte [] startRow, long
> > timestamp)
> >
> > // get row range [startRow, endRow )
> > HTable.getScanner(final byte [][] columns, final byte [] startRow, final
> > byte [] stopRow)
> >
> > // get row range [startRow, endRow ), only the cells time stamp <=
> > timestamp
> > are retrieved
> > HTable.getScanner(final byte [][] columns, final byte [] startRow, final
> > byte [] stopRow, final long timestamp)
> >
> > Can any expert share your ideas about:
> > 1. If the rowkey is not chronological, how can I only process the newly
> > added/updated rows?
> > 2. How can I remove the old rows which are inserted three months ago?
> >
> > Schubert
> >
> > On Wed, Mar 4, 2009 at 3:10 AM, Slava Gorelik <slava.gorelik@gmail.com
> > >wrote:
> > - Show quoted text -
> >
> > > Thank You for the answer.How can you tell to MR jobs which rows you
> want
> > to
> > > get ? Is it possible to tell to MR Job give me only rows that starts
> with
> > > some key ?
> > >
> > > Best Regards.
> > > Slava
> > >
> > > On Tue, Mar 3, 2009 at 7:33 PM, schubert zhang <zs...@gmail.com>
> > wrote:
> > >
> > > > In my practice, I define the 'time' as the first part of rowkey, then
> I
> > > can
> > > > only process the newly added rows.
> > > > I think my practice is not good and not appropriate for other cases,
> > > since
> > > > the rowkey definition is so important.
> > > > And I also want to know any good ideas.
> > > >
> > > > Another question is, how can I remove all rows which are inserted
> three
> > > > months ago?
> > > >
> > > > On Wed, Mar 4, 2009 at 12:45 AM, Slava Gorelik <
> > slava.gorelik@gmail.com
> > > > >wrote:
> > > > - Show quoted text -
> > > >
> > > > > Hi.I have a small question about MR jobs. Is it possible to run MR
> > job
> > > on
> > > > > part of the table ?
> > > > > For example I have MR job running on table and next time when run
> > this
> > > > > job, I want to get only newly added or updated rows.
> > > > >
> > > > > Thank You and Best Regards.
> > > > >
> > > >
> > >
> >
>

Re: MR Job question

Posted by Slava Gorelik <sl...@gmail.com>.

Yes, but as I understand this is a not MR  Job. This is a scanner usage.
Best Regards.
Slava.

On Wed, Mar 4, 2009 at 4:17 AM, schubert zhang <zs...@gmail.com> wrote:

> Yes, we can tell HBase API only scan rows start with a key.
> // get rows start from startRow to table end
> HTable.getScanner(final byte[][] columns, final byte [] startRow)
>
> // get rows start from startRow to table end, only the cells time stamp
> <= timestamp are retrieved
> HTable.getScanner(final byte[][] columns, final byte [] startRow, long
> timestamp)
>
> // get row range [startRow, endRow )
> HTable.getScanner(final byte [][] columns, final byte [] startRow, final
> byte [] stopRow)
>
> // get row range [startRow, endRow ), only the cells time stamp <=
> timestamp
> are retrieved
> HTable.getScanner(final byte [][] columns, final byte [] startRow, final
> byte [] stopRow, final long timestamp)
>
> Can any expert share your ideas about:
> 1. If the rowkey is not chronological, how can I only process the newly
> added/updated rows?
> 2. How can I remove the old rows which are inserted three months ago?
>
> Schubert
>
> On Wed, Mar 4, 2009 at 3:10 AM, Slava Gorelik <slava.gorelik@gmail.com
> >wrote:
> - Show quoted text -
>
> > Thank You for the answer.How can you tell to MR jobs which rows you want
> to
> > get ? Is it possible to tell to MR Job give me only rows that starts with
> > some key ?
> >
> > Best Regards.
> > Slava
> >
> > On Tue, Mar 3, 2009 at 7:33 PM, schubert zhang <zs...@gmail.com>
> wrote:
> >
> > > In my practice, I define the 'time' as the first part of rowkey, then I
> > can
> > > only process the newly added rows.
> > > I think my practice is not good and not appropriate for other cases,
> > since
> > > the rowkey definition is so important.
> > > And I also want to know any good ideas.
> > >
> > > Another question is, how can I remove all rows which are inserted three
> > > months ago?
> > >
> > > On Wed, Mar 4, 2009 at 12:45 AM, Slava Gorelik <
> slava.gorelik@gmail.com
> > > >wrote:
> > > - Show quoted text -
> > >
> > > > Hi.I have a small question about MR jobs. Is it possible to run MR
> job
> > on
> > > > part of the table ?
> > > > For example I have MR job running on table and next time when run
> this
> > > > job, I want to get only newly added or updated rows.
> > > >
> > > > Thank You and Best Regards.
> > > >
> > >
> >
>

Re: MR Job question

Posted by schubert zhang <zs...@gmail.com>.

Yes, we can tell HBase API only scan rows start with a key.
// get rows start from startRow to table end
HTable.getScanner(final byte[][] columns, final byte [] startRow)

// get rows start from startRow to table end, only the cells time stamp
<= timestamp are retrieved
HTable.getScanner(final byte[][] columns, final byte [] startRow, long
timestamp)

// get row range [startRow, endRow )
HTable.getScanner(final byte [][] columns, final byte [] startRow, final
byte [] stopRow)

// get row range [startRow, endRow ), only the cells time stamp <= timestamp
are retrieved
HTable.getScanner(final byte [][] columns, final byte [] startRow, final
byte [] stopRow, final long timestamp)

Can any expert share your ideas about:
1. If the rowkey is not chronological, how can I only process the newly
added/updated rows?
2. How can I remove the old rows which are inserted three months ago?

Schubert

On Wed, Mar 4, 2009 at 3:10 AM, Slava Gorelik <sl...@gmail.com>wrote:

> Thank You for the answer.How can you tell to MR jobs which rows you want to
> get ? Is it possible to tell to MR Job give me only rows that starts with
> some key ?
>
> Best Regards.
> Slava
>
> On Tue, Mar 3, 2009 at 7:33 PM, schubert zhang <zs...@gmail.com> wrote:
>
> > In my practice, I define the 'time' as the first part of rowkey, then I
> can
> > only process the newly added rows.
> > I think my practice is not good and not appropriate for other cases,
> since
> > the rowkey definition is so important.
> > And I also want to know any good ideas.
> >
> > Another question is, how can I remove all rows which are inserted three
> > months ago?
> >
> > On Wed, Mar 4, 2009 at 12:45 AM, Slava Gorelik <slava.gorelik@gmail.com
> > >wrote:
> > - Show quoted text -
> >
> > > Hi.I have a small question about MR jobs. Is it possible to run MR job
> on
> > > part of the table ?
> > > For example I have MR job running on table and next time when run this
> > > job, I want to get only newly added or updated rows.
> > >
> > > Thank You and Best Regards.
> > >
> >
>

Re: MR Job question

Posted by Slava Gorelik <sl...@gmail.com>.

Thank You for the answer.How can you tell to MR jobs which rows you want to
get ? Is it possible to tell to MR Job give me only rows that starts with
some key ?

Best Regards.
Slava

On Tue, Mar 3, 2009 at 7:33 PM, schubert zhang <zs...@gmail.com> wrote:

> In my practice, I define the 'time' as the first part of rowkey, then I can
> only process the newly added rows.
> I think my practice is not good and not appropriate for other cases, since
> the rowkey definition is so important.
> And I also want to know any good ideas.
>
> Another question is, how can I remove all rows which are inserted three
> months ago?
>
> On Wed, Mar 4, 2009 at 12:45 AM, Slava Gorelik <slava.gorelik@gmail.com
> >wrote:
> - Show quoted text -
>
> > Hi.I have a small question about MR jobs. Is it possible to run MR job on
> > part of the table ?
> > For example I have MR job running on table and next time when run this
> > job, I want to get only newly added or updated rows.
> >
> > Thank You and Best Regards.
> >
>

Re: MR Job question

Posted by Billy Pearson <sa...@pearsonwholesale.com>.

And if you go with the time stamp there is an option issue to deal with this 
problem
HBASE-1170

If you have a set time you want to keep the data then there is always the 
ttl option on the tables columns.

Billy


"stack" <st...@duboce.net> wrote in message 
news:7c962aed0903032253n50753c66q57a0c8c4fef2d303@mail.gmail.com...
>I think time as part of the row key will be a fairly common practise; if it
> suits your access pattern, go for it.
>
> Regards how to get rid of all rows inserted three months ago, since your
> keys have timestamp embedded, can you not scan your table deleting all
> timestamps older than 3months?   Or, alter your table adding a timeout on
> the column of 3 months and then bring your table back on line.  At the 
> next
> major compaction, once a day if default, cells older than 3 months will be
> deleted.
>
> St.Ack
>
> On Tue, Mar 3, 2009 at 9:33 AM, schubert zhang 
> <zs...@gmail.com> wrote:
>
>> In my practice, I define the 'time' as the first part of rowkey, then I 
>> can
>> only process the newly added rows.
>> I think my practice is not good and not appropriate for other cases, 
>> since
>> the rowkey definition is so important.
>> And I also want to know any good ideas.
>>
>> Another question is, how can I remove all rows which are inserted three
>> months ago?
>>
>> On Wed, Mar 4, 2009 at 12:45 AM, Slava Gorelik 
>> <slava.gorelik@gmail.com
>> >wrote:
>>
>> > Hi.I have a small question about MR jobs. Is it possible to run MR job 
>> > on
>> > part of the table ?
>> > For example I have MR job running on table and next time when run this
>> > job, I want to get only newly added or updated rows.
>> >
>> > Thank You and Best Regards.
>> >
>>
>

Re: MR Job question

Posted by stack <st...@duboce.net>.

I think time as part of the row key will be a fairly common practise; if it
suits your access pattern, go for it.

Regards how to get rid of all rows inserted three months ago, since your
keys have timestamp embedded, can you not scan your table deleting all
timestamps older than 3months?   Or, alter your table adding a timeout on
the column of 3 months and then bring your table back on line.  At the next
major compaction, once a day if default, cells older than 3 months will be
deleted.

St.Ack

On Tue, Mar 3, 2009 at 9:33 AM, schubert zhang <zs...@gmail.com> wrote:

> In my practice, I define the 'time' as the first part of rowkey, then I can
> only process the newly added rows.
> I think my practice is not good and not appropriate for other cases, since
> the rowkey definition is so important.
> And I also want to know any good ideas.
>
> Another question is, how can I remove all rows which are inserted three
> months ago?
>
> On Wed, Mar 4, 2009 at 12:45 AM, Slava Gorelik <slava.gorelik@gmail.com
> >wrote:
>
> > Hi.I have a small question about MR jobs. Is it possible to run MR job on
> > part of the table ?
> > For example I have MR job running on table and next time when run this
> > job, I want to get only newly added or updated rows.
> >
> > Thank You and Best Regards.
> >
>

Re: MR Job question

Posted by schubert zhang <zs...@gmail.com>.

In my practice, I define the 'time' as the first part of rowkey, then I can
only process the newly added rows.
I think my practice is not good and not appropriate for other cases, since
the rowkey definition is so important.
And I also want to know any good ideas.

Another question is, how can I remove all rows which are inserted three
months ago?

On Wed, Mar 4, 2009 at 12:45 AM, Slava Gorelik <sl...@gmail.com>wrote:

> Hi.I have a small question about MR jobs. Is it possible to run MR job on
> part of the table ?
> For example I have MR job running on table and next time when run this
> job, I want to get only newly added or updated rows.
>
> Thank You and Best Regards.
>