You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Felipe Sodré Silva <fs...@gmail.com> on 2015/03/04 02:23:20 UTC

Different time ranges for different cfs when using TableInputFormat

When using TableInputFormat to make HBase data available to map/reduce
jobs we can use the settings SCAN_TIMERANGE_START and
SCAN_TIMERANGE_END to specify a time range during scan.
Is it possible to somehow have different time ranges for different
column families?

This is my problem:
I have table X with column families cf1, cf2 and cf3. I want to run a
map/reduce job on it using the most recent versions of columns in cf1
and cf2, but I want to use yesterday's data from cf3. Is this
possible?

Felipe

Re: Different time ranges for different cfs when using TableInputFormat

Posted by Nick Dimiduk <nd...@gmail.com>.
Have a look at the versions of TableMapReduceUtil#initTableMapperJob that
take a List<Scan> instances. Does that provide what you're looking for?

-n

On Wed, Mar 4, 2015 at 6:05 AM, Dave Latham <la...@davelink.net> wrote:

> That's not possible with HBase today.  The simplest thing may be to set
> your Scan time range to include both today's and yesterday's data and then
> filter down to only the data you want inside your map task.  Other
> possibilities would be creating a custom filter to do the filtering on the
> server side or even changing your input format or map task to run two
> concurrent scans with different familes/time ranges and merging the
> results.
>
> Being able to specify different time ranges for different column families
> is something I'd like to do as well.  Perhaps we'll get that into HBase at
> some point.
>
> Dave
>
> On Tue, Mar 3, 2015 at 5:23 PM, Felipe Sodré Silva <fs...@gmail.com>
> wrote:
>
> > When using TableInputFormat to make HBase data available to map/reduce
> > jobs we can use the settings SCAN_TIMERANGE_START and
> > SCAN_TIMERANGE_END to specify a time range during scan.
> > Is it possible to somehow have different time ranges for different
> > column families?
> >
> > This is my problem:
> > I have table X with column families cf1, cf2 and cf3. I want to run a
> > map/reduce job on it using the most recent versions of columns in cf1
> > and cf2, but I want to use yesterday's data from cf3. Is this
> > possible?
> >
> > Felipe
> >
>

Re: Different time ranges for different cfs when using TableInputFormat

Posted by Dave Latham <la...@davelink.net>.
That's not possible with HBase today.  The simplest thing may be to set
your Scan time range to include both today's and yesterday's data and then
filter down to only the data you want inside your map task.  Other
possibilities would be creating a custom filter to do the filtering on the
server side or even changing your input format or map task to run two
concurrent scans with different familes/time ranges and merging the results.

Being able to specify different time ranges for different column families
is something I'd like to do as well.  Perhaps we'll get that into HBase at
some point.

Dave

On Tue, Mar 3, 2015 at 5:23 PM, Felipe Sodré Silva <fs...@gmail.com> wrote:

> When using TableInputFormat to make HBase data available to map/reduce
> jobs we can use the settings SCAN_TIMERANGE_START and
> SCAN_TIMERANGE_END to specify a time range during scan.
> Is it possible to somehow have different time ranges for different
> column families?
>
> This is my problem:
> I have table X with column families cf1, cf2 and cf3. I want to run a
> map/reduce job on it using the most recent versions of columns in cf1
> and cf2, but I want to use yesterday's data from cf3. Is this
> possible?
>
> Felipe
>