You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Nishanth S <ni...@gmail.com> on 2014/10/08 17:50:40 UTC

Loading hbase from parquet files

Hey folks,

I am evaluating on loading  an  hbase table from parquet files based on
some rules that  would be applied on  parquet file records.Could some one
help me on what would be the best way to do this?.


Thanks,
Nishan

Re: Loading hbase from parquet files

Posted by Nishanth S <ni...@gmail.com>.
Thank you guys for the information.

-cheers
Nishan

On Wed, Oct 8, 2014 at 12:49 PM, Andrey Stepachev <oc...@gmail.com> wrote:

> For that use case I'd prefer to write new filtered HFiles with map reduce
> and then import those data into hbase using bulk import. Keep in mind, that
> incremental load tool moves files, not copies them. So once written you
> will not do any additional writes (except for those regions which was split
> while you filtering data). If importing data is small that would not be a
> problem.
>
> On Wed, Oct 8, 2014 at 8:45 PM, Nishanth S <ni...@gmail.com>
> wrote:
>
> > Thanks Andrey.In the current system  the hbase cfs have a ttl of  30 days
> > and data gets deleted after this(has snappy compression).Below is
> something
> > what I am trying to acheive.
> >
> > 1.Export the data from hbase  table  before it gets deleted.
> > 2.Store it  in some format  which supports maximum compression(storage
> cost
> > is my primary concern here),so looking at parquet.
> > 3.Load a subset of this data back into hbase based on  certain rules(say
> i
> > want  to load all rows which has a particular string in one of the
> fields).
> >
> >
> > I was thinking of bulkloading this data back into hbase but I am not sure
> > how I can  load a subset of the data using
> > org.apache.hadoop.hbase.mapreduce.Driver
> > import.
> >
> >
> >
> >
> >
> >
> > On Wed, Oct 8, 2014 at 10:20 AM, Andrey Stepachev <oc...@gmail.com>
> > wrote:
> >
> > > Hi Nishanth.
> > >
> > > Not clear what exactly you are building.
> > > Can you share more detailed description of what you are building, how
> > > parquet files are supposed to be ingested.
> > > Some questions arise:
> > > 1. is that online import or bulk load
> > > 2. why rules need to be deployed to cluster. Do you suppose to do
> reading
> > > inside hbase region server?
> > >
> > > As for deploying filters your cat try to use coprocessors instead. They
> > can
> > > be configurable and loadable (but not
> > > unloadable, so you need to think about some class loading magic like
> > > ClassWorlds)
> > > For bulk imports you can create HFiles directly and add them
> > incrementally:
> > > http://hbase.apache.org/book/arch.bulk.load.html
> > >
> > > On Wed, Oct 8, 2014 at 8:13 PM, Nishanth S <ni...@gmail.com>
> > > wrote:
> > >
> > > > I was thinking of using org.apache.hadoop.hbase.mapreduce.Driver
> > import.
> > > I
> > > > could see that we can pass in filters  to this utility but looks less
> > > > flexible since  you need to deploy a new filter every time  the rules
> > for
> > > > processing records change.Is there some way that we could define a
> > rules
> > > > engine?
> > > >
> > > >
> > > > Thanks,
> > > > -Nishan
> > > >
> > > > On Wed, Oct 8, 2014 at 9:50 AM, Nishanth S <ni...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hey folks,
> > > > >
> > > > > I am evaluating on loading  an  hbase table from parquet files
> based
> > on
> > > > > some rules that  would be applied on  parquet file records.Could
> some
> > > one
> > > > > help me on what would be the best way to do this?.
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Nishan
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Andrey.
> > >
> >
>
>
>
> --
> Andrey.
>

Re: Loading hbase from parquet files

Posted by Andrey Stepachev <oc...@gmail.com>.
For that use case I'd prefer to write new filtered HFiles with map reduce
and then import those data into hbase using bulk import. Keep in mind, that
incremental load tool moves files, not copies them. So once written you
will not do any additional writes (except for those regions which was split
while you filtering data). If importing data is small that would not be a
problem.

On Wed, Oct 8, 2014 at 8:45 PM, Nishanth S <ni...@gmail.com> wrote:

> Thanks Andrey.In the current system  the hbase cfs have a ttl of  30 days
> and data gets deleted after this(has snappy compression).Below is something
> what I am trying to acheive.
>
> 1.Export the data from hbase  table  before it gets deleted.
> 2.Store it  in some format  which supports maximum compression(storage cost
> is my primary concern here),so looking at parquet.
> 3.Load a subset of this data back into hbase based on  certain rules(say i
> want  to load all rows which has a particular string in one of the fields).
>
>
> I was thinking of bulkloading this data back into hbase but I am not sure
> how I can  load a subset of the data using
> org.apache.hadoop.hbase.mapreduce.Driver
> import.
>
>
>
>
>
>
> On Wed, Oct 8, 2014 at 10:20 AM, Andrey Stepachev <oc...@gmail.com>
> wrote:
>
> > Hi Nishanth.
> >
> > Not clear what exactly you are building.
> > Can you share more detailed description of what you are building, how
> > parquet files are supposed to be ingested.
> > Some questions arise:
> > 1. is that online import or bulk load
> > 2. why rules need to be deployed to cluster. Do you suppose to do reading
> > inside hbase region server?
> >
> > As for deploying filters your cat try to use coprocessors instead. They
> can
> > be configurable and loadable (but not
> > unloadable, so you need to think about some class loading magic like
> > ClassWorlds)
> > For bulk imports you can create HFiles directly and add them
> incrementally:
> > http://hbase.apache.org/book/arch.bulk.load.html
> >
> > On Wed, Oct 8, 2014 at 8:13 PM, Nishanth S <ni...@gmail.com>
> > wrote:
> >
> > > I was thinking of using org.apache.hadoop.hbase.mapreduce.Driver
> import.
> > I
> > > could see that we can pass in filters  to this utility but looks less
> > > flexible since  you need to deploy a new filter every time  the rules
> for
> > > processing records change.Is there some way that we could define a
> rules
> > > engine?
> > >
> > >
> > > Thanks,
> > > -Nishan
> > >
> > > On Wed, Oct 8, 2014 at 9:50 AM, Nishanth S <ni...@gmail.com>
> > > wrote:
> > >
> > > > Hey folks,
> > > >
> > > > I am evaluating on loading  an  hbase table from parquet files based
> on
> > > > some rules that  would be applied on  parquet file records.Could some
> > one
> > > > help me on what would be the best way to do this?.
> > > >
> > > >
> > > > Thanks,
> > > > Nishan
> > > >
> > >
> >
> >
> >
> > --
> > Andrey.
> >
>



-- 
Andrey.

Re: Loading hbase from parquet files

Posted by Ted Yu <yu...@gmail.com>.
Since storage is your primary concern, take a look at Doug Meil's blog 'The
Effect of ColumnFamily, RowKey and KeyValue Design on HFile Size':
http://blogs.apache.org/hbase/

Cheers

On Wed, Oct 8, 2014 at 9:45 AM, Nishanth S <ni...@gmail.com> wrote:

> Thanks Andrey.In the current system  the hbase cfs have a ttl of  30 days
> and data gets deleted after this(has snappy compression).Below is something
> what I am trying to acheive.
>
> 1.Export the data from hbase  table  before it gets deleted.
> 2.Store it  in some format  which supports maximum compression(storage cost
> is my primary concern here),so looking at parquet.
> 3.Load a subset of this data back into hbase based on  certain rules(say i
> want  to load all rows which has a particular string in one of the fields).
>
>
> I was thinking of bulkloading this data back into hbase but I am not sure
> how I can  load a subset of the data using
> org.apache.hadoop.hbase.mapreduce.Driver
> import.
>
>
>
>
>
>
> On Wed, Oct 8, 2014 at 10:20 AM, Andrey Stepachev <oc...@gmail.com>
> wrote:
>
> > Hi Nishanth.
> >
> > Not clear what exactly you are building.
> > Can you share more detailed description of what you are building, how
> > parquet files are supposed to be ingested.
> > Some questions arise:
> > 1. is that online import or bulk load
> > 2. why rules need to be deployed to cluster. Do you suppose to do reading
> > inside hbase region server?
> >
> > As for deploying filters your cat try to use coprocessors instead. They
> can
> > be configurable and loadable (but not
> > unloadable, so you need to think about some class loading magic like
> > ClassWorlds)
> > For bulk imports you can create HFiles directly and add them
> incrementally:
> > http://hbase.apache.org/book/arch.bulk.load.html
> >
> > On Wed, Oct 8, 2014 at 8:13 PM, Nishanth S <ni...@gmail.com>
> > wrote:
> >
> > > I was thinking of using org.apache.hadoop.hbase.mapreduce.Driver
> import.
> > I
> > > could see that we can pass in filters  to this utility but looks less
> > > flexible since  you need to deploy a new filter every time  the rules
> for
> > > processing records change.Is there some way that we could define a
> rules
> > > engine?
> > >
> > >
> > > Thanks,
> > > -Nishan
> > >
> > > On Wed, Oct 8, 2014 at 9:50 AM, Nishanth S <ni...@gmail.com>
> > > wrote:
> > >
> > > > Hey folks,
> > > >
> > > > I am evaluating on loading  an  hbase table from parquet files based
> on
> > > > some rules that  would be applied on  parquet file records.Could some
> > one
> > > > help me on what would be the best way to do this?.
> > > >
> > > >
> > > > Thanks,
> > > > Nishan
> > > >
> > >
> >
> >
> >
> > --
> > Andrey.
> >
>

Re: Loading hbase from parquet files

Posted by Nishanth S <ni...@gmail.com>.
Thanks Andrey.In the current system  the hbase cfs have a ttl of  30 days
and data gets deleted after this(has snappy compression).Below is something
what I am trying to acheive.

1.Export the data from hbase  table  before it gets deleted.
2.Store it  in some format  which supports maximum compression(storage cost
is my primary concern here),so looking at parquet.
3.Load a subset of this data back into hbase based on  certain rules(say i
want  to load all rows which has a particular string in one of the fields).


I was thinking of bulkloading this data back into hbase but I am not sure
how I can  load a subset of the data using
org.apache.hadoop.hbase.mapreduce.Driver
import.






On Wed, Oct 8, 2014 at 10:20 AM, Andrey Stepachev <oc...@gmail.com> wrote:

> Hi Nishanth.
>
> Not clear what exactly you are building.
> Can you share more detailed description of what you are building, how
> parquet files are supposed to be ingested.
> Some questions arise:
> 1. is that online import or bulk load
> 2. why rules need to be deployed to cluster. Do you suppose to do reading
> inside hbase region server?
>
> As for deploying filters your cat try to use coprocessors instead. They can
> be configurable and loadable (but not
> unloadable, so you need to think about some class loading magic like
> ClassWorlds)
> For bulk imports you can create HFiles directly and add them incrementally:
> http://hbase.apache.org/book/arch.bulk.load.html
>
> On Wed, Oct 8, 2014 at 8:13 PM, Nishanth S <ni...@gmail.com>
> wrote:
>
> > I was thinking of using org.apache.hadoop.hbase.mapreduce.Driver import.
> I
> > could see that we can pass in filters  to this utility but looks less
> > flexible since  you need to deploy a new filter every time  the rules for
> > processing records change.Is there some way that we could define a rules
> > engine?
> >
> >
> > Thanks,
> > -Nishan
> >
> > On Wed, Oct 8, 2014 at 9:50 AM, Nishanth S <ni...@gmail.com>
> > wrote:
> >
> > > Hey folks,
> > >
> > > I am evaluating on loading  an  hbase table from parquet files based on
> > > some rules that  would be applied on  parquet file records.Could some
> one
> > > help me on what would be the best way to do this?.
> > >
> > >
> > > Thanks,
> > > Nishan
> > >
> >
>
>
>
> --
> Andrey.
>

Re: Loading hbase from parquet files

Posted by Andrey Stepachev <oc...@gmail.com>.
Hi Nishanth.

Not clear what exactly you are building.
Can you share more detailed description of what you are building, how
parquet files are supposed to be ingested.
Some questions arise:
1. is that online import or bulk load
2. why rules need to be deployed to cluster. Do you suppose to do reading
inside hbase region server?

As for deploying filters your cat try to use coprocessors instead. They can
be configurable and loadable (but not
unloadable, so you need to think about some class loading magic like
ClassWorlds)
For bulk imports you can create HFiles directly and add them incrementally:
http://hbase.apache.org/book/arch.bulk.load.html

On Wed, Oct 8, 2014 at 8:13 PM, Nishanth S <ni...@gmail.com> wrote:

> I was thinking of using org.apache.hadoop.hbase.mapreduce.Driver import. I
> could see that we can pass in filters  to this utility but looks less
> flexible since  you need to deploy a new filter every time  the rules for
> processing records change.Is there some way that we could define a rules
> engine?
>
>
> Thanks,
> -Nishan
>
> On Wed, Oct 8, 2014 at 9:50 AM, Nishanth S <ni...@gmail.com>
> wrote:
>
> > Hey folks,
> >
> > I am evaluating on loading  an  hbase table from parquet files based on
> > some rules that  would be applied on  parquet file records.Could some one
> > help me on what would be the best way to do this?.
> >
> >
> > Thanks,
> > Nishan
> >
>



-- 
Andrey.

Re: Loading hbase from parquet files

Posted by Nishanth S <ni...@gmail.com>.
I was thinking of using org.apache.hadoop.hbase.mapreduce.Driver import. I
could see that we can pass in filters  to this utility but looks less
flexible since  you need to deploy a new filter every time  the rules for
processing records change.Is there some way that we could define a rules
engine?


Thanks,
-Nishan

On Wed, Oct 8, 2014 at 9:50 AM, Nishanth S <ni...@gmail.com> wrote:

> Hey folks,
>
> I am evaluating on loading  an  hbase table from parquet files based on
> some rules that  would be applied on  parquet file records.Could some one
> help me on what would be the best way to do this?.
>
>
> Thanks,
> Nishan
>