You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Oleg Ruchovets <or...@gmail.com> on 2011/03/19 23:00:32 UTC

hbase insertion optimisation:

   We want to insert to hbase on daily basis (hbase 0.90.1 , hadoop append).
currently we have ~ 10 million records per day.We use map/reduce to prepare
data , and write it to hbase using chunks of data (5000 puts  every chunk)
   All process takes 1h 20 minutes. Making some tests verified that writing
to hbase takes ~ 1 hour.

I have couple of questions:
  1) Reducers is writing  data which has a key like : <date>_<some_text> ,
the strange is that   all records were written to a one node.

    Is it correct behaviour? What is the way to get better distributions
accross the cluster? Simply during insertion process  I saw that most load
get that specific node where all data were inserted and all other nodes
almost has no any resources utilisations (cpu , I/O ...).

Oleg.

Re: hbase insertion optimisation:

Posted by Oleg Ruchovets <or...@gmail.com>.

On Sun, Mar 20, 2011 at 5:58 PM, Ted Yu <yu...@gmail.com> wrote:

> For 1), if you apply hashing to <date>_<somedata>, date prefix wouldn't be
> useful.
> You should evaluate the distribution of <somedata> as row key. Assuming
> distribution is uneven, you can apply hashing function to row key.
> Using MurmurHash is as simple as:
> MurmurHash.getInstance().hash(rowkey, 0, rowkey.length, seed)
>


Thank you for quick answer

in case key : <date>_<somedata>
where date is 20110310,
20110310_ length is 9

public class MyHash extends MurmurHash {
 private static MyHash _instance = new MyHash();

  public static Hash getInstance() {
    return _instance;
  }

  @Override
public int hash(byte[] data, int offset, int length, int seed) {
return super.hash(data, 9, length-9, seed);
}
}

using:
int rowKeyHash = MurmurHash.getInstance().hash(rowkey, 0, rowkey.length,
seed)
or
int rowKeyHash = MyHash.getInstance().hash(rowkey);

What should I do with int rowKeyHash ? How should code be written to be
using  rowKeyHash .

Currently my code looks like this:

......

put.setWriteToWAL(false);
     puts.add(put);
     counter++;

     if(counter>batchSize){
     try{
     table.getWriteBuffer().addAll(puts);
     table.flushCommits();
     puts.clear();
     }finally{
     counter = 0;
     }
     }


......

Thanks
Oleg.






> For 2), you can evaluate MurmurHash and JenkinsHash. Using different hash
> functions in your system entails storing meta data for each table about the
> choice of hash function.
>
> Cheers
>
> On Sun, Mar 20, 2011 at 8:21 AM, Oleg Ruchovets <oruchovets@gmail.com
> >wrote:
>
> > I took org.apache.hadoop.hbase.util.MurmurHash class  and want to use it
> > for
> > my hashing.
> >     Till now I had  key , value pairs (key format <date>_<somedata>) ,
> >      Using MurmurHash I get hashing for my key.
> > My questions is :
> >   1) What is the way to use hashing. Meaning how code should  be written
> >  so that instead of writing key, value it will use hashing too?
> >   2)Can I different hash function be used  for different Hbase tables?
> >  What is the way to do it?
> >
> > Thanks in advance
> > Oleg.
> >
> >
> >
> > On Sun, Mar 20, 2011 at 12:25 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > I guess you chose date prefix for query consideration.
> > > You should introduce hashing so that the row keys are not clustered
> > > together.
> > >
> > > On Sat, Mar 19, 2011 at 3:00 PM, Oleg Ruchovets <oruchovets@gmail.com
> > > >wrote:
> > >
> > > >   We want to insert to hbase on daily basis (hbase 0.90.1 , hadoop
> > > append).
> > > > currently we have ~ 10 million records per day.We use map/reduce to
> > > prepare
> > > > data , and write it to hbase using chunks of data (5000 puts  every
> > > chunk)
> > > >   All process takes 1h 20 minutes. Making some tests verified that
> > > writing
> > > > to hbase takes ~ 1 hour.
> > > >
> > > > I have couple of questions:
> > > >  1) Reducers is writing  data which has a key like :
> <date>_<some_text>
> > ,
> > > > the strange is that   all records were written to a one node.
> > > >
> > > >    Is it correct behaviour? What is the way to get better
> distributions
> > > > accross the cluster? Simply during insertion process  I saw that most
> > > load
> > > > get that specific node where all data were inserted and all other
> nodes
> > > > almost has no any resources utilisations (cpu , I/O ...).
> > > >
> > > > Oleg.
> > > >
> > >
> >
>

Re: hbase insertion optimisation:

Posted by Ted Yu <yu...@gmail.com>.

For 1), if you apply hashing to <date>_<somedata>, date prefix wouldn't be
useful.
You should evaluate the distribution of <somedata> as row key. Assuming
distribution is uneven, you can apply hashing function to row key.
Using MurmurHash is as simple as:
MurmurHash.getInstance().hash(rowkey, 0, rowkey.length, seed)

For 2), you can evaluate MurmurHash and JenkinsHash. Using different hash
functions in your system entails storing meta data for each table about the
choice of hash function.

Cheers

On Sun, Mar 20, 2011 at 8:21 AM, Oleg Ruchovets <or...@gmail.com>wrote:

> I took org.apache.hadoop.hbase.util.MurmurHash class  and want to use it
> for
> my hashing.
>     Till now I had  key , value pairs (key format <date>_<somedata>) ,
>      Using MurmurHash I get hashing for my key.
> My questions is :
>   1) What is the way to use hashing. Meaning how code should  be written
>  so that instead of writing key, value it will use hashing too?
>   2)Can I different hash function be used  for different Hbase tables?
>  What is the way to do it?
>
> Thanks in advance
> Oleg.
>
>
>
> On Sun, Mar 20, 2011 at 12:25 AM, Ted Yu <yu...@gmail.com> wrote:
>
> > I guess you chose date prefix for query consideration.
> > You should introduce hashing so that the row keys are not clustered
> > together.
> >
> > On Sat, Mar 19, 2011 at 3:00 PM, Oleg Ruchovets <oruchovets@gmail.com
> > >wrote:
> >
> > >   We want to insert to hbase on daily basis (hbase 0.90.1 , hadoop
> > append).
> > > currently we have ~ 10 million records per day.We use map/reduce to
> > prepare
> > > data , and write it to hbase using chunks of data (5000 puts  every
> > chunk)
> > >   All process takes 1h 20 minutes. Making some tests verified that
> > writing
> > > to hbase takes ~ 1 hour.
> > >
> > > I have couple of questions:
> > >  1) Reducers is writing  data which has a key like : <date>_<some_text>
> ,
> > > the strange is that   all records were written to a one node.
> > >
> > >    Is it correct behaviour? What is the way to get better distributions
> > > accross the cluster? Simply during insertion process  I saw that most
> > load
> > > get that specific node where all data were inserted and all other nodes
> > > almost has no any resources utilisations (cpu , I/O ...).
> > >
> > > Oleg.
> > >
> >
>

Re: hbase insertion optimisation:

Posted by Oleg Ruchovets <or...@gmail.com>.

I took org.apache.hadoop.hbase.util.MurmurHash class  and want to use it for
my hashing.
     Till now I had  key , value pairs (key format <date>_<somedata>) ,
      Using MurmurHash I get hashing for my key.
My questions is :
   1) What is the way to use hashing. Meaning how code should  be written
 so that instead of writing key, value it will use hashing too?
   2)Can I different hash function be used  for different Hbase tables?
 What is the way to do it?

Thanks in advance
Oleg.



On Sun, Mar 20, 2011 at 12:25 AM, Ted Yu <yu...@gmail.com> wrote:

> I guess you chose date prefix for query consideration.
> You should introduce hashing so that the row keys are not clustered
> together.
>
> On Sat, Mar 19, 2011 at 3:00 PM, Oleg Ruchovets <oruchovets@gmail.com
> >wrote:
>
> >   We want to insert to hbase on daily basis (hbase 0.90.1 , hadoop
> append).
> > currently we have ~ 10 million records per day.We use map/reduce to
> prepare
> > data , and write it to hbase using chunks of data (5000 puts  every
> chunk)
> >   All process takes 1h 20 minutes. Making some tests verified that
> writing
> > to hbase takes ~ 1 hour.
> >
> > I have couple of questions:
> >  1) Reducers is writing  data which has a key like : <date>_<some_text> ,
> > the strange is that   all records were written to a one node.
> >
> >    Is it correct behaviour? What is the way to get better distributions
> > accross the cluster? Simply during insertion process  I saw that most
> load
> > get that specific node where all data were inserted and all other nodes
> > almost has no any resources utilisations (cpu , I/O ...).
> >
> > Oleg.
> >
>

Re: hbase insertion optimisation:

Posted by Ted Yu <yu...@gmail.com>.

Timestamp is in every key value pair.
Take a look at this method in Scan:
  public Scan setTimeRange(long minStamp, long maxStamp)

Cheers

On Sat, Mar 19, 2011 at 3:43 PM, Oleg Ruchovets <or...@gmail.com>wrote:

> Good point ,
>          let me explain the process. We choose  the keys <date>_<somedata>
> because after insertion we  run scans and want to analyse data which is
> related to the specific date.
> Can you provide more details using hashing and how can I scan hbase data
> per
> specific date using it.
>
> Oleg.
>
> On Sun, Mar 20, 2011 at 12:25 AM, Ted Yu <yu...@gmail.com> wrote:
>
> > I guess you chose date prefix for query consideration.
> > You should introduce hashing so that the row keys are not clustered
> > together.
> >
> > On Sat, Mar 19, 2011 at 3:00 PM, Oleg Ruchovets <oruchovets@gmail.com
> > >wrote:
> >
> > >   We want to insert to hbase on daily basis (hbase 0.90.1 , hadoop
> > append).
> > > currently we have ~ 10 million records per day.We use map/reduce to
> > prepare
> > > data , and write it to hbase using chunks of data (5000 puts  every
> > chunk)
> > >   All process takes 1h 20 minutes. Making some tests verified that
> > writing
> > > to hbase takes ~ 1 hour.
> > >
> > > I have couple of questions:
> > >  1) Reducers is writing  data which has a key like : <date>_<some_text>
> ,
> > > the strange is that   all records were written to a one node.
> > >
> > >    Is it correct behaviour? What is the way to get better distributions
> > > accross the cluster? Simply during insertion process  I saw that most
> > load
> > > get that specific node where all data were inserted and all other nodes
> > > almost has no any resources utilisations (cpu , I/O ...).
> > >
> > > Oleg.
> > >
> >
>

Re: hbase insertion optimisation:

Posted by Oleg Ruchovets <or...@gmail.com>.

Good point ,
          let me explain the process. We choose  the keys <date>_<somedata>
because after insertion we  run scans and want to analyse data which is
related to the specific date.
Can you provide more details using hashing and how can I scan hbase data per
specific date using it.

Oleg.

On Sun, Mar 20, 2011 at 12:25 AM, Ted Yu <yu...@gmail.com> wrote:

> I guess you chose date prefix for query consideration.
> You should introduce hashing so that the row keys are not clustered
> together.
>
> On Sat, Mar 19, 2011 at 3:00 PM, Oleg Ruchovets <oruchovets@gmail.com
> >wrote:
>
> >   We want to insert to hbase on daily basis (hbase 0.90.1 , hadoop
> append).
> > currently we have ~ 10 million records per day.We use map/reduce to
> prepare
> > data , and write it to hbase using chunks of data (5000 puts  every
> chunk)
> >   All process takes 1h 20 minutes. Making some tests verified that
> writing
> > to hbase takes ~ 1 hour.
> >
> > I have couple of questions:
> >  1) Reducers is writing  data which has a key like : <date>_<some_text> ,
> > the strange is that   all records were written to a one node.
> >
> >    Is it correct behaviour? What is the way to get better distributions
> > accross the cluster? Simply during insertion process  I saw that most
> load
> > get that specific node where all data were inserted and all other nodes
> > almost has no any resources utilisations (cpu , I/O ...).
> >
> > Oleg.
> >
>

Re: hbase insertion optimisation:

Posted by Ted Yu <yu...@gmail.com>.

I guess you chose date prefix for query consideration.
You should introduce hashing so that the row keys are not clustered
together.

On Sat, Mar 19, 2011 at 3:00 PM, Oleg Ruchovets <or...@gmail.com>wrote:

>   We want to insert to hbase on daily basis (hbase 0.90.1 , hadoop append).
> currently we have ~ 10 million records per day.We use map/reduce to prepare
> data , and write it to hbase using chunks of data (5000 puts  every chunk)
>   All process takes 1h 20 minutes. Making some tests verified that writing
> to hbase takes ~ 1 hour.
>
> I have couple of questions:
>  1) Reducers is writing  data which has a key like : <date>_<some_text> ,
> the strange is that   all records were written to a one node.
>
>    Is it correct behaviour? What is the way to get better distributions
> accross the cluster? Simply during insertion process  I saw that most load
> get that specific node where all data were inserted and all other nodes
> almost has no any resources utilisations (cpu , I/O ...).
>
> Oleg.
>