You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Ted Yu <yu...@gmail.com> on 2010/01/14 01:30:36 UTC

HBase bulk load

Jonathan:
Since you implemented
https://issues.apache.org/jira/si/jira.issueviews:issue-html/HBASE-48/HBASE-48.html,
maybe you can point me to some document how bulk load is used ?
I found bin/loadtable.rb and assume that can be used to import data back
into HBase.

Thanks

Re: HBase bulk load

Posted by stack <st...@duboce.net>.
On Thu, Jan 21, 2010 at 1:16 PM, Sriram Muthuswamy Chittathoor
<sr...@ivycomptech.com> wrote:
> I noticed one thing during my sample mapreduce job running -- it creates a lot of java processes on the slave nodes.  Even when I have "reuse.tasks" property set why does it not use a single jvm.  Sometime I see almost like 20 jvms running in a single box.  What property can I use to reduce it from spawning these huge number of jvm's
>

Do a long listing to see what they all are.  Each daemon in
hadoop/hbase is to its own JVM.  You can configure mapreduce to reuse
a JVM.  That saves some on startup costs but doesn't decrease the
total number.  How many tasks have you configured to run concurrently?
You could turn this down.

St.Ack


>
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of stack
> Sent: Friday, January 22, 2010 2:24 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: HBase bulk load
>
> On Thu, Jan 21, 2010 at 12:35 PM, Sriram Muthuswamy Chittathoor
> <sr...@ivycomptech.com> wrote:
>>
>> The output of job 1, the one that parses the 4k files and outputs
>> user+day/value, if its ordered by user+day, then you can take the
>> outputs of
>> this first job and feed them to the second job one at a time.   HFiles
>> will
>> be written for some subset of all users but for this subset, all of
>> their
>> activity over the 8 years will be processed.  You'll then move on to the
>> next set of users....
>>
>>
>> ---  I am assuming here I will mark  that a certain set of users data
>> (for all 8 years) goes into a certain hfile and this hfile will just
>> keep getting appended to for this same set of users as I progress
>> through different years data for the same set of users
>>
>
> No.  Not if you are feeding all user data for all years to this first MR job.
>
> Because the key is user+date, the reducer will get all for a
> particular user ordered by date.  This will be filled into an hfile.
> A user might not all fit in a single hfile.  Thats fine too.  You
> should not have to keep track of anything.  The framework does this
> for you.
>
>
>> ---  I will have to completely process this set of users data year at a
>> time in order (2000, 2001 etc)
>>
>
> If you go this route, you'll have to have separate tables for each
> year, I believe.
>
>>
>> Maybe write this back to hdfs as sequencefiles rather than as hfiles and
>> then take the output of this jobs reducer and feed these to your
>> hfileoutputformat job one at a time if you want to piecemeal the
>> creation of hfiles
>>
>> ---  I light of the above where does this sequence files fit in ?
>>
>
> They'll be input for a second mapreduce job whose output is hfiles.
>
> St.Ack
>
>>
>> -----Original Message-----
>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
>> stack
>> Sent: Saturday, January 16, 2010 5:58 AM
>> To: hbase-user@hadoop.apache.org
>> Subject: Re: HBase bulk load
>>
>> On Thu, Jan 14, 2010 at 11:05 PM, Sriram Muthuswamy Chittathoor <
>> sriramc@ivycomptech.com> wrote:
>>
>>>
>>> --- We need to bulk load 8 years worth of data from our archives.
>> That
>>> will 8 * 12 months  of data.
>>>
>>
>> Have you run mapreduce jobs over this archive in the past?  I ask
>> because if
>> you have, you may have an input on how long it'll take to do the big or
>> part
>> of the import.
>>
>>
>>
>>> Whats your original key made of?
>>> --  Each Data files is a 4K text data which has 6 players data on an
>>> average.  We will parse it and extract per userid/day data (so many
>> each
>>> of this would be < .5K)
>>>
>>
>>
>> Is your archive in HDFS now?  Are the 4k files concatenated into some
>> kinda
>> archive format?  Gzip or something?  Is it accessible with http?
>>
>>
>>>
>>> Would you do this step in multiple stages or feed this mapreduce job
>> all
>>> 10
>>> years of data?
>>>
>>>
>>> Either way I can do.  Since I have 8 years worth of archived data I
>> need
>>> to get them onto to the system as a one time effort.  If I proceed in
>>> this year order will it be fine --  2000 , 2001 , 2002.  The only
>>> requirement is at the end these individual years data (in hfiles)
>> needs
>>> to be loaded in Hbase.
>>>
>>
>>
>> If I were to guess, the data in the year 2000 is < 2001 and so on?
>>
>> Is a table per year going to cut it for you?  Don't you want to see the
>> user
>> data over the whole 8 years?  It'll be a pain doing 8 different queries
>> and
>> aggregating instead of doing one query against a single table?
>>
>>
>>>
>>> --  Can u give me some link to doing this.   If I am getting u right
>> is
>>> this the sequence
>>>
>>> 1.  Start with say year 2000 (1 billion 4k files to be processed and
>>> loaded)
>>> 2.  Divide it into splits initially based on just filename ranges
>>> (user/day data is hidden inside the file)
>>> 3.  Each mappers gets a bunch of file (if it is 20 mappers then each
>> one
>>> will have to process 50 million 4k files (Seems too much even for a
>>> single year ?? --  should I go to a single month processing at a time
>>> ??)
>>>
>>
>> For 50 million 4k files, you'd want more than 20 mappers.  You might
>> have 20
>> 'slots' for tasks on your cluster with each time a mapper runs, it might
>> process N files.  1 file only would probably be too little work to
>> justify
>> the firing up of the JVM to run the task. So, you should feed each map
>> task
>> 100 or 1000 4k files?  If 1k files per map task thats 50k map tasks?
>>
>>
>>
>>> 4.  Each mapper parses the file and extract the user/day records
>>> 5.  The custom parttioner sends range of users/day to a particular
>>> reducer
>>>
>>
>> Yes.  Your custom partitioner guarantees that a particular user only
>> goes to
>> one reducer.  How many users do you have do you think?  Maybe this job
>> is
>> too big to do all in the one go.  You need to come up with a process
>> with
>> more steps.  A MR job that runs for weeks will fail (smile).  Someone
>> will
>> for sure pull plug on the namenode just as the job is coming to an end.
>>
>>
>>
>>> 6.  reducer in parallel will generate sequence files -- multiple will
>> be
>>> there
>>>
>>>
>>> My question here is in each year there will be sequence files
>> containing
>>> a range of users data.  Do I need to identify these and put them
>>> together in one hfile as the user/day records for all the 10 years
>>> should be together in the final hfile ?
>>
>>
>> No.  As stuff flows into the hfiles, it just needs to be guaranteed
>> ordered.
>>  A particular user may span multiple hfiles.
>>
>>
>>> So some manual stuff is required
>>> here taking related sequence files (those containing the same range of
>>> users / day data) and feeding them to  hfileoutputformat job ?
>>>
>>
>>
>> The output of job 1, the one that parses the 4k files and outputs
>> user+day/value, if its ordered by user+day, then you can take the
>> outputs of
>> this first job and feed them to the second job one at a time.   HFiles
>> will
>> be written for some subset of all users but for this subset, all of
>> their
>> activity over the 8 years will be processed.  You'll then move on to the
>> next set of users....   Eventually you will have many hfiles to upload
>> into
>> an hbase instance.  You'll need to probably modify loadtable.rb some
>> (One
>> modification you should do I thought is NOT to load an hfile whose
>> length is
>> 0 bytes).
>>
>>
>>
>>
>>> -- - Could u also give some links to this multiput technique ??
>>>
>>>
>> Ryan should put up the patch soon (
>> https://issues.apache.org/jira/browse/HBASE-2066).
>>
>>
>> This seems like a pretty big job.  My guess is that its going to take a
>> bit
>> of time getting it all working.  Given your scale, my guess is that
>> you'll
>> run into some interesting issues.  For example, how many of those 4k
>> files
>> have corruption in them and how will your map tasks deal with the
>> corruption?
>>
>> You need to also figure out some things like how long each step is going
>> to
>> take, how big the resultant data is going to be, and so on so you can
>> guage
>> things like the amount of hardware you are going to need to get the job
>> done.
>>
>> The best way to get answers on the above is to start in with running a
>> few
>> mapreduce jobs passing subsets of the data to see how things work out.
>>
>> Yours,
>> St.Ack
>>
>>
>>
>>
>>>
>>> St.Ack
>>>
>>>
>>>
>>>
>>> >
>>> > -----Original Message-----
>>> > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
>>> > stack
>>> > Sent: Thursday, January 14, 2010 11:33 AM
>>> > To: hbase-user@hadoop.apache.org
>>> > Subject: Re: HBase bulk load
>>> >
>>> > On Wed, Jan 13, 2010 at 9:49 PM, Sriram Muthuswamy Chittathoor <
>>> > sriramc@ivycomptech.com> wrote:
>>> >
>>> > > I am trying to use this technique to say bulk load 20 billion
>> rows.
>>> I
>>> > > tried it on a smaller set 20 million rows. A few things I had to
>>> take
>>> > > care was to write a custom partitioning logic so that a range of
>>> keys
>>> > > only go to a particular reduce since there was some mention of
>>> global
>>> > > ordering.
>>> > > For example  Users  (1 --  1mill) ---> Reducer 1 and so on
>>> > >
>>> > > Good.
>>> >
>>> >
>>> >
>>> > > My questions are:
>>> > > 1.  Can I divide the bulk loading into multiple runs  --  the
>>> existing
>>> > > bulk load bails out if it finds a HDFS output directory with the
>>> same
>>> > > name
>>> > >
>>> >
>>> > No.  Its not currently written to do that but especially if your
>> keys
>>> > are
>>> > ordered, it probably wouldn't take much to make the above work
>> (first
>>> > job
>>> > does the first set of keys, and so on).
>>> >
>>> >
>>> > > 2.  What I want to do is make multiple runs of 10 billion and then
>>> > > combine the output before running  loadtable.rb --  is this
>> possible
>>> ?
>>> > > I am thinking this may be required in case my MR bulk loading
>> fails
>>> in
>>> > > between and I need to start from where I crashed
>>> > >
>>> > > Well, MR does retries but, yeah, you could run into some issue at
>>> the
>>> > 10B
>>> > mark and want to then start over from there rather than start from
>> the
>>> > beginning.
>>> >
>>> > One thing that the current setup does not do is remove the task
>> hfile
>>> on
>>> > failure.  We should add this.  Would fix case where when speculative
>>> > execution is enabled, and the speculative tasks are kiled, we don't
>>> > leave
>>> > around half-made hfiles (Currently I believe they they show as
>>> > zero-length
>>> > files).
>>> >
>>> > St.Ack
>>> >
>>> >
>>> >
>>> > > Any tips with huge bulk loading experience ?
>>> > >
>>> > >
>>> > > -----Original Message-----
>>> > > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf
>> Of
>>> > > stack
>>> > > Sent: Thursday, January 14, 2010 6:19 AM
>>> > > To: hbase-user@hadoop.apache.org
>>> > > Subject: Re: HBase bulk load
>>> > >
>>> > > See
>>> > >
>>> >
>>>
>> http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/
>>> > > mapreduce/package-summary.html#bulk
>>> > > St.Ack
>>> > >
>>> > > On Wed, Jan 13, 2010 at 4:30 PM, Ted Yu <yu...@gmail.com>
>> wrote:
>>> > >
>>> > > > Jonathan:
>>> > > > Since you implemented
>>> > > >
>>> > > >
>>> > >
>>> >
>>>
>> https://issues.apache.org/jira/si/jira.issueviews:issue-html/HBASE-48/HB
>>> > > ASE-48.html
>>> > > > ,
>>> > > > maybe you can point me to some document how bulk load is used ?
>>> > > > I found bin/loadtable.rb and assume that can be used to import
>>> data
>>> > > back
>>> > > > into HBase.
>>> > > >
>>> > > > Thanks
>>> > > >
>>> > >
>>> > > This email is sent for and on behalf of Ivy Comptech Private
>>> Limited.
>>> > Ivy
>>> > > Comptech Private Limited is a limited liability company.
>>> > >
>>> > > This email and any attachments are confidential, and may be
>> legally
>>> > > privileged and protected by copyright. If you are not the intended
>>> > recipient
>>> > > dissemination or copying of this email is prohibited. If you have
>>> > received
>>> > > this in error, please notify the sender by replying by email and
>>> then
>>> > delete
>>> > > the email completely from your system.
>>> > > Any views or opinions are solely those of the sender.  This
>>> > communication
>>> > > is not intended to form a binding contract on behalf of Ivy
>> Comptech
>>> > Private
>>> > > Limited unless expressly indicated to the contrary and properly
>>> > authorised.
>>> > > Any actions taken on the basis of this email are at the
>> recipient's
>>> > own
>>> > > risk.
>>> > >
>>> > > Registered office:
>>> > > Ivy Comptech Private Limited, Cyber Spazio, Road No. 2, Banjara
>>> Hills,
>>> > > Hyderabad 500 033, Andhra Pradesh, India. Registered number:
>> 37994.
>>> > > Registered in India. A list of members' names is available for
>>> > inspection at
>>> > > the registered office.
>>> > >
>>> > >
>>> >
>>>
>>
>
> This email is sent for and on behalf of Ivy Comptech Private Limited. Ivy Comptech Private Limited is a limited liability company.
>
> This email and any attachments are confidential, and may be legally privileged and protected by copyright. If you are not the intended recipient dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.
> Any views or opinions are solely those of the sender.  This communication is not intended to form a binding contract on behalf of Ivy Comptech Private Limited unless expressly indicated to the contrary and properly authorised. Any actions taken on the basis of this email are at the recipient's own risk.
>
> Registered office:
> Ivy Comptech Private Limited, Cyber Spazio, Road No. 2, Banjara Hills, Hyderabad 500 033, Andhra Pradesh, India. Registered number: 37994. Registered in India. A list of members' names is available for inspection at the registered office.
>
>

RE: HBase bulk load

Posted by Sriram Muthuswamy Chittathoor <sr...@ivycomptech.com>.
I noticed one thing during my sample mapreduce job running -- it creates a lot of java processes on the slave nodes.  Even when I have "reuse.tasks" property set why does it not use a single jvm.  Sometime I see almost like 20 jvms running in a single box.  What property can I use to reduce it from spawning these huge number of jvm's



-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of stack
Sent: Friday, January 22, 2010 2:24 AM
To: hbase-user@hadoop.apache.org
Subject: Re: HBase bulk load

On Thu, Jan 21, 2010 at 12:35 PM, Sriram Muthuswamy Chittathoor
<sr...@ivycomptech.com> wrote:
>
> The output of job 1, the one that parses the 4k files and outputs
> user+day/value, if its ordered by user+day, then you can take the
> outputs of
> this first job and feed them to the second job one at a time.   HFiles
> will
> be written for some subset of all users but for this subset, all of
> their
> activity over the 8 years will be processed.  You'll then move on to the
> next set of users....
>
>
> ---  I am assuming here I will mark  that a certain set of users data
> (for all 8 years) goes into a certain hfile and this hfile will just
> keep getting appended to for this same set of users as I progress
> through different years data for the same set of users
>

No.  Not if you are feeding all user data for all years to this first MR job.

Because the key is user+date, the reducer will get all for a
particular user ordered by date.  This will be filled into an hfile.
A user might not all fit in a single hfile.  Thats fine too.  You
should not have to keep track of anything.  The framework does this
for you.


> ---  I will have to completely process this set of users data year at a
> time in order (2000, 2001 etc)
>

If you go this route, you'll have to have separate tables for each
year, I believe.

>
> Maybe write this back to hdfs as sequencefiles rather than as hfiles and
> then take the output of this jobs reducer and feed these to your
> hfileoutputformat job one at a time if you want to piecemeal the
> creation of hfiles
>
> ---  I light of the above where does this sequence files fit in ?
>

They'll be input for a second mapreduce job whose output is hfiles.

St.Ack

>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> stack
> Sent: Saturday, January 16, 2010 5:58 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: HBase bulk load
>
> On Thu, Jan 14, 2010 at 11:05 PM, Sriram Muthuswamy Chittathoor <
> sriramc@ivycomptech.com> wrote:
>
>>
>> --- We need to bulk load 8 years worth of data from our archives.
> That
>> will 8 * 12 months  of data.
>>
>
> Have you run mapreduce jobs over this archive in the past?  I ask
> because if
> you have, you may have an input on how long it'll take to do the big or
> part
> of the import.
>
>
>
>> Whats your original key made of?
>> --  Each Data files is a 4K text data which has 6 players data on an
>> average.  We will parse it and extract per userid/day data (so many
> each
>> of this would be < .5K)
>>
>
>
> Is your archive in HDFS now?  Are the 4k files concatenated into some
> kinda
> archive format?  Gzip or something?  Is it accessible with http?
>
>
>>
>> Would you do this step in multiple stages or feed this mapreduce job
> all
>> 10
>> years of data?
>>
>>
>> Either way I can do.  Since I have 8 years worth of archived data I
> need
>> to get them onto to the system as a one time effort.  If I proceed in
>> this year order will it be fine --  2000 , 2001 , 2002.  The only
>> requirement is at the end these individual years data (in hfiles)
> needs
>> to be loaded in Hbase.
>>
>
>
> If I were to guess, the data in the year 2000 is < 2001 and so on?
>
> Is a table per year going to cut it for you?  Don't you want to see the
> user
> data over the whole 8 years?  It'll be a pain doing 8 different queries
> and
> aggregating instead of doing one query against a single table?
>
>
>>
>> --  Can u give me some link to doing this.   If I am getting u right
> is
>> this the sequence
>>
>> 1.  Start with say year 2000 (1 billion 4k files to be processed and
>> loaded)
>> 2.  Divide it into splits initially based on just filename ranges
>> (user/day data is hidden inside the file)
>> 3.  Each mappers gets a bunch of file (if it is 20 mappers then each
> one
>> will have to process 50 million 4k files (Seems too much even for a
>> single year ?? --  should I go to a single month processing at a time
>> ??)
>>
>
> For 50 million 4k files, you'd want more than 20 mappers.  You might
> have 20
> 'slots' for tasks on your cluster with each time a mapper runs, it might
> process N files.  1 file only would probably be too little work to
> justify
> the firing up of the JVM to run the task. So, you should feed each map
> task
> 100 or 1000 4k files?  If 1k files per map task thats 50k map tasks?
>
>
>
>> 4.  Each mapper parses the file and extract the user/day records
>> 5.  The custom parttioner sends range of users/day to a particular
>> reducer
>>
>
> Yes.  Your custom partitioner guarantees that a particular user only
> goes to
> one reducer.  How many users do you have do you think?  Maybe this job
> is
> too big to do all in the one go.  You need to come up with a process
> with
> more steps.  A MR job that runs for weeks will fail (smile).  Someone
> will
> for sure pull plug on the namenode just as the job is coming to an end.
>
>
>
>> 6.  reducer in parallel will generate sequence files -- multiple will
> be
>> there
>>
>>
>> My question here is in each year there will be sequence files
> containing
>> a range of users data.  Do I need to identify these and put them
>> together in one hfile as the user/day records for all the 10 years
>> should be together in the final hfile ?
>
>
> No.  As stuff flows into the hfiles, it just needs to be guaranteed
> ordered.
>  A particular user may span multiple hfiles.
>
>
>> So some manual stuff is required
>> here taking related sequence files (those containing the same range of
>> users / day data) and feeding them to  hfileoutputformat job ?
>>
>
>
> The output of job 1, the one that parses the 4k files and outputs
> user+day/value, if its ordered by user+day, then you can take the
> outputs of
> this first job and feed them to the second job one at a time.   HFiles
> will
> be written for some subset of all users but for this subset, all of
> their
> activity over the 8 years will be processed.  You'll then move on to the
> next set of users....   Eventually you will have many hfiles to upload
> into
> an hbase instance.  You'll need to probably modify loadtable.rb some
> (One
> modification you should do I thought is NOT to load an hfile whose
> length is
> 0 bytes).
>
>
>
>
>> -- - Could u also give some links to this multiput technique ??
>>
>>
> Ryan should put up the patch soon (
> https://issues.apache.org/jira/browse/HBASE-2066).
>
>
> This seems like a pretty big job.  My guess is that its going to take a
> bit
> of time getting it all working.  Given your scale, my guess is that
> you'll
> run into some interesting issues.  For example, how many of those 4k
> files
> have corruption in them and how will your map tasks deal with the
> corruption?
>
> You need to also figure out some things like how long each step is going
> to
> take, how big the resultant data is going to be, and so on so you can
> guage
> things like the amount of hardware you are going to need to get the job
> done.
>
> The best way to get answers on the above is to start in with running a
> few
> mapreduce jobs passing subsets of the data to see how things work out.
>
> Yours,
> St.Ack
>
>
>
>
>>
>> St.Ack
>>
>>
>>
>>
>> >
>> > -----Original Message-----
>> > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
>> > stack
>> > Sent: Thursday, January 14, 2010 11:33 AM
>> > To: hbase-user@hadoop.apache.org
>> > Subject: Re: HBase bulk load
>> >
>> > On Wed, Jan 13, 2010 at 9:49 PM, Sriram Muthuswamy Chittathoor <
>> > sriramc@ivycomptech.com> wrote:
>> >
>> > > I am trying to use this technique to say bulk load 20 billion
> rows.
>> I
>> > > tried it on a smaller set 20 million rows. A few things I had to
>> take
>> > > care was to write a custom partitioning logic so that a range of
>> keys
>> > > only go to a particular reduce since there was some mention of
>> global
>> > > ordering.
>> > > For example  Users  (1 --  1mill) ---> Reducer 1 and so on
>> > >
>> > > Good.
>> >
>> >
>> >
>> > > My questions are:
>> > > 1.  Can I divide the bulk loading into multiple runs  --  the
>> existing
>> > > bulk load bails out if it finds a HDFS output directory with the
>> same
>> > > name
>> > >
>> >
>> > No.  Its not currently written to do that but especially if your
> keys
>> > are
>> > ordered, it probably wouldn't take much to make the above work
> (first
>> > job
>> > does the first set of keys, and so on).
>> >
>> >
>> > > 2.  What I want to do is make multiple runs of 10 billion and then
>> > > combine the output before running  loadtable.rb --  is this
> possible
>> ?
>> > > I am thinking this may be required in case my MR bulk loading
> fails
>> in
>> > > between and I need to start from where I crashed
>> > >
>> > > Well, MR does retries but, yeah, you could run into some issue at
>> the
>> > 10B
>> > mark and want to then start over from there rather than start from
> the
>> > beginning.
>> >
>> > One thing that the current setup does not do is remove the task
> hfile
>> on
>> > failure.  We should add this.  Would fix case where when speculative
>> > execution is enabled, and the speculative tasks are kiled, we don't
>> > leave
>> > around half-made hfiles (Currently I believe they they show as
>> > zero-length
>> > files).
>> >
>> > St.Ack
>> >
>> >
>> >
>> > > Any tips with huge bulk loading experience ?
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf
> Of
>> > > stack
>> > > Sent: Thursday, January 14, 2010 6:19 AM
>> > > To: hbase-user@hadoop.apache.org
>> > > Subject: Re: HBase bulk load
>> > >
>> > > See
>> > >
>> >
>>
> http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/
>> > > mapreduce/package-summary.html#bulk
>> > > St.Ack
>> > >
>> > > On Wed, Jan 13, 2010 at 4:30 PM, Ted Yu <yu...@gmail.com>
> wrote:
>> > >
>> > > > Jonathan:
>> > > > Since you implemented
>> > > >
>> > > >
>> > >
>> >
>>
> https://issues.apache.org/jira/si/jira.issueviews:issue-html/HBASE-48/HB
>> > > ASE-48.html
>> > > > ,
>> > > > maybe you can point me to some document how bulk load is used ?
>> > > > I found bin/loadtable.rb and assume that can be used to import
>> data
>> > > back
>> > > > into HBase.
>> > > >
>> > > > Thanks
>> > > >
>> > >
>> > > This email is sent for and on behalf of Ivy Comptech Private
>> Limited.
>> > Ivy
>> > > Comptech Private Limited is a limited liability company.
>> > >
>> > > This email and any attachments are confidential, and may be
> legally
>> > > privileged and protected by copyright. If you are not the intended
>> > recipient
>> > > dissemination or copying of this email is prohibited. If you have
>> > received
>> > > this in error, please notify the sender by replying by email and
>> then
>> > delete
>> > > the email completely from your system.
>> > > Any views or opinions are solely those of the sender.  This
>> > communication
>> > > is not intended to form a binding contract on behalf of Ivy
> Comptech
>> > Private
>> > > Limited unless expressly indicated to the contrary and properly
>> > authorised.
>> > > Any actions taken on the basis of this email are at the
> recipient's
>> > own
>> > > risk.
>> > >
>> > > Registered office:
>> > > Ivy Comptech Private Limited, Cyber Spazio, Road No. 2, Banjara
>> Hills,
>> > > Hyderabad 500 033, Andhra Pradesh, India. Registered number:
> 37994.
>> > > Registered in India. A list of members' names is available for
>> > inspection at
>> > > the registered office.
>> > >
>> > >
>> >
>>
>

This email is sent for and on behalf of Ivy Comptech Private Limited. Ivy Comptech Private Limited is a limited liability company.  

This email and any attachments are confidential, and may be legally privileged and protected by copyright. If you are not the intended recipient dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. 
Any views or opinions are solely those of the sender.  This communication is not intended to form a binding contract on behalf of Ivy Comptech Private Limited unless expressly indicated to the contrary and properly authorised. Any actions taken on the basis of this email are at the recipient's own risk.

Registered office:
Ivy Comptech Private Limited, Cyber Spazio, Road No. 2, Banjara Hills, Hyderabad 500 033, Andhra Pradesh, India. Registered number: 37994. Registered in India. A list of members' names is available for inspection at the registered office.


Re: HBase bulk load

Posted by stack <st...@duboce.net>.
On Thu, Jan 21, 2010 at 12:35 PM, Sriram Muthuswamy Chittathoor
<sr...@ivycomptech.com> wrote:
>
> The output of job 1, the one that parses the 4k files and outputs
> user+day/value, if its ordered by user+day, then you can take the
> outputs of
> this first job and feed them to the second job one at a time.   HFiles
> will
> be written for some subset of all users but for this subset, all of
> their
> activity over the 8 years will be processed.  You'll then move on to the
> next set of users....
>
>
> ---  I am assuming here I will mark  that a certain set of users data
> (for all 8 years) goes into a certain hfile and this hfile will just
> keep getting appended to for this same set of users as I progress
> through different years data for the same set of users
>

No.  Not if you are feeding all user data for all years to this first MR job.

Because the key is user+date, the reducer will get all for a
particular user ordered by date.  This will be filled into an hfile.
A user might not all fit in a single hfile.  Thats fine too.  You
should not have to keep track of anything.  The framework does this
for you.


> ---  I will have to completely process this set of users data year at a
> time in order (2000, 2001 etc)
>

If you go this route, you'll have to have separate tables for each
year, I believe.

>
> Maybe write this back to hdfs as sequencefiles rather than as hfiles and
> then take the output of this jobs reducer and feed these to your
> hfileoutputformat job one at a time if you want to piecemeal the
> creation of hfiles
>
> ---  I light of the above where does this sequence files fit in ?
>

They'll be input for a second mapreduce job whose output is hfiles.

St.Ack

>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> stack
> Sent: Saturday, January 16, 2010 5:58 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: HBase bulk load
>
> On Thu, Jan 14, 2010 at 11:05 PM, Sriram Muthuswamy Chittathoor <
> sriramc@ivycomptech.com> wrote:
>
>>
>> --- We need to bulk load 8 years worth of data from our archives.
> That
>> will 8 * 12 months  of data.
>>
>
> Have you run mapreduce jobs over this archive in the past?  I ask
> because if
> you have, you may have an input on how long it'll take to do the big or
> part
> of the import.
>
>
>
>> Whats your original key made of?
>> --  Each Data files is a 4K text data which has 6 players data on an
>> average.  We will parse it and extract per userid/day data (so many
> each
>> of this would be < .5K)
>>
>
>
> Is your archive in HDFS now?  Are the 4k files concatenated into some
> kinda
> archive format?  Gzip or something?  Is it accessible with http?
>
>
>>
>> Would you do this step in multiple stages or feed this mapreduce job
> all
>> 10
>> years of data?
>>
>>
>> Either way I can do.  Since I have 8 years worth of archived data I
> need
>> to get them onto to the system as a one time effort.  If I proceed in
>> this year order will it be fine --  2000 , 2001 , 2002.  The only
>> requirement is at the end these individual years data (in hfiles)
> needs
>> to be loaded in Hbase.
>>
>
>
> If I were to guess, the data in the year 2000 is < 2001 and so on?
>
> Is a table per year going to cut it for you?  Don't you want to see the
> user
> data over the whole 8 years?  It'll be a pain doing 8 different queries
> and
> aggregating instead of doing one query against a single table?
>
>
>>
>> --  Can u give me some link to doing this.   If I am getting u right
> is
>> this the sequence
>>
>> 1.  Start with say year 2000 (1 billion 4k files to be processed and
>> loaded)
>> 2.  Divide it into splits initially based on just filename ranges
>> (user/day data is hidden inside the file)
>> 3.  Each mappers gets a bunch of file (if it is 20 mappers then each
> one
>> will have to process 50 million 4k files (Seems too much even for a
>> single year ?? --  should I go to a single month processing at a time
>> ??)
>>
>
> For 50 million 4k files, you'd want more than 20 mappers.  You might
> have 20
> 'slots' for tasks on your cluster with each time a mapper runs, it might
> process N files.  1 file only would probably be too little work to
> justify
> the firing up of the JVM to run the task. So, you should feed each map
> task
> 100 or 1000 4k files?  If 1k files per map task thats 50k map tasks?
>
>
>
>> 4.  Each mapper parses the file and extract the user/day records
>> 5.  The custom parttioner sends range of users/day to a particular
>> reducer
>>
>
> Yes.  Your custom partitioner guarantees that a particular user only
> goes to
> one reducer.  How many users do you have do you think?  Maybe this job
> is
> too big to do all in the one go.  You need to come up with a process
> with
> more steps.  A MR job that runs for weeks will fail (smile).  Someone
> will
> for sure pull plug on the namenode just as the job is coming to an end.
>
>
>
>> 6.  reducer in parallel will generate sequence files -- multiple will
> be
>> there
>>
>>
>> My question here is in each year there will be sequence files
> containing
>> a range of users data.  Do I need to identify these and put them
>> together in one hfile as the user/day records for all the 10 years
>> should be together in the final hfile ?
>
>
> No.  As stuff flows into the hfiles, it just needs to be guaranteed
> ordered.
>  A particular user may span multiple hfiles.
>
>
>> So some manual stuff is required
>> here taking related sequence files (those containing the same range of
>> users / day data) and feeding them to  hfileoutputformat job ?
>>
>
>
> The output of job 1, the one that parses the 4k files and outputs
> user+day/value, if its ordered by user+day, then you can take the
> outputs of
> this first job and feed them to the second job one at a time.   HFiles
> will
> be written for some subset of all users but for this subset, all of
> their
> activity over the 8 years will be processed.  You'll then move on to the
> next set of users....   Eventually you will have many hfiles to upload
> into
> an hbase instance.  You'll need to probably modify loadtable.rb some
> (One
> modification you should do I thought is NOT to load an hfile whose
> length is
> 0 bytes).
>
>
>
>
>> -- - Could u also give some links to this multiput technique ??
>>
>>
> Ryan should put up the patch soon (
> https://issues.apache.org/jira/browse/HBASE-2066).
>
>
> This seems like a pretty big job.  My guess is that its going to take a
> bit
> of time getting it all working.  Given your scale, my guess is that
> you'll
> run into some interesting issues.  For example, how many of those 4k
> files
> have corruption in them and how will your map tasks deal with the
> corruption?
>
> You need to also figure out some things like how long each step is going
> to
> take, how big the resultant data is going to be, and so on so you can
> guage
> things like the amount of hardware you are going to need to get the job
> done.
>
> The best way to get answers on the above is to start in with running a
> few
> mapreduce jobs passing subsets of the data to see how things work out.
>
> Yours,
> St.Ack
>
>
>
>
>>
>> St.Ack
>>
>>
>>
>>
>> >
>> > -----Original Message-----
>> > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
>> > stack
>> > Sent: Thursday, January 14, 2010 11:33 AM
>> > To: hbase-user@hadoop.apache.org
>> > Subject: Re: HBase bulk load
>> >
>> > On Wed, Jan 13, 2010 at 9:49 PM, Sriram Muthuswamy Chittathoor <
>> > sriramc@ivycomptech.com> wrote:
>> >
>> > > I am trying to use this technique to say bulk load 20 billion
> rows.
>> I
>> > > tried it on a smaller set 20 million rows. A few things I had to
>> take
>> > > care was to write a custom partitioning logic so that a range of
>> keys
>> > > only go to a particular reduce since there was some mention of
>> global
>> > > ordering.
>> > > For example  Users  (1 --  1mill) ---> Reducer 1 and so on
>> > >
>> > > Good.
>> >
>> >
>> >
>> > > My questions are:
>> > > 1.  Can I divide the bulk loading into multiple runs  --  the
>> existing
>> > > bulk load bails out if it finds a HDFS output directory with the
>> same
>> > > name
>> > >
>> >
>> > No.  Its not currently written to do that but especially if your
> keys
>> > are
>> > ordered, it probably wouldn't take much to make the above work
> (first
>> > job
>> > does the first set of keys, and so on).
>> >
>> >
>> > > 2.  What I want to do is make multiple runs of 10 billion and then
>> > > combine the output before running  loadtable.rb --  is this
> possible
>> ?
>> > > I am thinking this may be required in case my MR bulk loading
> fails
>> in
>> > > between and I need to start from where I crashed
>> > >
>> > > Well, MR does retries but, yeah, you could run into some issue at
>> the
>> > 10B
>> > mark and want to then start over from there rather than start from
> the
>> > beginning.
>> >
>> > One thing that the current setup does not do is remove the task
> hfile
>> on
>> > failure.  We should add this.  Would fix case where when speculative
>> > execution is enabled, and the speculative tasks are kiled, we don't
>> > leave
>> > around half-made hfiles (Currently I believe they they show as
>> > zero-length
>> > files).
>> >
>> > St.Ack
>> >
>> >
>> >
>> > > Any tips with huge bulk loading experience ?
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf
> Of
>> > > stack
>> > > Sent: Thursday, January 14, 2010 6:19 AM
>> > > To: hbase-user@hadoop.apache.org
>> > > Subject: Re: HBase bulk load
>> > >
>> > > See
>> > >
>> >
>>
> http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/
>> > > mapreduce/package-summary.html#bulk
>> > > St.Ack
>> > >
>> > > On Wed, Jan 13, 2010 at 4:30 PM, Ted Yu <yu...@gmail.com>
> wrote:
>> > >
>> > > > Jonathan:
>> > > > Since you implemented
>> > > >
>> > > >
>> > >
>> >
>>
> https://issues.apache.org/jira/si/jira.issueviews:issue-html/HBASE-48/HB
>> > > ASE-48.html
>> > > > ,
>> > > > maybe you can point me to some document how bulk load is used ?
>> > > > I found bin/loadtable.rb and assume that can be used to import
>> data
>> > > back
>> > > > into HBase.
>> > > >
>> > > > Thanks
>> > > >
>> > >
>> > > This email is sent for and on behalf of Ivy Comptech Private
>> Limited.
>> > Ivy
>> > > Comptech Private Limited is a limited liability company.
>> > >
>> > > This email and any attachments are confidential, and may be
> legally
>> > > privileged and protected by copyright. If you are not the intended
>> > recipient
>> > > dissemination or copying of this email is prohibited. If you have
>> > received
>> > > this in error, please notify the sender by replying by email and
>> then
>> > delete
>> > > the email completely from your system.
>> > > Any views or opinions are solely those of the sender.  This
>> > communication
>> > > is not intended to form a binding contract on behalf of Ivy
> Comptech
>> > Private
>> > > Limited unless expressly indicated to the contrary and properly
>> > authorised.
>> > > Any actions taken on the basis of this email are at the
> recipient's
>> > own
>> > > risk.
>> > >
>> > > Registered office:
>> > > Ivy Comptech Private Limited, Cyber Spazio, Road No. 2, Banjara
>> Hills,
>> > > Hyderabad 500 033, Andhra Pradesh, India. Registered number:
> 37994.
>> > > Registered in India. A list of members' names is available for
>> > inspection at
>> > > the registered office.
>> > >
>> > >
>> >
>>
>

RE: HBase bulk load

Posted by Sriram Muthuswamy Chittathoor <sr...@ivycomptech.com>.
The output of job 1, the one that parses the 4k files and outputs
user+day/value, if its ordered by user+day, then you can take the
outputs of
this first job and feed them to the second job one at a time.   HFiles
will
be written for some subset of all users but for this subset, all of
their
activity over the 8 years will be processed.  You'll then move on to the
next set of users....


---  I am assuming here I will mark  that a certain set of users data
(for all 8 years) goes into a certain hfile and this hfile will just
keep getting appended to for this same set of users as I progress
through different years data for the same set of users

---  I will have to completely process this set of users data year at a
time in order (2000, 2001 etc)


Maybe write this back to hdfs as sequencefiles rather than as hfiles and
then take the output of this jobs reducer and feed these to your
hfileoutputformat job one at a time if you want to piecemeal the
creation of hfiles

---  I light of the above where does this sequence files fit in ?


-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
stack
Sent: Saturday, January 16, 2010 5:58 AM
To: hbase-user@hadoop.apache.org
Subject: Re: HBase bulk load

On Thu, Jan 14, 2010 at 11:05 PM, Sriram Muthuswamy Chittathoor <
sriramc@ivycomptech.com> wrote:

>
> --- We need to bulk load 8 years worth of data from our archives.
That
> will 8 * 12 months  of data.
>

Have you run mapreduce jobs over this archive in the past?  I ask
because if
you have, you may have an input on how long it'll take to do the big or
part
of the import.



> Whats your original key made of?
> --  Each Data files is a 4K text data which has 6 players data on an
> average.  We will parse it and extract per userid/day data (so many
each
> of this would be < .5K)
>


Is your archive in HDFS now?  Are the 4k files concatenated into some
kinda
archive format?  Gzip or something?  Is it accessible with http?


>
> Would you do this step in multiple stages or feed this mapreduce job
all
> 10
> years of data?
>
>
> Either way I can do.  Since I have 8 years worth of archived data I
need
> to get them onto to the system as a one time effort.  If I proceed in
> this year order will it be fine --  2000 , 2001 , 2002.  The only
> requirement is at the end these individual years data (in hfiles)
needs
> to be loaded in Hbase.
>


If I were to guess, the data in the year 2000 is < 2001 and so on?

Is a table per year going to cut it for you?  Don't you want to see the
user
data over the whole 8 years?  It'll be a pain doing 8 different queries
and
aggregating instead of doing one query against a single table?


>
> --  Can u give me some link to doing this.   If I am getting u right
is
> this the sequence
>
> 1.  Start with say year 2000 (1 billion 4k files to be processed and
> loaded)
> 2.  Divide it into splits initially based on just filename ranges
> (user/day data is hidden inside the file)
> 3.  Each mappers gets a bunch of file (if it is 20 mappers then each
one
> will have to process 50 million 4k files (Seems too much even for a
> single year ?? --  should I go to a single month processing at a time
> ??)
>

For 50 million 4k files, you'd want more than 20 mappers.  You might
have 20
'slots' for tasks on your cluster with each time a mapper runs, it might
process N files.  1 file only would probably be too little work to
justify
the firing up of the JVM to run the task. So, you should feed each map
task
100 or 1000 4k files?  If 1k files per map task thats 50k map tasks?



> 4.  Each mapper parses the file and extract the user/day records
> 5.  The custom parttioner sends range of users/day to a particular
> reducer
>

Yes.  Your custom partitioner guarantees that a particular user only
goes to
one reducer.  How many users do you have do you think?  Maybe this job
is
too big to do all in the one go.  You need to come up with a process
with
more steps.  A MR job that runs for weeks will fail (smile).  Someone
will
for sure pull plug on the namenode just as the job is coming to an end.



> 6.  reducer in parallel will generate sequence files -- multiple will
be
> there
>
>
> My question here is in each year there will be sequence files
containing
> a range of users data.  Do I need to identify these and put them
> together in one hfile as the user/day records for all the 10 years
> should be together in the final hfile ?


No.  As stuff flows into the hfiles, it just needs to be guaranteed
ordered.
 A particular user may span multiple hfiles.


> So some manual stuff is required
> here taking related sequence files (those containing the same range of
> users / day data) and feeding them to  hfileoutputformat job ?
>


The output of job 1, the one that parses the 4k files and outputs
user+day/value, if its ordered by user+day, then you can take the
outputs of
this first job and feed them to the second job one at a time.   HFiles
will
be written for some subset of all users but for this subset, all of
their
activity over the 8 years will be processed.  You'll then move on to the
next set of users....   Eventually you will have many hfiles to upload
into
an hbase instance.  You'll need to probably modify loadtable.rb some
(One
modification you should do I thought is NOT to load an hfile whose
length is
0 bytes).




> -- - Could u also give some links to this multiput technique ??
>
>
Ryan should put up the patch soon (
https://issues.apache.org/jira/browse/HBASE-2066).


This seems like a pretty big job.  My guess is that its going to take a
bit
of time getting it all working.  Given your scale, my guess is that
you'll
run into some interesting issues.  For example, how many of those 4k
files
have corruption in them and how will your map tasks deal with the
corruption?

You need to also figure out some things like how long each step is going
to
take, how big the resultant data is going to be, and so on so you can
guage
things like the amount of hardware you are going to need to get the job
done.

The best way to get answers on the above is to start in with running a
few
mapreduce jobs passing subsets of the data to see how things work out.

Yours,
St.Ack




>
> St.Ack
>
>
>
>
> >
> > -----Original Message-----
> > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> > stack
> > Sent: Thursday, January 14, 2010 11:33 AM
> > To: hbase-user@hadoop.apache.org
> > Subject: Re: HBase bulk load
> >
> > On Wed, Jan 13, 2010 at 9:49 PM, Sriram Muthuswamy Chittathoor <
> > sriramc@ivycomptech.com> wrote:
> >
> > > I am trying to use this technique to say bulk load 20 billion
rows.
> I
> > > tried it on a smaller set 20 million rows. A few things I had to
> take
> > > care was to write a custom partitioning logic so that a range of
> keys
> > > only go to a particular reduce since there was some mention of
> global
> > > ordering.
> > > For example  Users  (1 --  1mill) ---> Reducer 1 and so on
> > >
> > > Good.
> >
> >
> >
> > > My questions are:
> > > 1.  Can I divide the bulk loading into multiple runs  --  the
> existing
> > > bulk load bails out if it finds a HDFS output directory with the
> same
> > > name
> > >
> >
> > No.  Its not currently written to do that but especially if your
keys
> > are
> > ordered, it probably wouldn't take much to make the above work
(first
> > job
> > does the first set of keys, and so on).
> >
> >
> > > 2.  What I want to do is make multiple runs of 10 billion and then
> > > combine the output before running  loadtable.rb --  is this
possible
> ?
> > > I am thinking this may be required in case my MR bulk loading
fails
> in
> > > between and I need to start from where I crashed
> > >
> > > Well, MR does retries but, yeah, you could run into some issue at
> the
> > 10B
> > mark and want to then start over from there rather than start from
the
> > beginning.
> >
> > One thing that the current setup does not do is remove the task
hfile
> on
> > failure.  We should add this.  Would fix case where when speculative
> > execution is enabled, and the speculative tasks are kiled, we don't
> > leave
> > around half-made hfiles (Currently I believe they they show as
> > zero-length
> > files).
> >
> > St.Ack
> >
> >
> >
> > > Any tips with huge bulk loading experience ?
> > >
> > >
> > > -----Original Message-----
> > > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf
Of
> > > stack
> > > Sent: Thursday, January 14, 2010 6:19 AM
> > > To: hbase-user@hadoop.apache.org
> > > Subject: Re: HBase bulk load
> > >
> > > See
> > >
> >
>
http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/
> > > mapreduce/package-summary.html#bulk
> > > St.Ack
> > >
> > > On Wed, Jan 13, 2010 at 4:30 PM, Ted Yu <yu...@gmail.com>
wrote:
> > >
> > > > Jonathan:
> > > > Since you implemented
> > > >
> > > >
> > >
> >
>
https://issues.apache.org/jira/si/jira.issueviews:issue-html/HBASE-48/HB
> > > ASE-48.html
> > > > ,
> > > > maybe you can point me to some document how bulk load is used ?
> > > > I found bin/loadtable.rb and assume that can be used to import
> data
> > > back
> > > > into HBase.
> > > >
> > > > Thanks
> > > >
> > >
> > > This email is sent for and on behalf of Ivy Comptech Private
> Limited.
> > Ivy
> > > Comptech Private Limited is a limited liability company.
> > >
> > > This email and any attachments are confidential, and may be
legally
> > > privileged and protected by copyright. If you are not the intended
> > recipient
> > > dissemination or copying of this email is prohibited. If you have
> > received
> > > this in error, please notify the sender by replying by email and
> then
> > delete
> > > the email completely from your system.
> > > Any views or opinions are solely those of the sender.  This
> > communication
> > > is not intended to form a binding contract on behalf of Ivy
Comptech
> > Private
> > > Limited unless expressly indicated to the contrary and properly
> > authorised.
> > > Any actions taken on the basis of this email are at the
recipient's
> > own
> > > risk.
> > >
> > > Registered office:
> > > Ivy Comptech Private Limited, Cyber Spazio, Road No. 2, Banjara
> Hills,
> > > Hyderabad 500 033, Andhra Pradesh, India. Registered number:
37994.
> > > Registered in India. A list of members' names is available for
> > inspection at
> > > the registered office.
> > >
> > >
> >
>

Re: HBase bulk load

Posted by stack <st...@duboce.net>.
On Thu, Jan 14, 2010 at 11:05 PM, Sriram Muthuswamy Chittathoor <
sriramc@ivycomptech.com> wrote:

>
> --- We need to bulk load 8 years worth of data from our archives.   That
> will 8 * 12 months  of data.
>

Have you run mapreduce jobs over this archive in the past?  I ask because if
you have, you may have an input on how long it'll take to do the big or part
of the import.



> Whats your original key made of?
> --  Each Data files is a 4K text data which has 6 players data on an
> average.  We will parse it and extract per userid/day data (so many each
> of this would be < .5K)
>


Is your archive in HDFS now?  Are the 4k files concatenated into some kinda
archive format?  Gzip or something?  Is it accessible with http?


>
> Would you do this step in multiple stages or feed this mapreduce job all
> 10
> years of data?
>
>
> Either way I can do.  Since I have 8 years worth of archived data I need
> to get them onto to the system as a one time effort.  If I proceed in
> this year order will it be fine --  2000 , 2001 , 2002.  The only
> requirement is at the end these individual years data (in hfiles) needs
> to be loaded in Hbase.
>


If I were to guess, the data in the year 2000 is < 2001 and so on?

Is a table per year going to cut it for you?  Don't you want to see the user
data over the whole 8 years?  It'll be a pain doing 8 different queries and
aggregating instead of doing one query against a single table?


>
> --  Can u give me some link to doing this.   If I am getting u right is
> this the sequence
>
> 1.  Start with say year 2000 (1 billion 4k files to be processed and
> loaded)
> 2.  Divide it into splits initially based on just filename ranges
> (user/day data is hidden inside the file)
> 3.  Each mappers gets a bunch of file (if it is 20 mappers then each one
> will have to process 50 million 4k files (Seems too much even for a
> single year ?? --  should I go to a single month processing at a time
> ??)
>

For 50 million 4k files, you'd want more than 20 mappers.  You might have 20
'slots' for tasks on your cluster with each time a mapper runs, it might
process N files.  1 file only would probably be too little work to justify
the firing up of the JVM to run the task. So, you should feed each map task
100 or 1000 4k files?  If 1k files per map task thats 50k map tasks?



> 4.  Each mapper parses the file and extract the user/day records
> 5.  The custom parttioner sends range of users/day to a particular
> reducer
>

Yes.  Your custom partitioner guarantees that a particular user only goes to
one reducer.  How many users do you have do you think?  Maybe this job is
too big to do all in the one go.  You need to come up with a process with
more steps.  A MR job that runs for weeks will fail (smile).  Someone will
for sure pull plug on the namenode just as the job is coming to an end.



> 6.  reducer in parallel will generate sequence files -- multiple will be
> there
>
>
> My question here is in each year there will be sequence files containing
> a range of users data.  Do I need to identify these and put them
> together in one hfile as the user/day records for all the 10 years
> should be together in the final hfile ?


No.  As stuff flows into the hfiles, it just needs to be guaranteed ordered.
 A particular user may span multiple hfiles.


> So some manual stuff is required
> here taking related sequence files (those containing the same range of
> users / day data) and feeding them to  hfileoutputformat job ?
>


The output of job 1, the one that parses the 4k files and outputs
user+day/value, if its ordered by user+day, then you can take the outputs of
this first job and feed them to the second job one at a time.   HFiles will
be written for some subset of all users but for this subset, all of their
activity over the 8 years will be processed.  You'll then move on to the
next set of users....   Eventually you will have many hfiles to upload into
an hbase instance.  You'll need to probably modify loadtable.rb some (One
modification you should do I thought is NOT to load an hfile whose length is
0 bytes).


> -- - Could u also give some links to this multiput technique ??
>
>
Ryan should put up the patch soon (
https://issues.apache.org/jira/browse/HBASE-2066).


This seems like a pretty big job.  My guess is that its going to take a bit
of time getting it all working.  Given your scale, my guess is that you'll
run into some interesting issues.  For example, how many of those 4k files
have corruption in them and how will your map tasks deal with the
corruption?

You need to also figure out some things like how long each step is going to
take, how big the resultant data is going to be, and so on so you can guage
things like the amount of hardware you are going to need to get the job
done.

The best way to get answers on the above is to start in with running a few
mapreduce jobs passing subsets of the data to see how things work out.

Yours,
St.Ack




>
> St.Ack
>
>
>
>
> >
> > -----Original Message-----
> > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> > stack
> > Sent: Thursday, January 14, 2010 11:33 AM
> > To: hbase-user@hadoop.apache.org
> > Subject: Re: HBase bulk load
> >
> > On Wed, Jan 13, 2010 at 9:49 PM, Sriram Muthuswamy Chittathoor <
> > sriramc@ivycomptech.com> wrote:
> >
> > > I am trying to use this technique to say bulk load 20 billion rows.
> I
> > > tried it on a smaller set 20 million rows. A few things I had to
> take
> > > care was to write a custom partitioning logic so that a range of
> keys
> > > only go to a particular reduce since there was some mention of
> global
> > > ordering.
> > > For example  Users  (1 --  1mill) ---> Reducer 1 and so on
> > >
> > > Good.
> >
> >
> >
> > > My questions are:
> > > 1.  Can I divide the bulk loading into multiple runs  --  the
> existing
> > > bulk load bails out if it finds a HDFS output directory with the
> same
> > > name
> > >
> >
> > No.  Its not currently written to do that but especially if your keys
> > are
> > ordered, it probably wouldn't take much to make the above work (first
> > job
> > does the first set of keys, and so on).
> >
> >
> > > 2.  What I want to do is make multiple runs of 10 billion and then
> > > combine the output before running  loadtable.rb --  is this possible
> ?
> > > I am thinking this may be required in case my MR bulk loading fails
> in
> > > between and I need to start from where I crashed
> > >
> > > Well, MR does retries but, yeah, you could run into some issue at
> the
> > 10B
> > mark and want to then start over from there rather than start from the
> > beginning.
> >
> > One thing that the current setup does not do is remove the task hfile
> on
> > failure.  We should add this.  Would fix case where when speculative
> > execution is enabled, and the speculative tasks are kiled, we don't
> > leave
> > around half-made hfiles (Currently I believe they they show as
> > zero-length
> > files).
> >
> > St.Ack
> >
> >
> >
> > > Any tips with huge bulk loading experience ?
> > >
> > >
> > > -----Original Message-----
> > > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> > > stack
> > > Sent: Thursday, January 14, 2010 6:19 AM
> > > To: hbase-user@hadoop.apache.org
> > > Subject: Re: HBase bulk load
> > >
> > > See
> > >
> >
> http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/
> > > mapreduce/package-summary.html#bulk
> > > St.Ack
> > >
> > > On Wed, Jan 13, 2010 at 4:30 PM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > > > Jonathan:
> > > > Since you implemented
> > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/si/jira.issueviews:issue-html/HBASE-48/HB
> > > ASE-48.html
> > > > ,
> > > > maybe you can point me to some document how bulk load is used ?
> > > > I found bin/loadtable.rb and assume that can be used to import
> data
> > > back
> > > > into HBase.
> > > >
> > > > Thanks
> > > >
> > >
> > > This email is sent for and on behalf of Ivy Comptech Private
> Limited.
> > Ivy
> > > Comptech Private Limited is a limited liability company.
> > >
> > > This email and any attachments are confidential, and may be legally
> > > privileged and protected by copyright. If you are not the intended
> > recipient
> > > dissemination or copying of this email is prohibited. If you have
> > received
> > > this in error, please notify the sender by replying by email and
> then
> > delete
> > > the email completely from your system.
> > > Any views or opinions are solely those of the sender.  This
> > communication
> > > is not intended to form a binding contract on behalf of Ivy Comptech
> > Private
> > > Limited unless expressly indicated to the contrary and properly
> > authorised.
> > > Any actions taken on the basis of this email are at the recipient's
> > own
> > > risk.
> > >
> > > Registered office:
> > > Ivy Comptech Private Limited, Cyber Spazio, Road No. 2, Banjara
> Hills,
> > > Hyderabad 500 033, Andhra Pradesh, India. Registered number: 37994.
> > > Registered in India. A list of members' names is available for
> > inspection at
> > > the registered office.
> > >
> > >
> >
>

RE: HBase bulk load

Posted by Sriram Muthuswamy Chittathoor <sr...@ivycomptech.com>.
See my replies 

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
stack
Sent: Friday, January 15, 2010 4:40 AM
To: hbase-user@hadoop.apache.org
Subject: Re: HBase bulk load

On Wed, Jan 13, 2010 at 10:28 PM, Sriram Muthuswamy Chittathoor <
sriramc@ivycomptech.com> wrote:

> Some context
> ************
>
> 1. We generate 3 million files a day  --  3 * 365 =  1 billion / year
> (Final is say 10 years though)
> 2. Each file has data relating to a user and day
> 3. Every user will have some activity throughout the year (not
> necessarily on every day)
> 4. Our search is by  {user , data range combination} --  give me data
> for a given user between these date ranges
>
>
So, your key will be userid+day since epoch?   

--- Correct 



> I am assuming I am assuming could be done
>
> 1.  The InputSplits will divide the #splits  based on say some range
of
> days  -- say for a year  we divide into 1 splits (All files for a
month
> goes to the first mapper and so on)


> 2.  Each mapper processes the files and creates {Key , value}
> combinations --  key is composite  {user , date}
>


Mappers should run for about 5-10 minutes each.  How many months of data
do
you think this will be per mapper?
--   Looks like based on the granularity (processing 1 years worth of
data vs 1 months data) and the number of boxes I have it may vary 


--- We need to bulk load 8 years worth of data from our archives.   That
will 8 * 12 months  of data.  

Whats your original key made of?
--  Each Data files is a 4K text data which has 6 players data on an
average.  We will parse it and extract per userid/day data (so many each
of this would be < .5K)

Would you do this step in multiple stages or feed this mapreduce job all
10
years of data? 


Either way I can do.  Since I have 8 years worth of archived data I need
to get them onto to the system as a one time effort.  If I proceed in
this year order will it be fine --  2000 , 2001 , 2002.  The only
requirement is at the end these individual years data (in hfiles) needs
to be loaded in Hbase.   



>
> 3.  Custom Partitioner (say has some scheme where it sends a range of
> users (and all their associated date info) to a particular reducer
>
>
Maybe write this back to hdfs as sequencefiles rather than as hfiles and
then take the output of this jobs reducer and feed these to your
hfileoutputformat job one at a time if you want to piecemeal the
creation of
hfiles (many jobs rather than one big one).  In this case you'd have one
big
table rather than the one per year as you were suggesting.  You might
have
to move things around in hdfs after all jobs were done to put
directories
where loadtable.rb expects to find stuff (or better, just mod
loadtable.rb... its a simple script).

--  Can u give me some link to doing this.   If I am getting u right is
this the sequence 

1.  Start with say year 2000 (1 billion 4k files to be processed and
loaded)
2.  Divide it into splits initially based on just filename ranges
(user/day data is hidden inside the file)
3.  Each mappers gets a bunch of file (if it is 20 mappers then each one
will have to process 50 million 4k files (Seems too much even for a
single year ?? --  should I go to a single month processing at a time
??)
4.  Each mapper parses the file and extract the user/day records
5.  The custom parttioner sends range of users/day to a particular
reducer
6.  reducer in parallel will generate sequence files -- multiple will be
there


My question here is in each year there will be sequence files containing
a range of users data.  Do I need to identify these and put them
together in one hfile as the user/day records for all the 10 years
should be together in the final hfile ? So some manual stuff is required
here taking related sequence files (those containing the same range of
users / day data) and feeding them to  hfileoutputformat job ?


> 4.  U will have the output generated per reducer. Just need to
loadtable
> on this output


> This is what I am thinking  --  instead of loading all the 10 years
data
> into 1 table --  load it into tables per year.  That way I will have
to
> deal with failure of MR on a year granularity.  Because the whole load
> may take weeks
>


You could do this or do the above suggested route.

It shouldn't take weeks.  Ryan is claiming that he put 12B (small) rows
in
two days with his fancy new multiput.  Writing the hfiles should run at
least an order of magnitude faster -- unless your cells are large.

This goes without saying, but I'll say it anyway, please test first with
small datasets to ensure stuff works for you.  Use the head of the 0.20
branch.  It has small fix for a silly bug in KeyValueSortReducer.

-- - Could u also give some links to this multiput technique ??


St.Ack




>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> stack
> Sent: Thursday, January 14, 2010 11:33 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: HBase bulk load
>
> On Wed, Jan 13, 2010 at 9:49 PM, Sriram Muthuswamy Chittathoor <
> sriramc@ivycomptech.com> wrote:
>
> > I am trying to use this technique to say bulk load 20 billion rows.
I
> > tried it on a smaller set 20 million rows. A few things I had to
take
> > care was to write a custom partitioning logic so that a range of
keys
> > only go to a particular reduce since there was some mention of
global
> > ordering.
> > For example  Users  (1 --  1mill) ---> Reducer 1 and so on
> >
> > Good.
>
>
>
> > My questions are:
> > 1.  Can I divide the bulk loading into multiple runs  --  the
existing
> > bulk load bails out if it finds a HDFS output directory with the
same
> > name
> >
>
> No.  Its not currently written to do that but especially if your keys
> are
> ordered, it probably wouldn't take much to make the above work (first
> job
> does the first set of keys, and so on).
>
>
> > 2.  What I want to do is make multiple runs of 10 billion and then
> > combine the output before running  loadtable.rb --  is this possible
?
> > I am thinking this may be required in case my MR bulk loading fails
in
> > between and I need to start from where I crashed
> >
> > Well, MR does retries but, yeah, you could run into some issue at
the
> 10B
> mark and want to then start over from there rather than start from the
> beginning.
>
> One thing that the current setup does not do is remove the task hfile
on
> failure.  We should add this.  Would fix case where when speculative
> execution is enabled, and the speculative tasks are kiled, we don't
> leave
> around half-made hfiles (Currently I believe they they show as
> zero-length
> files).
>
> St.Ack
>
>
>
> > Any tips with huge bulk loading experience ?
> >
> >
> > -----Original Message-----
> > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> > stack
> > Sent: Thursday, January 14, 2010 6:19 AM
> > To: hbase-user@hadoop.apache.org
> > Subject: Re: HBase bulk load
> >
> > See
> >
>
http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/
> > mapreduce/package-summary.html#bulk
> > St.Ack
> >
> > On Wed, Jan 13, 2010 at 4:30 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > Jonathan:
> > > Since you implemented
> > >
> > >
> >
>
https://issues.apache.org/jira/si/jira.issueviews:issue-html/HBASE-48/HB
> > ASE-48.html
> > > ,
> > > maybe you can point me to some document how bulk load is used ?
> > > I found bin/loadtable.rb and assume that can be used to import
data
> > back
> > > into HBase.
> > >
> > > Thanks
> > >
> >
> > This email is sent for and on behalf of Ivy Comptech Private
Limited.
> Ivy
> > Comptech Private Limited is a limited liability company.
> >
> > This email and any attachments are confidential, and may be legally
> > privileged and protected by copyright. If you are not the intended
> recipient
> > dissemination or copying of this email is prohibited. If you have
> received
> > this in error, please notify the sender by replying by email and
then
> delete
> > the email completely from your system.
> > Any views or opinions are solely those of the sender.  This
> communication
> > is not intended to form a binding contract on behalf of Ivy Comptech
> Private
> > Limited unless expressly indicated to the contrary and properly
> authorised.
> > Any actions taken on the basis of this email are at the recipient's
> own
> > risk.
> >
> > Registered office:
> > Ivy Comptech Private Limited, Cyber Spazio, Road No. 2, Banjara
Hills,
> > Hyderabad 500 033, Andhra Pradesh, India. Registered number: 37994.
> > Registered in India. A list of members' names is available for
> inspection at
> > the registered office.
> >
> >
>

Re: HBase bulk load

Posted by stack <st...@duboce.net>.
On Wed, Jan 13, 2010 at 10:28 PM, Sriram Muthuswamy Chittathoor <
sriramc@ivycomptech.com> wrote:

> Some context
> ************
>
> 1. We generate 3 million files a day  --  3 * 365 =  1 billion / year
> (Final is say 10 years though)
> 2. Each file has data relating to a user and day
> 3. Every user will have some activity throughout the year (not
> necessarily on every day)
> 4. Our search is by  {user , data range combination} --  give me data
> for a given user between these date ranges
>
>
So, your key will be userid+day since epoch?



> I am assuming I am assuming could be done
>
> 1.  The InputSplits will divide the #splits  based on say some range of
> days  -- say for a year  we divide into 1 splits (All files for a month
> goes to the first mapper and so on)


> 2.  Each mapper processes the files and creates {Key , value}
> combinations --  key is composite  {user , date}
>


Mappers should run for about 5-10 minutes each.  How many months of data do
you think this will be per mapper?

Whats your original key made of?

Would you do this step in multiple stages or feed this mapreduce job all 10
years of data?



>
> 3.  Custom Partitioner (say has some scheme where it sends a range of
> users (and all their associated date info) to a particular reducer
>
>
Maybe write this back to hdfs as sequencefiles rather than as hfiles and
then take the output of this jobs reducer and feed these to your
hfileoutputformat job one at a time if you want to piecemeal the creation of
hfiles (many jobs rather than one big one).  In this case you'd have one big
table rather than the one per year as you were suggesting.  You might have
to move things around in hdfs after all jobs were done to put directories
where loadtable.rb expects to find stuff (or better, just mod
loadtable.rb... its a simple script).




> 4.  U will have the output generated per reducer. Just need to loadtable
> on this output


> This is what I am thinking  --  instead of loading all the 10 years data
> into 1 table --  load it into tables per year.  That way I will have to
> deal with failure of MR on a year granularity.  Because the whole load
> may take weeks
>


You could do this or do the above suggested route.

It shouldn't take weeks.  Ryan is claiming that he put 12B (small) rows in
two days with his fancy new multiput.  Writing the hfiles should run at
least an order of magnitude faster -- unless your cells are large.

This goes without saying, but I'll say it anyway, please test first with
small datasets to ensure stuff works for you.  Use the head of the 0.20
branch.  It has small fix for a silly bug in KeyValueSortReducer.

St.Ack




>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> stack
> Sent: Thursday, January 14, 2010 11:33 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: HBase bulk load
>
> On Wed, Jan 13, 2010 at 9:49 PM, Sriram Muthuswamy Chittathoor <
> sriramc@ivycomptech.com> wrote:
>
> > I am trying to use this technique to say bulk load 20 billion rows.  I
> > tried it on a smaller set 20 million rows. A few things I had to take
> > care was to write a custom partitioning logic so that a range of keys
> > only go to a particular reduce since there was some mention of global
> > ordering.
> > For example  Users  (1 --  1mill) ---> Reducer 1 and so on
> >
> > Good.
>
>
>
> > My questions are:
> > 1.  Can I divide the bulk loading into multiple runs  --  the existing
> > bulk load bails out if it finds a HDFS output directory with the same
> > name
> >
>
> No.  Its not currently written to do that but especially if your keys
> are
> ordered, it probably wouldn't take much to make the above work (first
> job
> does the first set of keys, and so on).
>
>
> > 2.  What I want to do is make multiple runs of 10 billion and then
> > combine the output before running  loadtable.rb --  is this possible ?
> > I am thinking this may be required in case my MR bulk loading fails in
> > between and I need to start from where I crashed
> >
> > Well, MR does retries but, yeah, you could run into some issue at the
> 10B
> mark and want to then start over from there rather than start from the
> beginning.
>
> One thing that the current setup does not do is remove the task hfile on
> failure.  We should add this.  Would fix case where when speculative
> execution is enabled, and the speculative tasks are kiled, we don't
> leave
> around half-made hfiles (Currently I believe they they show as
> zero-length
> files).
>
> St.Ack
>
>
>
> > Any tips with huge bulk loading experience ?
> >
> >
> > -----Original Message-----
> > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> > stack
> > Sent: Thursday, January 14, 2010 6:19 AM
> > To: hbase-user@hadoop.apache.org
> > Subject: Re: HBase bulk load
> >
> > See
> >
> http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/
> > mapreduce/package-summary.html#bulk
> > St.Ack
> >
> > On Wed, Jan 13, 2010 at 4:30 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > Jonathan:
> > > Since you implemented
> > >
> > >
> >
> https://issues.apache.org/jira/si/jira.issueviews:issue-html/HBASE-48/HB
> > ASE-48.html
> > > ,
> > > maybe you can point me to some document how bulk load is used ?
> > > I found bin/loadtable.rb and assume that can be used to import data
> > back
> > > into HBase.
> > >
> > > Thanks
> > >
> >
> > This email is sent for and on behalf of Ivy Comptech Private Limited.
> Ivy
> > Comptech Private Limited is a limited liability company.
> >
> > This email and any attachments are confidential, and may be legally
> > privileged and protected by copyright. If you are not the intended
> recipient
> > dissemination or copying of this email is prohibited. If you have
> received
> > this in error, please notify the sender by replying by email and then
> delete
> > the email completely from your system.
> > Any views or opinions are solely those of the sender.  This
> communication
> > is not intended to form a binding contract on behalf of Ivy Comptech
> Private
> > Limited unless expressly indicated to the contrary and properly
> authorised.
> > Any actions taken on the basis of this email are at the recipient's
> own
> > risk.
> >
> > Registered office:
> > Ivy Comptech Private Limited, Cyber Spazio, Road No. 2, Banjara Hills,
> > Hyderabad 500 033, Andhra Pradesh, India. Registered number: 37994.
> > Registered in India. A list of members' names is available for
> inspection at
> > the registered office.
> >
> >
>

RE: HBase bulk load

Posted by Sriram Muthuswamy Chittathoor <sr...@ivycomptech.com>.
Some context
************

1. We generate 3 million files a day  --  3 * 365 =  1 billion / year
(Final is say 10 years though)
2. Each file has data relating to a user and day 
3. Every user will have some activity throughout the year (not
necessarily on every day)
4. Our search is by  {user , data range combination} --  give me data
for a given user between these date ranges 

I am assuming I am assuming could be done

1.  The InputSplits will divide the #splits  based on say some range of
days  -- say for a year  we divide into 1 splits (All files for a month
goes to the first mapper and so on)

2.  Each mapper processes the files and creates {Key , value}
combinations --  key is composite  {user , date}

3.  Custom Partitioner (say has some scheme where it sends a range of
users (and all their associated date info) to a particular reducer 

4.  U will have the output generated per reducer. Just need to loadtable
on this output 

This is what I am thinking  --  instead of loading all the 10 years data
into 1 table --  load it into tables per year.  That way I will have to
deal with failure of MR on a year granularity.  Because the whole load
may take weeks

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
stack
Sent: Thursday, January 14, 2010 11:33 AM
To: hbase-user@hadoop.apache.org
Subject: Re: HBase bulk load

On Wed, Jan 13, 2010 at 9:49 PM, Sriram Muthuswamy Chittathoor <
sriramc@ivycomptech.com> wrote:

> I am trying to use this technique to say bulk load 20 billion rows.  I
> tried it on a smaller set 20 million rows. A few things I had to take
> care was to write a custom partitioning logic so that a range of keys
> only go to a particular reduce since there was some mention of global
> ordering.
> For example  Users  (1 --  1mill) ---> Reducer 1 and so on
>
> Good.



> My questions are:
> 1.  Can I divide the bulk loading into multiple runs  --  the existing
> bulk load bails out if it finds a HDFS output directory with the same
> name
>

No.  Its not currently written to do that but especially if your keys
are
ordered, it probably wouldn't take much to make the above work (first
job
does the first set of keys, and so on).


> 2.  What I want to do is make multiple runs of 10 billion and then
> combine the output before running  loadtable.rb --  is this possible ?
> I am thinking this may be required in case my MR bulk loading fails in
> between and I need to start from where I crashed
>
> Well, MR does retries but, yeah, you could run into some issue at the
10B
mark and want to then start over from there rather than start from the
beginning.

One thing that the current setup does not do is remove the task hfile on
failure.  We should add this.  Would fix case where when speculative
execution is enabled, and the speculative tasks are kiled, we don't
leave
around half-made hfiles (Currently I believe they they show as
zero-length
files).

St.Ack



> Any tips with huge bulk loading experience ?
>
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> stack
> Sent: Thursday, January 14, 2010 6:19 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: HBase bulk load
>
> See
>
http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/
> mapreduce/package-summary.html#bulk
> St.Ack
>
> On Wed, Jan 13, 2010 at 4:30 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > Jonathan:
> > Since you implemented
> >
> >
>
https://issues.apache.org/jira/si/jira.issueviews:issue-html/HBASE-48/HB
> ASE-48.html
> > ,
> > maybe you can point me to some document how bulk load is used ?
> > I found bin/loadtable.rb and assume that can be used to import data
> back
> > into HBase.
> >
> > Thanks
> >
>
> This email is sent for and on behalf of Ivy Comptech Private Limited.
Ivy
> Comptech Private Limited is a limited liability company.
>
> This email and any attachments are confidential, and may be legally
> privileged and protected by copyright. If you are not the intended
recipient
> dissemination or copying of this email is prohibited. If you have
received
> this in error, please notify the sender by replying by email and then
delete
> the email completely from your system.
> Any views or opinions are solely those of the sender.  This
communication
> is not intended to form a binding contract on behalf of Ivy Comptech
Private
> Limited unless expressly indicated to the contrary and properly
authorised.
> Any actions taken on the basis of this email are at the recipient's
own
> risk.
>
> Registered office:
> Ivy Comptech Private Limited, Cyber Spazio, Road No. 2, Banjara Hills,
> Hyderabad 500 033, Andhra Pradesh, India. Registered number: 37994.
> Registered in India. A list of members' names is available for
inspection at
> the registered office.
>
>

Re: HBase bulk load

Posted by stack <st...@duboce.net>.
On Wed, Jan 13, 2010 at 9:49 PM, Sriram Muthuswamy Chittathoor <
sriramc@ivycomptech.com> wrote:

> I am trying to use this technique to say bulk load 20 billion rows.  I
> tried it on a smaller set 20 million rows. A few things I had to take
> care was to write a custom partitioning logic so that a range of keys
> only go to a particular reduce since there was some mention of global
> ordering.
> For example  Users  (1 --  1mill) ---> Reducer 1 and so on
>
> Good.



> My questions are:
> 1.  Can I divide the bulk loading into multiple runs  --  the existing
> bulk load bails out if it finds a HDFS output directory with the same
> name
>

No.  Its not currently written to do that but especially if your keys are
ordered, it probably wouldn't take much to make the above work (first job
does the first set of keys, and so on).


> 2.  What I want to do is make multiple runs of 10 billion and then
> combine the output before running  loadtable.rb --  is this possible ?
> I am thinking this may be required in case my MR bulk loading fails in
> between and I need to start from where I crashed
>
> Well, MR does retries but, yeah, you could run into some issue at the 10B
mark and want to then start over from there rather than start from the
beginning.

One thing that the current setup does not do is remove the task hfile on
failure.  We should add this.  Would fix case where when speculative
execution is enabled, and the speculative tasks are kiled, we don't leave
around half-made hfiles (Currently I believe they they show as zero-length
files).

St.Ack



> Any tips with huge bulk loading experience ?
>
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> stack
> Sent: Thursday, January 14, 2010 6:19 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: HBase bulk load
>
> See
> http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/
> mapreduce/package-summary.html#bulk
> St.Ack
>
> On Wed, Jan 13, 2010 at 4:30 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > Jonathan:
> > Since you implemented
> >
> >
> https://issues.apache.org/jira/si/jira.issueviews:issue-html/HBASE-48/HB
> ASE-48.html
> > ,
> > maybe you can point me to some document how bulk load is used ?
> > I found bin/loadtable.rb and assume that can be used to import data
> back
> > into HBase.
> >
> > Thanks
> >
>
> This email is sent for and on behalf of Ivy Comptech Private Limited. Ivy
> Comptech Private Limited is a limited liability company.
>
> This email and any attachments are confidential, and may be legally
> privileged and protected by copyright. If you are not the intended recipient
> dissemination or copying of this email is prohibited. If you have received
> this in error, please notify the sender by replying by email and then delete
> the email completely from your system.
> Any views or opinions are solely those of the sender.  This communication
> is not intended to form a binding contract on behalf of Ivy Comptech Private
> Limited unless expressly indicated to the contrary and properly authorised.
> Any actions taken on the basis of this email are at the recipient's own
> risk.
>
> Registered office:
> Ivy Comptech Private Limited, Cyber Spazio, Road No. 2, Banjara Hills,
> Hyderabad 500 033, Andhra Pradesh, India. Registered number: 37994.
> Registered in India. A list of members' names is available for inspection at
> the registered office.
>
>

RE: HBase bulk load

Posted by Sriram Muthuswamy Chittathoor <sr...@ivycomptech.com>.
I am trying to use this technique to say bulk load 20 billion rows.  I
tried it on a smaller set 20 million rows. A few things I had to take
care was to write a custom partitioning logic so that a range of keys
only go to a particular reduce since there was some mention of global
ordering.
For example  Users  (1 --  1mill) ---> Reducer 1 and so on

My questions are:
1.  Can I divide the bulk loading into multiple runs  --  the existing
bulk load bails out if it finds a HDFS output directory with the same
name
2.  What I want to do is make multiple runs of 10 billion and then
combine the output before running  loadtable.rb --  is this possible ?
I am thinking this may be required in case my MR bulk loading fails in
between and I need to start from where I crashed

Any tips with huge bulk loading experience ?


-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
stack
Sent: Thursday, January 14, 2010 6:19 AM
To: hbase-user@hadoop.apache.org
Subject: Re: HBase bulk load

See
http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/
mapreduce/package-summary.html#bulk
St.Ack

On Wed, Jan 13, 2010 at 4:30 PM, Ted Yu <yu...@gmail.com> wrote:

> Jonathan:
> Since you implemented
>
>
https://issues.apache.org/jira/si/jira.issueviews:issue-html/HBASE-48/HB
ASE-48.html
> ,
> maybe you can point me to some document how bulk load is used ?
> I found bin/loadtable.rb and assume that can be used to import data
back
> into HBase.
>
> Thanks
>

This email is sent for and on behalf of Ivy Comptech Private Limited. Ivy Comptech Private Limited is a limited liability company.  

This email and any attachments are confidential, and may be legally privileged and protected by copyright. If you are not the intended recipient dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. 
Any views or opinions are solely those of the sender.  This communication is not intended to form a binding contract on behalf of Ivy Comptech Private Limited unless expressly indicated to the contrary and properly authorised. Any actions taken on the basis of this email are at the recipient's own risk.

Registered office:
Ivy Comptech Private Limited, Cyber Spazio, Road No. 2, Banjara Hills, Hyderabad 500 033, Andhra Pradesh, India. Registered number: 37994. Registered in India. A list of members' names is available for inspection at the registered office.


Re: HBase bulk load

Posted by stack <st...@duboce.net>.
See
http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulk
St.Ack

On Wed, Jan 13, 2010 at 4:30 PM, Ted Yu <yu...@gmail.com> wrote:

> Jonathan:
> Since you implemented
>
> https://issues.apache.org/jira/si/jira.issueviews:issue-html/HBASE-48/HBASE-48.html
> ,
> maybe you can point me to some document how bulk load is used ?
> I found bin/loadtable.rb and assume that can be used to import data back
> into HBase.
>
> Thanks
>