You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Shuja Rehman <sh...@gmail.com> on 2010/11/05 12:13:02 UTC

Best Way to Insert data into Hbase using Map Reduce

Hi

I am reading data from raw xml files and inserting data into hbase using
TableOutputFormat in a map reduce job. but due to heavy put statements, it
takes many hours to process the data. here is my sample code.

conf.set(TableOutputFormat.OUTPUT_TABLE, "mytable");
    conf.set("xmlinput.start", "<adc>");
    conf.set("xmlinput.end", "</adc>");
    conf
        .set(
          "io.serializations",

"org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");

      Job job = new Job(conf, "Populate Table with Data");

    FileInputFormat.setInputPaths(job, input);
    job.setJarByClass(ParserDriver.class);
    job.setMapperClass(MyParserMapper.class);
    job.setNumReduceTasks(0);
    job.setInputFormatClass(XmlInputFormat.class);
    job.setOutputFormatClass(TableOutputFormat.class);


*and mapper code*

public class MyParserMapper   extends
    Mapper<LongWritable, Text, NullWritable, Writable> {

    @Override
    public void map(LongWritable key, Text value1,Context context)

throws IOException, InterruptedException {
*//doing some processing*
 while(rItr.hasNext())
                    {
*                   //and this put statement runs for 132,622,560 times to
insert the data.*
                    context.write(NullWritable.get(), new
Put(rowId).add(Bytes.toBytes("CounterValues"),
Bytes.toBytes(counter.toString()), Bytes.toBytes(rElement.getTextTrim())));

                    }

}}

Is there any other way of doing this task so i can improve the performance?


-- 
Regards
Shuja-ur-Rehman Baig
<http://pk.linkedin.com/in/shujamughal>

RE: Best Way to Insert data into Hbase using Map Reduce

Posted by Michael Segel <mi...@hotmail.com>.

Ok.

You have a couple of issues.
First is that each file is a record. That doesn't make for a good map/reduce, although you can pass in the directory and then for each file you'd get a map/reduce task, assuming that you're processing all of the files at the same time.

Having millions of fields... I'm not sure that you have good structured data within your XML file and if you want to create one row per record.

Part of your speed problem is that building a DOM tree with millions of fields is probably what is taking a long time. (You have the issue of putting your entire document in to memory. and the time it takes to build the tree.) Then you have to determine your mapping from the JDOM object to your hbase table.

Doing Stax will make your code more efficient.

With respect to the buffer caching.

What that will do is cache your writes on the client side. Not sure if that makes sense when you're processing the entire file which is going to be larger than your cache.

I don't believe that it is going to be your performance issue. Having both a bad XML schema and the hbase schema will be an issue.

> Date: Mon, 8 Nov 2010 21:59:22 +0500
> Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> From: shujamughal@gmail.com
> To: user@hbase.apache.org
> 
> One more thing which i want to ask that i have found that people have given
> the following buffer size.
> 
>   table.setWriteBufferSize(1024*1024*24);
>   table.setAutoFlush(false);
> 
> Is there any specific reason of giving such buffer size? and how much ram is
> required for it. I have given 4 GB to each region server and I can see that
> used heap value for region server going increasing and increasing and region
> servers are crashing then.
> 
> On Mon, Nov 8, 2010 at 9:26 PM, Shuja Rehman <sh...@gmail.com> wrote:
> 
> > Ok
> > Well...i am getting hundred of files daily which all need to process thats
> > why i am using hadoop so it manage distribution of processing itself.
> > Yes, one record has millions of fields
> >
> > Thanks for comments.
> >
> >
> > On Mon, Nov 8, 2010 at 8:50 PM, Michael Segel <mi...@hotmail.com>wrote:
> >
> >>
> >> Switch out the JDOM for a Stax parser.
> >>
> >> Ok, having said that...
> >> You said you have a single record per file. Ok that means you have a lot
> >> of fields.
> >> Because you have 1 record, this isn't a map/reduce problem. You're better
> >> off writing a single threaded app
> >> to read the file, parse the file using Stax, and then write the fields to
> >> HBase.
> >>
> >> I'm not sure why you have millions of put()s.
> >> Do you have millions of fields in this one record?
> >>
> >> Writing a good stax parser and then mapping the fields to your hbase
> >> column(s) will help.
> >>
> >> HTH
> >>
> >> -Mike
> >> PS. A good stax implementation would be a recursive/re-entrant piece of
> >> code.
> >> While the code may look simple, it takes a skilled developer to write and
> >> maintain.
> >>
> >>
> >> > Date: Mon, 8 Nov 2010 14:36:34 +0500
> >> > Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> >> > From: shujamughal@gmail.com
> >> > To: user@hbase.apache.org
> >> >
> >> > HI
> >> >
> >> > I have used JDOM library to parse the xml in mapper and in my case, one
> >> > single file consist of 1 record so i give one complete file to map
> >> process
> >> > and extract the information from it which i need. I have only 2 column
> >> > families in my schema and bottleneck was the put statements which run
> >> > millions of time for each file. when i comment this put statement then
> >> job
> >> > complete within minutes but with put statement, it was taking about 7
> >> hours
> >> > to complete the same job. Anyhow I have changed the code according to
> >> > suggestion given by Michael  and now using java api to dump data instead
> >> of
> >> > table output format and used the list of puts and then flush them at
> >> each
> >> > 1000 records and it reduces the time significantly. Now the whole job
> >> > process by 1 hour and 45 min approx but still not in minutes. So is
> >> there
> >> > anything left which i might apply and performance increase?
> >> >
> >> > Thanks
> >> >
> >> > On Fri, Nov 5, 2010 at 10:46 PM, Buttler, David <bu...@llnl.gov>
> >> wrote:
> >> >
> >> > > Good points.
> >> > > Before we can make any rational suggestion, we need to know where the
> >> > > bottleneck is, so we can make suggestions to move it elsewhere.  I
> >> > > personally favor Michael's suggestion to split the ingest and the
> >> parsing
> >> > > parts of your job, and to switch to a parser that is faster than a DOM
> >> > > parser (SAX or Stax). But, without knowing what the bottleneck
> >> actually is,
> >> > > all of these suggestions are shots in the dark.
> >> > >
> >> > > What is the network load, the CPU load, the disk load?  Have you at
> >> least
> >> > > installed Ganglia or some equivalent so that you can see what the load
> >> is
> >> > > across the cluster?
> >> > >
> >> > > Dave
> >> > >
> >> > >
> >> > > -----Original Message-----
> >> > > From: Michael Segel [mailto:michael_segel@hotmail.com]
> >> > > Sent: Friday, November 05, 2010 9:49 AM
> >> > > To: user@hbase.apache.org
> >> > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
> >> > >
> >> > >
> >> > > I don't think using the buffered client is going to help a lot w
> >> > > performance.
> >> > >
> >> > > I'm a little confused because it doesn't sound like Shuja is using a
> >> > > map/reduce job to parse the file.
> >> > > That is... he says he parses the file in to a dom tree. Usually your
> >> map
> >> > > job parses each record and then in the mapper you parse out the
> >> record.
> >> > > Within the m/r job we don't parse out the fields in the records
> >> because we
> >> > > do additional processing which 'dedupes' the data so we don't have to
> >> > > further process the data.
> >> > > The second job only has to parse a portion of the original records.
> >> > >
> >> > > So assuming that Shuja is actually using a map reduce job, and each
> >> xml
> >> > > record is being parsed within the mapper() there are a couple of
> >> things...
> >> > > 1) Reduce the number of column families that you are using. (Each
> >> column
> >> > > family is written to a separate file)
> >> > > 2) Set up the HTable instance in Mapper.setup()
> >> > > 3) Switch to a different dom class (not all java classes are equal) or
> >> > > switch to Stax.
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > > From: buttler1@llnl.gov
> >> > > > To: user@hbase.apache.org
> >> > > > Date: Fri, 5 Nov 2010 08:28:07 -0700
> >> > > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
> >> > > >
> >> > > > Have you tried turning off auto flush, and managing the flush in
> >> your own
> >> > > code (say every 1000 puts?)
> >> > > > Dave
> >> > > >
> >> > > >
> >> > > > -----Original Message-----
> >> > > > From: Shuja Rehman [mailto:shujamughal@gmail.com]
> >> > > > Sent: Friday, November 05, 2010 8:04 AM
> >> > > > To: user@hbase.apache.org
> >> > > > Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> >> > > >
> >> > > > Michael
> >> > > >
> >> > > > hum....so u are storing xml record in the hbase and in second job, u
> >> r
> >> > > > parsing. but in my case i am parsing it also in first phase. what i
> >> do, i
> >> > > > get xml file and i parse it using jdom and then putting data in
> >> hbase. so
> >> > > > parsing+putting both operations are in 1 phase and in mapper code.
> >> > > >
> >> > > > My actual problem is that after parsing file, i need to use put
> >> statement
> >> > > > millions of times and i think for each statement it connects to
> >> hbase and
> >> > > > then insert it and this might be the reason of slow processing. So i
> >> am
> >> > > > trying to figure out some way we i can first buffer data and then
> >> insert
> >> > > in
> >> > > > batch fashion. it means in one put statement, i can insert many
> >> records
> >> > > and
> >> > > > i think if i do in this way then the process will be very fast.
> >> > > >
> >> > > > secondly what does it means? "we write the raw record in via a
> >> single
> >> > > put()
> >> > > > so the map() method is a null writable."
> >> > > >
> >> > > > can u explain it more?
> >> > > >
> >> > > > Thanks
> >> > > >
> >> > > >
> >> > > > On Fri, Nov 5, 2010 at 5:05 PM, Michael Segel <
> >> michael_segel@hotmail.com
> >> > > >wrote:
> >> > > >
> >> > > > >
> >> > > > > Suja,
> >> > > > >
> >> > > > > Just did a quick glance.
> >> > > > >
> >> > > > > What is it that you want to do exactly?
> >> > > > >
> >> > > > > Here's how we do it... (at a high level.)
> >> > > > >
> >> > > > > Input is an XML file where we want to store the raw XML records in
> >> > > hbase,
> >> > > > > one record per row.
> >> > > > >
> >> > > > > Instead of using the output of the map() method, we write the raw
> >> > > record in
> >> > > > > via a single put() so the map() method is a null writable.
> >> > > > >
> >> > > > > Its pretty fast. However fast is relative.
> >> > > > >
> >> > > > > Another thing... we store the xml record as a string (converted to
> >> > > > > bytecode) rather than a serialized object.
> >> > > > >
> >> > > > > Then you can break it down in to individual fields in a second
> >> batch
> >> > > job.
> >> > > > > (You can start with a DOM parser, and later move to a Stax parser.
> >> > > > > Depending on which DOM parser you have and the size of the record,
> >> it
> >> > > should
> >> > > > > be 'fast enough'. A good implementation of Stax tends to be
> >> > > > > recursive/re-entrant code which is harder to maintain.)
> >> > > > >
> >> > > > > HTH
> >> > > > >
> >> > > > > -Mike
> >> > > > >
> >> > > > >
> >> > > > > > Date: Fri, 5 Nov 2010 16:13:02 +0500
> >> > > > > > Subject: Best Way to Insert data into Hbase using Map Reduce
> >> > > > > > From: shujamughal@gmail.com
> >> > > > > > To: user@hbase.apache.org
> >> > > > > >
> >> > > > > > Hi
> >> > > > > >
> >> > > > > > I am reading data from raw xml files and inserting data into
> >> hbase
> >> > > using
> >> > > > > > TableOutputFormat in a map reduce job. but due to heavy put
> >> > > statements,
> >> > > > > it
> >> > > > > > takes many hours to process the data. here is my sample code.
> >> > > > > >
> >> > > > > > conf.set(TableOutputFormat.OUTPUT_TABLE, "mytable");
> >> > > > > >     conf.set("xmlinput.start", "<adc>");
> >> > > > > >     conf.set("xmlinput.end", "</adc>");
> >> > > > > >     conf
> >> > > > > >         .set(
> >> > > > > >           "io.serializations",
> >> > > > > >
> >> > > > > >
> >> > > > >
> >> > >
> >> "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
> >> > > > > >
> >> > > > > >       Job job = new Job(conf, "Populate Table with Data");
> >> > > > > >
> >> > > > > >     FileInputFormat.setInputPaths(job, input);
> >> > > > > >     job.setJarByClass(ParserDriver.class);
> >> > > > > >     job.setMapperClass(MyParserMapper.class);
> >> > > > > >     job.setNumReduceTasks(0);
> >> > > > > >     job.setInputFormatClass(XmlInputFormat.class);
> >> > > > > >     job.setOutputFormatClass(TableOutputFormat.class);
> >> > > > > >
> >> > > > > >
> >> > > > > > *and mapper code*
> >> > > > > >
> >> > > > > > public class MyParserMapper   extends
> >> > > > > >     Mapper<LongWritable, Text, NullWritable, Writable> {
> >> > > > > >
> >> > > > > >     @Override
> >> > > > > >     public void map(LongWritable key, Text value1,Context
> >> context)
> >> > > > > >
> >> > > > > > throws IOException, InterruptedException {
> >> > > > > > *//doing some processing*
> >> > > > > >  while(rItr.hasNext())
> >> > > > > >                     {
> >> > > > > > *                   //and this put statement runs for
> >> 132,622,560
> >> > > times
> >> > > > > to
> >> > > > > > insert the data.*
> >> > > > > >                     context.write(NullWritable.get(), new
> >> > > > > > Put(rowId).add(Bytes.toBytes("CounterValues"),
> >> > > > > > Bytes.toBytes(counter.toString()),
> >> > > > > Bytes.toBytes(rElement.getTextTrim())));
> >> > > > > >
> >> > > > > >                     }
> >> > > > > >
> >> > > > > > }}
> >> > > > > >
> >> > > > > > Is there any other way of doing this task so i can improve the
> >> > > > > performance?
> >> > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > Regards
> >> > > > > > Shuja-ur-Rehman Baig
> >> > > > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Regards
> >> > > > Shuja-ur-Rehman Baig
> >> > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> >> > >
> >> > >
> >> >
> >> >
> >> > --
> >> > Regards
> >> > Shuja-ur-Rehman Baig
> >> > <http://pk.linkedin.com/in/shujamughal>
> >>
> >>
> >
> >
> >
> > --
> > Regards
> > Shuja-ur-Rehman Baig
> > <http://pk.linkedin.com/in/shujamughal>
> >
> >
> 
> 
> -- 
> Regards
> Shuja-ur-Rehman Baig
> <http://pk.linkedin.com/in/shujamughal>

Re: Best Way to Insert data into Hbase using Map Reduce

Posted by Shuja Rehman <sh...@gmail.com>.

Hi Oleg,

Yes, I have used HTablePool. Here is my basic code skeleton

 public void setup(Context context)
    {
     HBaseConfiguration config = new HBaseConfiguration();
    config.set("hbase.zookeeper.quorum", Constants.HBASE_ZOOKEEPER_QUORUM);

config.set("hbase.zookeeper.property.clientPort",Constants.HBASE_ZOOKEEPRR_PROPERTY_CLIENTPORT);
             HTablePool tablePool = new HTablePool(config, 50);
            table = (HTable) tablePool.getTable("myTable");
    }

 public void map(LongWritable key, Text value1,Context context)
{

List<Put> puts = new ArrayList<Put>();
table.setWriteBufferSize(1024*1024*24);
table.setAutoFlush(false);
 while(true)
      {
       Put put = new Put(rowId);
                     put.add(;
                     put.setWriteToWAL(false);
                     puts.add(put);
      }
if(cnt % 500 == 0 )
         {
          table.getWriteBuffer().addAll(puts);
           table.flushCommits();
             puts.clear();
           }//*/
        cnt++;
}//while

if(puts.size()>0 )
             {
             table.getWriteBuffer().addAll(puts);
             table.flushCommits();
             puts.clear();
             }//*/

}//map


On Tue, Nov 9, 2010 at 3:32 PM, Oleg Ruchovets <or...@gmail.com> wrote:

> Hi ,
> Do you use HTablePool?
> Changing the code to using HBasePool gives  me significat performance
> benefit.
>
>
> HBaseConfiguration conf = new HBaseConfiguration();
> HTablePool pool = new HTablePool(conf, 10);
> HTable table = pool.getTable(name);
>
> Actually disabling WAL ,
> Increasing pool size and rewriting code to using WriteBuffer
> Gives me a good improvement.
>



> I wonder : how can I check that my insertion process is optimized.
>  I mean if insertion took X time -- is it good or no? and how can I check
> it.
>
> Thanks Oleg.
>
>
>
> On Mon, Nov 8, 2010 at 6:59 PM, Shuja Rehman <sh...@gmail.com>
> wrote:
>
> > One more thing which i want to ask that i have found that people have
> given
> > the following buffer size.
> >
> >  table.setWriteBufferSize(1024*1024*24);
> >  table.setAutoFlush(false);
> >
> > Is there any specific reason of giving such buffer size? and how much ram
> > is
> > required for it. I have given 4 GB to each region server and I can see
> that
> > used heap value for region server going increasing and increasing and
> > region
> > servers are crashing then.
> >
> > On Mon, Nov 8, 2010 at 9:26 PM, Shuja Rehman <sh...@gmail.com>
> > wrote:
> >
> > > Ok
> > > Well...i am getting hundred of files daily which all need to process
> > thats
> > > why i am using hadoop so it manage distribution of processing itself.
> > > Yes, one record has millions of fields
> > >
> > > Thanks for comments.
> > >
> > >
> > > On Mon, Nov 8, 2010 at 8:50 PM, Michael Segel <
> michael_segel@hotmail.com
> > >wrote:
> > >
> > >>
> > >> Switch out the JDOM for a Stax parser.
> > >>
> > >> Ok, having said that...
> > >> You said you have a single record per file. Ok that means you have a
> lot
> > >> of fields.
> > >> Because you have 1 record, this isn't a map/reduce problem. You're
> > better
> > >> off writing a single threaded app
> > >> to read the file, parse the file using Stax, and then write the fields
> > to
> > >> HBase.
> > >>
> > >> I'm not sure why you have millions of put()s.
> > >> Do you have millions of fields in this one record?
> > >>
> > >> Writing a good stax parser and then mapping the fields to your hbase
> > >> column(s) will help.
> > >>
> > >> HTH
> > >>
> > >> -Mike
> > >> PS. A good stax implementation would be a recursive/re-entrant piece
> of
> > >> code.
> > >> While the code may look simple, it takes a skilled developer to write
> > and
> > >> maintain.
> > >>
> > >>
> > >> > Date: Mon, 8 Nov 2010 14:36:34 +0500
> > >> > Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> > >> > From: shujamughal@gmail.com
> > >> > To: user@hbase.apache.org
> > >> >
> > >> > HI
> > >> >
> > >> > I have used JDOM library to parse the xml in mapper and in my case,
> > one
> > >> > single file consist of 1 record so i give one complete file to map
> > >> process
> > >> > and extract the information from it which i need. I have only 2
> column
> > >> > families in my schema and bottleneck was the put statements which
> run
> > >> > millions of time for each file. when i comment this put statement
> then
> > >> job
> > >> > complete within minutes but with put statement, it was taking about
> 7
> > >> hours
> > >> > to complete the same job. Anyhow I have changed the code according
> to
> > >> > suggestion given by Michael  and now using java api to dump data
> > instead
> > >> of
> > >> > table output format and used the list of puts and then flush them at
> > >> each
> > >> > 1000 records and it reduces the time significantly. Now the whole
> job
> > >> > process by 1 hour and 45 min approx but still not in minutes. So is
> > >> there
> > >> > anything left which i might apply and performance increase?
> > >> >
> > >> > Thanks
> > >> >
> > >> > On Fri, Nov 5, 2010 at 10:46 PM, Buttler, David <bu...@llnl.gov>
> > >> wrote:
> > >> >
> > >> > > Good points.
> > >> > > Before we can make any rational suggestion, we need to know where
> > the
> > >> > > bottleneck is, so we can make suggestions to move it elsewhere.  I
> > >> > > personally favor Michael's suggestion to split the ingest and the
> > >> parsing
> > >> > > parts of your job, and to switch to a parser that is faster than a
> > DOM
> > >> > > parser (SAX or Stax). But, without knowing what the bottleneck
> > >> actually is,
> > >> > > all of these suggestions are shots in the dark.
> > >> > >
> > >> > > What is the network load, the CPU load, the disk load?  Have you
> at
> > >> least
> > >> > > installed Ganglia or some equivalent so that you can see what the
> > load
> > >> is
> > >> > > across the cluster?
> > >> > >
> > >> > > Dave
> > >> > >
> > >> > >
> > >> > > -----Original Message-----
> > >> > > From: Michael Segel [mailto:michael_segel@hotmail.com]
> > >> > > Sent: Friday, November 05, 2010 9:49 AM
> > >> > > To: user@hbase.apache.org
> > >> > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
> > >> > >
> > >> > >
> > >> > > I don't think using the buffered client is going to help a lot w
> > >> > > performance.
> > >> > >
> > >> > > I'm a little confused because it doesn't sound like Shuja is using
> a
> > >> > > map/reduce job to parse the file.
> > >> > > That is... he says he parses the file in to a dom tree. Usually
> your
> > >> map
> > >> > > job parses each record and then in the mapper you parse out the
> > >> record.
> > >> > > Within the m/r job we don't parse out the fields in the records
> > >> because we
> > >> > > do additional processing which 'dedupes' the data so we don't have
> > to
> > >> > > further process the data.
> > >> > > The second job only has to parse a portion of the original
> records.
> > >> > >
> > >> > > So assuming that Shuja is actually using a map reduce job, and
> each
> > >> xml
> > >> > > record is being parsed within the mapper() there are a couple of
> > >> things...
> > >> > > 1) Reduce the number of column families that you are using. (Each
> > >> column
> > >> > > family is written to a separate file)
> > >> > > 2) Set up the HTable instance in Mapper.setup()
> > >> > > 3) Switch to a different dom class (not all java classes are
> equal)
> > or
> > >> > > switch to Stax.
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > > From: buttler1@llnl.gov
> > >> > > > To: user@hbase.apache.org
> > >> > > > Date: Fri, 5 Nov 2010 08:28:07 -0700
> > >> > > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
> > >> > > >
> > >> > > > Have you tried turning off auto flush, and managing the flush in
> > >> your own
> > >> > > code (say every 1000 puts?)
> > >> > > > Dave
> > >> > > >
> > >> > > >
> > >> > > > -----Original Message-----
> > >> > > > From: Shuja Rehman [mailto:shujamughal@gmail.com]
> > >> > > > Sent: Friday, November 05, 2010 8:04 AM
> > >> > > > To: user@hbase.apache.org
> > >> > > > Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> > >> > > >
> > >> > > > Michael
> > >> > > >
> > >> > > > hum....so u are storing xml record in the hbase and in second
> job,
> > u
> > >> r
> > >> > > > parsing. but in my case i am parsing it also in first phase.
> what
> > i
> > >> do, i
> > >> > > > get xml file and i parse it using jdom and then putting data in
> > >> hbase. so
> > >> > > > parsing+putting both operations are in 1 phase and in mapper
> code.
> > >> > > >
> > >> > > > My actual problem is that after parsing file, i need to use put
> > >> statement
> > >> > > > millions of times and i think for each statement it connects to
> > >> hbase and
> > >> > > > then insert it and this might be the reason of slow processing.
> So
> > i
> > >> am
> > >> > > > trying to figure out some way we i can first buffer data and
> then
> > >> insert
> > >> > > in
> > >> > > > batch fashion. it means in one put statement, i can insert many
> > >> records
> > >> > > and
> > >> > > > i think if i do in this way then the process will be very fast.
> > >> > > >
> > >> > > > secondly what does it means? "we write the raw record in via a
> > >> single
> > >> > > put()
> > >> > > > so the map() method is a null writable."
> > >> > > >
> > >> > > > can u explain it more?
> > >> > > >
> > >> > > > Thanks
> > >> > > >
> > >> > > >
> > >> > > > On Fri, Nov 5, 2010 at 5:05 PM, Michael Segel <
> > >> michael_segel@hotmail.com
> > >> > > >wrote:
> > >> > > >
> > >> > > > >
> > >> > > > > Suja,
> > >> > > > >
> > >> > > > > Just did a quick glance.
> > >> > > > >
> > >> > > > > What is it that you want to do exactly?
> > >> > > > >
> > >> > > > > Here's how we do it... (at a high level.)
> > >> > > > >
> > >> > > > > Input is an XML file where we want to store the raw XML
> records
> > in
> > >> > > hbase,
> > >> > > > > one record per row.
> > >> > > > >
> > >> > > > > Instead of using the output of the map() method, we write the
> > raw
> > >> > > record in
> > >> > > > > via a single put() so the map() method is a null writable.
> > >> > > > >
> > >> > > > > Its pretty fast. However fast is relative.
> > >> > > > >
> > >> > > > > Another thing... we store the xml record as a string
> (converted
> > to
> > >> > > > > bytecode) rather than a serialized object.
> > >> > > > >
> > >> > > > > Then you can break it down in to individual fields in a second
> > >> batch
> > >> > > job.
> > >> > > > > (You can start with a DOM parser, and later move to a Stax
> > parser.
> > >> > > > > Depending on which DOM parser you have and the size of the
> > record,
> > >> it
> > >> > > should
> > >> > > > > be 'fast enough'. A good implementation of Stax tends to be
> > >> > > > > recursive/re-entrant code which is harder to maintain.)
> > >> > > > >
> > >> > > > > HTH
> > >> > > > >
> > >> > > > > -Mike
> > >> > > > >
> > >> > > > >
> > >> > > > > > Date: Fri, 5 Nov 2010 16:13:02 +0500
> > >> > > > > > Subject: Best Way to Insert data into Hbase using Map Reduce
> > >> > > > > > From: shujamughal@gmail.com
> > >> > > > > > To: user@hbase.apache.org
> > >> > > > > >
> > >> > > > > > Hi
> > >> > > > > >
> > >> > > > > > I am reading data from raw xml files and inserting data into
> > >> hbase
> > >> > > using
> > >> > > > > > TableOutputFormat in a map reduce job. but due to heavy put
> > >> > > statements,
> > >> > > > > it
> > >> > > > > > takes many hours to process the data. here is my sample
> code.
> > >> > > > > >
> > >> > > > > > conf.set(TableOutputFormat.OUTPUT_TABLE, "mytable");
> > >> > > > > >     conf.set("xmlinput.start", "<adc>");
> > >> > > > > >     conf.set("xmlinput.end", "</adc>");
> > >> > > > > >     conf
> > >> > > > > >         .set(
> > >> > > > > >           "io.serializations",
> > >> > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > >
> > >>
> >
> "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
> > >> > > > > >
> > >> > > > > >       Job job = new Job(conf, "Populate Table with Data");
> > >> > > > > >
> > >> > > > > >     FileInputFormat.setInputPaths(job, input);
> > >> > > > > >     job.setJarByClass(ParserDriver.class);
> > >> > > > > >     job.setMapperClass(MyParserMapper.class);
> > >> > > > > >     job.setNumReduceTasks(0);
> > >> > > > > >     job.setInputFormatClass(XmlInputFormat.class);
> > >> > > > > >     job.setOutputFormatClass(TableOutputFormat.class);
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > *and mapper code*
> > >> > > > > >
> > >> > > > > > public class MyParserMapper   extends
> > >> > > > > >     Mapper<LongWritable, Text, NullWritable, Writable> {
> > >> > > > > >
> > >> > > > > >     @Override
> > >> > > > > >     public void map(LongWritable key, Text value1,Context
> > >> context)
> > >> > > > > >
> > >> > > > > > throws IOException, InterruptedException {
> > >> > > > > > *//doing some processing*
> > >> > > > > >  while(rItr.hasNext())
> > >> > > > > >                     {
> > >> > > > > > *                   //and this put statement runs for
> > >> 132,622,560
> > >> > > times
> > >> > > > > to
> > >> > > > > > insert the data.*
> > >> > > > > >                     context.write(NullWritable.get(), new
> > >> > > > > > Put(rowId).add(Bytes.toBytes("CounterValues"),
> > >> > > > > > Bytes.toBytes(counter.toString()),
> > >> > > > > Bytes.toBytes(rElement.getTextTrim())));
> > >> > > > > >
> > >> > > > > >                     }
> > >> > > > > >
> > >> > > > > > }}
> > >> > > > > >
> > >> > > > > > Is there any other way of doing this task so i can improve
> the
> > >> > > > > performance?
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > --
> > >> > > > > > Regards
> > >> > > > > > Shuja-ur-Rehman Baig
> > >> > > > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> > >> > > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > > Regards
> > >> > > > Shuja-ur-Rehman Baig
> > >> > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> > >> > >
> > >> > >
> > >> >
> > >> >
> > >> > --
> > >> > Regards
> > >> > Shuja-ur-Rehman Baig
> > >> > <http://pk.linkedin.com/in/shujamughal>
> > >>
> > >>
> > >
> > >
> > >
> > > --
> > > Regards
> > > Shuja-ur-Rehman Baig
> > > <http://pk.linkedin.com/in/shujamughal>
> > >
> > >
> >
> >
> > --
> > Regards
> > Shuja-ur-Rehman Baig
> > <http://pk.linkedin.com/in/shujamughal>
> >
>



-- 
Regards
Shuja-ur-Rehman Baig
<http://pk.linkedin.com/in/shujamughal>

RE: Best Way to Insert data into Hbase using Map Reduce

Posted by Michael Segel <mi...@hotmail.com>.

OK,

I responded to a different question ...

You don't need to use a pool in this case. 
In the set up you can create a single instance of HTable and then use it in your map() method.

Also I'd stay away from not writing to WAL.

Having said that... yes writing to WAL means you incur some overhead when compared to *not* writing to WAL.

However this is a map/reduce job and stopping the writing to WAL is probably one of the last things I would do to improve performance.
The reason is that if you don't write to WAL you risk losing data. Thinking down the road... couldn't you use the WAL to do log shipping to a different cluster/cloud ? 

But I digress. 

Turning off the WAL increases your potential risk for data loss if something goes wrong.  There are a lot of options that could improve performance, including a design change, that could get you to a point where the batch process occurs 'fast enough'.

And that's a crucial point.

How fast do you want to go and how much are you willing to spend?

To give you an extreme example... suppose I have a job that takes 2 hours to run. I figure I can shave 10 minutes off the job but it would take 40 hours of work. Does it make sense to do this work if my SLA (Service Level Agreement) with my users is that the job has to run within 3 hours?

The point is that my job runs 'fast enough' that I can't justify the hours required to get a marginal performance improvement. Of course I may still want to do the work if it means improving the quality of code and reducing my ongoing maintenance costs... but that's a different argument.

Specific to your example... switching from DOM to StAX implementation will do more to improve your performance and memory footprint.

JMHO... HTH

-Mike


> Date: Tue, 9 Nov 2010 20:46:02 +0500
> Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> From: shujamughal@gmail.com
> To: user@hbase.apache.org
> 
> Hi Oleg,
> 
> Yes, I have used HTablePool. Here is my basic code skeleton
> 
>  public void setup(Context context)
>     {
>      HBaseConfiguration config = new HBaseConfiguration();
>     config.set("hbase.zookeeper.quorum", Constants.HBASE_ZOOKEEPER_QUORUM);
> 
> config.set("hbase.zookeeper.property.clientPort",Constants.HBASE_ZOOKEEPRR_PROPERTY_CLIENTPORT);
>              HTablePool tablePool = new HTablePool(config, 50);
>             table = (HTable) tablePool.getTable("myTable");
>     }
> 
>  public void map(LongWritable key, Text value1,Context context)
> {
> 
> List<Put> puts = new ArrayList<Put>();
> table.setWriteBufferSize(1024*1024*24);
> table.setAutoFlush(false);
>  while(true)
>       {
>        Put put = new Put(rowId);
>                      put.add(;
>                      put.setWriteToWAL(false);
>                      puts.add(put);
>       }
> if(cnt % 500 == 0 )
>          {
>           table.getWriteBuffer().addAll(puts);
>            table.flushCommits();
>              puts.clear();
>            }//*/
>         cnt++;
> }//while
> 
> if(puts.size()>0 )
>              {
>              table.getWriteBuffer().addAll(puts);
>              table.flushCommits();
>              puts.clear();
>              }//*/
> 
> }//map
> 
> 
> On Tue, Nov 9, 2010 at 3:32 PM, Oleg Ruchovets <or...@gmail.com> wrote:
> 
> > Hi ,
> > Do you use HTablePool?
> > Changing the code to using HBasePool gives  me significat performance
> > benefit.
> >
> >
> > HBaseConfiguration conf = new HBaseConfiguration();
> > HTablePool pool = new HTablePool(conf, 10);
> > HTable table = pool.getTable(name);
> >
> > Actually disabling WAL ,
> > Increasing pool size and rewriting code to using WriteBuffer
> > Gives me a good improvement.
> >
> What does mean by *rewriting code to using WriteBuffer?*
> 
> 
> 
> > I wonder : how can I check that my insertion process is optimized.
> >  I mean if insertion took X time -- is it good or no? and how can I check
> > it.
> >
> *I am also not sure about it.*
> 
> >
> > Thanks Oleg.
> >
> >
> >
> > On Mon, Nov 8, 2010 at 6:59 PM, Shuja Rehman <sh...@gmail.com>
> > wrote:
> >
> > > One more thing which i want to ask that i have found that people have
> > given
> > > the following buffer size.
> > >
> > >  table.setWriteBufferSize(1024*1024*24);
> > >  table.setAutoFlush(false);
> > >
> > > Is there any specific reason of giving such buffer size? and how much ram
> > > is
> > > required for it. I have given 4 GB to each region server and I can see
> > that
> > > used heap value for region server going increasing and increasing and
> > > region
> > > servers are crashing then.
> > >
> > > On Mon, Nov 8, 2010 at 9:26 PM, Shuja Rehman <sh...@gmail.com>
> > > wrote:
> > >
> > > > Ok
> > > > Well...i am getting hundred of files daily which all need to process
> > > thats
> > > > why i am using hadoop so it manage distribution of processing itself.
> > > > Yes, one record has millions of fields
> > > >
> > > > Thanks for comments.
> > > >
> > > >
> > > > On Mon, Nov 8, 2010 at 8:50 PM, Michael Segel <
> > michael_segel@hotmail.com
> > > >wrote:
> > > >
> > > >>
> > > >> Switch out the JDOM for a Stax parser.
> > > >>
> > > >> Ok, having said that...
> > > >> You said you have a single record per file. Ok that means you have a
> > lot
> > > >> of fields.
> > > >> Because you have 1 record, this isn't a map/reduce problem. You're
> > > better
> > > >> off writing a single threaded app
> > > >> to read the file, parse the file using Stax, and then write the fields
> > > to
> > > >> HBase.
> > > >>
> > > >> I'm not sure why you have millions of put()s.
> > > >> Do you have millions of fields in this one record?
> > > >>
> > > >> Writing a good stax parser and then mapping the fields to your hbase
> > > >> column(s) will help.
> > > >>
> > > >> HTH
> > > >>
> > > >> -Mike
> > > >> PS. A good stax implementation would be a recursive/re-entrant piece
> > of
> > > >> code.
> > > >> While the code may look simple, it takes a skilled developer to write
> > > and
> > > >> maintain.
> > > >>
> > > >>
> > > >> > Date: Mon, 8 Nov 2010 14:36:34 +0500
> > > >> > Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> > > >> > From: shujamughal@gmail.com
> > > >> > To: user@hbase.apache.org
> > > >> >
> > > >> > HI
> > > >> >
> > > >> > I have used JDOM library to parse the xml in mapper and in my case,
> > > one
> > > >> > single file consist of 1 record so i give one complete file to map
> > > >> process
> > > >> > and extract the information from it which i need. I have only 2
> > column
> > > >> > families in my schema and bottleneck was the put statements which
> > run
> > > >> > millions of time for each file. when i comment this put statement
> > then
> > > >> job
> > > >> > complete within minutes but with put statement, it was taking about
> > 7
> > > >> hours
> > > >> > to complete the same job. Anyhow I have changed the code according
> > to
> > > >> > suggestion given by Michael  and now using java api to dump data
> > > instead
> > > >> of
> > > >> > table output format and used the list of puts and then flush them at
> > > >> each
> > > >> > 1000 records and it reduces the time significantly. Now the whole
> > job
> > > >> > process by 1 hour and 45 min approx but still not in minutes. So is
> > > >> there
> > > >> > anything left which i might apply and performance increase?
> > > >> >
> > > >> > Thanks
> > > >> >
> > > >> > On Fri, Nov 5, 2010 at 10:46 PM, Buttler, David <bu...@llnl.gov>
> > > >> wrote:
> > > >> >
> > > >> > > Good points.
> > > >> > > Before we can make any rational suggestion, we need to know where
> > > the
> > > >> > > bottleneck is, so we can make suggestions to move it elsewhere.  I
> > > >> > > personally favor Michael's suggestion to split the ingest and the
> > > >> parsing
> > > >> > > parts of your job, and to switch to a parser that is faster than a
> > > DOM
> > > >> > > parser (SAX or Stax). But, without knowing what the bottleneck
> > > >> actually is,
> > > >> > > all of these suggestions are shots in the dark.
> > > >> > >
> > > >> > > What is the network load, the CPU load, the disk load?  Have you
> > at
> > > >> least
> > > >> > > installed Ganglia or some equivalent so that you can see what the
> > > load
> > > >> is
> > > >> > > across the cluster?
> > > >> > >
> > > >> > > Dave
> > > >> > >
> > > >> > >
> > > >> > > -----Original Message-----
> > > >> > > From: Michael Segel [mailto:michael_segel@hotmail.com]
> > > >> > > Sent: Friday, November 05, 2010 9:49 AM
> > > >> > > To: user@hbase.apache.org
> > > >> > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
> > > >> > >
> > > >> > >
> > > >> > > I don't think using the buffered client is going to help a lot w
> > > >> > > performance.
> > > >> > >
> > > >> > > I'm a little confused because it doesn't sound like Shuja is using
> > a
> > > >> > > map/reduce job to parse the file.
> > > >> > > That is... he says he parses the file in to a dom tree. Usually
> > your
> > > >> map
> > > >> > > job parses each record and then in the mapper you parse out the
> > > >> record.
> > > >> > > Within the m/r job we don't parse out the fields in the records
> > > >> because we
> > > >> > > do additional processing which 'dedupes' the data so we don't have
> > > to
> > > >> > > further process the data.
> > > >> > > The second job only has to parse a portion of the original
> > records.
> > > >> > >
> > > >> > > So assuming that Shuja is actually using a map reduce job, and
> > each
> > > >> xml
> > > >> > > record is being parsed within the mapper() there are a couple of
> > > >> things...
> > > >> > > 1) Reduce the number of column families that you are using. (Each
> > > >> column
> > > >> > > family is written to a separate file)
> > > >> > > 2) Set up the HTable instance in Mapper.setup()
> > > >> > > 3) Switch to a different dom class (not all java classes are
> > equal)
> > > or
> > > >> > > switch to Stax.
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > > From: buttler1@llnl.gov
> > > >> > > > To: user@hbase.apache.org
> > > >> > > > Date: Fri, 5 Nov 2010 08:28:07 -0700
> > > >> > > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
> > > >> > > >
> > > >> > > > Have you tried turning off auto flush, and managing the flush in
> > > >> your own
> > > >> > > code (say every 1000 puts?)
> > > >> > > > Dave
> > > >> > > >
> > > >> > > >
> > > >> > > > -----Original Message-----
> > > >> > > > From: Shuja Rehman [mailto:shujamughal@gmail.com]
> > > >> > > > Sent: Friday, November 05, 2010 8:04 AM
> > > >> > > > To: user@hbase.apache.org
> > > >> > > > Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> > > >> > > >
> > > >> > > > Michael
> > > >> > > >
> > > >> > > > hum....so u are storing xml record in the hbase and in second
> > job,
> > > u
> > > >> r
> > > >> > > > parsing. but in my case i am parsing it also in first phase.
> > what
> > > i
> > > >> do, i
> > > >> > > > get xml file and i parse it using jdom and then putting data in
> > > >> hbase. so
> > > >> > > > parsing+putting both operations are in 1 phase and in mapper
> > code.
> > > >> > > >
> > > >> > > > My actual problem is that after parsing file, i need to use put
> > > >> statement
> > > >> > > > millions of times and i think for each statement it connects to
> > > >> hbase and
> > > >> > > > then insert it and this might be the reason of slow processing.
> > So
> > > i
> > > >> am
> > > >> > > > trying to figure out some way we i can first buffer data and
> > then
> > > >> insert
> > > >> > > in
> > > >> > > > batch fashion. it means in one put statement, i can insert many
> > > >> records
> > > >> > > and
> > > >> > > > i think if i do in this way then the process will be very fast.
> > > >> > > >
> > > >> > > > secondly what does it means? "we write the raw record in via a
> > > >> single
> > > >> > > put()
> > > >> > > > so the map() method is a null writable."
> > > >> > > >
> > > >> > > > can u explain it more?
> > > >> > > >
> > > >> > > > Thanks
> > > >> > > >
> > > >> > > >
> > > >> > > > On Fri, Nov 5, 2010 at 5:05 PM, Michael Segel <
> > > >> michael_segel@hotmail.com
> > > >> > > >wrote:
> > > >> > > >
> > > >> > > > >
> > > >> > > > > Suja,
> > > >> > > > >
> > > >> > > > > Just did a quick glance.
> > > >> > > > >
> > > >> > > > > What is it that you want to do exactly?
> > > >> > > > >
> > > >> > > > > Here's how we do it... (at a high level.)
> > > >> > > > >
> > > >> > > > > Input is an XML file where we want to store the raw XML
> > records
> > > in
> > > >> > > hbase,
> > > >> > > > > one record per row.
> > > >> > > > >
> > > >> > > > > Instead of using the output of the map() method, we write the
> > > raw
> > > >> > > record in
> > > >> > > > > via a single put() so the map() method is a null writable.
> > > >> > > > >
> > > >> > > > > Its pretty fast. However fast is relative.
> > > >> > > > >
> > > >> > > > > Another thing... we store the xml record as a string
> > (converted
> > > to
> > > >> > > > > bytecode) rather than a serialized object.
> > > >> > > > >
> > > >> > > > > Then you can break it down in to individual fields in a second
> > > >> batch
> > > >> > > job.
> > > >> > > > > (You can start with a DOM parser, and later move to a Stax
> > > parser.
> > > >> > > > > Depending on which DOM parser you have and the size of the
> > > record,
> > > >> it
> > > >> > > should
> > > >> > > > > be 'fast enough'. A good implementation of Stax tends to be
> > > >> > > > > recursive/re-entrant code which is harder to maintain.)
> > > >> > > > >
> > > >> > > > > HTH
> > > >> > > > >
> > > >> > > > > -Mike
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > > Date: Fri, 5 Nov 2010 16:13:02 +0500
> > > >> > > > > > Subject: Best Way to Insert data into Hbase using Map Reduce
> > > >> > > > > > From: shujamughal@gmail.com
> > > >> > > > > > To: user@hbase.apache.org
> > > >> > > > > >
> > > >> > > > > > Hi
> > > >> > > > > >
> > > >> > > > > > I am reading data from raw xml files and inserting data into
> > > >> hbase
> > > >> > > using
> > > >> > > > > > TableOutputFormat in a map reduce job. but due to heavy put
> > > >> > > statements,
> > > >> > > > > it
> > > >> > > > > > takes many hours to process the data. here is my sample
> > code.
> > > >> > > > > >
> > > >> > > > > > conf.set(TableOutputFormat.OUTPUT_TABLE, "mytable");
> > > >> > > > > >     conf.set("xmlinput.start", "<adc>");
> > > >> > > > > >     conf.set("xmlinput.end", "</adc>");
> > > >> > > > > >     conf
> > > >> > > > > >         .set(
> > > >> > > > > >           "io.serializations",
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > >
> > > >>
> > >
> > "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
> > > >> > > > > >
> > > >> > > > > >       Job job = new Job(conf, "Populate Table with Data");
> > > >> > > > > >
> > > >> > > > > >     FileInputFormat.setInputPaths(job, input);
> > > >> > > > > >     job.setJarByClass(ParserDriver.class);
> > > >> > > > > >     job.setMapperClass(MyParserMapper.class);
> > > >> > > > > >     job.setNumReduceTasks(0);
> > > >> > > > > >     job.setInputFormatClass(XmlInputFormat.class);
> > > >> > > > > >     job.setOutputFormatClass(TableOutputFormat.class);
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > *and mapper code*
> > > >> > > > > >
> > > >> > > > > > public class MyParserMapper   extends
> > > >> > > > > >     Mapper<LongWritable, Text, NullWritable, Writable> {
> > > >> > > > > >
> > > >> > > > > >     @Override
> > > >> > > > > >     public void map(LongWritable key, Text value1,Context
> > > >> context)
> > > >> > > > > >
> > > >> > > > > > throws IOException, InterruptedException {
> > > >> > > > > > *//doing some processing*
> > > >> > > > > >  while(rItr.hasNext())
> > > >> > > > > >                     {
> > > >> > > > > > *                   //and this put statement runs for
> > > >> 132,622,560
> > > >> > > times
> > > >> > > > > to
> > > >> > > > > > insert the data.*
> > > >> > > > > >                     context.write(NullWritable.get(), new
> > > >> > > > > > Put(rowId).add(Bytes.toBytes("CounterValues"),
> > > >> > > > > > Bytes.toBytes(counter.toString()),
> > > >> > > > > Bytes.toBytes(rElement.getTextTrim())));
> > > >> > > > > >
> > > >> > > > > >                     }
> > > >> > > > > >
> > > >> > > > > > }}
> > > >> > > > > >
> > > >> > > > > > Is there any other way of doing this task so i can improve
> > the
> > > >> > > > > performance?
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > --
> > > >> > > > > > Regards
> > > >> > > > > > Shuja-ur-Rehman Baig
> > > >> > > > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> > > >> > > > >
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > --
> > > >> > > > Regards
> > > >> > > > Shuja-ur-Rehman Baig
> > > >> > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> > > >> > >
> > > >> > >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > Regards
> > > >> > Shuja-ur-Rehman Baig
> > > >> > <http://pk.linkedin.com/in/shujamughal>
> > > >>
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Regards
> > > > Shuja-ur-Rehman Baig
> > > > <http://pk.linkedin.com/in/shujamughal>
> > > >
> > > >
> > >
> > >
> > > --
> > > Regards
> > > Shuja-ur-Rehman Baig
> > > <http://pk.linkedin.com/in/shujamughal>
> > >
> >
> 
> 
> 
> -- 
> Regards
> Shuja-ur-Rehman Baig
> <http://pk.linkedin.com/in/shujamughal>

Re: Best Way to Insert data into Hbase using Map Reduce

Posted by Shuja Rehman <sh...@gmail.com>.

Hi Oleg,

Yes, I have used HTablePool. Here is my basic code skeleton

 public void setup(Context context)
    {
     HBaseConfiguration config = new HBaseConfiguration();
    config.set("hbase.zookeeper.quorum", Constants.HBASE_ZOOKEEPER_QUORUM);

config.set("hbase.zookeeper.property.clientPort",Constants.HBASE_ZOOKEEPRR_PROPERTY_CLIENTPORT);
             HTablePool tablePool = new HTablePool(config, 50);
            table = (HTable) tablePool.getTable("myTable");
    }

 public void map(LongWritable key, Text value1,Context context)
{

List<Put> puts = new ArrayList<Put>();
table.setWriteBufferSize(1024*1024*24);
table.setAutoFlush(false);
 while(true)
      {
       Put put = new Put(rowId);
                     put.add(;
                     put.setWriteToWAL(false);
                     puts.add(put);
      }
if(cnt % 500 == 0 )
         {
          table.getWriteBuffer().addAll(puts);
           table.flushCommits();
             puts.clear();
           }//*/
        cnt++;
}//while

if(puts.size()>0 )
             {
             table.getWriteBuffer().addAll(puts);
             table.flushCommits();
             puts.clear();
             }//*/

}//map


On Tue, Nov 9, 2010 at 3:32 PM, Oleg Ruchovets <or...@gmail.com> wrote:

> Hi ,
> Do you use HTablePool?
> Changing the code to using HBasePool gives  me significat performance
> benefit.
>
>
> HBaseConfiguration conf = new HBaseConfiguration();
> HTablePool pool = new HTablePool(conf, 10);
> HTable table = pool.getTable(name);
>
> Actually disabling WAL ,
> Increasing pool size and rewriting code to using WriteBuffer
> Gives me a good improvement.
>
What does mean by *rewriting code to using WriteBuffer?*



> I wonder : how can I check that my insertion process is optimized.
>  I mean if insertion took X time -- is it good or no? and how can I check
> it.
>
*I am also not sure about it.*

>
> Thanks Oleg.
>
>
>
> On Mon, Nov 8, 2010 at 6:59 PM, Shuja Rehman <sh...@gmail.com>
> wrote:
>
> > One more thing which i want to ask that i have found that people have
> given
> > the following buffer size.
> >
> >  table.setWriteBufferSize(1024*1024*24);
> >  table.setAutoFlush(false);
> >
> > Is there any specific reason of giving such buffer size? and how much ram
> > is
> > required for it. I have given 4 GB to each region server and I can see
> that
> > used heap value for region server going increasing and increasing and
> > region
> > servers are crashing then.
> >
> > On Mon, Nov 8, 2010 at 9:26 PM, Shuja Rehman <sh...@gmail.com>
> > wrote:
> >
> > > Ok
> > > Well...i am getting hundred of files daily which all need to process
> > thats
> > > why i am using hadoop so it manage distribution of processing itself.
> > > Yes, one record has millions of fields
> > >
> > > Thanks for comments.
> > >
> > >
> > > On Mon, Nov 8, 2010 at 8:50 PM, Michael Segel <
> michael_segel@hotmail.com
> > >wrote:
> > >
> > >>
> > >> Switch out the JDOM for a Stax parser.
> > >>
> > >> Ok, having said that...
> > >> You said you have a single record per file. Ok that means you have a
> lot
> > >> of fields.
> > >> Because you have 1 record, this isn't a map/reduce problem. You're
> > better
> > >> off writing a single threaded app
> > >> to read the file, parse the file using Stax, and then write the fields
> > to
> > >> HBase.
> > >>
> > >> I'm not sure why you have millions of put()s.
> > >> Do you have millions of fields in this one record?
> > >>
> > >> Writing a good stax parser and then mapping the fields to your hbase
> > >> column(s) will help.
> > >>
> > >> HTH
> > >>
> > >> -Mike
> > >> PS. A good stax implementation would be a recursive/re-entrant piece
> of
> > >> code.
> > >> While the code may look simple, it takes a skilled developer to write
> > and
> > >> maintain.
> > >>
> > >>
> > >> > Date: Mon, 8 Nov 2010 14:36:34 +0500
> > >> > Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> > >> > From: shujamughal@gmail.com
> > >> > To: user@hbase.apache.org
> > >> >
> > >> > HI
> > >> >
> > >> > I have used JDOM library to parse the xml in mapper and in my case,
> > one
> > >> > single file consist of 1 record so i give one complete file to map
> > >> process
> > >> > and extract the information from it which i need. I have only 2
> column
> > >> > families in my schema and bottleneck was the put statements which
> run
> > >> > millions of time for each file. when i comment this put statement
> then
> > >> job
> > >> > complete within minutes but with put statement, it was taking about
> 7
> > >> hours
> > >> > to complete the same job. Anyhow I have changed the code according
> to
> > >> > suggestion given by Michael  and now using java api to dump data
> > instead
> > >> of
> > >> > table output format and used the list of puts and then flush them at
> > >> each
> > >> > 1000 records and it reduces the time significantly. Now the whole
> job
> > >> > process by 1 hour and 45 min approx but still not in minutes. So is
> > >> there
> > >> > anything left which i might apply and performance increase?
> > >> >
> > >> > Thanks
> > >> >
> > >> > On Fri, Nov 5, 2010 at 10:46 PM, Buttler, David <bu...@llnl.gov>
> > >> wrote:
> > >> >
> > >> > > Good points.
> > >> > > Before we can make any rational suggestion, we need to know where
> > the
> > >> > > bottleneck is, so we can make suggestions to move it elsewhere.  I
> > >> > > personally favor Michael's suggestion to split the ingest and the
> > >> parsing
> > >> > > parts of your job, and to switch to a parser that is faster than a
> > DOM
> > >> > > parser (SAX or Stax). But, without knowing what the bottleneck
> > >> actually is,
> > >> > > all of these suggestions are shots in the dark.
> > >> > >
> > >> > > What is the network load, the CPU load, the disk load?  Have you
> at
> > >> least
> > >> > > installed Ganglia or some equivalent so that you can see what the
> > load
> > >> is
> > >> > > across the cluster?
> > >> > >
> > >> > > Dave
> > >> > >
> > >> > >
> > >> > > -----Original Message-----
> > >> > > From: Michael Segel [mailto:michael_segel@hotmail.com]
> > >> > > Sent: Friday, November 05, 2010 9:49 AM
> > >> > > To: user@hbase.apache.org
> > >> > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
> > >> > >
> > >> > >
> > >> > > I don't think using the buffered client is going to help a lot w
> > >> > > performance.
> > >> > >
> > >> > > I'm a little confused because it doesn't sound like Shuja is using
> a
> > >> > > map/reduce job to parse the file.
> > >> > > That is... he says he parses the file in to a dom tree. Usually
> your
> > >> map
> > >> > > job parses each record and then in the mapper you parse out the
> > >> record.
> > >> > > Within the m/r job we don't parse out the fields in the records
> > >> because we
> > >> > > do additional processing which 'dedupes' the data so we don't have
> > to
> > >> > > further process the data.
> > >> > > The second job only has to parse a portion of the original
> records.
> > >> > >
> > >> > > So assuming that Shuja is actually using a map reduce job, and
> each
> > >> xml
> > >> > > record is being parsed within the mapper() there are a couple of
> > >> things...
> > >> > > 1) Reduce the number of column families that you are using. (Each
> > >> column
> > >> > > family is written to a separate file)
> > >> > > 2) Set up the HTable instance in Mapper.setup()
> > >> > > 3) Switch to a different dom class (not all java classes are
> equal)
> > or
> > >> > > switch to Stax.
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > > From: buttler1@llnl.gov
> > >> > > > To: user@hbase.apache.org
> > >> > > > Date: Fri, 5 Nov 2010 08:28:07 -0700
> > >> > > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
> > >> > > >
> > >> > > > Have you tried turning off auto flush, and managing the flush in
> > >> your own
> > >> > > code (say every 1000 puts?)
> > >> > > > Dave
> > >> > > >
> > >> > > >
> > >> > > > -----Original Message-----
> > >> > > > From: Shuja Rehman [mailto:shujamughal@gmail.com]
> > >> > > > Sent: Friday, November 05, 2010 8:04 AM
> > >> > > > To: user@hbase.apache.org
> > >> > > > Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> > >> > > >
> > >> > > > Michael
> > >> > > >
> > >> > > > hum....so u are storing xml record in the hbase and in second
> job,
> > u
> > >> r
> > >> > > > parsing. but in my case i am parsing it also in first phase.
> what
> > i
> > >> do, i
> > >> > > > get xml file and i parse it using jdom and then putting data in
> > >> hbase. so
> > >> > > > parsing+putting both operations are in 1 phase and in mapper
> code.
> > >> > > >
> > >> > > > My actual problem is that after parsing file, i need to use put
> > >> statement
> > >> > > > millions of times and i think for each statement it connects to
> > >> hbase and
> > >> > > > then insert it and this might be the reason of slow processing.
> So
> > i
> > >> am
> > >> > > > trying to figure out some way we i can first buffer data and
> then
> > >> insert
> > >> > > in
> > >> > > > batch fashion. it means in one put statement, i can insert many
> > >> records
> > >> > > and
> > >> > > > i think if i do in this way then the process will be very fast.
> > >> > > >
> > >> > > > secondly what does it means? "we write the raw record in via a
> > >> single
> > >> > > put()
> > >> > > > so the map() method is a null writable."
> > >> > > >
> > >> > > > can u explain it more?
> > >> > > >
> > >> > > > Thanks
> > >> > > >
> > >> > > >
> > >> > > > On Fri, Nov 5, 2010 at 5:05 PM, Michael Segel <
> > >> michael_segel@hotmail.com
> > >> > > >wrote:
> > >> > > >
> > >> > > > >
> > >> > > > > Suja,
> > >> > > > >
> > >> > > > > Just did a quick glance.
> > >> > > > >
> > >> > > > > What is it that you want to do exactly?
> > >> > > > >
> > >> > > > > Here's how we do it... (at a high level.)
> > >> > > > >
> > >> > > > > Input is an XML file where we want to store the raw XML
> records
> > in
> > >> > > hbase,
> > >> > > > > one record per row.
> > >> > > > >
> > >> > > > > Instead of using the output of the map() method, we write the
> > raw
> > >> > > record in
> > >> > > > > via a single put() so the map() method is a null writable.
> > >> > > > >
> > >> > > > > Its pretty fast. However fast is relative.
> > >> > > > >
> > >> > > > > Another thing... we store the xml record as a string
> (converted
> > to
> > >> > > > > bytecode) rather than a serialized object.
> > >> > > > >
> > >> > > > > Then you can break it down in to individual fields in a second
> > >> batch
> > >> > > job.
> > >> > > > > (You can start with a DOM parser, and later move to a Stax
> > parser.
> > >> > > > > Depending on which DOM parser you have and the size of the
> > record,
> > >> it
> > >> > > should
> > >> > > > > be 'fast enough'. A good implementation of Stax tends to be
> > >> > > > > recursive/re-entrant code which is harder to maintain.)
> > >> > > > >
> > >> > > > > HTH
> > >> > > > >
> > >> > > > > -Mike
> > >> > > > >
> > >> > > > >
> > >> > > > > > Date: Fri, 5 Nov 2010 16:13:02 +0500
> > >> > > > > > Subject: Best Way to Insert data into Hbase using Map Reduce
> > >> > > > > > From: shujamughal@gmail.com
> > >> > > > > > To: user@hbase.apache.org
> > >> > > > > >
> > >> > > > > > Hi
> > >> > > > > >
> > >> > > > > > I am reading data from raw xml files and inserting data into
> > >> hbase
> > >> > > using
> > >> > > > > > TableOutputFormat in a map reduce job. but due to heavy put
> > >> > > statements,
> > >> > > > > it
> > >> > > > > > takes many hours to process the data. here is my sample
> code.
> > >> > > > > >
> > >> > > > > > conf.set(TableOutputFormat.OUTPUT_TABLE, "mytable");
> > >> > > > > >     conf.set("xmlinput.start", "<adc>");
> > >> > > > > >     conf.set("xmlinput.end", "</adc>");
> > >> > > > > >     conf
> > >> > > > > >         .set(
> > >> > > > > >           "io.serializations",
> > >> > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > >
> > >>
> >
> "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
> > >> > > > > >
> > >> > > > > >       Job job = new Job(conf, "Populate Table with Data");
> > >> > > > > >
> > >> > > > > >     FileInputFormat.setInputPaths(job, input);
> > >> > > > > >     job.setJarByClass(ParserDriver.class);
> > >> > > > > >     job.setMapperClass(MyParserMapper.class);
> > >> > > > > >     job.setNumReduceTasks(0);
> > >> > > > > >     job.setInputFormatClass(XmlInputFormat.class);
> > >> > > > > >     job.setOutputFormatClass(TableOutputFormat.class);
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > *and mapper code*
> > >> > > > > >
> > >> > > > > > public class MyParserMapper   extends
> > >> > > > > >     Mapper<LongWritable, Text, NullWritable, Writable> {
> > >> > > > > >
> > >> > > > > >     @Override
> > >> > > > > >     public void map(LongWritable key, Text value1,Context
> > >> context)
> > >> > > > > >
> > >> > > > > > throws IOException, InterruptedException {
> > >> > > > > > *//doing some processing*
> > >> > > > > >  while(rItr.hasNext())
> > >> > > > > >                     {
> > >> > > > > > *                   //and this put statement runs for
> > >> 132,622,560
> > >> > > times
> > >> > > > > to
> > >> > > > > > insert the data.*
> > >> > > > > >                     context.write(NullWritable.get(), new
> > >> > > > > > Put(rowId).add(Bytes.toBytes("CounterValues"),
> > >> > > > > > Bytes.toBytes(counter.toString()),
> > >> > > > > Bytes.toBytes(rElement.getTextTrim())));
> > >> > > > > >
> > >> > > > > >                     }
> > >> > > > > >
> > >> > > > > > }}
> > >> > > > > >
> > >> > > > > > Is there any other way of doing this task so i can improve
> the
> > >> > > > > performance?
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > --
> > >> > > > > > Regards
> > >> > > > > > Shuja-ur-Rehman Baig
> > >> > > > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> > >> > > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > > Regards
> > >> > > > Shuja-ur-Rehman Baig
> > >> > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> > >> > >
> > >> > >
> > >> >
> > >> >
> > >> > --
> > >> > Regards
> > >> > Shuja-ur-Rehman Baig
> > >> > <http://pk.linkedin.com/in/shujamughal>
> > >>
> > >>
> > >
> > >
> > >
> > > --
> > > Regards
> > > Shuja-ur-Rehman Baig
> > > <http://pk.linkedin.com/in/shujamughal>
> > >
> > >
> >
> >
> > --
> > Regards
> > Shuja-ur-Rehman Baig
> > <http://pk.linkedin.com/in/shujamughal>
> >
>



-- 
Regards
Shuja-ur-Rehman Baig
<http://pk.linkedin.com/in/shujamughal>

Re: Best Way to Insert data into Hbase using Map Reduce

Posted by Oleg Ruchovets <or...@gmail.com>.

Hi ,
Do you use HTablePool?
Changing the code to using HBasePool gives  me significat performance
benefit.


HBaseConfiguration conf = new HBaseConfiguration();
HTablePool pool = new HTablePool(conf, 10);
HTable table = pool.getTable(name);

Actually disabling WAL ,
Increasing pool size and rewriting code to using WriteBuffer
Gives me a good improvement.

I wonder : how can I check that my insertion process is optimized.
  I mean if insertion took X time -- is it good or no? and how can I check
it.

Thanks Oleg.



On Mon, Nov 8, 2010 at 6:59 PM, Shuja Rehman <sh...@gmail.com> wrote:

> One more thing which i want to ask that i have found that people have given
> the following buffer size.
>
>  table.setWriteBufferSize(1024*1024*24);
>  table.setAutoFlush(false);
>
> Is there any specific reason of giving such buffer size? and how much ram
> is
> required for it. I have given 4 GB to each region server and I can see that
> used heap value for region server going increasing and increasing and
> region
> servers are crashing then.
>
> On Mon, Nov 8, 2010 at 9:26 PM, Shuja Rehman <sh...@gmail.com>
> wrote:
>
> > Ok
> > Well...i am getting hundred of files daily which all need to process
> thats
> > why i am using hadoop so it manage distribution of processing itself.
> > Yes, one record has millions of fields
> >
> > Thanks for comments.
> >
> >
> > On Mon, Nov 8, 2010 at 8:50 PM, Michael Segel <michael_segel@hotmail.com
> >wrote:
> >
> >>
> >> Switch out the JDOM for a Stax parser.
> >>
> >> Ok, having said that...
> >> You said you have a single record per file. Ok that means you have a lot
> >> of fields.
> >> Because you have 1 record, this isn't a map/reduce problem. You're
> better
> >> off writing a single threaded app
> >> to read the file, parse the file using Stax, and then write the fields
> to
> >> HBase.
> >>
> >> I'm not sure why you have millions of put()s.
> >> Do you have millions of fields in this one record?
> >>
> >> Writing a good stax parser and then mapping the fields to your hbase
> >> column(s) will help.
> >>
> >> HTH
> >>
> >> -Mike
> >> PS. A good stax implementation would be a recursive/re-entrant piece of
> >> code.
> >> While the code may look simple, it takes a skilled developer to write
> and
> >> maintain.
> >>
> >>
> >> > Date: Mon, 8 Nov 2010 14:36:34 +0500
> >> > Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> >> > From: shujamughal@gmail.com
> >> > To: user@hbase.apache.org
> >> >
> >> > HI
> >> >
> >> > I have used JDOM library to parse the xml in mapper and in my case,
> one
> >> > single file consist of 1 record so i give one complete file to map
> >> process
> >> > and extract the information from it which i need. I have only 2 column
> >> > families in my schema and bottleneck was the put statements which run
> >> > millions of time for each file. when i comment this put statement then
> >> job
> >> > complete within minutes but with put statement, it was taking about 7
> >> hours
> >> > to complete the same job. Anyhow I have changed the code according to
> >> > suggestion given by Michael  and now using java api to dump data
> instead
> >> of
> >> > table output format and used the list of puts and then flush them at
> >> each
> >> > 1000 records and it reduces the time significantly. Now the whole job
> >> > process by 1 hour and 45 min approx but still not in minutes. So is
> >> there
> >> > anything left which i might apply and performance increase?
> >> >
> >> > Thanks
> >> >
> >> > On Fri, Nov 5, 2010 at 10:46 PM, Buttler, David <bu...@llnl.gov>
> >> wrote:
> >> >
> >> > > Good points.
> >> > > Before we can make any rational suggestion, we need to know where
> the
> >> > > bottleneck is, so we can make suggestions to move it elsewhere.  I
> >> > > personally favor Michael's suggestion to split the ingest and the
> >> parsing
> >> > > parts of your job, and to switch to a parser that is faster than a
> DOM
> >> > > parser (SAX or Stax). But, without knowing what the bottleneck
> >> actually is,
> >> > > all of these suggestions are shots in the dark.
> >> > >
> >> > > What is the network load, the CPU load, the disk load?  Have you at
> >> least
> >> > > installed Ganglia or some equivalent so that you can see what the
> load
> >> is
> >> > > across the cluster?
> >> > >
> >> > > Dave
> >> > >
> >> > >
> >> > > -----Original Message-----
> >> > > From: Michael Segel [mailto:michael_segel@hotmail.com]
> >> > > Sent: Friday, November 05, 2010 9:49 AM
> >> > > To: user@hbase.apache.org
> >> > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
> >> > >
> >> > >
> >> > > I don't think using the buffered client is going to help a lot w
> >> > > performance.
> >> > >
> >> > > I'm a little confused because it doesn't sound like Shuja is using a
> >> > > map/reduce job to parse the file.
> >> > > That is... he says he parses the file in to a dom tree. Usually your
> >> map
> >> > > job parses each record and then in the mapper you parse out the
> >> record.
> >> > > Within the m/r job we don't parse out the fields in the records
> >> because we
> >> > > do additional processing which 'dedupes' the data so we don't have
> to
> >> > > further process the data.
> >> > > The second job only has to parse a portion of the original records.
> >> > >
> >> > > So assuming that Shuja is actually using a map reduce job, and each
> >> xml
> >> > > record is being parsed within the mapper() there are a couple of
> >> things...
> >> > > 1) Reduce the number of column families that you are using. (Each
> >> column
> >> > > family is written to a separate file)
> >> > > 2) Set up the HTable instance in Mapper.setup()
> >> > > 3) Switch to a different dom class (not all java classes are equal)
> or
> >> > > switch to Stax.
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > > From: buttler1@llnl.gov
> >> > > > To: user@hbase.apache.org
> >> > > > Date: Fri, 5 Nov 2010 08:28:07 -0700
> >> > > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
> >> > > >
> >> > > > Have you tried turning off auto flush, and managing the flush in
> >> your own
> >> > > code (say every 1000 puts?)
> >> > > > Dave
> >> > > >
> >> > > >
> >> > > > -----Original Message-----
> >> > > > From: Shuja Rehman [mailto:shujamughal@gmail.com]
> >> > > > Sent: Friday, November 05, 2010 8:04 AM
> >> > > > To: user@hbase.apache.org
> >> > > > Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> >> > > >
> >> > > > Michael
> >> > > >
> >> > > > hum....so u are storing xml record in the hbase and in second job,
> u
> >> r
> >> > > > parsing. but in my case i am parsing it also in first phase. what
> i
> >> do, i
> >> > > > get xml file and i parse it using jdom and then putting data in
> >> hbase. so
> >> > > > parsing+putting both operations are in 1 phase and in mapper code.
> >> > > >
> >> > > > My actual problem is that after parsing file, i need to use put
> >> statement
> >> > > > millions of times and i think for each statement it connects to
> >> hbase and
> >> > > > then insert it and this might be the reason of slow processing. So
> i
> >> am
> >> > > > trying to figure out some way we i can first buffer data and then
> >> insert
> >> > > in
> >> > > > batch fashion. it means in one put statement, i can insert many
> >> records
> >> > > and
> >> > > > i think if i do in this way then the process will be very fast.
> >> > > >
> >> > > > secondly what does it means? "we write the raw record in via a
> >> single
> >> > > put()
> >> > > > so the map() method is a null writable."
> >> > > >
> >> > > > can u explain it more?
> >> > > >
> >> > > > Thanks
> >> > > >
> >> > > >
> >> > > > On Fri, Nov 5, 2010 at 5:05 PM, Michael Segel <
> >> michael_segel@hotmail.com
> >> > > >wrote:
> >> > > >
> >> > > > >
> >> > > > > Suja,
> >> > > > >
> >> > > > > Just did a quick glance.
> >> > > > >
> >> > > > > What is it that you want to do exactly?
> >> > > > >
> >> > > > > Here's how we do it... (at a high level.)
> >> > > > >
> >> > > > > Input is an XML file where we want to store the raw XML records
> in
> >> > > hbase,
> >> > > > > one record per row.
> >> > > > >
> >> > > > > Instead of using the output of the map() method, we write the
> raw
> >> > > record in
> >> > > > > via a single put() so the map() method is a null writable.
> >> > > > >
> >> > > > > Its pretty fast. However fast is relative.
> >> > > > >
> >> > > > > Another thing... we store the xml record as a string (converted
> to
> >> > > > > bytecode) rather than a serialized object.
> >> > > > >
> >> > > > > Then you can break it down in to individual fields in a second
> >> batch
> >> > > job.
> >> > > > > (You can start with a DOM parser, and later move to a Stax
> parser.
> >> > > > > Depending on which DOM parser you have and the size of the
> record,
> >> it
> >> > > should
> >> > > > > be 'fast enough'. A good implementation of Stax tends to be
> >> > > > > recursive/re-entrant code which is harder to maintain.)
> >> > > > >
> >> > > > > HTH
> >> > > > >
> >> > > > > -Mike
> >> > > > >
> >> > > > >
> >> > > > > > Date: Fri, 5 Nov 2010 16:13:02 +0500
> >> > > > > > Subject: Best Way to Insert data into Hbase using Map Reduce
> >> > > > > > From: shujamughal@gmail.com
> >> > > > > > To: user@hbase.apache.org
> >> > > > > >
> >> > > > > > Hi
> >> > > > > >
> >> > > > > > I am reading data from raw xml files and inserting data into
> >> hbase
> >> > > using
> >> > > > > > TableOutputFormat in a map reduce job. but due to heavy put
> >> > > statements,
> >> > > > > it
> >> > > > > > takes many hours to process the data. here is my sample code.
> >> > > > > >
> >> > > > > > conf.set(TableOutputFormat.OUTPUT_TABLE, "mytable");
> >> > > > > >     conf.set("xmlinput.start", "<adc>");
> >> > > > > >     conf.set("xmlinput.end", "</adc>");
> >> > > > > >     conf
> >> > > > > >         .set(
> >> > > > > >           "io.serializations",
> >> > > > > >
> >> > > > > >
> >> > > > >
> >> > >
> >>
> "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
> >> > > > > >
> >> > > > > >       Job job = new Job(conf, "Populate Table with Data");
> >> > > > > >
> >> > > > > >     FileInputFormat.setInputPaths(job, input);
> >> > > > > >     job.setJarByClass(ParserDriver.class);
> >> > > > > >     job.setMapperClass(MyParserMapper.class);
> >> > > > > >     job.setNumReduceTasks(0);
> >> > > > > >     job.setInputFormatClass(XmlInputFormat.class);
> >> > > > > >     job.setOutputFormatClass(TableOutputFormat.class);
> >> > > > > >
> >> > > > > >
> >> > > > > > *and mapper code*
> >> > > > > >
> >> > > > > > public class MyParserMapper   extends
> >> > > > > >     Mapper<LongWritable, Text, NullWritable, Writable> {
> >> > > > > >
> >> > > > > >     @Override
> >> > > > > >     public void map(LongWritable key, Text value1,Context
> >> context)
> >> > > > > >
> >> > > > > > throws IOException, InterruptedException {
> >> > > > > > *//doing some processing*
> >> > > > > >  while(rItr.hasNext())
> >> > > > > >                     {
> >> > > > > > *                   //and this put statement runs for
> >> 132,622,560
> >> > > times
> >> > > > > to
> >> > > > > > insert the data.*
> >> > > > > >                     context.write(NullWritable.get(), new
> >> > > > > > Put(rowId).add(Bytes.toBytes("CounterValues"),
> >> > > > > > Bytes.toBytes(counter.toString()),
> >> > > > > Bytes.toBytes(rElement.getTextTrim())));
> >> > > > > >
> >> > > > > >                     }
> >> > > > > >
> >> > > > > > }}
> >> > > > > >
> >> > > > > > Is there any other way of doing this task so i can improve the
> >> > > > > performance?
> >> > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > Regards
> >> > > > > > Shuja-ur-Rehman Baig
> >> > > > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Regards
> >> > > > Shuja-ur-Rehman Baig
> >> > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> >> > >
> >> > >
> >> >
> >> >
> >> > --
> >> > Regards
> >> > Shuja-ur-Rehman Baig
> >> > <http://pk.linkedin.com/in/shujamughal>
> >>
> >>
> >
> >
> >
> > --
> > Regards
> > Shuja-ur-Rehman Baig
> > <http://pk.linkedin.com/in/shujamughal>
> >
> >
>
>
> --
> Regards
> Shuja-ur-Rehman Baig
> <http://pk.linkedin.com/in/shujamughal>
>

Re: Best Way to Insert data into Hbase using Map Reduce

Posted by Shuja Rehman <sh...@gmail.com>.

One more thing which i want to ask that i have found that people have given
the following buffer size.

  table.setWriteBufferSize(1024*1024*24);
  table.setAutoFlush(false);

Is there any specific reason of giving such buffer size? and how much ram is
required for it. I have given 4 GB to each region server and I can see that
used heap value for region server going increasing and increasing and region
servers are crashing then.

On Mon, Nov 8, 2010 at 9:26 PM, Shuja Rehman <sh...@gmail.com> wrote:

> Ok
> Well...i am getting hundred of files daily which all need to process thats
> why i am using hadoop so it manage distribution of processing itself.
> Yes, one record has millions of fields
>
> Thanks for comments.
>
>
> On Mon, Nov 8, 2010 at 8:50 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>>
>> Switch out the JDOM for a Stax parser.
>>
>> Ok, having said that...
>> You said you have a single record per file. Ok that means you have a lot
>> of fields.
>> Because you have 1 record, this isn't a map/reduce problem. You're better
>> off writing a single threaded app
>> to read the file, parse the file using Stax, and then write the fields to
>> HBase.
>>
>> I'm not sure why you have millions of put()s.
>> Do you have millions of fields in this one record?
>>
>> Writing a good stax parser and then mapping the fields to your hbase
>> column(s) will help.
>>
>> HTH
>>
>> -Mike
>> PS. A good stax implementation would be a recursive/re-entrant piece of
>> code.
>> While the code may look simple, it takes a skilled developer to write and
>> maintain.
>>
>>
>> > Date: Mon, 8 Nov 2010 14:36:34 +0500
>> > Subject: Re: Best Way to Insert data into Hbase using Map Reduce
>> > From: shujamughal@gmail.com
>> > To: user@hbase.apache.org
>> >
>> > HI
>> >
>> > I have used JDOM library to parse the xml in mapper and in my case, one
>> > single file consist of 1 record so i give one complete file to map
>> process
>> > and extract the information from it which i need. I have only 2 column
>> > families in my schema and bottleneck was the put statements which run
>> > millions of time for each file. when i comment this put statement then
>> job
>> > complete within minutes but with put statement, it was taking about 7
>> hours
>> > to complete the same job. Anyhow I have changed the code according to
>> > suggestion given by Michael  and now using java api to dump data instead
>> of
>> > table output format and used the list of puts and then flush them at
>> each
>> > 1000 records and it reduces the time significantly. Now the whole job
>> > process by 1 hour and 45 min approx but still not in minutes. So is
>> there
>> > anything left which i might apply and performance increase?
>> >
>> > Thanks
>> >
>> > On Fri, Nov 5, 2010 at 10:46 PM, Buttler, David <bu...@llnl.gov>
>> wrote:
>> >
>> > > Good points.
>> > > Before we can make any rational suggestion, we need to know where the
>> > > bottleneck is, so we can make suggestions to move it elsewhere.  I
>> > > personally favor Michael's suggestion to split the ingest and the
>> parsing
>> > > parts of your job, and to switch to a parser that is faster than a DOM
>> > > parser (SAX or Stax). But, without knowing what the bottleneck
>> actually is,
>> > > all of these suggestions are shots in the dark.
>> > >
>> > > What is the network load, the CPU load, the disk load?  Have you at
>> least
>> > > installed Ganglia or some equivalent so that you can see what the load
>> is
>> > > across the cluster?
>> > >
>> > > Dave
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: Michael Segel [mailto:michael_segel@hotmail.com]
>> > > Sent: Friday, November 05, 2010 9:49 AM
>> > > To: user@hbase.apache.org
>> > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
>> > >
>> > >
>> > > I don't think using the buffered client is going to help a lot w
>> > > performance.
>> > >
>> > > I'm a little confused because it doesn't sound like Shuja is using a
>> > > map/reduce job to parse the file.
>> > > That is... he says he parses the file in to a dom tree. Usually your
>> map
>> > > job parses each record and then in the mapper you parse out the
>> record.
>> > > Within the m/r job we don't parse out the fields in the records
>> because we
>> > > do additional processing which 'dedupes' the data so we don't have to
>> > > further process the data.
>> > > The second job only has to parse a portion of the original records.
>> > >
>> > > So assuming that Shuja is actually using a map reduce job, and each
>> xml
>> > > record is being parsed within the mapper() there are a couple of
>> things...
>> > > 1) Reduce the number of column families that you are using. (Each
>> column
>> > > family is written to a separate file)
>> > > 2) Set up the HTable instance in Mapper.setup()
>> > > 3) Switch to a different dom class (not all java classes are equal) or
>> > > switch to Stax.
>> > >
>> > >
>> > >
>> > >
>> > > > From: buttler1@llnl.gov
>> > > > To: user@hbase.apache.org
>> > > > Date: Fri, 5 Nov 2010 08:28:07 -0700
>> > > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
>> > > >
>> > > > Have you tried turning off auto flush, and managing the flush in
>> your own
>> > > code (say every 1000 puts?)
>> > > > Dave
>> > > >
>> > > >
>> > > > -----Original Message-----
>> > > > From: Shuja Rehman [mailto:shujamughal@gmail.com]
>> > > > Sent: Friday, November 05, 2010 8:04 AM
>> > > > To: user@hbase.apache.org
>> > > > Subject: Re: Best Way to Insert data into Hbase using Map Reduce
>> > > >
>> > > > Michael
>> > > >
>> > > > hum....so u are storing xml record in the hbase and in second job, u
>> r
>> > > > parsing. but in my case i am parsing it also in first phase. what i
>> do, i
>> > > > get xml file and i parse it using jdom and then putting data in
>> hbase. so
>> > > > parsing+putting both operations are in 1 phase and in mapper code.
>> > > >
>> > > > My actual problem is that after parsing file, i need to use put
>> statement
>> > > > millions of times and i think for each statement it connects to
>> hbase and
>> > > > then insert it and this might be the reason of slow processing. So i
>> am
>> > > > trying to figure out some way we i can first buffer data and then
>> insert
>> > > in
>> > > > batch fashion. it means in one put statement, i can insert many
>> records
>> > > and
>> > > > i think if i do in this way then the process will be very fast.
>> > > >
>> > > > secondly what does it means? "we write the raw record in via a
>> single
>> > > put()
>> > > > so the map() method is a null writable."
>> > > >
>> > > > can u explain it more?
>> > > >
>> > > > Thanks
>> > > >
>> > > >
>> > > > On Fri, Nov 5, 2010 at 5:05 PM, Michael Segel <
>> michael_segel@hotmail.com
>> > > >wrote:
>> > > >
>> > > > >
>> > > > > Suja,
>> > > > >
>> > > > > Just did a quick glance.
>> > > > >
>> > > > > What is it that you want to do exactly?
>> > > > >
>> > > > > Here's how we do it... (at a high level.)
>> > > > >
>> > > > > Input is an XML file where we want to store the raw XML records in
>> > > hbase,
>> > > > > one record per row.
>> > > > >
>> > > > > Instead of using the output of the map() method, we write the raw
>> > > record in
>> > > > > via a single put() so the map() method is a null writable.
>> > > > >
>> > > > > Its pretty fast. However fast is relative.
>> > > > >
>> > > > > Another thing... we store the xml record as a string (converted to
>> > > > > bytecode) rather than a serialized object.
>> > > > >
>> > > > > Then you can break it down in to individual fields in a second
>> batch
>> > > job.
>> > > > > (You can start with a DOM parser, and later move to a Stax parser.
>> > > > > Depending on which DOM parser you have and the size of the record,
>> it
>> > > should
>> > > > > be 'fast enough'. A good implementation of Stax tends to be
>> > > > > recursive/re-entrant code which is harder to maintain.)
>> > > > >
>> > > > > HTH
>> > > > >
>> > > > > -Mike
>> > > > >
>> > > > >
>> > > > > > Date: Fri, 5 Nov 2010 16:13:02 +0500
>> > > > > > Subject: Best Way to Insert data into Hbase using Map Reduce
>> > > > > > From: shujamughal@gmail.com
>> > > > > > To: user@hbase.apache.org
>> > > > > >
>> > > > > > Hi
>> > > > > >
>> > > > > > I am reading data from raw xml files and inserting data into
>> hbase
>> > > using
>> > > > > > TableOutputFormat in a map reduce job. but due to heavy put
>> > > statements,
>> > > > > it
>> > > > > > takes many hours to process the data. here is my sample code.
>> > > > > >
>> > > > > > conf.set(TableOutputFormat.OUTPUT_TABLE, "mytable");
>> > > > > >     conf.set("xmlinput.start", "<adc>");
>> > > > > >     conf.set("xmlinput.end", "</adc>");
>> > > > > >     conf
>> > > > > >         .set(
>> > > > > >           "io.serializations",
>> > > > > >
>> > > > > >
>> > > > >
>> > >
>> "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
>> > > > > >
>> > > > > >       Job job = new Job(conf, "Populate Table with Data");
>> > > > > >
>> > > > > >     FileInputFormat.setInputPaths(job, input);
>> > > > > >     job.setJarByClass(ParserDriver.class);
>> > > > > >     job.setMapperClass(MyParserMapper.class);
>> > > > > >     job.setNumReduceTasks(0);
>> > > > > >     job.setInputFormatClass(XmlInputFormat.class);
>> > > > > >     job.setOutputFormatClass(TableOutputFormat.class);
>> > > > > >
>> > > > > >
>> > > > > > *and mapper code*
>> > > > > >
>> > > > > > public class MyParserMapper   extends
>> > > > > >     Mapper<LongWritable, Text, NullWritable, Writable> {
>> > > > > >
>> > > > > >     @Override
>> > > > > >     public void map(LongWritable key, Text value1,Context
>> context)
>> > > > > >
>> > > > > > throws IOException, InterruptedException {
>> > > > > > *//doing some processing*
>> > > > > >  while(rItr.hasNext())
>> > > > > >                     {
>> > > > > > *                   //and this put statement runs for
>> 132,622,560
>> > > times
>> > > > > to
>> > > > > > insert the data.*
>> > > > > >                     context.write(NullWritable.get(), new
>> > > > > > Put(rowId).add(Bytes.toBytes("CounterValues"),
>> > > > > > Bytes.toBytes(counter.toString()),
>> > > > > Bytes.toBytes(rElement.getTextTrim())));
>> > > > > >
>> > > > > >                     }
>> > > > > >
>> > > > > > }}
>> > > > > >
>> > > > > > Is there any other way of doing this task so i can improve the
>> > > > > performance?
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Regards
>> > > > > > Shuja-ur-Rehman Baig
>> > > > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Regards
>> > > > Shuja-ur-Rehman Baig
>> > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
>> > >
>> > >
>> >
>> >
>> > --
>> > Regards
>> > Shuja-ur-Rehman Baig
>> > <http://pk.linkedin.com/in/shujamughal>
>>
>>
>
>
>
> --
> Regards
> Shuja-ur-Rehman Baig
> <http://pk.linkedin.com/in/shujamughal>
>
>


-- 
Regards
Shuja-ur-Rehman Baig
<http://pk.linkedin.com/in/shujamughal>

Re: Best Way to Insert data into Hbase using Map Reduce

Posted by Shuja Rehman <sh...@gmail.com>.

Ok
Well...i am getting hundred of files daily which all need to process thats
why i am using hadoop so it manage distribution of processing itself.
Yes, one record has millions of fields

Thanks for comments.

On Mon, Nov 8, 2010 at 8:50 PM, Michael Segel <mi...@hotmail.com>wrote:

>
> Switch out the JDOM for a Stax parser.
>
> Ok, having said that...
> You said you have a single record per file. Ok that means you have a lot of
> fields.
> Because you have 1 record, this isn't a map/reduce problem. You're better
> off writing a single threaded app
> to read the file, parse the file using Stax, and then write the fields to
> HBase.
>
> I'm not sure why you have millions of put()s.
> Do you have millions of fields in this one record?
>
> Writing a good stax parser and then mapping the fields to your hbase
> column(s) will help.
>
> HTH
>
> -Mike
> PS. A good stax implementation would be a recursive/re-entrant piece of
> code.
> While the code may look simple, it takes a skilled developer to write and
> maintain.
>
>
> > Date: Mon, 8 Nov 2010 14:36:34 +0500
> > Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> > From: shujamughal@gmail.com
> > To: user@hbase.apache.org
> >
> > HI
> >
> > I have used JDOM library to parse the xml in mapper and in my case, one
> > single file consist of 1 record so i give one complete file to map
> process
> > and extract the information from it which i need. I have only 2 column
> > families in my schema and bottleneck was the put statements which run
> > millions of time for each file. when i comment this put statement then
> job
> > complete within minutes but with put statement, it was taking about 7
> hours
> > to complete the same job. Anyhow I have changed the code according to
> > suggestion given by Michael  and now using java api to dump data instead
> of
> > table output format and used the list of puts and then flush them at each
> > 1000 records and it reduces the time significantly. Now the whole job
> > process by 1 hour and 45 min approx but still not in minutes. So is there
> > anything left which i might apply and performance increase?
> >
> > Thanks
> >
> > On Fri, Nov 5, 2010 at 10:46 PM, Buttler, David <bu...@llnl.gov>
> wrote:
> >
> > > Good points.
> > > Before we can make any rational suggestion, we need to know where the
> > > bottleneck is, so we can make suggestions to move it elsewhere.  I
> > > personally favor Michael's suggestion to split the ingest and the
> parsing
> > > parts of your job, and to switch to a parser that is faster than a DOM
> > > parser (SAX or Stax). But, without knowing what the bottleneck actually
> is,
> > > all of these suggestions are shots in the dark.
> > >
> > > What is the network load, the CPU load, the disk load?  Have you at
> least
> > > installed Ganglia or some equivalent so that you can see what the load
> is
> > > across the cluster?
> > >
> > > Dave
> > >
> > >
> > > -----Original Message-----
> > > From: Michael Segel [mailto:michael_segel@hotmail.com]
> > > Sent: Friday, November 05, 2010 9:49 AM
> > > To: user@hbase.apache.org
> > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
> > >
> > >
> > > I don't think using the buffered client is going to help a lot w
> > > performance.
> > >
> > > I'm a little confused because it doesn't sound like Shuja is using a
> > > map/reduce job to parse the file.
> > > That is... he says he parses the file in to a dom tree. Usually your
> map
> > > job parses each record and then in the mapper you parse out the record.
> > > Within the m/r job we don't parse out the fields in the records because
> we
> > > do additional processing which 'dedupes' the data so we don't have to
> > > further process the data.
> > > The second job only has to parse a portion of the original records.
> > >
> > > So assuming that Shuja is actually using a map reduce job, and each xml
> > > record is being parsed within the mapper() there are a couple of
> things...
> > > 1) Reduce the number of column families that you are using. (Each
> column
> > > family is written to a separate file)
> > > 2) Set up the HTable instance in Mapper.setup()
> > > 3) Switch to a different dom class (not all java classes are equal) or
> > > switch to Stax.
> > >
> > >
> > >
> > >
> > > > From: buttler1@llnl.gov
> > > > To: user@hbase.apache.org
> > > > Date: Fri, 5 Nov 2010 08:28:07 -0700
> > > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
> > > >
> > > > Have you tried turning off auto flush, and managing the flush in your
> own
> > > code (say every 1000 puts?)
> > > > Dave
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Shuja Rehman [mailto:shujamughal@gmail.com]
> > > > Sent: Friday, November 05, 2010 8:04 AM
> > > > To: user@hbase.apache.org
> > > > Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> > > >
> > > > Michael
> > > >
> > > > hum....so u are storing xml record in the hbase and in second job, u
> r
> > > > parsing. but in my case i am parsing it also in first phase. what i
> do, i
> > > > get xml file and i parse it using jdom and then putting data in
> hbase. so
> > > > parsing+putting both operations are in 1 phase and in mapper code.
> > > >
> > > > My actual problem is that after parsing file, i need to use put
> statement
> > > > millions of times and i think for each statement it connects to hbase
> and
> > > > then insert it and this might be the reason of slow processing. So i
> am
> > > > trying to figure out some way we i can first buffer data and then
> insert
> > > in
> > > > batch fashion. it means in one put statement, i can insert many
> records
> > > and
> > > > i think if i do in this way then the process will be very fast.
> > > >
> > > > secondly what does it means? "we write the raw record in via a single
> > > put()
> > > > so the map() method is a null writable."
> > > >
> > > > can u explain it more?
> > > >
> > > > Thanks
> > > >
> > > >
> > > > On Fri, Nov 5, 2010 at 5:05 PM, Michael Segel <
> michael_segel@hotmail.com
> > > >wrote:
> > > >
> > > > >
> > > > > Suja,
> > > > >
> > > > > Just did a quick glance.
> > > > >
> > > > > What is it that you want to do exactly?
> > > > >
> > > > > Here's how we do it... (at a high level.)
> > > > >
> > > > > Input is an XML file where we want to store the raw XML records in
> > > hbase,
> > > > > one record per row.
> > > > >
> > > > > Instead of using the output of the map() method, we write the raw
> > > record in
> > > > > via a single put() so the map() method is a null writable.
> > > > >
> > > > > Its pretty fast. However fast is relative.
> > > > >
> > > > > Another thing... we store the xml record as a string (converted to
> > > > > bytecode) rather than a serialized object.
> > > > >
> > > > > Then you can break it down in to individual fields in a second
> batch
> > > job.
> > > > > (You can start with a DOM parser, and later move to a Stax parser.
> > > > > Depending on which DOM parser you have and the size of the record,
> it
> > > should
> > > > > be 'fast enough'. A good implementation of Stax tends to be
> > > > > recursive/re-entrant code which is harder to maintain.)
> > > > >
> > > > > HTH
> > > > >
> > > > > -Mike
> > > > >
> > > > >
> > > > > > Date: Fri, 5 Nov 2010 16:13:02 +0500
> > > > > > Subject: Best Way to Insert data into Hbase using Map Reduce
> > > > > > From: shujamughal@gmail.com
> > > > > > To: user@hbase.apache.org
> > > > > >
> > > > > > Hi
> > > > > >
> > > > > > I am reading data from raw xml files and inserting data into
> hbase
> > > using
> > > > > > TableOutputFormat in a map reduce job. but due to heavy put
> > > statements,
> > > > > it
> > > > > > takes many hours to process the data. here is my sample code.
> > > > > >
> > > > > > conf.set(TableOutputFormat.OUTPUT_TABLE, "mytable");
> > > > > >     conf.set("xmlinput.start", "<adc>");
> > > > > >     conf.set("xmlinput.end", "</adc>");
> > > > > >     conf
> > > > > >         .set(
> > > > > >           "io.serializations",
> > > > > >
> > > > > >
> > > > >
> > >
> "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
> > > > > >
> > > > > >       Job job = new Job(conf, "Populate Table with Data");
> > > > > >
> > > > > >     FileInputFormat.setInputPaths(job, input);
> > > > > >     job.setJarByClass(ParserDriver.class);
> > > > > >     job.setMapperClass(MyParserMapper.class);
> > > > > >     job.setNumReduceTasks(0);
> > > > > >     job.setInputFormatClass(XmlInputFormat.class);
> > > > > >     job.setOutputFormatClass(TableOutputFormat.class);
> > > > > >
> > > > > >
> > > > > > *and mapper code*
> > > > > >
> > > > > > public class MyParserMapper   extends
> > > > > >     Mapper<LongWritable, Text, NullWritable, Writable> {
> > > > > >
> > > > > >     @Override
> > > > > >     public void map(LongWritable key, Text value1,Context
> context)
> > > > > >
> > > > > > throws IOException, InterruptedException {
> > > > > > *//doing some processing*
> > > > > >  while(rItr.hasNext())
> > > > > >                     {
> > > > > > *                   //and this put statement runs for 132,622,560
> > > times
> > > > > to
> > > > > > insert the data.*
> > > > > >                     context.write(NullWritable.get(), new
> > > > > > Put(rowId).add(Bytes.toBytes("CounterValues"),
> > > > > > Bytes.toBytes(counter.toString()),
> > > > > Bytes.toBytes(rElement.getTextTrim())));
> > > > > >
> > > > > >                     }
> > > > > >
> > > > > > }}
> > > > > >
> > > > > > Is there any other way of doing this task so i can improve the
> > > > > performance?
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Regards
> > > > > > Shuja-ur-Rehman Baig
> > > > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards
> > > > Shuja-ur-Rehman Baig
> > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> > >
> > >
> >
> >
> > --
> > Regards
> > Shuja-ur-Rehman Baig
> > <http://pk.linkedin.com/in/shujamughal>
>
>



-- 
Regards
Shuja-ur-Rehman Baig
<http://pk.linkedin.com/in/shujamughal>

RE: Best Way to Insert data into Hbase using Map Reduce

Posted by Michael Segel <mi...@hotmail.com>.

Switch out the JDOM for a Stax parser.

Ok, having said that... 
You said you have a single record per file. Ok that means you have a lot of fields.
Because you have 1 record, this isn't a map/reduce problem. You're better off writing a single threaded app 
to read the file, parse the file using Stax, and then write the fields to HBase.

I'm not sure why you have millions of put()s.
Do you have millions of fields in this one record?

Writing a good stax parser and then mapping the fields to your hbase column(s) will help.

HTH

-Mike
PS. A good stax implementation would be a recursive/re-entrant piece of code.
While the code may look simple, it takes a skilled developer to write and maintain.


> Date: Mon, 8 Nov 2010 14:36:34 +0500
> Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> From: shujamughal@gmail.com
> To: user@hbase.apache.org
> 
> HI
> 
> I have used JDOM library to parse the xml in mapper and in my case, one
> single file consist of 1 record so i give one complete file to map process
> and extract the information from it which i need. I have only 2 column
> families in my schema and bottleneck was the put statements which run
> millions of time for each file. when i comment this put statement then job
> complete within minutes but with put statement, it was taking about 7 hours
> to complete the same job. Anyhow I have changed the code according to
> suggestion given by Michael  and now using java api to dump data instead of
> table output format and used the list of puts and then flush them at each
> 1000 records and it reduces the time significantly. Now the whole job
> process by 1 hour and 45 min approx but still not in minutes. So is there
> anything left which i might apply and performance increase?
> 
> Thanks
> 
> On Fri, Nov 5, 2010 at 10:46 PM, Buttler, David <bu...@llnl.gov> wrote:
> 
> > Good points.
> > Before we can make any rational suggestion, we need to know where the
> > bottleneck is, so we can make suggestions to move it elsewhere.  I
> > personally favor Michael's suggestion to split the ingest and the parsing
> > parts of your job, and to switch to a parser that is faster than a DOM
> > parser (SAX or Stax). But, without knowing what the bottleneck actually is,
> > all of these suggestions are shots in the dark.
> >
> > What is the network load, the CPU load, the disk load?  Have you at least
> > installed Ganglia or some equivalent so that you can see what the load is
> > across the cluster?
> >
> > Dave
> >
> >
> > -----Original Message-----
> > From: Michael Segel [mailto:michael_segel@hotmail.com]
> > Sent: Friday, November 05, 2010 9:49 AM
> > To: user@hbase.apache.org
> > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
> >
> >
> > I don't think using the buffered client is going to help a lot w
> > performance.
> >
> > I'm a little confused because it doesn't sound like Shuja is using a
> > map/reduce job to parse the file.
> > That is... he says he parses the file in to a dom tree. Usually your map
> > job parses each record and then in the mapper you parse out the record.
> > Within the m/r job we don't parse out the fields in the records because we
> > do additional processing which 'dedupes' the data so we don't have to
> > further process the data.
> > The second job only has to parse a portion of the original records.
> >
> > So assuming that Shuja is actually using a map reduce job, and each xml
> > record is being parsed within the mapper() there are a couple of things...
> > 1) Reduce the number of column families that you are using. (Each column
> > family is written to a separate file)
> > 2) Set up the HTable instance in Mapper.setup()
> > 3) Switch to a different dom class (not all java classes are equal) or
> > switch to Stax.
> >
> >
> >
> >
> > > From: buttler1@llnl.gov
> > > To: user@hbase.apache.org
> > > Date: Fri, 5 Nov 2010 08:28:07 -0700
> > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
> > >
> > > Have you tried turning off auto flush, and managing the flush in your own
> > code (say every 1000 puts?)
> > > Dave
> > >
> > >
> > > -----Original Message-----
> > > From: Shuja Rehman [mailto:shujamughal@gmail.com]
> > > Sent: Friday, November 05, 2010 8:04 AM
> > > To: user@hbase.apache.org
> > > Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> > >
> > > Michael
> > >
> > > hum....so u are storing xml record in the hbase and in second job, u r
> > > parsing. but in my case i am parsing it also in first phase. what i do, i
> > > get xml file and i parse it using jdom and then putting data in hbase. so
> > > parsing+putting both operations are in 1 phase and in mapper code.
> > >
> > > My actual problem is that after parsing file, i need to use put statement
> > > millions of times and i think for each statement it connects to hbase and
> > > then insert it and this might be the reason of slow processing. So i am
> > > trying to figure out some way we i can first buffer data and then insert
> > in
> > > batch fashion. it means in one put statement, i can insert many records
> > and
> > > i think if i do in this way then the process will be very fast.
> > >
> > > secondly what does it means? "we write the raw record in via a single
> > put()
> > > so the map() method is a null writable."
> > >
> > > can u explain it more?
> > >
> > > Thanks
> > >
> > >
> > > On Fri, Nov 5, 2010 at 5:05 PM, Michael Segel <michael_segel@hotmail.com
> > >wrote:
> > >
> > > >
> > > > Suja,
> > > >
> > > > Just did a quick glance.
> > > >
> > > > What is it that you want to do exactly?
> > > >
> > > > Here's how we do it... (at a high level.)
> > > >
> > > > Input is an XML file where we want to store the raw XML records in
> > hbase,
> > > > one record per row.
> > > >
> > > > Instead of using the output of the map() method, we write the raw
> > record in
> > > > via a single put() so the map() method is a null writable.
> > > >
> > > > Its pretty fast. However fast is relative.
> > > >
> > > > Another thing... we store the xml record as a string (converted to
> > > > bytecode) rather than a serialized object.
> > > >
> > > > Then you can break it down in to individual fields in a second batch
> > job.
> > > > (You can start with a DOM parser, and later move to a Stax parser.
> > > > Depending on which DOM parser you have and the size of the record, it
> > should
> > > > be 'fast enough'. A good implementation of Stax tends to be
> > > > recursive/re-entrant code which is harder to maintain.)
> > > >
> > > > HTH
> > > >
> > > > -Mike
> > > >
> > > >
> > > > > Date: Fri, 5 Nov 2010 16:13:02 +0500
> > > > > Subject: Best Way to Insert data into Hbase using Map Reduce
> > > > > From: shujamughal@gmail.com
> > > > > To: user@hbase.apache.org
> > > > >
> > > > > Hi
> > > > >
> > > > > I am reading data from raw xml files and inserting data into hbase
> > using
> > > > > TableOutputFormat in a map reduce job. but due to heavy put
> > statements,
> > > > it
> > > > > takes many hours to process the data. here is my sample code.
> > > > >
> > > > > conf.set(TableOutputFormat.OUTPUT_TABLE, "mytable");
> > > > >     conf.set("xmlinput.start", "<adc>");
> > > > >     conf.set("xmlinput.end", "</adc>");
> > > > >     conf
> > > > >         .set(
> > > > >           "io.serializations",
> > > > >
> > > > >
> > > >
> > "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
> > > > >
> > > > >       Job job = new Job(conf, "Populate Table with Data");
> > > > >
> > > > >     FileInputFormat.setInputPaths(job, input);
> > > > >     job.setJarByClass(ParserDriver.class);
> > > > >     job.setMapperClass(MyParserMapper.class);
> > > > >     job.setNumReduceTasks(0);
> > > > >     job.setInputFormatClass(XmlInputFormat.class);
> > > > >     job.setOutputFormatClass(TableOutputFormat.class);
> > > > >
> > > > >
> > > > > *and mapper code*
> > > > >
> > > > > public class MyParserMapper   extends
> > > > >     Mapper<LongWritable, Text, NullWritable, Writable> {
> > > > >
> > > > >     @Override
> > > > >     public void map(LongWritable key, Text value1,Context context)
> > > > >
> > > > > throws IOException, InterruptedException {
> > > > > *//doing some processing*
> > > > >  while(rItr.hasNext())
> > > > >                     {
> > > > > *                   //and this put statement runs for 132,622,560
> > times
> > > > to
> > > > > insert the data.*
> > > > >                     context.write(NullWritable.get(), new
> > > > > Put(rowId).add(Bytes.toBytes("CounterValues"),
> > > > > Bytes.toBytes(counter.toString()),
> > > > Bytes.toBytes(rElement.getTextTrim())));
> > > > >
> > > > >                     }
> > > > >
> > > > > }}
> > > > >
> > > > > Is there any other way of doing this task so i can improve the
> > > > performance?
> > > > >
> > > > >
> > > > > --
> > > > > Regards
> > > > > Shuja-ur-Rehman Baig
> > > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> > > >
> > >
> > >
> > >
> > >
> > > --
> > > Regards
> > > Shuja-ur-Rehman Baig
> > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> >
> >
> 
> 
> -- 
> Regards
> Shuja-ur-Rehman Baig
> <http://pk.linkedin.com/in/shujamughal>

Re: Best Way to Insert data into Hbase using Map Reduce

Posted by Shuja Rehman <sh...@gmail.com>.

HI

I have used JDOM library to parse the xml in mapper and in my case, one
single file consist of 1 record so i give one complete file to map process
and extract the information from it which i need. I have only 2 column
families in my schema and bottleneck was the put statements which run
millions of time for each file. when i comment this put statement then job
complete within minutes but with put statement, it was taking about 7 hours
to complete the same job. Anyhow I have changed the code according to
suggestion given by Michael  and now using java api to dump data instead of
table output format and used the list of puts and then flush them at each
1000 records and it reduces the time significantly. Now the whole job
process by 1 hour and 45 min approx but still not in minutes. So is there
anything left which i might apply and performance increase?

Thanks

On Fri, Nov 5, 2010 at 10:46 PM, Buttler, David <bu...@llnl.gov> wrote:

> Good points.
> Before we can make any rational suggestion, we need to know where the
> bottleneck is, so we can make suggestions to move it elsewhere.  I
> personally favor Michael's suggestion to split the ingest and the parsing
> parts of your job, and to switch to a parser that is faster than a DOM
> parser (SAX or Stax). But, without knowing what the bottleneck actually is,
> all of these suggestions are shots in the dark.
>
> What is the network load, the CPU load, the disk load?  Have you at least
> installed Ganglia or some equivalent so that you can see what the load is
> across the cluster?
>
> Dave
>
>
> -----Original Message-----
> From: Michael Segel [mailto:michael_segel@hotmail.com]
> Sent: Friday, November 05, 2010 9:49 AM
> To: user@hbase.apache.org
> Subject: RE: Best Way to Insert data into Hbase using Map Reduce
>
>
> I don't think using the buffered client is going to help a lot w
> performance.
>
> I'm a little confused because it doesn't sound like Shuja is using a
> map/reduce job to parse the file.
> That is... he says he parses the file in to a dom tree. Usually your map
> job parses each record and then in the mapper you parse out the record.
> Within the m/r job we don't parse out the fields in the records because we
> do additional processing which 'dedupes' the data so we don't have to
> further process the data.
> The second job only has to parse a portion of the original records.
>
> So assuming that Shuja is actually using a map reduce job, and each xml
> record is being parsed within the mapper() there are a couple of things...
> 1) Reduce the number of column families that you are using. (Each column
> family is written to a separate file)
> 2) Set up the HTable instance in Mapper.setup()
> 3) Switch to a different dom class (not all java classes are equal) or
> switch to Stax.
>
>
>
>
> > From: buttler1@llnl.gov
> > To: user@hbase.apache.org
> > Date: Fri, 5 Nov 2010 08:28:07 -0700
> > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
> >
> > Have you tried turning off auto flush, and managing the flush in your own
> code (say every 1000 puts?)
> > Dave
> >
> >
> > -----Original Message-----
> > From: Shuja Rehman [mailto:shujamughal@gmail.com]
> > Sent: Friday, November 05, 2010 8:04 AM
> > To: user@hbase.apache.org
> > Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> >
> > Michael
> >
> > hum....so u are storing xml record in the hbase and in second job, u r
> > parsing. but in my case i am parsing it also in first phase. what i do, i
> > get xml file and i parse it using jdom and then putting data in hbase. so
> > parsing+putting both operations are in 1 phase and in mapper code.
> >
> > My actual problem is that after parsing file, i need to use put statement
> > millions of times and i think for each statement it connects to hbase and
> > then insert it and this might be the reason of slow processing. So i am
> > trying to figure out some way we i can first buffer data and then insert
> in
> > batch fashion. it means in one put statement, i can insert many records
> and
> > i think if i do in this way then the process will be very fast.
> >
> > secondly what does it means? "we write the raw record in via a single
> put()
> > so the map() method is a null writable."
> >
> > can u explain it more?
> >
> > Thanks
> >
> >
> > On Fri, Nov 5, 2010 at 5:05 PM, Michael Segel <michael_segel@hotmail.com
> >wrote:
> >
> > >
> > > Suja,
> > >
> > > Just did a quick glance.
> > >
> > > What is it that you want to do exactly?
> > >
> > > Here's how we do it... (at a high level.)
> > >
> > > Input is an XML file where we want to store the raw XML records in
> hbase,
> > > one record per row.
> > >
> > > Instead of using the output of the map() method, we write the raw
> record in
> > > via a single put() so the map() method is a null writable.
> > >
> > > Its pretty fast. However fast is relative.
> > >
> > > Another thing... we store the xml record as a string (converted to
> > > bytecode) rather than a serialized object.
> > >
> > > Then you can break it down in to individual fields in a second batch
> job.
> > > (You can start with a DOM parser, and later move to a Stax parser.
> > > Depending on which DOM parser you have and the size of the record, it
> should
> > > be 'fast enough'. A good implementation of Stax tends to be
> > > recursive/re-entrant code which is harder to maintain.)
> > >
> > > HTH
> > >
> > > -Mike
> > >
> > >
> > > > Date: Fri, 5 Nov 2010 16:13:02 +0500
> > > > Subject: Best Way to Insert data into Hbase using Map Reduce
> > > > From: shujamughal@gmail.com
> > > > To: user@hbase.apache.org
> > > >
> > > > Hi
> > > >
> > > > I am reading data from raw xml files and inserting data into hbase
> using
> > > > TableOutputFormat in a map reduce job. but due to heavy put
> statements,
> > > it
> > > > takes many hours to process the data. here is my sample code.
> > > >
> > > > conf.set(TableOutputFormat.OUTPUT_TABLE, "mytable");
> > > >     conf.set("xmlinput.start", "<adc>");
> > > >     conf.set("xmlinput.end", "</adc>");
> > > >     conf
> > > >         .set(
> > > >           "io.serializations",
> > > >
> > > >
> > >
> "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
> > > >
> > > >       Job job = new Job(conf, "Populate Table with Data");
> > > >
> > > >     FileInputFormat.setInputPaths(job, input);
> > > >     job.setJarByClass(ParserDriver.class);
> > > >     job.setMapperClass(MyParserMapper.class);
> > > >     job.setNumReduceTasks(0);
> > > >     job.setInputFormatClass(XmlInputFormat.class);
> > > >     job.setOutputFormatClass(TableOutputFormat.class);
> > > >
> > > >
> > > > *and mapper code*
> > > >
> > > > public class MyParserMapper   extends
> > > >     Mapper<LongWritable, Text, NullWritable, Writable> {
> > > >
> > > >     @Override
> > > >     public void map(LongWritable key, Text value1,Context context)
> > > >
> > > > throws IOException, InterruptedException {
> > > > *//doing some processing*
> > > >  while(rItr.hasNext())
> > > >                     {
> > > > *                   //and this put statement runs for 132,622,560
> times
> > > to
> > > > insert the data.*
> > > >                     context.write(NullWritable.get(), new
> > > > Put(rowId).add(Bytes.toBytes("CounterValues"),
> > > > Bytes.toBytes(counter.toString()),
> > > Bytes.toBytes(rElement.getTextTrim())));
> > > >
> > > >                     }
> > > >
> > > > }}
> > > >
> > > > Is there any other way of doing this task so i can improve the
> > > performance?
> > > >
> > > >
> > > > --
> > > > Regards
> > > > Shuja-ur-Rehman Baig
> > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> > >
> >
> >
> >
> >
> > --
> > Regards
> > Shuja-ur-Rehman Baig
> > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
>
>


-- 
Regards
Shuja-ur-Rehman Baig
<http://pk.linkedin.com/in/shujamughal>

RE: Best Way to Insert data into Hbase using Map Reduce

Posted by "Buttler, David" <bu...@llnl.gov>.

Good points.
Before we can make any rational suggestion, we need to know where the bottleneck is, so we can make suggestions to move it elsewhere.  I personally favor Michael's suggestion to split the ingest and the parsing parts of your job, and to switch to a parser that is faster than a DOM parser (SAX or Stax). But, without knowing what the bottleneck actually is, all of these suggestions are shots in the dark.  

What is the network load, the CPU load, the disk load?  Have you at least installed Ganglia or some equivalent so that you can see what the load is across the cluster?

Dave


-----Original Message-----
From: Michael Segel [mailto:michael_segel@hotmail.com] 
Sent: Friday, November 05, 2010 9:49 AM
To: user@hbase.apache.org
Subject: RE: Best Way to Insert data into Hbase using Map Reduce


I don't think using the buffered client is going to help a lot w performance.

I'm a little confused because it doesn't sound like Shuja is using a map/reduce job to parse the file. 
That is... he says he parses the file in to a dom tree. Usually your map job parses each record and then in the mapper you parse out the record.
Within the m/r job we don't parse out the fields in the records because we do additional processing which 'dedupes' the data so we don't have to further process the data.
The second job only has to parse a portion of the original records.

So assuming that Shuja is actually using a map reduce job, and each xml record is being parsed within the mapper() there are a couple of things...
1) Reduce the number of column families that you are using. (Each column family is written to a separate file)
2) Set up the HTable instance in Mapper.setup() 
3) Switch to a different dom class (not all java classes are equal) or switch to Stax.




> From: buttler1@llnl.gov
> To: user@hbase.apache.org
> Date: Fri, 5 Nov 2010 08:28:07 -0700
> Subject: RE: Best Way to Insert data into Hbase using Map Reduce
> 
> Have you tried turning off auto flush, and managing the flush in your own code (say every 1000 puts?)
> Dave
> 
> 
> -----Original Message-----
> From: Shuja Rehman [mailto:shujamughal@gmail.com] 
> Sent: Friday, November 05, 2010 8:04 AM
> To: user@hbase.apache.org
> Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> 
> Michael
> 
> hum....so u are storing xml record in the hbase and in second job, u r
> parsing. but in my case i am parsing it also in first phase. what i do, i
> get xml file and i parse it using jdom and then putting data in hbase. so
> parsing+putting both operations are in 1 phase and in mapper code.
> 
> My actual problem is that after parsing file, i need to use put statement
> millions of times and i think for each statement it connects to hbase and
> then insert it and this might be the reason of slow processing. So i am
> trying to figure out some way we i can first buffer data and then insert in
> batch fashion. it means in one put statement, i can insert many records and
> i think if i do in this way then the process will be very fast.
> 
> secondly what does it means? "we write the raw record in via a single put()
> so the map() method is a null writable."
> 
> can u explain it more?
> 
> Thanks
> 
> 
> On Fri, Nov 5, 2010 at 5:05 PM, Michael Segel <mi...@hotmail.com>wrote:
> 
> >
> > Suja,
> >
> > Just did a quick glance.
> >
> > What is it that you want to do exactly?
> >
> > Here's how we do it... (at a high level.)
> >
> > Input is an XML file where we want to store the raw XML records in hbase,
> > one record per row.
> >
> > Instead of using the output of the map() method, we write the raw record in
> > via a single put() so the map() method is a null writable.
> >
> > Its pretty fast. However fast is relative.
> >
> > Another thing... we store the xml record as a string (converted to
> > bytecode) rather than a serialized object.
> >
> > Then you can break it down in to individual fields in a second batch job.
> > (You can start with a DOM parser, and later move to a Stax parser.
> > Depending on which DOM parser you have and the size of the record, it should
> > be 'fast enough'. A good implementation of Stax tends to be
> > recursive/re-entrant code which is harder to maintain.)
> >
> > HTH
> >
> > -Mike
> >
> >
> > > Date: Fri, 5 Nov 2010 16:13:02 +0500
> > > Subject: Best Way to Insert data into Hbase using Map Reduce
> > > From: shujamughal@gmail.com
> > > To: user@hbase.apache.org
> > >
> > > Hi
> > >
> > > I am reading data from raw xml files and inserting data into hbase using
> > > TableOutputFormat in a map reduce job. but due to heavy put statements,
> > it
> > > takes many hours to process the data. here is my sample code.
> > >
> > > conf.set(TableOutputFormat.OUTPUT_TABLE, "mytable");
> > >     conf.set("xmlinput.start", "<adc>");
> > >     conf.set("xmlinput.end", "</adc>");
> > >     conf
> > >         .set(
> > >           "io.serializations",
> > >
> > >
> > "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
> > >
> > >       Job job = new Job(conf, "Populate Table with Data");
> > >
> > >     FileInputFormat.setInputPaths(job, input);
> > >     job.setJarByClass(ParserDriver.class);
> > >     job.setMapperClass(MyParserMapper.class);
> > >     job.setNumReduceTasks(0);
> > >     job.setInputFormatClass(XmlInputFormat.class);
> > >     job.setOutputFormatClass(TableOutputFormat.class);
> > >
> > >
> > > *and mapper code*
> > >
> > > public class MyParserMapper   extends
> > >     Mapper<LongWritable, Text, NullWritable, Writable> {
> > >
> > >     @Override
> > >     public void map(LongWritable key, Text value1,Context context)
> > >
> > > throws IOException, InterruptedException {
> > > *//doing some processing*
> > >  while(rItr.hasNext())
> > >                     {
> > > *                   //and this put statement runs for 132,622,560 times
> > to
> > > insert the data.*
> > >                     context.write(NullWritable.get(), new
> > > Put(rowId).add(Bytes.toBytes("CounterValues"),
> > > Bytes.toBytes(counter.toString()),
> > Bytes.toBytes(rElement.getTextTrim())));
> > >
> > >                     }
> > >
> > > }}
> > >
> > > Is there any other way of doing this task so i can improve the
> > performance?
> > >
> > >
> > > --
> > > Regards
> > > Shuja-ur-Rehman Baig
> > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> >
> 
> 
> 
> 
> -- 
> Regards
> Shuja-ur-Rehman Baig
> <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>

RE: Best Way to Insert data into Hbase using Map Reduce

Posted by Michael Segel <mi...@hotmail.com>.

I don't think using the buffered client is going to help a lot w performance.

I'm a little confused because it doesn't sound like Shuja is using a map/reduce job to parse the file. 
That is... he says he parses the file in to a dom tree. Usually your map job parses each record and then in the mapper you parse out the record.
Within the m/r job we don't parse out the fields in the records because we do additional processing which 'dedupes' the data so we don't have to further process the data.
The second job only has to parse a portion of the original records.

So assuming that Shuja is actually using a map reduce job, and each xml record is being parsed within the mapper() there are a couple of things...
1) Reduce the number of column families that you are using. (Each column family is written to a separate file)
2) Set up the HTable instance in Mapper.setup() 
3) Switch to a different dom class (not all java classes are equal) or switch to Stax.




> From: buttler1@llnl.gov
> To: user@hbase.apache.org
> Date: Fri, 5 Nov 2010 08:28:07 -0700
> Subject: RE: Best Way to Insert data into Hbase using Map Reduce
> 
> Have you tried turning off auto flush, and managing the flush in your own code (say every 1000 puts?)
> Dave
> 
> 
> -----Original Message-----
> From: Shuja Rehman [mailto:shujamughal@gmail.com] 
> Sent: Friday, November 05, 2010 8:04 AM
> To: user@hbase.apache.org
> Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> 
> Michael
> 
> hum....so u are storing xml record in the hbase and in second job, u r
> parsing. but in my case i am parsing it also in first phase. what i do, i
> get xml file and i parse it using jdom and then putting data in hbase. so
> parsing+putting both operations are in 1 phase and in mapper code.
> 
> My actual problem is that after parsing file, i need to use put statement
> millions of times and i think for each statement it connects to hbase and
> then insert it and this might be the reason of slow processing. So i am
> trying to figure out some way we i can first buffer data and then insert in
> batch fashion. it means in one put statement, i can insert many records and
> i think if i do in this way then the process will be very fast.
> 
> secondly what does it means? "we write the raw record in via a single put()
> so the map() method is a null writable."
> 
> can u explain it more?
> 
> Thanks
> 
> 
> On Fri, Nov 5, 2010 at 5:05 PM, Michael Segel <mi...@hotmail.com>wrote:
> 
> >
> > Suja,
> >
> > Just did a quick glance.
> >
> > What is it that you want to do exactly?
> >
> > Here's how we do it... (at a high level.)
> >
> > Input is an XML file where we want to store the raw XML records in hbase,
> > one record per row.
> >
> > Instead of using the output of the map() method, we write the raw record in
> > via a single put() so the map() method is a null writable.
> >
> > Its pretty fast. However fast is relative.
> >
> > Another thing... we store the xml record as a string (converted to
> > bytecode) rather than a serialized object.
> >
> > Then you can break it down in to individual fields in a second batch job.
> > (You can start with a DOM parser, and later move to a Stax parser.
> > Depending on which DOM parser you have and the size of the record, it should
> > be 'fast enough'. A good implementation of Stax tends to be
> > recursive/re-entrant code which is harder to maintain.)
> >
> > HTH
> >
> > -Mike
> >
> >
> > > Date: Fri, 5 Nov 2010 16:13:02 +0500
> > > Subject: Best Way to Insert data into Hbase using Map Reduce
> > > From: shujamughal@gmail.com
> > > To: user@hbase.apache.org
> > >
> > > Hi
> > >
> > > I am reading data from raw xml files and inserting data into hbase using
> > > TableOutputFormat in a map reduce job. but due to heavy put statements,
> > it
> > > takes many hours to process the data. here is my sample code.
> > >
> > > conf.set(TableOutputFormat.OUTPUT_TABLE, "mytable");
> > >     conf.set("xmlinput.start", "<adc>");
> > >     conf.set("xmlinput.end", "</adc>");
> > >     conf
> > >         .set(
> > >           "io.serializations",
> > >
> > >
> > "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
> > >
> > >       Job job = new Job(conf, "Populate Table with Data");
> > >
> > >     FileInputFormat.setInputPaths(job, input);
> > >     job.setJarByClass(ParserDriver.class);
> > >     job.setMapperClass(MyParserMapper.class);
> > >     job.setNumReduceTasks(0);
> > >     job.setInputFormatClass(XmlInputFormat.class);
> > >     job.setOutputFormatClass(TableOutputFormat.class);
> > >
> > >
> > > *and mapper code*
> > >
> > > public class MyParserMapper   extends
> > >     Mapper<LongWritable, Text, NullWritable, Writable> {
> > >
> > >     @Override
> > >     public void map(LongWritable key, Text value1,Context context)
> > >
> > > throws IOException, InterruptedException {
> > > *//doing some processing*
> > >  while(rItr.hasNext())
> > >                     {
> > > *                   //and this put statement runs for 132,622,560 times
> > to
> > > insert the data.*
> > >                     context.write(NullWritable.get(), new
> > > Put(rowId).add(Bytes.toBytes("CounterValues"),
> > > Bytes.toBytes(counter.toString()),
> > Bytes.toBytes(rElement.getTextTrim())));
> > >
> > >                     }
> > >
> > > }}
> > >
> > > Is there any other way of doing this task so i can improve the
> > performance?
> > >
> > >
> > > --
> > > Regards
> > > Shuja-ur-Rehman Baig
> > > <http://BLOCKEDpk.linkedin.com/in/shujamughal>
> >
> 
> 
> 
> 
> -- 
> Regards
> Shuja-ur-Rehman Baig
> <http://BLOCKEDpk.linkedin.com/in/shujamughal>

Re: Best Way to Insert data into Hbase using Map Reduce

Posted by Shuja Rehman <sh...@gmail.com>.

not yet, can u explain it more how to do it?
thnx

On Fri, Nov 5, 2010 at 8:28 PM, Buttler, David <bu...@llnl.gov> wrote:

> Have you tried turning off auto flush, and managing the flush in your own
> code (say every 1000 puts?)
> Dave
>
>
> -----Original Message-----
> From: Shuja Rehman [mailto:shujamughal@gmail.com]
> Sent: Friday, November 05, 2010 8:04 AM
> To: user@hbase.apache.org
> Subject: Re: Best Way to Insert data into Hbase using Map Reduce
>
> Michael
>
> hum....so u are storing xml record in the hbase and in second job, u r
> parsing. but in my case i am parsing it also in first phase. what i do, i
> get xml file and i parse it using jdom and then putting data in hbase. so
> parsing+putting both operations are in 1 phase and in mapper code.
>
> My actual problem is that after parsing file, i need to use put statement
> millions of times and i think for each statement it connects to hbase and
> then insert it and this might be the reason of slow processing. So i am
> trying to figure out some way we i can first buffer data and then insert in
> batch fashion. it means in one put statement, i can insert many records and
> i think if i do in this way then the process will be very fast.
>
> secondly what does it means? "we write the raw record in via a single put()
> so the map() method is a null writable."
>
> can u explain it more?
>
> Thanks
>
>
> On Fri, Nov 5, 2010 at 5:05 PM, Michael Segel <michael_segel@hotmail.com
> >wrote:
>
> >
> > Suja,
> >
> > Just did a quick glance.
> >
> > What is it that you want to do exactly?
> >
> > Here's how we do it... (at a high level.)
> >
> > Input is an XML file where we want to store the raw XML records in hbase,
> > one record per row.
> >
> > Instead of using the output of the map() method, we write the raw record
> in
> > via a single put() so the map() method is a null writable.
> >
> > Its pretty fast. However fast is relative.
> >
> > Another thing... we store the xml record as a string (converted to
> > bytecode) rather than a serialized object.
> >
> > Then you can break it down in to individual fields in a second batch job.
> > (You can start with a DOM parser, and later move to a Stax parser.
> > Depending on which DOM parser you have and the size of the record, it
> should
> > be 'fast enough'. A good implementation of Stax tends to be
> > recursive/re-entrant code which is harder to maintain.)
> >
> > HTH
> >
> > -Mike
> >
> >
> > > Date: Fri, 5 Nov 2010 16:13:02 +0500
> > > Subject: Best Way to Insert data into Hbase using Map Reduce
> > > From: shujamughal@gmail.com
> > > To: user@hbase.apache.org
> > >
> > > Hi
> > >
> > > I am reading data from raw xml files and inserting data into hbase
> using
> > > TableOutputFormat in a map reduce job. but due to heavy put statements,
> > it
> > > takes many hours to process the data. here is my sample code.
> > >
> > > conf.set(TableOutputFormat.OUTPUT_TABLE, "mytable");
> > >     conf.set("xmlinput.start", "<adc>");
> > >     conf.set("xmlinput.end", "</adc>");
> > >     conf
> > >         .set(
> > >           "io.serializations",
> > >
> > >
> >
> "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
> > >
> > >       Job job = new Job(conf, "Populate Table with Data");
> > >
> > >     FileInputFormat.setInputPaths(job, input);
> > >     job.setJarByClass(ParserDriver.class);
> > >     job.setMapperClass(MyParserMapper.class);
> > >     job.setNumReduceTasks(0);
> > >     job.setInputFormatClass(XmlInputFormat.class);
> > >     job.setOutputFormatClass(TableOutputFormat.class);
> > >
> > >
> > > *and mapper code*
> > >
> > > public class MyParserMapper   extends
> > >     Mapper<LongWritable, Text, NullWritable, Writable> {
> > >
> > >     @Override
> > >     public void map(LongWritable key, Text value1,Context context)
> > >
> > > throws IOException, InterruptedException {
> > > *//doing some processing*
> > >  while(rItr.hasNext())
> > >                     {
> > > *                   //and this put statement runs for 132,622,560 times
> > to
> > > insert the data.*
> > >                     context.write(NullWritable.get(), new
> > > Put(rowId).add(Bytes.toBytes("CounterValues"),
> > > Bytes.toBytes(counter.toString()),
> > Bytes.toBytes(rElement.getTextTrim())));
> > >
> > >                     }
> > >
> > > }}
> > >
> > > Is there any other way of doing this task so i can improve the
> > performance?
> > >
> > >
> > > --
> > > Regards
> > > Shuja-ur-Rehman Baig
> > > <http://BLOCKEDpk.linkedin.com/in/shujamughal>
> >
>
>
>
>
> --
> Regards
> Shuja-ur-Rehman Baig
> <http://BLOCKEDpk.linkedin.com/in/shujamughal>
>



-- 
Regards
Shuja-ur-Rehman Baig
<http://pk.linkedin.com/in/shujamughal>

RE: Best Way to Insert data into Hbase using Map Reduce

Posted by "Buttler, David" <bu...@llnl.gov>.

Have you tried turning off auto flush, and managing the flush in your own code (say every 1000 puts?)
Dave


-----Original Message-----
From: Shuja Rehman [mailto:shujamughal@gmail.com] 
Sent: Friday, November 05, 2010 8:04 AM
To: user@hbase.apache.org
Subject: Re: Best Way to Insert data into Hbase using Map Reduce

Michael

hum....so u are storing xml record in the hbase and in second job, u r
parsing. but in my case i am parsing it also in first phase. what i do, i
get xml file and i parse it using jdom and then putting data in hbase. so
parsing+putting both operations are in 1 phase and in mapper code.

My actual problem is that after parsing file, i need to use put statement
millions of times and i think for each statement it connects to hbase and
then insert it and this might be the reason of slow processing. So i am
trying to figure out some way we i can first buffer data and then insert in
batch fashion. it means in one put statement, i can insert many records and
i think if i do in this way then the process will be very fast.

secondly what does it means? "we write the raw record in via a single put()
so the map() method is a null writable."

can u explain it more?

Thanks


On Fri, Nov 5, 2010 at 5:05 PM, Michael Segel <mi...@hotmail.com>wrote:

>
> Suja,
>
> Just did a quick glance.
>
> What is it that you want to do exactly?
>
> Here's how we do it... (at a high level.)
>
> Input is an XML file where we want to store the raw XML records in hbase,
> one record per row.
>
> Instead of using the output of the map() method, we write the raw record in
> via a single put() so the map() method is a null writable.
>
> Its pretty fast. However fast is relative.
>
> Another thing... we store the xml record as a string (converted to
> bytecode) rather than a serialized object.
>
> Then you can break it down in to individual fields in a second batch job.
> (You can start with a DOM parser, and later move to a Stax parser.
> Depending on which DOM parser you have and the size of the record, it should
> be 'fast enough'. A good implementation of Stax tends to be
> recursive/re-entrant code which is harder to maintain.)
>
> HTH
>
> -Mike
>
>
> > Date: Fri, 5 Nov 2010 16:13:02 +0500
> > Subject: Best Way to Insert data into Hbase using Map Reduce
> > From: shujamughal@gmail.com
> > To: user@hbase.apache.org
> >
> > Hi
> >
> > I am reading data from raw xml files and inserting data into hbase using
> > TableOutputFormat in a map reduce job. but due to heavy put statements,
> it
> > takes many hours to process the data. here is my sample code.
> >
> > conf.set(TableOutputFormat.OUTPUT_TABLE, "mytable");
> >     conf.set("xmlinput.start", "<adc>");
> >     conf.set("xmlinput.end", "</adc>");
> >     conf
> >         .set(
> >           "io.serializations",
> >
> >
> "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
> >
> >       Job job = new Job(conf, "Populate Table with Data");
> >
> >     FileInputFormat.setInputPaths(job, input);
> >     job.setJarByClass(ParserDriver.class);
> >     job.setMapperClass(MyParserMapper.class);
> >     job.setNumReduceTasks(0);
> >     job.setInputFormatClass(XmlInputFormat.class);
> >     job.setOutputFormatClass(TableOutputFormat.class);
> >
> >
> > *and mapper code*
> >
> > public class MyParserMapper   extends
> >     Mapper<LongWritable, Text, NullWritable, Writable> {
> >
> >     @Override
> >     public void map(LongWritable key, Text value1,Context context)
> >
> > throws IOException, InterruptedException {
> > *//doing some processing*
> >  while(rItr.hasNext())
> >                     {
> > *                   //and this put statement runs for 132,622,560 times
> to
> > insert the data.*
> >                     context.write(NullWritable.get(), new
> > Put(rowId).add(Bytes.toBytes("CounterValues"),
> > Bytes.toBytes(counter.toString()),
> Bytes.toBytes(rElement.getTextTrim())));
> >
> >                     }
> >
> > }}
> >
> > Is there any other way of doing this task so i can improve the
> performance?
> >
> >
> > --
> > Regards
> > Shuja-ur-Rehman Baig
> > <http://BLOCKEDpk.linkedin.com/in/shujamughal>
>




-- 
Regards
Shuja-ur-Rehman Baig
<http://BLOCKEDpk.linkedin.com/in/shujamughal>

Re: Best Way to Insert data into Hbase using Map Reduce

Posted by Shuja Rehman <sh...@gmail.com>.

Michael

hum....so u are storing xml record in the hbase and in second job, u r
parsing. but in my case i am parsing it also in first phase. what i do, i
get xml file and i parse it using jdom and then putting data in hbase. so
parsing+putting both operations are in 1 phase and in mapper code.

My actual problem is that after parsing file, i need to use put statement
millions of times and i think for each statement it connects to hbase and
then insert it and this might be the reason of slow processing. So i am
trying to figure out some way we i can first buffer data and then insert in
batch fashion. it means in one put statement, i can insert many records and
i think if i do in this way then the process will be very fast.

secondly what does it means? "we write the raw record in via a single put()
so the map() method is a null writable."

can u explain it more?

Thanks


On Fri, Nov 5, 2010 at 5:05 PM, Michael Segel <mi...@hotmail.com>wrote:

>
> Suja,
>
> Just did a quick glance.
>
> What is it that you want to do exactly?
>
> Here's how we do it... (at a high level.)
>
> Input is an XML file where we want to store the raw XML records in hbase,
> one record per row.
>
> Instead of using the output of the map() method, we write the raw record in
> via a single put() so the map() method is a null writable.
>
> Its pretty fast. However fast is relative.
>
> Another thing... we store the xml record as a string (converted to
> bytecode) rather than a serialized object.
>
> Then you can break it down in to individual fields in a second batch job.
> (You can start with a DOM parser, and later move to a Stax parser.
> Depending on which DOM parser you have and the size of the record, it should
> be 'fast enough'. A good implementation of Stax tends to be
> recursive/re-entrant code which is harder to maintain.)
>
> HTH
>
> -Mike
>
>
> > Date: Fri, 5 Nov 2010 16:13:02 +0500
> > Subject: Best Way to Insert data into Hbase using Map Reduce
> > From: shujamughal@gmail.com
> > To: user@hbase.apache.org
> >
> > Hi
> >
> > I am reading data from raw xml files and inserting data into hbase using
> > TableOutputFormat in a map reduce job. but due to heavy put statements,
> it
> > takes many hours to process the data. here is my sample code.
> >
> > conf.set(TableOutputFormat.OUTPUT_TABLE, "mytable");
> >     conf.set("xmlinput.start", "<adc>");
> >     conf.set("xmlinput.end", "</adc>");
> >     conf
> >         .set(
> >           "io.serializations",
> >
> >
> "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
> >
> >       Job job = new Job(conf, "Populate Table with Data");
> >
> >     FileInputFormat.setInputPaths(job, input);
> >     job.setJarByClass(ParserDriver.class);
> >     job.setMapperClass(MyParserMapper.class);
> >     job.setNumReduceTasks(0);
> >     job.setInputFormatClass(XmlInputFormat.class);
> >     job.setOutputFormatClass(TableOutputFormat.class);
> >
> >
> > *and mapper code*
> >
> > public class MyParserMapper   extends
> >     Mapper<LongWritable, Text, NullWritable, Writable> {
> >
> >     @Override
> >     public void map(LongWritable key, Text value1,Context context)
> >
> > throws IOException, InterruptedException {
> > *//doing some processing*
> >  while(rItr.hasNext())
> >                     {
> > *                   //and this put statement runs for 132,622,560 times
> to
> > insert the data.*
> >                     context.write(NullWritable.get(), new
> > Put(rowId).add(Bytes.toBytes("CounterValues"),
> > Bytes.toBytes(counter.toString()),
> Bytes.toBytes(rElement.getTextTrim())));
> >
> >                     }
> >
> > }}
> >
> > Is there any other way of doing this task so i can improve the
> performance?
> >
> >
> > --
> > Regards
> > Shuja-ur-Rehman Baig
> > <http://pk.linkedin.com/in/shujamughal>
>




-- 
Regards
Shuja-ur-Rehman Baig
<http://pk.linkedin.com/in/shujamughal>

RE: Best Way to Insert data into Hbase using Map Reduce

Posted by Michael Segel <mi...@hotmail.com>.

Suja,

Just did a quick glance.

What is it that you want to do exactly?

Here's how we do it... (at a high level.)

Input is an XML file where we want to store the raw XML records in hbase, one record per row.

Instead of using the output of the map() method, we write the raw record in via a single put() so the map() method is a null writable.

Its pretty fast. However fast is relative.

Another thing... we store the xml record as a string (converted to bytecode) rather than a serialized object.

Then you can break it down in to individual fields in a second batch job.
(You can start with a DOM parser, and later move to a Stax parser. Depending on which DOM parser you have and the size of the record, it should be 'fast enough'. A good implementation of Stax tends to be recursive/re-entrant code which is harder to maintain.)

HTH

-Mike


> Date: Fri, 5 Nov 2010 16:13:02 +0500
> Subject: Best Way to Insert data into Hbase using Map Reduce
> From: shujamughal@gmail.com
> To: user@hbase.apache.org
> 
> Hi
> 
> I am reading data from raw xml files and inserting data into hbase using
> TableOutputFormat in a map reduce job. but due to heavy put statements, it
> takes many hours to process the data. here is my sample code.
> 
> conf.set(TableOutputFormat.OUTPUT_TABLE, "mytable");
>     conf.set("xmlinput.start", "<adc>");
>     conf.set("xmlinput.end", "</adc>");
>     conf
>         .set(
>           "io.serializations",
> 
> "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
> 
>       Job job = new Job(conf, "Populate Table with Data");
> 
>     FileInputFormat.setInputPaths(job, input);
>     job.setJarByClass(ParserDriver.class);
>     job.setMapperClass(MyParserMapper.class);
>     job.setNumReduceTasks(0);
>     job.setInputFormatClass(XmlInputFormat.class);
>     job.setOutputFormatClass(TableOutputFormat.class);
> 
> 
> *and mapper code*
> 
> public class MyParserMapper   extends
>     Mapper<LongWritable, Text, NullWritable, Writable> {
> 
>     @Override
>     public void map(LongWritable key, Text value1,Context context)
> 
> throws IOException, InterruptedException {
> *//doing some processing*
>  while(rItr.hasNext())
>                     {
> *                   //and this put statement runs for 132,622,560 times to
> insert the data.*
>                     context.write(NullWritable.get(), new
> Put(rowId).add(Bytes.toBytes("CounterValues"),
> Bytes.toBytes(counter.toString()), Bytes.toBytes(rElement.getTextTrim())));
> 
>                     }
> 
> }}
> 
> Is there any other way of doing this task so i can improve the performance?
> 
> 
> -- 
> Regards
> Shuja-ur-Rehman Baig
> <http://pk.linkedin.com/in/shujamughal>