You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by "Gan, Xiyun" <ga...@gmail.com> on 2011/04/07 04:54:20 UTC

ImportTsv usage

Hi,
   I need to use bulk load functionality in HBase. I have read the
documentation on HBase wiki page, but the ImportTsv tool does not meet my
need, so I added some code to the map() function in ImportTsv.java.
Originally, that map() function writes only one key/value pair to the
context. In my modified code, the function writes two key/value pairs to
context, the rest code remains the same as the originally one.
   I complied my code, using hadoop jar to run. But I find the time cost to
run the job is not twice as much as the original one, it's nearly ten times
as much as the one only emit one key/value pair. I checked my code, and I
did not find any problem. If the map() function emits either of the two
key/value pairs I wrote, the time cost becomes normal.
  What's the cause? Do I miss any tips in bulk load?

-- 
Best wishes
Gan, Xiyun

Re: ImportTsv usage

Posted by Todd Lipcon <to...@cloudera.com>.

Hi Xiyun,

My guess is that with the small output, you are fitting each map output in
one spill. When you double the output size, it doesn't fit in one spill, and
you incur an extra penalty to re-read and merge the output.

If you can spare the memory, bump mapred.child.java.opts so that each map
task has twice as much heap as it used to, and increase io.sort.mb to about
twice as much. For example mapred.child.java.opts=-Xmx512m and
io.sort.mb=300

Thanks
-Todd

On Wed, Apr 6, 2011 at 9:16 PM, Stack <st...@duboce.net> wrote:

> On Wed, Apr 6, 2011 at 9:10 PM, Gan, Xiyun <ga...@gmail.com> wrote:
> > A 12-nodes cluster, HBase version is 0.89.20100924.
>
> Please upgrade to 0.90.1 at least.
>
>
> > The inputs are the same, about 15 million lines of text. I'm sure the
> time
> > cost of parsing a line is low.
>
> How much difference in the size of the outputs?  Are the number of
> hfiles doubled or ten times as many?
>
> How many mappers/reducers are running?
>
> > The added k/v pair in map() function is very simple, even the added code
> is
> >
> >         String strKey = "key";
> >         ImmutableBytesWritable rowKey =
> >             new ImmutableBytesWritable(strKey.getBytes());
> >
> >         Put put = new Put(rowKey.copyBytes());
> >         KeyValue kv = new
> >
> KeyValue(strKey.getBytes(),"c1".getBytes(),"q1".getBytes(),"v1".getBytes());
> >         put.add(kv);
> >         context.write(rowKey, put);
> > The time cost is nearly 10 times as much as the original one.
> >
>
> The above looks like it would add but a light drag.
>
> When you say ten times, is it ten times one minute or then times ten
> minutes.
>
> Where do you see the time being spent?  In maps, in longer running
> reducers?
>
> St.Ack
>
> > Thanks so much.
> > On Thu, Apr 7, 2011 at 11:49 AM, Stack <st...@duboce.net> wrote:
> >>
> >> Tell us more about how you are doing the measurement.  Are you
> >> profiling with ten inputs or one million?  Is this on a single node or
> >> a thousand node cluster?  What version of HBase?
> >>
> >> Thank you,
> >> St.Ack
> >>
> >> On Wed, Apr 6, 2011 at 7:54 PM, Gan, Xiyun <ga...@gmail.com> wrote:
> >> > Hi,
> >> >   I need to use bulk load functionality in HBase. I have read the
> >> > documentation on HBase wiki page, but the ImportTsv tool does not meet
> >> > my
> >> > need, so I added some code to the map() function in ImportTsv.java.
> >> > Originally, that map() function writes only one key/value pair to the
> >> > context. In my modified code, the function writes two key/value pairs
> to
> >> > context, the rest code remains the same as the originally one.
> >> >   I complied my code, using hadoop jar to run. But I find the time
> cost
> >> > to
> >> > run the job is not twice as much as the original one, it's nearly ten
> >> > times
> >> > as much as the one only emit one key/value pair. I checked my code,
> and
> >> > I
> >> > did not find any problem. If the map() function emits either of the
> two
> >> > key/value pairs I wrote, the time cost becomes normal.
> >> >  What's the cause? Do I miss any tips in bulk load?
> >> >
> >> > --
> >> > Best wishes
> >> > Gan, Xiyun
> >> >
> >
> >
> >
> > --
> > Best wishes
> > Gan, Xiyun
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: ImportTsv usage

Posted by Stack <st...@duboce.net>.

On Wed, Apr 6, 2011 at 9:10 PM, Gan, Xiyun <ga...@gmail.com> wrote:
> A 12-nodes cluster, HBase version is 0.89.20100924.

Please upgrade to 0.90.1 at least.


> The inputs are the same, about 15 million lines of text. I'm sure the time
> cost of parsing a line is low.

How much difference in the size of the outputs?  Are the number of
hfiles doubled or ten times as many?

How many mappers/reducers are running?

> The added k/v pair in map() function is very simple, even the added code is
>
>         String strKey = "key";
>         ImmutableBytesWritable rowKey =
>             new ImmutableBytesWritable(strKey.getBytes());
>
>         Put put = new Put(rowKey.copyBytes());
>         KeyValue kv = new
> KeyValue(strKey.getBytes(),"c1".getBytes(),"q1".getBytes(),"v1".getBytes());
>         put.add(kv);
>         context.write(rowKey, put);
> The time cost is nearly 10 times as much as the original one.
>

The above looks like it would add but a light drag.

When you say ten times, is it ten times one minute or then times ten minutes.

Where do you see the time being spent?  In maps, in longer running reducers?

St.Ack

> Thanks so much.
> On Thu, Apr 7, 2011 at 11:49 AM, Stack <st...@duboce.net> wrote:
>>
>> Tell us more about how you are doing the measurement.  Are you
>> profiling with ten inputs or one million?  Is this on a single node or
>> a thousand node cluster?  What version of HBase?
>>
>> Thank you,
>> St.Ack
>>
>> On Wed, Apr 6, 2011 at 7:54 PM, Gan, Xiyun <ga...@gmail.com> wrote:
>> > Hi,
>> >   I need to use bulk load functionality in HBase. I have read the
>> > documentation on HBase wiki page, but the ImportTsv tool does not meet
>> > my
>> > need, so I added some code to the map() function in ImportTsv.java.
>> > Originally, that map() function writes only one key/value pair to the
>> > context. In my modified code, the function writes two key/value pairs to
>> > context, the rest code remains the same as the originally one.
>> >   I complied my code, using hadoop jar to run. But I find the time cost
>> > to
>> > run the job is not twice as much as the original one, it's nearly ten
>> > times
>> > as much as the one only emit one key/value pair. I checked my code, and
>> > I
>> > did not find any problem. If the map() function emits either of the two
>> > key/value pairs I wrote, the time cost becomes normal.
>> >  What's the cause? Do I miss any tips in bulk load?
>> >
>> > --
>> > Best wishes
>> > Gan, Xiyun
>> >
>
>
>
> --
> Best wishes
> Gan, Xiyun
>

Re: ImportTsv usage

Posted by "Gan, Xiyun" <ga...@gmail.com>.

A 12-nodes cluster, HBase version is 0.89.20100924.

The inputs are the same, about 15 million lines of text. I'm sure the time
cost of parsing a line is low.
The added k/v pair in map() function is very simple, even the added code is

        String strKey = "key";
        ImmutableBytesWritable rowKey =
            new ImmutableBytesWritable(strKey.getBytes());

        Put put = new Put(rowKey.copyBytes());
        KeyValue kv = new
KeyValue(strKey.getBytes(),"c1".getBytes(),"q1".getBytes(),"v1".getBytes());
        put.add(kv);

        context.write(rowKey, put);

The time cost is nearly 10 times as much as the original one.


Thanks so much.

On Thu, Apr 7, 2011 at 11:49 AM, Stack <st...@duboce.net> wrote:

> Tell us more about how you are doing the measurement.  Are you
> profiling with ten inputs or one million?  Is this on a single node or
> a thousand node cluster?  What version of HBase?
>
> Thank you,
> St.Ack
>
> On Wed, Apr 6, 2011 at 7:54 PM, Gan, Xiyun <ga...@gmail.com> wrote:
> > Hi,
> >   I need to use bulk load functionality in HBase. I have read the
> > documentation on HBase wiki page, but the ImportTsv tool does not meet my
> > need, so I added some code to the map() function in ImportTsv.java.
> > Originally, that map() function writes only one key/value pair to the
> > context. In my modified code, the function writes two key/value pairs to
> > context, the rest code remains the same as the originally one.
> >   I complied my code, using hadoop jar to run. But I find the time cost
> to
> > run the job is not twice as much as the original one, it's nearly ten
> times
> > as much as the one only emit one key/value pair. I checked my code, and I
> > did not find any problem. If the map() function emits either of the two
> > key/value pairs I wrote, the time cost becomes normal.
> >  What's the cause? Do I miss any tips in bulk load?
> >
> > --
> > Best wishes
> > Gan, Xiyun
> >
>



-- 
Best wishes
Gan, Xiyun

Re: ImportTsv usage

Posted by Stack <st...@duboce.net>.

Tell us more about how you are doing the measurement.  Are you
profiling with ten inputs or one million?  Is this on a single node or
a thousand node cluster?  What version of HBase?

Thank you,
St.Ack

On Wed, Apr 6, 2011 at 7:54 PM, Gan, Xiyun <ga...@gmail.com> wrote:
> Hi,
>   I need to use bulk load functionality in HBase. I have read the
> documentation on HBase wiki page, but the ImportTsv tool does not meet my
> need, so I added some code to the map() function in ImportTsv.java.
> Originally, that map() function writes only one key/value pair to the
> context. In my modified code, the function writes two key/value pairs to
> context, the rest code remains the same as the originally one.
>   I complied my code, using hadoop jar to run. But I find the time cost to
> run the job is not twice as much as the original one, it's nearly ten times
> as much as the one only emit one key/value pair. I checked my code, and I
> did not find any problem. If the map() function emits either of the two
> key/value pairs I wrote, the time cost becomes normal.
>  What's the cause? Do I miss any tips in bulk load?
>
> --
> Best wishes
> Gan, Xiyun
>