You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Vidhyashankar Venkataraman <vi...@yahoo-inc.com> on 2010/06/11 23:54:52 UTC

Low throughputs while writing Hfiles using Hfile.writer

The last couple of days I have been running into some bottleneck issues with writing HFiles that I am unable to figure out. I am using the Hfile.writer to prepare a bunch of Hfiles (using Hfile.writer: Hfile is similar to a TFile) to bulk load and I have been getting suspiciously low values for the throughput..

I am not using MR to create my files.. I prepare data on the fly and dump Hfiles almost exactly like what HFileOutputFormat does..

This is my current setup: (almost similar to what I had been saying in my prev emails)
Individual output file size is 2 GB.. Block size of 1 MB. I am writing multiple such files to build the entire db.. Each client program writes files one after another..
Each Key-value pair is around 15 KB..
5 datanodes..
Each dn also runs 5 instances of my client program. (25 processes in all)
And I get a throughput of around 100 rows per second per node (that comes to around 1.5 MBps per node)
Expectedly, neither the disk nor the network is the bottleneck..

Are there any config values that I need to take care of?

With copyFromLocal command of Hadoop, I can get really much better throughputs: 50MBps with just one process.. (of course, the block size is much larger in that case)..

Thanks in advance :)
Vidhya

On 6/11/10 12:44 PM, "Pei Lin Ong" <pe...@yahoo-inc.com> wrote:

Hi Milind and Koji,

Vidhya is one of the Search devs working on Web Crawl Cache cluster (ingesting crawled content from Bing).

He is currently looking at different technology choices, such as HBase, for the cluster configuration. Vidhya has run into a Hadoop HDFS issue and is looking for help.

I have suggested he pose the question via this thread as Vidhya indicates it is urgent due to the WCC timetable.

Please accommodate this request and see if you can answer Vidhya's question (after he poses it). Should the question require further discussion, then Vidhya or I will file a ticket.

Thank you!
Pei

Re: Low throughputs while writing Hfiles using Hfile.writer

Posted by Vidhyashankar Venkataraman <vi...@yahoo-inc.com>.

> That was the Hfile block size.. How different is this 'block' different from that of HDFS?
  Never mind.. Got the answer.

Thank you
Vidhya

On 6/11/10 3:13 PM, "Todd Lipcon" <to...@cloudera.com> wrote:

On Fri, Jun 11, 2010 at 3:07 PM, Vidhyashankar Venkataraman <
vidhyash@yahoo-inc.com> wrote:

> >> Do you have profiling output from your HFile writers?
> Do you mean the debug output in the logs?
>
>
I was suggesting running a Java profiler (eg YourKit or the built-in hprof
profiler) to see where the time is going. I recall you saying you're new-ish
to Java, but I know some of the Grid Solutions guys over there are pretty
expert users of the profiler.


> Can it also be due to the numerous per-block queries to the namenode? (Now
> that the block size is so low)
>

I wasn't clear, is the 1MB block size your HDFS block size or your HFile
block size? I wouldn't recommend such a tiny HDFS block size - we usually go
128MB or 256M. It could definitely slow you down.

-Todd


>
> On 6/11/10 3:01 PM, "Todd Lipcon" <to...@cloudera.com> wrote:
>
> Hi Vidhya,
>
> Do you have profiling output from your HFile writers?
>
> Since you have a standalone program that should be doing little except
> writing, I imagine the profiler output would be pretty useful in seeing
> where the bottleneck lies.
>
> My guess is that you're CPU bound on serialization - serialization is often
> slow slow slow.
>
> -Todd
>
>
> On Fri, Jun 11, 2010 at 2:54 PM, Vidhyashankar Venkataraman <
> vidhyash@yahoo-inc.com> wrote:
>
> > The last couple of days I have been running into some bottleneck issues
> > with writing HFiles that I am unable to figure out. I am using the
> > Hfile.writer to prepare a bunch of Hfiles (using Hfile.writer: Hfile is
> > similar to a TFile) to bulk load and I have been getting suspiciously low
> > values for the throughput..
> >
> > I am not using MR to create my files.. I prepare data on the fly and dump
> > Hfiles almost exactly like what HFileOutputFormat does..
> >
> > This is my current setup: (almost similar to what I had been saying in my
> > prev emails)
> > Individual output file size is 2 GB.. Block size of 1 MB. I am writing
> > multiple such files to build the entire db.. Each client program writes
> > files one after another..
> > Each Key-value pair is around 15 KB..
> > 5 datanodes..
> > Each dn also runs 5 instances of my client program. (25 processes in all)
> >  And I get a throughput of around 100 rows per second per node (that
> comes
> > to around 1.5 MBps per node)
> > Expectedly, neither the disk  nor the network is the bottleneck..
> >
> > Are there any config values that I need to take care of?
> >
> >
> > With copyFromLocal command of Hadoop, I can get really much better
> > throughputs: 50MBps with just one process.. (of course, the block size is
> > much larger in that case)..
> >
> > Thanks in advance :)
> > Vidhya
> >
> > On 6/11/10 12:44 PM, "Pei Lin Ong" <pe...@yahoo-inc.com> wrote:
> >
> > Hi Milind and Koji,
> >
> > Vidhya is one of the Search devs working on Web Crawl Cache cluster
> > (ingesting crawled content from Bing).
> >
> > He is currently looking at different technology choices, such as HBase,
> for
> > the cluster configuration. Vidhya has run into a Hadoop HDFS issue and is
> > looking for help.
> >
> > I have suggested he pose the question via this thread as Vidhya indicates
> > it is urgent due to the WCC timetable.
> >
> > Please accommodate this request and see if you can answer Vidhya's
> question
> > (after he poses it). Should the question require further discussion, then
> > Vidhya or I will file a ticket.
> >
> > Thank you!
> > Pei
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>


--
Todd Lipcon
Software Engineer, Cloudera

Re: Low throughputs while writing Hfiles using Hfile.writer

Posted by Vidhyashankar Venkataraman <vi...@yahoo-inc.com>.

>> I wasn't clear, is the 1MB block size your HDFS block size or your HFile
>> block size?
That was the Hfile block size.. How different is this 'block' different from that of HDFS?

Thank you
Vidhya

On 6/11/10 3:13 PM, "Todd Lipcon" <to...@cloudera.com> wrote:

On Fri, Jun 11, 2010 at 3:07 PM, Vidhyashankar Venkataraman <
vidhyash@yahoo-inc.com> wrote:

> >> Do you have profiling output from your HFile writers?
> Do you mean the debug output in the logs?
>
>
I was suggesting running a Java profiler (eg YourKit or the built-in hprof
profiler) to see where the time is going. I recall you saying you're new-ish
to Java, but I know some of the Grid Solutions guys over there are pretty
expert users of the profiler.


> Can it also be due to the numerous per-block queries to the namenode? (Now
> that the block size is so low)
>

I wasn't clear, is the 1MB block size your HDFS block size or your HFile
block size? I wouldn't recommend such a tiny HDFS block size - we usually go
128MB or 256M. It could definitely slow you down.

-Todd


>
> On 6/11/10 3:01 PM, "Todd Lipcon" <to...@cloudera.com> wrote:
>
> Hi Vidhya,
>
> Do you have profiling output from your HFile writers?
>
> Since you have a standalone program that should be doing little except
> writing, I imagine the profiler output would be pretty useful in seeing
> where the bottleneck lies.
>
> My guess is that you're CPU bound on serialization - serialization is often
> slow slow slow.
>
> -Todd
>
>
> On Fri, Jun 11, 2010 at 2:54 PM, Vidhyashankar Venkataraman <
> vidhyash@yahoo-inc.com> wrote:
>
> > The last couple of days I have been running into some bottleneck issues
> > with writing HFiles that I am unable to figure out. I am using the
> > Hfile.writer to prepare a bunch of Hfiles (using Hfile.writer: Hfile is
> > similar to a TFile) to bulk load and I have been getting suspiciously low
> > values for the throughput..
> >
> > I am not using MR to create my files.. I prepare data on the fly and dump
> > Hfiles almost exactly like what HFileOutputFormat does..
> >
> > This is my current setup: (almost similar to what I had been saying in my
> > prev emails)
> > Individual output file size is 2 GB.. Block size of 1 MB. I am writing
> > multiple such files to build the entire db.. Each client program writes
> > files one after another..
> > Each Key-value pair is around 15 KB..
> > 5 datanodes..
> > Each dn also runs 5 instances of my client program. (25 processes in all)
> >  And I get a throughput of around 100 rows per second per node (that
> comes
> > to around 1.5 MBps per node)
> > Expectedly, neither the disk  nor the network is the bottleneck..
> >
> > Are there any config values that I need to take care of?
> >
> >
> > With copyFromLocal command of Hadoop, I can get really much better
> > throughputs: 50MBps with just one process.. (of course, the block size is
> > much larger in that case)..
> >
> > Thanks in advance :)
> > Vidhya
> >
> > On 6/11/10 12:44 PM, "Pei Lin Ong" <pe...@yahoo-inc.com> wrote:
> >
> > Hi Milind and Koji,
> >
> > Vidhya is one of the Search devs working on Web Crawl Cache cluster
> > (ingesting crawled content from Bing).
> >
> > He is currently looking at different technology choices, such as HBase,
> for
> > the cluster configuration. Vidhya has run into a Hadoop HDFS issue and is
> > looking for help.
> >
> > I have suggested he pose the question via this thread as Vidhya indicates
> > it is urgent due to the WCC timetable.
> >
> > Please accommodate this request and see if you can answer Vidhya's
> question
> > (after he poses it). Should the question require further discussion, then
> > Vidhya or I will file a ticket.
> >
> > Thank you!
> > Pei
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>


--
Todd Lipcon
Software Engineer, Cloudera

Re: Low throughputs while writing Hfiles using Hfile.writer

Posted by Todd Lipcon <to...@cloudera.com>.

On Fri, Jun 11, 2010 at 3:07 PM, Vidhyashankar Venkataraman <
vidhyash@yahoo-inc.com> wrote:

> >> Do you have profiling output from your HFile writers?
> Do you mean the debug output in the logs?
>
>
I was suggesting running a Java profiler (eg YourKit or the built-in hprof
profiler) to see where the time is going. I recall you saying you're new-ish
to Java, but I know some of the Grid Solutions guys over there are pretty
expert users of the profiler.


> Can it also be due to the numerous per-block queries to the namenode? (Now
> that the block size is so low)
>

I wasn't clear, is the 1MB block size your HDFS block size or your HFile
block size? I wouldn't recommend such a tiny HDFS block size - we usually go
128MB or 256M. It could definitely slow you down.

-Todd


>
> On 6/11/10 3:01 PM, "Todd Lipcon" <to...@cloudera.com> wrote:
>
> Hi Vidhya,
>
> Do you have profiling output from your HFile writers?
>
> Since you have a standalone program that should be doing little except
> writing, I imagine the profiler output would be pretty useful in seeing
> where the bottleneck lies.
>
> My guess is that you're CPU bound on serialization - serialization is often
> slow slow slow.
>
> -Todd
>
>
> On Fri, Jun 11, 2010 at 2:54 PM, Vidhyashankar Venkataraman <
> vidhyash@yahoo-inc.com> wrote:
>
> > The last couple of days I have been running into some bottleneck issues
> > with writing HFiles that I am unable to figure out. I am using the
> > Hfile.writer to prepare a bunch of Hfiles (using Hfile.writer: Hfile is
> > similar to a TFile) to bulk load and I have been getting suspiciously low
> > values for the throughput..
> >
> > I am not using MR to create my files.. I prepare data on the fly and dump
> > Hfiles almost exactly like what HFileOutputFormat does..
> >
> > This is my current setup: (almost similar to what I had been saying in my
> > prev emails)
> > Individual output file size is 2 GB.. Block size of 1 MB. I am writing
> > multiple such files to build the entire db.. Each client program writes
> > files one after another..
> > Each Key-value pair is around 15 KB..
> > 5 datanodes..
> > Each dn also runs 5 instances of my client program. (25 processes in all)
> >  And I get a throughput of around 100 rows per second per node (that
> comes
> > to around 1.5 MBps per node)
> > Expectedly, neither the disk  nor the network is the bottleneck..
> >
> > Are there any config values that I need to take care of?
> >
> >
> > With copyFromLocal command of Hadoop, I can get really much better
> > throughputs: 50MBps with just one process.. (of course, the block size is
> > much larger in that case)..
> >
> > Thanks in advance :)
> > Vidhya
> >
> > On 6/11/10 12:44 PM, "Pei Lin Ong" <pe...@yahoo-inc.com> wrote:
> >
> > Hi Milind and Koji,
> >
> > Vidhya is one of the Search devs working on Web Crawl Cache cluster
> > (ingesting crawled content from Bing).
> >
> > He is currently looking at different technology choices, such as HBase,
> for
> > the cluster configuration. Vidhya has run into a Hadoop HDFS issue and is
> > looking for help.
> >
> > I have suggested he pose the question via this thread as Vidhya indicates
> > it is urgent due to the WCC timetable.
> >
> > Please accommodate this request and see if you can answer Vidhya's
> question
> > (after he poses it). Should the question require further discussion, then
> > Vidhya or I will file a ticket.
> >
> > Thank you!
> > Pei
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Low throughputs while writing Hfiles using Hfile.writer

Posted by Vidhyashankar Venkataraman <vi...@yahoo-inc.com>.

>> Do you have profiling output from your HFile writers?
Do you mean the debug output in the logs?

Can it also be due to the numerous per-block queries to the namenode? (Now that the block size is so low)

Thank you
V

On 6/11/10 3:01 PM, "Todd Lipcon" <to...@cloudera.com> wrote:

Hi Vidhya,

Do you have profiling output from your HFile writers?

Since you have a standalone program that should be doing little except
writing, I imagine the profiler output would be pretty useful in seeing
where the bottleneck lies.

My guess is that you're CPU bound on serialization - serialization is often
slow slow slow.

-Todd


On Fri, Jun 11, 2010 at 2:54 PM, Vidhyashankar Venkataraman <
vidhyash@yahoo-inc.com> wrote:

> The last couple of days I have been running into some bottleneck issues
> with writing HFiles that I am unable to figure out. I am using the
> Hfile.writer to prepare a bunch of Hfiles (using Hfile.writer: Hfile is
> similar to a TFile) to bulk load and I have been getting suspiciously low
> values for the throughput..
>
> I am not using MR to create my files.. I prepare data on the fly and dump
> Hfiles almost exactly like what HFileOutputFormat does..
>
> This is my current setup: (almost similar to what I had been saying in my
> prev emails)
> Individual output file size is 2 GB.. Block size of 1 MB. I am writing
> multiple such files to build the entire db.. Each client program writes
> files one after another..
> Each Key-value pair is around 15 KB..
> 5 datanodes..
> Each dn also runs 5 instances of my client program. (25 processes in all)
>  And I get a throughput of around 100 rows per second per node (that comes
> to around 1.5 MBps per node)
> Expectedly, neither the disk  nor the network is the bottleneck..
>
> Are there any config values that I need to take care of?
>
>
> With copyFromLocal command of Hadoop, I can get really much better
> throughputs: 50MBps with just one process.. (of course, the block size is
> much larger in that case)..
>
> Thanks in advance :)
> Vidhya
>
> On 6/11/10 12:44 PM, "Pei Lin Ong" <pe...@yahoo-inc.com> wrote:
>
> Hi Milind and Koji,
>
> Vidhya is one of the Search devs working on Web Crawl Cache cluster
> (ingesting crawled content from Bing).
>
> He is currently looking at different technology choices, such as HBase, for
> the cluster configuration. Vidhya has run into a Hadoop HDFS issue and is
> looking for help.
>
> I have suggested he pose the question via this thread as Vidhya indicates
> it is urgent due to the WCC timetable.
>
> Please accommodate this request and see if you can answer Vidhya's question
> (after he poses it). Should the question require further discussion, then
> Vidhya or I will file a ticket.
>
> Thank you!
> Pei
>



--
Todd Lipcon
Software Engineer, Cloudera

Re: Low throughputs while writing Hfiles using Hfile.writer

Posted by Ryan Rawson <ry...@gmail.com>.

Why are you using 1 MB HDFS block sizes?  Stick with the default of
64MB, there is no reason to add 64 times the overhead.

As for HFile writing, you will want to enable compression (there is 0
reason not to) and also check that profiling.  Yourkit has a version
that reasonably runs in semi-prod without killing performance too
much.

On Fri, Jun 11, 2010 at 3:01 PM, Todd Lipcon <to...@cloudera.com> wrote:
> Hi Vidhya,
>
> Do you have profiling output from your HFile writers?
>
> Since you have a standalone program that should be doing little except
> writing, I imagine the profiler output would be pretty useful in seeing
> where the bottleneck lies.
>
> My guess is that you're CPU bound on serialization - serialization is often
> slow slow slow.
>
> -Todd
>
>
> On Fri, Jun 11, 2010 at 2:54 PM, Vidhyashankar Venkataraman <
> vidhyash@yahoo-inc.com> wrote:
>
>> The last couple of days I have been running into some bottleneck issues
>> with writing HFiles that I am unable to figure out. I am using the
>> Hfile.writer to prepare a bunch of Hfiles (using Hfile.writer: Hfile is
>> similar to a TFile) to bulk load and I have been getting suspiciously low
>> values for the throughput..
>>
>> I am not using MR to create my files.. I prepare data on the fly and dump
>> Hfiles almost exactly like what HFileOutputFormat does..
>>
>> This is my current setup: (almost similar to what I had been saying in my
>> prev emails)
>> Individual output file size is 2 GB.. Block size of 1 MB. I am writing
>> multiple such files to build the entire db.. Each client program writes
>> files one after another..
>> Each Key-value pair is around 15 KB..
>> 5 datanodes..
>> Each dn also runs 5 instances of my client program. (25 processes in all)
>>  And I get a throughput of around 100 rows per second per node (that comes
>> to around 1.5 MBps per node)
>> Expectedly, neither the disk  nor the network is the bottleneck..
>>
>> Are there any config values that I need to take care of?
>>
>>
>> With copyFromLocal command of Hadoop, I can get really much better
>> throughputs: 50MBps with just one process.. (of course, the block size is
>> much larger in that case)..
>>
>> Thanks in advance :)
>> Vidhya
>>
>> On 6/11/10 12:44 PM, "Pei Lin Ong" <pe...@yahoo-inc.com> wrote:
>>
>> Hi Milind and Koji,
>>
>> Vidhya is one of the Search devs working on Web Crawl Cache cluster
>> (ingesting crawled content from Bing).
>>
>> He is currently looking at different technology choices, such as HBase, for
>> the cluster configuration. Vidhya has run into a Hadoop HDFS issue and is
>> looking for help.
>>
>> I have suggested he pose the question via this thread as Vidhya indicates
>> it is urgent due to the WCC timetable.
>>
>> Please accommodate this request and see if you can answer Vidhya's question
>> (after he poses it). Should the question require further discussion, then
>> Vidhya or I will file a ticket.
>>
>> Thank you!
>> Pei
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Low throughputs while writing Hfiles using Hfile.writer

Posted by Todd Lipcon <to...@cloudera.com>.

Hi Vidhya,

Do you have profiling output from your HFile writers?

Since you have a standalone program that should be doing little except
writing, I imagine the profiler output would be pretty useful in seeing
where the bottleneck lies.

My guess is that you're CPU bound on serialization - serialization is often
slow slow slow.

-Todd


On Fri, Jun 11, 2010 at 2:54 PM, Vidhyashankar Venkataraman <
vidhyash@yahoo-inc.com> wrote:

> The last couple of days I have been running into some bottleneck issues
> with writing HFiles that I am unable to figure out. I am using the
> Hfile.writer to prepare a bunch of Hfiles (using Hfile.writer: Hfile is
> similar to a TFile) to bulk load and I have been getting suspiciously low
> values for the throughput..
>
> I am not using MR to create my files.. I prepare data on the fly and dump
> Hfiles almost exactly like what HFileOutputFormat does..
>
> This is my current setup: (almost similar to what I had been saying in my
> prev emails)
> Individual output file size is 2 GB.. Block size of 1 MB. I am writing
> multiple such files to build the entire db.. Each client program writes
> files one after another..
> Each Key-value pair is around 15 KB..
> 5 datanodes..
> Each dn also runs 5 instances of my client program. (25 processes in all)
>  And I get a throughput of around 100 rows per second per node (that comes
> to around 1.5 MBps per node)
> Expectedly, neither the disk  nor the network is the bottleneck..
>
> Are there any config values that I need to take care of?
>
>
> With copyFromLocal command of Hadoop, I can get really much better
> throughputs: 50MBps with just one process.. (of course, the block size is
> much larger in that case)..
>
> Thanks in advance :)
> Vidhya
>
> On 6/11/10 12:44 PM, "Pei Lin Ong" <pe...@yahoo-inc.com> wrote:
>
> Hi Milind and Koji,
>
> Vidhya is one of the Search devs working on Web Crawl Cache cluster
> (ingesting crawled content from Bing).
>
> He is currently looking at different technology choices, such as HBase, for
> the cluster configuration. Vidhya has run into a Hadoop HDFS issue and is
> looking for help.
>
> I have suggested he pose the question via this thread as Vidhya indicates
> it is urgent due to the WCC timetable.
>
> Please accommodate this request and see if you can answer Vidhya's question
> (after he poses it). Should the question require further discussion, then
> Vidhya or I will file a ticket.
>
> Thank you!
> Pei
>



-- 
Todd Lipcon
Software Engineer, Cloudera