You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Steven Wong <sw...@netflix.com> on 2013/01/23 22:07:28 UTC

Need help with cluster setup for performance [Impala]

My apologies for sending this message to this group, but I'm having trouble sending to the right group.


________________________________
From: Steven Wong
Sent: Wednesday, January 23, 2013 11:15 AM
To: impala-user@cloudera.org
Subject: RE: Need help with cluster setup for performance

Thanks for the suggestions. The /metrics output looks good now, and the SELECT COUNT(*) runs much faster than before.

But I still have the "Unknown disk id" error message. My CDH version is:

 hadoop-client        x86_64 2.0.0+552-1.cdh4.1.2.p0.27.el5 cloudera-cdh4  18 k
 hadoop-mapreduce     x86_64 2.0.0+552-1.cdh4.1.2.p0.27.el5 cloudera-cdh4 9.8 M
 hadoop-yarn          x86_64 2.0.0+552-1.cdh4.1.2.p0.27.el5 cloudera-cdh4 8.9 M



On Tuesday, January 22, 2013 5:37:30 PM UTC-8, Henry wrote:
On 22 January 2013 11:40, Steven Wong <sw...@netflix.com> wrote:
Hi,

I followed http://zenfractal.com/2012/11/15/from-zero-to-impala-in-minutes/ to set up a cluster on EC2. After seeing disappointing performance numbers from a SELECT COUNT(*), I am following https://ccp.cloudera.com/display/IMPALA10BETADOC/Configuring+Impala+for+Performance#ConfiguringImpalaforPerformance-TestingImpalaforHighPerformanceConfiguration to check my cluster setup. Questions:

1. My cluster has 3 data nodes. Is the following http://<hostname>:<port>/metrics output good?

statestore.backend.state.map:
{
  127.0.0.1:23000<http://127.0.0.1:23000/> : OK
}
statestore.live.backends:3
statestore.live.backends.list:[127.0.0.1:22000<http://127.0.0.1:22000/>]


Hi Steven -

This looks like your problem. Your machines are registering themselves with 'localhost' as their hostname, and this means that they all look the same to the statestore.

I looked at Matt's zero-to-impala link - it's awesome, but now a little out of date. You should modify where you run impalad to also have --ipaddress and --hostname correctly set for each node. Then check the statestore metrics; things should look a lot better and your performance should improve.


2. My impalad logs contain "Unknown disk id.  This will negatively affect performance.  Check your hdfs settings to enable block location metadata." and my http://<hostname>:<port>/varz doesn't contain the string "dfs.datanode.hdfs-blocks-metadata.enabled". But my hdfs-site.xml sets dfs.datanode.hdfs-blocks-metadata.enabled to true. Why?

What version of CDH are you using?


3. My impalad.out doesn't contain "Unable to load native-hadoop library". This is good, I believe.

4. My impalad logs contain the following lines matching the word "scheduler", but none contains "locality percentage". Why?


The locality percentage is printed only for GLOG_v=1 - and I note that the setup-impala.sh script has  a typo where it has GVLOG_v=1. If you fix this, you should see the locality percentage.

Hope this helps - let us know if things improve.

Henry


/tmp/impalad.INFO:I0122 00:19:09.137197  5121 simple-scheduler.cc:82] Starting simple scheduler
/tmp/impalad.ip-10-170-17-154.impala.log.INFO.20130122-001901.5121:I0122 00:19:09.137197  5121 simple-scheduler.cc:82] Starting simple scheduler

Thanks.
Steven


--





--
Henry Robinson
Software Engineer
Cloudera
415-994-6679