You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Samay <sm...@gmail.com> on 2014/09/22 19:25:00 UTC

SparkSQL: Key not valid while running TPC-H

Hi,

I am trying to run TPC-H queries with SparkSQL 1.1.0 CLI with 1 r3.4xlarge
master + 20 r3.4xlarge slave machines on EC2 (each machine has 16vCPUs,
122GB memory). The TPC-H scale factor I am using is 1000 (i.e. 1000GB of
total data).

When I try to run TPC-H query 3 i.e.
select
l_orderkey, sum(l_extendedprice*(1-l_discount)) as revenue, o_orderdate,
o_shippriority
from
customer c join orders o
on c.c_mktsegment = 'BUILDING' and c.c_custkey = o.o_custkey
join lineitem l
on l.l_orderkey = o.o_orderkey
where
o_orderdate < '1995-03-15' and l_shipdate > '1995-03-15'
group by l_orderkey, o_orderdate, o_shippriority
order by revenue desc, o_orderdate
limit 10;

I get the following output on the master node aftera very long pause:-

14/09/22 16:55:57 INFO scheduler.TaskSetManager: Finished task 197.0 in
stage 17.0 (TID 23821) in 346 ms on ip-10-45-25-51.ec2.internal (239/320)
14/09/22 16:55:57 INFO scheduler.TaskSetManager: Finished task 235.0 in
stage 17.0 (TID 23859) in 343 ms on ip-10-45-25-51.ec2.internal (240/320)
14/09/22 16:59:28 INFO network.ConnectionManager: Removing SendingConnection
to ConnectionManagerId(ip-10-35-182-185.ec2.internal,35198)
14/09/22 16:59:28 INFO network.ConnectionManager: Key not valid ?
sun.nio.KEY
14/09/22 16:59:28 INFO network.ConnectionManager: Removing
ReceivingConnection to
ConnectionManagerId(ip-10-35-182-185.ec2.internal,35198)
14/09/22 16:59:28 INFO network.ConnectionManager: key already cancelled ?
sun.nio.KEY
java.nio.channels.CancelledKeyException
at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)

Then the executors start getting removed. Any ideas as to why this might be
occurring? Any help will be appreciated.

*Notes that might be helpful:*

I noticed that there is always a very long pause(250-300seconds) after 240
reduce tasks are executed. Also, sometimes I get the error after 245 or 250
reduce tasks but the pause is always after 240 reduce tasks.

I could not see any relevant information in the worker node logs. These were
the last lines.
INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 19
remote fetches in 4 ms
INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 19
remote fetches in 4 ms
INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 19
remote fetches in 5 ms
INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 19
remote fetches in 5 ms

Relevant configuration information:
I am using cached compressed tables i.e. I have set
spark.sql.inMemoryColumnarStorage.compressed=true and then I run the cache
table command for all the tables.

The other configuration parameters I have set are as follows:-
spark.executor.memory 117760m
spark.executor.extraLibraryPath /root/ephemeral-hdfs/lib/native/
spark.executor.extraClassPath /root/ephemeral-hdfs/conf
spark.worker.timeout 600
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.storage.memoryFraction 0.6
spark.storage.blockManagerSlaveTimeoutMs 100000
spark.shuffle.memoryFraction 0.3
spark.shuffle.consolidateFiles true
spark.shuffle.file.buffer.kb 512
spark.akka.timeout 600
spark.akka.framesize 512
spark.akka.threads 8
spark.core.connection.ack.wait.timeout 600
spark.spark.sql.shuffle.partitions 320

Regards,
Samay



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Key-not-valid-while-running-TPC-H-tp14823.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org