You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Samay <sm...@gmail.com> on 2014/09/23 18:33:14 UTC

SparkSQL: Freezing while running TPC-H query 5

Hi,

I am trying to run TPC-H queries with SparkSQL 1.1.0 CLI with 1 r3.4xlarge
master + 20 r3.4xlarge slave machines on EC2 (each machine has 16vCPUs,
122GB memory). The TPC-H scale factor I am using is 1000 (i.e. 1000GB of
total data). 

When I try to run TPC-H query 5, the query hangs for a long time mid-query.
I've increased several timeouts to large values like 600seconds, in order to
prevent block manager and connection ACK timeouts. I see that the CPU is
being used even during the long pauses. (Not one core, but several cores),

Query:
select
n_name, sum(l_extendedprice * (1 - l_discount)) as revenue
from
customer c join
( select n_name, l_extendedprice, l_discount, s_nationkey, o_custkey from
orders o join
( select n_name, l_extendedprice, l_discount, l_orderkey, s_nationkey from
lineitem l join
( select n_name, s_suppkey, s_nationkey from supplier s join
( select n_name, n_nationkey
from nation n join region r
on n.n_regionkey = r.r_regionkey and r.r_name = 'ASIA'
) n1 on s.s_nationkey = n1.n_nationkey
) s1 on l.l_suppkey = s1.s_suppkey
) l1 on l1.l_orderkey = o.o_orderkey and o.o_orderdate >= '1994-01-01'
and o.o_orderdate < '1995-01-01'
) o1
on c.c_nationkey = o1.s_nationkey and c.c_custkey = o1.o_custkey
group by n_name
order by revenue desc;

Below is the excerpt of the error on the worker node log after timeout.

14/09/23 14:21:25 INFO
storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight:
50331648, targetRequestSize: 10066329
14/09/23 14:21:25 INFO
storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 5 non-empty
blocks out of 320 blocks
14/09/23 14:21:25 INFO
storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 5 remote
fetches in 1 ms 
14/09/23 14:32:12 WARN executor.Executor: Told to re-register on heartbeat
14/09/23 14:32:50 INFO storage.BlockManager: BlockManager re-registering
with master
14/09/23 14:32:50 INFO storage.BlockManagerMaster: Trying to register
BlockManager
14/09/23 14:32:50 INFO storage.BlockManagerMaster: Registered BlockManager
14/09/23 14:32:50 WARN network.ConnectionManager: Could not find reference
for received ack Message 338974
14/09/23 14:32:50 INFO storage.BlockManager: Reporting 507 blocks to the
master. 
14/09/23 14:32:50 ERROR
storage.BlockFetcherIterator$BasicBlockFetcherIterator: Could not get
block(s) from ConnectionManagerId(ip-10-45-47-24.ec2.internal,49905)
java.io.IOException: sendMessageReliably failed because ack was not received
within 600 sec 
    at
org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:854)
    at
org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:852)
    at scala.Option.foreach(Option.scala:236)
    at
org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:852)
    at java.util.TimerThread.mainLoop(Timer.java:555)
    at java.util.TimerThread.run(Timer.java:505)
14/09/23 14:33:06 ERROR
storage.BlockFetcherIterator$BasicBlockFetcherIterator: Could not get
block(s) from ConnectionManagerId(ip-10-239-184-234.ec2.internal,50538)
java.io.IOException: sendMessageReliably failed because ack was not received
within 600 sec 
    at
org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:854)
    at
org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:852)
    at scala.Option.foreach(Option.scala:236)
    at
org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:852)
    at java.util.TimerThread.mainLoop(Timer.java:555)
    at java.util.TimerThread.run(Timer.java:505)

I have also attached a file listing the configuration parameters I am using.

Anybody have any ideas why there is such a big pause? Also, is there any
parameters I can tune to reduce this pause?

I am seeing similar behaviour on several other queries where there are long
pauses of 200-300s before the query starts making progress on the master.
Some of the queries complete while the others do not. Any help would be
appreciated.

Regards,
Samay

spark-defaults.conf
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n14902/spark-defaults.conf>  



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Freezing-while-running-TPC-H-query-5-tp14902.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: SparkSQL: Freezing while running TPC-H query 5

Posted by Samay <sm...@gmail.com>.

Hey Dan,

Thanks for your reply. I have a couple of questions.

1) Were you able to verify that this is because of GC? If yes, then could
you let me know how.

2) If this is GC, then do you know of any tuning I can do to reduce this GC
pause?

Regards,
Samay

On Tue, Sep 23, 2014 at 11:15 PM, Dan Dietterich [via Apache Spark User
List] <ml...@n3.nabble.com> wrote:

> I have been seeing the same behavior when running large queries. My
> current theory is that the pauses are related to Java garbage collection.
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Freezing-while-running-TPC-H-query-5-tp14902p14921.html
>  To unsubscribe from SparkSQL: Freezing while running TPC-H query 5, click
> here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=14902&code=c21pbGluZ3NhbWF5QGdtYWlsLmNvbXwxNDkwMnwtMTQxODI1MDUwMw==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Freezing-while-running-TPC-H-query-5-tp14902p14985.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.