You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jim Carroll <ji...@gmail.com> on 2014/09/08 23:58:30 UTC

Querying a parquet file in s3 with an ec2 install

Hello all,

I've been wrestling with this problem all day and any suggestions would be
greatly appreciated.

I'm trying to test reading a parquet file that's stored in s3 using a spark
cluster deployed on ec2. The following works in the spark shell when run
completely locally on my own machine (i.e. no --master option passed to the
spark-shell command):

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
val p = parquetFile("s3n://[bucket]/path-to-parquet-dir/")
p.registerAsTable("s")
sql("select count(*) from s").collect

I have an ec2 deployment of spark (tried version 1.0.2 and 1.1.0-rc4) using
the standalone cluster manager and deployed with the spark-ec2 script. 

Running the same code in a spark shell connected to the cluster it basically
hangs on the select statement. The workers/slaves simply time out and
restart every 30 seconds when they hit what appears to be an activity
timeout, as if there's no activity from the spark-shell (based on what I see
in the stderr logs for the job, I assume this is expected behavior when
connected from a spark-shell that's sitting idle).

I see these messages about every 30 seconds:

14/09/08 17:43:08 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient memory
14/09/08 17:43:09 INFO AppClient$ClientActor: Executor updated:
app-20140908213842-0002/7 is now EXITED (Command exited with code 1)
14/09/08 17:43:09 INFO SparkDeploySchedulerBackend: Executor
app-20140908213842-0002/7 removed: Command exited with code 1
14/09/08 17:43:09 INFO AppClient$ClientActor: Executor added:
app-20140908213842-0002/8 on
worker-20140908183422-ip-10-60-107-194.ec2.internal-53445
(ip-10-60-107-194.ec2.internal:53445) with 2 cores
14/09/08 17:43:09 INFO SparkDeploySchedulerBackend: Granted executor ID
app-20140908213842-0002/8 on hostPort ip-10-60-107-194.ec2.internal:53445
with 2 cores, 4.0 GB RAM
14/09/08 17:43:09 INFO AppClient$ClientActor: Executor updated:
app-20140908213842-0002/8 is now RUNNING

Eventually it fails with a: 

14/09/08 17:44:16 INFO AppClient$ClientActor: Executor updated:
app-20140908213842-0002/9 is now EXITED (Command exited with code 1)
14/09/08 17:44:16 INFO SparkDeploySchedulerBackend: Executor
app-20140908213842-0002/9 removed: Command exited with code 1
14/09/08 17:44:16 ERROR SparkDeploySchedulerBackend: Application has been
killed. Reason: Master removed our application: FAILED
14/09/08 17:44:16 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks
have all completed, from pool 
14/09/08 17:44:16 INFO TaskSchedulerImpl: Cancelling stage 1
14/09/08 17:44:16 INFO DAGScheduler: Failed to run collect at
SparkPlan.scala:85
14/09/08 17:44:16 INFO SparkUI: Stopped Spark web UI at
http://192.168.10.198:4040
14/09/08 17:44:16 INFO DAGScheduler: Stopping DAGScheduler
14/09/08 17:44:16 INFO SparkDeploySchedulerBackend: Shutting down all
executors
14/09/08 17:44:16 INFO SparkDeploySchedulerBackend: Asking each executor to
shut down
14/09/08 17:44:16 INFO SparkDeploySchedulerBackend: Asking each executor to
shut down
org.apache.spark.SparkException: Job aborted due to stage failure: Master
removed our application: FAILED
	at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
	at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
	at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
	at scala.Option.foreach(Option.scala:236)
	at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
	at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
	at akka.actor.ActorCell.invoke(ActorCell.scala:456)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
	at akka.dispatch.Mailbox.run(Mailbox.scala:219)
	at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

As far as the "Initial job has not accepted any resources" I'm running the
spark-shell command with:

SPARK_MEM=2g ./spark-shell --master
spark://ec2-x-x-x-x.compute-1.amazonaws.com:7077

According to the master web page each node has 6 Gig so I'm not sure why I'm
seeing that message either. If I run with less than 2g I get the following
in my spark-shell:

14/09/08 17:47:38 INFO Remoting: Remoting shut down
14/09/08 17:47:38 INFO RemoteActorRefProvider$RemotingTerminator: Remoting
shut down.
java.io.IOException: Error reading summaries
	at
parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:128)
        ....
Caused by: java.util.concurrent.ExecutionException:
java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)

I'm not sure if this exception is from the spark-shell jvm or transferred
over from the master or a worker through the master.

Any help would be greatly appreciated.

Thanks
Jim









--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Querying-a-parquet-file-in-s3-with-an-ec2-install-tp13737.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Querying a parquet file in s3 with an ec2 install

Posted by Ian O'Connell <ia...@ianoconnell.com>.

Mmm how many days worth of data/how deep is your data nesting?

I suspect your running into a current issue with parquet (a fix is in
master but I don't believe released yet..). It reads all the metadata to
the submitter node as part of scheduling the job. This can cause long start
times(timeouts too), and also requires a lot of memory so hence the OOM
with lower memory. The newer one reads the metadata per file on the task
reading that file. At least the hadoop stack is designed to do that on the
mappers. With how Spark works I expect the same improvement there.



On Mon, Sep 8, 2014 at 3:33 PM, Manu Mukerji <ma...@gmail.com> wrote:

> How big is the data set? Does it work when you copy it to hdfs?
>
> -Manu
>
>
> On Mon, Sep 8, 2014 at 2:58 PM, Jim Carroll <ji...@gmail.com> wrote:
>
>> Hello all,
>>
>> I've been wrestling with this problem all day and any suggestions would be
>> greatly appreciated.
>>
>> I'm trying to test reading a parquet file that's stored in s3 using a
>> spark
>> cluster deployed on ec2. The following works in the spark shell when run
>> completely locally on my own machine (i.e. no --master option passed to
>> the
>> spark-shell command):
>>
>> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>> import sqlContext._
>> val p = parquetFile("s3n://[bucket]/path-to-parquet-dir/")
>> p.registerAsTable("s")
>> sql("select count(*) from s").collect
>>
>> I have an ec2 deployment of spark (tried version 1.0.2 and 1.1.0-rc4)
>> using
>> the standalone cluster manager and deployed with the spark-ec2 script.
>>
>> Running the same code in a spark shell connected to the cluster it
>> basically
>> hangs on the select statement. The workers/slaves simply time out and
>> restart every 30 seconds when they hit what appears to be an activity
>> timeout, as if there's no activity from the spark-shell (based on what I
>> see
>> in the stderr logs for the job, I assume this is expected behavior when
>> connected from a spark-shell that's sitting idle).
>>
>> I see these messages about every 30 seconds:
>>
>> 14/09/08 17:43:08 WARN TaskSchedulerImpl: Initial job has not accepted any
>> resources; check your cluster UI to ensure that workers are registered and
>> have sufficient memory
>> 14/09/08 17:43:09 INFO AppClient$ClientActor: Executor updated:
>> app-20140908213842-0002/7 is now EXITED (Command exited with code 1)
>> 14/09/08 17:43:09 INFO SparkDeploySchedulerBackend: Executor
>> app-20140908213842-0002/7 removed: Command exited with code 1
>> 14/09/08 17:43:09 INFO AppClient$ClientActor: Executor added:
>> app-20140908213842-0002/8 on
>> worker-20140908183422-ip-10-60-107-194.ec2.internal-53445
>> (ip-10-60-107-194.ec2.internal:53445) with 2 cores
>> 14/09/08 17:43:09 INFO SparkDeploySchedulerBackend: Granted executor ID
>> app-20140908213842-0002/8 on hostPort ip-10-60-107-194.ec2.internal:53445
>> with 2 cores, 4.0 GB RAM
>> 14/09/08 17:43:09 INFO AppClient$ClientActor: Executor updated:
>> app-20140908213842-0002/8 is now RUNNING
>>
>> Eventually it fails with a:
>>
>> 14/09/08 17:44:16 INFO AppClient$ClientActor: Executor updated:
>> app-20140908213842-0002/9 is now EXITED (Command exited with code 1)
>> 14/09/08 17:44:16 INFO SparkDeploySchedulerBackend: Executor
>> app-20140908213842-0002/9 removed: Command exited with code 1
>> 14/09/08 17:44:16 ERROR SparkDeploySchedulerBackend: Application has been
>> killed. Reason: Master removed our application: FAILED
>> 14/09/08 17:44:16 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks
>> have all completed, from pool
>> 14/09/08 17:44:16 INFO TaskSchedulerImpl: Cancelling stage 1
>> 14/09/08 17:44:16 INFO DAGScheduler: Failed to run collect at
>> SparkPlan.scala:85
>> 14/09/08 17:44:16 INFO SparkUI: Stopped Spark web UI at
>> http://192.168.10.198:4040
>> 14/09/08 17:44:16 INFO DAGScheduler: Stopping DAGScheduler
>> 14/09/08 17:44:16 INFO SparkDeploySchedulerBackend: Shutting down all
>> executors
>> 14/09/08 17:44:16 INFO SparkDeploySchedulerBackend: Asking each executor
>> to
>> shut down
>> 14/09/08 17:44:16 INFO SparkDeploySchedulerBackend: Asking each executor
>> to
>> shut down
>> org.apache.spark.SparkException: Job aborted due to stage failure: Master
>> removed our application: FAILED
>>         at
>> org.apache.spark.scheduler.DAGScheduler.org
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
>>         at
>>
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
>>         at
>>
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
>>         at
>>
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>         at
>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>         at
>>
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
>>         at
>>
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
>>         at
>>
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
>>         at scala.Option.foreach(Option.scala:236)
>>         at
>>
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
>>         at
>>
>> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
>>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>         at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>         at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>         at
>>
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>         at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>         at
>>
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>         at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>         at
>>
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>
>> As far as the "Initial job has not accepted any resources" I'm running the
>> spark-shell command with:
>>
>> SPARK_MEM=2g ./spark-shell --master
>> spark://ec2-x-x-x-x.compute-1.amazonaws.com:7077
>>
>> According to the master web page each node has 6 Gig so I'm not sure why
>> I'm
>> seeing that message either. If I run with less than 2g I get the following
>> in my spark-shell:
>>
>> 14/09/08 17:47:38 INFO Remoting: Remoting shut down
>> 14/09/08 17:47:38 INFO RemoteActorRefProvider$RemotingTerminator: Remoting
>> shut down.
>> java.io.IOException: Error reading summaries
>>         at
>>
>> parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:128)
>>         ....
>> Caused by: java.util.concurrent.ExecutionException:
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>         at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>>
>> I'm not sure if this exception is from the spark-shell jvm or transferred
>> over from the master or a worker through the master.
>>
>> Any help would be greatly appreciated.
>>
>> Thanks
>> Jim
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Querying-a-parquet-file-in-s3-with-an-ec2-install-tp13737.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: Querying a parquet file in s3 with an ec2 install

Posted by Jim Carroll <ji...@gmail.com>.

Okay,

This seems to be either a code version issue or a communication issue. It
works if I execute the spark shell from the master node. It doesn't work if
I run it from my laptop and connect to the master node. 

I had opened the ports for the WebUI (8080) and the cluster manager (7077)
for the master node or it fails much sooner. Do I need to open up the ports
for the workers as well?

I used the spark-ec2 install script with --spark-version using both 1.0.2
and then again with the git hash tag that corresponds to 1.1.0rc4
(2f9b2bd7844ee8393dc9c319f4fefedf95f5e460). In both cases I rebuilt from
source using the same codebase on my machine and moved the entire project
into /root/spark (since to run the spark-shell it needs to match the same
path as the install on ec2). Could I have missed something here?

Thanks.
Jim




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Querying-a-parquet-file-in-s3-with-an-ec2-install-tp13737p13802.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Querying a parquet file in s3 with an ec2 install

Posted by Jim Carroll <ji...@gmail.com>.

>Why I think its the number of files is that I believe that a
> all of those or large part of those files are read when 
>you run sqlContext.parquetFile() and the time it would 
>take in s3 for that to happen is a lot so something 
>internally is timing out.. 

I'll create the parquet files with Drill instead of Spark which will give me
(somewhat) better control over the slice sizes and see what happens.

That said, this behavior seems wrong to me. First, exiting due to inactivity
on a job seems like (perhaps?) the wrong fix to a former problem.  Second,
there IS activity if it's reading the slice headers but the job is exiting
anyway. So if this fixes the problem the measure of "activity" seems wrong.

Ian and Manu, thanks for your help. I'll post back and let you know if that
fixes it.

Jim




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Querying-a-parquet-file-in-s3-with-an-ec2-install-tp13737p13791.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Querying a parquet file in s3 with an ec2 install

Posted by Jim Carroll <ji...@gmail.com>.

My apologies to the list. I replied to Manu's question and it went directly
to him rather than the list.

In case anyone else has this issue here is my reply and Manu's reply to me.
This also answers Ian's question.

---------------------------------------

Hi Manu,

The dataset is 7.5 million rows and 500 columns. In parquet form it's about
1.1 Gig. It was created with Spark and copied up to s3. It has about 4600
parts (which I'd also like to gain some control over). I can try a smaller
dataset, however it works when I run it locally, even with the file out on
s3. It just takes a while.

I can try copying it to HDFS first but that wont help longer term.

Thanks
Jim

-----------------------------------------
Manu's response:
-----------------------------------------

I am pretty sure it is due to the number of parts you have.. I have a
parquet data set that is 250M rows and 924 columns and it is ~2500 files...

I recommend creating a tables in HIve with that data set and doing an insert
overwrite so you can get a data set with more manageable files..

Why I think its the number of files is that I believe that a all of those or
large part of those files are read when you run sqlContext.parquetFile() and
the time it would take in s3 for that to happen is a lot so something
internally is timing out..

-Manu

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Querying-a-parquet-file-in-s3-with-an-ec2-install-tp13737p13790.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Querying a parquet file in s3 with an ec2 install

Posted by Manu Mukerji <ma...@gmail.com>.

How big is the data set? Does it work when you copy it to hdfs?

-Manu


On Mon, Sep 8, 2014 at 2:58 PM, Jim Carroll <ji...@gmail.com> wrote:

> Hello all,
>
> I've been wrestling with this problem all day and any suggestions would be
> greatly appreciated.
>
> I'm trying to test reading a parquet file that's stored in s3 using a spark
> cluster deployed on ec2. The following works in the spark shell when run
> completely locally on my own machine (i.e. no --master option passed to the
> spark-shell command):
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext._
> val p = parquetFile("s3n://[bucket]/path-to-parquet-dir/")
> p.registerAsTable("s")
> sql("select count(*) from s").collect
>
> I have an ec2 deployment of spark (tried version 1.0.2 and 1.1.0-rc4) using
> the standalone cluster manager and deployed with the spark-ec2 script.
>
> Running the same code in a spark shell connected to the cluster it
> basically
> hangs on the select statement. The workers/slaves simply time out and
> restart every 30 seconds when they hit what appears to be an activity
> timeout, as if there's no activity from the spark-shell (based on what I
> see
> in the stderr logs for the job, I assume this is expected behavior when
> connected from a spark-shell that's sitting idle).
>
> I see these messages about every 30 seconds:
>
> 14/09/08 17:43:08 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/09/08 17:43:09 INFO AppClient$ClientActor: Executor updated:
> app-20140908213842-0002/7 is now EXITED (Command exited with code 1)
> 14/09/08 17:43:09 INFO SparkDeploySchedulerBackend: Executor
> app-20140908213842-0002/7 removed: Command exited with code 1
> 14/09/08 17:43:09 INFO AppClient$ClientActor: Executor added:
> app-20140908213842-0002/8 on
> worker-20140908183422-ip-10-60-107-194.ec2.internal-53445
> (ip-10-60-107-194.ec2.internal:53445) with 2 cores
> 14/09/08 17:43:09 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140908213842-0002/8 on hostPort ip-10-60-107-194.ec2.internal:53445
> with 2 cores, 4.0 GB RAM
> 14/09/08 17:43:09 INFO AppClient$ClientActor: Executor updated:
> app-20140908213842-0002/8 is now RUNNING
>
> Eventually it fails with a:
>
> 14/09/08 17:44:16 INFO AppClient$ClientActor: Executor updated:
> app-20140908213842-0002/9 is now EXITED (Command exited with code 1)
> 14/09/08 17:44:16 INFO SparkDeploySchedulerBackend: Executor
> app-20140908213842-0002/9 removed: Command exited with code 1
> 14/09/08 17:44:16 ERROR SparkDeploySchedulerBackend: Application has been
> killed. Reason: Master removed our application: FAILED
> 14/09/08 17:44:16 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks
> have all completed, from pool
> 14/09/08 17:44:16 INFO TaskSchedulerImpl: Cancelling stage 1
> 14/09/08 17:44:16 INFO DAGScheduler: Failed to run collect at
> SparkPlan.scala:85
> 14/09/08 17:44:16 INFO SparkUI: Stopped Spark web UI at
> http://192.168.10.198:4040
> 14/09/08 17:44:16 INFO DAGScheduler: Stopping DAGScheduler
> 14/09/08 17:44:16 INFO SparkDeploySchedulerBackend: Shutting down all
> executors
> 14/09/08 17:44:16 INFO SparkDeploySchedulerBackend: Asking each executor to
> shut down
> 14/09/08 17:44:16 INFO SparkDeploySchedulerBackend: Asking each executor to
> shut down
> org.apache.spark.SparkException: Job aborted due to stage failure: Master
> removed our application: FAILED
>         at
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
>         at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
>         at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
>         at
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
>         at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
>         at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
>         at scala.Option.foreach(Option.scala:236)
>         at
>
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
>         at
>
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>         at
>
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>         at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at
>
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at
>
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> As far as the "Initial job has not accepted any resources" I'm running the
> spark-shell command with:
>
> SPARK_MEM=2g ./spark-shell --master
> spark://ec2-x-x-x-x.compute-1.amazonaws.com:7077
>
> According to the master web page each node has 6 Gig so I'm not sure why
> I'm
> seeing that message either. If I run with less than 2g I get the following
> in my spark-shell:
>
> 14/09/08 17:47:38 INFO Remoting: Remoting shut down
> 14/09/08 17:47:38 INFO RemoteActorRefProvider$RemotingTerminator: Remoting
> shut down.
> java.io.IOException: Error reading summaries
>         at
>
> parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:128)
>         ....
> Caused by: java.util.concurrent.ExecutionException:
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>         at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>
> I'm not sure if this exception is from the spark-shell jvm or transferred
> over from the master or a worker through the master.
>
> Any help would be greatly appreciated.
>
> Thanks
> Jim
>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Querying-a-parquet-file-in-s3-with-an-ec2-install-tp13737.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>