You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Hassan Syed <h....@gmail.com> on 2014/02/02 16:22:48 UTC

Can't get a local job to parallelise (using 0.9.0 from git with parquet and avro)

Sorry if this is a repeat post to the list. My previous post states that it
hasn't made the mailing list.

I can't seem to get a local job to parallelise. As far as I can tell I am
doing everything correctly. I have a stack overflow question up concerning
the issue with my ETL functions. Hope someone can help me, I have been
trying various things for hours now.

Regards

Hassan

SO Question
<http://stackoverflow.com/questions/21510508/spark-job-not-parallelising-locally-using-parquet-avro>  



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-get-a-local-job-to-parallelise-using-0-9-0-from-git-with-parquet-and-avro-tp1130.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Can't get a local job to parallelise (using 0.9.0 from git with parquet and avro)

Posted by Hassan Syed <h....@gmail.com>.
Ok I've got a much larger file now. 2.5 gigs uncompressed 1.2 gigs compressed
(46 blocks, 1mb page size, using snappy).  I updated parquet and to the
distributed spark 0.9.0 It's still not parallelising without a partition.

Partitioning the file before every processing cycle is becoming really
annoying as it takes forever to load the file. 

Any ideas of what I could try ? is it because of some incompatibility with
parquet or avro ? 

Regards

Hassan



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-get-a-local-job-to-parallelise-using-0-9-0-from-git-with-parquet-and-avro-tp1130p1306.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Can't get a local job to parallelise (using 0.9.0 from git with parquet and avro)

Posted by Hassan Syed <h....@gmail.com>.
Hmm it seems you have to use the Hadoop Api to create the parquet file in
order to get it to locally parallelise. Which is quite weird considering the
blocks are visible to spark. Anyhow some of my derived files are created
using this API and all is hunky dory now. 

Lesson of the day, generate your files using a spark context :D 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-get-a-local-job-to-parallelise-using-0-9-0-from-git-with-parquet-and-avro-tp1130p1338.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Can't get a local job to parallelise (using 0.9.0 from git with parquet and avro)

Posted by Hassan Syed <h....@gmail.com>.
I'm surprised why no one is concerned or interested about this issue. Should
I be posting this in the dev mailing list ? 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-get-a-local-job-to-parallelise-using-0-9-0-from-git-with-parquet-and-avro-tp1130p1335.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Can't get a local job to parallelise (using 0.9.0 from git with parquet and avro)

Posted by Hassan Syed <h....@gmail.com>.
Wahey ! That serialiser snippet did the trick. It is working now after the
repartition. 

It still won't work directly from the input though. I can continue my work
from here. However, if anyone wants me to do something to try to get to the
bottom of this I am up for it.

Could this is the reason ?

14/02/02 18:29:44 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable

Many thanks Frank. 

Hassan





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-get-a-local-job-to-parallelise-using-0-9-0-from-git-with-parquet-and-avro-tp1130p1138.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Can't get a local job to parallelise (using 0.9.0 from git with parquet and avro)

Posted by Frank Austin Nothaft <fn...@berkeley.edu>.
Hassan,

I’m not sure why only a single core is being used to process 4 partitions. It shouldn’t have anything to do with not using HDFS, but that’s pure conjecture on my part.

RE: Kryo serialization, Matt Massie has a good blog post on using Parquet and Avro with Spark. Additionally, here is a link towards source that is used in a project to register Avro with Kryo.

Regards,
 
Frank Austin Nothaft
fnothaft@berkeley.edu
fnothaft@eecs.berkeley.edu
202-340-0466

On Feb 2, 2014, at 10:01 AM, Hassan Syed <h....@gmail.com> wrote:

> Many thanks for replying.
> 
> Note I am not running HDFS on my laptop, but I am using the local
> filesystem.
> 
> I am seeing this from the console output : 
> 
> 14/02/02 17:47:41 INFO rdd.NewHadoopRDD: Input split:
> ParquetInputSplit{part:
> file:///Users/hassan/code/scala/avro/forum_dataset.parq start: 0 length:
> 1023817737 hosts: [localhost] blocks: 4 requestedSchema: same as file
> fileSchema: message forumavroschema.Topic 
> 
> So I guess there are indeed 4 partitions. I am seeing only a single core
> being used, and only the driver shows up in the web console as an executor.
> Is this because of me not using HDFS ? 
> 
> I had a hunch that the block size was not being picked up for some reason so
> I tried repartition(16) on the input RDD, And from the spew on the console
> it seems now that after the partition at least the work is being delegated.
> However, I do not think kryo can serialise avro objects without me writing
> some serialization methods :( as the job produces no output now. 
> 
> How do you advise I proceed. Should I continue using avro/parquet or switch
> to something else ? And do I need to set up HDFS on my laptop ?
> 
> Kind Regards
> 
> Hassan
> 
> 
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-get-a-local-job-to-parallelise-using-0-9-0-from-git-with-parquet-and-avro-tp1130p1135.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Can't get a local job to parallelise (using 0.9.0 from git with parquet and avro)

Posted by Hassan Syed <h....@gmail.com>.
Many thanks for replying.

Note I am not running HDFS on my laptop, but I am using the local
filesystem.

I am seeing this from the console output : 

14/02/02 17:47:41 INFO rdd.NewHadoopRDD: Input split:
ParquetInputSplit{part:
file:///Users/hassan/code/scala/avro/forum_dataset.parq start: 0 length:
1023817737 hosts: [localhost] blocks: 4 requestedSchema: same as file
fileSchema: message forumavroschema.Topic 

So I guess there are indeed 4 partitions. I am seeing only a single core
being used, and only the driver shows up in the web console as an executor.
Is this because of me not using HDFS ? 

I had a hunch that the block size was not being picked up for some reason so
I tried repartition(16) on the input RDD, And from the spew on the console
it seems now that after the partition at least the work is being delegated.
However, I do not think kryo can serialise avro objects without me writing
some serialization methods :( as the job produces no output now. 

How do you advise I proceed. Should I continue using avro/parquet or switch
to something else ? And do I need to set up HDFS on my laptop ?

Kind Regards

Hassan





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-get-a-local-job-to-parallelise-using-0-9-0-from-git-with-parquet-and-avro-tp1130p1135.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Can't get a local job to parallelise (using 0.9.0 from git with parquet and avro)

Posted by Frank Austin Nothaft <fn...@berkeley.edu>.
Hassan,

How many of the cores is it using? IIRC, at default settings, Parquet partitions a 700MB file into 3 chunks. Therefore, we would expect a 1GB Parquet file to be split into 4 partitions, and therefore to use 4 cores. You can determine how many partitions Parquet has written your file in by doing an ls inside of the Parquet file.

If the processing is a bottleneck and you don’t mind incurring a shuffle, after you load the data from disk, you could do a coalesce on the RDD. In your case, you would want numPartitions to be at least the number of cores in your system (I think typically they recommend having 2-3 partitions per core for load balancing?), and shuffle=true. Else, I’d change the Parquet settings that you create the file with to write the file in more partitions.

Regards,

Frank Austin Nothaft
fnothaft@berkeley.edu
fnothaft@eecs.berkeley.edu
202-340-0466

On Feb 2, 2014, at 8:47 AM, Hassan Syed <h....@gmail.com> wrote:

> I know it is Sunday. But I would be eternally great full if someone could
> help me sort out this issue. If I can't get spark working soon I am going to
> have to do this processing on my laptop and i'd have to write a resumable
> batch operation using a database to maintain state.
> 
> Any of the top things to try would help. 
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-get-a-local-job-to-parallelise-using-0-9-0-from-git-with-parquet-and-avro-tp1130p1132.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Can't get a local job to parallelise (using 0.9.0 from git with parquet and avro)

Posted by Hassan Syed <h....@gmail.com>.
I know it is Sunday. But I would be eternally great full if someone could
help me sort out this issue. If I can't get spark working soon I am going to
have to do this processing on my laptop and i'd have to write a resumable
batch operation using a database to maintain state.

Any of the top things to try would help. 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-get-a-local-job-to-parallelise-using-0-9-0-from-git-with-parquet-and-avro-tp1130p1132.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.