You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Karlson <ks...@siberie.de> on 2015/03/20 20:09:07 UTC

Storage of RDDs created via sc.parallelize

Hi all,

where is the data stored that is passed to sc.parallelize? Or put 
differently, where is the data for the base RDD fetched from when the 
DAG is executed, if the base RDD is constructed via sc.parallelize?

I am reading a csv file via the Python csv module and am feeding the 
parsed data chunkwise to sc.parallelize, because the whole file would 
not fit into memory on the driver. Reading the file with sc.textfile 
first is not an option, as there might be linebreaks inside the csv 
fields, preventing me from parsing the file line by line.

The problem I am facing right now is that even though I am feeding only 
one chunk at a time to Spark, I will eventually run out of memory on the 
driver.

Thanks in advance!

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: Storage of RDDs created via sc.parallelize

Posted by Akhil Das <ak...@sigmoidanalytics.com>.
You can use sc.newAPIHadoopFile
<http://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.SparkContext>
with CSVInputFormat <https://github.com/mvallebr/CSVInputFormat> so that it
will read the csv file properly.

Thanks
Best Regards

On Sat, Mar 21, 2015 at 12:39 AM, Karlson <ks...@siberie.de> wrote:

>
> Hi all,
>
> where is the data stored that is passed to sc.parallelize? Or put
> differently, where is the data for the base RDD fetched from when the DAG
> is executed, if the base RDD is constructed via sc.parallelize?
>
> I am reading a csv file via the Python csv module and am feeding the
> parsed data chunkwise to sc.parallelize, because the whole file would not
> fit into memory on the driver. Reading the file with sc.textfile first is
> not an option, as there might be linebreaks inside the csv fields,
> preventing me from parsing the file line by line.
>
> The problem I am facing right now is that even though I am feeding only
> one chunk at a time to Spark, I will eventually run out of memory on the
> driver.
>
> Thanks in advance!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>