You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Karlson <ks...@siberie.de> on 2015/03/20 20:09:07 UTC
Storage of RDDs created via sc.parallelize
Hi all,
where is the data stored that is passed to sc.parallelize? Or put
differently, where is the data for the base RDD fetched from when the
DAG is executed, if the base RDD is constructed via sc.parallelize?
I am reading a csv file via the Python csv module and am feeding the
parsed data chunkwise to sc.parallelize, because the whole file would
not fit into memory on the driver. Reading the file with sc.textfile
first is not an option, as there might be linebreaks inside the csv
fields, preventing me from parsing the file line by line.
The problem I am facing right now is that even though I am feeding only
one chunk at a time to Spark, I will eventually run out of memory on the
driver.
Thanks in advance!
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org
Re: Storage of RDDs created via sc.parallelize
Posted by Akhil Das <ak...@sigmoidanalytics.com>.
You can use sc.newAPIHadoopFile
<http://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.SparkContext>
with CSVInputFormat <https://github.com/mvallebr/CSVInputFormat> so that it
will read the csv file properly.
Thanks
Best Regards
On Sat, Mar 21, 2015 at 12:39 AM, Karlson <ks...@siberie.de> wrote:
>
> Hi all,
>
> where is the data stored that is passed to sc.parallelize? Or put
> differently, where is the data for the base RDD fetched from when the DAG
> is executed, if the base RDD is constructed via sc.parallelize?
>
> I am reading a csv file via the Python csv module and am feeding the
> parsed data chunkwise to sc.parallelize, because the whole file would not
> fit into memory on the driver. Reading the file with sc.textfile first is
> not an option, as there might be linebreaks inside the csv fields,
> preventing me from parsing the file line by line.
>
> The problem I am facing right now is that even though I am feeding only
> one chunk at a time to Spark, I will eventually run out of memory on the
> driver.
>
> Thanks in advance!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>