You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by ayan guha <gu...@gmail.com> on 2018/10/09 06:20:45 UTC

SparkR issue

Hi

We are seeing some weird behaviour in Spark R.

We created a R Dataframe with 600K records and 29 columns. Then we tried to
convert R DF to SparkDF using

df <- SparkR::createDataFrame(rdf)

from RStudio. It hanged, we had to kill the process after 1-2 hours.

We also tried following:
df <- SparkR::createDataFrame(rdf, numPartition=4000)
df <- SparkR::createDataFrame(rdf, numPartition=300)
df <- SparkR::createDataFrame(rdf, numPartition=10)

Same result. Both scenarios seems RStudio is working and no trace of jobs
in Spark Application Master view.

Finally, we used this:

df <- SparkR::createDataFrame(rdf, schema=schema) , schema is a StructType.

This tool 25 mins to create the spark DF. However job did show up in
Application Master view and it shows only 20-30 secs. Then where did rest
of the time go?

Question:
1. Is this expected behavior? (I hope not). How should we speed up this bit?
2. We understand better options would be to read data from external
sources, but we need this data to be generated for some simulation purpose.
Whats possibly going wrong?


Best
Ayan



-- 
Best Regards,
Ayan Guha

Re: SparkR issue

Posted by Felix Cheung <fe...@hotmail.com>.
1 seems like its spending a lot of time in R (slicing the data I guess?) and not with Spark
2 could you write it into a csv file locally and then read it from Spark?


________________________________
From: ayan guha <gu...@gmail.com>
Sent: Monday, October 8, 2018 11:21 PM
To: user
Subject: SparkR issue

Hi

We are seeing some weird behaviour in Spark R.

We created a R Dataframe with 600K records and 29 columns. Then we tried to convert R DF to SparkDF using

df <- SparkR::createDataFrame(rdf)

from RStudio. It hanged, we had to kill the process after 1-2 hours.

We also tried following:
df <- SparkR::createDataFrame(rdf, numPartition=4000)
df <- SparkR::createDataFrame(rdf, numPartition=300)
df <- SparkR::createDataFrame(rdf, numPartition=10)

Same result. Both scenarios seems RStudio is working and no trace of jobs in Spark Application Master view.

Finally, we used this:

df <- SparkR::createDataFrame(rdf, schema=schema) , schema is a StructType.

This tool 25 mins to create the spark DF. However job did show up in Application Master view and it shows only 20-30 secs. Then where did rest of the time go?

Question:
1. Is this expected behavior? (I hope not). How should we speed up this bit?
2. We understand better options would be to read data from external sources, but we need this data to be generated for some simulation purpose. Whats possibly going wrong?


Best
Ayan



--
Best Regards,
Ayan Guha