You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by ramnavan <hi...@gmail.com> on 2017/05/25 05:28:43 UTC

Questions regarding Jobs, Stages and Caching

Hi,
 
I’m new to Spark and trying to understand the inner workings of Spark in the
below mentioned scenarios. I’m using PySpark and Spark 2.1.1
 
Spark.read.json():
 
I am running executing this line
“spark.read.json(‘s3a://<bucket-name>/*.json’)” and a cluster with three
worker nodes (AWS M4.xlarge instances). The bucket has about 19949 json
files and the total size is about 4.4 GB. The line created three spark jobs
first job with 10000 tasks, second job with 19949 tasks and third job with
10000 tasks. Each of the jobs have one stage in it. Please refer to the
attached images job0, job1 and job2.jpg.   job0.jpg
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n28708/job0.jpg>    
job1.jpg
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n28708/job1.jpg>    
job2.jpg
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n28708/job2.jpg>   
I was expecting it to create 1 job with 19949 tasks.  I’d like to understand
why there are three jobs instead of just one and why reading json files
calls for map operation.
 
Caching and Count():
 
Once spark reads 19949 json files into a dataframe (let’s call it files_df),
I am calling these two operations files_df.createOrReplaceTempView(“files)
and files_df.cache(). I am expecting files_df.cache() will cache the entire
dataframe in memory so any subsequent operation will be faster. My next
statement is files_df.count(). This operation took an entire 8.8 minutes and
it looks like it read the files again from s3 and calculated the count. 
Please refer to attached count.jpg file for reference.   count.jpg
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n28708/count.jpg>  
Why is this happening? If I call files_df.count() for the second time, it
comes back fast within few seconds. Can someone explain this?
 
In general, I am looking for a good source to learn about Spark Internals
and try to understand what’s happening beneath the hood.
 
Thanks in advance!
 
Ram



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Questions-regarding-Jobs-Stages-and-Caching-tp28708.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Questions regarding Jobs, Stages and Caching

Posted by Ram Navan <hi...@gmail.com>.
Thank You Stephen and Nicholas.


I specified the schema to spark.read.json() and the time to execute this
instruction got reduced to 4 minutes from original 8 minutes! I also see
only two jobs (instead of three when calling with no schema) created.
Please refer to attachment job0 and job2 from the first message in the
thread.

Now why do we have two jobs to execute spark.read.json? Shouldn't there be
only one job? Each job has 1 state each and 10000 tasks at each stage. The
DAG visualization of the stage looks like Parallelize --> mapPartitions -->
map

Thanks
Ram

On Thu, May 25, 2017 at 9:46 AM, Nicholas Hakobian <
nicholas.hakobian@rallyhealth.com> wrote:

> If you do not specify a schema, then the json() function will attempt to
> determine the schema, which requires a full scan of the file. Any
> subsequent actions will again have to read in the data. See the
> documentation at:
>
> http://spark.apache.org/docs/latest/api/python/pyspark.sql.
> html#pyspark.sql.DataFrameReader.json
>
> "If the schema parameter is not specified, this function goes through the
> input once to determine the input schema."
>
>
> Nicholas Szandor Hakobian, Ph.D.
> Senior Data Scientist
> Rally Health
> nicholas.hakobian@rallyhealth.com
>
>
> On Thu, May 25, 2017 at 9:24 AM, Steffen Schmitz <
> steffenschmitz@hotmail.de> wrote:
>
>> Hi Ram,
>>
>> spark.read.json() should be evaluated on the first the call of .count().
>> It should then be read into memory once and the rows are counted. After
>> this operation it will be in memory and access will be faster.
>> If you add println statements in between of your function calls you
>> should see start Spark starts to work only after the call of count.
>>
>> Regards,
>> Steffen
>>
>> On 25. May 2017, at 17:02, Ram Navan <hi...@gmail.com> wrote:
>>
>> Hi Steffen,
>>
>> Thanks for your response.
>>
>> Isn't spark.read.json() an action function? It reads the files from the
>> source directory, infers the schema and creates a dataframe right?
>> dataframe.cache() prints out this schema as well. I am not sure why
>> dataframe.count() will try to do the same thing again (reading files from
>> source). spark.read.json() and count() - both actions took 8 minutes each
>> in my scenario. I'd expect only one of the action should incur the expenses
>> of reading 19949 files from s3. Am I missing anything?
>>
>> Thank you!
>>
>> Ram
>>
>>
>> On Thu, May 25, 2017 at 1:34 AM, Steffen Schmitz <
>> steffenschmitz@hotmail.de> wrote:
>>
>>> Hi Ram,
>>>
>>> Regarding your caching question:
>>> The data frame is evaluated lazy. That means it isn’t cached directly on
>>> invoking of .cache(), but on calling the first action on it (in your case
>>> count).
>>> Then it is loaded into memory and the rows are counted, not on the call
>>> of .cache().
>>> On the second call to count it is already in memory and cached and
>>> that’s why it’s faster.
>>>
>>> I do not know if it’s allowed to recommend resources here, but I really
>>> liked the Big Data Analysis with Spark Course by Heather Miller on Coursera.
>>> And the Spark documentation is also a good place to start.
>>>
>>> Regards,
>>> Steffen
>>>
>>> > On 25. May 2017, at 07:28, ramnavan <hi...@gmail.com> wrote:
>>> >
>>> > Hi,
>>> >
>>> > I’m new to Spark and trying to understand the inner workings of Spark
>>> in the
>>> > below mentioned scenarios. I’m using PySpark and Spark 2.1.1
>>> >
>>> > Spark.read.json():
>>> >
>>> > I am running executing this line
>>> > “spark.read.json(‘s3a://<bucket-name>/*.json’)” and a cluster with
>>> three
>>> > worker nodes (AWS M4.xlarge instances). The bucket has about 19949 json
>>> > files and the total size is about 4.4 GB. The line created three spark
>>> jobs
>>> > first job with 10000 tasks, second job with 19949 tasks and third job
>>> with
>>> > 10000 tasks. Each of the jobs have one stage in it. Please refer to the
>>> > attached images job0, job1 and job2.jpg.   job0.jpg
>>> > <http://apache-spark-user-list.1001560.n3.nabble.com/file/n2
>>> 8708/job0.jpg>
>>> > job1.jpg
>>> > <http://apache-spark-user-list.1001560.n3.nabble.com/file/n2
>>> 8708/job1.jpg>
>>> > job2.jpg
>>> > <http://apache-spark-user-list.1001560.n3.nabble.com/file/n2
>>> 8708/job2.jpg>
>>> > I was expecting it to create 1 job with 19949 tasks.  I’d like to
>>> understand
>>> > why there are three jobs instead of just one and why reading json files
>>> > calls for map operation.
>>> >
>>> > Caching and Count():
>>> >
>>> > Once spark reads 19949 json files into a dataframe (let’s call it
>>> files_df),
>>> > I am calling these two operations files_df.createOrReplaceTempVi
>>> ew(“files)
>>> > and files_df.cache(). I am expecting files_df.cache() will cache the
>>> entire
>>> > dataframe in memory so any subsequent operation will be faster. My next
>>> > statement is files_df.count(). This operation took an entire 8.8
>>> minutes and
>>> > it looks like it read the files again from s3 and calculated the count.
>>> > Please refer to attached count.jpg file for reference.   count.jpg
>>> > <http://apache-spark-user-list.1001560.n3.nabble.com/file/n2
>>> 8708/count.jpg>
>>> > Why is this happening? If I call files_df.count() for the second time,
>>> it
>>> > comes back fast within few seconds. Can someone explain this?
>>> >
>>> > In general, I am looking for a good source to learn about Spark
>>> Internals
>>> > and try to understand what’s happening beneath the hood.
>>> >
>>> > Thanks in advance!
>>> >
>>> > Ram
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/Questions-regarding-Jobs-Stages-and-Ca
>>> ching-tp28708.html
>>> > Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>> >
>>>
>>>
>>
>>
>> --
>> Ram
>>
>>
>>
>


-- 
Ram

Re: Questions regarding Jobs, Stages and Caching

Posted by Nicholas Hakobian <ni...@rallyhealth.com>.
If you do not specify a schema, then the json() function will attempt to
determine the schema, which requires a full scan of the file. Any
subsequent actions will again have to read in the data. See the
documentation at:

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json

"If the schema parameter is not specified, this function goes through the
input once to determine the input schema."


Nicholas Szandor Hakobian, Ph.D.
Senior Data Scientist
Rally Health
nicholas.hakobian@rallyhealth.com


On Thu, May 25, 2017 at 9:24 AM, Steffen Schmitz <st...@hotmail.de>
wrote:

> Hi Ram,
>
> spark.read.json() should be evaluated on the first the call of .count().
> It should then be read into memory once and the rows are counted. After
> this operation it will be in memory and access will be faster.
> If you add println statements in between of your function calls you should
> see start Spark starts to work only after the call of count.
>
> Regards,
> Steffen
>
> On 25. May 2017, at 17:02, Ram Navan <hi...@gmail.com> wrote:
>
> Hi Steffen,
>
> Thanks for your response.
>
> Isn't spark.read.json() an action function? It reads the files from the
> source directory, infers the schema and creates a dataframe right?
> dataframe.cache() prints out this schema as well. I am not sure why
> dataframe.count() will try to do the same thing again (reading files from
> source). spark.read.json() and count() - both actions took 8 minutes each
> in my scenario. I'd expect only one of the action should incur the expenses
> of reading 19949 files from s3. Am I missing anything?
>
> Thank you!
>
> Ram
>
>
> On Thu, May 25, 2017 at 1:34 AM, Steffen Schmitz <
> steffenschmitz@hotmail.de> wrote:
>
>> Hi Ram,
>>
>> Regarding your caching question:
>> The data frame is evaluated lazy. That means it isn’t cached directly on
>> invoking of .cache(), but on calling the first action on it (in your case
>> count).
>> Then it is loaded into memory and the rows are counted, not on the call
>> of .cache().
>> On the second call to count it is already in memory and cached and that’s
>> why it’s faster.
>>
>> I do not know if it’s allowed to recommend resources here, but I really
>> liked the Big Data Analysis with Spark Course by Heather Miller on Coursera.
>> And the Spark documentation is also a good place to start.
>>
>> Regards,
>> Steffen
>>
>> > On 25. May 2017, at 07:28, ramnavan <hi...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > I’m new to Spark and trying to understand the inner workings of Spark
>> in the
>> > below mentioned scenarios. I’m using PySpark and Spark 2.1.1
>> >
>> > Spark.read.json():
>> >
>> > I am running executing this line
>> > “spark.read.json(‘s3a://<bucket-name>/*.json’)” and a cluster with
>> three
>> > worker nodes (AWS M4.xlarge instances). The bucket has about 19949 json
>> > files and the total size is about 4.4 GB. The line created three spark
>> jobs
>> > first job with 10000 tasks, second job with 19949 tasks and third job
>> with
>> > 10000 tasks. Each of the jobs have one stage in it. Please refer to the
>> > attached images job0, job1 and job2.jpg.   job0.jpg
>> > <http://apache-spark-user-list.1001560.n3.nabble.com/file/
>> n28708/job0.jpg>
>> > job1.jpg
>> > <http://apache-spark-user-list.1001560.n3.nabble.com/file/
>> n28708/job1.jpg>
>> > job2.jpg
>> > <http://apache-spark-user-list.1001560.n3.nabble.com/file/
>> n28708/job2.jpg>
>> > I was expecting it to create 1 job with 19949 tasks.  I’d like to
>> understand
>> > why there are three jobs instead of just one and why reading json files
>> > calls for map operation.
>> >
>> > Caching and Count():
>> >
>> > Once spark reads 19949 json files into a dataframe (let’s call it
>> files_df),
>> > I am calling these two operations files_df.createOrReplaceTempVi
>> ew(“files)
>> > and files_df.cache(). I am expecting files_df.cache() will cache the
>> entire
>> > dataframe in memory so any subsequent operation will be faster. My next
>> > statement is files_df.count(). This operation took an entire 8.8
>> minutes and
>> > it looks like it read the files again from s3 and calculated the count.
>> > Please refer to attached count.jpg file for reference.   count.jpg
>> > <http://apache-spark-user-list.1001560.n3.nabble.com/file/
>> n28708/count.jpg>
>> > Why is this happening? If I call files_df.count() for the second time,
>> it
>> > comes back fast within few seconds. Can someone explain this?
>> >
>> > In general, I am looking for a good source to learn about Spark
>> Internals
>> > and try to understand what’s happening beneath the hood.
>> >
>> > Thanks in advance!
>> >
>> > Ram
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Questions-regarding-Jobs-Stages-and-
>> Caching-tp28708.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com
>> .
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> >
>>
>>
>
>
> --
> Ram
>
>
>

Re: Questions regarding Jobs, Stages and Caching

Posted by Steffen Schmitz <st...@hotmail.de>.
Hi Ram,

spark.read.json() should be evaluated on the first the call of .count(). It should then be read into memory once and the rows are counted. After this operation it will be in memory and access will be faster.
If you add println statements in between of your function calls you should see start Spark starts to work only after the call of count.

Regards,
Steffen

On 25. May 2017, at 17:02, Ram Navan <hi...@gmail.com>> wrote:

Hi Steffen,

Thanks for your response.

Isn't spark.read.json() an action function? It reads the files from the source directory, infers the schema and creates a dataframe right? dataframe.cache() prints out this schema as well. I am not sure why dataframe.count() will try to do the same thing again (reading files from source). spark.read.json() and count() - both actions took 8 minutes each in my scenario. I'd expect only one of the action should incur the expenses of reading 19949 files from s3. Am I missing anything?

Thank you!

Ram


On Thu, May 25, 2017 at 1:34 AM, Steffen Schmitz <st...@hotmail.de>> wrote:
Hi Ram,

Regarding your caching question:
The data frame is evaluated lazy. That means it isn’t cached directly on invoking of .cache(), but on calling the first action on it (in your case count).
Then it is loaded into memory and the rows are counted, not on the call of .cache().
On the second call to count it is already in memory and cached and that’s why it’s faster.

I do not know if it’s allowed to recommend resources here, but I really liked the Big Data Analysis with Spark Course by Heather Miller on Coursera.
And the Spark documentation is also a good place to start.

Regards,
Steffen

> On 25. May 2017, at 07:28, ramnavan <hi...@gmail.com>> wrote:
>
> Hi,
>
> I’m new to Spark and trying to understand the inner workings of Spark in the
> below mentioned scenarios. I’m using PySpark and Spark 2.1.1
>
> Spark.read.json():
>
> I am running executing this line
> “spark.read.json(‘s3a://<bucket-name>/*.json’)” and a cluster with three
> worker nodes (AWS M4.xlarge instances). The bucket has about 19949 json
> files and the total size is about 4.4 GB. The line created three spark jobs
> first job with 10000 tasks, second job with 19949 tasks and third job with
> 10000 tasks. Each of the jobs have one stage in it. Please refer to the
> attached images job0, job1 and job2.jpg.   job0.jpg
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n28708/job0.jpg>
> job1.jpg
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n28708/job1.jpg>
> job2.jpg
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n28708/job2.jpg>
> I was expecting it to create 1 job with 19949 tasks.  I’d like to understand
> why there are three jobs instead of just one and why reading json files
> calls for map operation.
>
> Caching and Count():
>
> Once spark reads 19949 json files into a dataframe (let’s call it files_df),
> I am calling these two operations files_df.createOrReplaceTempView(“files)
> and files_df.cache(). I am expecting files_df.cache() will cache the entire
> dataframe in memory so any subsequent operation will be faster. My next
> statement is files_df.count(). This operation took an entire 8.8 minutes and
> it looks like it read the files again from s3 and calculated the count.
> Please refer to attached count.jpg file for reference.   count.jpg
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n28708/count.jpg>
> Why is this happening? If I call files_df.count() for the second time, it
> comes back fast within few seconds. Can someone explain this?
>
> In general, I am looking for a good source to learn about Spark Internals
> and try to understand what’s happening beneath the hood.
>
> Thanks in advance!
>
> Ram
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Questions-regarding-Jobs-Stages-and-Caching-tp28708.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com<http://Nabble.com>.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>
>




--
Ram


Re: Questions regarding Jobs, Stages and Caching

Posted by Ram Navan <hi...@gmail.com>.
Hi Steffen,

Thanks for your response.

Isn't spark.read.json() an action function? It reads the files from the
source directory, infers the schema and creates a dataframe right?
dataframe.cache() prints out this schema as well. I am not sure why
dataframe.count() will try to do the same thing again (reading files from
source). spark.read.json() and count() - both actions took 8 minutes each
in my scenario. I'd expect only one of the action should incur the expenses
of reading 19949 files from s3. Am I missing anything?

Thank you!

Ram


On Thu, May 25, 2017 at 1:34 AM, Steffen Schmitz <st...@hotmail.de>
wrote:

> Hi Ram,
>
> Regarding your caching question:
> The data frame is evaluated lazy. That means it isn’t cached directly on
> invoking of .cache(), but on calling the first action on it (in your case
> count).
> Then it is loaded into memory and the rows are counted, not on the call of
> .cache().
> On the second call to count it is already in memory and cached and that’s
> why it’s faster.
>
> I do not know if it’s allowed to recommend resources here, but I really
> liked the Big Data Analysis with Spark Course by Heather Miller on Coursera.
> And the Spark documentation is also a good place to start.
>
> Regards,
> Steffen
>
> > On 25. May 2017, at 07:28, ramnavan <hi...@gmail.com> wrote:
> >
> > Hi,
> >
> > I’m new to Spark and trying to understand the inner workings of Spark in
> the
> > below mentioned scenarios. I’m using PySpark and Spark 2.1.1
> >
> > Spark.read.json():
> >
> > I am running executing this line
> > “spark.read.json(‘s3a://<bucket-name>/*.json’)” and a cluster with three
> > worker nodes (AWS M4.xlarge instances). The bucket has about 19949 json
> > files and the total size is about 4.4 GB. The line created three spark
> jobs
> > first job with 10000 tasks, second job with 19949 tasks and third job
> with
> > 10000 tasks. Each of the jobs have one stage in it. Please refer to the
> > attached images job0, job1 and job2.jpg.   job0.jpg
> > <http://apache-spark-user-list.1001560.n3.nabble.com/
> file/n28708/job0.jpg>
> > job1.jpg
> > <http://apache-spark-user-list.1001560.n3.nabble.com/
> file/n28708/job1.jpg>
> > job2.jpg
> > <http://apache-spark-user-list.1001560.n3.nabble.com/
> file/n28708/job2.jpg>
> > I was expecting it to create 1 job with 19949 tasks.  I’d like to
> understand
> > why there are three jobs instead of just one and why reading json files
> > calls for map operation.
> >
> > Caching and Count():
> >
> > Once spark reads 19949 json files into a dataframe (let’s call it
> files_df),
> > I am calling these two operations files_df.createOrReplaceTempView(“
> files)
> > and files_df.cache(). I am expecting files_df.cache() will cache the
> entire
> > dataframe in memory so any subsequent operation will be faster. My next
> > statement is files_df.count(). This operation took an entire 8.8 minutes
> and
> > it looks like it read the files again from s3 and calculated the count.
> > Please refer to attached count.jpg file for reference.   count.jpg
> > <http://apache-spark-user-list.1001560.n3.nabble.com/
> file/n28708/count.jpg>
> > Why is this happening? If I call files_df.count() for the second time, it
> > comes back fast within few seconds. Can someone explain this?
> >
> > In general, I am looking for a good source to learn about Spark Internals
> > and try to understand what’s happening beneath the hood.
> >
> > Thanks in advance!
> >
> > Ram
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Questions-regarding-Jobs-Stages-and-Caching-tp28708.
> html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >
>
>


-- 
Ram

Re: Questions regarding Jobs, Stages and Caching

Posted by Steffen Schmitz <st...@hotmail.de>.
Hi Ram,

Regarding your caching question:
The data frame is evaluated lazy. That means it isn’t cached directly on invoking of .cache(), but on calling the first action on it (in your case count).
Then it is loaded into memory and the rows are counted, not on the call of .cache(). 
On the second call to count it is already in memory and cached and that’s why it’s faster.

I do not know if it’s allowed to recommend resources here, but I really liked the Big Data Analysis with Spark Course by Heather Miller on Coursera.
And the Spark documentation is also a good place to start.

Regards,
Steffen

> On 25. May 2017, at 07:28, ramnavan <hi...@gmail.com> wrote:
> 
> Hi,
> 
> I’m new to Spark and trying to understand the inner workings of Spark in the
> below mentioned scenarios. I’m using PySpark and Spark 2.1.1
> 
> Spark.read.json():
> 
> I am running executing this line
> “spark.read.json(‘s3a://<bucket-name>/*.json’)” and a cluster with three
> worker nodes (AWS M4.xlarge instances). The bucket has about 19949 json
> files and the total size is about 4.4 GB. The line created three spark jobs
> first job with 10000 tasks, second job with 19949 tasks and third job with
> 10000 tasks. Each of the jobs have one stage in it. Please refer to the
> attached images job0, job1 and job2.jpg.   job0.jpg
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n28708/job0.jpg>    
> job1.jpg
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n28708/job1.jpg>    
> job2.jpg
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n28708/job2.jpg>   
> I was expecting it to create 1 job with 19949 tasks.  I’d like to understand
> why there are three jobs instead of just one and why reading json files
> calls for map operation.
> 
> Caching and Count():
> 
> Once spark reads 19949 json files into a dataframe (let’s call it files_df),
> I am calling these two operations files_df.createOrReplaceTempView(“files)
> and files_df.cache(). I am expecting files_df.cache() will cache the entire
> dataframe in memory so any subsequent operation will be faster. My next
> statement is files_df.count(). This operation took an entire 8.8 minutes and
> it looks like it read the files again from s3 and calculated the count. 
> Please refer to attached count.jpg file for reference.   count.jpg
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n28708/count.jpg>  
> Why is this happening? If I call files_df.count() for the second time, it
> comes back fast within few seconds. Can someone explain this?
> 
> In general, I am looking for a good source to learn about Spark Internals
> and try to understand what’s happening beneath the hood.
> 
> Thanks in advance!
> 
> Ram
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Questions-regarding-Jobs-Stages-and-Caching-tp28708.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>