You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Jerry Lam <ch...@gmail.com> on 2015/12/20 23:59:48 UTC

[Spark SQL] SQLContext getOrCreate incorrect behaviour

Hi Spark developers,

I found that SQLContext.getOrCreate(sc: SparkContext) does not behave
correctly when a different spark context is provided.

```
val sc = new SparkContext
val sqlContext =SQLContext.getOrCreate(sc)
sc.stop
...

val sc2 = new SparkContext
val sqlContext2 = SQLContext.getOrCreate(sc2)
sc2.stop
```

The sqlContext2 will reference sc instead of sc2 and therefore, the program
will not work because sc has been stopped.

Best Regards,

Jerry

Re: [Spark SQL] SQLContext getOrCreate incorrect behaviour

Posted by Jerry Lam <ch...@gmail.com>.
Hi Kostas,

Thank you for the references of the 2 tickets. It helps me to understand
why I got some weird experiences lately.

Best Regards,

Jerry

On Wed, Dec 23, 2015 at 2:32 AM, kostas papageorgopoylos <p0...@gmail.com>
wrote:

> Hi
>
> Fyi
> The following 2 tickets are blocking currently (for releases up to 1.5.2)
> the pattern of Starting and Stopping a sparkContext inside the same driver
> program
>
> https://issues.apache.org/jira/browse/SPARK-11700 ->memory leak in
> SqlContext
> https://issues.apache.org/jira/browse/SPARK-11739
>
> In an application we have built we initially wanted to use the same
> pattern (start-stop-start.etc)
> in order to have a better usage of the spark cluster resources.
>
> I believe that the fixes in the above tickets will allow to safely stop and
> restart the sparkContext in the driver program in release 1.6.0
>
> Kind Regards
>
>
>
> 2015-12-22 21:00 GMT+02:00 Sean Owen <so...@cloudera.com>:
>
>> I think the original idea is that the life of the driver is the life
>> of the SparkContext: the context is stopped when the driver finishes.
>> Or: if for some reason the "context" dies or there's an unrecoverable
>> error, that's it for the driver.
>>
>> (There's nothing wrong with stop(), right? you have to call that when
>> the driver ends to shut down Spark cleanly. It's the re-starting
>> another context that's at issue.)
>>
>> This makes most sense in the context of a resource manager, which can
>> conceivably restart a driver if you like, but can't reach into your
>> program.
>>
>> That's probably still the best way to think of it. Still it would be
>> nice if SparkContext were friendlier to a restart just as a matter of
>> design. AFAIK it is; not sure about SQLContext though. If it's not a
>> priority it's just because this isn't a usual usage pattern, which
>> doesn't mean it's crazy, just not the primary pattern.
>>
>> On Tue, Dec 22, 2015 at 5:57 PM, Jerry Lam <ch...@gmail.com> wrote:
>> > Hi Sean,
>> >
>> > What if the spark context stops for involuntary reasons (misbehavior of
>> some connections) then we need to programmatically handle the failures by
>> recreating spark context. Is there something I don't understand/know about
>> the assumptions on how to use spark context? I tend to think of it as a
>> resource manager/scheduler for spark jobs. Are you guys planning to
>> deprecate the stop method from spark?
>> >
>> > Best Regards,
>> >
>> > Jerry
>> >
>> > Sent from my iPhone
>> >
>> >> On 22 Dec, 2015, at 3:57 am, Sean Owen <so...@cloudera.com> wrote:
>> >>
>> >> Although in many cases it does work to stop and then start a second
>> >> context, it wasn't how Spark was originally designed, and I still see
>> >> gotchas. I'd avoid it. I don't think you should have to release some
>> >> resources; just keep the same context alive.
>> >>
>> >>> On Tue, Dec 22, 2015 at 5:13 AM, Jerry Lam <ch...@gmail.com>
>> wrote:
>> >>> Hi Zhan,
>> >>>
>> >>> I'm illustrating the issue via a simple example. However it is not
>> difficult
>> >>> to imagine use cases that need this behaviour. For example, you want
>> to
>> >>> release all resources of spark when it does not use for longer than
>> an hour
>> >>> in  a job server like web services. Unless you can prevent people from
>> >>> stopping spark context, then it is reasonable to assume that people
>> can stop
>> >>> it and start it again in  later time.
>> >>>
>> >>> Best Regards,
>> >>>
>> >>> Jerry
>> >>>
>> >>>
>> >>>> On Mon, Dec 21, 2015 at 7:20 PM, Zhan Zhang <zz...@hortonworks.com>
>> wrote:
>> >>>>
>> >>>> This looks to me is a very unusual use case. You stop the
>> SparkContext,
>> >>>> and start another one. I don’t think it is well supported. As the
>> >>>> SparkContext is stopped, all the resources are supposed to be
>> released.
>> >>>>
>> >>>> Is there any mandatory reason you have stop and restart another
>> >>>> SparkContext.
>> >>>>
>> >>>> Thanks.
>> >>>>
>> >>>> Zhan Zhang
>> >>>>
>> >>>> Note that when sc is stopped, all resources are released (for
>> example in
>> >>>> yarn
>> >>>>> On Dec 20, 2015, at 2:59 PM, Jerry Lam <ch...@gmail.com>
>> wrote:
>> >>>>>
>> >>>>> Hi Spark developers,
>> >>>>>
>> >>>>> I found that SQLContext.getOrCreate(sc: SparkContext) does not
>> behave
>> >>>>> correctly when a different spark context is provided.
>> >>>>>
>> >>>>> ```
>> >>>>> val sc = new SparkContext
>> >>>>> val sqlContext =SQLContext.getOrCreate(sc)
>> >>>>> sc.stop
>> >>>>> ...
>> >>>>>
>> >>>>> val sc2 = new SparkContext
>> >>>>> val sqlContext2 = SQLContext.getOrCreate(sc2)
>> >>>>> sc2.stop
>> >>>>> ```
>> >>>>>
>> >>>>> The sqlContext2 will reference sc instead of sc2 and therefore, the
>> >>>>> program will not work because sc has been stopped.
>> >>>>>
>> >>>>> Best Regards,
>> >>>>>
>> >>>>> Jerry
>> >>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>

Re: [Spark SQL] SQLContext getOrCreate incorrect behaviour

Posted by kostas papageorgopoylos <p0...@gmail.com>.
Hi

Fyi
The following 2 tickets are blocking currently (for releases up to 1.5.2)
the pattern of Starting and Stopping a sparkContext inside the same driver
program

https://issues.apache.org/jira/browse/SPARK-11700 ->memory leak in
SqlContext
https://issues.apache.org/jira/browse/SPARK-11739

In an application we have built we initially wanted to use the same pattern
(start-stop-start.etc)
in order to have a better usage of the spark cluster resources.

I believe that the fixes in the above tickets will allow to safely stop and
restart the sparkContext in the driver program in release 1.6.0

Kind Regards



2015-12-22 21:00 GMT+02:00 Sean Owen <so...@cloudera.com>:

> I think the original idea is that the life of the driver is the life
> of the SparkContext: the context is stopped when the driver finishes.
> Or: if for some reason the "context" dies or there's an unrecoverable
> error, that's it for the driver.
>
> (There's nothing wrong with stop(), right? you have to call that when
> the driver ends to shut down Spark cleanly. It's the re-starting
> another context that's at issue.)
>
> This makes most sense in the context of a resource manager, which can
> conceivably restart a driver if you like, but can't reach into your
> program.
>
> That's probably still the best way to think of it. Still it would be
> nice if SparkContext were friendlier to a restart just as a matter of
> design. AFAIK it is; not sure about SQLContext though. If it's not a
> priority it's just because this isn't a usual usage pattern, which
> doesn't mean it's crazy, just not the primary pattern.
>
> On Tue, Dec 22, 2015 at 5:57 PM, Jerry Lam <ch...@gmail.com> wrote:
> > Hi Sean,
> >
> > What if the spark context stops for involuntary reasons (misbehavior of
> some connections) then we need to programmatically handle the failures by
> recreating spark context. Is there something I don't understand/know about
> the assumptions on how to use spark context? I tend to think of it as a
> resource manager/scheduler for spark jobs. Are you guys planning to
> deprecate the stop method from spark?
> >
> > Best Regards,
> >
> > Jerry
> >
> > Sent from my iPhone
> >
> >> On 22 Dec, 2015, at 3:57 am, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> Although in many cases it does work to stop and then start a second
> >> context, it wasn't how Spark was originally designed, and I still see
> >> gotchas. I'd avoid it. I don't think you should have to release some
> >> resources; just keep the same context alive.
> >>
> >>> On Tue, Dec 22, 2015 at 5:13 AM, Jerry Lam <ch...@gmail.com>
> wrote:
> >>> Hi Zhan,
> >>>
> >>> I'm illustrating the issue via a simple example. However it is not
> difficult
> >>> to imagine use cases that need this behaviour. For example, you want to
> >>> release all resources of spark when it does not use for longer than an
> hour
> >>> in  a job server like web services. Unless you can prevent people from
> >>> stopping spark context, then it is reasonable to assume that people
> can stop
> >>> it and start it again in  later time.
> >>>
> >>> Best Regards,
> >>>
> >>> Jerry
> >>>
> >>>
> >>>> On Mon, Dec 21, 2015 at 7:20 PM, Zhan Zhang <zz...@hortonworks.com>
> wrote:
> >>>>
> >>>> This looks to me is a very unusual use case. You stop the
> SparkContext,
> >>>> and start another one. I don’t think it is well supported. As the
> >>>> SparkContext is stopped, all the resources are supposed to be
> released.
> >>>>
> >>>> Is there any mandatory reason you have stop and restart another
> >>>> SparkContext.
> >>>>
> >>>> Thanks.
> >>>>
> >>>> Zhan Zhang
> >>>>
> >>>> Note that when sc is stopped, all resources are released (for example
> in
> >>>> yarn
> >>>>> On Dec 20, 2015, at 2:59 PM, Jerry Lam <ch...@gmail.com> wrote:
> >>>>>
> >>>>> Hi Spark developers,
> >>>>>
> >>>>> I found that SQLContext.getOrCreate(sc: SparkContext) does not behave
> >>>>> correctly when a different spark context is provided.
> >>>>>
> >>>>> ```
> >>>>> val sc = new SparkContext
> >>>>> val sqlContext =SQLContext.getOrCreate(sc)
> >>>>> sc.stop
> >>>>> ...
> >>>>>
> >>>>> val sc2 = new SparkContext
> >>>>> val sqlContext2 = SQLContext.getOrCreate(sc2)
> >>>>> sc2.stop
> >>>>> ```
> >>>>>
> >>>>> The sqlContext2 will reference sc instead of sc2 and therefore, the
> >>>>> program will not work because sc has been stopped.
> >>>>>
> >>>>> Best Regards,
> >>>>>
> >>>>> Jerry
> >>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: [Spark SQL] SQLContext getOrCreate incorrect behaviour

Posted by Sean Owen <so...@cloudera.com>.
I think the original idea is that the life of the driver is the life
of the SparkContext: the context is stopped when the driver finishes.
Or: if for some reason the "context" dies or there's an unrecoverable
error, that's it for the driver.

(There's nothing wrong with stop(), right? you have to call that when
the driver ends to shut down Spark cleanly. It's the re-starting
another context that's at issue.)

This makes most sense in the context of a resource manager, which can
conceivably restart a driver if you like, but can't reach into your
program.

That's probably still the best way to think of it. Still it would be
nice if SparkContext were friendlier to a restart just as a matter of
design. AFAIK it is; not sure about SQLContext though. If it's not a
priority it's just because this isn't a usual usage pattern, which
doesn't mean it's crazy, just not the primary pattern.

On Tue, Dec 22, 2015 at 5:57 PM, Jerry Lam <ch...@gmail.com> wrote:
> Hi Sean,
>
> What if the spark context stops for involuntary reasons (misbehavior of some connections) then we need to programmatically handle the failures by recreating spark context. Is there something I don't understand/know about the assumptions on how to use spark context? I tend to think of it as a resource manager/scheduler for spark jobs. Are you guys planning to deprecate the stop method from spark?
>
> Best Regards,
>
> Jerry
>
> Sent from my iPhone
>
>> On 22 Dec, 2015, at 3:57 am, Sean Owen <so...@cloudera.com> wrote:
>>
>> Although in many cases it does work to stop and then start a second
>> context, it wasn't how Spark was originally designed, and I still see
>> gotchas. I'd avoid it. I don't think you should have to release some
>> resources; just keep the same context alive.
>>
>>> On Tue, Dec 22, 2015 at 5:13 AM, Jerry Lam <ch...@gmail.com> wrote:
>>> Hi Zhan,
>>>
>>> I'm illustrating the issue via a simple example. However it is not difficult
>>> to imagine use cases that need this behaviour. For example, you want to
>>> release all resources of spark when it does not use for longer than an hour
>>> in  a job server like web services. Unless you can prevent people from
>>> stopping spark context, then it is reasonable to assume that people can stop
>>> it and start it again in  later time.
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>>
>>>> On Mon, Dec 21, 2015 at 7:20 PM, Zhan Zhang <zz...@hortonworks.com> wrote:
>>>>
>>>> This looks to me is a very unusual use case. You stop the SparkContext,
>>>> and start another one. I don’t think it is well supported. As the
>>>> SparkContext is stopped, all the resources are supposed to be released.
>>>>
>>>> Is there any mandatory reason you have stop and restart another
>>>> SparkContext.
>>>>
>>>> Thanks.
>>>>
>>>> Zhan Zhang
>>>>
>>>> Note that when sc is stopped, all resources are released (for example in
>>>> yarn
>>>>> On Dec 20, 2015, at 2:59 PM, Jerry Lam <ch...@gmail.com> wrote:
>>>>>
>>>>> Hi Spark developers,
>>>>>
>>>>> I found that SQLContext.getOrCreate(sc: SparkContext) does not behave
>>>>> correctly when a different spark context is provided.
>>>>>
>>>>> ```
>>>>> val sc = new SparkContext
>>>>> val sqlContext =SQLContext.getOrCreate(sc)
>>>>> sc.stop
>>>>> ...
>>>>>
>>>>> val sc2 = new SparkContext
>>>>> val sqlContext2 = SQLContext.getOrCreate(sc2)
>>>>> sc2.stop
>>>>> ```
>>>>>
>>>>> The sqlContext2 will reference sc instead of sc2 and therefore, the
>>>>> program will not work because sc has been stopped.
>>>>>
>>>>> Best Regards,
>>>>>
>>>>> Jerry
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: [Spark SQL] SQLContext getOrCreate incorrect behaviour

Posted by Jerry Lam <ch...@gmail.com>.
Hi Sean,

What if the spark context stops for involuntary reasons (misbehavior of some connections) then we need to programmatically handle the failures by recreating spark context. Is there something I don't understand/know about the assumptions on how to use spark context? I tend to think of it as a resource manager/scheduler for spark jobs. Are you guys planning to deprecate the stop method from spark? 

Best Regards,

Jerry 

Sent from my iPhone

> On 22 Dec, 2015, at 3:57 am, Sean Owen <so...@cloudera.com> wrote:
> 
> Although in many cases it does work to stop and then start a second
> context, it wasn't how Spark was originally designed, and I still see
> gotchas. I'd avoid it. I don't think you should have to release some
> resources; just keep the same context alive.
> 
>> On Tue, Dec 22, 2015 at 5:13 AM, Jerry Lam <ch...@gmail.com> wrote:
>> Hi Zhan,
>> 
>> I'm illustrating the issue via a simple example. However it is not difficult
>> to imagine use cases that need this behaviour. For example, you want to
>> release all resources of spark when it does not use for longer than an hour
>> in  a job server like web services. Unless you can prevent people from
>> stopping spark context, then it is reasonable to assume that people can stop
>> it and start it again in  later time.
>> 
>> Best Regards,
>> 
>> Jerry
>> 
>> 
>>> On Mon, Dec 21, 2015 at 7:20 PM, Zhan Zhang <zz...@hortonworks.com> wrote:
>>> 
>>> This looks to me is a very unusual use case. You stop the SparkContext,
>>> and start another one. I don’t think it is well supported. As the
>>> SparkContext is stopped, all the resources are supposed to be released.
>>> 
>>> Is there any mandatory reason you have stop and restart another
>>> SparkContext.
>>> 
>>> Thanks.
>>> 
>>> Zhan Zhang
>>> 
>>> Note that when sc is stopped, all resources are released (for example in
>>> yarn
>>>> On Dec 20, 2015, at 2:59 PM, Jerry Lam <ch...@gmail.com> wrote:
>>>> 
>>>> Hi Spark developers,
>>>> 
>>>> I found that SQLContext.getOrCreate(sc: SparkContext) does not behave
>>>> correctly when a different spark context is provided.
>>>> 
>>>> ```
>>>> val sc = new SparkContext
>>>> val sqlContext =SQLContext.getOrCreate(sc)
>>>> sc.stop
>>>> ...
>>>> 
>>>> val sc2 = new SparkContext
>>>> val sqlContext2 = SQLContext.getOrCreate(sc2)
>>>> sc2.stop
>>>> ```
>>>> 
>>>> The sqlContext2 will reference sc instead of sc2 and therefore, the
>>>> program will not work because sc has been stopped.
>>>> 
>>>> Best Regards,
>>>> 
>>>> Jerry
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: [Spark SQL] SQLContext getOrCreate incorrect behaviour

Posted by Jerry Lam <ch...@gmail.com>.
Hi Zhan,

I'm illustrating the issue via a simple example. However it is not
difficult to imagine use cases that need this behaviour. For example, you
want to release all resources of spark when it does not use for longer than
an hour in  a job server like web services. Unless you can prevent people
from stopping spark context, then it is reasonable to assume that people
can stop it and start it again in  later time.

Best Regards,

Jerry


On Mon, Dec 21, 2015 at 7:20 PM, Zhan Zhang <zz...@hortonworks.com> wrote:

> This looks to me is a very unusual use case. You stop the SparkContext,
> and start another one. I don’t think it is well supported. As the
> SparkContext is stopped, all the resources are supposed to be released.
>
> Is there any mandatory reason you have stop and restart another
> SparkContext.
>
> Thanks.
>
> Zhan Zhang
>
> Note that when sc is stopped, all resources are released (for example in
> yarn
> On Dec 20, 2015, at 2:59 PM, Jerry Lam <ch...@gmail.com> wrote:
>
> > Hi Spark developers,
> >
> > I found that SQLContext.getOrCreate(sc: SparkContext) does not behave
> correctly when a different spark context is provided.
> >
> > ```
> > val sc = new SparkContext
> > val sqlContext =SQLContext.getOrCreate(sc)
> > sc.stop
> > ...
> >
> > val sc2 = new SparkContext
> > val sqlContext2 = SQLContext.getOrCreate(sc2)
> > sc2.stop
> > ```
> >
> > The sqlContext2 will reference sc instead of sc2 and therefore, the
> program will not work because sc has been stopped.
> >
> > Best Regards,
> >
> > Jerry
>
>

Re: [Spark SQL] SQLContext getOrCreate incorrect behaviour

Posted by Zhan Zhang <zz...@hortonworks.com>.
This looks to me is a very unusual use case. You stop the SparkContext, and start another one. I don’t think it is well supported. As the SparkContext is stopped, all the resources are supposed to be released. 

Is there any mandatory reason you have stop and restart another SparkContext.

Thanks.

Zhan Zhang

Note that when sc is stopped, all resources are released (for example in yarn 
On Dec 20, 2015, at 2:59 PM, Jerry Lam <ch...@gmail.com> wrote:

> Hi Spark developers,
> 
> I found that SQLContext.getOrCreate(sc: SparkContext) does not behave correctly when a different spark context is provided.
> 
> ```
> val sc = new SparkContext
> val sqlContext =SQLContext.getOrCreate(sc)
> sc.stop
> ...
> 
> val sc2 = new SparkContext
> val sqlContext2 = SQLContext.getOrCreate(sc2)
> sc2.stop
> ```
> 
> The sqlContext2 will reference sc instead of sc2 and therefore, the program will not work because sc has been stopped. 
> 
> Best Regards,
> 
> Jerry 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: [Spark SQL] SQLContext getOrCreate incorrect behaviour

Posted by Ted Yu <yu...@gmail.com>.
In Jerry's example, the first SparkContext, sc, has been stopped.

So there would be only one SparkContext running at any given moment.

Cheers

On Mon, Dec 21, 2015 at 8:23 AM, Chester @work <ch...@alpinenow.com>
wrote:

> Jerry
>     I thought you should not create more than one SparkContext within one
> Jvm, ...
> Chester
>
> Sent from my iPhone
>
> > On Dec 20, 2015, at 2:59 PM, Jerry Lam <ch...@gmail.com> wrote:
> >
> > Hi Spark developers,
> >
> > I found that SQLContext.getOrCreate(sc: SparkContext) does not behave
> correctly when a different spark context is provided.
> >
> > ```
> > val sc = new SparkContext
> > val sqlContext =SQLContext.getOrCreate(sc)
> > sc.stop
> > ...
> >
> > val sc2 = new SparkContext
> > val sqlContext2 = SQLContext.getOrCreate(sc2)
> > sc2.stop
> > ```
> >
> > The sqlContext2 will reference sc instead of sc2 and therefore, the
> program will not work because sc has been stopped.
> >
> > Best Regards,
> >
> > Jerry
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: [Spark SQL] SQLContext getOrCreate incorrect behaviour

Posted by "Chester @work" <ch...@alpinenow.com>.
Jerry
    I thought you should not create more than one SparkContext within one Jvm, ...
Chester

Sent from my iPhone

> On Dec 20, 2015, at 2:59 PM, Jerry Lam <ch...@gmail.com> wrote:
> 
> Hi Spark developers,
> 
> I found that SQLContext.getOrCreate(sc: SparkContext) does not behave correctly when a different spark context is provided.
> 
> ```
> val sc = new SparkContext
> val sqlContext =SQLContext.getOrCreate(sc)
> sc.stop
> ...
> 
> val sc2 = new SparkContext
> val sqlContext2 = SQLContext.getOrCreate(sc2)
> sc2.stop
> ```
> 
> The sqlContext2 will reference sc instead of sc2 and therefore, the program will not work because sc has been stopped. 
> 
> Best Regards,
> 
> Jerry 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: [Spark SQL] SQLContext getOrCreate incorrect behaviour

Posted by Yin Huai <yh...@databricks.com>.
Hi Jerry,

Looks like https://issues.apache.org/jira/browse/SPARK-11739 is for the
issue you described. It has been fixed in 1.6. With this change, when you
call SQLContext.getOrCreate(sc2), we will first check if sc has been
stopped. If so, we will create a new SQLContext using sc2.

Thanks,

Yin

On Sun, Dec 20, 2015 at 2:59 PM, Jerry Lam <ch...@gmail.com> wrote:

> Hi Spark developers,
>
> I found that SQLContext.getOrCreate(sc: SparkContext) does not behave
> correctly when a different spark context is provided.
>
> ```
> val sc = new SparkContext
> val sqlContext =SQLContext.getOrCreate(sc)
> sc.stop
> ...
>
> val sc2 = new SparkContext
> val sqlContext2 = SQLContext.getOrCreate(sc2)
> sc2.stop
> ```
>
> The sqlContext2 will reference sc instead of sc2 and therefore, the
> program will not work because sc has been stopped.
>
> Best Regards,
>
> Jerry
>