You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Mendelson, Assaf" <As...@rsa.com> on 2017/02/12 08:06:28 UTC

is dataframe thread safe?

Hi,
I was wondering if dataframe is considered thread safe. I know the spark session and spark context are thread safe (and actually have tools to manage jobs from different threads) but the question is, can I use the same dataframe in both threads.
The idea would be to create a dataframe in the main thread and then in two sub threads do different transformations and actions on it.
I understand that some things might not be thread safe (e.g. if I unpersist in one thread it would affect the other. Checkpointing would cause similar issues), however, I can't find any documentation as to what operations (if any) are thread safe.

Thanks,
                Assaf.

Re: is dataframe thread safe?

Posted by 任弘迪 <ry...@gmail.com>.

for my understanding, all transformations are thread-safe cause dataframe
is just a description of the calculation and it's immutable, so the case
above is all right. just be careful with the actions.

On Sun, Feb 12, 2017 at 4:06 PM, Mendelson, Assaf <As...@rsa.com>
wrote:

> Hi,
>
> I was wondering if dataframe is considered thread safe. I know the spark
> session and spark context are thread safe (and actually have tools to
> manage jobs from different threads) but the question is, can I use the same
> dataframe in both threads.
>
> The idea would be to create a dataframe in the main thread and then in two
> sub threads do different transformations and actions on it.
>
> I understand that some things might not be thread safe (e.g. if I
> unpersist in one thread it would affect the other. Checkpointing would
> cause similar issues), however, I can’t find any documentation as to what
> operations (if any) are thread safe.
>
>
>
> Thanks,
>
>                 Assaf.
>

Re: is dataframe thread safe?

Posted by Mark Hamstra <ma...@clearstorydata.com>.

If you update the data, then you don't have the same DataFrame anymore. If
you don't do like Assaf did, caching and forcing evaluation of the
DataFrame before using that DataFrame concurrently, then you'll still get
consistent and correct results, but not necessarily efficient results. If
the fully materialized, cached are not yet available when multiple
concurrent Jobs try to use the DataFrame, then you can end up with more
than one Job doing the same work to generate what needs to go in the cache.
To avoid that kind of work duplication you need some mechanism to ensure
that only one action/Job is run to populate the cache before multiple
actions/Jobs can then use the cached results efficiently.

On Mon, Feb 13, 2017 at 9:15 AM, vincent gromakowski <
vincent.gromakowski@gmail.com> wrote:

> How about having a thread that update and cache a dataframe in-memory next
> to other threads requesting this dataframe, is it thread safe ?
>
> 2017-02-13 9:02 GMT+01:00 Reynold Xin <rx...@databricks.com>:
>
>> Yes your use case should be fine. Multiple threads can transform the same
>> data frame in parallel since they create different data frames.
>>
>>
>> On Sun, Feb 12, 2017 at 9:07 AM Mendelson, Assaf <As...@rsa.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I was wondering if dataframe is considered thread safe. I know the spark
>>> session and spark context are thread safe (and actually have tools to
>>> manage jobs from different threads) but the question is, can I use the same
>>> dataframe in both threads.
>>>
>>> The idea would be to create a dataframe in the main thread and then in
>>> two sub threads do different transformations and actions on it.
>>>
>>> I understand that some things might not be thread safe (e.g. if I
>>> unpersist in one thread it would affect the other. Checkpointing would
>>> cause similar issues), however, I can’t find any documentation as to what
>>> operations (if any) are thread safe.
>>>
>>>
>>>
>>> Thanks,
>>>
>>>                 Assaf.
>>>
>>
>

Re: is dataframe thread safe?

Posted by vincent gromakowski <vi...@gmail.com>.

How about having a thread that update and cache a dataframe in-memory next
to other threads requesting this dataframe, is it thread safe ?

2017-02-13 9:02 GMT+01:00 Reynold Xin <rx...@databricks.com>:

> Yes your use case should be fine. Multiple threads can transform the same
> data frame in parallel since they create different data frames.
>
>
> On Sun, Feb 12, 2017 at 9:07 AM Mendelson, Assaf <As...@rsa.com>
> wrote:
>
>> Hi,
>>
>> I was wondering if dataframe is considered thread safe. I know the spark
>> session and spark context are thread safe (and actually have tools to
>> manage jobs from different threads) but the question is, can I use the same
>> dataframe in both threads.
>>
>> The idea would be to create a dataframe in the main thread and then in
>> two sub threads do different transformations and actions on it.
>>
>> I understand that some things might not be thread safe (e.g. if I
>> unpersist in one thread it would affect the other. Checkpointing would
>> cause similar issues), however, I can’t find any documentation as to what
>> operations (if any) are thread safe.
>>
>>
>>
>> Thanks,
>>
>>                 Assaf.
>>
>

Re: is dataframe thread safe?

Posted by Reynold Xin <rx...@databricks.com>.

Yes your use case should be fine. Multiple threads can transform the same
data frame in parallel since they create different data frames.


On Sun, Feb 12, 2017 at 9:07 AM Mendelson, Assaf <As...@rsa.com>
wrote:

> Hi,
>
> I was wondering if dataframe is considered thread safe. I know the spark
> session and spark context are thread safe (and actually have tools to
> manage jobs from different threads) but the question is, can I use the same
> dataframe in both threads.
>
> The idea would be to create a dataframe in the main thread and then in two
> sub threads do different transformations and actions on it.
>
> I understand that some things might not be thread safe (e.g. if I
> unpersist in one thread it would affect the other. Checkpointing would
> cause similar issues), however, I can’t find any documentation as to what
> operations (if any) are thread safe.
>
>
>
> Thanks,
>
>                 Assaf.
>

Re: is dataframe thread safe?

Posted by "颜发才 (Yan Facai)" <fa...@gmail.com>.

DataFrame is immutable, so it should be thread safe, right?

On Sun, Feb 12, 2017 at 6:45 PM, Sean Owen <so...@cloudera.com> wrote:

> No this use case is perfectly sensible. Yes it is thread safe.
>
>
> On Sun, Feb 12, 2017, 10:30 Jörn Franke <jo...@gmail.com> wrote:
>
>> I think you should have a look at the spark documentation. It has
>> something called scheduler who does exactly this. In more sophisticated
>> environments yarn or mesos do this for you.
>>
>> Using threads for transformations does not make sense.
>>
>> On 12 Feb 2017, at 09:50, Mendelson, Assaf <As...@rsa.com>
>> wrote:
>>
>> I know spark takes care of executing everything in a distributed manner,
>> however, spark also supports having multiple threads on the same spark
>> session/context and knows (Through fair scheduler) to distribute the tasks
>> from them in a round robin.
>>
>>
>>
>> The question is, can those two actions (with a different set of
>> transformations) be applied to the SAME dataframe.
>>
>>
>>
>> Let’s say I want to do something like:
>>
>>
>>
>>
>>
>>
>>
>> Val df = ???
>>
>> df.cache()
>>
>> df.count()
>>
>>
>>
>> def f1(df: DataFrame): Unit = {
>>
>>   val df1 = df.groupby(something).agg(some aggs)
>>
>>   df1.write.parquet(“some path”)
>>
>> }
>>
>>
>>
>> def f2(df: DataFrame): Unit = {
>>
>>   val df2 = df.groupby(something else).agg(some different aggs)
>>
>>   df2.write.parquet(“some path 2”)
>>
>> }
>>
>>
>>
>> f1(df)
>>
>> f2(df)
>>
>>
>>
>> df.unpersist()
>>
>>
>>
>> if the aggregations do not use the full cluster (e.g. because of data
>> skewness, because there aren’t enough partitions or any other reason) then
>> this would leave the cluster under utilized.
>>
>>
>>
>> However, if I would call f1 and f2 on different threads, then df2 can use
>> free resources f1 has not consumed and the overall utilization would
>> improve.
>>
>>
>>
>> Of course, I can do this only if the operations on the dataframe are
>> thread safe. For example, if I would do a cache in f1 and an unpersist in
>> f2 I would get an inconsistent result. So my question is, what, if any are
>> the legal operations to use on a dataframe so I could do the above.
>>
>>
>>
>> Thanks,
>>
>>                 Assaf.
>>
>>
>>
>> *From:* Jörn Franke [mailto:jornfranke@gmail.com <jo...@gmail.com>]
>> *Sent:* Sunday, February 12, 2017 10:39 AM
>> *To:* Mendelson, Assaf
>> *Cc:* user
>> *Subject:* Re: is dataframe thread safe?
>>
>>
>>
>> I am not sure what you are trying to achieve here. Spark is taking care
>> of executing the transformations in a distributed fashion. This means you
>> must not use threads - it does not make sense. Hence, you do not find
>> documentation about it.
>>
>>
>> On 12 Feb 2017, at 09:06, Mendelson, Assaf <As...@rsa.com>
>> wrote:
>>
>> Hi,
>>
>> I was wondering if dataframe is considered thread safe. I know the spark
>> session and spark context are thread safe (and actually have tools to
>> manage jobs from different threads) but the question is, can I use the same
>> dataframe in both threads.
>>
>> The idea would be to create a dataframe in the main thread and then in
>> two sub threads do different transformations and actions on it.
>>
>> I understand that some things might not be thread safe (e.g. if I
>> unpersist in one thread it would affect the other. Checkpointing would
>> cause similar issues), however, I can’t find any documentation as to what
>> operations (if any) are thread safe.
>>
>>
>>
>> Thanks,
>>
>>                 Assaf.
>>
>>
>>
>>

Re: is dataframe thread safe?

Posted by Timur Shenkao <ts...@timshenkao.su>.

Hello,

I suspect that your need isn't parallel execution but parallel data access.
In that case, use Alluxio or Ignite.

Or, more exotic, one Spark job writes to Kafka and the other ones read from
Kafka.

Sincerely yours, Timur

On Sun, Feb 12, 2017 at 2:30 PM, Mendelson, Assaf <As...@rsa.com>
wrote:

> There is no threads within maps here. The idea is to have two jobs on two
> different threads which use the same dataframe (which is cached btw).
>
> This does not override spark’s parallel execution of transformation or any
> such. The documentation (job scheduling) actually hints at this option but
> doesn’t say specifically if it is supported when the same dataframe is used.
>
> As for configuring the scheduler, this would not work. First it would mean
> that the same cached dataframe cannot be used, I would have to add some
> additional configuration such as alluxio (and it would still have to
> serialize/deserialize) as opposed to using the cached data. Furthermore,
> multi-tenancy between applications is limited to either dividing the
> cluster between the applications or using dynamic allocation (which has its
> own overheads).
>
>
>
> Therefore Sean’s answer is what I was looking for (and hoping for…)
>
> Assaf
>
>
>
> *From:* Jörn Franke [mailto:jornfranke@gmail.com]
> *Sent:* Sunday, February 12, 2017 2:46 PM
> *To:* Sean Owen
> *Cc:* Mendelson, Assaf; user
>
> *Subject:* Re: is dataframe thread safe?
>
>
>
> I did not doubt that the submission of several jobs of one application
> makes sense. However, he want to create threads within maps etc., which
> looks like calling for issues (not only for running the application itself,
> but also for operating it in production within a shared cluster). I would
> rely for parallel execution of the transformations on the out-of-the-box
> functionality within Spark.
>
>
>
> For me he looks for a solution that can be achieved by a simple
> configuration of the scheduler in Spark, yarn or mesos. In this way the
> application would be more maintainable in production.
>
>
> On 12 Feb 2017, at 11:45, Sean Owen <so...@cloudera.com> wrote:
>
> No this use case is perfectly sensible. Yes it is thread safe.
>
> On Sun, Feb 12, 2017, 10:30 Jörn Franke <jo...@gmail.com> wrote:
>
> I think you should have a look at the spark documentation. It has
> something called scheduler who does exactly this. In more sophisticated
> environments yarn or mesos do this for you.
>
>
>
> Using threads for transformations does not make sense.
>
>
> On 12 Feb 2017, at 09:50, Mendelson, Assaf <As...@rsa.com>
> wrote:
>
> I know spark takes care of executing everything in a distributed manner,
> however, spark also supports having multiple threads on the same spark
> session/context and knows (Through fair scheduler) to distribute the tasks
> from them in a round robin.
>
>
>
> The question is, can those two actions (with a different set of
> transformations) be applied to the SAME dataframe.
>
>
>
> Let’s say I want to do something like:
>
>
>
>
>
>
>
> Val df = ???
>
> df.cache()
>
> df.count()
>
>
>
> def f1(df: DataFrame): Unit = {
>
>   val df1 = df.groupby(something).agg(some aggs)
>
>   df1.write.parquet(“some path”)
>
> }
>
>
>
> def f2(df: DataFrame): Unit = {
>
>   val df2 = df.groupby(something else).agg(some different aggs)
>
>   df2.write.parquet(“some path 2”)
>
> }
>
>
>
> f1(df)
>
> f2(df)
>
>
>
> df.unpersist()
>
>
>
> if the aggregations do not use the full cluster (e.g. because of data
> skewness, because there aren’t enough partitions or any other reason) then
> this would leave the cluster under utilized.
>
>
>
> However, if I would call f1 and f2 on different threads, then df2 can use
> free resources f1 has not consumed and the overall utilization would
> improve.
>
>
>
> Of course, I can do this only if the operations on the dataframe are
> thread safe. For example, if I would do a cache in f1 and an unpersist in
> f2 I would get an inconsistent result. So my question is, what, if any are
> the legal operations to use on a dataframe so I could do the above.
>
>
>
> Thanks,
>
>                 Assaf.
>
>
>
> *From:* Jörn Franke [mailto:jornfranke@gmail.com <jo...@gmail.com>]
> *Sent:* Sunday, February 12, 2017 10:39 AM
> *To:* Mendelson, Assaf
> *Cc:* user
> *Subject:* Re: is dataframe thread safe?
>
>
>
> I am not sure what you are trying to achieve here. Spark is taking care of
> executing the transformations in a distributed fashion. This means you must
> not use threads - it does not make sense. Hence, you do not find
> documentation about it.
>
>
> On 12 Feb 2017, at 09:06, Mendelson, Assaf <As...@rsa.com>
> wrote:
>
> Hi,
>
> I was wondering if dataframe is considered thread safe. I know the spark
> session and spark context are thread safe (and actually have tools to
> manage jobs from different threads) but the question is, can I use the same
> dataframe in both threads.
>
> The idea would be to create a dataframe in the main thread and then in two
> sub threads do different transformations and actions on it.
>
> I understand that some things might not be thread safe (e.g. if I
> unpersist in one thread it would affect the other. Checkpointing would
> cause similar issues), however, I can’t find any documentation as to what
> operations (if any) are thread safe.
>
>
>
> Thanks,
>
>                 Assaf.
>
>
>
>

RE: is dataframe thread safe?

Posted by "Mendelson, Assaf" <As...@rsa.com>.

There is no threads within maps here. The idea is to have two jobs on two different threads which use the same dataframe (which is cached btw).
This does not override spark’s parallel execution of transformation or any such. The documentation (job scheduling) actually hints at this option but doesn’t say specifically if it is supported when the same dataframe is used.
As for configuring the scheduler, this would not work. First it would mean that the same cached dataframe cannot be used, I would have to add some additional configuration such as alluxio (and it would still have to serialize/deserialize) as opposed to using the cached data. Furthermore, multi-tenancy between applications is limited to either dividing the cluster between the applications or using dynamic allocation (which has its own overheads).

Therefore Sean’s answer is what I was looking for (and hoping for…)
Assaf

From: Jörn Franke [mailto:jornfranke@gmail.com]
Sent: Sunday, February 12, 2017 2:46 PM
To: Sean Owen
Cc: Mendelson, Assaf; user
Subject: Re: is dataframe thread safe?

I did not doubt that the submission of several jobs of one application makes sense. However, he want to create threads within maps etc., which looks like calling for issues (not only for running the application itself, but also for operating it in production within a shared cluster). I would rely for parallel execution of the transformations on the out-of-the-box functionality within Spark.

For me he looks for a solution that can be achieved by a simple configuration of the scheduler in Spark, yarn or mesos. In this way the application would be more maintainable in production.

On 12 Feb 2017, at 11:45, Sean Owen <so...@cloudera.com>> wrote:
No this use case is perfectly sensible. Yes it is thread safe.
On Sun, Feb 12, 2017, 10:30 Jörn Franke <jo...@gmail.com>> wrote:
I think you should have a look at the spark documentation. It has something called scheduler who does exactly this. In more sophisticated environments yarn or mesos do this for you.

Using threads for transformations does not make sense.

On 12 Feb 2017, at 09:50, Mendelson, Assaf <As...@rsa.com>> wrote:
I know spark takes care of executing everything in a distributed manner, however, spark also supports having multiple threads on the same spark session/context and knows (Through fair scheduler) to distribute the tasks from them in a round robin.

The question is, can those two actions (with a different set of transformations) be applied to the SAME dataframe.

Let’s say I want to do something like:

Val df = ???
df.cache()
df.count()

def f1(df: DataFrame): Unit = {
  val df1 = df.groupby(something).agg(some aggs)
  df1.write.parquet(“some path”)
}

def f2(df: DataFrame): Unit = {
  val df2 = df.groupby(something else).agg(some different aggs)
  df2.write.parquet(“some path 2”)
}

f1(df)
f2(df)

df.unpersist()

if the aggregations do not use the full cluster (e.g. because of data skewness, because there aren’t enough partitions or any other reason) then this would leave the cluster under utilized.

However, if I would call f1 and f2 on different threads, then df2 can use free resources f1 has not consumed and the overall utilization would improve.

Of course, I can do this only if the operations on the dataframe are thread safe. For example, if I would do a cache in f1 and an unpersist in f2 I would get an inconsistent result. So my question is, what, if any are the legal operations to use on a dataframe so I could do the above.

Thanks,
                Assaf.

From: Jörn Franke [mailto:jornfranke@gmail.com]
Sent: Sunday, February 12, 2017 10:39 AM
To: Mendelson, Assaf
Cc: user
Subject: Re: is dataframe thread safe?

I am not sure what you are trying to achieve here. Spark is taking care of executing the transformations in a distributed fashion. This means you must not use threads - it does not make sense. Hence, you do not find documentation about it.

On 12 Feb 2017, at 09:06, Mendelson, Assaf <As...@rsa.com>> wrote:
Hi,
I was wondering if dataframe is considered thread safe. I know the spark session and spark context are thread safe (and actually have tools to manage jobs from different threads) but the question is, can I use the same dataframe in both threads.
The idea would be to create a dataframe in the main thread and then in two sub threads do different transformations and actions on it.
I understand that some things might not be thread safe (e.g. if I unpersist in one thread it would affect the other. Checkpointing would cause similar issues), however, I can’t find any documentation as to what operations (if any) are thread safe.

Thanks,
                Assaf.

Re: is dataframe thread safe?

Posted by Jörn Franke <jo...@gmail.com>.

I did not doubt that the submission of several jobs of one application makes sense. However, he want to create threads within maps etc., which looks like calling for issues (not only for running the application itself, but also for operating it in production within a shared cluster). I would rely for parallel execution of the transformations on the out-of-the-box functionality within Spark.

For me he looks for a solution that can be achieved by a simple configuration of the scheduler in Spark, yarn or mesos. In this way the application would be more maintainable in production.

> On 12 Feb 2017, at 11:45, Sean Owen <so...@cloudera.com> wrote:
> 
> No this use case is perfectly sensible. Yes it is thread safe. 
> 
>> On Sun, Feb 12, 2017, 10:30 Jörn Franke <jo...@gmail.com> wrote:
>> I think you should have a look at the spark documentation. It has something called scheduler who does exactly this. In more sophisticated environments yarn or mesos do this for you.
>> 
>> Using threads for transformations does not make sense. 
>> 
>>> On 12 Feb 2017, at 09:50, Mendelson, Assaf <As...@rsa.com> wrote:
>>> 
>>> I know spark takes care of executing everything in a distributed manner, however, spark also supports having multiple threads on the same spark session/context and knows (Through fair scheduler) to distribute the tasks from them in a round robin.
>>> 
>>>  
>>> 
>>> The question is, can those two actions (with a different set of transformations) be applied to the SAME dataframe.
>>> 
>>>  
>>> 
>>> Let’s say I want to do something like:
>>> 
>>>  
>>> 
>>>  
>>> 
>>>  
>>> 
>>> Val df = ???
>>> 
>>> df.cache()
>>> 
>>> df.count()
>>> 
>>>  
>>> 
>>> def f1(df: DataFrame): Unit = {
>>> 
>>>   val df1 = df.groupby(something).agg(some aggs)
>>> 
>>>   df1.write.parquet(“some path”)
>>> 
>>> }
>>> 
>>>  
>>> 
>>> def f2(df: DataFrame): Unit = {
>>> 
>>>   val df2 = df.groupby(something else).agg(some different aggs)
>>> 
>>>   df2.write.parquet(“some path 2”)
>>> 
>>> }
>>> 
>>>  
>>> 
>>> f1(df)
>>> 
>>> f2(df)
>>> 
>>>  
>>> 
>>> df.unpersist()
>>> 
>>>  
>>> 
>>> if the aggregations do not use the full cluster (e.g. because of data skewness, because there aren’t enough partitions or any other reason) then this would leave the cluster under utilized.
>>> 
>>>  
>>> 
>>> However, if I would call f1 and f2 on different threads, then df2 can use free resources f1 has not consumed and the overall utilization would improve.
>>> 
>>>  
>>> 
>>> Of course, I can do this only if the operations on the dataframe are thread safe. For example, if I would do a cache in f1 and an unpersist in f2 I would get an inconsistent result. So my question is, what, if any are the legal operations to use on a dataframe so I could do the above.
>>> 
>>>  
>>> 
>>> Thanks,
>>> 
>>>                 Assaf.
>>> 
>>>  
>>> 
>>> From: Jörn Franke [mailto:jornfranke@gmail.com] 
>>> Sent: Sunday, February 12, 2017 10:39 AM
>>> To: Mendelson, Assaf
>>> Cc: user
>>> Subject: Re: is dataframe thread safe?
>>> 
>>>  
>>> 
>>> I am not sure what you are trying to achieve here. Spark is taking care of executing the transformations in a distributed fashion. This means you must not use threads - it does not make sense. Hence, you do not find documentation about it.
>>> 
>>> 
>>> On 12 Feb 2017, at 09:06, Mendelson, Assaf <As...@rsa.com> wrote:
>>> 
>>> Hi,
>>> 
>>> I was wondering if dataframe is considered thread safe. I know the spark session and spark context are thread safe (and actually have tools to manage jobs from different threads) but the question is, can I use the same dataframe in both threads.
>>> 
>>> The idea would be to create a dataframe in the main thread and then in two sub threads do different transformations and actions on it.
>>> 
>>> I understand that some things might not be thread safe (e.g. if I unpersist in one thread it would affect the other. Checkpointing would cause similar issues), however, I can’t find any documentation as to what operations (if any) are thread safe.
>>> 
>>>  
>>> 
>>> Thanks,
>>> 
>>>                 Assaf.
>>> 
>>>

Re: is dataframe thread safe?

Posted by Sean Owen <so...@cloudera.com>.

No this use case is perfectly sensible. Yes it is thread safe.

On Sun, Feb 12, 2017, 10:30 Jörn Franke <jo...@gmail.com> wrote:

> I think you should have a look at the spark documentation. It has
> something called scheduler who does exactly this. In more sophisticated
> environments yarn or mesos do this for you.
>
> Using threads for transformations does not make sense.
>
> On 12 Feb 2017, at 09:50, Mendelson, Assaf <As...@rsa.com>
> wrote:
>
> I know spark takes care of executing everything in a distributed manner,
> however, spark also supports having multiple threads on the same spark
> session/context and knows (Through fair scheduler) to distribute the tasks
> from them in a round robin.
>
>
>
> The question is, can those two actions (with a different set of
> transformations) be applied to the SAME dataframe.
>
>
>
> Let’s say I want to do something like:
>
>
>
>
>
>
>
> Val df = ???
>
> df.cache()
>
> df.count()
>
>
>
> def f1(df: DataFrame): Unit = {
>
>   val df1 = df.groupby(something).agg(some aggs)
>
>   df1.write.parquet(“some path”)
>
> }
>
>
>
> def f2(df: DataFrame): Unit = {
>
>   val df2 = df.groupby(something else).agg(some different aggs)
>
>   df2.write.parquet(“some path 2”)
>
> }
>
>
>
> f1(df)
>
> f2(df)
>
>
>
> df.unpersist()
>
>
>
> if the aggregations do not use the full cluster (e.g. because of data
> skewness, because there aren’t enough partitions or any other reason) then
> this would leave the cluster under utilized.
>
>
>
> However, if I would call f1 and f2 on different threads, then df2 can use
> free resources f1 has not consumed and the overall utilization would
> improve.
>
>
>
> Of course, I can do this only if the operations on the dataframe are
> thread safe. For example, if I would do a cache in f1 and an unpersist in
> f2 I would get an inconsistent result. So my question is, what, if any are
> the legal operations to use on a dataframe so I could do the above.
>
>
>
> Thanks,
>
>                 Assaf.
>
>
>
> *From:* Jörn Franke [mailto:jornfranke@gmail.com <jo...@gmail.com>]
> *Sent:* Sunday, February 12, 2017 10:39 AM
> *To:* Mendelson, Assaf
> *Cc:* user
> *Subject:* Re: is dataframe thread safe?
>
>
>
> I am not sure what you are trying to achieve here. Spark is taking care of
> executing the transformations in a distributed fashion. This means you must
> not use threads - it does not make sense. Hence, you do not find
> documentation about it.
>
>
> On 12 Feb 2017, at 09:06, Mendelson, Assaf <As...@rsa.com>
> wrote:
>
> Hi,
>
> I was wondering if dataframe is considered thread safe. I know the spark
> session and spark context are thread safe (and actually have tools to
> manage jobs from different threads) but the question is, can I use the same
> dataframe in both threads.
>
> The idea would be to create a dataframe in the main thread and then in two
> sub threads do different transformations and actions on it.
>
> I understand that some things might not be thread safe (e.g. if I
> unpersist in one thread it would affect the other. Checkpointing would
> cause similar issues), however, I can’t find any documentation as to what
> operations (if any) are thread safe.
>
>
>
> Thanks,
>
>                 Assaf.
>
>
>
>

Re: is dataframe thread safe?

Posted by Jörn Franke <jo...@gmail.com>.

Cf. also https://spark.apache.org/docs/latest/job-scheduling.html

> On 12 Feb 2017, at 11:30, Jörn Franke <jo...@gmail.com> wrote:
> 
> I think you should have a look at the spark documentation. It has something called scheduler who does exactly this. In more sophisticated environments yarn or mesos do this for you.
> 
> Using threads for transformations does not make sense. 
> 
>> On 12 Feb 2017, at 09:50, Mendelson, Assaf <As...@rsa.com> wrote:
>> 
>> I know spark takes care of executing everything in a distributed manner, however, spark also supports having multiple threads on the same spark session/context and knows (Through fair scheduler) to distribute the tasks from them in a round robin.
>>  
>> The question is, can those two actions (with a different set of transformations) be applied to the SAME dataframe.
>>  
>> Let’s say I want to do something like:
>>  
>>  
>>  
>> Val df = ???
>> df.cache()
>> df.count()
>>  
>> def f1(df: DataFrame): Unit = {
>>   val df1 = df.groupby(something).agg(some aggs)
>>   df1.write.parquet(“some path”)
>> }
>>  
>> def f2(df: DataFrame): Unit = {
>>   val df2 = df.groupby(something else).agg(some different aggs)
>>   df2.write.parquet(“some path 2”)
>> }
>>  
>> f1(df)
>> f2(df)
>>  
>> df.unpersist()
>>  
>> if the aggregations do not use the full cluster (e.g. because of data skewness, because there aren’t enough partitions or any other reason) then this would leave the cluster under utilized.
>>  
>> However, if I would call f1 and f2 on different threads, then df2 can use free resources f1 has not consumed and the overall utilization would improve.
>>  
>> Of course, I can do this only if the operations on the dataframe are thread safe. For example, if I would do a cache in f1 and an unpersist in f2 I would get an inconsistent result. So my question is, what, if any are the legal operations to use on a dataframe so I could do the above.
>>  
>> Thanks,
>>                 Assaf.
>>  
>> From: Jörn Franke [mailto:jornfranke@gmail.com] 
>> Sent: Sunday, February 12, 2017 10:39 AM
>> To: Mendelson, Assaf
>> Cc: user
>> Subject: Re: is dataframe thread safe?
>>  
>> I am not sure what you are trying to achieve here. Spark is taking care of executing the transformations in a distributed fashion. This means you must not use threads - it does not make sense. Hence, you do not find documentation about it.
>> 
>> On 12 Feb 2017, at 09:06, Mendelson, Assaf <As...@rsa.com> wrote:
>> 
>> Hi,
>> I was wondering if dataframe is considered thread safe. I know the spark session and spark context are thread safe (and actually have tools to manage jobs from different threads) but the question is, can I use the same dataframe in both threads.
>> The idea would be to create a dataframe in the main thread and then in two sub threads do different transformations and actions on it.
>> I understand that some things might not be thread safe (e.g. if I unpersist in one thread it would affect the other. Checkpointing would cause similar issues), however, I can’t find any documentation as to what operations (if any) are thread safe.
>>  
>> Thanks,
>>                 Assaf.
>>

Re: is dataframe thread safe?

Posted by Jörn Franke <jo...@gmail.com>.

I think you should have a look at the spark documentation. It has something called scheduler who does exactly this. In more sophisticated environments yarn or mesos do this for you.

Using threads for transformations does not make sense. 

> On 12 Feb 2017, at 09:50, Mendelson, Assaf <As...@rsa.com> wrote:
> 
> I know spark takes care of executing everything in a distributed manner, however, spark also supports having multiple threads on the same spark session/context and knows (Through fair scheduler) to distribute the tasks from them in a round robin.
>  
> The question is, can those two actions (with a different set of transformations) be applied to the SAME dataframe.
>  
> Let’s say I want to do something like:
>  
>  
>  
> Val df = ???
> df.cache()
> df.count()
>  
> def f1(df: DataFrame): Unit = {
>   val df1 = df.groupby(something).agg(some aggs)
>   df1.write.parquet(“some path”)
> }
>  
> def f2(df: DataFrame): Unit = {
>   val df2 = df.groupby(something else).agg(some different aggs)
>   df2.write.parquet(“some path 2”)
> }
>  
> f1(df)
> f2(df)
>  
> df.unpersist()
>  
> if the aggregations do not use the full cluster (e.g. because of data skewness, because there aren’t enough partitions or any other reason) then this would leave the cluster under utilized.
>  
> However, if I would call f1 and f2 on different threads, then df2 can use free resources f1 has not consumed and the overall utilization would improve.
>  
> Of course, I can do this only if the operations on the dataframe are thread safe. For example, if I would do a cache in f1 and an unpersist in f2 I would get an inconsistent result. So my question is, what, if any are the legal operations to use on a dataframe so I could do the above.
>  
> Thanks,
>                 Assaf.
>  
> From: Jörn Franke [mailto:jornfranke@gmail.com] 
> Sent: Sunday, February 12, 2017 10:39 AM
> To: Mendelson, Assaf
> Cc: user
> Subject: Re: is dataframe thread safe?
>  
> I am not sure what you are trying to achieve here. Spark is taking care of executing the transformations in a distributed fashion. This means you must not use threads - it does not make sense. Hence, you do not find documentation about it.
> 
> On 12 Feb 2017, at 09:06, Mendelson, Assaf <As...@rsa.com> wrote:
> 
> Hi,
> I was wondering if dataframe is considered thread safe. I know the spark session and spark context are thread safe (and actually have tools to manage jobs from different threads) but the question is, can I use the same dataframe in both threads.
> The idea would be to create a dataframe in the main thread and then in two sub threads do different transformations and actions on it.
> I understand that some things might not be thread safe (e.g. if I unpersist in one thread it would affect the other. Checkpointing would cause similar issues), however, I can’t find any documentation as to what operations (if any) are thread safe.
>  
> Thanks,
>                 Assaf.
>

RE: is dataframe thread safe?

Posted by "Mendelson, Assaf" <As...@rsa.com>.

I know spark takes care of executing everything in a distributed manner, however, spark also supports having multiple threads on the same spark session/context and knows (Through fair scheduler) to distribute the tasks from them in a round robin.

The question is, can those two actions (with a different set of transformations) be applied to the SAME dataframe.

Let’s say I want to do something like:



Val df = ???
df.cache()
df.count()

def f1(df: DataFrame): Unit = {
  val df1 = df.groupby(something).agg(some aggs)
  df1.write.parquet(“some path”)
}

def f2(df: DataFrame): Unit = {
  val df2 = df.groupby(something else).agg(some different aggs)
  df2.write.parquet(“some path 2”)
}

f1(df)
f2(df)

df.unpersist()

if the aggregations do not use the full cluster (e.g. because of data skewness, because there aren’t enough partitions or any other reason) then this would leave the cluster under utilized.

However, if I would call f1 and f2 on different threads, then df2 can use free resources f1 has not consumed and the overall utilization would improve.

Of course, I can do this only if the operations on the dataframe are thread safe. For example, if I would do a cache in f1 and an unpersist in f2 I would get an inconsistent result. So my question is, what, if any are the legal operations to use on a dataframe so I could do the above.

Thanks,
                Assaf.

From: Jörn Franke [mailto:jornfranke@gmail.com]
Sent: Sunday, February 12, 2017 10:39 AM
To: Mendelson, Assaf
Cc: user
Subject: Re: is dataframe thread safe?

I am not sure what you are trying to achieve here. Spark is taking care of executing the transformations in a distributed fashion. This means you must not use threads - it does not make sense. Hence, you do not find documentation about it.

On 12 Feb 2017, at 09:06, Mendelson, Assaf <As...@rsa.com>> wrote:
Hi,
I was wondering if dataframe is considered thread safe. I know the spark session and spark context are thread safe (and actually have tools to manage jobs from different threads) but the question is, can I use the same dataframe in both threads.
The idea would be to create a dataframe in the main thread and then in two sub threads do different transformations and actions on it.
I understand that some things might not be thread safe (e.g. if I unpersist in one thread it would affect the other. Checkpointing would cause similar issues), however, I can’t find any documentation as to what operations (if any) are thread safe.

Thanks,
                Assaf.

Re: is dataframe thread safe?

Posted by Jörn Franke <jo...@gmail.com>.

I am not sure what you are trying to achieve here. Spark is taking care of executing the transformations in a distributed fashion. This means you must not use threads - it does not make sense. Hence, you do not find documentation about it.

> On 12 Feb 2017, at 09:06, Mendelson, Assaf <As...@rsa.com> wrote:
> 
> Hi,
> I was wondering if dataframe is considered thread safe. I know the spark session and spark context are thread safe (and actually have tools to manage jobs from different threads) but the question is, can I use the same dataframe in both threads.
> The idea would be to create a dataframe in the main thread and then in two sub threads do different transformations and actions on it.
> I understand that some things might not be thread safe (e.g. if I unpersist in one thread it would affect the other. Checkpointing would cause similar issues), however, I can’t find any documentation as to what operations (if any) are thread safe.
>  
> Thanks,
>                 Assaf.