You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Leif Walsh (JIRA)" <ji...@apache.org> on 2017/10/24 00:24:00 UTC

[jira] [Created] (SPARK-22340) pyspark setJobGroup doesn't match java threads

Leif Walsh created SPARK-22340:
----------------------------------

             Summary: pyspark setJobGroup doesn't match java threads
                 Key: SPARK-22340
                 URL: https://issues.apache.org/jira/browse/SPARK-22340
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.0.2
            Reporter: Leif Walsh


With pyspark, {{sc.setJobGroup}}'s documentation says

{quote}
Assigns a group ID to all the jobs started by this thread until the group ID is set to a different value or cleared.
{quote}

However, this doesn't appear to be associated with Python threads, only with Java threads.  As such, a Python thread which calls this and then submits multiple jobs doesn't necessarily get its jobs associated with any particular spark job group.  For example:

{code}
def run_jobs():
    sc.setJobGroup('hello', 'hello jobs')
    x = sc.range(100).sum()
    y = sc.range(1000).sum()
    return x, y

import concurrent.futures
with concurrent.futures.ThreadPoolExecutor() as executor:
    future = executor.submit(run_jobs)
    sc.cancelJobGroup('hello')
    future.result()
{code}

In this example, depending how the action calls on the Python side are allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be assigned the job group {{hello}}.

First, we should clarify the docs if this truly is the case.

Second, it would be really helpful if this could be made the case.  As it stands, job groups are pretty useless from the pyspark side, if we can't rely on this fact.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org