You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Weiqing Yang <ya...@gmail.com> on 2016/06/09 23:37:44 UTC

Add Caller Context in Spark

Hi,

Hadoop has implemented a feature of log tracing – caller context (Jira:
HDFS-9184 <https://issues.apache.org/jira/browse/HDFS-9184> and YARN-4349
<https://issues.apache.org/jira/browse/YARN-4349>). The motivation is to
better diagnose and understand how specific applications impacting parts of
the Hadoop system and potential problems they may be creating (e.g.
overloading NN). As HDFS mentioned inHDFS-9184
<https://issues.apache.org/jira/browse/HDFS-9184>, for a given HDFS
operation, it's very helpful to track which upper level job issues it. The
upper level callers may be specific Oozie tasks, MR jobs, hive queries,
Spark jobs.

Hadoop ecosystems like MapReduce, Tez (TEZ-2851
<https://issues.apache.org/jira/browse/TEZ-2851>), Hive (HIVE-12249
<https://issues.apache.org/jira/browse/HIVE-12249>, HIVE-12254
<https://issues.apache.org/jira/browse/HIVE-12254>) and Pig(PIG-4714
<https://issues.apache.org/jira/browse/PIG-4714>) have implemented their
caller contexts. Those systems invoke HDFS client API and Yarn client API
to setup caller context, and also expose an API to pass in caller context
into it.

Lots of Spark applications are running on Yarn/HDFS. Spark can also
implement its caller context via invoking HDFS/Yarn API, and also expose an
API to its upstream applications to set up their caller contexts. In the
end, the spark caller context written into Yarn log / HDFS log can
associate with task id, stage id, job id and app id.  That is also very
good for Spark users to identify tasks especially if Spark supports
multi-tenant environment in the future.

e.g.  Run SparkKmeans on Spark.

In HDFS log:
…
2016-05-25 15:36:23,748 INFO FSNamesystem.audit: allowed=true
ugi=yang(auth:SIMPLE)        ip=/127.0.0.1        cmd=getfileinfo
src=/data/mllib/kmeans_data.txt/_spark_metadata       dst=null
perm=null      proto=rpc callerContext=SparkKMeans
application_1464728991691_0009 running on Spark

 2016-05-25 15:36:27,893 INFO FSNamesystem.audit: allowed=true
ugi=yang (auth:SIMPLE)        ip=/127.0.0.1        cmd=open
src=/data/mllib/kmeans_data.txt       dst=null       perm=null
proto=rpc
callerContext=JobID_0_stageID_0_stageAttemptId_0_taskID_0_attemptNumber_0 on
Spark
…

“application_146472899169” is the application id.

I do have code that works with spark master branch. I am going to create a
Jira. Please feel free to let me know if you have any concern or comments.

Thanks,
Qing

Re: Add Caller Context in Spark

Posted by Weiqing Yang <ya...@gmail.com>.

Yes, it is a string. Jira SPARK-15857
<https://issues.apache.org/jira/browse/SPARK-15857> is created.

Thanks,
WQ

On Thu, Jun 9, 2016 at 4:40 PM, Reynold Xin <rx...@databricks.com> wrote:

> Is this just to set some string? That makes sense. One thing you would
> need to make sure is that Spark should still work outside of Hadoop, and
> also in older versions of Hadoop.
>
> On Thu, Jun 9, 2016 at 4:37 PM, Weiqing Yang <ya...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Hadoop has implemented a feature of log tracing – caller context (Jira:
>> HDFS-9184 <https://issues.apache.org/jira/browse/HDFS-9184> and YARN-4349
>> <https://issues.apache.org/jira/browse/YARN-4349>). The motivation is to
>> better diagnose and understand how specific applications impacting parts of
>> the Hadoop system and potential problems they may be creating (e.g.
>> overloading NN). As HDFS mentioned inHDFS-9184
>> <https://issues.apache.org/jira/browse/HDFS-9184>, for a given HDFS
>> operation, it's very helpful to track which upper level job issues it. The
>> upper level callers may be specific Oozie tasks, MR jobs, hive queries,
>> Spark jobs.
>>
>> Hadoop ecosystems like MapReduce, Tez (TEZ-2851
>> <https://issues.apache.org/jira/browse/TEZ-2851>), Hive (HIVE-12249
>> <https://issues.apache.org/jira/browse/HIVE-12249>, HIVE-12254
>> <https://issues.apache.org/jira/browse/HIVE-12254>) and Pig(PIG-4714
>> <https://issues.apache.org/jira/browse/PIG-4714>) have implemented their
>> caller contexts. Those systems invoke HDFS client API and Yarn client API
>> to setup caller context, and also expose an API to pass in caller context
>> into it.
>>
>> Lots of Spark applications are running on Yarn/HDFS. Spark can also
>> implement its caller context via invoking HDFS/Yarn API, and also expose an
>> API to its upstream applications to set up their caller contexts. In the
>> end, the spark caller context written into Yarn log / HDFS log can
>> associate with task id, stage id, job id and app id.  That is also very
>> good for Spark users to identify tasks especially if Spark supports
>> multi-tenant environment in the future.
>>
>> e.g.  Run SparkKmeans on Spark.
>>
>> In HDFS log:
>> …
>> 2016-05-25 15:36:23,748 INFO FSNamesystem.audit: allowed=true
>> ugi=yang(auth:SIMPLE)        ip=/127.0.0.1        cmd=getfileinfo
>> src=/data/mllib/kmeans_data.txt/_spark_metadata       dst=null
>> perm=null      proto=rpc callerContext=SparkKMeans
>> application_1464728991691_0009 running on Spark
>>
>>  2016-05-25 15:36:27,893 INFO FSNamesystem.audit: allowed=true
>> ugi=yang (auth:SIMPLE)        ip=/127.0.0.1        cmd=open
>> src=/data/mllib/kmeans_data.txt       dst=null       perm=null
>> proto=rpc
>> callerContext=JobID_0_stageID_0_stageAttemptId_0_taskID_0_attemptNumber_0 on
>> Spark
>> …
>>
>> “application_146472899169” is the application id.
>>
>> I do have code that works with spark master branch. I am going to create
>> a Jira. Please feel free to let me know if you have any concern or
>> comments.
>>
>> Thanks,
>> Qing
>>
>
>

Re: Add Caller Context in Spark

Posted by Reynold Xin <rx...@databricks.com>.

Is this just to set some string? That makes sense. One thing you would need
to make sure is that Spark should still work outside of Hadoop, and also in
older versions of Hadoop.

On Thu, Jun 9, 2016 at 4:37 PM, Weiqing Yang <ya...@gmail.com>
wrote:

> Hi,
>
> Hadoop has implemented a feature of log tracing – caller context (Jira:
> HDFS-9184 <https://issues.apache.org/jira/browse/HDFS-9184> and YARN-4349
> <https://issues.apache.org/jira/browse/YARN-4349>). The motivation is to
> better diagnose and understand how specific applications impacting parts of
> the Hadoop system and potential problems they may be creating (e.g.
> overloading NN). As HDFS mentioned inHDFS-9184
> <https://issues.apache.org/jira/browse/HDFS-9184>, for a given HDFS
> operation, it's very helpful to track which upper level job issues it. The
> upper level callers may be specific Oozie tasks, MR jobs, hive queries,
> Spark jobs.
>
> Hadoop ecosystems like MapReduce, Tez (TEZ-2851
> <https://issues.apache.org/jira/browse/TEZ-2851>), Hive (HIVE-12249
> <https://issues.apache.org/jira/browse/HIVE-12249>, HIVE-12254
> <https://issues.apache.org/jira/browse/HIVE-12254>) and Pig(PIG-4714
> <https://issues.apache.org/jira/browse/PIG-4714>) have implemented their
> caller contexts. Those systems invoke HDFS client API and Yarn client API
> to setup caller context, and also expose an API to pass in caller context
> into it.
>
> Lots of Spark applications are running on Yarn/HDFS. Spark can also
> implement its caller context via invoking HDFS/Yarn API, and also expose an
> API to its upstream applications to set up their caller contexts. In the
> end, the spark caller context written into Yarn log / HDFS log can
> associate with task id, stage id, job id and app id.  That is also very
> good for Spark users to identify tasks especially if Spark supports
> multi-tenant environment in the future.
>
> e.g.  Run SparkKmeans on Spark.
>
> In HDFS log:
> …
> 2016-05-25 15:36:23,748 INFO FSNamesystem.audit: allowed=true
> ugi=yang(auth:SIMPLE)        ip=/127.0.0.1        cmd=getfileinfo
> src=/data/mllib/kmeans_data.txt/_spark_metadata       dst=null
> perm=null      proto=rpc callerContext=SparkKMeans
> application_1464728991691_0009 running on Spark
>
>  2016-05-25 15:36:27,893 INFO FSNamesystem.audit: allowed=true
> ugi=yang (auth:SIMPLE)        ip=/127.0.0.1        cmd=open
> src=/data/mllib/kmeans_data.txt       dst=null       perm=null
> proto=rpc
> callerContext=JobID_0_stageID_0_stageAttemptId_0_taskID_0_attemptNumber_0 on
> Spark
> …
>
> “application_146472899169” is the application id.
>
> I do have code that works with spark master branch. I am going to create a
> Jira. Please feel free to let me know if you have any concern or comments.
>
> Thanks,
> Qing
>