You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Abhishek Modi (JIRA)" <ji...@apache.org> on 2015/07/12 16:41:04 UTC

[jira] [Commented] (SPARK-9004) Add s3 bytes read/written metrics

    [ https://issues.apache.org/jira/browse/SPARK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14623834#comment-14623834 ] 

Abhishek Modi commented on SPARK-9004:
--------------------------------------

Hadoop separates HDFS bytes, local filesystem bytes and S3 bytes in counters. Spark combines all of them in its metrics. Separating them could give a better idea of IO distribution.

Here's how it works in MR: 

1. Client creates a Job object (org.apache.hadoop.mapreduce.Job). It submits to the RM which then launches the AM etc.
2. After job submission, Client continuously monitors the job to see if it is finished. 
3. Once the job is finished, the client gets the counters of the job via the getCounters() function. 
4. It logs on the client using "Counters=" format.

I don't really know how to implement it. Can it be done by modifying NewHadoopRDD because i guess that's where the Job object is being used ?


> Add s3 bytes read/written metrics
> ---------------------------------
>
>                 Key: SPARK-9004
>                 URL: https://issues.apache.org/jira/browse/SPARK-9004
>             Project: Spark
>          Issue Type: Improvement
>            Reporter: Abhishek Modi
>            Priority: Minor
>
> s3 read/write metrics can be pretty useful in finding the total aggregate data processed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org