You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2018/04/06 17:09:00 UTC

[jira] [Commented] (SPARK-23686) Make better usage of org.apache.spark.ml.util.Instrumentation

    [ https://issues.apache.org/jira/browse/SPARK-23686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428612#comment-16428612 ] 

Joseph K. Bradley commented on SPARK-23686:
-------------------------------------------

I wanted to ping some other active MLlib committers since this will change logging in MLlib.  The main change will be to prefix logged messages with a string included a unique identifier for the algorithm.  That will make it easier to associate log messages with Pipeline stages; this is hard right now, e.g., if there are multiple StringIndexers in the same Pipeline.fit() call.
CC [~mlnick], [~holdenk], [~dbtsai], [~yanboliang], [~sethah]

> Make better usage of org.apache.spark.ml.util.Instrumentation
> -------------------------------------------------------------
>
>                 Key: SPARK-23686
>                 URL: https://issues.apache.org/jira/browse/SPARK-23686
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.3.0
>            Reporter: Bago Amirbekian
>            Priority: Major
>
> This Jira is a bit high level and might require subtasks or other jiras for more specific tasks.
> I've noticed that we don't make the best usage of the instrumentation class. Specifically sometimes we bypass the instrumentation class and use the debugger instead. For example, [https://github.com/apache/spark/blob/9b9827759af2ca3eea146a6032f9165f640ce152/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L143]
> Also there are some things that might be useful to log in the instrumentation class that we currently don't. For example:
> number of training examples
> mean/var of label (regression)
> I know computing these things can be expensive in some cases, but especially when this data is already available we can log it for free. For example, Logistic Regression Summarizer computes some useful data including numRows that we don't log.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org