You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Marcelo Vanzin (JIRA)" <ji...@apache.org> on 2016/04/21 19:41:25 UTC

[jira] [Commented] (SPARK-12141) Use Jackson to serialize all events when writing event log

    [ https://issues.apache.org/jira/browse/SPARK-12141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252301#comment-15252301 ] 

Marcelo Vanzin commented on SPARK-12141:
----------------------------------------

I took a quick look at how much work this would be; basically I quickly changed JsonProtocol to write everything using Jackson, and looked at the generated events. Here's a list of tasks that I collected from that experiment:

{quote}
These are API-breaking changes needed to implement SPARK-12141 properly. It might
be possible to hack things to avoid these changes (e.g. by installing Jackson
modules to handle these types), but we really should avoid going that route.

SparkListenerEnvironmentUpdate:
  - "environmentDetails" is better broken up into multiple, explicit maps
  - this would simplify the code that generates the event a little
    (SparkEnv.environmentDetails)
  - the generated json would be cleaner and easier to parse

BlockManagerId:
  - the Jackson-based mapping is only writing the "isDriver" value, might
    have to change some property / method names (or maybe use annotations).

TaskMetrics:
  - there's a lot of manual processing of TaskMetrics object when manually
    generating the event logs. Need to investigate what to do; my initial
    hunch is that we should use the TaskMetrics object from the public API,
    killing two birds with one stone. Might need some updates to the public
    API (in case it's missing information) and to the code that generates
    the UI. Since TaskMetrics are generally embedded in other objects (such
    as TaskInfo), this change might make have a domino effect.

StorageLevel:
  - similar to BlockManagerId, the Jackson version is missing stuff. Might be
    the case that the public API (RDDStorageInfo) could help here.

StageInfo:
  - seems to have too much information (such as locality data). Perhaps use the
    public StageData structure, or manually hide things that should not be shown.

TaskInfo:
  - the locality property is weirdly rendered; instead of a raw String, it shows
    up as an object referencing the enum class. Might need a special module for
    this, if we don't decide to just use the public API.

Accumulables:
  - for some reason the Jackson version is rendering a lot of accumulables
    I don't see in the logs generated by Spark 1.6.

JobResult:
  - Jackson not rendering any info.

Fields with default values:
  - Need to make sure Jackson properly deserializes them, since at least some versions have
    a bug with default values in case classes.
{quote}

Since this might turn out into a lot of work, I think it's better to track individual events in separate sub-tasks.

> Use Jackson to serialize all events when writing event log
> ----------------------------------------------------------
>
>                 Key: SPARK-12141
>                 URL: https://issues.apache.org/jira/browse/SPARK-12141
>             Project: Spark
>          Issue Type: Task
>          Components: Spark Core
>            Reporter: Marcelo Vanzin
>
> SPARK-11206 added infrastructure to serialize events using Jackson, so that manual serialization code is not needed anymore.
> We should write all events using that support, and remove all the manual serialization code in {{JsonProtocol}}.
> Since the event log format is a semi-public API, I'm targeting this at 2.0. Also, we can't remove the manual deserialization code, since we need to be able to read old event logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org