You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by adrian-ionescu <gi...@git.apache.org> on 2017/08/08 14:56:10 UTC

[GitHub] spark pull request #18884: [SPARK-21669] Internal API for collecting metrics...

GitHub user adrian-ionescu opened a pull request:

    https://github.com/apache/spark/pull/18884

    [SPARK-21669] Internal API for collecting metrics/stats during FileFormatWriter jobs

    ## What changes were proposed in this pull request?
    
    This patch introduces an internal interface for tracking metrics and/or statistics on data on the fly, as it is being written to disk during a `FileFormatWriter` job and partially reimplements SPARK-20703 in terms of it.
    
    The interface basically consists of 3 traits:
    - `WriteTaskStats`: just a tag for classes that represent statistics collected during a `WriteTask`
      The only constraint it adds is that the class should be `Serializable`, as instances of it will be collected on the driver from all executors at the end of the `WriteJob`.
    - `WriteTaskStatsTracker`: a trait for classes that can actually compute statistics based on tuples that are processed by a given `WriteTask` and eventually produce a `WriteTaskStats` instance.
    - `WriteJobStatsTracker`: a trait for classes that act as containers of `Serializable` state that's necessary for instantiating `WriteTaskStatsTracker` on executors and finally process the resulting collection of `WriteTaskStats`, once they're gathered back on the driver.
    
    Potential future use of this interface is e.g. CBO stats maintenance during `INSERT INTO table ... ` operations.
    
    ## How was this patch tested?
    Existing tests for SPARK-20703 exercise the new code: `hive/SQLMetricsSuite`, `sql/JavaDataFrameReaderWriterSuite`, etc.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/adrian-ionescu/apache-spark write-stats-tracker-api

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18884.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18884
    
----
commit 67e333e7abfd96b8f80bb0a128088d70f995d864
Author: Adrian Ionescu <ad...@databricks.com>
Date:   2017-08-07T12:57:02Z

    initial

commit 176726e7139121d0ffc9d0817b256b831a8c4fc8
Author: Adrian Ionescu <ad...@databricks.com>
Date:   2017-08-07T14:22:24Z

    tests pass; missing docs

commit 6f402468f72fcbdacc680dcae0fafb9fd340ad9f
Author: Adrian Ionescu <ad...@databricks.com>
Date:   2017-08-07T19:14:49Z

    newPartition() takes InternalRow instead of String

commit e6ab459501d70180d53a41dff69bdc13157df5a5
Author: Adrian Ionescu <ad...@databricks.com>
Date:   2017-08-08T12:56:54Z

    bug fix + docs

commit 3665f2fb4331012a022e9ae70cbe3d480ab8dcd3
Author: Adrian Ionescu <ad...@databricks.com>
Date:   2017-08-08T14:51:36Z

    minor

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18884: [SPARK-21669] Internal API for collecting metrics/stats ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/18884
  
    this looks good to me, but I didn't review super carefully.
    
    cc @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18884: [SPARK-21669] Internal API for collecting metrics/stats ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18884
  
    LGTM except some minor comments, great clean up!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18884: [SPARK-21669] Internal API for collecting metrics...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18884#discussion_r132086218
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala ---
    @@ -269,50 +278,57 @@ object FileFormatWriter extends Logging {
       }
     
       /**
    +   * For every registered [[WriteJobStatsTracker]], call `processStats()` on it, passing it
    +   * the corresponding [[WriteTaskStats]] from all executors.
    +   */
    +  private def processStats(
    +      statsTrackers: Seq[WriteJobStatsTracker],
    +      statsPerTask: Seq[Seq[WriteTaskStats]])
    --- End diff --
    
    The current framework looks like the trackers can't share the collection of states or some same metrics. Isn't a likely happened use case? When two trackers needs the same metrics, we will need to collect it in two copies of stats.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18884: [SPARK-21669] Internal API for collecting metrics...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18884#discussion_r132006674
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala ---
    @@ -0,0 +1,133 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import org.apache.hadoop.conf.Configuration
    +import org.apache.hadoop.fs.Path
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.execution.SQLExecution
    +import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics}
    +import org.apache.spark.util.SerializableConfiguration
    +
    +
    +/**
    + * Simple metrics collected during an instance of [[FileFormatWriter.ExecuteWriteTask]].
    + * These were first introduced in https://github.com/apache/spark/pull/18159 (SPARK-20703).
    + */
    +case class BasicWriteTaskStats(
    +    numPartitions: Int,
    +    numFiles: Int,
    +    numBytes: Long,
    +    numRows: Long)
    +  extends WriteTaskStats
    +
    +
    +/**
    + * Simple [[WriteTaskStatsTracker]] implementation that produces [[BasicWriteTaskStats]].
    + * @param hadoopConf
    + */
    +class BasicWriteTaskStatsTracker(hadoopConf: Configuration)
    +  extends WriteTaskStatsTracker {
    +
    +  var numPartitions: Int = 0
    --- End diff --
    
    private[this] ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18884: [SPARK-21669] Internal API for collecting metrics/stats ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18884
  
    **[Test build #3885 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3885/testReport)** for PR 18884 at commit [`3665f2f`](https://github.com/apache/spark/commit/3665f2fb4331012a022e9ae70cbe3d480ab8dcd3).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18884: [SPARK-21669] Internal API for collecting metrics...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18884#discussion_r132440406
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala ---
    @@ -269,50 +278,60 @@ object FileFormatWriter extends Logging {
       }
     
       /**
    +   * For every registered [[WriteJobStatsTracker]], call `processStats()` on it, passing it
    +   * the corresponding [[WriteTaskStats]] from all executors.
    +   */
    +  private def processStats(
    +      statsTrackers: Seq[WriteJobStatsTracker],
    +      statsPerTask: Seq[Seq[WriteTaskStats]])
    +    : Unit = {
    +
    +    val numStatsTrackers = statsTrackers.length
    +    assert(statsPerTask.forall(_.length == numStatsTrackers),
    +      s"""Every WriteTask should have produced one `WriteTaskStats` object for every tracker.
    +         |There are ${numStatsTrackers} statsTrackers, but some task returned
    --- End diff --
    
    nit: `$numStatsTrackers`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18884: [SPARK-21669] Internal API for collecting metrics/stats ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/18884
  
    Jenkins, add to white list.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18884: [SPARK-21669] Internal API for collecting metrics...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18884#discussion_r132438536
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala ---
    @@ -104,8 +109,10 @@ object FileFormatWriter extends Logging {
           hadoopConf: Configuration,
           partitionColumns: Seq[Attribute],
           bucketSpec: Option[BucketSpec],
    +      statsTrackers: Seq[WriteJobStatsTracker],
           refreshFunction: (Seq[ExecutedWriteSummary]) => Unit,
    --- End diff --
    
    instead of having the `refreshFunction`, can we just let this `write` method return `Seq[ExecutedWriteSummary]`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18884: [SPARK-21669] Internal API for collecting metrics...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18884#discussion_r132178188
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala ---
    @@ -269,50 +278,57 @@ object FileFormatWriter extends Logging {
       }
     
       /**
    +   * For every registered [[WriteJobStatsTracker]], call `processStats()` on it, passing it
    +   * the corresponding [[WriteTaskStats]] from all executors.
    +   */
    +  private def processStats(
    +      statsTrackers: Seq[WriteJobStatsTracker],
    +      statsPerTask: Seq[Seq[WriteTaskStats]])
    --- End diff --
    
    Because some metrics might be costly, I was just thinking of a case that more than one tracker need to use few overlapping metrics. Instead of having measuring the metrics in two different collections of states, measuring once and having just one copy of the metrics seems more reasonable. For now, this may lead to overdesign, just curious how we can deal with it easily. We can consider that once we hit that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18884: [SPARK-21669] Internal API for collecting metrics/stats ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18884
  
    **[Test build #80493 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80493/testReport)** for PR 18884 at commit [`7ec545b`](https://github.com/apache/spark/commit/7ec545b43b994d668a0c16d788521c4ceedb17e7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18884: [SPARK-21669] Internal API for collecting metrics/stats ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18884
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18884: [SPARK-21669] Internal API for collecting metrics...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18884#discussion_r132183030
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala ---
    @@ -128,6 +128,7 @@ class FileStreamSink(
             hadoopConf = hadoopConf,
             partitionColumns = partitionColumns,
             bucketSpec = None,
    +        statsTrackers = Nil,
    --- End diff --
    
    I think `FileStreamSink` is not a `RunnableCommand`? #18159 adds data writing metrics for certain `RunnableCommand`.
    
    Do we have a physical node representing data writing operation in a `FileStreamSink` so we can bind its `SQLMetric` and update the metrics after insertion? Looks like the `addBatch` accepts arbitrary `DataFrame` and writes the data from it directly.
    
    I'd think it might be another issue.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18884: [SPARK-21669] Internal API for collecting metrics...

Posted by adrian-ionescu <gi...@git.apache.org>.

Github user adrian-ionescu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18884#discussion_r132142337
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteStatsTracker.scala ---
    @@ -0,0 +1,121 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import org.apache.spark.sql.catalyst.InternalRow
    +
    +
    +/**
    + * To be implemented by classes that represent data statistics collected during a Write Task.
    + * It is important that instances of this type are [[Serializable]], as they will be gathered
    + * on the driver from all executors.
    + */
    +trait WriteTaskStats
    +  extends Serializable
    +
    +/**
    + * A trait for classes that are capable of collecting statistics on data that's being processed by
    + * a single write task in [[FileFormatWriter]] - i.e. there should be one instance per executor.
    + *
    + * This trait is coupled with the way [[FileFormatWriter]] works, in the sense that its methods
    + * will be called according to how tuples are being written out to disk, namely in sorted order
    + * according to partitionValue(s), then bucketId.
    + *
    + * As such, a typical call scenario is:
    + *
    + * newPartition -> newBucket -> newFile -> newRow -.
    + *    ^        |______^___________^ ^         ^____|
    + *    |               |             |______________|
    + *    |               |____________________________|
    + *    |____________________________________________|
    + *
    + * newPartition and newBucket events are only triggered if the relation to be written out is
    + * partitioned and/or bucketed, respectively.
    + */
    +trait WriteTaskStatsTracker {
    +
    +  /**
    +   * Process the fact that a new partition is about to be written.
    +   * Only triggered when the relation is partitioned by a (non-empty) sequence of columns.
    +   * @param partitionValues The values that define this new partition.
    +   */
    +  def newPartition(partitionValues: InternalRow): Unit
    +
    +  /**
    +   * Process the fact that a new bucket is about to written.
    +   * Only triggered when the relation is bucketed by a (non-empty) sequence of columns.
    +   * @param bucketId The bucket number.
    +   */
    +  def newBucket(bucketId: Int): Unit
    +
    +  /**
    +   * Process the fact that a new file is about to be written.
    +   * @param filePath Path of the file into which future rows will be written.
    +   */
    +  def newFile(filePath: String): Unit
    +
    +  /**
    +   * Process the fact that a new row to update the tracked statistics accordingly.
    +   * The row will be written to the most recently witnessed file (via `newFile`).
    +   * @note Keep in mind that any overhead here is per-row, obviously,
    +   *       so implementations should be as lightweight as possible.
    +   * @param row Current data row to be processed.
    +   */
    +  def newRow(row: InternalRow): Unit
    +
    +  /**
    +   * Returns the final statistics computed so far.
    +   * @note This may only be called once. Further use of the object may lead to undefined behavior.
    +   * @return An object of subtype of [[WriteTaskStats]], to be sent to the driver.
    +   */
    +  def getFinalStats(): WriteTaskStats
    +}
    +
    +/**
    + * A class implementing this trait is basically a collection of parameters that are necessary
    + * for instantiating a (derived type of) [[WriteTaskStatsTracker]] on all executors and then
    + * process the statistics produced by them (e.g. save them to memory/disk, issue warnings, etc).
    + * It is therefore important that such an objects is [[Serializable]], as it will be sent
    + * from the driver to all executors.
    + */
    +trait WriteJobStatsTracker
    --- End diff --
    
    No strong preference, just curious.. why is that preferable?
    The way I see it, this way you're sure to get it; otherwise you might forget to mix it in and then you'll only realize it at runtime when faced with a "task not serializable" exception.
    Is there some disadvantage to mixing it into the trait?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18884: [SPARK-21669] Internal API for collecting metrics/stats ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18884
  
    **[Test build #80493 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80493/testReport)** for PR 18884 at commit [`7ec545b`](https://github.com/apache/spark/commit/7ec545b43b994d668a0c16d788521c4ceedb17e7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait WriteJobStatsTracker extends Serializable `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18884: [SPARK-21669] Internal API for collecting metrics...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18884#discussion_r132006851
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteStatsTracker.scala ---
    @@ -0,0 +1,121 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import org.apache.spark.sql.catalyst.InternalRow
    +
    +
    +/**
    + * To be implemented by classes that represent data statistics collected during a Write Task.
    + * It is important that instances of this type are [[Serializable]], as they will be gathered
    + * on the driver from all executors.
    + */
    +trait WriteTaskStats
    +  extends Serializable
    +
    +/**
    + * A trait for classes that are capable of collecting statistics on data that's being processed by
    + * a single write task in [[FileFormatWriter]] - i.e. there should be one instance per executor.
    + *
    + * This trait is coupled with the way [[FileFormatWriter]] works, in the sense that its methods
    + * will be called according to how tuples are being written out to disk, namely in sorted order
    + * according to partitionValue(s), then bucketId.
    + *
    + * As such, a typical call scenario is:
    + *
    + * newPartition -> newBucket -> newFile -> newRow -.
    + *    ^        |______^___________^ ^         ^____|
    + *    |               |             |______________|
    + *    |               |____________________________|
    + *    |____________________________________________|
    + *
    + * newPartition and newBucket events are only triggered if the relation to be written out is
    + * partitioned and/or bucketed, respectively.
    + */
    +trait WriteTaskStatsTracker {
    +
    +  /**
    +   * Process the fact that a new partition is about to be written.
    +   * Only triggered when the relation is partitioned by a (non-empty) sequence of columns.
    +   * @param partitionValues The values that define this new partition.
    +   */
    +  def newPartition(partitionValues: InternalRow): Unit
    +
    +  /**
    +   * Process the fact that a new bucket is about to written.
    +   * Only triggered when the relation is bucketed by a (non-empty) sequence of columns.
    +   * @param bucketId The bucket number.
    +   */
    +  def newBucket(bucketId: Int): Unit
    +
    +  /**
    +   * Process the fact that a new file is about to be written.
    +   * @param filePath Path of the file into which future rows will be written.
    +   */
    +  def newFile(filePath: String): Unit
    +
    +  /**
    +   * Process the fact that a new row to update the tracked statistics accordingly.
    +   * The row will be written to the most recently witnessed file (via `newFile`).
    +   * @note Keep in mind that any overhead here is per-row, obviously,
    +   *       so implementations should be as lightweight as possible.
    +   * @param row Current data row to be processed.
    +   */
    +  def newRow(row: InternalRow): Unit
    +
    +  /**
    +   * Returns the final statistics computed so far.
    +   * @note This may only be called once. Further use of the object may lead to undefined behavior.
    +   * @return An object of subtype of [[WriteTaskStats]], to be sent to the driver.
    +   */
    +  def getFinalStats(): WriteTaskStats
    +}
    +
    +/**
    + * A class implementing this trait is basically a collection of parameters that are necessary
    + * for instantiating a (derived type of) [[WriteTaskStatsTracker]] on all executors and then
    + * process the statistics produced by them (e.g. save them to memory/disk, issue warnings, etc).
    + * It is therefore important that such an objects is [[Serializable]], as it will be sent
    + * from the driver to all executors.
    + */
    +trait WriteJobStatsTracker
    --- End diff --
    
    so i think the general approach is that the final implementation should add serializable, and the trait shouldn't ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18884: [SPARK-21669] Internal API for collecting metrics...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18884#discussion_r132440680
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala ---
    @@ -269,50 +278,57 @@ object FileFormatWriter extends Logging {
       }
     
       /**
    +   * For every registered [[WriteJobStatsTracker]], call `processStats()` on it, passing it
    +   * the corresponding [[WriteTaskStats]] from all executors.
    +   */
    +  private def processStats(
    +      statsTrackers: Seq[WriteJobStatsTracker],
    +      statsPerTask: Seq[Seq[WriteTaskStats]])
    --- End diff --
    
    If 2 trackers have overlapped metrics, I think we probably need to combine them into one (inheritance / composition).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18884: [SPARK-21669] Internal API for collecting metrics...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18884#discussion_r132022172
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala ---
    @@ -128,6 +128,7 @@ class FileStreamSink(
             hadoopConf = hadoopConf,
             partitionColumns = partitionColumns,
             bucketSpec = None,
    +        statsTrackers = Nil,
    --- End diff --
    
    don't we want that as well?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18884: [SPARK-21669] Internal API for collecting metrics/stats ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18884
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18884: [SPARK-21669] Internal API for collecting metrics...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18884#discussion_r132439040
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala ---
    @@ -104,8 +109,10 @@ object FileFormatWriter extends Logging {
           hadoopConf: Configuration,
           partitionColumns: Seq[Attribute],
           bucketSpec: Option[BucketSpec],
    +      statsTrackers: Seq[WriteJobStatsTracker],
           refreshFunction: (Seq[ExecutedWriteSummary]) => Unit,
    --- End diff --
    
    do we still need `ExecutedWriteSummary`? After we have the `statsTrackers`, we only need to return `updatedPartitions: Set[String]` to the caller side.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18884: [SPARK-21669] Internal API for collecting metrics...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18884#discussion_r132439273
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala ---
    @@ -104,8 +109,10 @@ object FileFormatWriter extends Logging {
           hadoopConf: Configuration,
           partitionColumns: Seq[Attribute],
           bucketSpec: Option[BucketSpec],
    +      statsTrackers: Seq[WriteJobStatsTracker],
           refreshFunction: (Seq[ExecutedWriteSummary]) => Unit,
    --- End diff --
    
    or return `Set[String]` as updated partitions


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18884: [SPARK-21669] Internal API for collecting metrics/stats ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18884
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18884: [SPARK-21669] Internal API for collecting metrics/stats ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18884
  
    **[Test build #3885 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3885/testReport)** for PR 18884 at commit [`3665f2f`](https://github.com/apache/spark/commit/3665f2fb4331012a022e9ae70cbe3d480ab8dcd3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18884: [SPARK-21669] Internal API for collecting metrics...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18884#discussion_r132447263
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteStatsTracker.scala ---
    @@ -0,0 +1,123 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import org.apache.spark.sql.catalyst.InternalRow
    +
    +
    +/**
    + * To be implemented by classes that represent data statistics collected during a Write Task.
    + * It is important that instances of this type are [[Serializable]], as they will be gathered
    + * on the driver from all executors.
    + */
    +trait WriteTaskStats
    +  extends Serializable
    +
    +
    +/**
    + * A trait for classes that are capable of collecting statistics on data that's being processed by
    + * a single write task in [[FileFormatWriter]] - i.e. there should be one instance per executor.
    + *
    + * This trait is coupled with the way [[FileFormatWriter]] works, in the sense that its methods
    + * will be called according to how tuples are being written out to disk, namely in sorted order
    + * according to partitionValue(s), then bucketId.
    + *
    + * As such, a typical call scenario is:
    + *
    + * newPartition -> newBucket -> newFile -> newRow -.
    + *    ^        |______^___________^ ^         ^____|
    + *    |               |             |______________|
    + *    |               |____________________________|
    + *    |____________________________________________|
    + *
    + * newPartition and newBucket events are only triggered if the relation to be written out is
    + * partitioned and/or bucketed, respectively.
    + */
    +trait WriteTaskStatsTracker {
    +
    +  /**
    +   * Process the fact that a new partition is about to be written.
    +   * Only triggered when the relation is partitioned by a (non-empty) sequence of columns.
    +   * @param partitionValues The values that define this new partition.
    +   */
    +  def newPartition(partitionValues: InternalRow): Unit
    +
    +  /**
    +   * Process the fact that a new bucket is about to written.
    +   * Only triggered when the relation is bucketed by a (non-empty) sequence of columns.
    +   * @param bucketId The bucket number.
    +   */
    +  def newBucket(bucketId: Int): Unit
    +
    +  /**
    +   * Process the fact that a new file is about to be written.
    +   * @param filePath Path of the file into which future rows will be written.
    +   */
    +  def newFile(filePath: String): Unit
    +
    +  /**
    +   * Process the fact that a new row to update the tracked statistics accordingly.
    +   * The row will be written to the most recently witnessed file (via `newFile`).
    +   * @note Keep in mind that any overhead here is per-row, obviously,
    +   *       so implementations should be as lightweight as possible.
    +   * @param row Current data row to be processed.
    +   */
    +  def newRow(row: InternalRow): Unit
    +
    +  /**
    +   * Returns the final statistics computed so far.
    +   * @note This may only be called once. Further use of the object may lead to undefined behavior.
    +   * @return An object of subtype of [[WriteTaskStats]], to be sent to the driver.
    +   */
    +  def getFinalStats(): WriteTaskStats
    +}
    +
    +
    +/**
    + * A class implementing this trait is basically a collection of parameters that are necessary
    + * for instantiating a (derived type of) [[WriteTaskStatsTracker]] on all executors and then
    + * process the statistics produced by them (e.g. save them to memory/disk, issue warnings, etc).
    + * It is therefore important that such an objects is [[Serializable]], as it will be sent
    + * from the driver to all executors.
    + */
    +trait WriteJobStatsTracker
    +  extends Serializable {
    +
    +  /**
    +   * Instantiates a [[WriteTaskStatsTracker]], based on (non-transient) members of this class.
    +   * To be called by executors.
    +   * @return A [[WriteTaskStatsTracker]] instance to be used for computing stats during a write task
    +   */
    +  def newTaskInstance(): WriteTaskStatsTracker
    +
    +  /**
    +   * Process the given collection of stats computed during this job.
    +   * E.g. aggregate them, write them to memory / disk, issue warnings, whatever.
    +   * @param stats One [[WriteTaskStats]] object from each successful write task.
    +   *              @note The type here is too generic. These classes should probably be parametrized:
    --- End diff --
    
    nit: no spaces before `@note`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18884: [SPARK-21669] Internal API for collecting metrics/stats ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18884
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80493/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18884: [SPARK-21669] Internal API for collecting metrics...

Posted by adrian-ionescu <gi...@git.apache.org>.

Github user adrian-ionescu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18884#discussion_r132130615
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala ---
    @@ -128,6 +128,7 @@ class FileStreamSink(
             hadoopConf = hadoopConf,
             partitionColumns = partitionColumns,
             bucketSpec = None,
    +        statsTrackers = Nil,
    --- End diff --
    
    We might, but it wasn't originally handled in https://github.com/apache/spark/pull/18159 and so `FileStreamSink` doesn't extend `DataWritingCommand`.
    @viirya, do you remember if there was any particular reason for this, or was it just overlooked / deemed out of scope?
    Anyway, I could try to add handling for it in this PR, but I'd say it's rather orthogonal.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18884: [SPARK-21669] Internal API for collecting metrics...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/18884


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18884: [SPARK-21669] Internal API for collecting metrics/stats ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/18884
  
    Merging in master.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18884: [SPARK-21669] Internal API for collecting metrics...

Posted by adrian-ionescu <gi...@git.apache.org>.

Github user adrian-ionescu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18884#discussion_r132147389
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala ---
    @@ -269,50 +278,57 @@ object FileFormatWriter extends Logging {
       }
     
       /**
    +   * For every registered [[WriteJobStatsTracker]], call `processStats()` on it, passing it
    +   * the corresponding [[WriteTaskStats]] from all executors.
    +   */
    +  private def processStats(
    +      statsTrackers: Seq[WriteJobStatsTracker],
    +      statsPerTask: Seq[Seq[WriteTaskStats]])
    --- End diff --
    
    Not sure if that's a common use case..
    But if you do need to share some stats in two trackers, I can think of two solutions within the current framework:
    1. in the `processStats` of the first tracker store the somewhere (e.g. catalog) and then retrieve them during `processStats` of the second tracker
    2. replace the two trackers with a single, combined one (inheritance / composition)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18884: [SPARK-21669] Internal API for collecting metrics/stats ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18884
  
    LGTM, pending jenkins


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18884: [SPARK-21669] Internal API for collecting metrics...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18884#discussion_r132085558
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala ---
    @@ -269,50 +278,57 @@ object FileFormatWriter extends Logging {
       }
     
       /**
    +   * For every registered [[WriteJobStatsTracker]], call `processStats()` on it, passing it
    +   * the corresponding [[WriteTaskStats]] from all executors.
    +   */
    +  private def processStats(
    +      statsTrackers: Seq[WriteJobStatsTracker],
    +      statsPerTask: Seq[Seq[WriteTaskStats]])
    +    : Unit = {
    +
    +    val statsPerTracker = if (statsPerTask.nonEmpty) {
    +      statsPerTask.transpose
    +    } else {
    +      statsTrackers.map(_ => Seq.empty)
    +    }
    +    assert(statsTrackers.length == statsPerTracker.length,
    +      s"""Every WriteTask should have produced one `WriteTaskStats` object for every tracker.
    +         |statsTrackers = ${statsTrackers}
    +         |statsPerTracker = ${statsPerTracker}
    +     """.stripMargin)
    --- End diff --
    
    In case of many states, this might result big error message. Just printing length is enough?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org