You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by npoberezkin <gi...@git.apache.org> on 2018/08/28 12:38:38 UTC

[GitHub] spark pull request #22255: [SPARK-25102][Spark Core] Write Spark version inf...

GitHub user npoberezkin opened a pull request:

    https://github.com/apache/spark/pull/22255

    [SPARK-25102][Spark Core] Write Spark version information to Parquet …

    ## What changes were proposed in this pull request?
    Overrided method getName from org.apache.parquet.hadoop.api.WriteSupport in org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport that returns version of Spark (to write it to Parquet file footer later).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/npoberezkin/spark spark-version-info-to-parquet-footer

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22255.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22255
    
----
commit ec9c130a30cb9d5c0ac8d4a11b31b99ab17c7e6c
Author: npoberezkin <np...@...>
Date:   2018-08-28T12:32:52Z

    [SPARK-25102][Spark Core] Write Spark version information to Parquet file footers

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22255: [SPARK-25102][Spark Core] Write Spark version inf...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22255


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22255
  
    Just to confirm it. `created_by` is set to `parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22255
  
    Hi, @npoberezkin . Thank you for your first contribution. Could you update your PR to use custom key-value metadata according to the above advice of @rdblue ? Also, please use tag `[SQL]` instead of `[Spark Core]` in the PR title.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...

Posted by npoberezkin <gi...@git.apache.org>.
Github user npoberezkin commented on the issue:

    https://github.com/apache/spark/pull/22255
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22255
  
    BTW, @rdblue recommended [key_value_metadata](https://github.com/apache/spark/pull/22255#issuecomment-418169189). Are we going to `created_by` instead of `key_value_metadata`? Could you give me some advice, @gatorsmile and @rdblue ?
    
    ```
      /** Optional key/value metadata **/
      5: optional list<KeyValue> key_value_metadata
    
      /** String for application that wrote this file.  This should be in the format
       * <Application> version <App Version> (build <App Build Hash>).
       * e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55)
       **/
      6: optional string created_by
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22255
  
    That is the value used by Parquet-MR library. We had better not to touch it. Parquet MR reader can work differently based on that versions to handle some older Parquet writer bugs.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22255
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22255
  
    https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L902
    
    @rdblue  Can we use created_by?
    
    ```
      /** String for application that wrote this file.  This should be in the format
       * <Application> version <App Version> (build <App Build Hash>).
       * e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55)
       **/
      6: optional string created_by
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22255
  
    I would also rather write the justification for this change, for instance, linking the usage of this name in Parquet side, potential usage, etc.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22255
  
    That will go like the following.
    ```
    file:        file:/tmp/p/part-00007-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet
    creator:     parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)
    extra:       org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT
    extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22255
  
    @dongjoon-hyun Do you want to take this over?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...

Posted by npoberezkin <gi...@git.apache.org>.
Github user npoberezkin commented on the issue:

    https://github.com/apache/spark/pull/22255
  
    @dbtsai Hello, I'm sorry for asking you directly, but for some reason jenkins did not generate message: "Can one of the admins verify this patch?".  I just saw that you've reviewed some other PR. That's my first PR, so maybe I've done something wrong, while creating it. I will be grateful for review or any other advice.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22255
  
    Sure, @gatorsmile .


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22255
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22255
  
    Hi, All.
    
    New PR is made. Please move to https://github.com/apache/spark/pull/22932 for further discussion.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/22255
  
    @npoberezkin, Parquet already supports custom key-value metadata in the file footer. The Spark version would go there.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22255
  
    Hi @rdblue, is it roughly good to do here in Spark?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22255
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22255
  
    Also cc @hvanhovell 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/22255
  
    I don't think this fits the intent of the model name. The model name is intended to encode what the data model was that was written to Parquet. I can write Avro records to a Parquet file, for example, and we identify that using "avro" (and this could be done in Spark). That could be used if we need to interpret the data differently from a model, but it probably shouldn't include a version of that data model. The data model doesn't change with a version bump, so I think these should be logically separate.
    
    It would be reasonable to add a "spark.version" property with this information. Other data models add properties to the file's key-value metadata for their own use. Avro adds its schema, for example.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...

Posted by npoberezkin <gi...@git.apache.org>.
Github user npoberezkin commented on the issue:

    https://github.com/apache/spark/pull/22255
  
    I got your idea now. Apparently I was a little confused because of the description of tickets.
    I can try to implement these (writing info about writer.model like "avro" etc in Spark), if you give me some directions on how can i do it and where should i make changes.
    Also I can add "spark.version" property, but if I got everything right, we'll need to open new issue in parquet to do this, am I right?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...

Posted by npoberezkin <gi...@git.apache.org>.
Github user npoberezkin commented on the issue:

    https://github.com/apache/spark/pull/22255
  
    Hello, @dbtsai, @HyukjinKwon . I added test on reading writer.model.name to PR. Justification for this change is below.
    This is original jira:
    https://issues.apache.org/jira/browse/SPARK-25102
    and it was referring to this one:
    https://issues.apache.org/jira/browse/PARQUET-352
    where the justification was given (it will be possible to identify files written by object models incorrectly). Also here is the link to Parquet repository with corresponding code changes (justification is also provided there):
    https://github.com/apache/parquet-mr/commit/dcd1c33f0dba247b43418b922c1c3a2fc432dc11
    And i found another case in which possibly this change can be useful:
    https://github.com/dask/fastparquet/issues/352


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22255
  
    It seems to cause some inconsistency if we choose one of `org.apache.spark.sql.create.version` or `spark.sql.create.version` as a key?
    
    1) If we choose `spark.sql.create.version` as a key, in Parquet, it will look like the following.
    ```
    extra:       spark.sql.create.version = 3.0.0-SNAPSHOT
    extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
    ```
    
    2) If we choose `org.apache.spark.sql.create.version`, it's different from Hive table property.
    
    I'll ignore the consistency of (2) for backward compatibility.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22255: [SPARK-25102][Spark Core] Write Spark version inf...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22255#discussion_r213881049
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala ---
    @@ -29,6 +29,7 @@ import org.apache.parquet.hadoop.api.WriteSupport
     import org.apache.parquet.hadoop.api.WriteSupport.WriteContext
     import org.apache.parquet.io.api.{Binary, RecordConsumer}
     
    +import org.apache.spark.{SPARK_VERSION}
    --- End diff --
    
    `org.apache.spark.SPARK_VERSION`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on the issue:

    https://github.com/apache/spark/pull/22255
  
    Is there any other project writing this into the footer? Tests on reading this back? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22255: [SPARK-25102][Spark Core] Write Spark version informatio...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22255
  
    Currently, we put the metadata like the following.
    ```
    file:        file:/tmp/p/part-00005-dbb9a9ab-0d6a-49df-9f39-397c8505f22b-c000.snappy.parquet
    creator:     parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)
    extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
    ```
    
    For the hive table, it looks like the following. So, I prefer to add `spark.sql.create.version=2.4.0` to `key_value_metadata`. I'll make a PR in this way.
    ```
    parameters:{spark.sql.sources.schema.part.0={"type":"struct","fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]}, transient_lastDdlTime=1541142761, spark.sql.sources.schema.numParts=1, spark.sql.create.version=2.4.0}
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org