You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by BryanCutler <gi...@git.apache.org> on 2018/01/16 19:38:12 UTC

[GitHub] spark pull request #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to ...

GitHub user BryanCutler opened a pull request:

    https://github.com/apache/spark/pull/20280

    [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include __from_dict__ flag

    ## What changes were proposed in this pull request?
    
    When a `Row` object is created using kwargs, the order of the keywords can not be relied upon  (except for Python 3.5 that uses an OrderedDict).  The fields are sorted in the constructor and a flag `__from_dict__` is set to indicate that this object was created from kwargs so that other areas in Spark can access row data using field names instead of by position.  This change includes the `__from_dict__` flag only when pickling a Row that was made from kwargs so that the behavior is preserved if the Row becomes pickled.
    
    ## How was this patch tested?
    
    Fixed existing tests that relied on fields and schema being in the same alphabetical order.  Added new test to create `Row` from positional arguments where order matters.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/BryanCutler/spark pyspark-Row-serialize-SPARK-22232

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20280.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20280
    
----

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    Thanks @HyukjinKwon and @felixcheung , I'm a bit worried too that this might break someones code, but it doesn't affect `createDataFrame` from `Row`s, it's only when the Row is serialized like going from an RDD of Rows `toDF`.  Even then the schema gets alphabetized, which I'm sure the users would agree that it is strange.
    
    I'm not sure about adding a config switch, it might be a little hard to add and could be confusing to the user to explain that its only when serialized and the schema would need to be sorted by the original Row keywords.
    
    I'll go ahead and update the migration guide, and expand on the PR description to hopefully make the change as clear as possible.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86205/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    **[Test build #89530 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89530/testReport)** for PR 20280 at commit [`10bf2d0`](https://github.com/apache/spark/commit/10bf2d094b29b4e8ef7a38693f3956f96c0e9f7e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    closing now, will revisit for Spark 3.0


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    Probably that'd work but also it'd be trickier to add / remove that configuration. Another similar option maybe just close this for now and target this for 3.0.0 since we already started to talk about it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    holding off is fine; however, I am less sure about the configuration if that's not something you guys feel strongly.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    @HyukjinKwon , I was also thinking about holding off on this until 3.0.0 and then make a clean switch.  What do you think about that @holdenk ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    **[Test build #89530 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89530/testReport)** for PR 20280 at commit [`10bf2d0`](https://github.com/apache/spark/commit/10bf2d094b29b4e8ef7a38693f3956f96c0e9f7e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    **[Test build #86193 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86193/testReport)** for PR 20280 at commit [`315b8de`](https://github.com/apache/spark/commit/315b8de0fb3e7277b895b98769e52da7aaae32d6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86204/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    I'm kinda worry the example you give above is actually fairly common - construct with kwargs, and then (re-)name the columns.
    
    perhaps worthwhile to consider a config switch?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to ...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20280#discussion_r161964785
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -2306,18 +2306,20 @@ def test_toDF_with_schema_string(self):
             self.assertEqual(df.schema.simpleString(), "struct<key:string,value:string>")
             self.assertEqual(df.collect(), [Row(key=str(i), value=str(i)) for i in range(100)])
     
    -        # field names can differ.
    -        df = rdd.toDF(" a: int, b: string ")
    --- End diff --
    
    Yeah, I think you're right. The schema should be able to override names no matter what.  I was thinking it was flawed because it relied on the row field to be sorted internally, so just changing the kwarg names (not the order) could cause it to fail.  That seems a little strange, but maybe it is the intent?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2452/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    Thanks @HyukjinKwon and @MrBago for reviewing.  After thinking about this some more, I don't think this is the right solution.  Like @HyukjinKwon pointed out, the supplied schema names should always override any specified from `Row` even if it was made from kwargs.  So that means `toDF` must go by position, and for `Row` with kwargs that is with field names sorted.  That seems a little strange but I believe it is mostly due to a Python limitation of kwargs not being in a specific order and I don't know if there is much we can do about it.
    
    Something still seems wrong though because the example @MrBago has in the JIRA is inconsistent.  I'll go over it again tomorrow, when I'm more awake..


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    **[Test build #86205 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86205/testReport)** for PR 20280 at commit [`a7d3396`](https://github.com/apache/spark/commit/a7d339624d3ddf80af63fd3710fdc1e0742ecc6c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    Also, this will cause a breaking change if `Row`s are defined with kwargs and schema changes field names, like this:
    
    ```
    data = [Row(key=i, value=str(i)) for i in range(100)]
    rdd = self.sc.parallelize(data, 5)
    df = rdd.toDF(" a: int, b: string ")
    ```
    
    and this would work but might be slower, depending on how complicated the schema is, because now the field names are searched for instead of just going by position
    ```
    df = rdd.toDF(" key: int, value: string ")
    ```
    
    So if we go forward with this fix, I should probably add something in the migration guide


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    **[Test build #86204 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86204/testReport)** for PR 20280 at commit [`2192e49`](https://github.com/apache/spark/commit/2192e494eb8a8f8d3d42b20fdb2b3c681f6bdcb5).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    **[Test build #86205 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86205/testReport)** for PR 20280 at commit [`a7d3396`](https://github.com/apache/spark/commit/a7d339624d3ddf80af63fd3710fdc1e0742ecc6c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    If the renaming scenario works in most of cases as expected, I think it'd be worthwhile to have a configuration; however, the previous behaviour looks actually odd because it's going to work only in certain weird conditions when fields in `Row` and fields in the given schema are in the same alphabetical order (https://github.com/apache/spark/pull/20280#discussion_r182569705). Otherwise this case fails already as well.
    
    The test case modified in https://github.com/apache/spark/pull/20280#discussion_r182569705 actually works only because `key` and `value` in `Row` and `a` and `b` in the schema are in the same order. I think the test case should be invalid .. 
    
    I thought about this for a while and failed to describe what the configuration does .. It sounded describing a bug like it was a proper behaviour that can be controlled by a configuration ..
    
    I think this one sounds more like a bug fix to me so far. Workaround should be relatively easy. Maybe, would it be good enough to describe workaround in the guide instead? I think it should be fine if we just use a map to convert `Row` to things like a tuple.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86193/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    Let me restate what I think the intended behavior of Row is:
    
    If a `Row` is made from kwargs, then the order of the fields can not be relied upon and whenever accessing data, it must be done like a dict with the field name.  Because of this, when applying a schema to the data, the schema fields must also be fields in the `Row` objects.  Field position can change as long as the name matches.
    
    If a `Row` is made from generating a custom class, like `TestRow = Row("key", "value")` then `row = TestRow('a', 1)`, the the schema will be applied base on position and the elements in the `Row` objects are accessed by index.  The name of each field in the schema can differ as long as the element at that index can be converted to the specified schema type.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    **[Test build #86193 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86193/testReport)** for PR 20280 at commit [`315b8de`](https://github.com/apache/spark/commit/315b8de0fb3e7277b895b98769e52da7aaae32d6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    > I think we should raise an error if __from_dict__ is set and the user tries to index using a position or a slice.
    
    I'd also like to follow up with another PR to address some of the usability issues with `Row`. I found a couple other unfriendly behaviors and we can address this ^ there as well.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    Looking at this again, I'm back to thinking this is the right fix.  Based on #14469, if the `Row` objects were made with named arguments, then the intent is for elements to be looked up by field name since the schema could be in a different order.  This shouldn't change depending on if the `Row` objects were serialized.  Let me restate what I think the intended behavior of `Row` is:


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    oops, I missed this. will take a look shortly.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    BTW, I believe it's not so easy to pass a configuration from a very quick look because the exception usually would be thrown in a Python worker process.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    BTW the performance issue is orthogonal to the serialization issue raised in this jira/PR. Maybe we should avoid scope creep in this thread.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to ...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20280#discussion_r161865195
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -2306,18 +2306,20 @@ def test_toDF_with_schema_string(self):
             self.assertEqual(df.schema.simpleString(), "struct<key:string,value:string>")
             self.assertEqual(df.collect(), [Row(key=str(i), value=str(i)) for i in range(100)])
     
    -        # field names can differ.
    -        df = rdd.toDF(" a: int, b: string ")
    --- End diff --
    
    This test was flawed because it only worked because ("a", "b") is in the same alphabetical order as ("key", "value").  If it was ("key", "aaa") then it would fail. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    After looking into this, it seems like the behavior of the `Row` class is as follows:
    
    If a `Row` is made from kwargs, then the order of the fields can not be relied upon and whenever accessing data, it must be done like a dict with the field name.  When this is the case, the order of the supplied schema doesn't matter but the field name must be a subset of what is in each row.
    
    If a `Row` is made from generating a custom class, like `TestRow = Row("key", "value")` then `row = TestRow('a', 1)`, then the position of each element is what is important and data is accessed by position in the tuple.  The supplied schema for this must match the types of the rows exactly, however field names are not important and can be changed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to ...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20280#discussion_r182569705
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -2306,18 +2306,20 @@ def test_toDF_with_schema_string(self):
             self.assertEqual(df.schema.simpleString(), "struct<key:string,value:string>")
             self.assertEqual(df.collect(), [Row(key=str(i), value=str(i)) for i in range(100)])
     
    -        # field names can differ.
    -        df = rdd.toDF(" a: int, b: string ")
    --- End diff --
    
    I still think this test is invalid because it only works as the `Row`s get serialized and lose the `__from_dict__` flag.  This should be the same think, but would fail:
    
    ```
    df = spark.createDataFrame([Row(key=i, value=str(i)) for i in range(100)], schema=" a: int, b: string ")
    ```
    And it would also fail if the named args of the `Row` objects were not in the same alphabetical order as the schema:
    
    ```
    data = [Row(z=i, y=str(i)) for i in range(100)]
    rdd = self.sc.parallelize(data, 5)
    
    df = rdd.toDF(" a: int, b: string ")
    ```
    This fails because the Row fields would be sorted in a different order, switching the type order.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    @HyukjinKwon @holdenk and @MrBago have any thoughts on moving forward with this change?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    ^ Yup, let's leave the performance issue out. I think we might have to raise an error too but it's kind of a radical change.
    
    As a note, sorted fields are documented:
    
    https://github.com/apache/spark/blob/3e40eb3f1ffac3d2f49459a801e3ce171ed34091/python/pyspark/sql/types.py#L1451-L1452
    
    My only main concern is:
    
    >...  the field name must be a subset of what is in each row.
    
    >... field names are not important and can be changed.
    
    I think this is kind of a breaking change because we will basically now disallow the names given by user explicitly IIUC?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89530/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    @BryanCutler, mind if I ask to clarify what happens for end-to-end cases in the PR description (like before & after with explaining the reasons)? the change looks small but possibly a breaking change about end-to-end cases although I think for now we are restoring the correct behaviour as expected.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    Right. Will triple check for sure but I am with you for now. Yup, something in the migration guide makes much more sense to me too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    **[Test build #86204 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86204/testReport)** for PR 20280 at commit [`2192e49`](https://github.com/apache/spark/commit/2192e494eb8a8f8d3d42b20fdb2b3c681f6bdcb5).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    Hey @holdenk , yeah I've been meaning to circle back to this and get some kind of resolution.  I'll try to take another look later this week.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20280#discussion_r161929011
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -2306,18 +2306,20 @@ def test_toDF_with_schema_string(self):
             self.assertEqual(df.schema.simpleString(), "struct<key:string,value:string>")
             self.assertEqual(df.collect(), [Row(key=str(i), value=str(i)) for i in range(100)])
     
    -        # field names can differ.
    -        df = rdd.toDF(" a: int, b: string ")
    --- End diff --
    
    But wasn't this testing `field names can differ.` by user's schema input?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to ...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler closed the pull request at:

    https://github.com/apache/spark/pull/20280


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    @MrBago @HyukjinKwon I think the above behavior of the `Row` class is a little screwy, but at least this fixes it to be more consistent.  I'm not sure if there is a way to rectify the two different uses without breaking one way or the other.  Also to note, using kwargs the performance will likely be really poor because it must find the index for each field and this should maybe be discouraged.  cc @holdenk 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    Hey @BryanCutler is this still on your radar?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by MrBago <gi...@git.apache.org>.
Github user MrBago commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    I think we should raise an error if `__from_dict__` is set and the user tries to index using a position or a slice.
    
    Indexing by field name takes the same code path for Rows that are and are not `__from_dict__` and I don't think we can "discourage" this because `row.fieldName` is commonly used in our docs and other pyspark learning materials. If performance is an issue, maybe we should replace `__fields__` with a dict that maps fileldName -> position.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    Awesome, looking forward to the update.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    I'm worried that people might have two rows with different meanings but the same type and their application will start producing garbage #s. I think a lot of people go from RDDs of Rows to DFs in PySpark these days, so I'm a little nervous about this change even though I think its a step in the right direction.
    
    How about a config flag defaulting to off and we switch over @ 3.0?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20280
  
    Ahh. @zero323 too if you are available because I think we had a talk about this somewhere long time ago. Yea, I am aware of the issue itself. Will take a look soon.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org