You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by davies <gi...@git.apache.org> on 2015/04/10 02:57:51 UTC

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/5445

    [SPARK-6677] [SQL] [PySpark] fix cached classes

    It's possible to have two DataType object with same id (memory address) at different time, we should check the cached classes to verify that it's generated by given datatype.
    
    This PR also change __FIELDS__ and __DATATYPE__ to lower case to match Python code style.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark fix_type_cache

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5445.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5445
    
----
commit 583f5abf5f4658f762ef471feebfa971a4ed17e5
Author: Davies Liu <da...@databricks.com>
Date:   2015-04-10T00:54:34Z

    fix cached classes

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5445#discussion_r28200455
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1119,8 +1120,8 @@ def Dict(d):
         class Row(tuple):
     
             """ Row in DataFrame """
    -        __DATATYPE__ = dataType
    -        __FIELDS__ = tuple(f.name for f in dataType.fields)
    +        __datatype = dataType
    --- End diff --
    
    _restore_object is only used for Row, so the dataType should be StructType. The reason it will crash while Row has a reference to dataType is that there will be multiple id for a Row class in `_cached_cls`, all of them are the same StructType, but from different batches (see another comment).
    
    So we should not add `__dataType__` for other datatypes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5445#issuecomment-91646034
  
      [Test build #30039 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30039/consoleFull) for   PR 5445 at commit [`63b3238`](https://github.com/apache/spark/commit/63b32384670d8777ed6eafb888880d9d53229762).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/5445#issuecomment-91994012
  
    @JoshRosen I think it's fine to go without renaming. The comments are very valuable, thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5445#issuecomment-91401116
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29993/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/5445#issuecomment-91994840
  
    Great!  I'm going to merge this into `master` (1.4.0) and `branch-1.3` (1.3.2).  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/5445#issuecomment-91679204
  
    cc @JoshRosen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/5445


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5445#discussion_r28199349
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1119,8 +1120,8 @@ def Dict(d):
         class Row(tuple):
     
             """ Row in DataFrame """
    -        __DATATYPE__ = dataType
    -        __FIELDS__ = tuple(f.name for f in dataType.fields)
    +        __datatype = dataType
    --- End diff --
    
    Due to name-mangling, this field will no longer be accessible outside of the `Row` class itself, but I don't think that's a problem based on how we use it: it looks like the only place where we read the old `__DATATYPE__` field was in `__reduce__`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5445#discussion_r28200522
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1119,8 +1120,8 @@ def Dict(d):
         class Row(tuple):
     
             """ Row in DataFrame """
    -        __DATATYPE__ = dataType
    -        __FIELDS__ = tuple(f.name for f in dataType.fields)
    +        __datatype = dataType
    --- End diff --
    
    Ah, gotcha.  Do you want to go ahead with the renaming here to eliminate the double-reference?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/5445#issuecomment-91946593
  
    A bit of time on GitHub code search suggests that renaming `__FIELD__` to `__field__` shouldn't cause any major problems because several other libraries also use this field name and it doesn't appear to be a Python reserved field / method.
    
    @davies, could you take a look at my summary above and let me know if it's accurate?  Just want to double-check my understanding before merging.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/5445#issuecomment-91985687
  
    This is fairly complicated, but the solution here makes sense to me: we should be guaranteed safety because we now always check that the cache returns a row class for the expected data type.  This looks good to me, but I'll wait to see if @davies wants to do a field renaming proposed upthread.  If not, I'll merge this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5445#discussion_r28200072
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1119,8 +1120,8 @@ def Dict(d):
         class Row(tuple):
     
             """ Row in DataFrame """
    -        __DATATYPE__ = dataType
    -        __FIELDS__ = tuple(f.name for f in dataType.fields)
    +        __datatype = dataType
    --- End diff --
    
    It looks like `Row` held a reference to its `DataType` even in the old code, so I guess the only new references here are the ones that we're adding to the functions / arrays / etc. returned from the other branches of `_create_cls`.  I suppose that we could set those objects' `__dataType__` fields inside of the `_create_cls` function instead of setting them in `_restore_object` if you think that would be easier to understand.  Not a huge deal to set it outside, though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5445#issuecomment-91392222
  
      [Test build #29993 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29993/consoleFull) for   PR 5445 at commit [`583f5ab`](https://github.com/apache/spark/commit/583f5abf5f4658f762ef471feebfa971a4ed17e5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5445#discussion_r28199718
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -997,12 +997,13 @@ def _restore_object(dataType, obj):
         # same object in most cases.
         k = id(dataType)
         cls = _cached_cls.get(k)
    -    if cls is None:
    +    if cls is None or cls.__datatype is not dataType:
    --- End diff --
    
    Yes, there is a side effect that the id of dataType will be not recycled once we keep a reference of it in `cls`.
    
    In the beginning, I thought the check is helping, actually, it's not. We can change it to an assert, or remove it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5445#issuecomment-91646050
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30039/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5445#discussion_r28199394
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -997,12 +997,13 @@ def _restore_object(dataType, obj):
         # same object in most cases.
         k = id(dataType)
         cls = _cached_cls.get(k)
    -    if cls is None:
    +    if cls is None or cls.__datatype is not dataType:
    --- End diff --
    
    Do we need this check for correctness?  If I understand the rest of this patch correctly, won't the write to `cls.__datatype` down on line 1006 mean that we'll retain a strong reference to the DataType as long as the Row class is still referenced, preventing the re-use of the DataType's object id?
    
    If we expect this check to never return true, should we turn it into an assertion or log a warning if it's true?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5445#discussion_r28200422
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -997,12 +997,13 @@ def _restore_object(dataType, obj):
         # same object in most cases.
         k = id(dataType)
         cls = _cached_cls.get(k)
    -    if cls is None:
    +    if cls is None or cls.__datatype is not dataType:
    --- End diff --
    
    After some debugging, I found that this check is necessary, for example:
    
    At first, we create DataType `d1` in the first batch, and create a class Row1, R1.__datatype__ will pointer to d1, it will be cached as id(d1) and d1:
    
    ```
    id(d1) -> Row
    d1 -> Row
    Row.__datatype__ -> d1
    ```
    
    In the second batch, it will create d2, which is the same as d1, lookup by id(d2) will miss, but will find Row by d2 as key, then the cache become
    ```
    id(d1) -> Row
    d1 -> Row
    id(d2) -> Row
    Row.__datatype__ -> d2
    ```
    This time, d1 could be collected by GC, id(d1) will by recycled to other object.
    
    Later, there could be a object dd, which has the same id than d1, then it will find Row via id(dd), but dd is different than d1 (different struct type), the assert will fail.
    
    Finally, I think that the current approach is the correct one.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5445#issuecomment-91622458
  
      [Test build #30039 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30039/consoleFull) for   PR 5445 at commit [`63b3238`](https://github.com/apache/spark/commit/63b32384670d8777ed6eafb888880d9d53229762).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5445#issuecomment-91457641
  
      [Test build #30010 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30010/consoleFull) for   PR 5445 at commit [`47bdede`](https://github.com/apache/spark/commit/47bdededefaf8696c9956a7091cc83b10a0a857e).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5445#discussion_r28200511
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -997,12 +997,13 @@ def _restore_object(dataType, obj):
         # same object in most cases.
         k = id(dataType)
         cls = _cached_cls.get(k)
    -    if cls is None:
    +    if cls is None or cls.__datatype is not dataType:
    --- End diff --
    
    This is subtle, but your explanation makes sense.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/5445#issuecomment-91945517
  
    To recap my understanding of this patch:
    
    - The `_cached_cls` dictionary maps from either DataType object ids or DataType objects to the generated Row classes for those data types.
    - Using an object id as a dictionary key will be safe as long as that that id refers to the same object for the lifetime of the dictionary.  As long as DataType instance isn't garbage-collected, its object id will not be re-used by a different DataType object.
    - The problem here seems to be that we weren't guaranteed to retain a strong reference to the DataType instance.  Although the DataType itself was used as a dictionary key for the `_cached_cls` dictionary, that dictionary is a [WeakValueDictionary](https://docs.python.org/2/library/weakref.html#weakref.WeakValueDictionary), so its reference to the DataType key would be removed if that type's Row class was garbage collected.
    - The solution implemented in this patch addresses this issue via two mechanisms:
      - When storing a Row class in the `_cached_cls` map, store a reference in the Row class that points to the DataType.  This avoids the problem that we have now where the Row class can remain in the map even though its DataType has been garbage-collected.
      - Add a check that tests whether the Row class returned from the `_cached_cls` map has the expected DataType.  As far as I understand it, this acts more as an assertion / sanity check / error-handler, so we expect this check to succeed most (all?) of the time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5445#discussion_r28199732
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1119,8 +1120,8 @@ def Dict(d):
         class Row(tuple):
     
             """ Row in DataFrame """
    -        __DATATYPE__ = dataType
    -        __FIELDS__ = tuple(f.name for f in dataType.fields)
    +        __datatype = dataType
    --- End diff --
    
    It seems that Row will have two reference to dataType, via `_Row__dataType` and `__dataType`, so I'd like to change it to '__dataType__' to avoid double references.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5445#issuecomment-91457651
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30010/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5445#issuecomment-91436387
  
      [Test build #30010 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30010/consoleFull) for   PR 5445 at commit [`47bdede`](https://github.com/apache/spark/commit/47bdededefaf8696c9956a7091cc83b10a0a857e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5445#discussion_r28200080
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -997,12 +997,13 @@ def _restore_object(dataType, obj):
         # same object in most cases.
         k = id(dataType)
         cls = _cached_cls.get(k)
    -    if cls is None:
    +    if cls is None or cls.__datatype is not dataType:
    --- End diff --
    
    I think we should change it into an assert, since that will make it easier to debug this issue if it ever crops up again (if we accidentally removed the reference keeping the `DataType` alive, for example).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6677] [SQL] [PySpark] fix cached classe...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5445#issuecomment-91401108
  
      [Test build #29993 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29993/consoleFull) for   PR 5445 at commit [`583f5ab`](https://github.com/apache/spark/commit/583f5abf5f4658f762ef471feebfa971a4ed17e5).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org