You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by RaghavendraS <gi...@git.apache.org> on 2016/06/20 07:00:20 UTC

[GitHub] spark pull request #13779: Replace NullType to StringType in a DataFrame Sch...

GitHub user RaghavendraS opened a pull request:

    https://github.com/apache/spark/pull/13779

    Replace NullType to StringType in a DataFrame Schema, then we can abl…

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    
    …e to write DataFrame in a Parquet Format

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/RaghavendraS/spark raghavendra-spark

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13779.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13779
    
----
commit 915ebe88294b256334b73aed9726812b23ab25c7
Author: rJabong <ra...@jabong.com>
Date:   2016-06-20T06:59:13Z

    Replace NullType to StringType in a DataFrame Schema, then we can able to write DataFrame in a Parquet Format

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13779: Replace NullType to StringType in a DataFrame Schema, th...

Posted by RaghavendraS <gi...@git.apache.org>.

Github user RaghavendraS commented on the issue:

https://github.com/apache/spark/pull/13779

Thanks @AmplabJenkins , @hvanhovell @marmbrus @akatz

**In my case:** When we are fetching incremental data from Mongo DB and storing it into parquet file, then we are getting NullType Error. Because in parquet there is no NullType data type. So I come up with below solutions.

**Case-1: If we convert NullType to StringType.**
This helps us for doing union of last n days incremental parquet data, without getting any error. We only need to compare schema from bottom to top and make data type changes accordingly, apply schema to data frames and make a union from bottom to top.

**Case-2: If we drop NullType field.**
In this case we need to transform each RDD according to final schema. In Case-1 we are only transforming schema but not transforming RDD, so Case-1 is better than Case-2.

@hvanhovell Please let me know if you know any other solution.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13779: Replace NullType to StringType in a DataFrame Schema, th...

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/13779
  
    @RaghavendraS I don't think we should do this. Having a variable with a `null` type indicates that something should be fixed in the application code. Changing the meaning of the column can lead to surprises. Can you give an example of when this is useful?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org