You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xiu (JIRA)" <ji...@apache.org> on 2015/10/24 02:34:27 UTC

[jira] [Commented] (SPARK-5737) Scanning duplicate columns from parquet table

    [ https://issues.apache.org/jira/browse/SPARK-5737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14972195#comment-14972195 ] 

Xiu commented on SPARK-5737:
----------------------------

I tried this scenario and could not reproduce this on 1.5.1. Also the more current read.parquet() method works fine in this scenario as well.

{code}
scala> val rdd = sqlContext.parquetFile("./examples/src/main/resources/users.parquet")
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
rdd: org.apache.spark.sql.DataFrame = [name: string, favorite_color: string, favorite_numbers: array<int>]

scala> rdd.select("name", "name", "favorite_color", "favorite_color", "favorite_numbers", "favorite_numbers").foreach(println)
[Alyssa,Alyssa,null,null,WrappedArray(3, 9, 15, 20),WrappedArray(3, 9, 15, 20)]
[Ben,Ben,red,red,WrappedArray(),WrappedArray()]

scala> val rdd2 = sqlContext.read.parquet("./examples/src/main/resources/users.parquet")
rdd2: org.apache.spark.sql.DataFrame = [name: string, favorite_color: string, favorite_numbers: array<int>]

scala>  rdd2.select("name", "name", "favorite_color", "favorite_color", "favorite_numbers", "favorite_numbers").foreach(println)
[Alyssa,Alyssa,null,null,WrappedArray(3, 9, 15, 20),WrappedArray(3, 9, 15, 20)]
[Ben,Ben,red,red,WrappedArray(),WrappedArray()]
{code}

[~marmbrus] should we close this one?

> Scanning duplicate columns from parquet table
> ---------------------------------------------
>
>                 Key: SPARK-5737
>                 URL: https://issues.apache.org/jira/browse/SPARK-5737
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.2.1
>            Reporter: Kevin Jung
>
> {quote}
> import org.apache.spark.sql._
> val sqlContext = new SQLContext(sc)
> import sqlContext._
> val rdd = sqlContext.parquetFile("temp.parquet")
> rdd.select('d1,'d1,'d2,'d2).take(3).foreach(println)
> {quote}
> The results of above code have null values at the preceding columns of duplicate two.
> For example,
> {quote}
> [null,-5.7,null,121.05]
> [null,-61.17,null,108.91]
> [null,50.60,null,72.15]
> {quote}
> This happens only in ParquetTableScan. PysicalRDD works fine and the rows have duplicate values like...
> {quote}
> [-5.7,-5.7,121.05,121.05]
> [-61.17,-61.17,108.91,108.91]
> [50.60,50.60,72.15,72.15]
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org