You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Gabor Szadovszky (Jira)" <ji...@apache.org> on 2020/01/20 14:15:00 UTC

[jira] [Resolved] (PARQUET-1746) Changed the data order after DataFrame reuse

     [ https://issues.apache.org/jira/browse/PARQUET-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabor Szadovszky resolved PARQUET-1746.
---------------------------------------
    Resolution: Not A Problem

The related Spark test generates 22 parquet files. The first 11 are empty meaning no data is in them. (I am not sure if they are even valid this way.)

The last 11 contains only 1 value in each:
{noformat}
$> ls *.parquet| while read file; do echo "$file"; parquet-tools cat $file 2>/dev/null; done
part-00000-19f5b358-410b-4dd4-b167-4016984ac6ef-c000.snappy.parquet
part-00000-212d052b-d03a-413b-98f3-1348c2d06855-c000.snappy.parquet
part-00000-311f4442-4225-47f1-aaf1-c7a8e38a875f-c000.snappy.parquet
part-00000-459612f9-d564-43a9-bf31-2d174c996fa6-c000.snappy.parquet
part-00000-5e20cfa6-a5d0-4d5f-a382-741907a74874-c000.snappy.parquet
part-00000-62881d28-7226-4a78-9fe7-2ed41b895e1c-c000.snappy.parquet
part-00000-9aaa784f-080a-43ae-9296-20bd033aa300-c000.snappy.parquet
part-00000-a01e81ab-a987-4929-991d-60f01acab1ca-c000.snappy.parquet
part-00000-add0de8e-26eb-406b-bf02-702924f89f1a-c000.snappy.parquet
part-00000-e8dd315d-b97e-4257-917c-34696d0a866c-c000.snappy.parquet
part-00000-ed8be0d2-508f-4666-b66f-93182413472e-c000.snappy.parquet
part-00001-20b63b66-8f9a-4e3b-893c-4acb106ddac1-c000.snappy.parquet
a = 7

part-00001-227ff83d-5341-48be-97be-00cde92cb303-c000.snappy.parquet
a = 1

part-00001-38e186bb-ca67-4e3d-87fe-780585f25c84-c000.snappy.parquet
a = 0

part-00001-3b06880b-6d57-49d7-bb63-4220092ef1ae-c000.snappy.parquet
a = 4

part-00001-449026a6-f486-4fca-81fa-b7cdeaddfa3b-c000.snappy.parquet
a = 5

part-00001-567ed849-b1e9-494f-b33f-495592826b28-c000.snappy.parquet
a = 2

part-00001-70fa8c7e-9b45-4103-a99e-5b0f61b6062a-c000.snappy.parquet
a = 10

part-00001-7399d477-c393-481b-b76f-1289deb72bc0-c000.snappy.parquet
a = 3

part-00001-93678ef9-27d4-4a5d-aaa1-58492de248e7-c000.snappy.parquet
a = 6

part-00001-c1b934d8-0058-40e0-87f9-40ee7eca52ed-c000.snappy.parquet
a = 8

part-00001-c599dd4d-32c8-4032-935a-b1d45bc508e1-c000.snappy.parquet
a = 9
{noformat}

This way the parquet-mr library has nothing to do with the ordering of these values.

> Changed the data order after DataFrame reuse
> --------------------------------------------
>
>                 Key: PARQUET-1746
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1746
>             Project: Parquet
>          Issue Type: Sub-task
>          Components: parquet-mr
>    Affects Versions: 1.11.0
>            Reporter: Yuming Wang
>            Priority: Major
>
> How to reproduce:
> {code:sh}
> git clone https://github.com/apache/spark.git && cd spark
> git fetch origin pull/26804/head:PARQUET-1746
> git checkout PARQUET-1746
> build/sbt "sql/test-only *StreamSuite"
> {code}
> output:
> {noformat}
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> Decoded objects do not match expected objects:
> expected: WrappedArray(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
> actual:   WrappedArray(0, 1, 3, 4, 5, 6, 7, 8, 9, 10, 2)
> assertnotnull(upcast(getcolumnbyordinal(0, LongType), LongType, - root class: "scala.Long"))
> +- upcast(getcolumnbyordinal(0, LongType), LongType, - root class: "scala.Long")
>    +- getcolumnbyordinal(0, LongType)
>          
> 	at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
> 	at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
> 	at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> 	at org.scalatest.Assertions.fail(Assertions.scala:1091)
> 	at org.scalatest.Assertions.fail$(Assertions.scala:1087)
> 	at org.scalatest.FunSuite.fail(FunSuite.scala:1560)
> 	at org.apache.spark.sql.QueryTest.checkDataset(QueryTest.scala:73)
> 	at org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$22(StreamSuite.scala:215)
> 	at org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$22$adapted(StreamSuite.scala:208)
> 	at org.apache.spark.sql.test.SQLTestUtils.$anonfun$withTempDir$1(SQLTestUtils.scala:76)
> 	at org.apache.spark.sql.test.SQLTestUtils.$anonfun$withTempDir$1$adapted(SQLTestUtils.scala:75)
> 	at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:161)
> 	at org.apache.spark.sql.streaming.StreamSuite.org$apache$spark$sql$test$SQLTestUtils$$super$withTempDir(StreamSuite.scala:51)
> 	at org.apache.spark.sql.test.SQLTestUtils.withTempDir(SQLTestUtils.scala:75)
> 	at org.apache.spark.sql.test.SQLTestUtils.withTempDir$(SQLTestUtils.scala:74)
> 	at org.apache.spark.sql.streaming.StreamSuite.withTempDir(StreamSuite.scala:51)
> 	at org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$21(StreamSuite.scala:208)
> 	at org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$21$adapted(StreamSuite.scala:207)
> 	at org.apache.spark.sql.test.SQLTestUtils.$anonfun$withTempDir$1(SQLTestUtils.scala:76)
> 	at org.apache.spark.sql.test.SQLTestUtils.$anonfun$withTempDir$1$adapted(SQLTestUtils.scala:75)
> 	at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:161)
> 	at org.apache.spark.sql.streaming.StreamSuite.org$apache$spark$sql$test$SQLTestUtils$$super$withTempDir(StreamSuite.scala:51)
> 	at org.apache.spark.sql.test.SQLTestUtils.withTempDir(SQLTestUtils.scala:75)
> 	at org.apache.spark.sql.test.SQLTestUtils.withTempDir$(SQLTestUtils.scala:74)
> 	at org.apache.spark.sql.streaming.StreamSuite.withTempDir(StreamSuite.scala:51)
> 	at org.apache.spark.sql.streaming.StreamSuite.assertDF$1(StreamSuite.scala:207)
> 	at org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$25(StreamSuite.scala:226)
> 	at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:52)
> 	at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:36)
> 	at org.apache.spark.sql.streaming.StreamSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(StreamSuite.scala:51)
> 	at org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:231)
> 	at org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:229)
> 	at org.apache.spark.sql.streaming.StreamSuite.withSQLConf(StreamSuite.scala:51)
> 	at org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$24(StreamSuite.scala:225)
> 	at org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$24$adapted(StreamSuite.scala:224)
> 	at scala.collection.immutable.List.foreach(List.scala:392)
> 	at org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$20(StreamSuite.scala:224)
> 	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> 	at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> 	at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> 	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> 	at org.scalatest.Transformer.apply(Transformer.scala:22)
> 	at org.scalatest.Transformer.apply(Transformer.scala:20)
> 	at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
> 	at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
> 	at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
> 	at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
> 	at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)
> 	at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
> 	at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
> 	at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56)
> 	at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
> 	at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
> 	at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:56)
> 	at org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
> 	at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:393)
> 	at scala.collection.immutable.List.foreach(List.scala:392)
> 	at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:381)
> 	at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:376)
> 	at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:458)
> 	at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
> 	at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
> 	at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
> 	at org.scalatest.Suite.run(Suite.scala:1124)
> 	at org.scalatest.Suite.run$(Suite.scala:1106)
> 	at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
> 	at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
> 	at org.scalatest.SuperEngine.runImpl(Engine.scala:518)
> 	at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
> 	at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
> 	at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:56)
> 	at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
> 	at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
> 	at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
> 	at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56)
> 	at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317)
> 	at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510)
> 	at sbt.ForkMain$Run$2.call(ForkMain.java:296)
> 	at sbt.ForkMain$Run$2.call(ForkMain.java:286)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)