You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2019/11/11 15:14:37 UTC

[GitHub] [incubator-iceberg] asheeshgarg commented on issue #621: Broadcast Join Failure

asheeshgarg commented on issue #621: Broadcast Join Failure
URL: https://github.com/apache/incubator-iceberg/issues/621#issuecomment-552485021
 
 
   def writeData(dataFrame: DataFrame, path: String, format: String, mode: SaveMode): Unit = {
     val tables = new HadoopTables(spark.sparkContext.hadoopConfiguration)
     val schema = SparkSchemaUtil.convert(dataFrame.schema)
     val partitionSpec = PartitionSpec.builderFor(SparkSchemaUtil.convert(dataFrame.schema)).build()
     tables.create(schema, partitionSpec, path)
     dataFrame
       .write
       .format(format)
       .mode(mode)
       .partitionBy("DATE")
       .save(path)
   }
   val icebergTableLoc = s"${storeLocation}/iceberg/eqty/reference"
   writeData(refDf, deltaLakeTableLoc, "iceberg", SaveMode.Append)
   
   val icebergTableLoc = s"${storeLocation}/iceberg/eqty/pricing"
   writeData(pricingDf, icebergTableLoc, "iceberg", SaveMode.Append)
   
   Above method is used to generate the data which 30 days of the reference and pricing data.
   Data loaded is pariting by date so I see roughly equally sized data in the reference and pricing store in the parquet file created in iceberg.
   
   Read operation is performed using 
     spark.read.format("iceberg").load("iceberg/eqty/reference").join(spark.read.format("iceberg").load("iceberg/eqty/pricing"),Seq("ID_BB_GLOBAL","DATE")).count()
   
   As soon as you do this you see the error I mentioned in the original request.
   If you repartition data to more partition it worked.
   
   As mentioned it worked directly with raw Parquet and I also tried the similar join using Apache delta it worked. So size of the data is really fine. 
   As we don't want to arbitrarily reparation the data.
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org