You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/08/13 20:59:38 UTC

[GitHub] [hudi] raj638111 opened a new issue #3473: Duplicate field: Partition field also available in parquet file

raj638111 opened a new issue #3473:
URL: https://github.com/apache/hudi/issues/3473


   **Describe the problem you faced**
   
   When querying the hudi table (I am querying as `parquet` format) from spark-shell, I am getting the following warning
   ```
   spark.read.parquet("s3://bucket1/huditable1").where("date = '20210101' and hour = '01'  and field1 = 'somevalue' ")
   WARN DataSource: Found duplicate column(s) in the data schema and the
     partition schema: `date`, `hour`
   ```
   On a close inspection, found that the parquet file also contains the same fields (ie `date` and `hour`)
   ```
   println(spark.read.parquet("s3://bucket1/huditable1/date=20210101/hour=01/file1.parquet").schema.treeString)
    |-- field1: string (nullable = true)
    |-- field2: string (nullable = true)
    |-- date: string (nullable = true)
    |-- hour: string (nullable = true)
   ```
   Is there a way to get rid of the duplicate fields `date` and `hour` from the parquet file? 
   Seems like during ingestion, `hudi` format is adding the partition fields also into the parquet file
   
   **Environment Description**
   
   * Hudi version : 0.8.0
   
   * Spark version : 3.1.1
   
   * Hive version : _
   
   * Hadoop version : _
   
   * Storage (HDFS/S3/GCS..) : s3
   
   * Running on Docker? (yes/no) : no
   
   * EMR: emr-6.3.0 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #3473: [Support] Duplicate field: Partition field also available in parquet file

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #3473:
URL: https://github.com/apache/hudi/issues/3473


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3473: [Support] Duplicate field: Partition field also available in parquet file

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3473:
URL: https://github.com/apache/hudi/issues/3473#issuecomment-905178885


   Closing the issue. Feel free to re-open if your requirement is not met. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3473: [Support] Duplicate field: Partition field also available in parquet file

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3473:
URL: https://github.com/apache/hudi/issues/3473#issuecomment-901496447


   Recently we added support to drop columns from incoming df that were used to generate the partition path in hudi. https://github.com/apache/hudi/pull/3465
   Can you check if this solves your need. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org