You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Richard Grossman (Jira)" <ji...@apache.org> on 2020/11/30 11:23:00 UTC

[jira] [Comment Edited] (PARQUET-1946) Parquet File not readable by Google big query (works with Spark)

    [ https://issues.apache.org/jira/browse/PARQUET-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17240684#comment-17240684 ] 

Richard Grossman edited comment on PARQUET-1946 at 11/30/20, 11:22 AM:
-----------------------------------------------------------------------

Hi 

May be you can help me. 

I would like to provide a file as sample to google to check why they cannot read the parquet file unfortunately the files contains PII informations and cannot be shared as is.

Is there any way to strip PII fields from parquet file to be able to share it with them ?

Thanks

Thanks 


was (Author: richiesgr):
Hi 

May be you can help me. 

I would like to provide a file as sample to google to check why they cannot read the parquet file unfortunately the files contains PII informations and cannot shared this. could it be possible to strip PII fields from parquet to be able to share it with them ?

Thanks 

> Parquet File not readable by Google big query (works with Spark)
> ----------------------------------------------------------------
>
>                 Key: PARQUET-1946
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1946
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-avro
>    Affects Versions: 1.11.0
>         Environment: [secor|https://github.com/pinterest/secor]
> GCP 
> Big Query google cloud
> Parquet writer 1.11
>  
>  
>            Reporter: Richard Grossman
>            Priority: Blocker
>
> Hi
> I'm trying to write Avro message to parquet on GCS. These parquet should be query by big query engine who support now parquet.
> To do this I'm using Secor a kafka log persister tools from pinterest.
> First I didn't notice any problem using Spark the same file can be read without any problem all is working perfect.
> Now using Big query bring and error like this :
> Error while reading table: , error message: Read less values than expected: Actual: 29333, Expected: 33827. Row group: 0, Column: , File:
> After investigation using parquet-tools I figured out that in parquet there is metadata regarding number total of unique values for each columns eg from parquet-tools
> page 0: DLE:BIT_PACKED RLE:BIT_PACKED [more]... CRC:[PAGE CORRUPT] VC:547
> So the VC value indicate that the total number of unique value in the file is 547.
> Now when make a spark SQL like SELECT DISTINCT COUNT(column) FROM ... I get 421 mean this number in the metadata is incorrect.
> So what is not a problem for Spark to read is a blocking problem for Big data because it relies on these values and found it incorrect.
> Is there any configuration of the writer that can prevent these errors in the metadata ? Or is it a normal behavior that should be a problem ?
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)