You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Ferdinand Xu (JIRA)" <ji...@apache.org> on 2016/07/06 07:42:10 UTC
[jira] [Comment Edited] (PARQUET-400) Error reading some files after PARQUET-77 bytebuffer read path

    [ https://issues.apache.org/jira/browse/PARQUET-400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15363915#comment-15363915 ] 

Ferdinand Xu edited comment on PARQUET-400 at 7/6/16 7:41 AM:
--------------------------------------------------------------

[~dweeks] Is there any JIRA tracking the fix for this issue? I can easily reproduce it by executing the following Hive SQL after depending on the upstream code. Or any workaround for it?
{noformat}
CREATE EXTERNAL TABLE customer_temporary
  ( c_customer_sk             bigint              --not null
  , c_customer_id             string              --not null
  , c_current_cdemo_sk        bigint
  , c_current_hdemo_sk        bigint
  , c_current_addr_sk         bigint
  , c_first_shipto_date_sk    bigint
  , c_first_sales_date_sk     bigint
  , c_salutation              string
  , c_first_name              string
  , c_last_name               string
  , c_preferred_cust_flag     string
  , c_birth_day               int
  , c_birth_month             int
  , c_birth_year              int
  , c_birth_country           string
  , c_login                   string
  , c_email_address           string
  , c_last_review_date        string
  )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE LOCATION 'file:///tmp/data/';

Select count(*) from customer_temporary;

DROP TABLE IF EXISTS customer2;

CREATE TABLE customer2
STORED AS PARQUET
AS
SELECT * FROM customer_temporary;
SELECT count(*) from customer2;
{noformat}


was (Author: ferd):
[~dweeks-netflix] Is there any JIRA tracking the fix for this issue? I can easily reproduce it by executing the following Hive SQL after depending on the upstream code. Or any workaround for it?
{noformat}
CREATE EXTERNAL TABLE customer_temporary
  ( c_customer_sk             bigint              --not null
  , c_customer_id             string              --not null
  , c_current_cdemo_sk        bigint
  , c_current_hdemo_sk        bigint
  , c_current_addr_sk         bigint
  , c_first_shipto_date_sk    bigint
  , c_first_sales_date_sk     bigint
  , c_salutation              string
  , c_first_name              string
  , c_last_name               string
  , c_preferred_cust_flag     string
  , c_birth_day               int
  , c_birth_month             int
  , c_birth_year              int
  , c_birth_country           string
  , c_login                   string
  , c_email_address           string
  , c_last_review_date        string
  )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE LOCATION 'file:///tmp/data/';

Select count(*) from customer_temporary;

DROP TABLE IF EXISTS customer2;

CREATE TABLE customer2
STORED AS PARQUET
AS
SELECT * FROM customer_temporary;
SELECT count(*) from customer2;
{noformat}

> Error reading some files after PARQUET-77 bytebuffer read path
> --------------------------------------------------------------
>
>                 Key: PARQUET-400
>                 URL: https://issues.apache.org/jira/browse/PARQUET-400
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Jason Altekruse
>            Assignee: Jason Altekruse
>         Attachments: bytebyffer_read_fail.gz.parquet
>
>
> This issue is based on a discussion on the list started by [~dweeks]
> Full discussion:
> https://mail-archives.apache.org/mod_mbox/parquet-dev/201512.mbox/%3CCAMpYv7C_szTheua9N95bXvbd2ROmV63BFiJTK-K-aDNK6ZNBKA%40mail.gmail.com%3E
> From the thread (he later provided a small repro file that is attached here):
> Just wanted to see if you or anyone else has run into problems reading
> files after the ByteBuffer patch.  I've been running into issues and have
> narrowed it down to the ByteBuffer commit using a small repro file (written
> with 1.6.0, unfortunately can't share the data).
> It doesn't happen for every file, but those that fail give this error:
> can not read class org.apache.parquet.format.PageHeader: Required field
> 'uncompressed_page_size' was not found in serialized data! Struct:
> PageHeader(type:null, uncompressed_page_size:0, compressed_page_size:0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)