You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Zoltán Borók-Nagy (Jira)" <ji...@apache.org> on 2021/09/10 16:32:00 UTC

[jira] [Resolved] (IMPALA-10900) Some parquet files are missing from Iceberg avro metadata

     [ https://issues.apache.org/jira/browse/IMPALA-10900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zoltán Borók-Nagy resolved IMPALA-10900.
----------------------------------------
    Fix Version/s: Impala 4.1.0
       Resolution: Fixed

> Some parquet files are missing from Iceberg avro metadata
> ---------------------------------------------------------
>
>                 Key: IMPALA-10900
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10900
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 4.0.0
>            Reporter: Riza Suminto
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg
>             Fix For: Impala 4.1.0
>
>         Attachments: 9e044ebf-f549-4c99-9952-3b8c0a70cafb-m0.avro, profile_ec48089182b28ba9_b2910c2d00000000_iceberg_ctas
>
>
> I tried to load tpcds_3000_iceberg database based on existing parquet database. The source database is tpcds_3000_parquet (3TB scale) and the cluster has 10 nodes.
> After loading table catalog_sales, I found that the row count of tpcds_3000_iceberg.catalog_sales is less than row count of tpcds_3000_parquet.catalog_sales. Further debugging reveals that the CTAS query actually finish writing parquet files, but only one parquet file per partition gets written into Iceberg avro metadata.
> For example inspecting partiton 2451120, it says that there are 2 parquet files
> {code:java}
>  ~  sudo -u hdfs hdfs dfs -ls /warehouse/tablespace/external/hive/tpcds_3000_iceberg.db/catalog_sales/data/cs_sold_date_sk=2451120/
>  Found 2 items
>  rw-rw---+ 3 impala hive 264837186 2021-08-28 20:05 /warehouse/tablespace/external/hive/tpcds_3000_iceberg.db/catalog_sales/data/cs_sold_date_sk=2451120/ec48089182b28ba9-b2910c2d00000011_1004901849_data.0.parq
>  rw-rw---+ 3 impala hive 80608775 2021-08-28 20:05 /warehouse/tablespace/external/hive/tpcds_3000_iceberg.db/catalog_sales/data/cs_sold_date_sk=2451120/ec48089182b28ba9-b2910c2d00000011_1004901849_data.1.parq{code}
>  However, the avro files only have ec48089182b28ba9-b2910c2d00000011_1004901849_data.0.parq in it
> {code:java}
>  ~  avro-tools tojson /tmp/9e044ebf-f549-4c99-9952-3b8c0a70cafb-m0.avro| grep "cs_sold_date_sk=2451120"
> [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> {"status":1,"snapshot_id":{"long":403055188554782479},"data_file":{"file_path":"hdfs://ve1315.halxg.cloudera.com:8020/warehouse/tablespace/external/hive/tpcds_3000_iceberg.db/catalog_sales/data/cs_sold_date_sk=2451120/ec48089182b28ba9-b2910c2d00000011_1004901849_data.0.parq","file_format":"PARQUET","partition":{"cs_sold_date_sk":{"long":2451120}},"record_count":3616116,"file_size_in_bytes":264837186,"block_size_in_bytes":67108864,"column_sizes":null,"value_counts":null,"null_value_counts":null,"nan_value_counts":null,"lower_bounds":null,"upper_bounds":null,"key_metadata":null,"split_offsets":null}}
> {code}
> The CTAS query that I use on debugging is the following:
> {code:java}
> create table catalog_sales
> partitioned by (cs_sold_date_sk)
> stored as iceberg
> as select
>   cs_sold_time_sk,
>   cs_ship_date_sk,
>   cs_bill_customer_sk,
>   cs_bill_cdemo_sk,
>   cs_bill_hdemo_sk,
>   cs_bill_addr_sk,
>   cs_ship_customer_sk,
>   cs_ship_cdemo_sk,
>   cs_ship_hdemo_sk,
>   cs_ship_addr_sk,
>   cs_call_center_sk,
>   cs_catalog_page_sk,
>   cs_ship_mode_sk,
>   cs_warehouse_sk,
>   cs_item_sk,
>   cs_promo_sk,
>   cs_order_number,
>   cs_quantity,
>   cs_wholesale_cost,
>   cs_list_price,
>   cs_sales_price,
>   cs_ext_discount_amt,
>   cs_ext_sales_price,
>   cs_ext_wholesale_cost,
>   cs_ext_list_price,
>   cs_ext_tax,
>   cs_coupon_amt,
>   cs_ext_ship_cost,
>   cs_net_paid,
>   cs_net_paid_inc_tax,
>   cs_net_paid_inc_ship,
>   cs_net_paid_inc_ship_tax,
>   cs_net_profit,
>   cs_sold_date_sk
> from tpcds_3000_parquet.catalog_sales where cs_sold_date_sk > 2451100 and cs_sold_date_sk <= 2451200;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org