You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Riza Suminto (Jira)" <ji...@apache.org> on 2021/08/30 23:48:00 UTC

[jira] [Updated] (IMPALA-10900) Some parquet files are missing from Iceberg avro metadata

     [ https://issues.apache.org/jira/browse/IMPALA-10900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Riza Suminto updated IMPALA-10900:
----------------------------------
    Description: 
I tried to load tpcds_3000_iceberg database based on existing parquet database. The source database is tpcds_3000_parquet (3TB scale) and the cluster has 10 nodes.

After loading table catalog_sales, I found that the row count of tpcds_3000_iceberg.catalog_sales is less than row count of tpcds_3000_parquet.catalog_sales. Further debugging reveals that the CTAS query actually finish writing parquet files, but only one parquet file per partition gets written into Iceberg avro metadata.

For example inspecting partiton 2451120, it says that there are 2 parquet files
{code:java}
 ~  sudo -u hdfs hdfs dfs -ls /warehouse/tablespace/external/hive/tpcds_3000_iceberg.db/catalog_sales/data/cs_sold_date_sk=2451120/
 Found 2 items
 rw-rw---+ 3 impala hive 264837186 2021-08-28 20:05 /warehouse/tablespace/external/hive/tpcds_3000_iceberg.db/catalog_sales/data/cs_sold_date_sk=2451120/ec48089182b28ba9-b2910c2d00000011_1004901849_data.0.parq
 rw-rw---+ 3 impala hive 80608775 2021-08-28 20:05 /warehouse/tablespace/external/hive/tpcds_3000_iceberg.db/catalog_sales/data/cs_sold_date_sk=2451120/ec48089182b28ba9-b2910c2d00000011_1004901849_data.1.parq{code}

 However, the avro files only have ec48089182b28ba9-b2910c2d00000011_1004901849_data.0.parq in it
{code:java}
 ~  avro-tools tojson /tmp/9e044ebf-f549-4c99-9952-3b8c0a70cafb-m0.avro| grep "cs_sold_date_sk=2451120"
[main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
{"status":1,"snapshot_id":{"long":403055188554782479},"data_file":{"file_path":"hdfs://ve1315.halxg.cloudera.com:8020/warehouse/tablespace/external/hive/tpcds_3000_iceberg.db/catalog_sales/data/cs_sold_date_sk=2451120/ec48089182b28ba9-b2910c2d00000011_1004901849_data.0.parq","file_format":"PARQUET","partition":{"cs_sold_date_sk":{"long":2451120}},"record_count":3616116,"file_size_in_bytes":264837186,"block_size_in_bytes":67108864,"column_sizes":null,"value_counts":null,"null_value_counts":null,"nan_value_counts":null,"lower_bounds":null,"upper_bounds":null,"key_metadata":null,"split_offsets":null}}
{code}
The CTAS query that I use on debugging is the following:
{code:java}
create table catalog_sales
partitioned by (cs_sold_date_sk)
stored as iceberg
as select
  cs_sold_time_sk,
  cs_ship_date_sk,
  cs_bill_customer_sk,
  cs_bill_cdemo_sk,
  cs_bill_hdemo_sk,
  cs_bill_addr_sk,
  cs_ship_customer_sk,
  cs_ship_cdemo_sk,
  cs_ship_hdemo_sk,
  cs_ship_addr_sk,
  cs_call_center_sk,
  cs_catalog_page_sk,
  cs_ship_mode_sk,
  cs_warehouse_sk,
  cs_item_sk,
  cs_promo_sk,
  cs_order_number,
  cs_quantity,
  cs_wholesale_cost,
  cs_list_price,
  cs_sales_price,
  cs_ext_discount_amt,
  cs_ext_sales_price,
  cs_ext_wholesale_cost,
  cs_ext_list_price,
  cs_ext_tax,
  cs_coupon_amt,
  cs_ext_ship_cost,
  cs_net_paid,
  cs_net_paid_inc_tax,
  cs_net_paid_inc_ship,
  cs_net_paid_inc_ship_tax,
  cs_net_profit,
  cs_sold_date_sk
from tpcds_3000_parquet.catalog_sales where cs_sold_date_sk > 2451100 and cs_sold_date_sk <= 2451200;
{code}

  was:
I tried to load tpcds_3000_iceberg database based on existing parquet database. The source database is tpcds_3000_parquet (3TB scale) and the cluster has 10 nodes.

After loading table catalog_sales, I found that the row count of tpcds_3000_iceberg.catalog_sales is less than row count of tpcds_3000_parquet.catalog_sales. Further debugging reveals that the CTAS query actually finish writing parquet files, but only one parquet file per partition gets written into Iceberg avro metadata.

For example inspecting partiton 2451120, it says that there are 2 parquet files
 ~  sudo -u hdfs hdfs dfs -ls /warehouse/tablespace/external/hive/tpcds_3000_iceberg.db/catalog_sales/data/cs_sold_date_sk=2451120/
Found 2 items
-rw-rw----+  3 impala hive  264837186 2021-08-28 20:05 /warehouse/tablespace/external/hive/tpcds_3000_iceberg.db/catalog_sales/data/cs_sold_date_sk=2451120/ec48089182b28ba9-b2910c2d00000011_1004901849_data.0.parq
-rw-rw----+  3 impala hive   80608775 2021-08-28 20:05 /warehouse/tablespace/external/hive/tpcds_3000_iceberg.db/catalog_sales/data/cs_sold_date_sk=2451120/ec48089182b28ba9-b2910c2d00000011_1004901849_data.1.parq
However, the avro files only have ec48089182b28ba9-b2910c2d00000011_1004901849_data.0.parq in it
{code:java}
 ~  avro-tools tojson /tmp/9e044ebf-f549-4c99-9952-3b8c0a70cafb-m0.avro| grep "cs_sold_date_sk=2451120"
[main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
{"status":1,"snapshot_id":{"long":403055188554782479},"data_file":{"file_path":"hdfs://ve1315.halxg.cloudera.com:8020/warehouse/tablespace/external/hive/tpcds_3000_iceberg.db/catalog_sales/data/cs_sold_date_sk=2451120/ec48089182b28ba9-b2910c2d00000011_1004901849_data.0.parq","file_format":"PARQUET","partition":{"cs_sold_date_sk":{"long":2451120}},"record_count":3616116,"file_size_in_bytes":264837186,"block_size_in_bytes":67108864,"column_sizes":null,"value_counts":null,"null_value_counts":null,"nan_value_counts":null,"lower_bounds":null,"upper_bounds":null,"key_metadata":null,"split_offsets":null}}
{code}

The CTAS query that I use on debugging is the following:
{code:java}
create table catalog_sales
partitioned by (cs_sold_date_sk)
stored as iceberg
as select
  cs_sold_time_sk,
  cs_ship_date_sk,
  cs_bill_customer_sk,
  cs_bill_cdemo_sk,
  cs_bill_hdemo_sk,
  cs_bill_addr_sk,
  cs_ship_customer_sk,
  cs_ship_cdemo_sk,
  cs_ship_hdemo_sk,
  cs_ship_addr_sk,
  cs_call_center_sk,
  cs_catalog_page_sk,
  cs_ship_mode_sk,
  cs_warehouse_sk,
  cs_item_sk,
  cs_promo_sk,
  cs_order_number,
  cs_quantity,
  cs_wholesale_cost,
  cs_list_price,
  cs_sales_price,
  cs_ext_discount_amt,
  cs_ext_sales_price,
  cs_ext_wholesale_cost,
  cs_ext_list_price,
  cs_ext_tax,
  cs_coupon_amt,
  cs_ext_ship_cost,
  cs_net_paid,
  cs_net_paid_inc_tax,
  cs_net_paid_inc_ship,
  cs_net_paid_inc_ship_tax,
  cs_net_profit,
  cs_sold_date_sk
from tpcds_3000_parquet.catalog_sales where cs_sold_date_sk > 2451100 and cs_sold_date_sk <= 2451200;
{code}


> Some parquet files are missing from Iceberg avro metadata
> ---------------------------------------------------------
>
>                 Key: IMPALA-10900
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10900
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 4.0.0
>            Reporter: Riza Suminto
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg
>
> I tried to load tpcds_3000_iceberg database based on existing parquet database. The source database is tpcds_3000_parquet (3TB scale) and the cluster has 10 nodes.
> After loading table catalog_sales, I found that the row count of tpcds_3000_iceberg.catalog_sales is less than row count of tpcds_3000_parquet.catalog_sales. Further debugging reveals that the CTAS query actually finish writing parquet files, but only one parquet file per partition gets written into Iceberg avro metadata.
> For example inspecting partiton 2451120, it says that there are 2 parquet files
> {code:java}
>  ~  sudo -u hdfs hdfs dfs -ls /warehouse/tablespace/external/hive/tpcds_3000_iceberg.db/catalog_sales/data/cs_sold_date_sk=2451120/
>  Found 2 items
>  rw-rw---+ 3 impala hive 264837186 2021-08-28 20:05 /warehouse/tablespace/external/hive/tpcds_3000_iceberg.db/catalog_sales/data/cs_sold_date_sk=2451120/ec48089182b28ba9-b2910c2d00000011_1004901849_data.0.parq
>  rw-rw---+ 3 impala hive 80608775 2021-08-28 20:05 /warehouse/tablespace/external/hive/tpcds_3000_iceberg.db/catalog_sales/data/cs_sold_date_sk=2451120/ec48089182b28ba9-b2910c2d00000011_1004901849_data.1.parq{code}
>  However, the avro files only have ec48089182b28ba9-b2910c2d00000011_1004901849_data.0.parq in it
> {code:java}
>  ~  avro-tools tojson /tmp/9e044ebf-f549-4c99-9952-3b8c0a70cafb-m0.avro| grep "cs_sold_date_sk=2451120"
> [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> {"status":1,"snapshot_id":{"long":403055188554782479},"data_file":{"file_path":"hdfs://ve1315.halxg.cloudera.com:8020/warehouse/tablespace/external/hive/tpcds_3000_iceberg.db/catalog_sales/data/cs_sold_date_sk=2451120/ec48089182b28ba9-b2910c2d00000011_1004901849_data.0.parq","file_format":"PARQUET","partition":{"cs_sold_date_sk":{"long":2451120}},"record_count":3616116,"file_size_in_bytes":264837186,"block_size_in_bytes":67108864,"column_sizes":null,"value_counts":null,"null_value_counts":null,"nan_value_counts":null,"lower_bounds":null,"upper_bounds":null,"key_metadata":null,"split_offsets":null}}
> {code}
> The CTAS query that I use on debugging is the following:
> {code:java}
> create table catalog_sales
> partitioned by (cs_sold_date_sk)
> stored as iceberg
> as select
>   cs_sold_time_sk,
>   cs_ship_date_sk,
>   cs_bill_customer_sk,
>   cs_bill_cdemo_sk,
>   cs_bill_hdemo_sk,
>   cs_bill_addr_sk,
>   cs_ship_customer_sk,
>   cs_ship_cdemo_sk,
>   cs_ship_hdemo_sk,
>   cs_ship_addr_sk,
>   cs_call_center_sk,
>   cs_catalog_page_sk,
>   cs_ship_mode_sk,
>   cs_warehouse_sk,
>   cs_item_sk,
>   cs_promo_sk,
>   cs_order_number,
>   cs_quantity,
>   cs_wholesale_cost,
>   cs_list_price,
>   cs_sales_price,
>   cs_ext_discount_amt,
>   cs_ext_sales_price,
>   cs_ext_wholesale_cost,
>   cs_ext_list_price,
>   cs_ext_tax,
>   cs_coupon_amt,
>   cs_ext_ship_cost,
>   cs_net_paid,
>   cs_net_paid_inc_tax,
>   cs_net_paid_inc_ship,
>   cs_net_paid_inc_ship_tax,
>   cs_net_profit,
>   cs_sold_date_sk
> from tpcds_3000_parquet.catalog_sales where cs_sold_date_sk > 2451100 and cs_sold_date_sk <= 2451200;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org