You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Hari Sankar Sivarama Subramaniyan (JIRA)" <ji...@apache.org> on 2014/08/16 05:24:18 UTC
[jira] [Created] (HIVE-7751) Mapjoin set in a non-conditional task
can fail in MR mode because of memory overhead issues
Hari Sankar Sivarama Subramaniyan created HIVE-7751:
-------------------------------------------------------
Summary: Mapjoin set in a non-conditional task can fail in MR mode because of memory overhead issues
Key: HIVE-7751
URL: https://issues.apache.org/jira/browse/HIVE-7751
Project: Hive
Issue Type: Bug
Reporter: Hari Sankar Sivarama Subramaniyan
Assignee: Hari Sankar Sivarama Subramaniyan
select sum(ss_quantity) from store_sales join store on store.s_store_sk = store_sales.ss_store_sk join customer_demographics on customer_demographics.cd_demo_sk = store_sales.ss_cdemo_sk join customer_address on store_sales.ss_addr_sk = customer_address.ca_address_sk join date_dim on store_sales.ss_sold_date_sk = date_dim.d_date_sk where d_year = 2000 and ((cd_marital_status = 'M' and cd_education_status = 'Advanced Degree' and ss_sales_price between 100.00 and 150.00) or (cd_marital_status = 'M' and cd_education_status = 'Advanced Degree' and ss_sales_price between 50.00 and 100.00) or (cd_marital_status = 'M' and cd_education_status = 'Advanced Degree' and ss_sales_price between 150.00 and 200.00)) and ((ca_country = 'United States' and ca_state in ('TX', 'OH', 'TX') and ss_net_profit between 0 and 2000) or (ca_country = 'United States' and ca_state in ('OR', 'MN', 'KY') and ss_net_profit between 150 and 3000) or (ca_country = 'United States' and ca_state in ('VA', 'TX', 'MS') and ss_net_profit between 50 and 25000));
The above query where the data is stored as orc format can fail because we convert the above join to a non-conditional task assuming that mapjoin would succeed at runtime. But at runtime, the query can fail due to memory overhead issues. The improvement to prevent such failures would be to use table statistics instead of calling ql.exec.Utilities.getTotalInputFileSize() inside the CommonJoinTaskDispatcher. This would make sure that we take better decisions for MR mode. Tez on the other hand would handle such scenarios better because it actaully relies on table stats to get the data size.
--
This message was sent by Atlassian JIRA
(v6.2#6252)