You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Hari Sankar Sivarama Subramaniyan (JIRA)" <ji...@apache.org> on 2014/08/16 05:24:18 UTC

[jira] [Created] (HIVE-7751) Mapjoin set in a non-conditional task can fail in MR mode because of memory overhead issues

Hari Sankar Sivarama Subramaniyan created HIVE-7751:
-------------------------------------------------------

             Summary: Mapjoin set in a non-conditional task  can fail in MR mode because of  memory overhead issues
                 Key: HIVE-7751
                 URL: https://issues.apache.org/jira/browse/HIVE-7751
             Project: Hive
          Issue Type: Bug
            Reporter: Hari Sankar Sivarama Subramaniyan
            Assignee: Hari Sankar Sivarama Subramaniyan


select sum(ss_quantity) from store_sales join store on store.s_store_sk = store_sales.ss_store_sk join customer_demographics on customer_demographics.cd_demo_sk = store_sales.ss_cdemo_sk join customer_address on store_sales.ss_addr_sk = customer_address.ca_address_sk join date_dim on store_sales.ss_sold_date_sk = date_dim.d_date_sk where d_year = 2000 and ((cd_marital_status = 'M' and cd_education_status = 'Advanced Degree' and ss_sales_price between 100.00 and 150.00) or (cd_marital_status = 'M' and cd_education_status = 'Advanced Degree' and ss_sales_price between 50.00 and 100.00) or (cd_marital_status = 'M' and cd_education_status = 'Advanced Degree' and ss_sales_price between 150.00 and 200.00)) and ((ca_country = 'United States' and ca_state in ('TX', 'OH', 'TX') and ss_net_profit between 0 and 2000) or (ca_country = 'United States' and ca_state in ('OR', 'MN', 'KY') and ss_net_profit between 150 and 3000) or (ca_country = 'United States' and ca_state in ('VA', 'TX', 'MS') and ss_net_profit between 50 and 25000));

The above query where the data is stored as orc format can fail because we convert the above join to a non-conditional task assuming that mapjoin would succeed at runtime. But at runtime, the query can fail due to memory overhead issues. The improvement to prevent such failures would be to use table statistics instead of calling ql.exec.Utilities.getTotalInputFileSize() inside the CommonJoinTaskDispatcher. This would make sure that we take better decisions for MR mode. Tez on the other hand would handle such scenarios better because it actaully relies on table stats to get the data size.



--
This message was sent by Atlassian JIRA
(v6.2#6252)