You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Rohini Palaniswamy (JIRA)" <ji...@apache.org> on 2015/03/03 23:56:04 UTC
[jira] [Created] (HIVE-9845) HCatSplit repeats information making
input split data size huge
Rohini Palaniswamy created HIVE-9845:
----------------------------------------
Summary: HCatSplit repeats information making input split data size huge
Key: HIVE-9845
URL: https://issues.apache.org/jira/browse/HIVE-9845
Project: Hive
Issue Type: Bug
Components: HCatalog
Reporter: Rohini Palaniswamy
Pig on Tez jobs with larger tables hit PIG-4443. Running on HDFS data which has even triple the number of splits(100K+ splits and tasks) does not hit that issue.
{code}
HCatBaseInputFormat.java:
//Call getSplit on the InputFormat, create an
//HCatSplit for each underlying split
//NumSplits is 0 for our purposes
org.apache.hadoop.mapred.InputSplit[] baseSplits =
inputFormat.getSplits(jobConf, 0);
for(org.apache.hadoop.mapred.InputSplit split : baseSplits) {
splits.add(new HCatSplit(
partitionInfo,
split,allCols));
}
{code}
Each hcatSplit duplicates partition schema and table schema.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)