You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by ShaoFeng Shi <sh...@apache.org> on 2017/09/26 09:16:19 UTC

HFileOutputFormat2 hardcodes default FileOutputCommitter

Hello gentlemen,

This is Shaofeng Shi from Apache Kylin community, we use HBase as the
storage engine, and we use MR job to generate HFile before bulk load. We
received user reporting that, if configured to use S3 as the output
location for HFile, the files were generated in "_temporary" folder and
won't be committed to the target path. This caused no data be loaded
finally. And we can reproduce this problem easily. The original reporting
is in [1].

Kylin uses HBase's HFileOutputFormat2.java to configure the MR job. After
some investigation, I found this class always uses the default
"FileOutputCommitter", see [2], regardless of the job's configuration; so
it always writing to "_temporary" folder. Since AWS EMR configured to use
DirectOutputCommitter for S3, then this problem occurs: Hadoop expects to
see the file directly under output path, while the RecordWriter generates
them in "_temporary" folder.

Did you get such reporting before? I had a temporary fix in my fork now.
Just wondering how you think about it; if oaky I would report a JIRA.
Thanks!

[1] https://issues.apache.org/jira/browse/KYLIN-2788
[2]
https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java#L193

-- 
Best regards,

Shaofeng Shi 史少锋

Re: HFileOutputFormat2 hardcodes default FileOutputCommitter

Posted by ShaoFeng Shi <sh...@apache.org>.
JIRA is created, and a patch is attached:
https://issues.apache.org/jira/browse/HBASE-18885

Please review and merge, we need this in the future version. Thanks.

2017-09-26 19:00 GMT+08:00 ShaoFeng Shi <sh...@apache.org>:

> Here is the pull request:
>
> https://github.com/apache/hbase/pull/60
>
> 2017-09-26 17:16 GMT+08:00 ShaoFeng Shi <sh...@apache.org>:
>
>> Hello gentlemen,
>>
>> This is Shaofeng Shi from Apache Kylin community, we use HBase as the
>> storage engine, and we use MR job to generate HFile before bulk load. We
>> received user reporting that, if configured to use S3 as the output
>> location for HFile, the files were generated in "_temporary" folder and
>> won't be committed to the target path. This caused no data be loaded
>> finally. And we can reproduce this problem easily. The original reporting
>> is in [1].
>>
>> Kylin uses HBase's HFileOutputFormat2.java to configure the MR job. After
>> some investigation, I found this class always uses the default
>> "FileOutputCommitter", see [2], regardless of the job's configuration; so
>> it always writing to "_temporary" folder. Since AWS EMR configured to use
>> DirectOutputCommitter for S3, then this problem occurs: Hadoop expects to
>> see the file directly under output path, while the RecordWriter generates
>> them in "_temporary" folder.
>>
>> Did you get such reporting before? I had a temporary fix in my fork now.
>> Just wondering how you think about it; if oaky I would report a JIRA.
>> Thanks!
>>
>> [1] https://issues.apache.org/jira/browse/KYLIN-2788
>> [2] https://github.com/apache/hbase/blob/master/hbase-mapreduce/
>> src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOu
>> tputFormat2.java#L193
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>>
>>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: HFileOutputFormat2 hardcodes default FileOutputCommitter

Posted by ShaoFeng Shi <sh...@apache.org>.
Here is the pull request:

https://github.com/apache/hbase/pull/60

2017-09-26 17:16 GMT+08:00 ShaoFeng Shi <sh...@apache.org>:

> Hello gentlemen,
>
> This is Shaofeng Shi from Apache Kylin community, we use HBase as the
> storage engine, and we use MR job to generate HFile before bulk load. We
> received user reporting that, if configured to use S3 as the output
> location for HFile, the files were generated in "_temporary" folder and
> won't be committed to the target path. This caused no data be loaded
> finally. And we can reproduce this problem easily. The original reporting
> is in [1].
>
> Kylin uses HBase's HFileOutputFormat2.java to configure the MR job. After
> some investigation, I found this class always uses the default
> "FileOutputCommitter", see [2], regardless of the job's configuration; so
> it always writing to "_temporary" folder. Since AWS EMR configured to use
> DirectOutputCommitter for S3, then this problem occurs: Hadoop expects to
> see the file directly under output path, while the RecordWriter generates
> them in "_temporary" folder.
>
> Did you get such reporting before? I had a temporary fix in my fork now.
> Just wondering how you think about it; if oaky I would report a JIRA.
> Thanks!
>
> [1] https://issues.apache.org/jira/browse/KYLIN-2788
> [2] https://github.com/apache/hbase/blob/master/hbase-
> mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/
> HFileOutputFormat2.java#L193
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>


-- 
Best regards,

Shaofeng Shi 史少锋