You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Wellington Chevreuil <we...@gmail.com> on 2019/11/01 12:04:18 UTC

Re: Completing a bulk load from HFiles stored in S3

Ah yeah, didn't realise it would assume same FS, internally. Indeed, no way
to have rename working between different FSes.

Em qui, 31 de out de 2019 às 16:25, Josh Elser <el...@apache.org> escreveu:

> Short answer: no, it will not work and you need to copy it to HDFS first.
>
> IIRC, the bulk load code is ultimately calling a filesystem rename from
> the path you provided to the proper location in the hbase.rootdir's
> filesystem. I don't believe that an `fs.rename` is going to work across
> filesystems because you can't do this atomically, which HDFS guarantees
> for the rename method [1]
>
> Additionally, for Kerberos-secured clusters, the server-side bulk load
> logic expects that the filesystem hosting your hfiles is HDFS (in order
> to read the files with the appropriate authentication). This fails right
> now, but is something our PeterS is looking at.
>
> [1]
>
> https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_rename.28Path_src.2C_Path_d.29
>
> On 10/31/19 6:55 AM, Wellington Chevreuil wrote:
> > I believe you can specify your s3 path for the hfiles directly, as hdfs
> > FileSystem does support s3a scheme, but you would need to add your s3
> > access and secret key to your completebulkload configuration.
> >
> > Em qua, 30 de out de 2019 às 19:43, Gautham Acharya <
> > gauthama@alleninstitute.org> escreveu:
> >
> >> If I have Hfiles stored in S3, can I run CompleteBulkLoad and provide an
> >> S3 Endpoint to run a single command, or do I need to first copy the S3
> >> Hfiles to HDFS first? The documentation is not very clear.
> >>
> >
>

Re: Completing a bulk load from HFiles stored in S3

Posted by Austin Heyne <ah...@ccri.com>.
Yes, that's correct. I've never tried bulk loading from S3 on 2.x

-Austin

On 11/12/19 1:32 PM, Josh Elser wrote:
> Thanks for the info, Austin. I'm guessing that's how 1.x works since 
> you mention EMR?
>
> I think this code has changed in 2.x with the SecureBulkLoad stuff 
> moving into "core" (instead of external as a coproc endpoint).
>
> On 11/12/19 10:39 AM, Austin Heyne wrote:
>> Sorry for the late reply. You should be able to bulk load files from 
>> S3 as it will detect that they're not the same filesystem and have 
>> the regionservers copy the files locally and then up to HDFS. This is 
>> related to a problem I reported a while ago when using HBase on S3 
>> with EMR.
>>
>> https://issues.apache.org/jira/browse/HBASE-20774
>>
>> -Austin
>>
>> On 11/1/19 8:04 AM, Wellington Chevreuil wrote:
>>> Ah yeah, didn't realise it would assume same FS, internally. Indeed, 
>>> no way
>>> to have rename working between different FSes.
>>>
>>> Em qui, 31 de out de 2019 às 16:25, Josh Elser <el...@apache.org> 
>>> escreveu:
>>>
>>>> Short answer: no, it will not work and you need to copy it to HDFS 
>>>> first.
>>>>
>>>> IIRC, the bulk load code is ultimately calling a filesystem rename 
>>>> from
>>>> the path you provided to the proper location in the hbase.rootdir's
>>>> filesystem. I don't believe that an `fs.rename` is going to work 
>>>> across
>>>> filesystems because you can't do this atomically, which HDFS 
>>>> guarantees
>>>> for the rename method [1]
>>>>
>>>> Additionally, for Kerberos-secured clusters, the server-side bulk load
>>>> logic expects that the filesystem hosting your hfiles is HDFS (in 
>>>> order
>>>> to read the files with the appropriate authentication). This fails 
>>>> right
>>>> now, but is something our PeterS is looking at.
>>>>
>>>> [1]
>>>>
>>>> https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_rename.28Path_src.2C_Path_d.29 
>>>>
>>>>
>>>> On 10/31/19 6:55 AM, Wellington Chevreuil wrote:
>>>>> I believe you can specify your s3 path for the hfiles directly, as 
>>>>> hdfs
>>>>> FileSystem does support s3a scheme, but you would need to add your s3
>>>>> access and secret key to your completebulkload configuration.
>>>>>
>>>>> Em qua, 30 de out de 2019 às 19:43, Gautham Acharya <
>>>>> gauthama@alleninstitute.org> escreveu:
>>>>>
>>>>>> If I have Hfiles stored in S3, can I run CompleteBulkLoad and 
>>>>>> provide an
>>>>>> S3 Endpoint to run a single command, or do I need to first copy 
>>>>>> the S3
>>>>>> Hfiles to HDFS first? The documentation is not very clear.
>>>>>>

Re: Completing a bulk load from HFiles stored in S3

Posted by Josh Elser <el...@apache.org>.
Thanks for the info, Austin. I'm guessing that's how 1.x works since you 
mention EMR?

I think this code has changed in 2.x with the SecureBulkLoad stuff 
moving into "core" (instead of external as a coproc endpoint).

On 11/12/19 10:39 AM, Austin Heyne wrote:
> Sorry for the late reply. You should be able to bulk load files from S3 
> as it will detect that they're not the same filesystem and have the 
> regionservers copy the files locally and then up to HDFS. This is 
> related to a problem I reported a while ago when using HBase on S3 with 
> EMR.
> 
> https://issues.apache.org/jira/browse/HBASE-20774
> 
> -Austin
> 
> On 11/1/19 8:04 AM, Wellington Chevreuil wrote:
>> Ah yeah, didn't realise it would assume same FS, internally. Indeed, 
>> no way
>> to have rename working between different FSes.
>>
>> Em qui, 31 de out de 2019 às 16:25, Josh Elser <el...@apache.org> 
>> escreveu:
>>
>>> Short answer: no, it will not work and you need to copy it to HDFS 
>>> first.
>>>
>>> IIRC, the bulk load code is ultimately calling a filesystem rename from
>>> the path you provided to the proper location in the hbase.rootdir's
>>> filesystem. I don't believe that an `fs.rename` is going to work across
>>> filesystems because you can't do this atomically, which HDFS guarantees
>>> for the rename method [1]
>>>
>>> Additionally, for Kerberos-secured clusters, the server-side bulk load
>>> logic expects that the filesystem hosting your hfiles is HDFS (in order
>>> to read the files with the appropriate authentication). This fails right
>>> now, but is something our PeterS is looking at.
>>>
>>> [1]
>>>
>>> https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_rename.28Path_src.2C_Path_d.29 
>>>
>>>
>>> On 10/31/19 6:55 AM, Wellington Chevreuil wrote:
>>>> I believe you can specify your s3 path for the hfiles directly, as hdfs
>>>> FileSystem does support s3a scheme, but you would need to add your s3
>>>> access and secret key to your completebulkload configuration.
>>>>
>>>> Em qua, 30 de out de 2019 às 19:43, Gautham Acharya <
>>>> gauthama@alleninstitute.org> escreveu:
>>>>
>>>>> If I have Hfiles stored in S3, can I run CompleteBulkLoad and 
>>>>> provide an
>>>>> S3 Endpoint to run a single command, or do I need to first copy the S3
>>>>> Hfiles to HDFS first? The documentation is not very clear.
>>>>>

Re: Completing a bulk load from HFiles stored in S3

Posted by Austin Heyne <ah...@ccri.com>.
Sorry for the late reply. You should be able to bulk load files from S3 
as it will detect that they're not the same filesystem and have the 
regionservers copy the files locally and then up to HDFS. This is 
related to a problem I reported a while ago when using HBase on S3 with 
EMR.

https://issues.apache.org/jira/browse/HBASE-20774

-Austin

On 11/1/19 8:04 AM, Wellington Chevreuil wrote:
> Ah yeah, didn't realise it would assume same FS, internally. Indeed, no way
> to have rename working between different FSes.
>
> Em qui, 31 de out de 2019 às 16:25, Josh Elser <el...@apache.org> escreveu:
>
>> Short answer: no, it will not work and you need to copy it to HDFS first.
>>
>> IIRC, the bulk load code is ultimately calling a filesystem rename from
>> the path you provided to the proper location in the hbase.rootdir's
>> filesystem. I don't believe that an `fs.rename` is going to work across
>> filesystems because you can't do this atomically, which HDFS guarantees
>> for the rename method [1]
>>
>> Additionally, for Kerberos-secured clusters, the server-side bulk load
>> logic expects that the filesystem hosting your hfiles is HDFS (in order
>> to read the files with the appropriate authentication). This fails right
>> now, but is something our PeterS is looking at.
>>
>> [1]
>>
>> https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_rename.28Path_src.2C_Path_d.29
>>
>> On 10/31/19 6:55 AM, Wellington Chevreuil wrote:
>>> I believe you can specify your s3 path for the hfiles directly, as hdfs
>>> FileSystem does support s3a scheme, but you would need to add your s3
>>> access and secret key to your completebulkload configuration.
>>>
>>> Em qua, 30 de out de 2019 às 19:43, Gautham Acharya <
>>> gauthama@alleninstitute.org> escreveu:
>>>
>>>> If I have Hfiles stored in S3, can I run CompleteBulkLoad and provide an
>>>> S3 Endpoint to run a single command, or do I need to first copy the S3
>>>> Hfiles to HDFS first? The documentation is not very clear.
>>>>