You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Adam Phelps <am...@opendns.com> on 2011/04/28 19:59:51 UTC

LoadIncrementalHFiles now deleting the hfiles?

We were using a backup scheme for our system where we have map-reduce 
jobs generating HFiles, which we then loaded using LoadIncrementalHFiles 
before making a remote copy of them using distcp.

However we just upgraded hbase (we're using cloudera's package, so we 
went from CDH3B4 to CDH3U0, both of which are versions of 0.90.1), and 
discovered that the HFiles now get deleted by the load operation.  Is 
this a recent change?  Is there a configuration variable to revert this 
behavior?

We can work around it by doing the copy before the load, but that is 
less than optimal in our scenario as we'd prefer to have quicker access 
to the data in HBase.

- Adam

Re: LoadIncrementalHFiles now deleting the hfiles?

Posted by Adam Phelps <am...@opendns.com>.
Hmmm, we didn't change anything with the config (at least that we know 
of) and we certainly didn't change any of the ordering of performing the 
load and the distcp off the cluster.

One interesting thing we were noticing after the upgrade is that distcp 
would copy HFiles to the backup cluster, but then the files would there 
would be deleted.  That was actually how we first noticed the change as 
we were tracking the total number of HFiles there and the count would 
increase as normal after the distcp but then mysteriously decrease.  I 
presume it was due to the HFile loader marking the files for deletion 
while distcp was running, and then the remote HDFS completing the deletion.

Between distcp and LoadIncrementalHFiles some bit of behavior definitely 
changed, I just don't know where it is.  Regardless we now have a 
working solution/work-around.  If this is the expected behavior rather 
than a bug then all is fine.

- Adam

On 4/30/11 10:50 PM, Todd Lipcon wrote:
> Hi Adam,
>
> It's always been this way.
>
> The only time you'll see them copied is if you run the load from a
> remote filesystem - ie if you specify a URL that doesn't match the URL
> used in hbase.rootdir.
>
> See th bulkLoadHFile() method in Store.java:
>      // Move the file if it's on another filesystem
>      FileSystem srcFs = srcPath.getFileSystem(conf);
>      if (!srcFs.equals(fs)) {
>        LOG.info("File " + srcPath + " on different filesystem than " +
> "destination store - moving to this filesystem.");
>        Path tmpPath = getTmpPath();
>        FileUtil.copy(srcFs, srcPath, fs, tmpPath, false, conf);
>        LOG.info("Copied to temporary path on dst filesystem: " + tmpPath);
>        srcPath = tmpPath;
>      }
>
> Perhaps your config changed slightly during the upgrade?
>
> -Todd
>
> On Fri, Apr 29, 2011 at 1:11 PM, Adam Phelps <amp@opendns.com
> <ma...@opendns.com>> wrote:
>
>     I could believe that, although I was under the impression that these
>     files are actually incorporated into the existing region files.
>       Still, its definitely a different behavior than what we were
>     seeing before our recent upgrade.
>
>     - Adam
>
>
>     On 4/29/11 10:41 AM, Patrick Angeles wrote:
>
>         Adam,
>
>         They are probably not deleted, but moved to the appropriate region
>         subdirectory under /hbase.
>
>         On Fri, Apr 29, 2011 at 1:15 PM, Adam Phelps<amp@opendns.com
>         <ma...@opendns.com>>  wrote:
>
>             I just verified this, and the hfiles seem to be deleted one
>             at a time as
>             the bulk load runs.
>
>             - Adam
>
>
>             On 4/28/11 4:28 PM, Stack wrote:
>
>                 I took a look through the code and don't see any
>                 explicit removes and
>                 looking through history of changes to the file, I don't
>                 see any change
>                 of substance.
>
>                 Can you figure what is doing the delete? At what stage?
>                   Is it as
>                 completebulkload runs?
>
>                 St.Ack
>
>                 On Thu, Apr 28, 2011 at 10:59 AM, Adam
>                 Phelps<amp@opendns.com <ma...@opendns.com>>   wrote:
>
>                     We were using a backup scheme for our system where
>                     we have map-reduce
>                     jobs
>                     generating HFiles, which we then loaded using
>                     LoadIncrementalHFiles
>                     before
>                     making a remote copy of them using distcp.
>
>                     However we just upgraded hbase (we're using
>                     cloudera's package, so we
>                     went
>                     from CDH3B4 to CDH3U0, both of which are versions of
>                     0.90.1), and
>                     discovered
>                     that the HFiles now get deleted by the load
>                     operation.  Is this a recent
>                     change?  Is there a configuration variable to revert
>                     this behavior?
>
>                     We can work around it by doing the copy before the
>                     load, but that is less
>                     than optimal in our scenario as we'd prefer to have
>                     quicker access to the
>                     data in HBase.
>
>                     - Adam
>
>
>
>
>
>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera


Re: LoadIncrementalHFiles now deleting the hfiles?

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Adam,

It's always been this way.

The only time you'll see them copied is if you run the load from a remote
filesystem - ie if you specify a URL that doesn't match the URL used in
hbase.rootdir.

See th bulkLoadHFile() method in Store.java:
    // Move the file if it's on another filesystem
    FileSystem srcFs = srcPath.getFileSystem(conf);
    if (!srcFs.equals(fs)) {
      LOG.info("File " + srcPath + " on different filesystem than " +
          "destination store - moving to this filesystem.");
      Path tmpPath = getTmpPath();
      FileUtil.copy(srcFs, srcPath, fs, tmpPath, false, conf);
      LOG.info("Copied to temporary path on dst filesystem: " + tmpPath);
      srcPath = tmpPath;
    }

Perhaps your config changed slightly during the upgrade?

-Todd

On Fri, Apr 29, 2011 at 1:11 PM, Adam Phelps <am...@opendns.com> wrote:

> I could believe that, although I was under the impression that these files
> are actually incorporated into the existing region files.  Still, its
> definitely a different behavior than what we were seeing before our recent
> upgrade.
>
> - Adam
>
>
> On 4/29/11 10:41 AM, Patrick Angeles wrote:
>
>> Adam,
>>
>> They are probably not deleted, but moved to the appropriate region
>> subdirectory under /hbase.
>>
>> On Fri, Apr 29, 2011 at 1:15 PM, Adam Phelps<am...@opendns.com>  wrote:
>>
>>  I just verified this, and the hfiles seem to be deleted one at a time as
>>> the bulk load runs.
>>>
>>> - Adam
>>>
>>>
>>> On 4/28/11 4:28 PM, Stack wrote:
>>>
>>>  I took a look through the code and don't see any explicit removes and
>>>> looking through history of changes to the file, I don't see any change
>>>> of substance.
>>>>
>>>> Can you figure what is doing the delete? At what stage?  Is it as
>>>> completebulkload runs?
>>>>
>>>> St.Ack
>>>>
>>>> On Thu, Apr 28, 2011 at 10:59 AM, Adam Phelps<am...@opendns.com>   wrote:
>>>>
>>>>  We were using a backup scheme for our system where we have map-reduce
>>>>> jobs
>>>>> generating HFiles, which we then loaded using LoadIncrementalHFiles
>>>>> before
>>>>> making a remote copy of them using distcp.
>>>>>
>>>>> However we just upgraded hbase (we're using cloudera's package, so we
>>>>> went
>>>>> from CDH3B4 to CDH3U0, both of which are versions of 0.90.1), and
>>>>> discovered
>>>>> that the HFiles now get deleted by the load operation.  Is this a
>>>>> recent
>>>>> change?  Is there a configuration variable to revert this behavior?
>>>>>
>>>>> We can work around it by doing the copy before the load, but that is
>>>>> less
>>>>> than optimal in our scenario as we'd prefer to have quicker access to
>>>>> the
>>>>> data in HBase.
>>>>>
>>>>> - Adam
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: LoadIncrementalHFiles now deleting the hfiles?

Posted by Adam Phelps <am...@opendns.com>.
I could believe that, although I was under the impression that these 
files are actually incorporated into the existing region files.  Still, 
its definitely a different behavior than what we were seeing before our 
recent upgrade.

- Adam

On 4/29/11 10:41 AM, Patrick Angeles wrote:
> Adam,
>
> They are probably not deleted, but moved to the appropriate region
> subdirectory under /hbase.
>
> On Fri, Apr 29, 2011 at 1:15 PM, Adam Phelps<am...@opendns.com>  wrote:
>
>> I just verified this, and the hfiles seem to be deleted one at a time as
>> the bulk load runs.
>>
>> - Adam
>>
>>
>> On 4/28/11 4:28 PM, Stack wrote:
>>
>>> I took a look through the code and don't see any explicit removes and
>>> looking through history of changes to the file, I don't see any change
>>> of substance.
>>>
>>> Can you figure what is doing the delete? At what stage?  Is it as
>>> completebulkload runs?
>>>
>>> St.Ack
>>>
>>> On Thu, Apr 28, 2011 at 10:59 AM, Adam Phelps<am...@opendns.com>   wrote:
>>>
>>>> We were using a backup scheme for our system where we have map-reduce
>>>> jobs
>>>> generating HFiles, which we then loaded using LoadIncrementalHFiles
>>>> before
>>>> making a remote copy of them using distcp.
>>>>
>>>> However we just upgraded hbase (we're using cloudera's package, so we
>>>> went
>>>> from CDH3B4 to CDH3U0, both of which are versions of 0.90.1), and
>>>> discovered
>>>> that the HFiles now get deleted by the load operation.  Is this a recent
>>>> change?  Is there a configuration variable to revert this behavior?
>>>>
>>>> We can work around it by doing the copy before the load, but that is less
>>>> than optimal in our scenario as we'd prefer to have quicker access to the
>>>> data in HBase.
>>>>
>>>> - Adam
>>>>
>>>>
>>>
>>
>


Re: LoadIncrementalHFiles now deleting the hfiles?

Posted by Patrick Angeles <pa...@cloudera.com>.
Adam,

They are probably not deleted, but moved to the appropriate region
subdirectory under /hbase.

On Fri, Apr 29, 2011 at 1:15 PM, Adam Phelps <am...@opendns.com> wrote:

> I just verified this, and the hfiles seem to be deleted one at a time as
> the bulk load runs.
>
> - Adam
>
>
> On 4/28/11 4:28 PM, Stack wrote:
>
>> I took a look through the code and don't see any explicit removes and
>> looking through history of changes to the file, I don't see any change
>> of substance.
>>
>> Can you figure what is doing the delete? At what stage?  Is it as
>> completebulkload runs?
>>
>> St.Ack
>>
>> On Thu, Apr 28, 2011 at 10:59 AM, Adam Phelps<am...@opendns.com>  wrote:
>>
>>> We were using a backup scheme for our system where we have map-reduce
>>> jobs
>>> generating HFiles, which we then loaded using LoadIncrementalHFiles
>>> before
>>> making a remote copy of them using distcp.
>>>
>>> However we just upgraded hbase (we're using cloudera's package, so we
>>> went
>>> from CDH3B4 to CDH3U0, both of which are versions of 0.90.1), and
>>> discovered
>>> that the HFiles now get deleted by the load operation.  Is this a recent
>>> change?  Is there a configuration variable to revert this behavior?
>>>
>>> We can work around it by doing the copy before the load, but that is less
>>> than optimal in our scenario as we'd prefer to have quicker access to the
>>> data in HBase.
>>>
>>> - Adam
>>>
>>>
>>
>

Re: LoadIncrementalHFiles now deleting the hfiles?

Posted by Adam Phelps <am...@opendns.com>.
I just verified this, and the hfiles seem to be deleted one at a time as 
the bulk load runs.

- Adam

On 4/28/11 4:28 PM, Stack wrote:
> I took a look through the code and don't see any explicit removes and
> looking through history of changes to the file, I don't see any change
> of substance.
>
> Can you figure what is doing the delete? At what stage?  Is it as
> completebulkload runs?
>
> St.Ack
>
> On Thu, Apr 28, 2011 at 10:59 AM, Adam Phelps<am...@opendns.com>  wrote:
>> We were using a backup scheme for our system where we have map-reduce jobs
>> generating HFiles, which we then loaded using LoadIncrementalHFiles before
>> making a remote copy of them using distcp.
>>
>> However we just upgraded hbase (we're using cloudera's package, so we went
>> from CDH3B4 to CDH3U0, both of which are versions of 0.90.1), and discovered
>> that the HFiles now get deleted by the load operation.  Is this a recent
>> change?  Is there a configuration variable to revert this behavior?
>>
>> We can work around it by doing the copy before the load, but that is less
>> than optimal in our scenario as we'd prefer to have quicker access to the
>> data in HBase.
>>
>> - Adam
>>
>


Re: LoadIncrementalHFiles now deleting the hfiles?

Posted by Stack <st...@duboce.net>.
The only delete I see is in InputSampler where it does a cleanup of
any existing partition sample files first before writing a new one.
Even then its doing an explicit file delete rather than a dir delete.

St.Ack

On Thu, Apr 28, 2011 at 4:28 PM, Stack <st...@duboce.net> wrote:
> I took a look through the code and don't see any explicit removes and
> looking through history of changes to the file, I don't see any change
> of substance.
>
> Can you figure what is doing the delete? At what stage?  Is it as
> completebulkload runs?
>
> St.Ack
>
> On Thu, Apr 28, 2011 at 10:59 AM, Adam Phelps <am...@opendns.com> wrote:
>> We were using a backup scheme for our system where we have map-reduce jobs
>> generating HFiles, which we then loaded using LoadIncrementalHFiles before
>> making a remote copy of them using distcp.
>>
>> However we just upgraded hbase (we're using cloudera's package, so we went
>> from CDH3B4 to CDH3U0, both of which are versions of 0.90.1), and discovered
>> that the HFiles now get deleted by the load operation.  Is this a recent
>> change?  Is there a configuration variable to revert this behavior?
>>
>> We can work around it by doing the copy before the load, but that is less
>> than optimal in our scenario as we'd prefer to have quicker access to the
>> data in HBase.
>>
>> - Adam
>>
>

Re: LoadIncrementalHFiles now deleting the hfiles?

Posted by Stack <st...@duboce.net>.
I took a look through the code and don't see any explicit removes and
looking through history of changes to the file, I don't see any change
of substance.

Can you figure what is doing the delete? At what stage?  Is it as
completebulkload runs?

St.Ack

On Thu, Apr 28, 2011 at 10:59 AM, Adam Phelps <am...@opendns.com> wrote:
> We were using a backup scheme for our system where we have map-reduce jobs
> generating HFiles, which we then loaded using LoadIncrementalHFiles before
> making a remote copy of them using distcp.
>
> However we just upgraded hbase (we're using cloudera's package, so we went
> from CDH3B4 to CDH3U0, both of which are versions of 0.90.1), and discovered
> that the HFiles now get deleted by the load operation.  Is this a recent
> change?  Is there a configuration variable to revert this behavior?
>
> We can work around it by doing the copy before the load, but that is less
> than optimal in our scenario as we'd prefer to have quicker access to the
> data in HBase.
>
> - Adam
>

Re: LoadIncrementalHFiles now deleting the hfiles?

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Taking a quick peek at their change lists I can't really see what
would have changed :|

J-D

On Thu, Apr 28, 2011 at 10:59 AM, Adam Phelps <am...@opendns.com> wrote:
> We were using a backup scheme for our system where we have map-reduce jobs
> generating HFiles, which we then loaded using LoadIncrementalHFiles before
> making a remote copy of them using distcp.
>
> However we just upgraded hbase (we're using cloudera's package, so we went
> from CDH3B4 to CDH3U0, both of which are versions of 0.90.1), and discovered
> that the HFiles now get deleted by the load operation.  Is this a recent
> change?  Is there a configuration variable to revert this behavior?
>
> We can work around it by doing the copy before the load, but that is less
> than optimal in our scenario as we'd prefer to have quicker access to the
> data in HBase.
>
> - Adam
>