You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Tobi Vollebregt (JIRA)" <ji...@apache.org> on 2015/04/08 19:00:20 UTC
[jira] [Updated] (HBASE-13430) HFiles that are in use by a table
cloned from a snapshot may be deleted when that snapshot is deleted
[ https://issues.apache.org/jira/browse/HBASE-13430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tobi Vollebregt updated HBASE-13430:
------------------------------------
Description:
We recently had a production issue in which HFiles that were still in use by a table were deleted. This appears to have been caused by race conditions in the order in which HFileLinks are created, combined with the fact that only files younger than {{hbase.master.hfilecleaner.ttl}} are kept alive.
This is how to reproduce:
* Clone a large snapshot into a new table. The clone operation must take more than {{hbase.master.hfilecleaner.ttl}} time to guarantee data loss.
* Ensure that no other table or snapshot is referencing the HFiles used by the new table.
* Delete the snapshot. This breaks the table.
The main cause is this:
* Cloning a snapshot creates the table in the {{HBASE_TEMP_DIRECTORY}}.
* However, it immediately creates back references to the HFileLinks that it creates for the table in the archive directory.
* HFileLinkCleaner does not check the {{HBASE_TEMP_DIRECTORY}}, so it considers all those back references deletable.
* The only thing that keeps them alive is the TimeToLiveHFileCleaner, but only for 5 minutes.
* So if cloning the snapshot takes more than 5 minutes, and the HFiles aren't referenced by anything else, data loss is guaranteed.
I have a unit test reproducing the issue and I tried to fix this, but didn't completely succeed. I will attach the patch shortly.
was:
We recently had a production issue in which HFiles that were still in use by a table were deleted. This appears to have been caused by race conditions in the order in which HFileLinks are created, combined with the fact that only files younger than {{hbase.master.hfilecleaner.ttl}} are kept alive.
This is how to reproduce:
* Clone a large snapshot into a new table. The clone operation must table more than {{hbase.master.hfilecleaner.ttl}} to guarantee data loss.
* Ensure that no other table or snapshot is referencing the HFiles used by the new table.
* Delete the snapshot. This breaks the table.
The main cause is this:
* Cloning a snapshot creates the table in the {{HBASE_TEMP_DIRECTORY}}.
* However, it immediately creates back references to the HFileLinks that it creates for the table in the archive directory.
* HFileLinkCleaner does not check the {{HBASE_TEMP_DIRECTORY}}, so it considers all those back references deletable.
* The only thing that keeps them alive is the TimeToLiveHFileCleaner, but only for 5 minutes.
* So if cloning the snapshot takes more than 5 minutes, and the HFiles aren't referenced by anything else, data loss is guaranteed.
I have a unit test reproducing the issue and I tried to fix this, but didn't completely succeed. I will attach the patch shortly.
> HFiles that are in use by a table cloned from a snapshot may be deleted when that snapshot is deleted
> -----------------------------------------------------------------------------------------------------
>
> Key: HBASE-13430
> URL: https://issues.apache.org/jira/browse/HBASE-13430
> Project: HBase
> Issue Type: Bug
> Components: hbase
> Reporter: Tobi Vollebregt
> Labels: data-integrity, master
>
> We recently had a production issue in which HFiles that were still in use by a table were deleted. This appears to have been caused by race conditions in the order in which HFileLinks are created, combined with the fact that only files younger than {{hbase.master.hfilecleaner.ttl}} are kept alive.
> This is how to reproduce:
> * Clone a large snapshot into a new table. The clone operation must take more than {{hbase.master.hfilecleaner.ttl}} time to guarantee data loss.
> * Ensure that no other table or snapshot is referencing the HFiles used by the new table.
> * Delete the snapshot. This breaks the table.
> The main cause is this:
> * Cloning a snapshot creates the table in the {{HBASE_TEMP_DIRECTORY}}.
> * However, it immediately creates back references to the HFileLinks that it creates for the table in the archive directory.
> * HFileLinkCleaner does not check the {{HBASE_TEMP_DIRECTORY}}, so it considers all those back references deletable.
> * The only thing that keeps them alive is the TimeToLiveHFileCleaner, but only for 5 minutes.
> * So if cloning the snapshot takes more than 5 minutes, and the HFiles aren't referenced by anything else, data loss is guaranteed.
> I have a unit test reproducing the issue and I tried to fix this, but didn't completely succeed. I will attach the patch shortly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)