You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Per Steffensen <st...@designware.dk> on 2011/12/08 14:13:19 UTC

Routing and region deletes

Hi

The system we are going to work on will receive 50mio+ new datarecords 
every day. We need to keep a history of 2 years of data (thats 35+ 
billion datarecords in the storage all in all), and that basically means 
that we also need to delete 50mio+ datarecords every day, or e.g. 1,5 
billion every month. We plan to store the datarecords in HBase.

Is it somehow possible to tell HBase to put (route) all datarecords 
belonging to a specific date or month to a designated set of regions 
(and route nothing else there), so that deleting all data belonging to 
that day/month i basically deleting those regions entirely? And is 
explicit deletion of entire regions possible at all?

The reason I want to do this is that I expect it to be much faster than 
doing explicit deletion record by record of 50mio+ records every day.

Regards, Per Steffensen



Re: Routing and region deletes

Posted by Per Steffensen <st...@designware.dk>.
Ahhh stupid me. I probably just want to use different tables for 
different days/months. Believe tables can fairly quickly be deleted on 
HBase?

Regards, Per Steffensen

Per Steffensen skrev:
> Thanks for your reply!
>
> Michel Segel skrev:
>> Per Seffensen,
>>
>> I would urge you to step away from the keyboard and rethink your design.
>>   
> Will do :-) But would actually still like to receive answers for my 
> questions - just pretend that my ideas are not so stupid and let me 
> know if it can be done
>> It sounds like you want to replicate a date partition model similar 
>> to what you would do if you were attempting this with HBase.
>>
>> HBase is not a relational database and you have a different way of 
>> doing things.
>>   
> I know
>> You could put the date/time stamp in the key such that your data is 
>> sorted by date.
>>   
> But I guess that would not guarantee that records with timestamps from 
> a specific day or month all exist in the same set of regions and that 
> records with timestamps from other days or months all exist outside 
> those regions, so that I can delete records from that day or month, 
> just by deleting the regions.
>> However, this would cause hot spots.  Think about how you access the 
>> data. It sounds like you access the more recent data more frequently 
>> than historical data.
> Not necessarily wrt reading, but certainly I (almost) only write new 
> records with timestamps from the current day/month.
>>   This is a bad idea in HBase.
>> (note: it may still make sense to do this ... You have to think more 
>> about the data and consider alternatives.)
>>
>> I personally would hash the key for even distribution, again 
>> depending on the data access pattern.  (hashed data means you can't 
>> do range queries but again, it depends on what you are doing...)
>>
>> You also have to think about how you purge the data. You don't just 
>> drop a region.
> I know that this is not the "default" way of deleting data, but it is 
> possible? Believe a region is basically just a folder with a set of 
> files and deleting those would be a matter of a few ms. So if I can 
> route all records with timestamps from a certain day or month to a 
> designated set of regions, deleting all those records will be a matter 
> of deleting #regions-in-that-set folders on disk - very quick. The 
> alternative is to do 50mio+ single delete operations every day (or 1,5 
> billion operations every month), and that will not even free up space 
> immediately since the records will actually just be marked deleted (in 
> a new file) - space will not be freed before next compaction of the 
> involved regions (see e.g. http://outerthought.org/blog/465-ot.html).
>>  Doing a full table scan once a month to delete may not be a bad thing.
> But I dont believe one full table scan will be enough. For that to be 
> possible, at least I would have to be able to provide HBase with all 
> 1,5 billion records to delete in one "delete"-call - thats probably 
> not possible :-)
>>  Again it depends on what you are doing...
>>
>> Just my opinion. Others will have their own... Now I'm stepping away 
>> from the keyboard to get my morning coffee...
>>   
> Enjoy. Then I will consider leaving work (its late afternoon in Europe)
>> :-)
>>
>>
>> Sent from a remote device. Please excuse any typos...
>>
>> Mike Segel
>>
>> On Dec 8, 2011, at 7:13 AM, Per Steffensen <st...@designware.dk> wrote:
>>
>>  
>>> Hi
>>>
>>> The system we are going to work on will receive 50mio+ new 
>>> datarecords every day. We need to keep a history of 2 years of data 
>>> (thats 35+ billion datarecords in the storage all in all), and that 
>>> basically means that we also need to delete 50mio+ datarecords every 
>>> day, or e.g. 1,5 billion every month. We plan to store the 
>>> datarecords in HBase.
>>>
>>> Is it somehow possible to tell HBase to put (route) all datarecords 
>>> belonging to a specific date or month to a designated set of regions 
>>> (and route nothing else there), so that deleting all data belonging 
>>> to that day/month i basically deleting those regions entirely? And 
>>> is explicit deletion of entire regions possible at all?
>>>
>>> The reason I want to do this is that I expect it to be much faster 
>>> than doing explicit deletion record by record of 50mio+ records 
>>> every day.
>>>
>>> Regards, Per Steffensen
>>>
>>>
>>>
>>>     
>>
>>   
>
>


Re: Routing and region deletes

Posted by Per Steffensen <st...@designware.dk>.
Thanks for your reply!

Michel Segel skrev:
> Per Seffensen,
>
> I would urge you to step away from the keyboard and rethink your design.
>   
Will do :-) But would actually still like to receive answers for my 
questions - just pretend that my ideas are not so stupid and let me know 
if it can be done
> It sounds like you want to replicate a date partition model similar to what you would do if you were attempting this with HBase.
>
> HBase is not a relational database and you have a different way of doing things.
>   
I know
> You could put the date/time stamp in the key such that your data is sorted by date.
>   
But I guess that would not guarantee that records with timestamps from a 
specific day or month all exist in the same set of regions and that 
records with timestamps from other days or months all exist outside 
those regions, so that I can delete records from that day or month, just 
by deleting the regions.
> However, this would cause hot spots.  Think about how you access the data. It sounds like you access the more recent data more frequently than historical data.
Not necessarily wrt reading, but certainly I (almost) only write new 
records with timestamps from the current day/month.
>   This is a bad idea in HBase.
> (note: it may still make sense to do this ... You have to think more about the data and consider alternatives.)
>
> I personally would hash the key for even distribution, again depending on the data access pattern.  (hashed data means you can't do range queries but again, it depends on what you are doing...)
>
> You also have to think about how you purge the data. You don't just drop a region.
I know that this is not the "default" way of deleting data, but it is 
possible? Believe a region is basically just a folder with a set of 
files and deleting those would be a matter of a few ms. So if I can 
route all records with timestamps from a certain day or month to a 
designated set of regions, deleting all those records will be a matter 
of deleting #regions-in-that-set folders on disk - very quick. The 
alternative is to do 50mio+ single delete operations every day (or 1,5 
billion operations every month), and that will not even free up space 
immediately since the records will actually just be marked deleted (in a 
new file) - space will not be freed before next compaction of the 
involved regions (see e.g. http://outerthought.org/blog/465-ot.html).
>  Doing a full table scan once a month to delete may not be a bad thing.
But I dont believe one full table scan will be enough. For that to be 
possible, at least I would have to be able to provide HBase with all 1,5 
billion records to delete in one "delete"-call - thats probably not 
possible :-)
>  Again it depends on what you are doing...
>
> Just my opinion. Others will have their own... Now I'm stepping away from the keyboard to get my morning coffee...
>   
Enjoy. Then I will consider leaving work (its late afternoon in Europe)
> :-)
>
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Dec 8, 2011, at 7:13 AM, Per Steffensen <st...@designware.dk> wrote:
>
>   
>> Hi
>>
>> The system we are going to work on will receive 50mio+ new datarecords every day. We need to keep a history of 2 years of data (thats 35+ billion datarecords in the storage all in all), and that basically means that we also need to delete 50mio+ datarecords every day, or e.g. 1,5 billion every month. We plan to store the datarecords in HBase.
>>
>> Is it somehow possible to tell HBase to put (route) all datarecords belonging to a specific date or month to a designated set of regions (and route nothing else there), so that deleting all data belonging to that day/month i basically deleting those regions entirely? And is explicit deletion of entire regions possible at all?
>>
>> The reason I want to do this is that I expect it to be much faster than doing explicit deletion record by record of 50mio+ records every day.
>>
>> Regards, Per Steffensen
>>
>>
>>
>>     
>
>   


Re: Routing and region deletes

Posted by Michel Segel <mi...@hotmail.com>.
Per Seffensen,

I would urge you to step away from the keyboard and rethink your design.
It sounds like you want to replicate a date partition model similar to what you would do if you were attempting this with HBase.

HBase is not a relational database and you have a different way of doing things.

You could put the date/time stamp in the key such that your data is sorted by date.
However, this would cause hot spots.  Think about how you access the data. It sounds like you access the more recent data more frequently than historical data.  This is a bad idea in HBase.
(note: it may still make sense to do this ... You have to think more about the data and consider alternatives.)

I personally would hash the key for even distribution, again depending on the data access pattern.  (hashed data means you can't do range queries but again, it depends on what you are doing...)

You also have to think about how you purge the data. You don't just drop a region. Doing a full table scan once a month to delete may not be a bad thing. Again it depends on what you are doing...

Just my opinion. Others will have their own... Now I'm stepping away from the keyboard to get my morning coffee...
:-)


Sent from a remote device. Please excuse any typos...

Mike Segel

On Dec 8, 2011, at 7:13 AM, Per Steffensen <st...@designware.dk> wrote:

> Hi
> 
> The system we are going to work on will receive 50mio+ new datarecords every day. We need to keep a history of 2 years of data (thats 35+ billion datarecords in the storage all in all), and that basically means that we also need to delete 50mio+ datarecords every day, or e.g. 1,5 billion every month. We plan to store the datarecords in HBase.
> 
> Is it somehow possible to tell HBase to put (route) all datarecords belonging to a specific date or month to a designated set of regions (and route nothing else there), so that deleting all data belonging to that day/month i basically deleting those regions entirely? And is explicit deletion of entire regions possible at all?
> 
> The reason I want to do this is that I expect it to be much faster than doing explicit deletion record by record of 50mio+ records every day.
> 
> Regards, Per Steffensen
> 
> 
>