You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Alex Grund <st...@googlemail.com> on 2013/02/06 21:24:05 UTC

How would you model this in Hbase?

Hi,

I am a newbie in nosql-databases and I am wondering how to model a
specific case with Hbase.

The thing I want to model are economic time series, such as
unemployment rate in a given country.

The complicated thing is this: Values of an economic time series can,
but do not have to be revised.

An example:

Imagine, the time series is published monthly, at the first day of a
month with the value for the previous month, such like:

Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4

(where "release" is the date of release and "reporting" is the date of
the month the "value" refers to. Read: "On Dec 1, 2011 the
unemployement rate for Nov 2011 was reported to be "1").

Now, imagine, that on every release, the value for the previous month
is revised, such like:

Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5

Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
Unemployment; release: 2011/11/01; reporting: 2011/09/01; value: 3.5

Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
Unemployment; release: 2011/10/01; reporting: 2011/08/01; value: 4.5

Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4
Unemployment; release: 2011/09/01; reporting: 2011/07/01; value: 5.5

Read: On Oct, 1, 2011, the unemployment rate was reported to be "3"
for Sep, and the revised value for Aug was reported to be "4.5".

The most recent observation (release) ex-post is:  [1]
Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5

Since the data is not revised further than one month behind, the whole
series ex-post would look like that: [3]
Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5

Unemployment; release: 2011/11/01; reporting: 2011/09/01; value: 3.5

Unemployment; release: 2011/10/01; reporting: 2011/08/01; value: 4.5

Unemployment; release: 2011/09/01; reporting: 2011/07/01; value: 5.5

Whereas, the "known-to-market"-series would look like that: [2]

Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4

That are the series I want to get from the db.


How would you model this with Hbase? Is Hbase suitable for that
application? Or in general, a column oriented DB?

Or, is a a relational approach a better fit?


Thanks!

Re: How would you model this in Hbase?

Posted by Ian Varley <iv...@salesforce.com>.

Point well taken, Ulrich - I'm not very familiar with the domain here, but what you're saying makes sense. These aren't "mistakes that are being corrected", they're really two different pieces of information, and the difference between them is interesting in and of itself. In that case, explicitly modeling it is definitely better. :)

Ian

On Feb 7, 2013, at 7:51 AM, Ulrich Staudinger wrote:

Hi there,

No offence meant Ian. I might also think too trading oriented.

You definitely want to have those numbers readily available and not as a
version. In retrospective, you will want to know by how much the actuals
were off. Or you will want to run a trading strategy against the actuals
...

It is the same with any of those macro figures.

Revised and initially reported are two separate types of information and
there is (usually) always a revised figure.

And when doing research, I wouldn't dare start with versioning unless it is
absolutely clear that the original value is wrong, void and worthless.

Cheers

P.s. pardon for double posting an hour ago.
Am 07.02.2013 14:36 schrieb "Ian Varley" <iv...@salesforce.com>>:

Overloading the time stamp aka the versions of the cell is really not a
good idea.

I agree in general, guys (and noted the dangers in my original post). I'd
note, however, that this may be one of the rare cases where this actually
*isn't* overloading the timestamp. If you look at the OP's question, this
really is two versions of a single value. The data originally came in as X,
then a month later it's revised to Y. If the majority of queries are going
to just ask "what's the latest value", then this will make it easy in
HBase, because that's the default behavior. And if you want to do a time
travel query, that too is easy (you just set the max date you'd like to
use). Doing either of those things with the reporting_month explicitly
factored into the model (in the key, say) is harder. (Not impossible, just
more complicated.)

In a relational database, you might model this as a simple "UPDATE econ SET
value = '2.5' WHERE figure='unemployment' AND month_reporting =
'2011-11-01'". But the downside there is you'd lose the old value, and
wouldn't be able to time travel. But in HBase you can.

Overloading the timestamp is a terrible idea if you make it mean something
other than "the date at which this data was valid". But that's not what's
happening here, that's exactly what he's looking for.

Ian

On Feb 7, 2013, at 1:26 AM, Ulrich Staudinger wrote:

On 02/06/2013 01:49 PM, Michael Segel wrote:

Overloading the time stamp aka the versions of the cell is really not a
good idea.

Fully agree.

Yeah, I know opinions are like A.... everyone has one. ;-)

Yeah, but some people share one.

But you have to be aware that if someone decides to delete some data...
well one tombstone marker for the column, goodbye all of the versions of
the cell.
(Any ideas on a clean easy way to remove that tombstone? ;-)

You're better off using other methods of adding dimension to your cell.
One that works well is using Avro.

All the usual caveats apply: don't bother with HBase unless you've got
some serious size in your data (e.g. TB) and need to support a heavy load
of real-time updates and queries. Otherwise, go with something simpler to
operate like a relational database, couchdb, etc.

While this is a valid point for just storing it and working on your own
with data, there are reasons why you want to choose a data integration
platform (more on this later).

Back to the root discussion.

Why don't you simply identify the six different types of information per
number:

- figure name (unemployment)
- month (reporting)
- release date
- figure
- revision date
- revised figure

the key would be:
<figure name>_<month>

en voila.

I strongly advise against "overloading" the timestamping/versioning feature
of hbase.

You would still have to load the entire series and sort it by what you
like, but that's not a problem with hbase.

<snip>
Thinking in ActiveQuant, you would store each of the columns above through
it's IArchiveWriter. Then you can seamlessly view/chart it in the
ActiveQuant Master Server, making it available over CSV and SOAP to your
corporate intranet or to Excel through the AQ plugin.
</snip>

--
Ulrich Staudinger

http://www.activequant.org
Connect online: https://www.xing.com/profile/Ulrich_Staudinger

Re: How would you model this in Hbase?

Posted by Ulrich Staudinger <us...@activequant.com>.

Hi there,

No offence meant Ian. I might also think too trading oriented.

You definitely want to have those numbers readily available and not as a
version. In retrospective, you will want to know by how much the actuals
were off. Or you will want to run a trading strategy against the actuals
...

It is the same with any of those macro figures.

Revised and initially reported are two separate types of information and
there is (usually) always a revised figure.

And when doing research, I wouldn't dare start with versioning unless it is
absolutely clear that the original value is wrong, void and worthless.

Cheers

P.s. pardon for double posting an hour ago.
Am 07.02.2013 14:36 schrieb "Ian Varley" <iv...@salesforce.com>:

Overloading the time stamp aka the versions of the cell is really not a
good idea.

I agree in general, guys (and noted the dangers in my original post). I'd
note, however, that this may be one of the rare cases where this actually
*isn't* overloading the timestamp. If you look at the OP's question, this
really is two versions of a single value. The data originally came in as X,
then a month later it's revised to Y. If the majority of queries are going
to just ask "what's the latest value", then this will make it easy in
HBase, because that's the default behavior. And if you want to do a time
travel query, that too is easy (you just set the max date you'd like to
use). Doing either of those things with the reporting_month explicitly
factored into the model (in the key, say) is harder. (Not impossible, just
more complicated.)

In a relational database, you might model this as a simple "UPDATE econ SET
value = '2.5' WHERE figure='unemployment' AND month_reporting =
'2011-11-01'". But the downside there is you'd lose the old value, and
wouldn't be able to time travel. But in HBase you can.

Overloading the timestamp is a terrible idea if you make it mean something
other than "the date at which this data was valid". But that's not what's
happening here, that's exactly what he's looking for.

Ian

On Feb 7, 2013, at 1:26 AM, Ulrich Staudinger wrote:

On 02/06/2013 01:49 PM, Michael Segel wrote:

Overloading the time stamp aka the versions of the cell is really not a
good idea.


Fully agree.



Yeah, I know opinions are like A.... everyone has one. ;-)


Yeah, but some people share one.


But you have to be aware that if someone decides to delete some data...
well one tombstone marker for the column, goodbye all of the versions of
the cell.
(Any ideas on a clean easy way to remove that tombstone?  ;-)

You're better off using other methods of adding dimension to your cell.
One that works well is using Avro.





All the usual caveats apply: don't bother with HBase unless you've got
some serious size in your data (e.g. TB) and need to support a heavy load
of real-time updates and queries. Otherwise, go with something simpler to
operate like a relational database, couchdb, etc.


While this is a valid point for just storing it and working on your own
with data, there are reasons why you want to choose a data integration
platform (more on this later).

Back to the root discussion.

Why don't you simply identify the six different types of information per
number:

- figure name (unemployment)
- month (reporting)
- release date
- figure
- revision date
- revised figure

the key would be:
<figure name>_<month>

en voila.

I strongly advise against "overloading" the timestamping/versioning feature
of hbase.


You would still have to load the entire series and sort it by what you
like, but that's not a problem with hbase.


<snip>
Thinking in ActiveQuant, you would store each of the columns above through
it's IArchiveWriter. Then you can seamlessly view/chart it in the
ActiveQuant Master Server, making it available over CSV and SOAP to your
corporate intranet or to Excel through the AQ plugin.
</snip>


--
Ulrich Staudinger

http://www.activequant.org
Connect online: https://www.xing.com/profile/Ulrich_Staudinger

Re: How would you model this in Hbase?

Posted by Ian Varley <iv...@salesforce.com>.

Overloading the time stamp aka the versions of the cell is really not a
good idea.

I agree in general, guys (and noted the dangers in my original post). I'd note, however, that this may be one of the rare cases where this actually *isn't* overloading the timestamp. If you look at the OP's question, this really is two versions of a single value. The data originally came in as X, then a month later it's revised to Y. If the majority of queries are going to just ask "what's the latest value", then this will make it easy in HBase, because that's the default behavior. And if you want to do a time travel query, that too is easy (you just set the max date you'd like to use). Doing either of those things with the reporting_month explicitly factored into the model (in the key, say) is harder. (Not impossible, just more complicated.)

In a relational database, you might model this as a simple "UPDATE econ SET value = '2.5' WHERE figure='unemployment' AND month_reporting = '2011-11-01'". But the downside there is you'd lose the old value, and wouldn't be able to time travel. But in HBase you can.

Overloading the timestamp is a terrible idea if you make it mean something other than "the date at which this data was valid". But that's not what's happening here, that's exactly what he's looking for.

Ian

On Feb 7, 2013, at 1:26 AM, Ulrich Staudinger wrote:

On 02/06/2013 01:49 PM, Michael Segel wrote:

Overloading the time stamp aka the versions of the cell is really not a
good idea.

Fully agree.

Yeah, I know opinions are like A.... everyone has one. ;-)

Yeah, but some people share one.

You're better off using other methods of adding dimension to your cell.
One that works well is using Avro.

While this is a valid point for just storing it and working on your own
with data, there are reasons why you want to choose a data integration
platform (more on this later).

Back to the root discussion.

Why don't you simply identify the six different types of information per
number:

- figure name (unemployment)
- month (reporting)
- release date
- figure
- revision date
- revised figure

the key would be:
<figure name>_<month>

en voila.

I strongly advise against "overloading" the timestamping/versioning feature
of hbase.

You would still have to load the entire series and sort it by what you
like, but that's not a problem with hbase.

--
Ulrich Staudinger

http://www.activequant.org
Connect online: https://www.xing.com/profile/Ulrich_Staudinger

Re: How would you model this in Hbase?

Posted by Ulrich Staudinger <us...@gmail.com>.

>  On 02/06/2013 01:49 PM, Michael Segel wrote:
>
>> Overloading the time stamp aka the versions of the cell is really not a
>> good idea.
>>
>>
Fully agree.

>  Yeah, I know opinions are like A.... everyone has one. ;-)
>>
>>
Yeah, but some people share one.

>  But you have to be aware that if someone decides to delete some data...
>> well one tombstone marker for the column, goodbye all of the versions of
>> the cell.
>> (Any ideas on a clean easy way to remove that tombstone?  ;-)
>>
>> You're better off using other methods of adding dimension to your cell.
>> One that works well is using Avro.
>>
>>

>
>>> All the usual caveats apply: don't bother with HBase unless you've got
>>> some serious size in your data (e.g. TB) and need to support a heavy load
>>> of real-time updates and queries. Otherwise, go with something simpler to
>>> operate like a relational database, couchdb, etc.
>>>
>>>
While this is a valid point for just storing it and working on your own
with data, there are reasons why you want to choose a data integration
platform (more on this later).

Back to the root discussion.

Why don't you simply identify the six different types of information per
number:

- figure name (unemployment)
- month (reporting)
- release date
- figure
- revision date
- revised figure

the key would be:
<figure name>_<month>

en voila.

I strongly advise against "overloading" the timestamping/versioning feature
of hbase.

You would still have to load the entire series and sort it by what you
like, but that's not a problem with hbase.

<snip>
Thinking in ActiveQuant, you would store each of the columns above through
it's IArchiveWriter. Then you can seamlessly view/chart it in the
ActiveQuant Master Server, making it available over CSV and SOAP to your
corporate intranet or to Excel through the AQ plugin.
</snip>

-- 
Ulrich Staudinger

http://www.activequant.org
Connect online: https://www.xing.com/profile/Ulrich_Staudinger

Re: How would you model this in Hbase?

Posted by Ulrich Staudinger <us...@activequant.com>.

Why don't you simply identify the six different types of information per
number:

- figure name (unemployment)
- month (reporting)
- release date
- figure
- revision date
- revised figure

the key would be:
<figure name>_<month>

en voila.

I strongly advise against "overloading" the timestamping/versioning feature
of hbase.


You would still have to load the entire series and sort it by what you
like, but that's not a problem with hbase.

Thinking in ActiveQuant, you would store each of the columns above through
it's IArchiveWriter. Then you can seamlessly view/chart it in the
ActiveQuant Master Server, making it available over CSV and SOAP to your
corporate intranet.


Cheers



On Wed, Feb 6, 2013 at 11:01 PM, James Taylor <jt...@salesforce.com>wrote:

> Another approach would be to use Phoenix (http://github.com/**
> forcedotcom/phoenix <http://github.com/forcedotcom/phoenix>). You can
> model your schema as you would in the relational world, but you get the
> horizontal scalability of HBase.
>
>     James
>
>
> On 02/06/2013 01:49 PM, Michael Segel wrote:
>
>> Overloading the time stamp aka the versions of the cell is really not a
>> good idea.
>>
>> Yeah, I know opinions are like A.... everyone has one. ;-)
>>
>> But you have to be aware that if someone decides to delete some data...
>> well one tombstone marker for the column, goodbye all of the versions of
>> the cell.
>> (Any ideas on a clean easy way to remove that tombstone?  ;-)
>>
>> You're better off using other methods of adding dimension to your cell.
>> One that works well is using Avro.
>>
>> When I teach a course on HBase, I do mention about cells in my schema
>> design section of the course. I talk about the ability to use the
>> versioning as a way to add dimension and then tell the students that this
>> really isn't a good idea ...
>>
>> -Just saying...
>>
>> On Feb 6, 2013, at 3:05 PM, Ian Varley <iv...@salesforce.com> wrote:
>>
>>  Alex,
>>>
>>> This might be an interesting use of the time dimension in HBase. Every
>>> value in HBase is uniquely represented by a set of coordinates:
>>>
>>> - table
>>> - row key
>>> - column family
>>> - column qualifier
>>> - timestamp
>>>
>>> So, you can have two different values that have all the same
>>> coordinates, except their timestamp. So for your example, that could be:
>>>
>>> - table: econ
>>> - row key: "indicatorABC"
>>> - column family: cf1
>>> - column qualifier: "reporting_2011-10-01"
>>>
>>> first value:
>>> - timestamp: "2011-11-01 00:00:00.000"
>>> - value: 2
>>>
>>> second value:
>>> - timestamp: "2011-12-01 00:00:00.000"
>>> - value: 2.5
>>>
>>> I.e., if you load the data such that the timestamps on the values
>>> represent the release date, then you can model this in a natural way. By
>>> default, reads in HBase will only give you the latest value, but you can
>>> manually tell a scanner to give you "time travel" by only reporting values
>>> as of an older date; so you could say "tell me what the data would have
>>> said on 11/01".
>>>
>>> (Also, by default, HBase only keeps a limited number of historical
>>> versions (3), but you can tell it to keep all versions.)
>>>
>>> There are some downsides to using the time dimension explicitly like
>>> this:
>>> - If you back date things and also work with deletes, you could get some
>>> weird behavior depending on when compaction runs.
>>> - If you have lots of versions of things, the server still has to read
>>> over these when you scan, which makes things slower. (Probably doesn't
>>> apply if you only have a couple historical versions of any given value.)
>>>
>>> All the usual caveats apply: don't bother with HBase unless you've got
>>> some serious size in your data (e.g. TB) and need to support a heavy load
>>> of real-time updates and queries. Otherwise, go with something simpler to
>>> operate like a relational database, couchdb, etc.
>>>
>>> Ian
>>>
>>> On Feb 6, 2013, at 2:24 PM, Alex Grund wrote:
>>>
>>> Hi,
>>>
>>> I am a newbie in nosql-databases and I am wondering how to model a
>>> specific case with Hbase.
>>>
>>> The thing I want to model are economic time series, such as
>>> unemployment rate in a given country.
>>>
>>> The complicated thing is this: Values of an economic time series can,
>>> but do not have to be revised.
>>>
>>> An example:
>>>
>>> Imagine, the time series is published monthly, at the first day of a
>>> month with the value for the previous month, such like:
>>>
>>> Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
>>> Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
>>> Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
>>> Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4
>>>
>>> (where "release" is the date of release and "reporting" is the date of
>>> the month the "value" refers to. Read: "On Dec 1, 2011 the
>>> unemployement rate for Nov 2011 was reported to be "1").
>>>
>>> Now, imagine, that on every release, the value for the previous month
>>> is revised, such like:
>>>
>>> Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
>>> Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5
>>>
>>> Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
>>> Unemployment; release: 2011/11/01; reporting: 2011/09/01; value: 3.5
>>>
>>> Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
>>> Unemployment; release: 2011/10/01; reporting: 2011/08/01; value: 4.5
>>>
>>> Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4
>>> Unemployment; release: 2011/09/01; reporting: 2011/07/01; value: 5.5
>>>
>>> Read: On Oct, 1, 2011, the unemployment rate was reported to be "3"
>>> for Sep, and the revised value for Aug was reported to be "4.5".
>>>
>>> The most recent observation (release) ex-post is:  [1]
>>> Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
>>> Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5
>>>
>>> Since the data is not revised further than one month behind, the whole
>>> series ex-post would look like that: [3]
>>> Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
>>> Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5
>>>
>>> Unemployment; release: 2011/11/01; reporting: 2011/09/01; value: 3.5
>>>
>>> Unemployment; release: 2011/10/01; reporting: 2011/08/01; value: 4.5
>>>
>>> Unemployment; release: 2011/09/01; reporting: 2011/07/01; value: 5.5
>>>
>>> Whereas, the "known-to-market"-series would look like that: [2]
>>>
>>> Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
>>> Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
>>> Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
>>> Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4
>>>
>>> That are the series I want to get from the db.
>>>
>>>
>>> How would you model this with Hbase? Is Hbase suitable for that
>>> application? Or in general, a column oriented DB?
>>>
>>> Or, is a a relational approach a better fit?
>>>
>>>
>>> Thanks!
>>>
>>>  The opinions expressed here are mine, while they may reflect a
>> cognitive thought, that is purely accidental.
>> Use at your own risk.
>> Michael Segel
>> michael_segel (AT) hotmail.com
>>
>>
>>
>>
>>
>>
>>
>>
>


-- 
Ulrich Staudinger, Managing Director and Sr. Software Engineer, ActiveQuant
GmbH

P: +41 79 702 05 95
E: ustaudinger@activequant.com

http://www.activequant.com

AQ-R user? Join our mailing list:
http://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/aqr-user

Re: How would you model this in Hbase?

Posted by James Taylor <jt...@salesforce.com>.

Another approach would be to use Phoenix 
(http://github.com/forcedotcom/phoenix). You can model your schema as 
you would in the relational world, but you get the horizontal 
scalability of HBase.

     James

On 02/06/2013 01:49 PM, Michael Segel wrote:
> Overloading the time stamp aka the versions of the cell is really not a good idea.
>
> Yeah, I know opinions are like A.... everyone has one. ;-)
>
> But you have to be aware that if someone decides to delete some data... well one tombstone marker for the column, goodbye all of the versions of the cell.
> (Any ideas on a clean easy way to remove that tombstone?  ;-)
>
> You're better off using other methods of adding dimension to your cell. One that works well is using Avro.
>
> When I teach a course on HBase, I do mention about cells in my schema design section of the course. I talk about the ability to use the versioning as a way to add dimension and then tell the students that this really isn't a good idea ...
>
> -Just saying...
>
> On Feb 6, 2013, at 3:05 PM, Ian Varley <iv...@salesforce.com> wrote:
>
>> Alex,
>>
>> This might be an interesting use of the time dimension in HBase. Every value in HBase is uniquely represented by a set of coordinates:
>>
>> - table
>> - row key
>> - column family
>> - column qualifier
>> - timestamp
>>
>> So, you can have two different values that have all the same coordinates, except their timestamp. So for your example, that could be:
>>
>> - table: econ
>> - row key: "indicatorABC"
>> - column family: cf1
>> - column qualifier: "reporting_2011-10-01"
>>
>> first value:
>> - timestamp: "2011-11-01 00:00:00.000"
>> - value: 2
>>
>> second value:
>> - timestamp: "2011-12-01 00:00:00.000"
>> - value: 2.5
>>
>> I.e., if you load the data such that the timestamps on the values represent the release date, then you can model this in a natural way. By default, reads in HBase will only give you the latest value, but you can manually tell a scanner to give you "time travel" by only reporting values as of an older date; so you could say "tell me what the data would have said on 11/01".
>>
>> (Also, by default, HBase only keeps a limited number of historical versions (3), but you can tell it to keep all versions.)
>>
>> There are some downsides to using the time dimension explicitly like this:
>> - If you back date things and also work with deletes, you could get some weird behavior depending on when compaction runs.
>> - If you have lots of versions of things, the server still has to read over these when you scan, which makes things slower. (Probably doesn't apply if you only have a couple historical versions of any given value.)
>>
>> All the usual caveats apply: don't bother with HBase unless you've got some serious size in your data (e.g. TB) and need to support a heavy load of real-time updates and queries. Otherwise, go with something simpler to operate like a relational database, couchdb, etc.
>>
>> Ian
>>
>> On Feb 6, 2013, at 2:24 PM, Alex Grund wrote:
>>
>> Hi,
>>
>> I am a newbie in nosql-databases and I am wondering how to model a
>> specific case with Hbase.
>>
>> The thing I want to model are economic time series, such as
>> unemployment rate in a given country.
>>
>> The complicated thing is this: Values of an economic time series can,
>> but do not have to be revised.
>>
>> An example:
>>
>> Imagine, the time series is published monthly, at the first day of a
>> month with the value for the previous month, such like:
>>
>> Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
>> Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
>> Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
>> Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4
>>
>> (where "release" is the date of release and "reporting" is the date of
>> the month the "value" refers to. Read: "On Dec 1, 2011 the
>> unemployement rate for Nov 2011 was reported to be "1").
>>
>> Now, imagine, that on every release, the value for the previous month
>> is revised, such like:
>>
>> Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
>> Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5
>>
>> Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
>> Unemployment; release: 2011/11/01; reporting: 2011/09/01; value: 3.5
>>
>> Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
>> Unemployment; release: 2011/10/01; reporting: 2011/08/01; value: 4.5
>>
>> Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4
>> Unemployment; release: 2011/09/01; reporting: 2011/07/01; value: 5.5
>>
>> Read: On Oct, 1, 2011, the unemployment rate was reported to be "3"
>> for Sep, and the revised value for Aug was reported to be "4.5".
>>
>> The most recent observation (release) ex-post is:  [1]
>> Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
>> Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5
>>
>> Since the data is not revised further than one month behind, the whole
>> series ex-post would look like that: [3]
>> Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
>> Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5
>>
>> Unemployment; release: 2011/11/01; reporting: 2011/09/01; value: 3.5
>>
>> Unemployment; release: 2011/10/01; reporting: 2011/08/01; value: 4.5
>>
>> Unemployment; release: 2011/09/01; reporting: 2011/07/01; value: 5.5
>>
>> Whereas, the "known-to-market"-series would look like that: [2]
>>
>> Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
>> Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
>> Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
>> Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4
>>
>> That are the series I want to get from the db.
>>
>>
>> How would you model this with Hbase? Is Hbase suitable for that
>> application? Or in general, a column oriented DB?
>>
>> Or, is a a relational approach a better fit?
>>
>>
>> Thanks!
>>
> The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>
>

Re: How would you model this in Hbase?

Posted by Michael Segel <mi...@hotmail.com>.

Overloading the time stamp aka the versions of the cell is really not a good idea. 

Yeah, I know opinions are like A.... everyone has one. ;-) 

But you have to be aware that if someone decides to delete some data... well one tombstone marker for the column, goodbye all of the versions of the cell. 
(Any ideas on a clean easy way to remove that tombstone?  ;-) 

You're better off using other methods of adding dimension to your cell. One that works well is using Avro.

When I teach a course on HBase, I do mention about cells in my schema design section of the course. I talk about the ability to use the versioning as a way to add dimension and then tell the students that this really isn't a good idea ... 

-Just saying... 

On Feb 6, 2013, at 3:05 PM, Ian Varley <iv...@salesforce.com> wrote:

> Alex,
> 
> This might be an interesting use of the time dimension in HBase. Every value in HBase is uniquely represented by a set of coordinates:
> 
> - table
> - row key
> - column family
> - column qualifier
> - timestamp
> 
> So, you can have two different values that have all the same coordinates, except their timestamp. So for your example, that could be:
> 
> - table: econ
> - row key: "indicatorABC"
> - column family: cf1
> - column qualifier: "reporting_2011-10-01"
> 
> first value:
> - timestamp: "2011-11-01 00:00:00.000"
> - value: 2
> 
> second value:
> - timestamp: "2011-12-01 00:00:00.000"
> - value: 2.5
> 
> I.e., if you load the data such that the timestamps on the values represent the release date, then you can model this in a natural way. By default, reads in HBase will only give you the latest value, but you can manually tell a scanner to give you "time travel" by only reporting values as of an older date; so you could say "tell me what the data would have said on 11/01".
> 
> (Also, by default, HBase only keeps a limited number of historical versions (3), but you can tell it to keep all versions.)
> 
> There are some downsides to using the time dimension explicitly like this:
> - If you back date things and also work with deletes, you could get some weird behavior depending on when compaction runs.
> - If you have lots of versions of things, the server still has to read over these when you scan, which makes things slower. (Probably doesn't apply if you only have a couple historical versions of any given value.)
> 
> All the usual caveats apply: don't bother with HBase unless you've got some serious size in your data (e.g. TB) and need to support a heavy load of real-time updates and queries. Otherwise, go with something simpler to operate like a relational database, couchdb, etc.
> 
> Ian
> 
> On Feb 6, 2013, at 2:24 PM, Alex Grund wrote:
> 
> Hi,
> 
> I am a newbie in nosql-databases and I am wondering how to model a
> specific case with Hbase.
> 
> The thing I want to model are economic time series, such as
> unemployment rate in a given country.
> 
> The complicated thing is this: Values of an economic time series can,
> but do not have to be revised.
> 
> An example:
> 
> Imagine, the time series is published monthly, at the first day of a
> month with the value for the previous month, such like:
> 
> Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
> Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
> Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
> Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4
> 
> (where "release" is the date of release and "reporting" is the date of
> the month the "value" refers to. Read: "On Dec 1, 2011 the
> unemployement rate for Nov 2011 was reported to be "1").
> 
> Now, imagine, that on every release, the value for the previous month
> is revised, such like:
> 
> Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
> Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5
> 
> Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
> Unemployment; release: 2011/11/01; reporting: 2011/09/01; value: 3.5
> 
> Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
> Unemployment; release: 2011/10/01; reporting: 2011/08/01; value: 4.5
> 
> Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4
> Unemployment; release: 2011/09/01; reporting: 2011/07/01; value: 5.5
> 
> Read: On Oct, 1, 2011, the unemployment rate was reported to be "3"
> for Sep, and the revised value for Aug was reported to be "4.5".
> 
> The most recent observation (release) ex-post is:  [1]
> Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
> Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5
> 
> Since the data is not revised further than one month behind, the whole
> series ex-post would look like that: [3]
> Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
> Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5
> 
> Unemployment; release: 2011/11/01; reporting: 2011/09/01; value: 3.5
> 
> Unemployment; release: 2011/10/01; reporting: 2011/08/01; value: 4.5
> 
> Unemployment; release: 2011/09/01; reporting: 2011/07/01; value: 5.5
> 
> Whereas, the "known-to-market"-series would look like that: [2]
> 
> Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
> Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
> Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
> Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4
> 
> That are the series I want to get from the db.
> 
> 
> How would you model this with Hbase? Is Hbase suitable for that
> application? Or in general, a column oriented DB?
> 
> Or, is a a relational approach a better fit?
> 
> 
> Thanks!
> 

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com

Re: How would you model this in Hbase?

Posted by Ian Varley <iv...@salesforce.com>.

Alex,

This might be an interesting use of the time dimension in HBase. Every value in HBase is uniquely represented by a set of coordinates:

 - table
 - row key
 - column family
 - column qualifier
 - timestamp

So, you can have two different values that have all the same coordinates, except their timestamp. So for your example, that could be:

 - table: econ
 - row key: "indicatorABC"
 - column family: cf1
 - column qualifier: "reporting_2011-10-01"

first value:
 - timestamp: "2011-11-01 00:00:00.000"
 - value: 2

second value:
 - timestamp: "2011-12-01 00:00:00.000"
 - value: 2.5

I.e., if you load the data such that the timestamps on the values represent the release date, then you can model this in a natural way. By default, reads in HBase will only give you the latest value, but you can manually tell a scanner to give you "time travel" by only reporting values as of an older date; so you could say "tell me what the data would have said on 11/01".

(Also, by default, HBase only keeps a limited number of historical versions (3), but you can tell it to keep all versions.)

There are some downsides to using the time dimension explicitly like this:
 - If you back date things and also work with deletes, you could get some weird behavior depending on when compaction runs.
 - If you have lots of versions of things, the server still has to read over these when you scan, which makes things slower. (Probably doesn't apply if you only have a couple historical versions of any given value.)

All the usual caveats apply: don't bother with HBase unless you've got some serious size in your data (e.g. TB) and need to support a heavy load of real-time updates and queries. Otherwise, go with something simpler to operate like a relational database, couchdb, etc.

Ian

On Feb 6, 2013, at 2:24 PM, Alex Grund wrote:

Hi,

I am a newbie in nosql-databases and I am wondering how to model a
specific case with Hbase.

The thing I want to model are economic time series, such as
unemployment rate in a given country.

The complicated thing is this: Values of an economic time series can,
but do not have to be revised.

An example:

Imagine, the time series is published monthly, at the first day of a
month with the value for the previous month, such like:

Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4

(where "release" is the date of release and "reporting" is the date of
the month the "value" refers to. Read: "On Dec 1, 2011 the
unemployement rate for Nov 2011 was reported to be "1").

Now, imagine, that on every release, the value for the previous month
is revised, such like:

Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5

Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
Unemployment; release: 2011/11/01; reporting: 2011/09/01; value: 3.5

Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
Unemployment; release: 2011/10/01; reporting: 2011/08/01; value: 4.5

Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4
Unemployment; release: 2011/09/01; reporting: 2011/07/01; value: 5.5

Read: On Oct, 1, 2011, the unemployment rate was reported to be "3"
for Sep, and the revised value for Aug was reported to be "4.5".

The most recent observation (release) ex-post is:  [1]
Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5

Since the data is not revised further than one month behind, the whole
series ex-post would look like that: [3]
Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5

Unemployment; release: 2011/11/01; reporting: 2011/09/01; value: 3.5

Unemployment; release: 2011/10/01; reporting: 2011/08/01; value: 4.5

Unemployment; release: 2011/09/01; reporting: 2011/07/01; value: 5.5

Whereas, the "known-to-market"-series would look like that: [2]

Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4

That are the series I want to get from the db.


How would you model this with Hbase? Is Hbase suitable for that
application? Or in general, a column oriented DB?

Or, is a a relational approach a better fit?


Thanks!