You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Bradford Cross <br...@gmail.com> on 2009/04/01 04:25:32 UTC

financial time series database

Greetings,

I am prototyping a financial time series database on top of HBase and trying
to head my head around what a good design would look like.

As I understand it, I have rows, column families, columns and cells.

Since the only think that Hbase really "indexes" is row keys, it seems
natural in a way to represent the rowkeys as the date/time.

As a simple example:

Bar data:

{
   "2009/1/17" : {
     "open":"100",
     "high":"102",
     "low":"99",
     "close":"101"
     "volume":"1000256"
   }
}


Quote data:

{
   "2009/1/17:11:23:04" : {
     "bid":"100.01",
     "ask":"100.02",
     "bidsize":"10000",
     "asksize":"100200"
   }
}

But there are many other issues to think about.

In financial time series data we have small amounts of data within each
"observation" and we can have lots of observations.  We can have millions of
observations per time series (f.ex. all historical trade and quote date for
a particular stock since 1993)across hundreds of thousands of individual
instruments (f.ex. across all stocks that have traded since 1993.)

The write patterns fit HBase nicely, because it is a write once and append
pattern.  This is followed by loads of offline processes for simulating
trading models and such.  These query patterns look like "all quotes for all
stocks between the dates of 1/1/996 and 12/31/2008."  So the querying is
typically across a date range, and we can further filter the query by
instrument types.

So I am not sure what makes sense for efficiency because I do not understand
HBase well enough yet.

 What kinds of mixes of rows, column families, and columns should I be
thinking about?

Does my simplistic approach make any sense?  That would mean each row is a
key-value pair where the key is is the date/time and the value is the
"observation."  I suppose this leads to a "table per time series" model.
Does that make sense or is there overhead to having lots of tables?

Re: financial time series database

Posted by zsongbo <zs...@gmail.com>.

Wesleys suggestions may make sense.

Are there technical lim


On Sat, Apr 4, 2009 at 9:00 AM, Ryan Rawson <ry...@gmail.com> wrote:

> Another reason to perhaps avoid tons of versions is there is no query
> mechanism, nor will there ever be.  The mechanism is limited to asking for
> either the last N versions, or all of them.  If you are querying a date
> range, this is obviously a problem.
>
> -ryan
>
> On Fri, Apr 3, 2009 at 7:25 AM, stack <st...@duboce.net> wrote:
>
> > On Thu, Apr 2, 2009 at 9:53 PM, Wesley Chow <we...@s7labs.com> wrote:
> >
> > >
> > > Are there technical limitations to the number of different timestamps
> per
> > > cell? If it's the case that you're doing to be dealing with tens of
> > > thousands to millions of entries all at one cell, perhaps you should
> > check
> > > that to make sure it's a reasonable use case. The examples in the HBase
> > docs
> > > number the timestamps in single digits, and I don't recall any mention
> of
> > > very large numbers.
> >
> >
> >
> > Agreed.  I'd  imagine that tens of thousands of versions currently would
> > suffer in same manner as tens of thousands of columns -- hbase running
> > increasingly slower as count went up, at least until we address
> > HBASE-867<https://issues.apache.org/jira/browse/HBASE-867> "If
> > millions of columns in a column family, hbase scanner won't come
> > up<https://issues.apache.org/jira/browse/HBASE-867>
> > "
> >
> >
> > St.Ack
> >
>

Re: financial time series database

Posted by Ryan Rawson <ry...@gmail.com>.

Another reason to perhaps avoid tons of versions is there is no query
mechanism, nor will there ever be.  The mechanism is limited to asking for
either the last N versions, or all of them.  If you are querying a date
range, this is obviously a problem.

-ryan

On Fri, Apr 3, 2009 at 7:25 AM, stack <st...@duboce.net> wrote:

> On Thu, Apr 2, 2009 at 9:53 PM, Wesley Chow <we...@s7labs.com> wrote:
>
> >
> > Are there technical limitations to the number of different timestamps per
> > cell? If it's the case that you're doing to be dealing with tens of
> > thousands to millions of entries all at one cell, perhaps you should
> check
> > that to make sure it's a reasonable use case. The examples in the HBase
> docs
> > number the timestamps in single digits, and I don't recall any mention of
> > very large numbers.
>
>
>
> Agreed.  I'd  imagine that tens of thousands of versions currently would
> suffer in same manner as tens of thousands of columns -- hbase running
> increasingly slower as count went up, at least until we address
> HBASE-867<https://issues.apache.org/jira/browse/HBASE-867> "If
> millions of columns in a column family, hbase scanner won't come
> up<https://issues.apache.org/jira/browse/HBASE-867>
> "
>
>
> St.Ack
>

Re: financial time series database

Posted by stack <st...@duboce.net>.

On Thu, Apr 2, 2009 at 9:53 PM, Wesley Chow <we...@s7labs.com> wrote:

>
> Are there technical limitations to the number of different timestamps per
> cell? If it's the case that you're doing to be dealing with tens of
> thousands to millions of entries all at one cell, perhaps you should check
> that to make sure it's a reasonable use case. The examples in the HBase docs
> number the timestamps in single digits, and I don't recall any mention of
> very large numbers.

Agreed.  I'd  imagine that tens of thousands of versions currently would
suffer in same manner as tens of thousands of columns -- hbase running
increasingly slower as count went up, at least until we address
HBASE-867<https://issues.apache.org/jira/browse/HBASE-867> "If
millions of columns in a column family, hbase scanner won't come
up<https://issues.apache.org/jira/browse/HBASE-867>
"

St.Ack

Re: financial time series database

Posted by Wesley Chow <we...@s7labs.com>.

Are there technical limitations to the number of different timestamps  
per cell? If it's the case that you're doing to be dealing with tens  
of thousands to millions of entries all at one cell, perhaps you  
should check that to make sure it's a reasonable use case. The  
examples in the HBase docs number the timestamps in single digits, and  
I don't recall any mention of very large numbers.

An alternative layout might be to append the timestamp to the  
instrument for the row key. So you might have something like:

YHOO.20090402154723012345 -> ...
YHOO.20090402154723023456 -> ...

This way, if you're appending to the database in the order your quotes  
come in, you aren't hitting a hotspot previously mentioned. You also  
get to index into the table by instrument if you need to. The downside  
here is that if you have to read in data for *all* instruments for a  
specific day, there doesn't seem to be a trivial way of accomplishing  
that. You could, of course, maintain a separate database that tells  
you what the entire universe of instruments is per day.

You could also use a hybrid approach, where perhaps you have a row key  
that matches a single day, like YHOO.20090402, and then have X number  
of cells with timestamps according to the time at which that quote  
came in on that day.

At any rate -- if your access pattern is structured predictably (like,  
primarily reading straight through and picked by coarse grain  
properties, such as instrument and day), you might be better served  
storing the files directly in HDFS and just not bothering with HBase  
at all.


Wes


On Apr 2, 2009, at 11:41 AM, Bradford Cross wrote:

> Cool, so the schema I am leaning toward is:
>
> -hijack time stamp to be the time of each observation.  Use a column  
> family
> to hold all the data, and a column for each property of  each  
> observation.
>
> Since HBase sorts the timestamps descending, it seems like hijacking  
> the
> timestamps makes sense.  Any performance implications of this that I  
> should
> be aware of?
>
> Hijacking the time stamps seems to be fairly intuitive, and  
> leverages the
> time stamps which I otherwise would not really care about if I just  
> ignored
> timestamps and dumped all data including the date/time of  
> observations into
> columns.
>
> Are there any downsides to hijacking the timestamps like this?
>
>
>
> On Thu, Apr 2, 2009 at 12:13 AM, stack <st...@duboce.net> wrote:
>
>> I should also state that apart from the hbase inadequacy, your  
>> schema looks
>> good (hbase should be able to carry this schema-type w/o sweat --  
>> hopefully
>> 0.20.0).
>> St.Ack
>>
>> On Thu, Apr 2, 2009 at 9:12 AM, stack <st...@duboce.net> wrote:
>>
>>> How many columns will you have?  Until we fix
>>> https://issues.apache.org/jira/browse/HBASE-867, you are limited  
>>> regards
>>> the number of columns you can have.
>>> St.Ack
>>>
>>>
>>> On Thu, Apr 2, 2009 at 4:48 AM, Bradford Cross <
>> bradford.n.cross@gmail.com
>>>> wrote:
>>>
>>>> Based on reading the hbase architecture wiki, I have changed my  
>>>> thinking
>>>> due
>>>> to the "Column Family Centric Storage."
>>>>
>>>> HBase stores column families physically close on disk, so the  
>>>> items in a
>>>> given column family should have roughly the same read/write
>>>> characteristics
>>>> and contain similar data.  Although at a conceptual level, tables  
>>>> may be
>>>> viewed as a sparse set of rows, physically they are stored on a
>> per-column
>>>> family basis. This is an important consideration for schema and
>>>> application
>>>> designers to keep in mind.
>>>>
>>>> This leads me to the thought of keeping an entire time series  
>>>> inside a
>>>> single column family.
>>>>
>>>> Options:
>>>>
>>>> Row key is a ticker symbol:
>>>> - hijack time stamp to be the time of each observation.  Use a  
>>>> column
>>>> family
>>>> to hold all the data, and a column for each property of  each
>> observation.
>>>> -don't hijack the time stamp, just ignore it.  Use a column  
>>>> family for
>> all
>>>> the data, and use an individual column for the date/time of the
>>>> observation,
>>>> and individual columns for each property of each observation.
>>>>
>>>> thoughts?
>>>>
>>>> On Tue, Mar 31, 2009 at 7:25 PM, Bradford Cross
>>>> <br...@gmail.com>wrote:
>>>>
>>>>> Greetings,
>>>>>
>>>>> I am prototyping a financial time series database on top of  
>>>>> HBase and
>>>>> trying to head my head around what a good design would look like.
>>>>>
>>>>> As I understand it, I have rows, column families, columns and  
>>>>> cells.
>>>>>
>>>>> Since the only think that Hbase really "indexes" is row keys, it  
>>>>> seems
>>>>> natural in a way to represent the rowkeys as the date/time.
>>>>>
>>>>> As a simple example:
>>>>>
>>>>> Bar data:
>>>>>
>>>>> {
>>>>>   "2009/1/17" : {
>>>>>     "open":"100",
>>>>>     "high":"102",
>>>>>     "low":"99",
>>>>>     "close":"101"
>>>>>     "volume":"1000256"
>>>>>   }
>>>>> }
>>>>>
>>>>>
>>>>> Quote data:
>>>>>
>>>>> {
>>>>>   "2009/1/17:11:23:04" : {
>>>>>     "bid":"100.01",
>>>>>     "ask":"100.02",
>>>>>     "bidsize":"10000",
>>>>>     "asksize":"100200"
>>>>>   }
>>>>> }
>>>>>
>>>>> But there are many other issues to think about.
>>>>>
>>>>> In financial time series data we have small amounts of data within
>> each
>>>>> "observation" and we can have lots of observations.  We can have
>>>> millions of
>>>>> observations per time series (f.ex. all historical trade and quote
>> date
>>>> for
>>>>> a particular stock since 1993)across hundreds of thousands of
>> individual
>>>>> instruments (f.ex. across all stocks that have traded since 1993.)
>>>>>
>>>>> The write patterns fit HBase nicely, because it is a write once  
>>>>> and
>>>> append
>>>>> pattern.  This is followed by loads of offline processes for
>> simulating
>>>>> trading models and such.  These query patterns look like "all  
>>>>> quotes
>> for
>>>> all
>>>>> stocks between the dates of 1/1/996 and 12/31/2008."  So the  
>>>>> querying
>> is
>>>>> typically across a date range, and we can further filter the  
>>>>> query by
>>>>> instrument types.
>>>>>
>>>>> So I am not sure what makes sense for efficiency because I do not
>>>>> understand HBase well enough yet.
>>>>>
>>>>> What kinds of mixes of rows, column families, and columns should  
>>>>> I be
>>>>> thinking about?
>>>>>
>>>>> Does my simplistic approach make any sense?  That would mean  
>>>>> each row
>> is
>>>> a
>>>>> key-value pair where the key is is the date/time and the value  
>>>>> is the
>>>>> "observation."  I suppose this leads to a "table per time series"
>> model.
>>>>> Does that make sense or is there overhead to having lots of  
>>>>> tables?
>>>>>
>>>>
>>>
>>>
>>

Re: financial time series database

Posted by Bradford Cross <br...@gmail.com>.

Cool, so the schema I am leaning toward is:

-hijack time stamp to be the time of each observation.  Use a column family
to hold all the data, and a column for each property of  each observation.

Since HBase sorts the timestamps descending, it seems like hijacking the
timestamps makes sense.  Any performance implications of this that I should
be aware of?

Hijacking the time stamps seems to be fairly intuitive, and leverages the
time stamps which I otherwise would not really care about if I just ignored
timestamps and dumped all data including the date/time of observations into
columns.

Are there any downsides to hijacking the timestamps like this?



On Thu, Apr 2, 2009 at 12:13 AM, stack <st...@duboce.net> wrote:

> I should also state that apart from the hbase inadequacy, your schema looks
> good (hbase should be able to carry this schema-type w/o sweat -- hopefully
> 0.20.0).
> St.Ack
>
> On Thu, Apr 2, 2009 at 9:12 AM, stack <st...@duboce.net> wrote:
>
> > How many columns will you have?  Until we fix
> > https://issues.apache.org/jira/browse/HBASE-867, you are limited regards
> > the number of columns you can have.
> > St.Ack
> >
> >
> > On Thu, Apr 2, 2009 at 4:48 AM, Bradford Cross <
> bradford.n.cross@gmail.com
> > > wrote:
> >
> >> Based on reading the hbase architecture wiki, I have changed my thinking
> >> due
> >> to the "Column Family Centric Storage."
> >>
> >> HBase stores column families physically close on disk, so the items in a
> >> given column family should have roughly the same read/write
> >> characteristics
> >> and contain similar data.  Although at a conceptual level, tables may be
> >> viewed as a sparse set of rows, physically they are stored on a
> per-column
> >> family basis. This is an important consideration for schema and
> >> application
> >> designers to keep in mind.
> >>
> >> This leads me to the thought of keeping an entire time series inside a
> >> single column family.
> >>
> >> Options:
> >>
> >> Row key is a ticker symbol:
> >> - hijack time stamp to be the time of each observation.  Use a column
> >> family
> >> to hold all the data, and a column for each property of  each
> observation.
> >> -don't hijack the time stamp, just ignore it.  Use a column family for
> all
> >> the data, and use an individual column for the date/time of the
> >> observation,
> >> and individual columns for each property of each observation.
> >>
> >> thoughts?
> >>
> >> On Tue, Mar 31, 2009 at 7:25 PM, Bradford Cross
> >> <br...@gmail.com>wrote:
> >>
> >> > Greetings,
> >> >
> >> > I am prototyping a financial time series database on top of HBase and
> >> > trying to head my head around what a good design would look like.
> >> >
> >> > As I understand it, I have rows, column families, columns and cells.
> >> >
> >> > Since the only think that Hbase really "indexes" is row keys, it seems
> >> > natural in a way to represent the rowkeys as the date/time.
> >> >
> >> > As a simple example:
> >> >
> >> > Bar data:
> >> >
> >> > {
> >> >    "2009/1/17" : {
> >> >      "open":"100",
> >> >      "high":"102",
> >> >      "low":"99",
> >> >      "close":"101"
> >> >      "volume":"1000256"
> >> >    }
> >> > }
> >> >
> >> >
> >> > Quote data:
> >> >
> >> > {
> >> >    "2009/1/17:11:23:04" : {
> >> >      "bid":"100.01",
> >> >      "ask":"100.02",
> >> >      "bidsize":"10000",
> >> >      "asksize":"100200"
> >> >    }
> >> > }
> >> >
> >> > But there are many other issues to think about.
> >> >
> >> > In financial time series data we have small amounts of data within
> each
> >> > "observation" and we can have lots of observations.  We can have
> >> millions of
> >> > observations per time series (f.ex. all historical trade and quote
> date
> >> for
> >> > a particular stock since 1993)across hundreds of thousands of
> individual
> >> > instruments (f.ex. across all stocks that have traded since 1993.)
> >> >
> >> > The write patterns fit HBase nicely, because it is a write once and
> >> append
> >> > pattern.  This is followed by loads of offline processes for
> simulating
> >> > trading models and such.  These query patterns look like "all quotes
> for
> >> all
> >> > stocks between the dates of 1/1/996 and 12/31/2008."  So the querying
> is
> >> > typically across a date range, and we can further filter the query by
> >> > instrument types.
> >> >
> >> > So I am not sure what makes sense for efficiency because I do not
> >> > understand HBase well enough yet.
> >> >
> >> >  What kinds of mixes of rows, column families, and columns should I be
> >> > thinking about?
> >> >
> >> > Does my simplistic approach make any sense?  That would mean each row
> is
> >> a
> >> > key-value pair where the key is is the date/time and the value is the
> >> > "observation."  I suppose this leads to a "table per time series"
> model.
> >> > Does that make sense or is there overhead to having lots of tables?
> >> >
> >>
> >
> >
>

Re: financial time series database

Posted by stack <st...@duboce.net>.

I should also state that apart from the hbase inadequacy, your schema looks
good (hbase should be able to carry this schema-type w/o sweat -- hopefully
0.20.0).
St.Ack

On Thu, Apr 2, 2009 at 9:12 AM, stack <st...@duboce.net> wrote:

> How many columns will you have?  Until we fix
> https://issues.apache.org/jira/browse/HBASE-867, you are limited regards
> the number of columns you can have.
> St.Ack
>
>
> On Thu, Apr 2, 2009 at 4:48 AM, Bradford Cross <bradford.n.cross@gmail.com
> > wrote:
>
>> Based on reading the hbase architecture wiki, I have changed my thinking
>> due
>> to the "Column Family Centric Storage."
>>
>> HBase stores column families physically close on disk, so the items in a
>> given column family should have roughly the same read/write
>> characteristics
>> and contain similar data.  Although at a conceptual level, tables may be
>> viewed as a sparse set of rows, physically they are stored on a per-column
>> family basis. This is an important consideration for schema and
>> application
>> designers to keep in mind.
>>
>> This leads me to the thought of keeping an entire time series inside a
>> single column family.
>>
>> Options:
>>
>> Row key is a ticker symbol:
>> - hijack time stamp to be the time of each observation.  Use a column
>> family
>> to hold all the data, and a column for each property of  each observation.
>> -don't hijack the time stamp, just ignore it.  Use a column family for all
>> the data, and use an individual column for the date/time of the
>> observation,
>> and individual columns for each property of each observation.
>>
>> thoughts?
>>
>> On Tue, Mar 31, 2009 at 7:25 PM, Bradford Cross
>> <br...@gmail.com>wrote:
>>
>> > Greetings,
>> >
>> > I am prototyping a financial time series database on top of HBase and
>> > trying to head my head around what a good design would look like.
>> >
>> > As I understand it, I have rows, column families, columns and cells.
>> >
>> > Since the only think that Hbase really "indexes" is row keys, it seems
>> > natural in a way to represent the rowkeys as the date/time.
>> >
>> > As a simple example:
>> >
>> > Bar data:
>> >
>> > {
>> >    "2009/1/17" : {
>> >      "open":"100",
>> >      "high":"102",
>> >      "low":"99",
>> >      "close":"101"
>> >      "volume":"1000256"
>> >    }
>> > }
>> >
>> >
>> > Quote data:
>> >
>> > {
>> >    "2009/1/17:11:23:04" : {
>> >      "bid":"100.01",
>> >      "ask":"100.02",
>> >      "bidsize":"10000",
>> >      "asksize":"100200"
>> >    }
>> > }
>> >
>> > But there are many other issues to think about.
>> >
>> > In financial time series data we have small amounts of data within each
>> > "observation" and we can have lots of observations.  We can have
>> millions of
>> > observations per time series (f.ex. all historical trade and quote date
>> for
>> > a particular stock since 1993)across hundreds of thousands of individual
>> > instruments (f.ex. across all stocks that have traded since 1993.)
>> >
>> > The write patterns fit HBase nicely, because it is a write once and
>> append
>> > pattern.  This is followed by loads of offline processes for simulating
>> > trading models and such.  These query patterns look like "all quotes for
>> all
>> > stocks between the dates of 1/1/996 and 12/31/2008."  So the querying is
>> > typically across a date range, and we can further filter the query by
>> > instrument types.
>> >
>> > So I am not sure what makes sense for efficiency because I do not
>> > understand HBase well enough yet.
>> >
>> >  What kinds of mixes of rows, column families, and columns should I be
>> > thinking about?
>> >
>> > Does my simplistic approach make any sense?  That would mean each row is
>> a
>> > key-value pair where the key is is the date/time and the value is the
>> > "observation."  I suppose this leads to a "table per time series" model.
>> > Does that make sense or is there overhead to having lots of tables?
>> >
>>
>
>

Re: financial time series database

Posted by stack <st...@duboce.net>.

How many columns will you have?  Until we fix
https://issues.apache.org/jira/browse/HBASE-867, you are limited regards the
number of columns you can have.
St.Ack

On Thu, Apr 2, 2009 at 4:48 AM, Bradford Cross
<br...@gmail.com>wrote:

> Based on reading the hbase architecture wiki, I have changed my thinking
> due
> to the "Column Family Centric Storage."
>
> HBase stores column families physically close on disk, so the items in a
> given column family should have roughly the same read/write characteristics
> and contain similar data.  Although at a conceptual level, tables may be
> viewed as a sparse set of rows, physically they are stored on a per-column
> family basis. This is an important consideration for schema and application
> designers to keep in mind.
>
> This leads me to the thought of keeping an entire time series inside a
> single column family.
>
> Options:
>
> Row key is a ticker symbol:
> - hijack time stamp to be the time of each observation.  Use a column
> family
> to hold all the data, and a column for each property of  each observation.
> -don't hijack the time stamp, just ignore it.  Use a column family for all
> the data, and use an individual column for the date/time of the
> observation,
> and individual columns for each property of each observation.
>
> thoughts?
>
> On Tue, Mar 31, 2009 at 7:25 PM, Bradford Cross
> <br...@gmail.com>wrote:
>
> > Greetings,
> >
> > I am prototyping a financial time series database on top of HBase and
> > trying to head my head around what a good design would look like.
> >
> > As I understand it, I have rows, column families, columns and cells.
> >
> > Since the only think that Hbase really "indexes" is row keys, it seems
> > natural in a way to represent the rowkeys as the date/time.
> >
> > As a simple example:
> >
> > Bar data:
> >
> > {
> >    "2009/1/17" : {
> >      "open":"100",
> >      "high":"102",
> >      "low":"99",
> >      "close":"101"
> >      "volume":"1000256"
> >    }
> > }
> >
> >
> > Quote data:
> >
> > {
> >    "2009/1/17:11:23:04" : {
> >      "bid":"100.01",
> >      "ask":"100.02",
> >      "bidsize":"10000",
> >      "asksize":"100200"
> >    }
> > }
> >
> > But there are many other issues to think about.
> >
> > In financial time series data we have small amounts of data within each
> > "observation" and we can have lots of observations.  We can have millions
> of
> > observations per time series (f.ex. all historical trade and quote date
> for
> > a particular stock since 1993)across hundreds of thousands of individual
> > instruments (f.ex. across all stocks that have traded since 1993.)
> >
> > The write patterns fit HBase nicely, because it is a write once and
> append
> > pattern.  This is followed by loads of offline processes for simulating
> > trading models and such.  These query patterns look like "all quotes for
> all
> > stocks between the dates of 1/1/996 and 12/31/2008."  So the querying is
> > typically across a date range, and we can further filter the query by
> > instrument types.
> >
> > So I am not sure what makes sense for efficiency because I do not
> > understand HBase well enough yet.
> >
> >  What kinds of mixes of rows, column families, and columns should I be
> > thinking about?
> >
> > Does my simplistic approach make any sense?  That would mean each row is
> a
> > key-value pair where the key is is the date/time and the value is the
> > "observation."  I suppose this leads to a "table per time series" model.
> > Does that make sense or is there overhead to having lots of tables?
> >
>

Re: financial time series database

Posted by Bradford Cross <br...@gmail.com>.

Based on reading the hbase architecture wiki, I have changed my thinking due
to the "Column Family Centric Storage."

HBase stores column families physically close on disk, so the items in a
given column family should have roughly the same read/write characteristics
and contain similar data.  Although at a conceptual level, tables may be
viewed as a sparse set of rows, physically they are stored on a per-column
family basis. This is an important consideration for schema and application
designers to keep in mind.

This leads me to the thought of keeping an entire time series inside a
single column family.

Options:

Row key is a ticker symbol:
- hijack time stamp to be the time of each observation.  Use a column family
to hold all the data, and a column for each property of  each observation.
-don't hijack the time stamp, just ignore it.  Use a column family for all
the data, and use an individual column for the date/time of the observation,
and individual columns for each property of each observation.

thoughts?

On Tue, Mar 31, 2009 at 7:25 PM, Bradford Cross
<br...@gmail.com>wrote:

> Greetings,
>
> I am prototyping a financial time series database on top of HBase and
> trying to head my head around what a good design would look like.
>
> As I understand it, I have rows, column families, columns and cells.
>
> Since the only think that Hbase really "indexes" is row keys, it seems
> natural in a way to represent the rowkeys as the date/time.
>
> As a simple example:
>
> Bar data:
>
> {
>    "2009/1/17" : {
>      "open":"100",
>      "high":"102",
>      "low":"99",
>      "close":"101"
>      "volume":"1000256"
>    }
> }
>
>
> Quote data:
>
> {
>    "2009/1/17:11:23:04" : {
>      "bid":"100.01",
>      "ask":"100.02",
>      "bidsize":"10000",
>      "asksize":"100200"
>    }
> }
>
> But there are many other issues to think about.
>
> In financial time series data we have small amounts of data within each
> "observation" and we can have lots of observations.  We can have millions of
> observations per time series (f.ex. all historical trade and quote date for
> a particular stock since 1993)across hundreds of thousands of individual
> instruments (f.ex. across all stocks that have traded since 1993.)
>
> The write patterns fit HBase nicely, because it is a write once and append
> pattern.  This is followed by loads of offline processes for simulating
> trading models and such.  These query patterns look like "all quotes for all
> stocks between the dates of 1/1/996 and 12/31/2008."  So the querying is
> typically across a date range, and we can further filter the query by
> instrument types.
>
> So I am not sure what makes sense for efficiency because I do not
> understand HBase well enough yet.
>
>  What kinds of mixes of rows, column families, and columns should I be
> thinking about?
>
> Does my simplistic approach make any sense?  That would mean each row is a
> key-value pair where the key is is the date/time and the value is the
> "observation."  I suppose this leads to a "table per time series" model.
> Does that make sense or is there overhead to having lots of tables?
>

Re: financial time series database

Posted by Bradford Cross <br...@gmail.com>.

I see. So would it be better to have each row represent a time series and
then play with how the column families and columns represent individual
observations within a series?

On Mar 31, 2009 8:10 PM, "zsongbo" <zs...@gmail.com> wrote:

If the rowkey is date/time and the data is original sequential by date/time,
when load/insert data i...
to receive new data. The load performance will be bad.

On Wed, Apr 1, 2009 at 11:08 AM, zsongbo <zs...@gmail.com> wrote: > If the
rowkey is date/time a...

Re: financial time series database

Posted by zsongbo <zs...@gmail.com>.

If the rowkey is date/time and the data is original sequential by date/time,
when load/insert data into table, only one region (the one node) is active
to receive new data. The load performance will be bad.

On Wed, Apr 1, 2009 at 11:08 AM, zsongbo <zs...@gmail.com> wrote:

> If the rowkey is date/time and the data is original sequential by
> date/time, when load/insert data into table, only one region (the one
> node) is active to receive new data. The load performance will be pool.
>
>
> On Wed, Apr 1, 2009 at 10:25 AM, Bradford Cross <
> bradford.n.cross@gmail.com> wrote:
>
>> Greetings,
>>
>> I am prototyping a financial time series database on top of HBase and
>> trying
>> to head my head around what a good design would look like.
>>
>> As I understand it, I have rows, column families, columns and cells.
>>
>> Since the only think that Hbase really "indexes" is row keys, it seems
>> natural in a way to represent the rowkeys as the date/time.
>>
>> As a simple example:
>>
>> Bar data:
>>
>> {
>>   "2009/1/17" : {
>>     "open":"100",
>>     "high":"102",
>>     "low":"99",
>>     "close":"101"
>>     "volume":"1000256"
>>   }
>> }
>>
>>
>> Quote data:
>>
>> {
>>   "2009/1/17:11:23:04" : {
>>     "bid":"100.01",
>>     "ask":"100.02",
>>     "bidsize":"10000",
>>     "asksize":"100200"
>>   }
>> }
>>
>> But there are many other issues to think about.
>>
>> In financial time series data we have small amounts of data within each
>> "observation" and we can have lots of observations.  We can have millions
>> of
>> observations per time series (f.ex. all historical trade and quote date
>> for
>> a particular stock since 1993)across hundreds of thousands of individual
>> instruments (f.ex. across all stocks that have traded since 1993.)
>>
>> The write patterns fit HBase nicely, because it is a write once and append
>> pattern.  This is followed by loads of offline processes for simulating
>> trading models and such.  These query patterns look like "all quotes for
>> all
>> stocks between the dates of 1/1/996 and 12/31/2008."  So the querying is
>> typically across a date range, and we can further filter the query by
>> instrument types.
>>
>> So I am not sure what makes sense for efficiency because I do not
>> understand
>> HBase well enough yet.
>>
>>  What kinds of mixes of rows, column families, and columns should I be
>> thinking about?
>>
>> Does my simplistic approach make any sense?  That would mean each row is a
>> key-value pair where the key is is the date/time and the value is the
>> "observation."  I suppose this leads to a "table per time series" model.
>> Does that make sense or is there overhead to having lots of tables?
>>
>
>

Re: financial time series database

Posted by zsongbo <zs...@gmail.com>.

If the rowkey is date/time and the data is original sequential by date/time,
when load/insert data into table, only one region (the one node) is active
to receive new data. The load performance will be pool.

On Wed, Apr 1, 2009 at 10:25 AM, Bradford Cross
<br...@gmail.com>wrote:

> Greetings,
>
> I am prototyping a financial time series database on top of HBase and
> trying
> to head my head around what a good design would look like.
>
> As I understand it, I have rows, column families, columns and cells.
>
> Since the only think that Hbase really "indexes" is row keys, it seems
> natural in a way to represent the rowkeys as the date/time.
>
> As a simple example:
>
> Bar data:
>
> {
>   "2009/1/17" : {
>     "open":"100",
>     "high":"102",
>     "low":"99",
>     "close":"101"
>     "volume":"1000256"
>   }
> }
>
>
> Quote data:
>
> {
>   "2009/1/17:11:23:04" : {
>     "bid":"100.01",
>     "ask":"100.02",
>     "bidsize":"10000",
>     "asksize":"100200"
>   }
> }
>
> But there are many other issues to think about.
>
> In financial time series data we have small amounts of data within each
> "observation" and we can have lots of observations.  We can have millions
> of
> observations per time series (f.ex. all historical trade and quote date for
> a particular stock since 1993)across hundreds of thousands of individual
> instruments (f.ex. across all stocks that have traded since 1993.)
>
> The write patterns fit HBase nicely, because it is a write once and append
> pattern.  This is followed by loads of offline processes for simulating
> trading models and such.  These query patterns look like "all quotes for
> all
> stocks between the dates of 1/1/996 and 12/31/2008."  So the querying is
> typically across a date range, and we can further filter the query by
> instrument types.
>
> So I am not sure what makes sense for efficiency because I do not
> understand
> HBase well enough yet.
>
>  What kinds of mixes of rows, column families, and columns should I be
> thinking about?
>
> Does my simplistic approach make any sense?  That would mean each row is a
> key-value pair where the key is is the date/time and the value is the
> "observation."  I suppose this leads to a "table per time series" model.
> Does that make sense or is there overhead to having lots of tables?
>