You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Brock Judkins <br...@gmail.com> on 2009/01/08 02:03:41 UTC
Storing/retrieving time series with hadoop
Hi list,
I am researching hadoop as a possible solution for my company's data
warehousing solution. My question is whether hadoop, possibly in combination
with Hive or Pig, is a good solution for time-series data? We basically have
a ton of web analytics to store that we display both internally and
externally.
For the time being I am storing timestamped data points in a huge MySQL
table, but I know this will not scale very far (although it's holding up ok
at almost 90MM rows). I am aware that hadoop can scale insanely large
(larger than I need), but does anyone have experience using it to draw
charts based on time series with fairly low latency?
Thanks!
Brock
Re: Storing/retrieving time series with hadoop
Posted by Robert Zubek <ro...@threerings.net>.
We use Hadoop to warehouse time series data, and run analytics on them.
Being able to parallelize our analytics jobs, and scale up the cluster
as needed for the data, turned out to be a big win.
However, we rolled our own storage solution. At the time when we started
on this project, there were no good solutions for storing time series
(maybe there are right now). I investigated HBase, but it was optimized
for retrieving just the latest values, not the entire time series for
analysis. We also investigated Pig, but it was too early in the
project's life, and didn't support everything we wanted.
As for latency - with S3 it can be significant, depending on how you lay
out your data; we have a separate caching layer just to speed up data
retrieval for graph drawing. I haven't tried HDFS over clustered hard
drives, though; it might be fast enough for your purposes.
Cheers,
Robert
Brock Judkins wrote:
> Hi list,
> I am researching hadoop as a possible solution for my company's data
> warehousing solution. My question is whether hadoop, possibly in combination
> with Hive or Pig, is a good solution for time-series data? We basically have
> a ton of web analytics to store that we display both internally and
> externally.
>
> For the time being I am storing timestamped data points in a huge MySQL
> table, but I know this will not scale very far (although it's holding up ok
> at almost 90MM rows). I am aware that hadoop can scale insanely large
> (larger than I need), but does anyone have experience using it to draw
> charts based on time series with fairly low latency?
>
> Thanks!
> Brock
>
>
Re: Storing/retrieving time series with hadoop
Posted by Mark Chadwick <mc...@invitemedia.com>.
Brok,
I've had good luck storing time-series data with HBase. Its latency for
looking up records is orders of magnitude lower than Hadoop's MapReduce
(which is more for batch processing), yet still resides on HDFS, and has
mechanisms to let you MapReduce on your HBase data.
You may have a difficult time getting a data warehouse to fit the model of
HBase, but if you are specificlly looking at Hadoop, that will be one of
your better bets.
-Mark Chadwick
On Wed, Jan 7, 2009 at 8:03 PM, Brock Judkins <br...@gmail.com>wrote:
> Hi list,
> I am researching hadoop as a possible solution for my company's data
> warehousing solution. My question is whether hadoop, possibly in
> combination
> with Hive or Pig, is a good solution for time-series data? We basically
> have
> a ton of web analytics to store that we display both internally and
> externally.
>
> For the time being I am storing timestamped data points in a huge MySQL
> table, but I know this will not scale very far (although it's holding up ok
> at almost 90MM rows). I am aware that hadoop can scale insanely large
> (larger than I need), but does anyone have experience using it to draw
> charts based on time series with fairly low latency?
>
> Thanks!
> Brock
>
Re: Storing/retrieving time series with hadoop
Posted by Chris K Wensel <ch...@wensel.net>.
Hey Brock
I used Cascading quite extensively with time series data.
Along with the standard function/filter/aggregator operations in the
Cascading processing model, there is what we call a "buffer".
Its really just a user friendly Reduce that integrates well with other
operations and offers up a "sliding window" across your grouped data.
Quite useful for running averages or filling in missing intervals etc.
Plus there are handy operations for switching from text time strings
to long time stamps and back etc..
YMMV
cheers,
ckw
On Jan 7, 2009, at 5:03 PM, Brock Judkins wrote:
> Hi list,
> I am researching hadoop as a possible solution for my company's data
> warehousing solution. My question is whether hadoop, possibly in
> combination
> with Hive or Pig, is a good solution for time-series data? We
> basically have
> a ton of web analytics to store that we display both internally and
> externally.
>
> For the time being I am storing timestamped data points in a huge
> MySQL
> table, but I know this will not scale very far (although it's
> holding up ok
> at almost 90MM rows). I am aware that hadoop can scale insanely large
> (larger than I need), but does anyone have experience using it to draw
> charts based on time series with fairly low latency?
>
> Thanks!
> Brock
--
Chris K Wensel
chris@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/