You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Sam Seigal <se...@yahoo.com> on 2011/04/14 03:12:58 UTC

Dynamic Data Sets

I have a requirement where I have large sets of incoming data into a
system I own.

A single unit of data in this set has a set of immutable attributes +
state attached to it. The state is dynamic and can change at any time.
What is the best way to run analytical queries on data of such nature
?

One way is to maintain this data in a separate store, take a snapshot
in point of time, and then import into the HDFS filesystem for
analysis using Hadoop Map-Reduce. I do not see this approach scaling,
since moving data is obviously expensive.
If i was to directly maintain this data as Sequence Files in HDFS, how
would updates work ?

I am new to Hadoop/HDFS , so any suggestions/critique is welcome. I
know that HBase works around this problem through multi version
concurrency control techniques. Is that the only option ? Are there
any alternatives ?

Also note that all aggregation and analysis I want to do is time based
i.e. sum of x on pivot y over a day, 2 days, week, month etc. For such
use cases, is it advisable to use HDFS directly or use systems built
on top of hadoop like Hive or Hbase ?

Re: Dynamic Data Sets

Posted by Ted Dunning <td...@maprtech.com>.
Hbase is very good at this kind of thing.

Depending on your aggregation needs OpenTSDB might be interesting since they
store and query against large amounts of time ordered data similar to what
you want to do.

It isn't clear to whether your data is primarily about current state or
about time-embedded state transitions.  You can easily store both in hbase,
but the arrangements will be a bit different.

On Wed, Apr 13, 2011 at 6:12 PM, Sam Seigal <se...@yahoo.com> wrote:

> I have a requirement where I have large sets of incoming data into a
> system I own.
>
> A single unit of data in this set has a set of immutable attributes +
> state attached to it. The state is dynamic and can change at any time.
> What is the best way to run analytical queries on data of such nature
> ?
>
> One way is to maintain this data in a separate store, take a snapshot
> in point of time, and then import into the HDFS filesystem for
> analysis using Hadoop Map-Reduce. I do not see this approach scaling,
> since moving data is obviously expensive.
> If i was to directly maintain this data as Sequence Files in HDFS, how
> would updates work ?
>
> I am new to Hadoop/HDFS , so any suggestions/critique is welcome. I
> know that HBase works around this problem through multi version
> concurrency control techniques. Is that the only option ? Are there
> any alternatives ?
>
> Also note that all aggregation and analysis I want to do is time based
> i.e. sum of x on pivot y over a day, 2 days, week, month etc. For such
> use cases, is it advisable to use HDFS directly or use systems built
> on top of hadoop like Hive or Hbase ?
>

Re: why local fs instead of hdfs

Posted by Konstantin Boudnik <co...@apache.org>.
Seems like something is setting fs.default.name programmatically.
Another possibility that $HADOOP_CONF_DIR isn't in the classpath in
the second case.

Hope it helps,
  Cos

On Thu, Apr 14, 2011 at 20:24, Gang Luo <lg...@yahoo.com.cn> wrote:
> Hi all,
>
> a tricky problem here. When we prepare an input path, it should be a path on
> HDFS by default, right? In what condition will this become a path on local file
> system? I follow a program which worked well and the input path is something
> like "hdfs://...". But when I apply the similar driver class to run different
> program, the input path become "file:/..." and doesn't work. What is the
> problem?
>
> Thanks.
>
> -Gang
>

why local fs instead of hdfs

Posted by Gang Luo <lg...@yahoo.com.cn>.
Hi all,

a tricky problem here. When we prepare an input path, it should be a path on 
HDFS by default, right? In what condition will this become a path on local file 
system? I follow a program which worked well and the input path is something 
like "hdfs://...". But when I apply the similar driver class to run different 
program, the input path become "file:/..." and doesn't work. What is the 
problem?

Thanks.

-Gang

Re: Dynamic Data Sets

Posted by Sam Seigal <se...@yahoo.com>.
How does HBase compare to Hive when it comes to dynamic data sets ?
Does Hive support multi version concurrency control ? I am new to
Hadoop, hence trying to get an idea of how to evaluate these different
technologies and provide concrete justifications on why to choose one
over the other.

Also, I am not interested in how a state changes over time. I am only
interested in what the current state of a data unit is, and then
aggregate with other data with the same state based on a time range
(5000 records exist in state A on April 14th, 2000 records exist in
state B on April 13th etc). The analysis will vary depending on how
the state changes over time.


On Thu, Apr 14, 2011 at 12:19 PM, Michel Segel
<mi...@hotmail.com> wrote:
> Sorry,
> It appears to be a flock of us...
>
> Ok bad pun...
>
> I didn't see Ted's response but it looks like we're thinking along the same lines of thought.
> I was going to ask about that... But it's really a moot point. The size of the immutable data set doesn't really matter.  The solution would be the same. Consider it some blob which is >= the size of a SHA-1 hash value. In fact that could be your unique key.
>
> So you get your blob, timestamp and then state value. You hash the blob, store the blob in one table using the hash as the key value, and then store the state in a column where the timestamp as the column name and the hash value as the row key. Two separate tables because if you stored them as separate column families you may have some performance issues due to a size difference in column families.
>
> This would be a pretty straight forward solution in hbase.
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Apr 14, 2011, at 12:18 PM, James Seigel Tynt <ja...@tynt.com> wrote:
>
>> If all the seigel/seigal/segel gang don't chime in It'd be weird.
>>
>> What size of data are we talking?
>>
>> James
>>
>> On 2011-04-14, at 11:06 AM, Michael Segel <mi...@hotmail.com> wrote:
>>
>>>
>>> James,
>>>
>>>
>>> If I understand you get a set of immutable attributes, then a state which can change.
>>>
>>> If you wanted to use HBase...
>>> I'd say create a unique identifier for your immutable attributes, then store the unique id, timestamp, and state. Assuming
>>> that you're really interested in looking at the state change over time.
>>>
>>> So what you end up with is one table of immutable attributes, with a unique key, and then another table where you can use the same unique key and create columns with column names of time stamps with the state as the value.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>>
>>> ----------------------------------------
>>>> Date: Wed, 13 Apr 2011 18:12:58 -0700
>>>> Subject: Dynamic Data Sets
>>>> From: selekt86@yahoo.com
>>>> To: common-user@hadoop.apache.org
>>>>
>>>> I have a requirement where I have large sets of incoming data into a
>>>> system I own.
>>>>
>>>> A single unit of data in this set has a set of immutable attributes +
>>>> state attached to it. The state is dynamic and can change at any time.
>>>> What is the best way to run analytical queries on data of such nature
>>>> ?
>>>>
>>>> One way is to maintain this data in a separate store, take a snapshot
>>>> in point of time, and then import into the HDFS filesystem for
>>>> analysis using Hadoop Map-Reduce. I do not see this approach scaling,
>>>> since moving data is obviously expensive.
>>>> If i was to directly maintain this data as Sequence Files in HDFS, how
>>>> would updates work ?
>>>>
>>>> I am new to Hadoop/HDFS , so any suggestions/critique is welcome. I
>>>> know that HBase works around this problem through multi version
>>>> concurrency control techniques. Is that the only option ? Are there
>>>> any alternatives ?
>>>>
>>>> Also note that all aggregation and analysis I want to do is time based
>>>> i.e. sum of x on pivot y over a day, 2 days, week, month etc. For such
>>>> use cases, is it advisable to use HDFS directly or use systems built
>>>> on top of hadoop like Hive or Hbase ?
>>>
>>
>

Re: Dynamic Data Sets

Posted by Michel Segel <mi...@hotmail.com>.
Sorry,
It appears to be a flock of us...

Ok bad pun...

I didn't see Ted's response but it looks like we're thinking along the same lines of thought.
I was going to ask about that... But it's really a moot point. The size of the immutable data set doesn't really matter.  The solution would be the same. Consider it some blob which is >= the size of a SHA-1 hash value. In fact that could be your unique key.

So you get your blob, timestamp and then state value. You hash the blob, store the blob in one table using the hash as the key value, and then store the state in a column where the timestamp as the column name and the hash value as the row key. Two separate tables because if you stored them as separate column families you may have some performance issues due to a size difference in column families.

This would be a pretty straight forward solution in hbase.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Apr 14, 2011, at 12:18 PM, James Seigel Tynt <ja...@tynt.com> wrote:

> If all the seigel/seigal/segel gang don't chime in It'd be weird. 
> 
> What size of data are we talking?
> 
> James
> 
> On 2011-04-14, at 11:06 AM, Michael Segel <mi...@hotmail.com> wrote:
> 
>> 
>> James,
>> 
>> 
>> If I understand you get a set of immutable attributes, then a state which can change. 
>> 
>> If you wanted to use HBase... 
>> I'd say create a unique identifier for your immutable attributes, then store the unique id, timestamp, and state. Assuming 
>> that you're really interested in looking at the state change over time.
>> 
>> So what you end up with is one table of immutable attributes, with a unique key, and then another table where you can use the same unique key and create columns with column names of time stamps with the state as the value.
>> 
>> HTH
>> 
>> -Mike
>> 
>> 
>> ----------------------------------------
>>> Date: Wed, 13 Apr 2011 18:12:58 -0700
>>> Subject: Dynamic Data Sets
>>> From: selekt86@yahoo.com
>>> To: common-user@hadoop.apache.org
>>> 
>>> I have a requirement where I have large sets of incoming data into a
>>> system I own.
>>> 
>>> A single unit of data in this set has a set of immutable attributes +
>>> state attached to it. The state is dynamic and can change at any time.
>>> What is the best way to run analytical queries on data of such nature
>>> ?
>>> 
>>> One way is to maintain this data in a separate store, take a snapshot
>>> in point of time, and then import into the HDFS filesystem for
>>> analysis using Hadoop Map-Reduce. I do not see this approach scaling,
>>> since moving data is obviously expensive.
>>> If i was to directly maintain this data as Sequence Files in HDFS, how
>>> would updates work ?
>>> 
>>> I am new to Hadoop/HDFS , so any suggestions/critique is welcome. I
>>> know that HBase works around this problem through multi version
>>> concurrency control techniques. Is that the only option ? Are there
>>> any alternatives ?
>>> 
>>> Also note that all aggregation and analysis I want to do is time based
>>> i.e. sum of x on pivot y over a day, 2 days, week, month etc. For such
>>> use cases, is it advisable to use HDFS directly or use systems built
>>> on top of hadoop like Hive or Hbase ?
>> 
> 

Re: Dynamic Data Sets

Posted by James Seigel Tynt <ja...@tynt.com>.
If all the seigel/seigal/segel gang don't chime in It'd be weird. 

What size of data are we talking?

James

On 2011-04-14, at 11:06 AM, Michael Segel <mi...@hotmail.com> wrote:

> 
> James,
> 
> 
> If I understand you get a set of immutable attributes, then a state which can change. 
> 
> If you wanted to use HBase... 
> I'd say create a unique identifier for your immutable attributes, then store the unique id, timestamp, and state. Assuming 
> that you're really interested in looking at the state change over time.
> 
> So what you end up with is one table of immutable attributes, with a unique key, and then another table where you can use the same unique key and create columns with column names of time stamps with the state as the value.
> 
> HTH
> 
> -Mike
> 
> 
> ----------------------------------------
>> Date: Wed, 13 Apr 2011 18:12:58 -0700
>> Subject: Dynamic Data Sets
>> From: selekt86@yahoo.com
>> To: common-user@hadoop.apache.org
>> 
>> I have a requirement where I have large sets of incoming data into a
>> system I own.
>> 
>> A single unit of data in this set has a set of immutable attributes +
>> state attached to it. The state is dynamic and can change at any time.
>> What is the best way to run analytical queries on data of such nature
>> ?
>> 
>> One way is to maintain this data in a separate store, take a snapshot
>> in point of time, and then import into the HDFS filesystem for
>> analysis using Hadoop Map-Reduce. I do not see this approach scaling,
>> since moving data is obviously expensive.
>> If i was to directly maintain this data as Sequence Files in HDFS, how
>> would updates work ?
>> 
>> I am new to Hadoop/HDFS , so any suggestions/critique is welcome. I
>> know that HBase works around this problem through multi version
>> concurrency control techniques. Is that the only option ? Are there
>> any alternatives ?
>> 
>> Also note that all aggregation and analysis I want to do is time based
>> i.e. sum of x on pivot y over a day, 2 days, week, month etc. For such
>> use cases, is it advisable to use HDFS directly or use systems built
>> on top of hadoop like Hive or Hbase ?
>                         

RE: Dynamic Data Sets

Posted by Michael Segel <mi...@hotmail.com>.
James,


If I understand you get a set of immutable attributes, then a state which can change. 

If you wanted to use HBase... 
I'd say create a unique identifier for your immutable attributes, then store the unique id, timestamp, and state. Assuming 
that you're really interested in looking at the state change over time.

So what you end up with is one table of immutable attributes, with a unique key, and then another table where you can use the same unique key and create columns with column names of time stamps with the state as the value.

HTH

-Mike


----------------------------------------
> Date: Wed, 13 Apr 2011 18:12:58 -0700
> Subject: Dynamic Data Sets
> From: selekt86@yahoo.com
> To: common-user@hadoop.apache.org
>
> I have a requirement where I have large sets of incoming data into a
> system I own.
>
> A single unit of data in this set has a set of immutable attributes +
> state attached to it. The state is dynamic and can change at any time.
> What is the best way to run analytical queries on data of such nature
> ?
>
> One way is to maintain this data in a separate store, take a snapshot
> in point of time, and then import into the HDFS filesystem for
> analysis using Hadoop Map-Reduce. I do not see this approach scaling,
> since moving data is obviously expensive.
> If i was to directly maintain this data as Sequence Files in HDFS, how
> would updates work ?
>
> I am new to Hadoop/HDFS , so any suggestions/critique is welcome. I
> know that HBase works around this problem through multi version
> concurrency control techniques. Is that the only option ? Are there
> any alternatives ?
>
> Also note that all aggregation and analysis I want to do is time based
> i.e. sum of x on pivot y over a day, 2 days, week, month etc. For such
> use cases, is it advisable to use HDFS directly or use systems built
> on top of hadoop like Hive or Hbase ?