You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by jeff l <je...@gmail.com> on 2012/11/24 21:56:17 UTC

Best practice for storage of data that changes

Hi All,

I'm coming from the RDBMS world and am looking at hdfs for long term data
storage and analysis.

I've done some research and set up some smallish hdfs clusters with hive
for testing but I'm having a little trouble understanding how everything
fits together and was hoping someone could point me in the right direction.

I'm looking at storing two types of data:

1. Append-only data - e.g. weblogs or user logins
2. Account/User data

HDFS seems to be perfect for append-only data like #1, but I'm having
trouble figuring out what to do with data that may change frequently.

A simple example would be user data where various bits of information:
email, etc may change from day to day.  Would hbase or cassandra be the
better way to go for this type of data, and can I overlay hive over all (
hdfs, hbase, cassandra ) so that I can query the data through a single
interface?

Thanks in advance for any help.

Re: Best practice for storage of data that changes

Posted by anil gupta <an...@gmail.com>.

Hi Guys,

I posted our study on my blog:
http://bigdatanoob.blogspot.com/2012/11/hbase-vs-cassandra.html

We ended up choosing HBase because:
1. HBase provides Range based scan, and ordered partitioning.
2. HBase is closely integrated with Hadoop ecosystem.
3. HBase is strongly consistent as compared to Cassandra which is
eventually consistent.

As i said earlier in my email that selection of NoSql solution depends on
the use case. There are subtle differences between NoSql solution and each
of them have their own "Sweet Spot". So, pick yours after careful
evaluation.

PS: Added the HBase mailing list also since this is more about HBase.

Hope This Helps,
Anil Gupta


On Thu, Nov 29, 2012 at 8:51 PM, Lance Norskog <go...@gmail.com> wrote:

> Please! There are lots of blogs etc. about the two, but very few
> head-to-head for a real use case.
>
> ------------------------------
>
> *From: *"anil gupta" <an...@gmail.com>
> *To: *"common-user@hadoop.apache.org" <us...@hadoop.apache.org>
> *Sent: *Wednesday, November 28, 2012 11:01:55 AM
> *Subject: *Re: Best practice for storage of data that changes
>
>
> Hi Jeff,
>
> At my workplace "Intuit", we did some detailed study to evaluate HBase and
> Cassandra for our use case. I will see if i can post the comparative study
> on my public blog or on this mailing list.
>
> BTW, What is your use case? What bottleneck are you hitting at current
> solutions? If you can share some details then HBase community will try to
> help you out.
>
> Thanks,
> Anil Gupta
>
>
> On Wed, Nov 28, 2012 at 9:55 AM, jeff l <je...@gmail.com> wrote:
>
>> Hi,
>>
>> I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql )
>> and MongoDB but don't feel any are quite right for this problem.  The
>> amount of data being stored and access requirements just don't match up
>> well.
>>
>> I was hoping to keep the stack as simple as possible and just use hdfs
>> but everything I was seeing kept pointing to the need for some other
>> datastore.  I'll check out both HBase and Cassandra.
>>
>> Thanks for the feedback.
>>
>>
>> On Sun, Nov 25, 2012 at 1:11 PM, anil gupta <an...@gmail.com>wrote:
>>
>>> Hi Jeff,
>>>
>>> My two cents below:
>>>
>>> 1st use case: Append-only data - e.g. weblogs or user logins
>>> As others have already mentioned that Hadoop is suitable enough to store
>>> append only data. If you want to do analysis of weblogs or user logins then
>>> Hadoop is a suitable solution for it.
>>>
>>>
>>> 2nd use case: Account/User data
>>> First, of all i would suggest you to have a look at your use case then
>>> analyze whether it really needs a NoSql solution or not.
>>> As you were talking about maintaining User Data in NoSql. Why NoSql
>>> instead of RDBMS? What is the size of data? Which NoSql features are the
>>> selling points for you?
>>>
>>> For real time read writes you can have a look at Cassandra or HBase.
>>> But, i would suggest you to have a very close look at both of them because
>>> both of them have their own advantages. So, the choice will be dependent on
>>> your use case.
>>>
>>> One added advantage with HBase is that it has a deeper integration with
>>> Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop
>>> Tools. HBase has integration with Hive querying but AFAIK it has some
>>> limitations.
>>>
>>> HTH,
>>> Anil Gupta
>>>
>>>
>>> On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija <
>>> balijamahesh.mca@gmail.com> wrote:
>>>
>>>> Hi Jeff,
>>>>
>>>>         As HDFS paradigm is "Write once and read many" you cannot be
>>>> able to update the files on HDFS.
>>>>         But for your problem what you can do is you keep the
>>>> logs/userdata in hdfs with different timestamps.
>>>>         Run some mapreduce jobs at certain intervals to extract
>>>> required data from those logs and put it to Hbase/Cassandra/Mongodb.
>>>>
>>>>         Mongodb read performance is quite faster also it supports
>>>> ad-hoc querying. Also you can use Hadoop-MongoDB connector to read/write
>>>> the data to Mongodb thru Hadoop-Mapreduce.
>>>>
>>>>         If you are very specific about updating the hdfs files directly
>>>> then you have to use any commercial Hadoop packages like MapR which
>>>> supports updating the HDFS files.
>>>>
>>>> Best,
>>>> Mahesh Balija,
>>>> Calsoft Labs.
>>>>
>>>>
>>>>
>>>> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
>>>> bharathvissapragada1990@gmail.com> wrote:
>>>>
>>>>> Hi Jeff,
>>>>>
>>>>> Please look at [1] . You can store your data in HBase tables and query
>>>>> them normally just by mapping them to Hive tables. Regarding Cassandra
>>>>> support, please follow JIRA [2], its not yet in the trunk I suppose!
>>>>>
>>>>> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
>>>>> [2] https://issues.apache.org/jira/browse/HIVE-1434
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com>wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I'm coming from the RDBMS world and am looking at hdfs for long term
>>>>>> data storage and analysis.
>>>>>>
>>>>>> I've done some research and set up some smallish hdfs clusters with
>>>>>> hive for testing but I'm having a little trouble understanding how
>>>>>> everything fits together and was hoping someone could point me in the right
>>>>>> direction.
>>>>>>
>>>>>> I'm looking at storing two types of data:
>>>>>>
>>>>>> 1. Append-only data - e.g. weblogs or user logins
>>>>>> 2. Account/User data
>>>>>>
>>>>>> HDFS seems to be perfect for append-only data like #1, but I'm having
>>>>>> trouble figuring out what to do with data that may change frequently.
>>>>>>
>>>>>> A simple example would be user data where various bits of
>>>>>> information: email, etc may change from day to day.  Would hbase or
>>>>>> cassandra be the better way to go for this type of data, and can I overlay
>>>>>> hive over all ( hdfs, hbase, cassandra ) so that I can query the data
>>>>>> through a single interface?
>>>>>>
>>>>>> Thanks in advance for any help.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Bharath .V
>>>>> w:http://researchweb.iiit.ac.in/~bharath.v
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Anil Gupta
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>
>
>


-- 
Thanks & Regards,
Anil Gupta

Re: Best practice for storage of data that changes

Posted by anil gupta <an...@gmail.com>.

Hi Guys,

I posted our study on my blog:
http://bigdatanoob.blogspot.com/2012/11/hbase-vs-cassandra.html

We ended up choosing HBase because:
1. HBase provides Range based scan, and ordered partitioning.
2. HBase is closely integrated with Hadoop ecosystem.
3. HBase is strongly consistent as compared to Cassandra which is
eventually consistent.

As i said earlier in my email that selection of NoSql solution depends on
the use case. There are subtle differences between NoSql solution and each
of them have their own "Sweet Spot". So, pick yours after careful
evaluation.

PS: Added the HBase mailing list also since this is more about HBase.

Hope This Helps,
Anil Gupta


On Thu, Nov 29, 2012 at 8:51 PM, Lance Norskog <go...@gmail.com> wrote:

> Please! There are lots of blogs etc. about the two, but very few
> head-to-head for a real use case.
>
> ------------------------------
>
> *From: *"anil gupta" <an...@gmail.com>
> *To: *"common-user@hadoop.apache.org" <us...@hadoop.apache.org>
> *Sent: *Wednesday, November 28, 2012 11:01:55 AM
> *Subject: *Re: Best practice for storage of data that changes
>
>
> Hi Jeff,
>
> At my workplace "Intuit", we did some detailed study to evaluate HBase and
> Cassandra for our use case. I will see if i can post the comparative study
> on my public blog or on this mailing list.
>
> BTW, What is your use case? What bottleneck are you hitting at current
> solutions? If you can share some details then HBase community will try to
> help you out.
>
> Thanks,
> Anil Gupta
>
>
> On Wed, Nov 28, 2012 at 9:55 AM, jeff l <je...@gmail.com> wrote:
>
>> Hi,
>>
>> I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql )
>> and MongoDB but don't feel any are quite right for this problem.  The
>> amount of data being stored and access requirements just don't match up
>> well.
>>
>> I was hoping to keep the stack as simple as possible and just use hdfs
>> but everything I was seeing kept pointing to the need for some other
>> datastore.  I'll check out both HBase and Cassandra.
>>
>> Thanks for the feedback.
>>
>>
>> On Sun, Nov 25, 2012 at 1:11 PM, anil gupta <an...@gmail.com>wrote:
>>
>>> Hi Jeff,
>>>
>>> My two cents below:
>>>
>>> 1st use case: Append-only data - e.g. weblogs or user logins
>>> As others have already mentioned that Hadoop is suitable enough to store
>>> append only data. If you want to do analysis of weblogs or user logins then
>>> Hadoop is a suitable solution for it.
>>>
>>>
>>> 2nd use case: Account/User data
>>> First, of all i would suggest you to have a look at your use case then
>>> analyze whether it really needs a NoSql solution or not.
>>> As you were talking about maintaining User Data in NoSql. Why NoSql
>>> instead of RDBMS? What is the size of data? Which NoSql features are the
>>> selling points for you?
>>>
>>> For real time read writes you can have a look at Cassandra or HBase.
>>> But, i would suggest you to have a very close look at both of them because
>>> both of them have their own advantages. So, the choice will be dependent on
>>> your use case.
>>>
>>> One added advantage with HBase is that it has a deeper integration with
>>> Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop
>>> Tools. HBase has integration with Hive querying but AFAIK it has some
>>> limitations.
>>>
>>> HTH,
>>> Anil Gupta
>>>
>>>
>>> On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija <
>>> balijamahesh.mca@gmail.com> wrote:
>>>
>>>> Hi Jeff,
>>>>
>>>>         As HDFS paradigm is "Write once and read many" you cannot be
>>>> able to update the files on HDFS.
>>>>         But for your problem what you can do is you keep the
>>>> logs/userdata in hdfs with different timestamps.
>>>>         Run some mapreduce jobs at certain intervals to extract
>>>> required data from those logs and put it to Hbase/Cassandra/Mongodb.
>>>>
>>>>         Mongodb read performance is quite faster also it supports
>>>> ad-hoc querying. Also you can use Hadoop-MongoDB connector to read/write
>>>> the data to Mongodb thru Hadoop-Mapreduce.
>>>>
>>>>         If you are very specific about updating the hdfs files directly
>>>> then you have to use any commercial Hadoop packages like MapR which
>>>> supports updating the HDFS files.
>>>>
>>>> Best,
>>>> Mahesh Balija,
>>>> Calsoft Labs.
>>>>
>>>>
>>>>
>>>> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
>>>> bharathvissapragada1990@gmail.com> wrote:
>>>>
>>>>> Hi Jeff,
>>>>>
>>>>> Please look at [1] . You can store your data in HBase tables and query
>>>>> them normally just by mapping them to Hive tables. Regarding Cassandra
>>>>> support, please follow JIRA [2], its not yet in the trunk I suppose!
>>>>>
>>>>> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
>>>>> [2] https://issues.apache.org/jira/browse/HIVE-1434
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com>wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I'm coming from the RDBMS world and am looking at hdfs for long term
>>>>>> data storage and analysis.
>>>>>>
>>>>>> I've done some research and set up some smallish hdfs clusters with
>>>>>> hive for testing but I'm having a little trouble understanding how
>>>>>> everything fits together and was hoping someone could point me in the right
>>>>>> direction.
>>>>>>
>>>>>> I'm looking at storing two types of data:
>>>>>>
>>>>>> 1. Append-only data - e.g. weblogs or user logins
>>>>>> 2. Account/User data
>>>>>>
>>>>>> HDFS seems to be perfect for append-only data like #1, but I'm having
>>>>>> trouble figuring out what to do with data that may change frequently.
>>>>>>
>>>>>> A simple example would be user data where various bits of
>>>>>> information: email, etc may change from day to day.  Would hbase or
>>>>>> cassandra be the better way to go for this type of data, and can I overlay
>>>>>> hive over all ( hdfs, hbase, cassandra ) so that I can query the data
>>>>>> through a single interface?
>>>>>>
>>>>>> Thanks in advance for any help.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Bharath .V
>>>>> w:http://researchweb.iiit.ac.in/~bharath.v
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Anil Gupta
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>
>
>


-- 
Thanks & Regards,
Anil Gupta

Re: Best practice for storage of data that changes

Posted by anil gupta <an...@gmail.com>.

Hi Guys,

I posted our study on my blog:
http://bigdatanoob.blogspot.com/2012/11/hbase-vs-cassandra.html

We ended up choosing HBase because:
1. HBase provides Range based scan, and ordered partitioning.
2. HBase is closely integrated with Hadoop ecosystem.
3. HBase is strongly consistent as compared to Cassandra which is
eventually consistent.

As i said earlier in my email that selection of NoSql solution depends on
the use case. There are subtle differences between NoSql solution and each
of them have their own "Sweet Spot". So, pick yours after careful
evaluation.

PS: Added the HBase mailing list also since this is more about HBase.

Hope This Helps,
Anil Gupta


On Thu, Nov 29, 2012 at 8:51 PM, Lance Norskog <go...@gmail.com> wrote:

> Please! There are lots of blogs etc. about the two, but very few
> head-to-head for a real use case.
>
> ------------------------------
>
> *From: *"anil gupta" <an...@gmail.com>
> *To: *"common-user@hadoop.apache.org" <us...@hadoop.apache.org>
> *Sent: *Wednesday, November 28, 2012 11:01:55 AM
> *Subject: *Re: Best practice for storage of data that changes
>
>
> Hi Jeff,
>
> At my workplace "Intuit", we did some detailed study to evaluate HBase and
> Cassandra for our use case. I will see if i can post the comparative study
> on my public blog or on this mailing list.
>
> BTW, What is your use case? What bottleneck are you hitting at current
> solutions? If you can share some details then HBase community will try to
> help you out.
>
> Thanks,
> Anil Gupta
>
>
> On Wed, Nov 28, 2012 at 9:55 AM, jeff l <je...@gmail.com> wrote:
>
>> Hi,
>>
>> I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql )
>> and MongoDB but don't feel any are quite right for this problem.  The
>> amount of data being stored and access requirements just don't match up
>> well.
>>
>> I was hoping to keep the stack as simple as possible and just use hdfs
>> but everything I was seeing kept pointing to the need for some other
>> datastore.  I'll check out both HBase and Cassandra.
>>
>> Thanks for the feedback.
>>
>>
>> On Sun, Nov 25, 2012 at 1:11 PM, anil gupta <an...@gmail.com>wrote:
>>
>>> Hi Jeff,
>>>
>>> My two cents below:
>>>
>>> 1st use case: Append-only data - e.g. weblogs or user logins
>>> As others have already mentioned that Hadoop is suitable enough to store
>>> append only data. If you want to do analysis of weblogs or user logins then
>>> Hadoop is a suitable solution for it.
>>>
>>>
>>> 2nd use case: Account/User data
>>> First, of all i would suggest you to have a look at your use case then
>>> analyze whether it really needs a NoSql solution or not.
>>> As you were talking about maintaining User Data in NoSql. Why NoSql
>>> instead of RDBMS? What is the size of data? Which NoSql features are the
>>> selling points for you?
>>>
>>> For real time read writes you can have a look at Cassandra or HBase.
>>> But, i would suggest you to have a very close look at both of them because
>>> both of them have their own advantages. So, the choice will be dependent on
>>> your use case.
>>>
>>> One added advantage with HBase is that it has a deeper integration with
>>> Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop
>>> Tools. HBase has integration with Hive querying but AFAIK it has some
>>> limitations.
>>>
>>> HTH,
>>> Anil Gupta
>>>
>>>
>>> On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija <
>>> balijamahesh.mca@gmail.com> wrote:
>>>
>>>> Hi Jeff,
>>>>
>>>>         As HDFS paradigm is "Write once and read many" you cannot be
>>>> able to update the files on HDFS.
>>>>         But for your problem what you can do is you keep the
>>>> logs/userdata in hdfs with different timestamps.
>>>>         Run some mapreduce jobs at certain intervals to extract
>>>> required data from those logs and put it to Hbase/Cassandra/Mongodb.
>>>>
>>>>         Mongodb read performance is quite faster also it supports
>>>> ad-hoc querying. Also you can use Hadoop-MongoDB connector to read/write
>>>> the data to Mongodb thru Hadoop-Mapreduce.
>>>>
>>>>         If you are very specific about updating the hdfs files directly
>>>> then you have to use any commercial Hadoop packages like MapR which
>>>> supports updating the HDFS files.
>>>>
>>>> Best,
>>>> Mahesh Balija,
>>>> Calsoft Labs.
>>>>
>>>>
>>>>
>>>> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
>>>> bharathvissapragada1990@gmail.com> wrote:
>>>>
>>>>> Hi Jeff,
>>>>>
>>>>> Please look at [1] . You can store your data in HBase tables and query
>>>>> them normally just by mapping them to Hive tables. Regarding Cassandra
>>>>> support, please follow JIRA [2], its not yet in the trunk I suppose!
>>>>>
>>>>> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
>>>>> [2] https://issues.apache.org/jira/browse/HIVE-1434
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com>wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I'm coming from the RDBMS world and am looking at hdfs for long term
>>>>>> data storage and analysis.
>>>>>>
>>>>>> I've done some research and set up some smallish hdfs clusters with
>>>>>> hive for testing but I'm having a little trouble understanding how
>>>>>> everything fits together and was hoping someone could point me in the right
>>>>>> direction.
>>>>>>
>>>>>> I'm looking at storing two types of data:
>>>>>>
>>>>>> 1. Append-only data - e.g. weblogs or user logins
>>>>>> 2. Account/User data
>>>>>>
>>>>>> HDFS seems to be perfect for append-only data like #1, but I'm having
>>>>>> trouble figuring out what to do with data that may change frequently.
>>>>>>
>>>>>> A simple example would be user data where various bits of
>>>>>> information: email, etc may change from day to day.  Would hbase or
>>>>>> cassandra be the better way to go for this type of data, and can I overlay
>>>>>> hive over all ( hdfs, hbase, cassandra ) so that I can query the data
>>>>>> through a single interface?
>>>>>>
>>>>>> Thanks in advance for any help.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Bharath .V
>>>>> w:http://researchweb.iiit.ac.in/~bharath.v
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Anil Gupta
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>
>
>


-- 
Thanks & Regards,
Anil Gupta

Re: Best practice for storage of data that changes

Posted by anil gupta <an...@gmail.com>.

Hi Guys,

I posted our study on my blog:
http://bigdatanoob.blogspot.com/2012/11/hbase-vs-cassandra.html

We ended up choosing HBase because:
1. HBase provides Range based scan, and ordered partitioning.
2. HBase is closely integrated with Hadoop ecosystem.
3. HBase is strongly consistent as compared to Cassandra which is
eventually consistent.

As i said earlier in my email that selection of NoSql solution depends on
the use case. There are subtle differences between NoSql solution and each
of them have their own "Sweet Spot". So, pick yours after careful
evaluation.

PS: Added the HBase mailing list also since this is more about HBase.

Hope This Helps,
Anil Gupta


On Thu, Nov 29, 2012 at 8:51 PM, Lance Norskog <go...@gmail.com> wrote:

> Please! There are lots of blogs etc. about the two, but very few
> head-to-head for a real use case.
>
> ------------------------------
>
> *From: *"anil gupta" <an...@gmail.com>
> *To: *"common-user@hadoop.apache.org" <us...@hadoop.apache.org>
> *Sent: *Wednesday, November 28, 2012 11:01:55 AM
> *Subject: *Re: Best practice for storage of data that changes
>
>
> Hi Jeff,
>
> At my workplace "Intuit", we did some detailed study to evaluate HBase and
> Cassandra for our use case. I will see if i can post the comparative study
> on my public blog or on this mailing list.
>
> BTW, What is your use case? What bottleneck are you hitting at current
> solutions? If you can share some details then HBase community will try to
> help you out.
>
> Thanks,
> Anil Gupta
>
>
> On Wed, Nov 28, 2012 at 9:55 AM, jeff l <je...@gmail.com> wrote:
>
>> Hi,
>>
>> I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql )
>> and MongoDB but don't feel any are quite right for this problem.  The
>> amount of data being stored and access requirements just don't match up
>> well.
>>
>> I was hoping to keep the stack as simple as possible and just use hdfs
>> but everything I was seeing kept pointing to the need for some other
>> datastore.  I'll check out both HBase and Cassandra.
>>
>> Thanks for the feedback.
>>
>>
>> On Sun, Nov 25, 2012 at 1:11 PM, anil gupta <an...@gmail.com>wrote:
>>
>>> Hi Jeff,
>>>
>>> My two cents below:
>>>
>>> 1st use case: Append-only data - e.g. weblogs or user logins
>>> As others have already mentioned that Hadoop is suitable enough to store
>>> append only data. If you want to do analysis of weblogs or user logins then
>>> Hadoop is a suitable solution for it.
>>>
>>>
>>> 2nd use case: Account/User data
>>> First, of all i would suggest you to have a look at your use case then
>>> analyze whether it really needs a NoSql solution or not.
>>> As you were talking about maintaining User Data in NoSql. Why NoSql
>>> instead of RDBMS? What is the size of data? Which NoSql features are the
>>> selling points for you?
>>>
>>> For real time read writes you can have a look at Cassandra or HBase.
>>> But, i would suggest you to have a very close look at both of them because
>>> both of them have their own advantages. So, the choice will be dependent on
>>> your use case.
>>>
>>> One added advantage with HBase is that it has a deeper integration with
>>> Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop
>>> Tools. HBase has integration with Hive querying but AFAIK it has some
>>> limitations.
>>>
>>> HTH,
>>> Anil Gupta
>>>
>>>
>>> On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija <
>>> balijamahesh.mca@gmail.com> wrote:
>>>
>>>> Hi Jeff,
>>>>
>>>>         As HDFS paradigm is "Write once and read many" you cannot be
>>>> able to update the files on HDFS.
>>>>         But for your problem what you can do is you keep the
>>>> logs/userdata in hdfs with different timestamps.
>>>>         Run some mapreduce jobs at certain intervals to extract
>>>> required data from those logs and put it to Hbase/Cassandra/Mongodb.
>>>>
>>>>         Mongodb read performance is quite faster also it supports
>>>> ad-hoc querying. Also you can use Hadoop-MongoDB connector to read/write
>>>> the data to Mongodb thru Hadoop-Mapreduce.
>>>>
>>>>         If you are very specific about updating the hdfs files directly
>>>> then you have to use any commercial Hadoop packages like MapR which
>>>> supports updating the HDFS files.
>>>>
>>>> Best,
>>>> Mahesh Balija,
>>>> Calsoft Labs.
>>>>
>>>>
>>>>
>>>> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
>>>> bharathvissapragada1990@gmail.com> wrote:
>>>>
>>>>> Hi Jeff,
>>>>>
>>>>> Please look at [1] . You can store your data in HBase tables and query
>>>>> them normally just by mapping them to Hive tables. Regarding Cassandra
>>>>> support, please follow JIRA [2], its not yet in the trunk I suppose!
>>>>>
>>>>> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
>>>>> [2] https://issues.apache.org/jira/browse/HIVE-1434
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com>wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I'm coming from the RDBMS world and am looking at hdfs for long term
>>>>>> data storage and analysis.
>>>>>>
>>>>>> I've done some research and set up some smallish hdfs clusters with
>>>>>> hive for testing but I'm having a little trouble understanding how
>>>>>> everything fits together and was hoping someone could point me in the right
>>>>>> direction.
>>>>>>
>>>>>> I'm looking at storing two types of data:
>>>>>>
>>>>>> 1. Append-only data - e.g. weblogs or user logins
>>>>>> 2. Account/User data
>>>>>>
>>>>>> HDFS seems to be perfect for append-only data like #1, but I'm having
>>>>>> trouble figuring out what to do with data that may change frequently.
>>>>>>
>>>>>> A simple example would be user data where various bits of
>>>>>> information: email, etc may change from day to day.  Would hbase or
>>>>>> cassandra be the better way to go for this type of data, and can I overlay
>>>>>> hive over all ( hdfs, hbase, cassandra ) so that I can query the data
>>>>>> through a single interface?
>>>>>>
>>>>>> Thanks in advance for any help.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Bharath .V
>>>>> w:http://researchweb.iiit.ac.in/~bharath.v
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Anil Gupta
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>
>
>


-- 
Thanks & Regards,
Anil Gupta

Re: Best practice for storage of data that changes

Posted by anil gupta <an...@gmail.com>.

Hi Guys,

I posted our study on my blog:
http://bigdatanoob.blogspot.com/2012/11/hbase-vs-cassandra.html

We ended up choosing HBase because:
1. HBase provides Range based scan, and ordered partitioning.
2. HBase is closely integrated with Hadoop ecosystem.
3. HBase is strongly consistent as compared to Cassandra which is
eventually consistent.

As i said earlier in my email that selection of NoSql solution depends on
the use case. There are subtle differences between NoSql solution and each
of them have their own "Sweet Spot". So, pick yours after careful
evaluation.

PS: Added the HBase mailing list also since this is more about HBase.

Hope This Helps,
Anil Gupta


On Thu, Nov 29, 2012 at 8:51 PM, Lance Norskog <go...@gmail.com> wrote:

> Please! There are lots of blogs etc. about the two, but very few
> head-to-head for a real use case.
>
> ------------------------------
>
> *From: *"anil gupta" <an...@gmail.com>
> *To: *"common-user@hadoop.apache.org" <us...@hadoop.apache.org>
> *Sent: *Wednesday, November 28, 2012 11:01:55 AM
> *Subject: *Re: Best practice for storage of data that changes
>
>
> Hi Jeff,
>
> At my workplace "Intuit", we did some detailed study to evaluate HBase and
> Cassandra for our use case. I will see if i can post the comparative study
> on my public blog or on this mailing list.
>
> BTW, What is your use case? What bottleneck are you hitting at current
> solutions? If you can share some details then HBase community will try to
> help you out.
>
> Thanks,
> Anil Gupta
>
>
> On Wed, Nov 28, 2012 at 9:55 AM, jeff l <je...@gmail.com> wrote:
>
>> Hi,
>>
>> I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql )
>> and MongoDB but don't feel any are quite right for this problem.  The
>> amount of data being stored and access requirements just don't match up
>> well.
>>
>> I was hoping to keep the stack as simple as possible and just use hdfs
>> but everything I was seeing kept pointing to the need for some other
>> datastore.  I'll check out both HBase and Cassandra.
>>
>> Thanks for the feedback.
>>
>>
>> On Sun, Nov 25, 2012 at 1:11 PM, anil gupta <an...@gmail.com>wrote:
>>
>>> Hi Jeff,
>>>
>>> My two cents below:
>>>
>>> 1st use case: Append-only data - e.g. weblogs or user logins
>>> As others have already mentioned that Hadoop is suitable enough to store
>>> append only data. If you want to do analysis of weblogs or user logins then
>>> Hadoop is a suitable solution for it.
>>>
>>>
>>> 2nd use case: Account/User data
>>> First, of all i would suggest you to have a look at your use case then
>>> analyze whether it really needs a NoSql solution or not.
>>> As you were talking about maintaining User Data in NoSql. Why NoSql
>>> instead of RDBMS? What is the size of data? Which NoSql features are the
>>> selling points for you?
>>>
>>> For real time read writes you can have a look at Cassandra or HBase.
>>> But, i would suggest you to have a very close look at both of them because
>>> both of them have their own advantages. So, the choice will be dependent on
>>> your use case.
>>>
>>> One added advantage with HBase is that it has a deeper integration with
>>> Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop
>>> Tools. HBase has integration with Hive querying but AFAIK it has some
>>> limitations.
>>>
>>> HTH,
>>> Anil Gupta
>>>
>>>
>>> On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija <
>>> balijamahesh.mca@gmail.com> wrote:
>>>
>>>> Hi Jeff,
>>>>
>>>>         As HDFS paradigm is "Write once and read many" you cannot be
>>>> able to update the files on HDFS.
>>>>         But for your problem what you can do is you keep the
>>>> logs/userdata in hdfs with different timestamps.
>>>>         Run some mapreduce jobs at certain intervals to extract
>>>> required data from those logs and put it to Hbase/Cassandra/Mongodb.
>>>>
>>>>         Mongodb read performance is quite faster also it supports
>>>> ad-hoc querying. Also you can use Hadoop-MongoDB connector to read/write
>>>> the data to Mongodb thru Hadoop-Mapreduce.
>>>>
>>>>         If you are very specific about updating the hdfs files directly
>>>> then you have to use any commercial Hadoop packages like MapR which
>>>> supports updating the HDFS files.
>>>>
>>>> Best,
>>>> Mahesh Balija,
>>>> Calsoft Labs.
>>>>
>>>>
>>>>
>>>> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
>>>> bharathvissapragada1990@gmail.com> wrote:
>>>>
>>>>> Hi Jeff,
>>>>>
>>>>> Please look at [1] . You can store your data in HBase tables and query
>>>>> them normally just by mapping them to Hive tables. Regarding Cassandra
>>>>> support, please follow JIRA [2], its not yet in the trunk I suppose!
>>>>>
>>>>> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
>>>>> [2] https://issues.apache.org/jira/browse/HIVE-1434
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com>wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I'm coming from the RDBMS world and am looking at hdfs for long term
>>>>>> data storage and analysis.
>>>>>>
>>>>>> I've done some research and set up some smallish hdfs clusters with
>>>>>> hive for testing but I'm having a little trouble understanding how
>>>>>> everything fits together and was hoping someone could point me in the right
>>>>>> direction.
>>>>>>
>>>>>> I'm looking at storing two types of data:
>>>>>>
>>>>>> 1. Append-only data - e.g. weblogs or user logins
>>>>>> 2. Account/User data
>>>>>>
>>>>>> HDFS seems to be perfect for append-only data like #1, but I'm having
>>>>>> trouble figuring out what to do with data that may change frequently.
>>>>>>
>>>>>> A simple example would be user data where various bits of
>>>>>> information: email, etc may change from day to day.  Would hbase or
>>>>>> cassandra be the better way to go for this type of data, and can I overlay
>>>>>> hive over all ( hdfs, hbase, cassandra ) so that I can query the data
>>>>>> through a single interface?
>>>>>>
>>>>>> Thanks in advance for any help.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Bharath .V
>>>>> w:http://researchweb.iiit.ac.in/~bharath.v
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Anil Gupta
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>
>
>


-- 
Thanks & Regards,
Anil Gupta

Re: Best practice for storage of data that changes

Posted by Michael Segel <mi...@hotmail.com>.

Here's the simple thing to consider... 

If you are running M/R jobs against the data... HBase hands down is the winner. 

If you are looking at a stand alone cluster ... Cassandra wins. HBase is still a fickle beast.

Of course I just bottom lined it.  :-) 


On Nov 29, 2012, at 10:51 PM, Lance Norskog <go...@gmail.com> wrote:

> Please! There are lots of blogs etc. about the two, but very few head-to-head for a real use case.
> 
> From: "anil gupta" <an...@gmail.com>
> To: "common-user@hadoop.apache.org" <us...@hadoop.apache.org>
> Sent: Wednesday, November 28, 2012 11:01:55 AM
> Subject: Re: Best practice for storage of data that changes
> 
> Hi Jeff,
> 
> At my workplace "Intuit", we did some detailed study to evaluate HBase and Cassandra for our use case. I will see if i can post the comparative study on my public blog or on this mailing list.
> 
> BTW, What is your use case? What bottleneck are you hitting at current solutions? If you can share some details then HBase community will try to help you out.
> 
> Thanks,
> Anil Gupta
> 
> 
> On Wed, Nov 28, 2012 at 9:55 AM, jeff l <je...@gmail.com> wrote:
> Hi,
> 
> I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql ) and MongoDB but don't feel any are quite right for this problem.  The amount of data being stored and access requirements just don't match up well.
> 
> I was hoping to keep the stack as simple as possible and just use hdfs but everything I was seeing kept pointing to the need for some other datastore.  I'll check out both HBase and Cassandra.
> 
> Thanks for the feedback.
> 
> 
> On Sun, Nov 25, 2012 at 1:11 PM, anil gupta <an...@gmail.com> wrote:
> Hi Jeff,
> 
> My two cents below:
> 
> 1st use case: Append-only data - e.g. weblogs or user logins
> As others have already mentioned that Hadoop is suitable enough to store append only data. If you want to do analysis of weblogs or user logins then Hadoop is a suitable solution for it.
> 
> 
> 2nd use case: Account/User data
> First, of all i would suggest you to have a look at your use case then analyze whether it really needs a NoSql solution or not. 
> As you were talking about maintaining User Data in NoSql. Why NoSql instead of RDBMS? What is the size of data? Which NoSql features are the selling points for you?
> 
> For real time read writes you can have a look at Cassandra or HBase. But, i would suggest you to have a very close look at both of them because both of them have their own advantages. So, the choice will be dependent on your use case. 
> 
> One added advantage with HBase is that it has a deeper integration with Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop Tools. HBase has integration with Hive querying but AFAIK it has some limitations.
> 
> HTH,
> Anil Gupta
> 
> 
> On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija<ba...@gmail.com> wrote:
> Hi Jeff,
> 
>         As HDFS paradigm is "Write once and read many" you cannot be able to update the files on HDFS.
>         But for your problem what you can do is you keep the logs/userdata in hdfs with different timestamps.
>         Run some mapreduce jobs at certain intervals to extract required data from those logs and put it to Hbase/Cassandra/Mongodb.
> 
>         Mongodb read performance is quite faster also it supports ad-hoc querying. Also you can use Hadoop-MongoDB connector to read/write the data to Mongodb thru Hadoop-Mapreduce.
>      
>         If you are very specific about updating the hdfs files directly then you have to use any commercial Hadoop packages like MapR which supports updating the HDFS files.
> 
> Best,
> Mahesh Balija,
> Calsoft Labs.
> 
> 
> 
> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada<bh...@gmail.com> wrote:
> Hi Jeff,
> 
> Please look at [1] . You can store your data in HBase tables and query them normally just by mapping them to Hive tables. Regarding Cassandra support, please follow JIRA [2], its not yet in the trunk I suppose!
> 
> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
> [2] https://issues.apache.org/jira/browse/HIVE-1434
> 
> Thanks,
> 
> 
> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com> wrote:
> Hi All,
> 
> I'm coming from the RDBMS world and am looking at hdfs for long term data storage and analysis.
> 
> I've done some research and set up some smallish hdfs clusters with hive for testing but I'm having a little trouble understanding how everything fits together and was hoping someone could point me in the right direction.
> 
> I'm looking at storing two types of data:
> 
> 1. Append-only data - e.g. weblogs or user logins
> 2. Account/User data
> 
> HDFS seems to be perfect for append-only data like #1, but I'm having trouble figuring out what to do with data that may change frequently.
> 
> A simple example would be user data where various bits of information: email, etc may change from day to day.  Would hbase or cassandra be the better way to go for this type of data, and can I overlay hive over all ( hdfs, hbase, cassandra ) so that I can query the data through a single interface?
> 
> Thanks in advance for any help.
> 
> 
> 
> -- 
> Regards,
> Bharath .V
> w:http://researchweb.iiit.ac.in/~bharath.v
> 
> 
> 
> 
> -- 
> Thanks & Regards,
> Anil Gupta
> 
> 
> 
> 
> -- 
> Thanks & Regards,
> Anil Gupta

Re: Best practice for storage of data that changes

Posted by Michael Segel <mi...@hotmail.com>.

Here's the simple thing to consider... 

If you are running M/R jobs against the data... HBase hands down is the winner. 

If you are looking at a stand alone cluster ... Cassandra wins. HBase is still a fickle beast.

Of course I just bottom lined it.  :-) 


On Nov 29, 2012, at 10:51 PM, Lance Norskog <go...@gmail.com> wrote:

> Please! There are lots of blogs etc. about the two, but very few head-to-head for a real use case.
> 
> From: "anil gupta" <an...@gmail.com>
> To: "common-user@hadoop.apache.org" <us...@hadoop.apache.org>
> Sent: Wednesday, November 28, 2012 11:01:55 AM
> Subject: Re: Best practice for storage of data that changes
> 
> Hi Jeff,
> 
> At my workplace "Intuit", we did some detailed study to evaluate HBase and Cassandra for our use case. I will see if i can post the comparative study on my public blog or on this mailing list.
> 
> BTW, What is your use case? What bottleneck are you hitting at current solutions? If you can share some details then HBase community will try to help you out.
> 
> Thanks,
> Anil Gupta
> 
> 
> On Wed, Nov 28, 2012 at 9:55 AM, jeff l <je...@gmail.com> wrote:
> Hi,
> 
> I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql ) and MongoDB but don't feel any are quite right for this problem.  The amount of data being stored and access requirements just don't match up well.
> 
> I was hoping to keep the stack as simple as possible and just use hdfs but everything I was seeing kept pointing to the need for some other datastore.  I'll check out both HBase and Cassandra.
> 
> Thanks for the feedback.
> 
> 
> On Sun, Nov 25, 2012 at 1:11 PM, anil gupta <an...@gmail.com> wrote:
> Hi Jeff,
> 
> My two cents below:
> 
> 1st use case: Append-only data - e.g. weblogs or user logins
> As others have already mentioned that Hadoop is suitable enough to store append only data. If you want to do analysis of weblogs or user logins then Hadoop is a suitable solution for it.
> 
> 
> 2nd use case: Account/User data
> First, of all i would suggest you to have a look at your use case then analyze whether it really needs a NoSql solution or not. 
> As you were talking about maintaining User Data in NoSql. Why NoSql instead of RDBMS? What is the size of data? Which NoSql features are the selling points for you?
> 
> For real time read writes you can have a look at Cassandra or HBase. But, i would suggest you to have a very close look at both of them because both of them have their own advantages. So, the choice will be dependent on your use case. 
> 
> One added advantage with HBase is that it has a deeper integration with Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop Tools. HBase has integration with Hive querying but AFAIK it has some limitations.
> 
> HTH,
> Anil Gupta
> 
> 
> On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija<ba...@gmail.com> wrote:
> Hi Jeff,
> 
>         As HDFS paradigm is "Write once and read many" you cannot be able to update the files on HDFS.
>         But for your problem what you can do is you keep the logs/userdata in hdfs with different timestamps.
>         Run some mapreduce jobs at certain intervals to extract required data from those logs and put it to Hbase/Cassandra/Mongodb.
> 
>         Mongodb read performance is quite faster also it supports ad-hoc querying. Also you can use Hadoop-MongoDB connector to read/write the data to Mongodb thru Hadoop-Mapreduce.
>      
>         If you are very specific about updating the hdfs files directly then you have to use any commercial Hadoop packages like MapR which supports updating the HDFS files.
> 
> Best,
> Mahesh Balija,
> Calsoft Labs.
> 
> 
> 
> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada<bh...@gmail.com> wrote:
> Hi Jeff,
> 
> Please look at [1] . You can store your data in HBase tables and query them normally just by mapping them to Hive tables. Regarding Cassandra support, please follow JIRA [2], its not yet in the trunk I suppose!
> 
> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
> [2] https://issues.apache.org/jira/browse/HIVE-1434
> 
> Thanks,
> 
> 
> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com> wrote:
> Hi All,
> 
> I'm coming from the RDBMS world and am looking at hdfs for long term data storage and analysis.
> 
> I've done some research and set up some smallish hdfs clusters with hive for testing but I'm having a little trouble understanding how everything fits together and was hoping someone could point me in the right direction.
> 
> I'm looking at storing two types of data:
> 
> 1. Append-only data - e.g. weblogs or user logins
> 2. Account/User data
> 
> HDFS seems to be perfect for append-only data like #1, but I'm having trouble figuring out what to do with data that may change frequently.
> 
> A simple example would be user data where various bits of information: email, etc may change from day to day.  Would hbase or cassandra be the better way to go for this type of data, and can I overlay hive over all ( hdfs, hbase, cassandra ) so that I can query the data through a single interface?
> 
> Thanks in advance for any help.
> 
> 
> 
> -- 
> Regards,
> Bharath .V
> w:http://researchweb.iiit.ac.in/~bharath.v
> 
> 
> 
> 
> -- 
> Thanks & Regards,
> Anil Gupta
> 
> 
> 
> 
> -- 
> Thanks & Regards,
> Anil Gupta

Re: Best practice for storage of data that changes

Posted by Michael Segel <mi...@hotmail.com>.

Here's the simple thing to consider... 

If you are running M/R jobs against the data... HBase hands down is the winner. 

If you are looking at a stand alone cluster ... Cassandra wins. HBase is still a fickle beast.

Of course I just bottom lined it.  :-) 


On Nov 29, 2012, at 10:51 PM, Lance Norskog <go...@gmail.com> wrote:

> Please! There are lots of blogs etc. about the two, but very few head-to-head for a real use case.
> 
> From: "anil gupta" <an...@gmail.com>
> To: "common-user@hadoop.apache.org" <us...@hadoop.apache.org>
> Sent: Wednesday, November 28, 2012 11:01:55 AM
> Subject: Re: Best practice for storage of data that changes
> 
> Hi Jeff,
> 
> At my workplace "Intuit", we did some detailed study to evaluate HBase and Cassandra for our use case. I will see if i can post the comparative study on my public blog or on this mailing list.
> 
> BTW, What is your use case? What bottleneck are you hitting at current solutions? If you can share some details then HBase community will try to help you out.
> 
> Thanks,
> Anil Gupta
> 
> 
> On Wed, Nov 28, 2012 at 9:55 AM, jeff l <je...@gmail.com> wrote:
> Hi,
> 
> I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql ) and MongoDB but don't feel any are quite right for this problem.  The amount of data being stored and access requirements just don't match up well.
> 
> I was hoping to keep the stack as simple as possible and just use hdfs but everything I was seeing kept pointing to the need for some other datastore.  I'll check out both HBase and Cassandra.
> 
> Thanks for the feedback.
> 
> 
> On Sun, Nov 25, 2012 at 1:11 PM, anil gupta <an...@gmail.com> wrote:
> Hi Jeff,
> 
> My two cents below:
> 
> 1st use case: Append-only data - e.g. weblogs or user logins
> As others have already mentioned that Hadoop is suitable enough to store append only data. If you want to do analysis of weblogs or user logins then Hadoop is a suitable solution for it.
> 
> 
> 2nd use case: Account/User data
> First, of all i would suggest you to have a look at your use case then analyze whether it really needs a NoSql solution or not. 
> As you were talking about maintaining User Data in NoSql. Why NoSql instead of RDBMS? What is the size of data? Which NoSql features are the selling points for you?
> 
> For real time read writes you can have a look at Cassandra or HBase. But, i would suggest you to have a very close look at both of them because both of them have their own advantages. So, the choice will be dependent on your use case. 
> 
> One added advantage with HBase is that it has a deeper integration with Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop Tools. HBase has integration with Hive querying but AFAIK it has some limitations.
> 
> HTH,
> Anil Gupta
> 
> 
> On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija<ba...@gmail.com> wrote:
> Hi Jeff,
> 
>         As HDFS paradigm is "Write once and read many" you cannot be able to update the files on HDFS.
>         But for your problem what you can do is you keep the logs/userdata in hdfs with different timestamps.
>         Run some mapreduce jobs at certain intervals to extract required data from those logs and put it to Hbase/Cassandra/Mongodb.
> 
>         Mongodb read performance is quite faster also it supports ad-hoc querying. Also you can use Hadoop-MongoDB connector to read/write the data to Mongodb thru Hadoop-Mapreduce.
>      
>         If you are very specific about updating the hdfs files directly then you have to use any commercial Hadoop packages like MapR which supports updating the HDFS files.
> 
> Best,
> Mahesh Balija,
> Calsoft Labs.
> 
> 
> 
> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada<bh...@gmail.com> wrote:
> Hi Jeff,
> 
> Please look at [1] . You can store your data in HBase tables and query them normally just by mapping them to Hive tables. Regarding Cassandra support, please follow JIRA [2], its not yet in the trunk I suppose!
> 
> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
> [2] https://issues.apache.org/jira/browse/HIVE-1434
> 
> Thanks,
> 
> 
> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com> wrote:
> Hi All,
> 
> I'm coming from the RDBMS world and am looking at hdfs for long term data storage and analysis.
> 
> I've done some research and set up some smallish hdfs clusters with hive for testing but I'm having a little trouble understanding how everything fits together and was hoping someone could point me in the right direction.
> 
> I'm looking at storing two types of data:
> 
> 1. Append-only data - e.g. weblogs or user logins
> 2. Account/User data
> 
> HDFS seems to be perfect for append-only data like #1, but I'm having trouble figuring out what to do with data that may change frequently.
> 
> A simple example would be user data where various bits of information: email, etc may change from day to day.  Would hbase or cassandra be the better way to go for this type of data, and can I overlay hive over all ( hdfs, hbase, cassandra ) so that I can query the data through a single interface?
> 
> Thanks in advance for any help.
> 
> 
> 
> -- 
> Regards,
> Bharath .V
> w:http://researchweb.iiit.ac.in/~bharath.v
> 
> 
> 
> 
> -- 
> Thanks & Regards,
> Anil Gupta
> 
> 
> 
> 
> -- 
> Thanks & Regards,
> Anil Gupta

Re: Best practice for storage of data that changes

Posted by Michael Segel <mi...@hotmail.com>.

Here's the simple thing to consider... 

If you are running M/R jobs against the data... HBase hands down is the winner. 

If you are looking at a stand alone cluster ... Cassandra wins. HBase is still a fickle beast.

Of course I just bottom lined it.  :-) 


On Nov 29, 2012, at 10:51 PM, Lance Norskog <go...@gmail.com> wrote:

> Please! There are lots of blogs etc. about the two, but very few head-to-head for a real use case.
> 
> From: "anil gupta" <an...@gmail.com>
> To: "common-user@hadoop.apache.org" <us...@hadoop.apache.org>
> Sent: Wednesday, November 28, 2012 11:01:55 AM
> Subject: Re: Best practice for storage of data that changes
> 
> Hi Jeff,
> 
> At my workplace "Intuit", we did some detailed study to evaluate HBase and Cassandra for our use case. I will see if i can post the comparative study on my public blog or on this mailing list.
> 
> BTW, What is your use case? What bottleneck are you hitting at current solutions? If you can share some details then HBase community will try to help you out.
> 
> Thanks,
> Anil Gupta
> 
> 
> On Wed, Nov 28, 2012 at 9:55 AM, jeff l <je...@gmail.com> wrote:
> Hi,
> 
> I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql ) and MongoDB but don't feel any are quite right for this problem.  The amount of data being stored and access requirements just don't match up well.
> 
> I was hoping to keep the stack as simple as possible and just use hdfs but everything I was seeing kept pointing to the need for some other datastore.  I'll check out both HBase and Cassandra.
> 
> Thanks for the feedback.
> 
> 
> On Sun, Nov 25, 2012 at 1:11 PM, anil gupta <an...@gmail.com> wrote:
> Hi Jeff,
> 
> My two cents below:
> 
> 1st use case: Append-only data - e.g. weblogs or user logins
> As others have already mentioned that Hadoop is suitable enough to store append only data. If you want to do analysis of weblogs or user logins then Hadoop is a suitable solution for it.
> 
> 
> 2nd use case: Account/User data
> First, of all i would suggest you to have a look at your use case then analyze whether it really needs a NoSql solution or not. 
> As you were talking about maintaining User Data in NoSql. Why NoSql instead of RDBMS? What is the size of data? Which NoSql features are the selling points for you?
> 
> For real time read writes you can have a look at Cassandra or HBase. But, i would suggest you to have a very close look at both of them because both of them have their own advantages. So, the choice will be dependent on your use case. 
> 
> One added advantage with HBase is that it has a deeper integration with Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop Tools. HBase has integration with Hive querying but AFAIK it has some limitations.
> 
> HTH,
> Anil Gupta
> 
> 
> On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija<ba...@gmail.com> wrote:
> Hi Jeff,
> 
>         As HDFS paradigm is "Write once and read many" you cannot be able to update the files on HDFS.
>         But for your problem what you can do is you keep the logs/userdata in hdfs with different timestamps.
>         Run some mapreduce jobs at certain intervals to extract required data from those logs and put it to Hbase/Cassandra/Mongodb.
> 
>         Mongodb read performance is quite faster also it supports ad-hoc querying. Also you can use Hadoop-MongoDB connector to read/write the data to Mongodb thru Hadoop-Mapreduce.
>      
>         If you are very specific about updating the hdfs files directly then you have to use any commercial Hadoop packages like MapR which supports updating the HDFS files.
> 
> Best,
> Mahesh Balija,
> Calsoft Labs.
> 
> 
> 
> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada<bh...@gmail.com> wrote:
> Hi Jeff,
> 
> Please look at [1] . You can store your data in HBase tables and query them normally just by mapping them to Hive tables. Regarding Cassandra support, please follow JIRA [2], its not yet in the trunk I suppose!
> 
> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
> [2] https://issues.apache.org/jira/browse/HIVE-1434
> 
> Thanks,
> 
> 
> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com> wrote:
> Hi All,
> 
> I'm coming from the RDBMS world and am looking at hdfs for long term data storage and analysis.
> 
> I've done some research and set up some smallish hdfs clusters with hive for testing but I'm having a little trouble understanding how everything fits together and was hoping someone could point me in the right direction.
> 
> I'm looking at storing two types of data:
> 
> 1. Append-only data - e.g. weblogs or user logins
> 2. Account/User data
> 
> HDFS seems to be perfect for append-only data like #1, but I'm having trouble figuring out what to do with data that may change frequently.
> 
> A simple example would be user data where various bits of information: email, etc may change from day to day.  Would hbase or cassandra be the better way to go for this type of data, and can I overlay hive over all ( hdfs, hbase, cassandra ) so that I can query the data through a single interface?
> 
> Thanks in advance for any help.
> 
> 
> 
> -- 
> Regards,
> Bharath .V
> w:http://researchweb.iiit.ac.in/~bharath.v
> 
> 
> 
> 
> -- 
> Thanks & Regards,
> Anil Gupta
> 
> 
> 
> 
> -- 
> Thanks & Regards,
> Anil Gupta

Re: Best practice for storage of data that changes

Posted by Lance Norskog <go...@gmail.com>.

Please! There are lots of blogs etc. about the two, but very few head-to-head for a real use case. 

----- Original Message -----

| From: "anil gupta" <an...@gmail.com>
| To: "common-user@hadoop.apache.org" <us...@hadoop.apache.org>
| Sent: Wednesday, November 28, 2012 11:01:55 AM
| Subject: Re: Best practice for storage of data that changes

| Hi Jeff,

| At my workplace "Intuit", we did some detailed study to evaluate
| HBase and Cassandra for our use case. I will see if i can post the
| comparative study on my public blog or on this mailing list.

| BTW, What is your use case? What bottleneck are you hitting at
| current solutions? If you can share some details then HBase
| community will try to help you out.

| Thanks,
| Anil Gupta

| On Wed, Nov 28, 2012 at 9:55 AM, jeff l < jeff.pubmail@gmail.com >
| wrote:

| | Hi,
| 

| | I have quite a bit of experience with RDBMSs ( Oracle, Postgres,
| | Mysql ) and MongoDB but don't feel any are quite right for this
| | problem. The amount of data being stored and access requirements
| | just don't match up well.
| 

| | I was hoping to keep the stack as simple as possible and just use
| | hdfs but everything I was seeing kept pointing to the need for some
| | other datastore. I'll check out both HBase and Cassandra.
| 

| | Thanks for the feedback.
| 

| | On Sun, Nov 25, 2012 at 1:11 PM, anil gupta < anilgupta84@gmail.com
| | >
| | wrote:
| 

| | | Hi Jeff,
| | 
| 

| | | My two cents below:
| | 
| 

| | | 1st use case: Append-only data - e.g. weblogs or user logins
| | 
| 
| | | As others have already mentioned that Hadoop is suitable enough
| | | to
| | | store append only data. If you want to do analysis of weblogs or
| | | user logins then Hadoop is a suitable solution for it.
| | 
| 

| | | 2nd use case: Account/User data
| | 
| 
| | | First, of all i would suggest you to have a look at your use case
| | | then analyze whether it really needs a NoSql solution or not.
| | 
| 
| | | As you were talking about maintaining User Data in NoSql. Why
| | | NoSql
| | | instead of RDBMS? What is the size of data? Which NoSql features
| | | are
| | | the selling points for you?
| | 
| 

| | | For real time read writes you can have a look at Cassandra or
| | | HBase.
| | | But, i would suggest you to have a very close look at both of
| | | them
| | | because both of them have their own advantages. So, the choice
| | | will
| | | be dependent on your use case.
| | 
| 

| | | One added advantage with HBase is that it has a deeper
| | | integration
| | | with Hadoop ecosystem so you can do a lot of stuff on HBase data
| | | using Hadoop Tools. HBase has integration with Hive querying but
| | | AFAIK it has some limitations.
| | 
| 

| | | HTH,
| | 
| 
| | | Anil Gupta
| | 
| 

| | | On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija <
| | | balijamahesh.mca@gmail.com > wrote:
| | 
| 

| | | | Hi Jeff,
| | | 
| | 
| 

| | | | As HDFS paradigm is "Write once and read many" you cannot be
| | | | able
| | | | to
| | | | update the files on HDFS.
| | | 
| | 
| 
| | | | But for your problem what you can do is you keep the
| | | | logs/userdata
| | | | in
| | | | hdfs with different timestamps.
| | | 
| | 
| 
| | | | Run some mapreduce jobs at certain intervals to extract
| | | | required
| | | | data
| | | | from those logs and put it to Hbase/Cassandra/Mongodb.
| | | 
| | 
| 

| | | | Mongodb read performance is quite faster also it supports
| | | | ad-hoc
| | | | querying. Also you can use Hadoop-MongoDB connector to
| | | | read/write
| | | | the data to Mongodb thru Hadoop-Mapreduce.
| | | 
| | 
| 

| | | | If you are very specific about updating the hdfs files directly
| | | | then
| | | | you have to use any commercial Hadoop packages like MapR which
| | | | supports updating the HDFS files.
| | | 
| | 
| 

| | | | Best,
| | | 
| | 
| 
| | | | Mahesh Balija,
| | | 
| | 
| 
| | | | Calsoft Labs.
| | | 
| | 
| 

| | | | On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
| | | | bharathvissapragada1990@gmail.com > wrote:
| | | 
| | 
| 

| | | | | Hi Jeff,
| | | | 
| | | 
| | 
| 

| | | | | Please look at [1] . You can store your data in HBase tables
| | | | | and
| | | | | query them normally just by mapping them to Hive tables.
| | | | | Regarding
| | | | | Cassandra support, please follow JIRA [2], its not yet in the
| | | | | trunk
| | | | | I suppose!
| | | | 
| | | 
| | 
| 

| | | | | [1] https://cwiki.apache.org/Hive/hbaseintegration.html
| | | | 
| | | 
| | 
| 
| | | | | [2] https://issues.apache.org/jira/browse/HIVE-1434
| | | | 
| | | 
| | 
| 

| | | | | Thanks,
| | | | 
| | | 
| | 
| 

| | | | | On Sun, Nov 25, 2012 at 2:26 AM, jeff l <
| | | | | jeff.pubmail@gmail.com
| | | | | >
| | | | | wrote:
| | | | 
| | | 
| | 
| 

| | | | | | Hi All,
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | I'm coming from the RDBMS world and am looking at hdfs for
| | | | | | long
| | | | | | term
| | | | | | data storage and analysis.
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | I've done some research and set up some smallish hdfs
| | | | | | clusters
| | | | | | with
| | | | | | hive for testing but I'm having a little trouble
| | | | | | understanding
| | | | | | how
| | | | | | everything fits together and was hoping someone could point
| | | | | | me
| | | | | | in
| | | | | | the right direction.
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | I'm looking at storing two types of data:
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | 1. Append-only data - e.g. weblogs or user logins
| | | | | 
| | | | 
| | | 
| | 
| 
| | | | | | 2. Account/User data
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | HDFS seems to be perfect for append-only data like #1, but
| | | | | | I'm
| | | | | | having
| | | | | | trouble figuring out what to do with data that may change
| | | | | | frequently.
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | A simple example would be user data where various bits of
| | | | | | information: email, etc may change from day to day. Would
| | | | | | hbase
| | | | | | or
| | | | | | cassandra be the better way to go for this type of data,
| | | | | | and
| | | | | | can
| | | | | | I
| | | | | | overlay hive over all ( hdfs, hbase, cassandra ) so that I
| | | | | | can
| | | | | | query
| | | | | | the data through a single interface?
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | Thanks in advance for any help.
| | | | | 
| | | | 
| | | 
| | 
| 
| | | | | --
| | | | 
| | | 
| | 
| 
| | | | | Regards,
| | | | 
| | | 
| | 
| 
| | | | | Bharath .V
| | | | 
| | | 
| | 
| 
| | | | | w: http://researchweb.iiit.ac.in/~bharath.v
| | | | 
| | | 
| | 
| 

| | | --
| | 
| 
| | | Thanks & Regards,
| | 
| 
| | | Anil Gupta
| | 
| 

| --
| Thanks & Regards,
| Anil Gupta

Re: Best practice for storage of data that changes

Posted by Lance Norskog <go...@gmail.com>.

Please! There are lots of blogs etc. about the two, but very few head-to-head for a real use case. 

----- Original Message -----

| From: "anil gupta" <an...@gmail.com>
| To: "common-user@hadoop.apache.org" <us...@hadoop.apache.org>
| Sent: Wednesday, November 28, 2012 11:01:55 AM
| Subject: Re: Best practice for storage of data that changes

| Hi Jeff,

| At my workplace "Intuit", we did some detailed study to evaluate
| HBase and Cassandra for our use case. I will see if i can post the
| comparative study on my public blog or on this mailing list.

| BTW, What is your use case? What bottleneck are you hitting at
| current solutions? If you can share some details then HBase
| community will try to help you out.

| Thanks,
| Anil Gupta

| On Wed, Nov 28, 2012 at 9:55 AM, jeff l < jeff.pubmail@gmail.com >
| wrote:

| | Hi,
| 

| | I have quite a bit of experience with RDBMSs ( Oracle, Postgres,
| | Mysql ) and MongoDB but don't feel any are quite right for this
| | problem. The amount of data being stored and access requirements
| | just don't match up well.
| 

| | I was hoping to keep the stack as simple as possible and just use
| | hdfs but everything I was seeing kept pointing to the need for some
| | other datastore. I'll check out both HBase and Cassandra.
| 

| | Thanks for the feedback.
| 

| | On Sun, Nov 25, 2012 at 1:11 PM, anil gupta < anilgupta84@gmail.com
| | >
| | wrote:
| 

| | | Hi Jeff,
| | 
| 

| | | My two cents below:
| | 
| 

| | | 1st use case: Append-only data - e.g. weblogs or user logins
| | 
| 
| | | As others have already mentioned that Hadoop is suitable enough
| | | to
| | | store append only data. If you want to do analysis of weblogs or
| | | user logins then Hadoop is a suitable solution for it.
| | 
| 

| | | 2nd use case: Account/User data
| | 
| 
| | | First, of all i would suggest you to have a look at your use case
| | | then analyze whether it really needs a NoSql solution or not.
| | 
| 
| | | As you were talking about maintaining User Data in NoSql. Why
| | | NoSql
| | | instead of RDBMS? What is the size of data? Which NoSql features
| | | are
| | | the selling points for you?
| | 
| 

| | | For real time read writes you can have a look at Cassandra or
| | | HBase.
| | | But, i would suggest you to have a very close look at both of
| | | them
| | | because both of them have their own advantages. So, the choice
| | | will
| | | be dependent on your use case.
| | 
| 

| | | One added advantage with HBase is that it has a deeper
| | | integration
| | | with Hadoop ecosystem so you can do a lot of stuff on HBase data
| | | using Hadoop Tools. HBase has integration with Hive querying but
| | | AFAIK it has some limitations.
| | 
| 

| | | HTH,
| | 
| 
| | | Anil Gupta
| | 
| 

| | | On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija <
| | | balijamahesh.mca@gmail.com > wrote:
| | 
| 

| | | | Hi Jeff,
| | | 
| | 
| 

| | | | As HDFS paradigm is "Write once and read many" you cannot be
| | | | able
| | | | to
| | | | update the files on HDFS.
| | | 
| | 
| 
| | | | But for your problem what you can do is you keep the
| | | | logs/userdata
| | | | in
| | | | hdfs with different timestamps.
| | | 
| | 
| 
| | | | Run some mapreduce jobs at certain intervals to extract
| | | | required
| | | | data
| | | | from those logs and put it to Hbase/Cassandra/Mongodb.
| | | 
| | 
| 

| | | | Mongodb read performance is quite faster also it supports
| | | | ad-hoc
| | | | querying. Also you can use Hadoop-MongoDB connector to
| | | | read/write
| | | | the data to Mongodb thru Hadoop-Mapreduce.
| | | 
| | 
| 

| | | | If you are very specific about updating the hdfs files directly
| | | | then
| | | | you have to use any commercial Hadoop packages like MapR which
| | | | supports updating the HDFS files.
| | | 
| | 
| 

| | | | Best,
| | | 
| | 
| 
| | | | Mahesh Balija,
| | | 
| | 
| 
| | | | Calsoft Labs.
| | | 
| | 
| 

| | | | On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
| | | | bharathvissapragada1990@gmail.com > wrote:
| | | 
| | 
| 

| | | | | Hi Jeff,
| | | | 
| | | 
| | 
| 

| | | | | Please look at [1] . You can store your data in HBase tables
| | | | | and
| | | | | query them normally just by mapping them to Hive tables.
| | | | | Regarding
| | | | | Cassandra support, please follow JIRA [2], its not yet in the
| | | | | trunk
| | | | | I suppose!
| | | | 
| | | 
| | 
| 

| | | | | [1] https://cwiki.apache.org/Hive/hbaseintegration.html
| | | | 
| | | 
| | 
| 
| | | | | [2] https://issues.apache.org/jira/browse/HIVE-1434
| | | | 
| | | 
| | 
| 

| | | | | Thanks,
| | | | 
| | | 
| | 
| 

| | | | | On Sun, Nov 25, 2012 at 2:26 AM, jeff l <
| | | | | jeff.pubmail@gmail.com
| | | | | >
| | | | | wrote:
| | | | 
| | | 
| | 
| 

| | | | | | Hi All,
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | I'm coming from the RDBMS world and am looking at hdfs for
| | | | | | long
| | | | | | term
| | | | | | data storage and analysis.
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | I've done some research and set up some smallish hdfs
| | | | | | clusters
| | | | | | with
| | | | | | hive for testing but I'm having a little trouble
| | | | | | understanding
| | | | | | how
| | | | | | everything fits together and was hoping someone could point
| | | | | | me
| | | | | | in
| | | | | | the right direction.
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | I'm looking at storing two types of data:
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | 1. Append-only data - e.g. weblogs or user logins
| | | | | 
| | | | 
| | | 
| | 
| 
| | | | | | 2. Account/User data
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | HDFS seems to be perfect for append-only data like #1, but
| | | | | | I'm
| | | | | | having
| | | | | | trouble figuring out what to do with data that may change
| | | | | | frequently.
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | A simple example would be user data where various bits of
| | | | | | information: email, etc may change from day to day. Would
| | | | | | hbase
| | | | | | or
| | | | | | cassandra be the better way to go for this type of data,
| | | | | | and
| | | | | | can
| | | | | | I
| | | | | | overlay hive over all ( hdfs, hbase, cassandra ) so that I
| | | | | | can
| | | | | | query
| | | | | | the data through a single interface?
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | Thanks in advance for any help.
| | | | | 
| | | | 
| | | 
| | 
| 
| | | | | --
| | | | 
| | | 
| | 
| 
| | | | | Regards,
| | | | 
| | | 
| | 
| 
| | | | | Bharath .V
| | | | 
| | | 
| | 
| 
| | | | | w: http://researchweb.iiit.ac.in/~bharath.v
| | | | 
| | | 
| | 
| 

| | | --
| | 
| 
| | | Thanks & Regards,
| | 
| 
| | | Anil Gupta
| | 
| 

| --
| Thanks & Regards,
| Anil Gupta

Re: Best practice for storage of data that changes

Posted by Lance Norskog <go...@gmail.com>.

Please! There are lots of blogs etc. about the two, but very few head-to-head for a real use case. 

----- Original Message -----

| From: "anil gupta" <an...@gmail.com>
| To: "common-user@hadoop.apache.org" <us...@hadoop.apache.org>
| Sent: Wednesday, November 28, 2012 11:01:55 AM
| Subject: Re: Best practice for storage of data that changes

| Hi Jeff,

| At my workplace "Intuit", we did some detailed study to evaluate
| HBase and Cassandra for our use case. I will see if i can post the
| comparative study on my public blog or on this mailing list.

| BTW, What is your use case? What bottleneck are you hitting at
| current solutions? If you can share some details then HBase
| community will try to help you out.

| Thanks,
| Anil Gupta

| On Wed, Nov 28, 2012 at 9:55 AM, jeff l < jeff.pubmail@gmail.com >
| wrote:

| | Hi,
| 

| | I have quite a bit of experience with RDBMSs ( Oracle, Postgres,
| | Mysql ) and MongoDB but don't feel any are quite right for this
| | problem. The amount of data being stored and access requirements
| | just don't match up well.
| 

| | I was hoping to keep the stack as simple as possible and just use
| | hdfs but everything I was seeing kept pointing to the need for some
| | other datastore. I'll check out both HBase and Cassandra.
| 

| | Thanks for the feedback.
| 

| | On Sun, Nov 25, 2012 at 1:11 PM, anil gupta < anilgupta84@gmail.com
| | >
| | wrote:
| 

| | | Hi Jeff,
| | 
| 

| | | My two cents below:
| | 
| 

| | | 1st use case: Append-only data - e.g. weblogs or user logins
| | 
| 
| | | As others have already mentioned that Hadoop is suitable enough
| | | to
| | | store append only data. If you want to do analysis of weblogs or
| | | user logins then Hadoop is a suitable solution for it.
| | 
| 

| | | 2nd use case: Account/User data
| | 
| 
| | | First, of all i would suggest you to have a look at your use case
| | | then analyze whether it really needs a NoSql solution or not.
| | 
| 
| | | As you were talking about maintaining User Data in NoSql. Why
| | | NoSql
| | | instead of RDBMS? What is the size of data? Which NoSql features
| | | are
| | | the selling points for you?
| | 
| 

| | | For real time read writes you can have a look at Cassandra or
| | | HBase.
| | | But, i would suggest you to have a very close look at both of
| | | them
| | | because both of them have their own advantages. So, the choice
| | | will
| | | be dependent on your use case.
| | 
| 

| | | One added advantage with HBase is that it has a deeper
| | | integration
| | | with Hadoop ecosystem so you can do a lot of stuff on HBase data
| | | using Hadoop Tools. HBase has integration with Hive querying but
| | | AFAIK it has some limitations.
| | 
| 

| | | HTH,
| | 
| 
| | | Anil Gupta
| | 
| 

| | | On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija <
| | | balijamahesh.mca@gmail.com > wrote:
| | 
| 

| | | | Hi Jeff,
| | | 
| | 
| 

| | | | As HDFS paradigm is "Write once and read many" you cannot be
| | | | able
| | | | to
| | | | update the files on HDFS.
| | | 
| | 
| 
| | | | But for your problem what you can do is you keep the
| | | | logs/userdata
| | | | in
| | | | hdfs with different timestamps.
| | | 
| | 
| 
| | | | Run some mapreduce jobs at certain intervals to extract
| | | | required
| | | | data
| | | | from those logs and put it to Hbase/Cassandra/Mongodb.
| | | 
| | 
| 

| | | | Mongodb read performance is quite faster also it supports
| | | | ad-hoc
| | | | querying. Also you can use Hadoop-MongoDB connector to
| | | | read/write
| | | | the data to Mongodb thru Hadoop-Mapreduce.
| | | 
| | 
| 

| | | | If you are very specific about updating the hdfs files directly
| | | | then
| | | | you have to use any commercial Hadoop packages like MapR which
| | | | supports updating the HDFS files.
| | | 
| | 
| 

| | | | Best,
| | | 
| | 
| 
| | | | Mahesh Balija,
| | | 
| | 
| 
| | | | Calsoft Labs.
| | | 
| | 
| 

| | | | On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
| | | | bharathvissapragada1990@gmail.com > wrote:
| | | 
| | 
| 

| | | | | Hi Jeff,
| | | | 
| | | 
| | 
| 

| | | | | Please look at [1] . You can store your data in HBase tables
| | | | | and
| | | | | query them normally just by mapping them to Hive tables.
| | | | | Regarding
| | | | | Cassandra support, please follow JIRA [2], its not yet in the
| | | | | trunk
| | | | | I suppose!
| | | | 
| | | 
| | 
| 

| | | | | [1] https://cwiki.apache.org/Hive/hbaseintegration.html
| | | | 
| | | 
| | 
| 
| | | | | [2] https://issues.apache.org/jira/browse/HIVE-1434
| | | | 
| | | 
| | 
| 

| | | | | Thanks,
| | | | 
| | | 
| | 
| 

| | | | | On Sun, Nov 25, 2012 at 2:26 AM, jeff l <
| | | | | jeff.pubmail@gmail.com
| | | | | >
| | | | | wrote:
| | | | 
| | | 
| | 
| 

| | | | | | Hi All,
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | I'm coming from the RDBMS world and am looking at hdfs for
| | | | | | long
| | | | | | term
| | | | | | data storage and analysis.
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | I've done some research and set up some smallish hdfs
| | | | | | clusters
| | | | | | with
| | | | | | hive for testing but I'm having a little trouble
| | | | | | understanding
| | | | | | how
| | | | | | everything fits together and was hoping someone could point
| | | | | | me
| | | | | | in
| | | | | | the right direction.
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | I'm looking at storing two types of data:
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | 1. Append-only data - e.g. weblogs or user logins
| | | | | 
| | | | 
| | | 
| | 
| 
| | | | | | 2. Account/User data
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | HDFS seems to be perfect for append-only data like #1, but
| | | | | | I'm
| | | | | | having
| | | | | | trouble figuring out what to do with data that may change
| | | | | | frequently.
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | A simple example would be user data where various bits of
| | | | | | information: email, etc may change from day to day. Would
| | | | | | hbase
| | | | | | or
| | | | | | cassandra be the better way to go for this type of data,
| | | | | | and
| | | | | | can
| | | | | | I
| | | | | | overlay hive over all ( hdfs, hbase, cassandra ) so that I
| | | | | | can
| | | | | | query
| | | | | | the data through a single interface?
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | Thanks in advance for any help.
| | | | | 
| | | | 
| | | 
| | 
| 
| | | | | --
| | | | 
| | | 
| | 
| 
| | | | | Regards,
| | | | 
| | | 
| | 
| 
| | | | | Bharath .V
| | | | 
| | | 
| | 
| 
| | | | | w: http://researchweb.iiit.ac.in/~bharath.v
| | | | 
| | | 
| | 
| 

| | | --
| | 
| 
| | | Thanks & Regards,
| | 
| 
| | | Anil Gupta
| | 
| 

| --
| Thanks & Regards,
| Anil Gupta

Re: Best practice for storage of data that changes

Posted by Lance Norskog <go...@gmail.com>.

Please! There are lots of blogs etc. about the two, but very few head-to-head for a real use case. 

----- Original Message -----

| From: "anil gupta" <an...@gmail.com>
| To: "common-user@hadoop.apache.org" <us...@hadoop.apache.org>
| Sent: Wednesday, November 28, 2012 11:01:55 AM
| Subject: Re: Best practice for storage of data that changes

| Hi Jeff,

| At my workplace "Intuit", we did some detailed study to evaluate
| HBase and Cassandra for our use case. I will see if i can post the
| comparative study on my public blog or on this mailing list.

| BTW, What is your use case? What bottleneck are you hitting at
| current solutions? If you can share some details then HBase
| community will try to help you out.

| Thanks,
| Anil Gupta

| On Wed, Nov 28, 2012 at 9:55 AM, jeff l < jeff.pubmail@gmail.com >
| wrote:

| | Hi,
| 

| | I have quite a bit of experience with RDBMSs ( Oracle, Postgres,
| | Mysql ) and MongoDB but don't feel any are quite right for this
| | problem. The amount of data being stored and access requirements
| | just don't match up well.
| 

| | I was hoping to keep the stack as simple as possible and just use
| | hdfs but everything I was seeing kept pointing to the need for some
| | other datastore. I'll check out both HBase and Cassandra.
| 

| | Thanks for the feedback.
| 

| | On Sun, Nov 25, 2012 at 1:11 PM, anil gupta < anilgupta84@gmail.com
| | >
| | wrote:
| 

| | | Hi Jeff,
| | 
| 

| | | My two cents below:
| | 
| 

| | | 1st use case: Append-only data - e.g. weblogs or user logins
| | 
| 
| | | As others have already mentioned that Hadoop is suitable enough
| | | to
| | | store append only data. If you want to do analysis of weblogs or
| | | user logins then Hadoop is a suitable solution for it.
| | 
| 

| | | 2nd use case: Account/User data
| | 
| 
| | | First, of all i would suggest you to have a look at your use case
| | | then analyze whether it really needs a NoSql solution or not.
| | 
| 
| | | As you were talking about maintaining User Data in NoSql. Why
| | | NoSql
| | | instead of RDBMS? What is the size of data? Which NoSql features
| | | are
| | | the selling points for you?
| | 
| 

| | | For real time read writes you can have a look at Cassandra or
| | | HBase.
| | | But, i would suggest you to have a very close look at both of
| | | them
| | | because both of them have their own advantages. So, the choice
| | | will
| | | be dependent on your use case.
| | 
| 

| | | One added advantage with HBase is that it has a deeper
| | | integration
| | | with Hadoop ecosystem so you can do a lot of stuff on HBase data
| | | using Hadoop Tools. HBase has integration with Hive querying but
| | | AFAIK it has some limitations.
| | 
| 

| | | HTH,
| | 
| 
| | | Anil Gupta
| | 
| 

| | | On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija <
| | | balijamahesh.mca@gmail.com > wrote:
| | 
| 

| | | | Hi Jeff,
| | | 
| | 
| 

| | | | As HDFS paradigm is "Write once and read many" you cannot be
| | | | able
| | | | to
| | | | update the files on HDFS.
| | | 
| | 
| 
| | | | But for your problem what you can do is you keep the
| | | | logs/userdata
| | | | in
| | | | hdfs with different timestamps.
| | | 
| | 
| 
| | | | Run some mapreduce jobs at certain intervals to extract
| | | | required
| | | | data
| | | | from those logs and put it to Hbase/Cassandra/Mongodb.
| | | 
| | 
| 

| | | | Mongodb read performance is quite faster also it supports
| | | | ad-hoc
| | | | querying. Also you can use Hadoop-MongoDB connector to
| | | | read/write
| | | | the data to Mongodb thru Hadoop-Mapreduce.
| | | 
| | 
| 

| | | | If you are very specific about updating the hdfs files directly
| | | | then
| | | | you have to use any commercial Hadoop packages like MapR which
| | | | supports updating the HDFS files.
| | | 
| | 
| 

| | | | Best,
| | | 
| | 
| 
| | | | Mahesh Balija,
| | | 
| | 
| 
| | | | Calsoft Labs.
| | | 
| | 
| 

| | | | On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
| | | | bharathvissapragada1990@gmail.com > wrote:
| | | 
| | 
| 

| | | | | Hi Jeff,
| | | | 
| | | 
| | 
| 

| | | | | Please look at [1] . You can store your data in HBase tables
| | | | | and
| | | | | query them normally just by mapping them to Hive tables.
| | | | | Regarding
| | | | | Cassandra support, please follow JIRA [2], its not yet in the
| | | | | trunk
| | | | | I suppose!
| | | | 
| | | 
| | 
| 

| | | | | [1] https://cwiki.apache.org/Hive/hbaseintegration.html
| | | | 
| | | 
| | 
| 
| | | | | [2] https://issues.apache.org/jira/browse/HIVE-1434
| | | | 
| | | 
| | 
| 

| | | | | Thanks,
| | | | 
| | | 
| | 
| 

| | | | | On Sun, Nov 25, 2012 at 2:26 AM, jeff l <
| | | | | jeff.pubmail@gmail.com
| | | | | >
| | | | | wrote:
| | | | 
| | | 
| | 
| 

| | | | | | Hi All,
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | I'm coming from the RDBMS world and am looking at hdfs for
| | | | | | long
| | | | | | term
| | | | | | data storage and analysis.
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | I've done some research and set up some smallish hdfs
| | | | | | clusters
| | | | | | with
| | | | | | hive for testing but I'm having a little trouble
| | | | | | understanding
| | | | | | how
| | | | | | everything fits together and was hoping someone could point
| | | | | | me
| | | | | | in
| | | | | | the right direction.
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | I'm looking at storing two types of data:
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | 1. Append-only data - e.g. weblogs or user logins
| | | | | 
| | | | 
| | | 
| | 
| 
| | | | | | 2. Account/User data
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | HDFS seems to be perfect for append-only data like #1, but
| | | | | | I'm
| | | | | | having
| | | | | | trouble figuring out what to do with data that may change
| | | | | | frequently.
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | A simple example would be user data where various bits of
| | | | | | information: email, etc may change from day to day. Would
| | | | | | hbase
| | | | | | or
| | | | | | cassandra be the better way to go for this type of data,
| | | | | | and
| | | | | | can
| | | | | | I
| | | | | | overlay hive over all ( hdfs, hbase, cassandra ) so that I
| | | | | | can
| | | | | | query
| | | | | | the data through a single interface?
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | Thanks in advance for any help.
| | | | | 
| | | | 
| | | 
| | 
| 
| | | | | --
| | | | 
| | | 
| | 
| 
| | | | | Regards,
| | | | 
| | | 
| | 
| 
| | | | | Bharath .V
| | | | 
| | | 
| | 
| 
| | | | | w: http://researchweb.iiit.ac.in/~bharath.v
| | | | 
| | | 
| | 
| 

| | | --
| | 
| 
| | | Thanks & Regards,
| | 
| 
| | | Anil Gupta
| | 
| 

| --
| Thanks & Regards,
| Anil Gupta

Re: Best practice for storage of data that changes

Posted by anil gupta <an...@gmail.com>.

Hi Jeff,

At my workplace "Intuit", we did some detailed study to evaluate HBase and
Cassandra for our use case. I will see if i can post the comparative study
on my public blog or on this mailing list.

BTW, What is your use case? What bottleneck are you hitting at current
solutions? If you can share some details then HBase community will try to
help you out.

Thanks,
Anil Gupta


On Wed, Nov 28, 2012 at 9:55 AM, jeff l <je...@gmail.com> wrote:

> Hi,
>
> I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql )
> and MongoDB but don't feel any are quite right for this problem.  The
> amount of data being stored and access requirements just don't match up
> well.
>
> I was hoping to keep the stack as simple as possible and just use hdfs but
> everything I was seeing kept pointing to the need for some other datastore.
>  I'll check out both HBase and Cassandra.
>
> Thanks for the feedback.
>
>
> On Sun, Nov 25, 2012 at 1:11 PM, anil gupta <an...@gmail.com> wrote:
>
>> Hi Jeff,
>>
>> My two cents below:
>>
>> 1st use case: Append-only data - e.g. weblogs or user logins
>> As others have already mentioned that Hadoop is suitable enough to store
>> append only data. If you want to do analysis of weblogs or user logins then
>> Hadoop is a suitable solution for it.
>>
>>
>> 2nd use case: Account/User data
>> First, of all i would suggest you to have a look at your use case then
>> analyze whether it really needs a NoSql solution or not.
>> As you were talking about maintaining User Data in NoSql. Why NoSql
>> instead of RDBMS? What is the size of data? Which NoSql features are the
>> selling points for you?
>>
>> For real time read writes you can have a look at Cassandra or HBase. But,
>> i would suggest you to have a very close look at both of them because both
>> of them have their own advantages. So, the choice will be dependent on your
>> use case.
>>
>> One added advantage with HBase is that it has a deeper integration with
>> Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop
>> Tools. HBase has integration with Hive querying but AFAIK it has some
>> limitations.
>>
>> HTH,
>> Anil Gupta
>>
>>
>> On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija <
>> balijamahesh.mca@gmail.com> wrote:
>>
>>> Hi Jeff,
>>>
>>>         As HDFS paradigm is "Write once and read many" you cannot be
>>> able to update the files on HDFS.
>>>         But for your problem what you can do is you keep the
>>> logs/userdata in hdfs with different timestamps.
>>>         Run some mapreduce jobs at certain intervals to extract required
>>> data from those logs and put it to Hbase/Cassandra/Mongodb.
>>>
>>>         Mongodb read performance is quite faster also it supports ad-hoc
>>> querying. Also you can use Hadoop-MongoDB connector to read/write the data
>>> to Mongodb thru Hadoop-Mapreduce.
>>>
>>>         If you are very specific about updating the hdfs files directly
>>> then you have to use any commercial Hadoop packages like MapR which
>>> supports updating the HDFS files.
>>>
>>> Best,
>>> Mahesh Balija,
>>> Calsoft Labs.
>>>
>>>
>>>
>>> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
>>> bharathvissapragada1990@gmail.com> wrote:
>>>
>>>> Hi Jeff,
>>>>
>>>> Please look at [1] . You can store your data in HBase tables and query
>>>> them normally just by mapping them to Hive tables. Regarding Cassandra
>>>> support, please follow JIRA [2], its not yet in the trunk I suppose!
>>>>
>>>> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
>>>> [2] https://issues.apache.org/jira/browse/HIVE-1434
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I'm coming from the RDBMS world and am looking at hdfs for long term
>>>>> data storage and analysis.
>>>>>
>>>>> I've done some research and set up some smallish hdfs clusters with
>>>>> hive for testing but I'm having a little trouble understanding how
>>>>> everything fits together and was hoping someone could point me in the right
>>>>> direction.
>>>>>
>>>>> I'm looking at storing two types of data:
>>>>>
>>>>> 1. Append-only data - e.g. weblogs or user logins
>>>>> 2. Account/User data
>>>>>
>>>>> HDFS seems to be perfect for append-only data like #1, but I'm having
>>>>> trouble figuring out what to do with data that may change frequently.
>>>>>
>>>>> A simple example would be user data where various bits of information:
>>>>> email, etc may change from day to day.  Would hbase or cassandra be the
>>>>> better way to go for this type of data, and can I overlay hive over all (
>>>>> hdfs, hbase, cassandra ) so that I can query the data through a single
>>>>> interface?
>>>>>
>>>>> Thanks in advance for any help.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Bharath .V
>>>> w:http://researchweb.iiit.ac.in/~bharath.v
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Anil Gupta
>>
>
>


-- 
Thanks & Regards,
Anil Gupta

Re: Best practice for storage of data that changes

Posted by anil gupta <an...@gmail.com>.

Hi Jeff,

At my workplace "Intuit", we did some detailed study to evaluate HBase and
Cassandra for our use case. I will see if i can post the comparative study
on my public blog or on this mailing list.

BTW, What is your use case? What bottleneck are you hitting at current
solutions? If you can share some details then HBase community will try to
help you out.

Thanks,
Anil Gupta


On Wed, Nov 28, 2012 at 9:55 AM, jeff l <je...@gmail.com> wrote:

> Hi,
>
> I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql )
> and MongoDB but don't feel any are quite right for this problem.  The
> amount of data being stored and access requirements just don't match up
> well.
>
> I was hoping to keep the stack as simple as possible and just use hdfs but
> everything I was seeing kept pointing to the need for some other datastore.
>  I'll check out both HBase and Cassandra.
>
> Thanks for the feedback.
>
>
> On Sun, Nov 25, 2012 at 1:11 PM, anil gupta <an...@gmail.com> wrote:
>
>> Hi Jeff,
>>
>> My two cents below:
>>
>> 1st use case: Append-only data - e.g. weblogs or user logins
>> As others have already mentioned that Hadoop is suitable enough to store
>> append only data. If you want to do analysis of weblogs or user logins then
>> Hadoop is a suitable solution for it.
>>
>>
>> 2nd use case: Account/User data
>> First, of all i would suggest you to have a look at your use case then
>> analyze whether it really needs a NoSql solution or not.
>> As you were talking about maintaining User Data in NoSql. Why NoSql
>> instead of RDBMS? What is the size of data? Which NoSql features are the
>> selling points for you?
>>
>> For real time read writes you can have a look at Cassandra or HBase. But,
>> i would suggest you to have a very close look at both of them because both
>> of them have their own advantages. So, the choice will be dependent on your
>> use case.
>>
>> One added advantage with HBase is that it has a deeper integration with
>> Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop
>> Tools. HBase has integration with Hive querying but AFAIK it has some
>> limitations.
>>
>> HTH,
>> Anil Gupta
>>
>>
>> On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija <
>> balijamahesh.mca@gmail.com> wrote:
>>
>>> Hi Jeff,
>>>
>>>         As HDFS paradigm is "Write once and read many" you cannot be
>>> able to update the files on HDFS.
>>>         But for your problem what you can do is you keep the
>>> logs/userdata in hdfs with different timestamps.
>>>         Run some mapreduce jobs at certain intervals to extract required
>>> data from those logs and put it to Hbase/Cassandra/Mongodb.
>>>
>>>         Mongodb read performance is quite faster also it supports ad-hoc
>>> querying. Also you can use Hadoop-MongoDB connector to read/write the data
>>> to Mongodb thru Hadoop-Mapreduce.
>>>
>>>         If you are very specific about updating the hdfs files directly
>>> then you have to use any commercial Hadoop packages like MapR which
>>> supports updating the HDFS files.
>>>
>>> Best,
>>> Mahesh Balija,
>>> Calsoft Labs.
>>>
>>>
>>>
>>> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
>>> bharathvissapragada1990@gmail.com> wrote:
>>>
>>>> Hi Jeff,
>>>>
>>>> Please look at [1] . You can store your data in HBase tables and query
>>>> them normally just by mapping them to Hive tables. Regarding Cassandra
>>>> support, please follow JIRA [2], its not yet in the trunk I suppose!
>>>>
>>>> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
>>>> [2] https://issues.apache.org/jira/browse/HIVE-1434
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I'm coming from the RDBMS world and am looking at hdfs for long term
>>>>> data storage and analysis.
>>>>>
>>>>> I've done some research and set up some smallish hdfs clusters with
>>>>> hive for testing but I'm having a little trouble understanding how
>>>>> everything fits together and was hoping someone could point me in the right
>>>>> direction.
>>>>>
>>>>> I'm looking at storing two types of data:
>>>>>
>>>>> 1. Append-only data - e.g. weblogs or user logins
>>>>> 2. Account/User data
>>>>>
>>>>> HDFS seems to be perfect for append-only data like #1, but I'm having
>>>>> trouble figuring out what to do with data that may change frequently.
>>>>>
>>>>> A simple example would be user data where various bits of information:
>>>>> email, etc may change from day to day.  Would hbase or cassandra be the
>>>>> better way to go for this type of data, and can I overlay hive over all (
>>>>> hdfs, hbase, cassandra ) so that I can query the data through a single
>>>>> interface?
>>>>>
>>>>> Thanks in advance for any help.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Bharath .V
>>>> w:http://researchweb.iiit.ac.in/~bharath.v
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Anil Gupta
>>
>
>


-- 
Thanks & Regards,
Anil Gupta

Re: Best practice for storage of data that changes

Posted by anil gupta <an...@gmail.com>.

Hi Jeff,

At my workplace "Intuit", we did some detailed study to evaluate HBase and
Cassandra for our use case. I will see if i can post the comparative study
on my public blog or on this mailing list.

BTW, What is your use case? What bottleneck are you hitting at current
solutions? If you can share some details then HBase community will try to
help you out.

Thanks,
Anil Gupta


On Wed, Nov 28, 2012 at 9:55 AM, jeff l <je...@gmail.com> wrote:

> Hi,
>
> I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql )
> and MongoDB but don't feel any are quite right for this problem.  The
> amount of data being stored and access requirements just don't match up
> well.
>
> I was hoping to keep the stack as simple as possible and just use hdfs but
> everything I was seeing kept pointing to the need for some other datastore.
>  I'll check out both HBase and Cassandra.
>
> Thanks for the feedback.
>
>
> On Sun, Nov 25, 2012 at 1:11 PM, anil gupta <an...@gmail.com> wrote:
>
>> Hi Jeff,
>>
>> My two cents below:
>>
>> 1st use case: Append-only data - e.g. weblogs or user logins
>> As others have already mentioned that Hadoop is suitable enough to store
>> append only data. If you want to do analysis of weblogs or user logins then
>> Hadoop is a suitable solution for it.
>>
>>
>> 2nd use case: Account/User data
>> First, of all i would suggest you to have a look at your use case then
>> analyze whether it really needs a NoSql solution or not.
>> As you were talking about maintaining User Data in NoSql. Why NoSql
>> instead of RDBMS? What is the size of data? Which NoSql features are the
>> selling points for you?
>>
>> For real time read writes you can have a look at Cassandra or HBase. But,
>> i would suggest you to have a very close look at both of them because both
>> of them have their own advantages. So, the choice will be dependent on your
>> use case.
>>
>> One added advantage with HBase is that it has a deeper integration with
>> Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop
>> Tools. HBase has integration with Hive querying but AFAIK it has some
>> limitations.
>>
>> HTH,
>> Anil Gupta
>>
>>
>> On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija <
>> balijamahesh.mca@gmail.com> wrote:
>>
>>> Hi Jeff,
>>>
>>>         As HDFS paradigm is "Write once and read many" you cannot be
>>> able to update the files on HDFS.
>>>         But for your problem what you can do is you keep the
>>> logs/userdata in hdfs with different timestamps.
>>>         Run some mapreduce jobs at certain intervals to extract required
>>> data from those logs and put it to Hbase/Cassandra/Mongodb.
>>>
>>>         Mongodb read performance is quite faster also it supports ad-hoc
>>> querying. Also you can use Hadoop-MongoDB connector to read/write the data
>>> to Mongodb thru Hadoop-Mapreduce.
>>>
>>>         If you are very specific about updating the hdfs files directly
>>> then you have to use any commercial Hadoop packages like MapR which
>>> supports updating the HDFS files.
>>>
>>> Best,
>>> Mahesh Balija,
>>> Calsoft Labs.
>>>
>>>
>>>
>>> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
>>> bharathvissapragada1990@gmail.com> wrote:
>>>
>>>> Hi Jeff,
>>>>
>>>> Please look at [1] . You can store your data in HBase tables and query
>>>> them normally just by mapping them to Hive tables. Regarding Cassandra
>>>> support, please follow JIRA [2], its not yet in the trunk I suppose!
>>>>
>>>> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
>>>> [2] https://issues.apache.org/jira/browse/HIVE-1434
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I'm coming from the RDBMS world and am looking at hdfs for long term
>>>>> data storage and analysis.
>>>>>
>>>>> I've done some research and set up some smallish hdfs clusters with
>>>>> hive for testing but I'm having a little trouble understanding how
>>>>> everything fits together and was hoping someone could point me in the right
>>>>> direction.
>>>>>
>>>>> I'm looking at storing two types of data:
>>>>>
>>>>> 1. Append-only data - e.g. weblogs or user logins
>>>>> 2. Account/User data
>>>>>
>>>>> HDFS seems to be perfect for append-only data like #1, but I'm having
>>>>> trouble figuring out what to do with data that may change frequently.
>>>>>
>>>>> A simple example would be user data where various bits of information:
>>>>> email, etc may change from day to day.  Would hbase or cassandra be the
>>>>> better way to go for this type of data, and can I overlay hive over all (
>>>>> hdfs, hbase, cassandra ) so that I can query the data through a single
>>>>> interface?
>>>>>
>>>>> Thanks in advance for any help.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Bharath .V
>>>> w:http://researchweb.iiit.ac.in/~bharath.v
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Anil Gupta
>>
>
>


-- 
Thanks & Regards,
Anil Gupta

Re: Best practice for storage of data that changes

Posted by anil gupta <an...@gmail.com>.

Hi Jeff,

At my workplace "Intuit", we did some detailed study to evaluate HBase and
Cassandra for our use case. I will see if i can post the comparative study
on my public blog or on this mailing list.

BTW, What is your use case? What bottleneck are you hitting at current
solutions? If you can share some details then HBase community will try to
help you out.

Thanks,
Anil Gupta


On Wed, Nov 28, 2012 at 9:55 AM, jeff l <je...@gmail.com> wrote:

> Hi,
>
> I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql )
> and MongoDB but don't feel any are quite right for this problem.  The
> amount of data being stored and access requirements just don't match up
> well.
>
> I was hoping to keep the stack as simple as possible and just use hdfs but
> everything I was seeing kept pointing to the need for some other datastore.
>  I'll check out both HBase and Cassandra.
>
> Thanks for the feedback.
>
>
> On Sun, Nov 25, 2012 at 1:11 PM, anil gupta <an...@gmail.com> wrote:
>
>> Hi Jeff,
>>
>> My two cents below:
>>
>> 1st use case: Append-only data - e.g. weblogs or user logins
>> As others have already mentioned that Hadoop is suitable enough to store
>> append only data. If you want to do analysis of weblogs or user logins then
>> Hadoop is a suitable solution for it.
>>
>>
>> 2nd use case: Account/User data
>> First, of all i would suggest you to have a look at your use case then
>> analyze whether it really needs a NoSql solution or not.
>> As you were talking about maintaining User Data in NoSql. Why NoSql
>> instead of RDBMS? What is the size of data? Which NoSql features are the
>> selling points for you?
>>
>> For real time read writes you can have a look at Cassandra or HBase. But,
>> i would suggest you to have a very close look at both of them because both
>> of them have their own advantages. So, the choice will be dependent on your
>> use case.
>>
>> One added advantage with HBase is that it has a deeper integration with
>> Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop
>> Tools. HBase has integration with Hive querying but AFAIK it has some
>> limitations.
>>
>> HTH,
>> Anil Gupta
>>
>>
>> On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija <
>> balijamahesh.mca@gmail.com> wrote:
>>
>>> Hi Jeff,
>>>
>>>         As HDFS paradigm is "Write once and read many" you cannot be
>>> able to update the files on HDFS.
>>>         But for your problem what you can do is you keep the
>>> logs/userdata in hdfs with different timestamps.
>>>         Run some mapreduce jobs at certain intervals to extract required
>>> data from those logs and put it to Hbase/Cassandra/Mongodb.
>>>
>>>         Mongodb read performance is quite faster also it supports ad-hoc
>>> querying. Also you can use Hadoop-MongoDB connector to read/write the data
>>> to Mongodb thru Hadoop-Mapreduce.
>>>
>>>         If you are very specific about updating the hdfs files directly
>>> then you have to use any commercial Hadoop packages like MapR which
>>> supports updating the HDFS files.
>>>
>>> Best,
>>> Mahesh Balija,
>>> Calsoft Labs.
>>>
>>>
>>>
>>> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
>>> bharathvissapragada1990@gmail.com> wrote:
>>>
>>>> Hi Jeff,
>>>>
>>>> Please look at [1] . You can store your data in HBase tables and query
>>>> them normally just by mapping them to Hive tables. Regarding Cassandra
>>>> support, please follow JIRA [2], its not yet in the trunk I suppose!
>>>>
>>>> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
>>>> [2] https://issues.apache.org/jira/browse/HIVE-1434
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I'm coming from the RDBMS world and am looking at hdfs for long term
>>>>> data storage and analysis.
>>>>>
>>>>> I've done some research and set up some smallish hdfs clusters with
>>>>> hive for testing but I'm having a little trouble understanding how
>>>>> everything fits together and was hoping someone could point me in the right
>>>>> direction.
>>>>>
>>>>> I'm looking at storing two types of data:
>>>>>
>>>>> 1. Append-only data - e.g. weblogs or user logins
>>>>> 2. Account/User data
>>>>>
>>>>> HDFS seems to be perfect for append-only data like #1, but I'm having
>>>>> trouble figuring out what to do with data that may change frequently.
>>>>>
>>>>> A simple example would be user data where various bits of information:
>>>>> email, etc may change from day to day.  Would hbase or cassandra be the
>>>>> better way to go for this type of data, and can I overlay hive over all (
>>>>> hdfs, hbase, cassandra ) so that I can query the data through a single
>>>>> interface?
>>>>>
>>>>> Thanks in advance for any help.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Bharath .V
>>>> w:http://researchweb.iiit.ac.in/~bharath.v
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Anil Gupta
>>
>
>


-- 
Thanks & Regards,
Anil Gupta

Re: Best practice for storage of data that changes

Posted by jeff l <je...@gmail.com>.

Hi,

I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql )
and MongoDB but don't feel any are quite right for this problem.  The
amount of data being stored and access requirements just don't match up
well.

I was hoping to keep the stack as simple as possible and just use hdfs but
everything I was seeing kept pointing to the need for some other datastore.
 I'll check out both HBase and Cassandra.

Thanks for the feedback.


On Sun, Nov 25, 2012 at 1:11 PM, anil gupta <an...@gmail.com> wrote:

> Hi Jeff,
>
> My two cents below:
>
> 1st use case: Append-only data - e.g. weblogs or user logins
> As others have already mentioned that Hadoop is suitable enough to store
> append only data. If you want to do analysis of weblogs or user logins then
> Hadoop is a suitable solution for it.
>
>
> 2nd use case: Account/User data
> First, of all i would suggest you to have a look at your use case then
> analyze whether it really needs a NoSql solution or not.
> As you were talking about maintaining User Data in NoSql. Why NoSql
> instead of RDBMS? What is the size of data? Which NoSql features are the
> selling points for you?
>
> For real time read writes you can have a look at Cassandra or HBase. But,
> i would suggest you to have a very close look at both of them because both
> of them have their own advantages. So, the choice will be dependent on your
> use case.
>
> One added advantage with HBase is that it has a deeper integration with
> Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop
> Tools. HBase has integration with Hive querying but AFAIK it has some
> limitations.
>
> HTH,
> Anil Gupta
>
>
> On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija <balijamahesh.mca@gmail.com
> > wrote:
>
>> Hi Jeff,
>>
>>         As HDFS paradigm is "Write once and read many" you cannot be able
>> to update the files on HDFS.
>>         But for your problem what you can do is you keep the
>> logs/userdata in hdfs with different timestamps.
>>         Run some mapreduce jobs at certain intervals to extract required
>> data from those logs and put it to Hbase/Cassandra/Mongodb.
>>
>>         Mongodb read performance is quite faster also it supports ad-hoc
>> querying. Also you can use Hadoop-MongoDB connector to read/write the data
>> to Mongodb thru Hadoop-Mapreduce.
>>
>>         If you are very specific about updating the hdfs files directly
>> then you have to use any commercial Hadoop packages like MapR which
>> supports updating the HDFS files.
>>
>> Best,
>> Mahesh Balija,
>> Calsoft Labs.
>>
>>
>>
>> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
>> bharathvissapragada1990@gmail.com> wrote:
>>
>>> Hi Jeff,
>>>
>>> Please look at [1] . You can store your data in HBase tables and query
>>> them normally just by mapping them to Hive tables. Regarding Cassandra
>>> support, please follow JIRA [2], its not yet in the trunk I suppose!
>>>
>>> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
>>> [2] https://issues.apache.org/jira/browse/HIVE-1434
>>>
>>> Thanks,
>>>
>>>
>>> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I'm coming from the RDBMS world and am looking at hdfs for long term
>>>> data storage and analysis.
>>>>
>>>> I've done some research and set up some smallish hdfs clusters with
>>>> hive for testing but I'm having a little trouble understanding how
>>>> everything fits together and was hoping someone could point me in the right
>>>> direction.
>>>>
>>>> I'm looking at storing two types of data:
>>>>
>>>> 1. Append-only data - e.g. weblogs or user logins
>>>> 2. Account/User data
>>>>
>>>> HDFS seems to be perfect for append-only data like #1, but I'm having
>>>> trouble figuring out what to do with data that may change frequently.
>>>>
>>>> A simple example would be user data where various bits of information:
>>>> email, etc may change from day to day.  Would hbase or cassandra be the
>>>> better way to go for this type of data, and can I overlay hive over all (
>>>> hdfs, hbase, cassandra ) so that I can query the data through a single
>>>> interface?
>>>>
>>>> Thanks in advance for any help.
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Bharath .V
>>> w:http://researchweb.iiit.ac.in/~bharath.v
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>

Re: Best practice for storage of data that changes

Posted by jeff l <je...@gmail.com>.

Hi,

I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql )
and MongoDB but don't feel any are quite right for this problem.  The
amount of data being stored and access requirements just don't match up
well.

I was hoping to keep the stack as simple as possible and just use hdfs but
everything I was seeing kept pointing to the need for some other datastore.
 I'll check out both HBase and Cassandra.

Thanks for the feedback.


On Sun, Nov 25, 2012 at 1:11 PM, anil gupta <an...@gmail.com> wrote:

> Hi Jeff,
>
> My two cents below:
>
> 1st use case: Append-only data - e.g. weblogs or user logins
> As others have already mentioned that Hadoop is suitable enough to store
> append only data. If you want to do analysis of weblogs or user logins then
> Hadoop is a suitable solution for it.
>
>
> 2nd use case: Account/User data
> First, of all i would suggest you to have a look at your use case then
> analyze whether it really needs a NoSql solution or not.
> As you were talking about maintaining User Data in NoSql. Why NoSql
> instead of RDBMS? What is the size of data? Which NoSql features are the
> selling points for you?
>
> For real time read writes you can have a look at Cassandra or HBase. But,
> i would suggest you to have a very close look at both of them because both
> of them have their own advantages. So, the choice will be dependent on your
> use case.
>
> One added advantage with HBase is that it has a deeper integration with
> Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop
> Tools. HBase has integration with Hive querying but AFAIK it has some
> limitations.
>
> HTH,
> Anil Gupta
>
>
> On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija <balijamahesh.mca@gmail.com
> > wrote:
>
>> Hi Jeff,
>>
>>         As HDFS paradigm is "Write once and read many" you cannot be able
>> to update the files on HDFS.
>>         But for your problem what you can do is you keep the
>> logs/userdata in hdfs with different timestamps.
>>         Run some mapreduce jobs at certain intervals to extract required
>> data from those logs and put it to Hbase/Cassandra/Mongodb.
>>
>>         Mongodb read performance is quite faster also it supports ad-hoc
>> querying. Also you can use Hadoop-MongoDB connector to read/write the data
>> to Mongodb thru Hadoop-Mapreduce.
>>
>>         If you are very specific about updating the hdfs files directly
>> then you have to use any commercial Hadoop packages like MapR which
>> supports updating the HDFS files.
>>
>> Best,
>> Mahesh Balija,
>> Calsoft Labs.
>>
>>
>>
>> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
>> bharathvissapragada1990@gmail.com> wrote:
>>
>>> Hi Jeff,
>>>
>>> Please look at [1] . You can store your data in HBase tables and query
>>> them normally just by mapping them to Hive tables. Regarding Cassandra
>>> support, please follow JIRA [2], its not yet in the trunk I suppose!
>>>
>>> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
>>> [2] https://issues.apache.org/jira/browse/HIVE-1434
>>>
>>> Thanks,
>>>
>>>
>>> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I'm coming from the RDBMS world and am looking at hdfs for long term
>>>> data storage and analysis.
>>>>
>>>> I've done some research and set up some smallish hdfs clusters with
>>>> hive for testing but I'm having a little trouble understanding how
>>>> everything fits together and was hoping someone could point me in the right
>>>> direction.
>>>>
>>>> I'm looking at storing two types of data:
>>>>
>>>> 1. Append-only data - e.g. weblogs or user logins
>>>> 2. Account/User data
>>>>
>>>> HDFS seems to be perfect for append-only data like #1, but I'm having
>>>> trouble figuring out what to do with data that may change frequently.
>>>>
>>>> A simple example would be user data where various bits of information:
>>>> email, etc may change from day to day.  Would hbase or cassandra be the
>>>> better way to go for this type of data, and can I overlay hive over all (
>>>> hdfs, hbase, cassandra ) so that I can query the data through a single
>>>> interface?
>>>>
>>>> Thanks in advance for any help.
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Bharath .V
>>> w:http://researchweb.iiit.ac.in/~bharath.v
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>

Re: Best practice for storage of data that changes

Posted by jeff l <je...@gmail.com>.

Hi,

I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql )
and MongoDB but don't feel any are quite right for this problem.  The
amount of data being stored and access requirements just don't match up
well.

I was hoping to keep the stack as simple as possible and just use hdfs but
everything I was seeing kept pointing to the need for some other datastore.
 I'll check out both HBase and Cassandra.

Thanks for the feedback.


On Sun, Nov 25, 2012 at 1:11 PM, anil gupta <an...@gmail.com> wrote:

> Hi Jeff,
>
> My two cents below:
>
> 1st use case: Append-only data - e.g. weblogs or user logins
> As others have already mentioned that Hadoop is suitable enough to store
> append only data. If you want to do analysis of weblogs or user logins then
> Hadoop is a suitable solution for it.
>
>
> 2nd use case: Account/User data
> First, of all i would suggest you to have a look at your use case then
> analyze whether it really needs a NoSql solution or not.
> As you were talking about maintaining User Data in NoSql. Why NoSql
> instead of RDBMS? What is the size of data? Which NoSql features are the
> selling points for you?
>
> For real time read writes you can have a look at Cassandra or HBase. But,
> i would suggest you to have a very close look at both of them because both
> of them have their own advantages. So, the choice will be dependent on your
> use case.
>
> One added advantage with HBase is that it has a deeper integration with
> Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop
> Tools. HBase has integration with Hive querying but AFAIK it has some
> limitations.
>
> HTH,
> Anil Gupta
>
>
> On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija <balijamahesh.mca@gmail.com
> > wrote:
>
>> Hi Jeff,
>>
>>         As HDFS paradigm is "Write once and read many" you cannot be able
>> to update the files on HDFS.
>>         But for your problem what you can do is you keep the
>> logs/userdata in hdfs with different timestamps.
>>         Run some mapreduce jobs at certain intervals to extract required
>> data from those logs and put it to Hbase/Cassandra/Mongodb.
>>
>>         Mongodb read performance is quite faster also it supports ad-hoc
>> querying. Also you can use Hadoop-MongoDB connector to read/write the data
>> to Mongodb thru Hadoop-Mapreduce.
>>
>>         If you are very specific about updating the hdfs files directly
>> then you have to use any commercial Hadoop packages like MapR which
>> supports updating the HDFS files.
>>
>> Best,
>> Mahesh Balija,
>> Calsoft Labs.
>>
>>
>>
>> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
>> bharathvissapragada1990@gmail.com> wrote:
>>
>>> Hi Jeff,
>>>
>>> Please look at [1] . You can store your data in HBase tables and query
>>> them normally just by mapping them to Hive tables. Regarding Cassandra
>>> support, please follow JIRA [2], its not yet in the trunk I suppose!
>>>
>>> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
>>> [2] https://issues.apache.org/jira/browse/HIVE-1434
>>>
>>> Thanks,
>>>
>>>
>>> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I'm coming from the RDBMS world and am looking at hdfs for long term
>>>> data storage and analysis.
>>>>
>>>> I've done some research and set up some smallish hdfs clusters with
>>>> hive for testing but I'm having a little trouble understanding how
>>>> everything fits together and was hoping someone could point me in the right
>>>> direction.
>>>>
>>>> I'm looking at storing two types of data:
>>>>
>>>> 1. Append-only data - e.g. weblogs or user logins
>>>> 2. Account/User data
>>>>
>>>> HDFS seems to be perfect for append-only data like #1, but I'm having
>>>> trouble figuring out what to do with data that may change frequently.
>>>>
>>>> A simple example would be user data where various bits of information:
>>>> email, etc may change from day to day.  Would hbase or cassandra be the
>>>> better way to go for this type of data, and can I overlay hive over all (
>>>> hdfs, hbase, cassandra ) so that I can query the data through a single
>>>> interface?
>>>>
>>>> Thanks in advance for any help.
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Bharath .V
>>> w:http://researchweb.iiit.ac.in/~bharath.v
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>

Re: Best practice for storage of data that changes

Posted by jeff l <je...@gmail.com>.

Hi,

I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql )
and MongoDB but don't feel any are quite right for this problem.  The
amount of data being stored and access requirements just don't match up
well.

I was hoping to keep the stack as simple as possible and just use hdfs but
everything I was seeing kept pointing to the need for some other datastore.
 I'll check out both HBase and Cassandra.

Thanks for the feedback.


On Sun, Nov 25, 2012 at 1:11 PM, anil gupta <an...@gmail.com> wrote:

> Hi Jeff,
>
> My two cents below:
>
> 1st use case: Append-only data - e.g. weblogs or user logins
> As others have already mentioned that Hadoop is suitable enough to store
> append only data. If you want to do analysis of weblogs or user logins then
> Hadoop is a suitable solution for it.
>
>
> 2nd use case: Account/User data
> First, of all i would suggest you to have a look at your use case then
> analyze whether it really needs a NoSql solution or not.
> As you were talking about maintaining User Data in NoSql. Why NoSql
> instead of RDBMS? What is the size of data? Which NoSql features are the
> selling points for you?
>
> For real time read writes you can have a look at Cassandra or HBase. But,
> i would suggest you to have a very close look at both of them because both
> of them have their own advantages. So, the choice will be dependent on your
> use case.
>
> One added advantage with HBase is that it has a deeper integration with
> Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop
> Tools. HBase has integration with Hive querying but AFAIK it has some
> limitations.
>
> HTH,
> Anil Gupta
>
>
> On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija <balijamahesh.mca@gmail.com
> > wrote:
>
>> Hi Jeff,
>>
>>         As HDFS paradigm is "Write once and read many" you cannot be able
>> to update the files on HDFS.
>>         But for your problem what you can do is you keep the
>> logs/userdata in hdfs with different timestamps.
>>         Run some mapreduce jobs at certain intervals to extract required
>> data from those logs and put it to Hbase/Cassandra/Mongodb.
>>
>>         Mongodb read performance is quite faster also it supports ad-hoc
>> querying. Also you can use Hadoop-MongoDB connector to read/write the data
>> to Mongodb thru Hadoop-Mapreduce.
>>
>>         If you are very specific about updating the hdfs files directly
>> then you have to use any commercial Hadoop packages like MapR which
>> supports updating the HDFS files.
>>
>> Best,
>> Mahesh Balija,
>> Calsoft Labs.
>>
>>
>>
>> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
>> bharathvissapragada1990@gmail.com> wrote:
>>
>>> Hi Jeff,
>>>
>>> Please look at [1] . You can store your data in HBase tables and query
>>> them normally just by mapping them to Hive tables. Regarding Cassandra
>>> support, please follow JIRA [2], its not yet in the trunk I suppose!
>>>
>>> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
>>> [2] https://issues.apache.org/jira/browse/HIVE-1434
>>>
>>> Thanks,
>>>
>>>
>>> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I'm coming from the RDBMS world and am looking at hdfs for long term
>>>> data storage and analysis.
>>>>
>>>> I've done some research and set up some smallish hdfs clusters with
>>>> hive for testing but I'm having a little trouble understanding how
>>>> everything fits together and was hoping someone could point me in the right
>>>> direction.
>>>>
>>>> I'm looking at storing two types of data:
>>>>
>>>> 1. Append-only data - e.g. weblogs or user logins
>>>> 2. Account/User data
>>>>
>>>> HDFS seems to be perfect for append-only data like #1, but I'm having
>>>> trouble figuring out what to do with data that may change frequently.
>>>>
>>>> A simple example would be user data where various bits of information:
>>>> email, etc may change from day to day.  Would hbase or cassandra be the
>>>> better way to go for this type of data, and can I overlay hive over all (
>>>> hdfs, hbase, cassandra ) so that I can query the data through a single
>>>> interface?
>>>>
>>>> Thanks in advance for any help.
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Bharath .V
>>> w:http://researchweb.iiit.ac.in/~bharath.v
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>

Re: Best practice for storage of data that changes

Posted by anil gupta <an...@gmail.com>.

Hi Jeff,

My two cents below:

1st use case: Append-only data - e.g. weblogs or user logins
As others have already mentioned that Hadoop is suitable enough to store
append only data. If you want to do analysis of weblogs or user logins then
Hadoop is a suitable solution for it.

2nd use case: Account/User data
First, of all i would suggest you to have a look at your use case then
analyze whether it really needs a NoSql solution or not.
As you were talking about maintaining User Data in NoSql. Why NoSql instead
of RDBMS? What is the size of data? Which NoSql features are the selling
points for you?

For real time read writes you can have a look at Cassandra or HBase. But, i
would suggest you to have a very close look at both of them because both of
them have their own advantages. So, the choice will be dependent on your
use case.

One added advantage with HBase is that it has a deeper integration with
Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop
Tools. HBase has integration with Hive querying but AFAIK it has some
limitations.

HTH,
Anil Gupta

On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija
<ba...@gmail.com>wrote:

> Hi Jeff,
>
>         As HDFS paradigm is "Write once and read many" you cannot be able
> to update the files on HDFS.
>         But for your problem what you can do is you keep the logs/userdata
> in hdfs with different timestamps.
>         Run some mapreduce jobs at certain intervals to extract required
> data from those logs and put it to Hbase/Cassandra/Mongodb.
>
>         Mongodb read performance is quite faster also it supports ad-hoc
> querying. Also you can use Hadoop-MongoDB connector to read/write the data
> to Mongodb thru Hadoop-Mapreduce.
>
>         If you are very specific about updating the hdfs files directly
> then you have to use any commercial Hadoop packages like MapR which
> supports updating the HDFS files.
>
> Best,
> Mahesh Balija,
> Calsoft Labs.
>
>
>
> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
> bharathvissapragada1990@gmail.com> wrote:
>
>> Hi Jeff,
>>
>> Please look at [1] . You can store your data in HBase tables and query
>> them normally just by mapping them to Hive tables. Regarding Cassandra
>> support, please follow JIRA [2], its not yet in the trunk I suppose!
>>
>> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
>> [2] https://issues.apache.org/jira/browse/HIVE-1434
>>
>> Thanks,
>>
>>
>> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I'm coming from the RDBMS world and am looking at hdfs for long term
>>> data storage and analysis.
>>>
>>> I've done some research and set up some smallish hdfs clusters with hive
>>> for testing but I'm having a little trouble understanding how everything
>>> fits together and was hoping someone could point me in the right direction.
>>>
>>> I'm looking at storing two types of data:
>>>
>>> 1. Append-only data - e.g. weblogs or user logins
>>> 2. Account/User data
>>>
>>> HDFS seems to be perfect for append-only data like #1, but I'm having
>>> trouble figuring out what to do with data that may change frequently.
>>>
>>> A simple example would be user data where various bits of information:
>>> email, etc may change from day to day.  Would hbase or cassandra be the
>>> better way to go for this type of data, and can I overlay hive over all (
>>> hdfs, hbase, cassandra ) so that I can query the data through a single
>>> interface?
>>>
>>> Thanks in advance for any help.
>>>
>>
>>
>>
>> --
>> Regards,
>> Bharath .V
>> w:http://researchweb.iiit.ac.in/~bharath.v
>>
>
>

-- 
Thanks & Regards,
Anil Gupta

Re: Best practice for storage of data that changes

Posted by anil gupta <an...@gmail.com>.

Hi Jeff,

My two cents below:

1st use case: Append-only data - e.g. weblogs or user logins
As others have already mentioned that Hadoop is suitable enough to store
append only data. If you want to do analysis of weblogs or user logins then
Hadoop is a suitable solution for it.

2nd use case: Account/User data
First, of all i would suggest you to have a look at your use case then
analyze whether it really needs a NoSql solution or not.
As you were talking about maintaining User Data in NoSql. Why NoSql instead
of RDBMS? What is the size of data? Which NoSql features are the selling
points for you?

For real time read writes you can have a look at Cassandra or HBase. But, i
would suggest you to have a very close look at both of them because both of
them have their own advantages. So, the choice will be dependent on your
use case.

One added advantage with HBase is that it has a deeper integration with
Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop
Tools. HBase has integration with Hive querying but AFAIK it has some
limitations.

HTH,
Anil Gupta

On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija
<ba...@gmail.com>wrote:

> Hi Jeff,
>
>         As HDFS paradigm is "Write once and read many" you cannot be able
> to update the files on HDFS.
>         But for your problem what you can do is you keep the logs/userdata
> in hdfs with different timestamps.
>         Run some mapreduce jobs at certain intervals to extract required
> data from those logs and put it to Hbase/Cassandra/Mongodb.
>
>         Mongodb read performance is quite faster also it supports ad-hoc
> querying. Also you can use Hadoop-MongoDB connector to read/write the data
> to Mongodb thru Hadoop-Mapreduce.
>
>         If you are very specific about updating the hdfs files directly
> then you have to use any commercial Hadoop packages like MapR which
> supports updating the HDFS files.
>
> Best,
> Mahesh Balija,
> Calsoft Labs.
>
>
>
> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
> bharathvissapragada1990@gmail.com> wrote:
>
>> Hi Jeff,
>>
>> Please look at [1] . You can store your data in HBase tables and query
>> them normally just by mapping them to Hive tables. Regarding Cassandra
>> support, please follow JIRA [2], its not yet in the trunk I suppose!
>>
>> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
>> [2] https://issues.apache.org/jira/browse/HIVE-1434
>>
>> Thanks,
>>
>>
>> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I'm coming from the RDBMS world and am looking at hdfs for long term
>>> data storage and analysis.
>>>
>>> I've done some research and set up some smallish hdfs clusters with hive
>>> for testing but I'm having a little trouble understanding how everything
>>> fits together and was hoping someone could point me in the right direction.
>>>
>>> I'm looking at storing two types of data:
>>>
>>> 1. Append-only data - e.g. weblogs or user logins
>>> 2. Account/User data
>>>
>>> HDFS seems to be perfect for append-only data like #1, but I'm having
>>> trouble figuring out what to do with data that may change frequently.
>>>
>>> A simple example would be user data where various bits of information:
>>> email, etc may change from day to day.  Would hbase or cassandra be the
>>> better way to go for this type of data, and can I overlay hive over all (
>>> hdfs, hbase, cassandra ) so that I can query the data through a single
>>> interface?
>>>
>>> Thanks in advance for any help.
>>>
>>
>>
>>
>> --
>> Regards,
>> Bharath .V
>> w:http://researchweb.iiit.ac.in/~bharath.v
>>
>
>

-- 
Thanks & Regards,
Anil Gupta

Re: Best practice for storage of data that changes

Posted by anil gupta <an...@gmail.com>.

Hi Jeff,

My two cents below:

1st use case: Append-only data - e.g. weblogs or user logins
As others have already mentioned that Hadoop is suitable enough to store
append only data. If you want to do analysis of weblogs or user logins then
Hadoop is a suitable solution for it.

2nd use case: Account/User data
First, of all i would suggest you to have a look at your use case then
analyze whether it really needs a NoSql solution or not.
As you were talking about maintaining User Data in NoSql. Why NoSql instead
of RDBMS? What is the size of data? Which NoSql features are the selling
points for you?

For real time read writes you can have a look at Cassandra or HBase. But, i
would suggest you to have a very close look at both of them because both of
them have their own advantages. So, the choice will be dependent on your
use case.

One added advantage with HBase is that it has a deeper integration with
Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop
Tools. HBase has integration with Hive querying but AFAIK it has some
limitations.

HTH,
Anil Gupta

On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija
<ba...@gmail.com>wrote:

> Hi Jeff,
>
>         As HDFS paradigm is "Write once and read many" you cannot be able
> to update the files on HDFS.
>         But for your problem what you can do is you keep the logs/userdata
> in hdfs with different timestamps.
>         Run some mapreduce jobs at certain intervals to extract required
> data from those logs and put it to Hbase/Cassandra/Mongodb.
>
>         Mongodb read performance is quite faster also it supports ad-hoc
> querying. Also you can use Hadoop-MongoDB connector to read/write the data
> to Mongodb thru Hadoop-Mapreduce.
>
>         If you are very specific about updating the hdfs files directly
> then you have to use any commercial Hadoop packages like MapR which
> supports updating the HDFS files.
>
> Best,
> Mahesh Balija,
> Calsoft Labs.
>
>
>
> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
> bharathvissapragada1990@gmail.com> wrote:
>
>> Hi Jeff,
>>
>> Please look at [1] . You can store your data in HBase tables and query
>> them normally just by mapping them to Hive tables. Regarding Cassandra
>> support, please follow JIRA [2], its not yet in the trunk I suppose!
>>
>> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
>> [2] https://issues.apache.org/jira/browse/HIVE-1434
>>
>> Thanks,
>>
>>
>> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I'm coming from the RDBMS world and am looking at hdfs for long term
>>> data storage and analysis.
>>>
>>> I've done some research and set up some smallish hdfs clusters with hive
>>> for testing but I'm having a little trouble understanding how everything
>>> fits together and was hoping someone could point me in the right direction.
>>>
>>> I'm looking at storing two types of data:
>>>
>>> 1. Append-only data - e.g. weblogs or user logins
>>> 2. Account/User data
>>>
>>> HDFS seems to be perfect for append-only data like #1, but I'm having
>>> trouble figuring out what to do with data that may change frequently.
>>>
>>> A simple example would be user data where various bits of information:
>>> email, etc may change from day to day.  Would hbase or cassandra be the
>>> better way to go for this type of data, and can I overlay hive over all (
>>> hdfs, hbase, cassandra ) so that I can query the data through a single
>>> interface?
>>>
>>> Thanks in advance for any help.
>>>
>>
>>
>>
>> --
>> Regards,
>> Bharath .V
>> w:http://researchweb.iiit.ac.in/~bharath.v
>>
>
>

-- 
Thanks & Regards,
Anil Gupta

Re: Best practice for storage of data that changes

Posted by anil gupta <an...@gmail.com>.

Hi Jeff,

My two cents below:

1st use case: Append-only data - e.g. weblogs or user logins
As others have already mentioned that Hadoop is suitable enough to store
append only data. If you want to do analysis of weblogs or user logins then
Hadoop is a suitable solution for it.

2nd use case: Account/User data
First, of all i would suggest you to have a look at your use case then
analyze whether it really needs a NoSql solution or not.
As you were talking about maintaining User Data in NoSql. Why NoSql instead
of RDBMS? What is the size of data? Which NoSql features are the selling
points for you?

For real time read writes you can have a look at Cassandra or HBase. But, i
would suggest you to have a very close look at both of them because both of
them have their own advantages. So, the choice will be dependent on your
use case.

One added advantage with HBase is that it has a deeper integration with
Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop
Tools. HBase has integration with Hive querying but AFAIK it has some
limitations.

HTH,
Anil Gupta

On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija
<ba...@gmail.com>wrote:

> Hi Jeff,
>
>         As HDFS paradigm is "Write once and read many" you cannot be able
> to update the files on HDFS.
>         But for your problem what you can do is you keep the logs/userdata
> in hdfs with different timestamps.
>         Run some mapreduce jobs at certain intervals to extract required
> data from those logs and put it to Hbase/Cassandra/Mongodb.
>
>         Mongodb read performance is quite faster also it supports ad-hoc
> querying. Also you can use Hadoop-MongoDB connector to read/write the data
> to Mongodb thru Hadoop-Mapreduce.
>
>         If you are very specific about updating the hdfs files directly
> then you have to use any commercial Hadoop packages like MapR which
> supports updating the HDFS files.
>
> Best,
> Mahesh Balija,
> Calsoft Labs.
>
>
>
> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
> bharathvissapragada1990@gmail.com> wrote:
>
>> Hi Jeff,
>>
>> Please look at [1] . You can store your data in HBase tables and query
>> them normally just by mapping them to Hive tables. Regarding Cassandra
>> support, please follow JIRA [2], its not yet in the trunk I suppose!
>>
>> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
>> [2] https://issues.apache.org/jira/browse/HIVE-1434
>>
>> Thanks,
>>
>>
>> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I'm coming from the RDBMS world and am looking at hdfs for long term
>>> data storage and analysis.
>>>
>>> I've done some research and set up some smallish hdfs clusters with hive
>>> for testing but I'm having a little trouble understanding how everything
>>> fits together and was hoping someone could point me in the right direction.
>>>
>>> I'm looking at storing two types of data:
>>>
>>> 1. Append-only data - e.g. weblogs or user logins
>>> 2. Account/User data
>>>
>>> HDFS seems to be perfect for append-only data like #1, but I'm having
>>> trouble figuring out what to do with data that may change frequently.
>>>
>>> A simple example would be user data where various bits of information:
>>> email, etc may change from day to day.  Would hbase or cassandra be the
>>> better way to go for this type of data, and can I overlay hive over all (
>>> hdfs, hbase, cassandra ) so that I can query the data through a single
>>> interface?
>>>
>>> Thanks in advance for any help.
>>>
>>
>>
>>
>> --
>> Regards,
>> Bharath .V
>> w:http://researchweb.iiit.ac.in/~bharath.v
>>
>
>

-- 
Thanks & Regards,
Anil Gupta

Re: Best practice for storage of data that changes

Posted by Mahesh Balija <ba...@gmail.com>.

Hi Jeff,

        As HDFS paradigm is "Write once and read many" you cannot be able
to update the files on HDFS.
        But for your problem what you can do is you keep the logs/userdata
in hdfs with different timestamps.
        Run some mapreduce jobs at certain intervals to extract required
data from those logs and put it to Hbase/Cassandra/Mongodb.

        Mongodb read performance is quite faster also it supports ad-hoc
querying. Also you can use Hadoop-MongoDB connector to read/write the data
to Mongodb thru Hadoop-Mapreduce.

        If you are very specific about updating the hdfs files directly
then you have to use any commercial Hadoop packages like MapR which
supports updating the HDFS files.

Best,
Mahesh Balija,
Calsoft Labs.


On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
bharathvissapragada1990@gmail.com> wrote:

> Hi Jeff,
>
> Please look at [1] . You can store your data in HBase tables and query
> them normally just by mapping them to Hive tables. Regarding Cassandra
> support, please follow JIRA [2], its not yet in the trunk I suppose!
>
> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
> [2] https://issues.apache.org/jira/browse/HIVE-1434
>
> Thanks,
>
>
> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com> wrote:
>
>> Hi All,
>>
>> I'm coming from the RDBMS world and am looking at hdfs for long term data
>> storage and analysis.
>>
>> I've done some research and set up some smallish hdfs clusters with hive
>> for testing but I'm having a little trouble understanding how everything
>> fits together and was hoping someone could point me in the right direction.
>>
>> I'm looking at storing two types of data:
>>
>> 1. Append-only data - e.g. weblogs or user logins
>> 2. Account/User data
>>
>> HDFS seems to be perfect for append-only data like #1, but I'm having
>> trouble figuring out what to do with data that may change frequently.
>>
>> A simple example would be user data where various bits of information:
>> email, etc may change from day to day.  Would hbase or cassandra be the
>> better way to go for this type of data, and can I overlay hive over all (
>> hdfs, hbase, cassandra ) so that I can query the data through a single
>> interface?
>>
>> Thanks in advance for any help.
>>
>
>
>
> --
> Regards,
> Bharath .V
> w:http://researchweb.iiit.ac.in/~bharath.v
>

Re: Best practice for storage of data that changes

Posted by Mahesh Balija <ba...@gmail.com>.

Hi Jeff,

        As HDFS paradigm is "Write once and read many" you cannot be able
to update the files on HDFS.
        But for your problem what you can do is you keep the logs/userdata
in hdfs with different timestamps.
        Run some mapreduce jobs at certain intervals to extract required
data from those logs and put it to Hbase/Cassandra/Mongodb.

        Mongodb read performance is quite faster also it supports ad-hoc
querying. Also you can use Hadoop-MongoDB connector to read/write the data
to Mongodb thru Hadoop-Mapreduce.

        If you are very specific about updating the hdfs files directly
then you have to use any commercial Hadoop packages like MapR which
supports updating the HDFS files.

Best,
Mahesh Balija,
Calsoft Labs.


On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
bharathvissapragada1990@gmail.com> wrote:

> Hi Jeff,
>
> Please look at [1] . You can store your data in HBase tables and query
> them normally just by mapping them to Hive tables. Regarding Cassandra
> support, please follow JIRA [2], its not yet in the trunk I suppose!
>
> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
> [2] https://issues.apache.org/jira/browse/HIVE-1434
>
> Thanks,
>
>
> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com> wrote:
>
>> Hi All,
>>
>> I'm coming from the RDBMS world and am looking at hdfs for long term data
>> storage and analysis.
>>
>> I've done some research and set up some smallish hdfs clusters with hive
>> for testing but I'm having a little trouble understanding how everything
>> fits together and was hoping someone could point me in the right direction.
>>
>> I'm looking at storing two types of data:
>>
>> 1. Append-only data - e.g. weblogs or user logins
>> 2. Account/User data
>>
>> HDFS seems to be perfect for append-only data like #1, but I'm having
>> trouble figuring out what to do with data that may change frequently.
>>
>> A simple example would be user data where various bits of information:
>> email, etc may change from day to day.  Would hbase or cassandra be the
>> better way to go for this type of data, and can I overlay hive over all (
>> hdfs, hbase, cassandra ) so that I can query the data through a single
>> interface?
>>
>> Thanks in advance for any help.
>>
>
>
>
> --
> Regards,
> Bharath .V
> w:http://researchweb.iiit.ac.in/~bharath.v
>

Re: Best practice for storage of data that changes

Posted by Mahesh Balija <ba...@gmail.com>.

Hi Jeff,

        As HDFS paradigm is "Write once and read many" you cannot be able
to update the files on HDFS.
        But for your problem what you can do is you keep the logs/userdata
in hdfs with different timestamps.
        Run some mapreduce jobs at certain intervals to extract required
data from those logs and put it to Hbase/Cassandra/Mongodb.

        Mongodb read performance is quite faster also it supports ad-hoc
querying. Also you can use Hadoop-MongoDB connector to read/write the data
to Mongodb thru Hadoop-Mapreduce.

        If you are very specific about updating the hdfs files directly
then you have to use any commercial Hadoop packages like MapR which
supports updating the HDFS files.

Best,
Mahesh Balija,
Calsoft Labs.


On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
bharathvissapragada1990@gmail.com> wrote:

> Hi Jeff,
>
> Please look at [1] . You can store your data in HBase tables and query
> them normally just by mapping them to Hive tables. Regarding Cassandra
> support, please follow JIRA [2], its not yet in the trunk I suppose!
>
> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
> [2] https://issues.apache.org/jira/browse/HIVE-1434
>
> Thanks,
>
>
> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com> wrote:
>
>> Hi All,
>>
>> I'm coming from the RDBMS world and am looking at hdfs for long term data
>> storage and analysis.
>>
>> I've done some research and set up some smallish hdfs clusters with hive
>> for testing but I'm having a little trouble understanding how everything
>> fits together and was hoping someone could point me in the right direction.
>>
>> I'm looking at storing two types of data:
>>
>> 1. Append-only data - e.g. weblogs or user logins
>> 2. Account/User data
>>
>> HDFS seems to be perfect for append-only data like #1, but I'm having
>> trouble figuring out what to do with data that may change frequently.
>>
>> A simple example would be user data where various bits of information:
>> email, etc may change from day to day.  Would hbase or cassandra be the
>> better way to go for this type of data, and can I overlay hive over all (
>> hdfs, hbase, cassandra ) so that I can query the data through a single
>> interface?
>>
>> Thanks in advance for any help.
>>
>
>
>
> --
> Regards,
> Bharath .V
> w:http://researchweb.iiit.ac.in/~bharath.v
>

Re: Best practice for storage of data that changes

Posted by Mahesh Balija <ba...@gmail.com>.

Hi Jeff,

        As HDFS paradigm is "Write once and read many" you cannot be able
to update the files on HDFS.
        But for your problem what you can do is you keep the logs/userdata
in hdfs with different timestamps.
        Run some mapreduce jobs at certain intervals to extract required
data from those logs and put it to Hbase/Cassandra/Mongodb.

        Mongodb read performance is quite faster also it supports ad-hoc
querying. Also you can use Hadoop-MongoDB connector to read/write the data
to Mongodb thru Hadoop-Mapreduce.

        If you are very specific about updating the hdfs files directly
then you have to use any commercial Hadoop packages like MapR which
supports updating the HDFS files.

Best,
Mahesh Balija,
Calsoft Labs.


On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
bharathvissapragada1990@gmail.com> wrote:

> Hi Jeff,
>
> Please look at [1] . You can store your data in HBase tables and query
> them normally just by mapping them to Hive tables. Regarding Cassandra
> support, please follow JIRA [2], its not yet in the trunk I suppose!
>
> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
> [2] https://issues.apache.org/jira/browse/HIVE-1434
>
> Thanks,
>
>
> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com> wrote:
>
>> Hi All,
>>
>> I'm coming from the RDBMS world and am looking at hdfs for long term data
>> storage and analysis.
>>
>> I've done some research and set up some smallish hdfs clusters with hive
>> for testing but I'm having a little trouble understanding how everything
>> fits together and was hoping someone could point me in the right direction.
>>
>> I'm looking at storing two types of data:
>>
>> 1. Append-only data - e.g. weblogs or user logins
>> 2. Account/User data
>>
>> HDFS seems to be perfect for append-only data like #1, but I'm having
>> trouble figuring out what to do with data that may change frequently.
>>
>> A simple example would be user data where various bits of information:
>> email, etc may change from day to day.  Would hbase or cassandra be the
>> better way to go for this type of data, and can I overlay hive over all (
>> hdfs, hbase, cassandra ) so that I can query the data through a single
>> interface?
>>
>> Thanks in advance for any help.
>>
>
>
>
> --
> Regards,
> Bharath .V
> w:http://researchweb.iiit.ac.in/~bharath.v
>

Re: Best practice for storage of data that changes

Posted by bharath vissapragada <bh...@gmail.com>.

Hi Jeff,

Please look at [1] . You can store your data in HBase tables and query them
normally just by mapping them to Hive tables. Regarding Cassandra support,
please follow JIRA [2], its not yet in the trunk I suppose!

[1] https://cwiki.apache.org/Hive/hbaseintegration.html
[2] https://issues.apache.org/jira/browse/HIVE-1434

Thanks,

On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com> wrote:

> Hi All,
>
> I'm coming from the RDBMS world and am looking at hdfs for long term data
> storage and analysis.
>
> I've done some research and set up some smallish hdfs clusters with hive
> for testing but I'm having a little trouble understanding how everything
> fits together and was hoping someone could point me in the right direction.
>
> I'm looking at storing two types of data:
>
> 1. Append-only data - e.g. weblogs or user logins
> 2. Account/User data
>
> HDFS seems to be perfect for append-only data like #1, but I'm having
> trouble figuring out what to do with data that may change frequently.
>
> A simple example would be user data where various bits of information:
> email, etc may change from day to day.  Would hbase or cassandra be the
> better way to go for this type of data, and can I overlay hive over all (
> hdfs, hbase, cassandra ) so that I can query the data through a single
> interface?
>
> Thanks in advance for any help.
>



-- 
Regards,
Bharath .V
w:http://researchweb.iiit.ac.in/~bharath.v

Re: Best practice for storage of data that changes

Posted by Peyman Mohajerian <mo...@gmail.com>.

If you need to run fast queries on your 'Account/User data' then you got to
use a nosql solution. If you only constraint is frequent updates you may
still manage to keep the data in hdfs, just rewrite it everytime there is
change. So the key consideration is whether you want to run fast queries
you are fine with offline slow queries of the hdfs data.

On Sat, Nov 24, 2012 at 12:56 PM, jeff l <je...@gmail.com> wrote:

> Hi All,
>
> I'm coming from the RDBMS world and am looking at hdfs for long term data
> storage and analysis.
>
> I've done some research and set up some smallish hdfs clusters with hive
> for testing but I'm having a little trouble understanding how everything
> fits together and was hoping someone could point me in the right direction.
>
> I'm looking at storing two types of data:
>
> 1. Append-only data - e.g. weblogs or user logins
> 2. Account/User data
>
> HDFS seems to be perfect for append-only data like #1, but I'm having
> trouble figuring out what to do with data that may change frequently.
>
> A simple example would be user data where various bits of information:
> email, etc may change from day to day.  Would hbase or cassandra be the
> better way to go for this type of data, and can I overlay hive over all (
> hdfs, hbase, cassandra ) so that I can query the data through a single
> interface?
>
> Thanks in advance for any help.
>

Re: Best practice for storage of data that changes

Posted by Peyman Mohajerian <mo...@gmail.com>.

If you need to run fast queries on your 'Account/User data' then you got to
use a nosql solution. If you only constraint is frequent updates you may
still manage to keep the data in hdfs, just rewrite it everytime there is
change. So the key consideration is whether you want to run fast queries
you are fine with offline slow queries of the hdfs data.

On Sat, Nov 24, 2012 at 12:56 PM, jeff l <je...@gmail.com> wrote:

> Hi All,
>
> I'm coming from the RDBMS world and am looking at hdfs for long term data
> storage and analysis.
>
> I've done some research and set up some smallish hdfs clusters with hive
> for testing but I'm having a little trouble understanding how everything
> fits together and was hoping someone could point me in the right direction.
>
> I'm looking at storing two types of data:
>
> 1. Append-only data - e.g. weblogs or user logins
> 2. Account/User data
>
> HDFS seems to be perfect for append-only data like #1, but I'm having
> trouble figuring out what to do with data that may change frequently.
>
> A simple example would be user data where various bits of information:
> email, etc may change from day to day.  Would hbase or cassandra be the
> better way to go for this type of data, and can I overlay hive over all (
> hdfs, hbase, cassandra ) so that I can query the data through a single
> interface?
>
> Thanks in advance for any help.
>

Re: Best practice for storage of data that changes

Posted by bharath vissapragada <bh...@gmail.com>.

Hi Jeff,

Please look at [1] . You can store your data in HBase tables and query them
normally just by mapping them to Hive tables. Regarding Cassandra support,
please follow JIRA [2], its not yet in the trunk I suppose!

[1] https://cwiki.apache.org/Hive/hbaseintegration.html
[2] https://issues.apache.org/jira/browse/HIVE-1434

Thanks,

On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com> wrote:

> Hi All,
>
> I'm coming from the RDBMS world and am looking at hdfs for long term data
> storage and analysis.
>
> I've done some research and set up some smallish hdfs clusters with hive
> for testing but I'm having a little trouble understanding how everything
> fits together and was hoping someone could point me in the right direction.
>
> I'm looking at storing two types of data:
>
> 1. Append-only data - e.g. weblogs or user logins
> 2. Account/User data
>
> HDFS seems to be perfect for append-only data like #1, but I'm having
> trouble figuring out what to do with data that may change frequently.
>
> A simple example would be user data where various bits of information:
> email, etc may change from day to day.  Would hbase or cassandra be the
> better way to go for this type of data, and can I overlay hive over all (
> hdfs, hbase, cassandra ) so that I can query the data through a single
> interface?
>
> Thanks in advance for any help.
>



-- 
Regards,
Bharath .V
w:http://researchweb.iiit.ac.in/~bharath.v

Re: Best practice for storage of data that changes

Posted by Peyman Mohajerian <mo...@gmail.com>.

If you need to run fast queries on your 'Account/User data' then you got to
use a nosql solution. If you only constraint is frequent updates you may
still manage to keep the data in hdfs, just rewrite it everytime there is
change. So the key consideration is whether you want to run fast queries
you are fine with offline slow queries of the hdfs data.

On Sat, Nov 24, 2012 at 12:56 PM, jeff l <je...@gmail.com> wrote:

> Hi All,
>
> I'm coming from the RDBMS world and am looking at hdfs for long term data
> storage and analysis.
>
> I've done some research and set up some smallish hdfs clusters with hive
> for testing but I'm having a little trouble understanding how everything
> fits together and was hoping someone could point me in the right direction.
>
> I'm looking at storing two types of data:
>
> 1. Append-only data - e.g. weblogs or user logins
> 2. Account/User data
>
> HDFS seems to be perfect for append-only data like #1, but I'm having
> trouble figuring out what to do with data that may change frequently.
>
> A simple example would be user data where various bits of information:
> email, etc may change from day to day.  Would hbase or cassandra be the
> better way to go for this type of data, and can I overlay hive over all (
> hdfs, hbase, cassandra ) so that I can query the data through a single
> interface?
>
> Thanks in advance for any help.
>

Re: Best practice for storage of data that changes

Posted by bharath vissapragada <bh...@gmail.com>.

Hi Jeff,

Please look at [1] . You can store your data in HBase tables and query them
normally just by mapping them to Hive tables. Regarding Cassandra support,
please follow JIRA [2], its not yet in the trunk I suppose!

[1] https://cwiki.apache.org/Hive/hbaseintegration.html
[2] https://issues.apache.org/jira/browse/HIVE-1434

Thanks,

On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com> wrote:

> Hi All,
>
> I'm coming from the RDBMS world and am looking at hdfs for long term data
> storage and analysis.
>
> I've done some research and set up some smallish hdfs clusters with hive
> for testing but I'm having a little trouble understanding how everything
> fits together and was hoping someone could point me in the right direction.
>
> I'm looking at storing two types of data:
>
> 1. Append-only data - e.g. weblogs or user logins
> 2. Account/User data
>
> HDFS seems to be perfect for append-only data like #1, but I'm having
> trouble figuring out what to do with data that may change frequently.
>
> A simple example would be user data where various bits of information:
> email, etc may change from day to day.  Would hbase or cassandra be the
> better way to go for this type of data, and can I overlay hive over all (
> hdfs, hbase, cassandra ) so that I can query the data through a single
> interface?
>
> Thanks in advance for any help.
>



-- 
Regards,
Bharath .V
w:http://researchweb.iiit.ac.in/~bharath.v

Re: Best practice for storage of data that changes

Posted by bharath vissapragada <bh...@gmail.com>.

Hi Jeff,

Please look at [1] . You can store your data in HBase tables and query them
normally just by mapping them to Hive tables. Regarding Cassandra support,
please follow JIRA [2], its not yet in the trunk I suppose!

[1] https://cwiki.apache.org/Hive/hbaseintegration.html
[2] https://issues.apache.org/jira/browse/HIVE-1434

Thanks,

On Sun, Nov 25, 2012 at 2:26 AM, jeff l <je...@gmail.com> wrote:

> Hi All,
>
> I'm coming from the RDBMS world and am looking at hdfs for long term data
> storage and analysis.
>
> I've done some research and set up some smallish hdfs clusters with hive
> for testing but I'm having a little trouble understanding how everything
> fits together and was hoping someone could point me in the right direction.
>
> I'm looking at storing two types of data:
>
> 1. Append-only data - e.g. weblogs or user logins
> 2. Account/User data
>
> HDFS seems to be perfect for append-only data like #1, but I'm having
> trouble figuring out what to do with data that may change frequently.
>
> A simple example would be user data where various bits of information:
> email, etc may change from day to day.  Would hbase or cassandra be the
> better way to go for this type of data, and can I overlay hive over all (
> hdfs, hbase, cassandra ) so that I can query the data through a single
> interface?
>
> Thanks in advance for any help.
>



-- 
Regards,
Bharath .V
w:http://researchweb.iiit.ac.in/~bharath.v

Re: Best practice for storage of data that changes

Posted by Peyman Mohajerian <mo...@gmail.com>.

If you need to run fast queries on your 'Account/User data' then you got to
use a nosql solution. If you only constraint is frequent updates you may
still manage to keep the data in hdfs, just rewrite it everytime there is
change. So the key consideration is whether you want to run fast queries
you are fine with offline slow queries of the hdfs data.

On Sat, Nov 24, 2012 at 12:56 PM, jeff l <je...@gmail.com> wrote:

> Hi All,
>
> I'm coming from the RDBMS world and am looking at hdfs for long term data
> storage and analysis.
>
> I've done some research and set up some smallish hdfs clusters with hive
> for testing but I'm having a little trouble understanding how everything
> fits together and was hoping someone could point me in the right direction.
>
> I'm looking at storing two types of data:
>
> 1. Append-only data - e.g. weblogs or user logins
> 2. Account/User data
>
> HDFS seems to be perfect for append-only data like #1, but I'm having
> trouble figuring out what to do with data that may change frequently.
>
> A simple example would be user data where various bits of information:
> email, etc may change from day to day.  Would hbase or cassandra be the
> better way to go for this type of data, and can I overlay hive over all (
> hdfs, hbase, cassandra ) so that I can query the data through a single
> interface?
>
> Thanks in advance for any help.
>