You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Tharindu Mathew <mc...@gmail.com> on 2011/08/29 21:47:05 UTC

Recommendations on moving to Hadoop/Hive with Cassandra + RDBMS

Hi,

I have an already running system where I define a simple data flow (using a
simple custom data flow language) and configure jobs to run against stored
data. I use quartz to schedule and run these jobs and the data exists on
various data stores (mainly Cassandra but some data exists in RDBMS like
mysql as well).

Thinking about scalability and already existing support for standard data
flow languages in the form of Pig and HiveQL, I plan to move my system to
Hadoop.

I've seen some efforts on the integration of Cassandra and Hadoop. I've been
reading up and still am contemplating on how to make this change.

It would be great to hear the recommended approach of doing this on Hadoop
with the integration of Cassandra and other RDBMS. For example, a sample
task that already runs on the system is "once in every hour, get rows from
column family X, aggregate data in columns A, B and C and write back to
column family Y, and enter details of last aggregated row into a table in
mysql"

Thanks in advance.

-- 
Regards,

Tharindu

Re: Recommendations on moving to Hadoop/Hive with Cassandra + RDBMS

Posted by Tharindu Mathew <mc...@gmail.com>.

Thanks Jeremy. These will be really useful.

On Wed, Aug 31, 2011 at 12:12 AM, Jeremy Hanna
<je...@gmail.com>wrote:

> I've tried to help out with some UDFs and references that help with our use
> case: https://github.com/jeromatron/pygmalion/
>
> There are some brisk docs on pig as well that might be helpful:
> http://www.datastax.com/docs/0.8/brisk/about_pig
>
> On Aug 30, 2011, at 1:30 PM, Tharindu Mathew wrote:
>
> > Thanks Jeremy for your response. That gives me some encouragement, that I
> might be on that right track.
> >
> > I think I need to try out more stuff before coming to a conclusion on
> Brisk.
> >
> > For Pig operations over Cassandra, I only could find
> http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig. Are there any
> other resource that you can point me to? There seems to be a lack of samples
> on this subject.
> >
> > On Tue, Aug 30, 2011 at 10:56 PM, Jeremy Hanna <
> jeremy.hanna1234@gmail.com> wrote:
> > FWIW, we are using Pig (and Hadoop) with Cassandra and are looking to
> potentially move to Brisk because of the simplicity of operations there.
> >
> > Not sure what you mean about the true power of Hadoop.  In my mind the
> true power of Hadoop is the ability to parallelize jobs and send each task
> to where the data resides.  HDFS exists to enable that.  Brisk is just
> another HDFS compatible implementation.  If you're already storing your data
> in Cassandra and are looking to use Hadoop with it, then I would seriously
> consider using Brisk.
> >
> > That said, Cassandra with Hadoop works fine.
> >
> > On Aug 30, 2011, at 11:58 AM, Tharindu Mathew wrote:
> >
> > > Hi Eric,
> > >
> > > Thanks for your response.
> > >
> > > On Tue, Aug 30, 2011 at 5:35 PM, Eric Djatsa <dj...@gmail.com>
> wrote:
> > >
> > >> Hi Tharindu, try having a look at Brisk(
> > >> http://www.datastax.com/products/brisk) it integrates Hadoop with
> > >> Cassandra and is shipped with Hive for SQL analysis. You can then
> install
> > >> Sqoop(http://www.cloudera.com/downloads/sqoop/) on top of Hadoop in
> order
> > >> to enable data import/export between Hadoop and MySQL.
> > >> Does this sound ok to you ?
> > >>
> > >> These do sound ok. But I was looking at using something from Apache
> itself.
> > >
> > > Brisk sounds nice, but I feel that disregarding HDFS and totally
> switching
> > > to Cassandra is not the right thing to do. Just my opinion there. I
> feel we
> > > are not using the true power of Hadoop then.
> > >
> > > I feel Pig has more integration with Cassandra, so I might take a look
> > > there.
> > >
> > > Whichever I choose, I will contribute the code back to the Apache
> projects I
> > > use. Here's a sample data analysis I do with my language. Maybe, there
> is no
> > > generic way to do what I want to do.
> > >
> > >
> > >
> > > <get name="NodeId">
> > > <index name="ServerName" start="" end=""/>
> > > <!--<index name="nodeId" start="AS" end="FB"/>-->
> > > <!--<groupBy index="nodeId"/>-->
> > > <granularity index="timeStamp" type="hour"/>
> > > </get>
> > >
> > > <lookup name="Event"/>
> > >
> > > <aggregate>
> > > <measure name="RequestCount" aggregationType="CUMULATIVE"/>
> > > <measure name="ResponseCount" aggregationType="CUMULATIVE"/>
> > > <measure name="MaximumResponseTime" aggregationType="AVG"/>
> > > </aggregate>
> > >
> > > <put name="NodeResult" indexRow="allKeys"/>
> > >
> > > <log/>
> > >
> > > <get name="NodeResult">
> > > <index name="ServerName" start="" end=""/>
> > > <groupBy index="ServerName"/>
> > > </get>
> > >
> > > <aggregate>
> > > <measure name="RequestCount" aggregationType="CUMULATIVE"/>
> > > <measure name="ResponseCount" aggregationType="CUMULATIVE"/>
> > > <measure name="MaximumResponseTime" aggregationType="AVG"/>
> > > </aggregate>
> > >
> > > <put name="NodeAccumilator" indexRow="allKeys"/>
> > >
> > > <log/>
> > >
> > >
> > >> 2011/8/29 Tharindu Mathew <mc...@gmail.com>
> > >>
> > >>> Hi,
> > >>>
> > >>> I have an already running system where I define a simple data flow
> (using
> > >>> a simple custom data flow language) and configure jobs to run against
> stored
> > >>> data. I use quartz to schedule and run these jobs and the data exists
> on
> > >>> various data stores (mainly Cassandra but some data exists in RDBMS
> like
> > >>> mysql as well).
> > >>>
> > >>> Thinking about scalability and already existing support for standard
> data
> > >>> flow languages in the form of Pig and HiveQL, I plan to move my
> system to
> > >>> Hadoop.
> > >>>
> > >>> I've seen some efforts on the integration of Cassandra and Hadoop.
> I've
> > >>> been reading up and still am contemplating on how to make this
> change.
> > >>>
> > >>> It would be great to hear the recommended approach of doing this on
> Hadoop
> > >>> with the integration of Cassandra and other RDBMS. For example, a
> sample
> > >>> task that already runs on the system is "once in every hour, get rows
> from
> > >>> column family X, aggregate data in columns A, B and C and write back
> to
> > >>> column family Y, and enter details of last aggregated row into a
> table in
> > >>> mysql"
> > >>>
> > >>> Thanks in advance.
> > >>>
> > >>> --
> > >>> Regards,
> > >>>
> > >>> Tharindu
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> *Eric Djatsa Yota*
> > >> *Double degree MsC Student in Computer Science Engineering and
> > >> Communication Networks
> > >> Télécom ParisTech (FRANCE) - Politecnico di Torino (ITALY)*
> > >> *Intern at AMADEUS S.A.S Sophia Antipolis*
> > >> djatsaedy@gmail.com
> > >> *Tel : 0601791859*
> > >>
> > >>
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Tharindu
> >
> >
> >
> >
> > --
> > Regards,
> >
> > Tharindu
>
>


-- 
Regards,

Tharindu

Re: Recommendations on moving to Hadoop/Hive with Cassandra + RDBMS

Posted by Tharindu Mathew <mc...@gmail.com>.

Thanks Jeremy. These will be really useful.

On Wed, Aug 31, 2011 at 12:12 AM, Jeremy Hanna
<je...@gmail.com>wrote:

> I've tried to help out with some UDFs and references that help with our use
> case: https://github.com/jeromatron/pygmalion/
>
> There are some brisk docs on pig as well that might be helpful:
> http://www.datastax.com/docs/0.8/brisk/about_pig
>
> On Aug 30, 2011, at 1:30 PM, Tharindu Mathew wrote:
>
> > Thanks Jeremy for your response. That gives me some encouragement, that I
> might be on that right track.
> >
> > I think I need to try out more stuff before coming to a conclusion on
> Brisk.
> >
> > For Pig operations over Cassandra, I only could find
> http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig. Are there any
> other resource that you can point me to? There seems to be a lack of samples
> on this subject.
> >
> > On Tue, Aug 30, 2011 at 10:56 PM, Jeremy Hanna <
> jeremy.hanna1234@gmail.com> wrote:
> > FWIW, we are using Pig (and Hadoop) with Cassandra and are looking to
> potentially move to Brisk because of the simplicity of operations there.
> >
> > Not sure what you mean about the true power of Hadoop.  In my mind the
> true power of Hadoop is the ability to parallelize jobs and send each task
> to where the data resides.  HDFS exists to enable that.  Brisk is just
> another HDFS compatible implementation.  If you're already storing your data
> in Cassandra and are looking to use Hadoop with it, then I would seriously
> consider using Brisk.
> >
> > That said, Cassandra with Hadoop works fine.
> >
> > On Aug 30, 2011, at 11:58 AM, Tharindu Mathew wrote:
> >
> > > Hi Eric,
> > >
> > > Thanks for your response.
> > >
> > > On Tue, Aug 30, 2011 at 5:35 PM, Eric Djatsa <dj...@gmail.com>
> wrote:
> > >
> > >> Hi Tharindu, try having a look at Brisk(
> > >> http://www.datastax.com/products/brisk) it integrates Hadoop with
> > >> Cassandra and is shipped with Hive for SQL analysis. You can then
> install
> > >> Sqoop(http://www.cloudera.com/downloads/sqoop/) on top of Hadoop in
> order
> > >> to enable data import/export between Hadoop and MySQL.
> > >> Does this sound ok to you ?
> > >>
> > >> These do sound ok. But I was looking at using something from Apache
> itself.
> > >
> > > Brisk sounds nice, but I feel that disregarding HDFS and totally
> switching
> > > to Cassandra is not the right thing to do. Just my opinion there. I
> feel we
> > > are not using the true power of Hadoop then.
> > >
> > > I feel Pig has more integration with Cassandra, so I might take a look
> > > there.
> > >
> > > Whichever I choose, I will contribute the code back to the Apache
> projects I
> > > use. Here's a sample data analysis I do with my language. Maybe, there
> is no
> > > generic way to do what I want to do.
> > >
> > >
> > >
> > > <get name="NodeId">
> > > <index name="ServerName" start="" end=""/>
> > > <!--<index name="nodeId" start="AS" end="FB"/>-->
> > > <!--<groupBy index="nodeId"/>-->
> > > <granularity index="timeStamp" type="hour"/>
> > > </get>
> > >
> > > <lookup name="Event"/>
> > >
> > > <aggregate>
> > > <measure name="RequestCount" aggregationType="CUMULATIVE"/>
> > > <measure name="ResponseCount" aggregationType="CUMULATIVE"/>
> > > <measure name="MaximumResponseTime" aggregationType="AVG"/>
> > > </aggregate>
> > >
> > > <put name="NodeResult" indexRow="allKeys"/>
> > >
> > > <log/>
> > >
> > > <get name="NodeResult">
> > > <index name="ServerName" start="" end=""/>
> > > <groupBy index="ServerName"/>
> > > </get>
> > >
> > > <aggregate>
> > > <measure name="RequestCount" aggregationType="CUMULATIVE"/>
> > > <measure name="ResponseCount" aggregationType="CUMULATIVE"/>
> > > <measure name="MaximumResponseTime" aggregationType="AVG"/>
> > > </aggregate>
> > >
> > > <put name="NodeAccumilator" indexRow="allKeys"/>
> > >
> > > <log/>
> > >
> > >
> > >> 2011/8/29 Tharindu Mathew <mc...@gmail.com>
> > >>
> > >>> Hi,
> > >>>
> > >>> I have an already running system where I define a simple data flow
> (using
> > >>> a simple custom data flow language) and configure jobs to run against
> stored
> > >>> data. I use quartz to schedule and run these jobs and the data exists
> on
> > >>> various data stores (mainly Cassandra but some data exists in RDBMS
> like
> > >>> mysql as well).
> > >>>
> > >>> Thinking about scalability and already existing support for standard
> data
> > >>> flow languages in the form of Pig and HiveQL, I plan to move my
> system to
> > >>> Hadoop.
> > >>>
> > >>> I've seen some efforts on the integration of Cassandra and Hadoop.
> I've
> > >>> been reading up and still am contemplating on how to make this
> change.
> > >>>
> > >>> It would be great to hear the recommended approach of doing this on
> Hadoop
> > >>> with the integration of Cassandra and other RDBMS. For example, a
> sample
> > >>> task that already runs on the system is "once in every hour, get rows
> from
> > >>> column family X, aggregate data in columns A, B and C and write back
> to
> > >>> column family Y, and enter details of last aggregated row into a
> table in
> > >>> mysql"
> > >>>
> > >>> Thanks in advance.
> > >>>
> > >>> --
> > >>> Regards,
> > >>>
> > >>> Tharindu
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> *Eric Djatsa Yota*
> > >> *Double degree MsC Student in Computer Science Engineering and
> > >> Communication Networks
> > >> Télécom ParisTech (FRANCE) - Politecnico di Torino (ITALY)*
> > >> *Intern at AMADEUS S.A.S Sophia Antipolis*
> > >> djatsaedy@gmail.com
> > >> *Tel : 0601791859*
> > >>
> > >>
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Tharindu
> >
> >
> >
> >
> > --
> > Regards,
> >
> > Tharindu
>
>


-- 
Regards,

Tharindu

Re: Recommendations on moving to Hadoop/Hive with Cassandra + RDBMS

Posted by Tharindu Mathew <mc...@gmail.com>.

Thanks Jeremy. These will be really useful.

On Wed, Aug 31, 2011 at 12:12 AM, Jeremy Hanna
<je...@gmail.com>wrote:

> I've tried to help out with some UDFs and references that help with our use
> case: https://github.com/jeromatron/pygmalion/
>
> There are some brisk docs on pig as well that might be helpful:
> http://www.datastax.com/docs/0.8/brisk/about_pig
>
> On Aug 30, 2011, at 1:30 PM, Tharindu Mathew wrote:
>
> > Thanks Jeremy for your response. That gives me some encouragement, that I
> might be on that right track.
> >
> > I think I need to try out more stuff before coming to a conclusion on
> Brisk.
> >
> > For Pig operations over Cassandra, I only could find
> http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig. Are there any
> other resource that you can point me to? There seems to be a lack of samples
> on this subject.
> >
> > On Tue, Aug 30, 2011 at 10:56 PM, Jeremy Hanna <
> jeremy.hanna1234@gmail.com> wrote:
> > FWIW, we are using Pig (and Hadoop) with Cassandra and are looking to
> potentially move to Brisk because of the simplicity of operations there.
> >
> > Not sure what you mean about the true power of Hadoop.  In my mind the
> true power of Hadoop is the ability to parallelize jobs and send each task
> to where the data resides.  HDFS exists to enable that.  Brisk is just
> another HDFS compatible implementation.  If you're already storing your data
> in Cassandra and are looking to use Hadoop with it, then I would seriously
> consider using Brisk.
> >
> > That said, Cassandra with Hadoop works fine.
> >
> > On Aug 30, 2011, at 11:58 AM, Tharindu Mathew wrote:
> >
> > > Hi Eric,
> > >
> > > Thanks for your response.
> > >
> > > On Tue, Aug 30, 2011 at 5:35 PM, Eric Djatsa <dj...@gmail.com>
> wrote:
> > >
> > >> Hi Tharindu, try having a look at Brisk(
> > >> http://www.datastax.com/products/brisk) it integrates Hadoop with
> > >> Cassandra and is shipped with Hive for SQL analysis. You can then
> install
> > >> Sqoop(http://www.cloudera.com/downloads/sqoop/) on top of Hadoop in
> order
> > >> to enable data import/export between Hadoop and MySQL.
> > >> Does this sound ok to you ?
> > >>
> > >> These do sound ok. But I was looking at using something from Apache
> itself.
> > >
> > > Brisk sounds nice, but I feel that disregarding HDFS and totally
> switching
> > > to Cassandra is not the right thing to do. Just my opinion there. I
> feel we
> > > are not using the true power of Hadoop then.
> > >
> > > I feel Pig has more integration with Cassandra, so I might take a look
> > > there.
> > >
> > > Whichever I choose, I will contribute the code back to the Apache
> projects I
> > > use. Here's a sample data analysis I do with my language. Maybe, there
> is no
> > > generic way to do what I want to do.
> > >
> > >
> > >
> > > <get name="NodeId">
> > > <index name="ServerName" start="" end=""/>
> > > <!--<index name="nodeId" start="AS" end="FB"/>-->
> > > <!--<groupBy index="nodeId"/>-->
> > > <granularity index="timeStamp" type="hour"/>
> > > </get>
> > >
> > > <lookup name="Event"/>
> > >
> > > <aggregate>
> > > <measure name="RequestCount" aggregationType="CUMULATIVE"/>
> > > <measure name="ResponseCount" aggregationType="CUMULATIVE"/>
> > > <measure name="MaximumResponseTime" aggregationType="AVG"/>
> > > </aggregate>
> > >
> > > <put name="NodeResult" indexRow="allKeys"/>
> > >
> > > <log/>
> > >
> > > <get name="NodeResult">
> > > <index name="ServerName" start="" end=""/>
> > > <groupBy index="ServerName"/>
> > > </get>
> > >
> > > <aggregate>
> > > <measure name="RequestCount" aggregationType="CUMULATIVE"/>
> > > <measure name="ResponseCount" aggregationType="CUMULATIVE"/>
> > > <measure name="MaximumResponseTime" aggregationType="AVG"/>
> > > </aggregate>
> > >
> > > <put name="NodeAccumilator" indexRow="allKeys"/>
> > >
> > > <log/>
> > >
> > >
> > >> 2011/8/29 Tharindu Mathew <mc...@gmail.com>
> > >>
> > >>> Hi,
> > >>>
> > >>> I have an already running system where I define a simple data flow
> (using
> > >>> a simple custom data flow language) and configure jobs to run against
> stored
> > >>> data. I use quartz to schedule and run these jobs and the data exists
> on
> > >>> various data stores (mainly Cassandra but some data exists in RDBMS
> like
> > >>> mysql as well).
> > >>>
> > >>> Thinking about scalability and already existing support for standard
> data
> > >>> flow languages in the form of Pig and HiveQL, I plan to move my
> system to
> > >>> Hadoop.
> > >>>
> > >>> I've seen some efforts on the integration of Cassandra and Hadoop.
> I've
> > >>> been reading up and still am contemplating on how to make this
> change.
> > >>>
> > >>> It would be great to hear the recommended approach of doing this on
> Hadoop
> > >>> with the integration of Cassandra and other RDBMS. For example, a
> sample
> > >>> task that already runs on the system is "once in every hour, get rows
> from
> > >>> column family X, aggregate data in columns A, B and C and write back
> to
> > >>> column family Y, and enter details of last aggregated row into a
> table in
> > >>> mysql"
> > >>>
> > >>> Thanks in advance.
> > >>>
> > >>> --
> > >>> Regards,
> > >>>
> > >>> Tharindu
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> *Eric Djatsa Yota*
> > >> *Double degree MsC Student in Computer Science Engineering and
> > >> Communication Networks
> > >> Télécom ParisTech (FRANCE) - Politecnico di Torino (ITALY)*
> > >> *Intern at AMADEUS S.A.S Sophia Antipolis*
> > >> djatsaedy@gmail.com
> > >> *Tel : 0601791859*
> > >>
> > >>
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Tharindu
> >
> >
> >
> >
> > --
> > Regards,
> >
> > Tharindu
>
>


-- 
Regards,

Tharindu

Re: Recommendations on moving to Hadoop/Hive with Cassandra + RDBMS

Posted by Jeremy Hanna <je...@gmail.com>.

I've tried to help out with some UDFs and references that help with our use case: https://github.com/jeromatron/pygmalion/

There are some brisk docs on pig as well that might be helpful: http://www.datastax.com/docs/0.8/brisk/about_pig

On Aug 30, 2011, at 1:30 PM, Tharindu Mathew wrote:

> Thanks Jeremy for your response. That gives me some encouragement, that I might be on that right track.
> 
> I think I need to try out more stuff before coming to a conclusion on Brisk.
> 
> For Pig operations over Cassandra, I only could find http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig. Are there any other resource that you can point me to? There seems to be a lack of samples on this subject.
> 
> On Tue, Aug 30, 2011 at 10:56 PM, Jeremy Hanna <je...@gmail.com> wrote:
> FWIW, we are using Pig (and Hadoop) with Cassandra and are looking to potentially move to Brisk because of the simplicity of operations there.
> 
> Not sure what you mean about the true power of Hadoop.  In my mind the true power of Hadoop is the ability to parallelize jobs and send each task to where the data resides.  HDFS exists to enable that.  Brisk is just another HDFS compatible implementation.  If you're already storing your data in Cassandra and are looking to use Hadoop with it, then I would seriously consider using Brisk.
> 
> That said, Cassandra with Hadoop works fine.
> 
> On Aug 30, 2011, at 11:58 AM, Tharindu Mathew wrote:
> 
> > Hi Eric,
> >
> > Thanks for your response.
> >
> > On Tue, Aug 30, 2011 at 5:35 PM, Eric Djatsa <dj...@gmail.com> wrote:
> >
> >> Hi Tharindu, try having a look at Brisk(
> >> http://www.datastax.com/products/brisk) it integrates Hadoop with
> >> Cassandra and is shipped with Hive for SQL analysis. You can then install
> >> Sqoop(http://www.cloudera.com/downloads/sqoop/) on top of Hadoop in order
> >> to enable data import/export between Hadoop and MySQL.
> >> Does this sound ok to you ?
> >>
> >> These do sound ok. But I was looking at using something from Apache itself.
> >
> > Brisk sounds nice, but I feel that disregarding HDFS and totally switching
> > to Cassandra is not the right thing to do. Just my opinion there. I feel we
> > are not using the true power of Hadoop then.
> >
> > I feel Pig has more integration with Cassandra, so I might take a look
> > there.
> >
> > Whichever I choose, I will contribute the code back to the Apache projects I
> > use. Here's a sample data analysis I do with my language. Maybe, there is no
> > generic way to do what I want to do.
> >
> >
> >
> > <get name="NodeId">
> > <index name="ServerName" start="" end=""/>
> > <!--<index name="nodeId" start="AS" end="FB"/>-->
> > <!--<groupBy index="nodeId"/>-->
> > <granularity index="timeStamp" type="hour"/>
> > </get>
> >
> > <lookup name="Event"/>
> >
> > <aggregate>
> > <measure name="RequestCount" aggregationType="CUMULATIVE"/>
> > <measure name="ResponseCount" aggregationType="CUMULATIVE"/>
> > <measure name="MaximumResponseTime" aggregationType="AVG"/>
> > </aggregate>
> >
> > <put name="NodeResult" indexRow="allKeys"/>
> >
> > <log/>
> >
> > <get name="NodeResult">
> > <index name="ServerName" start="" end=""/>
> > <groupBy index="ServerName"/>
> > </get>
> >
> > <aggregate>
> > <measure name="RequestCount" aggregationType="CUMULATIVE"/>
> > <measure name="ResponseCount" aggregationType="CUMULATIVE"/>
> > <measure name="MaximumResponseTime" aggregationType="AVG"/>
> > </aggregate>
> >
> > <put name="NodeAccumilator" indexRow="allKeys"/>
> >
> > <log/>
> >
> >
> >> 2011/8/29 Tharindu Mathew <mc...@gmail.com>
> >>
> >>> Hi,
> >>>
> >>> I have an already running system where I define a simple data flow (using
> >>> a simple custom data flow language) and configure jobs to run against stored
> >>> data. I use quartz to schedule and run these jobs and the data exists on
> >>> various data stores (mainly Cassandra but some data exists in RDBMS like
> >>> mysql as well).
> >>>
> >>> Thinking about scalability and already existing support for standard data
> >>> flow languages in the form of Pig and HiveQL, I plan to move my system to
> >>> Hadoop.
> >>>
> >>> I've seen some efforts on the integration of Cassandra and Hadoop. I've
> >>> been reading up and still am contemplating on how to make this change.
> >>>
> >>> It would be great to hear the recommended approach of doing this on Hadoop
> >>> with the integration of Cassandra and other RDBMS. For example, a sample
> >>> task that already runs on the system is "once in every hour, get rows from
> >>> column family X, aggregate data in columns A, B and C and write back to
> >>> column family Y, and enter details of last aggregated row into a table in
> >>> mysql"
> >>>
> >>> Thanks in advance.
> >>>
> >>> --
> >>> Regards,
> >>>
> >>> Tharindu
> >>>
> >>
> >>
> >>
> >> --
> >> *Eric Djatsa Yota*
> >> *Double degree MsC Student in Computer Science Engineering and
> >> Communication Networks
> >> Télécom ParisTech (FRANCE) - Politecnico di Torino (ITALY)*
> >> *Intern at AMADEUS S.A.S Sophia Antipolis*
> >> djatsaedy@gmail.com
> >> *Tel : 0601791859*
> >>
> >>
> >
> >
> > --
> > Regards,
> >
> > Tharindu
> 
> 
> 
> 
> -- 
> Regards,
> 
> Tharindu

Re: Recommendations on moving to Hadoop/Hive with Cassandra + RDBMS

Posted by Jeremy Hanna <je...@gmail.com>.

I've tried to help out with some UDFs and references that help with our use case: https://github.com/jeromatron/pygmalion/

There are some brisk docs on pig as well that might be helpful: http://www.datastax.com/docs/0.8/brisk/about_pig

On Aug 30, 2011, at 1:30 PM, Tharindu Mathew wrote:

> Thanks Jeremy for your response. That gives me some encouragement, that I might be on that right track.
> 
> I think I need to try out more stuff before coming to a conclusion on Brisk.
> 
> For Pig operations over Cassandra, I only could find http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig. Are there any other resource that you can point me to? There seems to be a lack of samples on this subject.
> 
> On Tue, Aug 30, 2011 at 10:56 PM, Jeremy Hanna <je...@gmail.com> wrote:
> FWIW, we are using Pig (and Hadoop) with Cassandra and are looking to potentially move to Brisk because of the simplicity of operations there.
> 
> Not sure what you mean about the true power of Hadoop.  In my mind the true power of Hadoop is the ability to parallelize jobs and send each task to where the data resides.  HDFS exists to enable that.  Brisk is just another HDFS compatible implementation.  If you're already storing your data in Cassandra and are looking to use Hadoop with it, then I would seriously consider using Brisk.
> 
> That said, Cassandra with Hadoop works fine.
> 
> On Aug 30, 2011, at 11:58 AM, Tharindu Mathew wrote:
> 
> > Hi Eric,
> >
> > Thanks for your response.
> >
> > On Tue, Aug 30, 2011 at 5:35 PM, Eric Djatsa <dj...@gmail.com> wrote:
> >
> >> Hi Tharindu, try having a look at Brisk(
> >> http://www.datastax.com/products/brisk) it integrates Hadoop with
> >> Cassandra and is shipped with Hive for SQL analysis. You can then install
> >> Sqoop(http://www.cloudera.com/downloads/sqoop/) on top of Hadoop in order
> >> to enable data import/export between Hadoop and MySQL.
> >> Does this sound ok to you ?
> >>
> >> These do sound ok. But I was looking at using something from Apache itself.
> >
> > Brisk sounds nice, but I feel that disregarding HDFS and totally switching
> > to Cassandra is not the right thing to do. Just my opinion there. I feel we
> > are not using the true power of Hadoop then.
> >
> > I feel Pig has more integration with Cassandra, so I might take a look
> > there.
> >
> > Whichever I choose, I will contribute the code back to the Apache projects I
> > use. Here's a sample data analysis I do with my language. Maybe, there is no
> > generic way to do what I want to do.
> >
> >
> >
> > <get name="NodeId">
> > <index name="ServerName" start="" end=""/>
> > <!--<index name="nodeId" start="AS" end="FB"/>-->
> > <!--<groupBy index="nodeId"/>-->
> > <granularity index="timeStamp" type="hour"/>
> > </get>
> >
> > <lookup name="Event"/>
> >
> > <aggregate>
> > <measure name="RequestCount" aggregationType="CUMULATIVE"/>
> > <measure name="ResponseCount" aggregationType="CUMULATIVE"/>
> > <measure name="MaximumResponseTime" aggregationType="AVG"/>
> > </aggregate>
> >
> > <put name="NodeResult" indexRow="allKeys"/>
> >
> > <log/>
> >
> > <get name="NodeResult">
> > <index name="ServerName" start="" end=""/>
> > <groupBy index="ServerName"/>
> > </get>
> >
> > <aggregate>
> > <measure name="RequestCount" aggregationType="CUMULATIVE"/>
> > <measure name="ResponseCount" aggregationType="CUMULATIVE"/>
> > <measure name="MaximumResponseTime" aggregationType="AVG"/>
> > </aggregate>
> >
> > <put name="NodeAccumilator" indexRow="allKeys"/>
> >
> > <log/>
> >
> >
> >> 2011/8/29 Tharindu Mathew <mc...@gmail.com>
> >>
> >>> Hi,
> >>>
> >>> I have an already running system where I define a simple data flow (using
> >>> a simple custom data flow language) and configure jobs to run against stored
> >>> data. I use quartz to schedule and run these jobs and the data exists on
> >>> various data stores (mainly Cassandra but some data exists in RDBMS like
> >>> mysql as well).
> >>>
> >>> Thinking about scalability and already existing support for standard data
> >>> flow languages in the form of Pig and HiveQL, I plan to move my system to
> >>> Hadoop.
> >>>
> >>> I've seen some efforts on the integration of Cassandra and Hadoop. I've
> >>> been reading up and still am contemplating on how to make this change.
> >>>
> >>> It would be great to hear the recommended approach of doing this on Hadoop
> >>> with the integration of Cassandra and other RDBMS. For example, a sample
> >>> task that already runs on the system is "once in every hour, get rows from
> >>> column family X, aggregate data in columns A, B and C and write back to
> >>> column family Y, and enter details of last aggregated row into a table in
> >>> mysql"
> >>>
> >>> Thanks in advance.
> >>>
> >>> --
> >>> Regards,
> >>>
> >>> Tharindu
> >>>
> >>
> >>
> >>
> >> --
> >> *Eric Djatsa Yota*
> >> *Double degree MsC Student in Computer Science Engineering and
> >> Communication Networks
> >> Télécom ParisTech (FRANCE) - Politecnico di Torino (ITALY)*
> >> *Intern at AMADEUS S.A.S Sophia Antipolis*
> >> djatsaedy@gmail.com
> >> *Tel : 0601791859*
> >>
> >>
> >
> >
> > --
> > Regards,
> >
> > Tharindu
> 
> 
> 
> 
> -- 
> Regards,
> 
> Tharindu

Re: Recommendations on moving to Hadoop/Hive with Cassandra + RDBMS

Posted by Tharindu Mathew <mc...@gmail.com>.

Thanks Jeremy for your response. That gives me some encouragement, that I
might be on that right track.

I think I need to try out more stuff before coming to a conclusion on Brisk.

For Pig operations over Cassandra, I only could find
http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig. Are there any
other resource that you can point me to? There seems to be a lack of samples
on this subject.

On Tue, Aug 30, 2011 at 10:56 PM, Jeremy Hanna
<je...@gmail.com>wrote:

> FWIW, we are using Pig (and Hadoop) with Cassandra and are looking to
> potentially move to Brisk because of the simplicity of operations there.
>
> Not sure what you mean about the true power of Hadoop.  In my mind the true
> power of Hadoop is the ability to parallelize jobs and send each task to
> where the data resides.  HDFS exists to enable that.  Brisk is just another
> HDFS compatible implementation.  If you're already storing your data in
> Cassandra and are looking to use Hadoop with it, then I would seriously
> consider using Brisk.
>
> That said, Cassandra with Hadoop works fine.
>
> On Aug 30, 2011, at 11:58 AM, Tharindu Mathew wrote:
>
> > Hi Eric,
> >
> > Thanks for your response.
> >
> > On Tue, Aug 30, 2011 at 5:35 PM, Eric Djatsa <dj...@gmail.com>
> wrote:
> >
> >> Hi Tharindu, try having a look at Brisk(
> >> http://www.datastax.com/products/brisk) it integrates Hadoop with
> >> Cassandra and is shipped with Hive for SQL analysis. You can then
> install
> >> Sqoop(http://www.cloudera.com/downloads/sqoop/) on top of Hadoop in
> order
> >> to enable data import/export between Hadoop and MySQL.
> >> Does this sound ok to you ?
> >>
> >> These do sound ok. But I was looking at using something from Apache
> itself.
> >
> > Brisk sounds nice, but I feel that disregarding HDFS and totally
> switching
> > to Cassandra is not the right thing to do. Just my opinion there. I feel
> we
> > are not using the true power of Hadoop then.
> >
> > I feel Pig has more integration with Cassandra, so I might take a look
> > there.
> >
> > Whichever I choose, I will contribute the code back to the Apache
> projects I
> > use. Here's a sample data analysis I do with my language. Maybe, there is
> no
> > generic way to do what I want to do.
> >
> >
> >
> > <get name="NodeId">
> > <index name="ServerName" start="" end=""/>
> > <!--<index name="nodeId" start="AS" end="FB"/>-->
> > <!--<groupBy index="nodeId"/>-->
> > <granularity index="timeStamp" type="hour"/>
> > </get>
> >
> > <lookup name="Event"/>
> >
> > <aggregate>
> > <measure name="RequestCount" aggregationType="CUMULATIVE"/>
> > <measure name="ResponseCount" aggregationType="CUMULATIVE"/>
> > <measure name="MaximumResponseTime" aggregationType="AVG"/>
> > </aggregate>
> >
> > <put name="NodeResult" indexRow="allKeys"/>
> >
> > <log/>
> >
> > <get name="NodeResult">
> > <index name="ServerName" start="" end=""/>
> > <groupBy index="ServerName"/>
> > </get>
> >
> > <aggregate>
> > <measure name="RequestCount" aggregationType="CUMULATIVE"/>
> > <measure name="ResponseCount" aggregationType="CUMULATIVE"/>
> > <measure name="MaximumResponseTime" aggregationType="AVG"/>
> > </aggregate>
> >
> > <put name="NodeAccumilator" indexRow="allKeys"/>
> >
> > <log/>
> >
> >
> >> 2011/8/29 Tharindu Mathew <mc...@gmail.com>
> >>
> >>> Hi,
> >>>
> >>> I have an already running system where I define a simple data flow
> (using
> >>> a simple custom data flow language) and configure jobs to run against
> stored
> >>> data. I use quartz to schedule and run these jobs and the data exists
> on
> >>> various data stores (mainly Cassandra but some data exists in RDBMS
> like
> >>> mysql as well).
> >>>
> >>> Thinking about scalability and already existing support for standard
> data
> >>> flow languages in the form of Pig and HiveQL, I plan to move my system
> to
> >>> Hadoop.
> >>>
> >>> I've seen some efforts on the integration of Cassandra and Hadoop. I've
> >>> been reading up and still am contemplating on how to make this change.
> >>>
> >>> It would be great to hear the recommended approach of doing this on
> Hadoop
> >>> with the integration of Cassandra and other RDBMS. For example, a
> sample
> >>> task that already runs on the system is "once in every hour, get rows
> from
> >>> column family X, aggregate data in columns A, B and C and write back to
> >>> column family Y, and enter details of last aggregated row into a table
> in
> >>> mysql"
> >>>
> >>> Thanks in advance.
> >>>
> >>> --
> >>> Regards,
> >>>
> >>> Tharindu
> >>>
> >>
> >>
> >>
> >> --
> >> *Eric Djatsa Yota*
> >> *Double degree MsC Student in Computer Science Engineering and
> >> Communication Networks
> >> Télécom ParisTech (FRANCE) - Politecnico di Torino (ITALY)*
> >> *Intern at AMADEUS S.A.S Sophia Antipolis*
> >> djatsaedy@gmail.com
> >> *Tel : 0601791859*
> >>
> >>
> >
> >
> > --
> > Regards,
> >
> > Tharindu
>
>


-- 
Regards,

Tharindu

Re: Recommendations on moving to Hadoop/Hive with Cassandra + RDBMS

Posted by Tharindu Mathew <mc...@gmail.com>.

Thanks Jeremy for your response. That gives me some encouragement, that I
might be on that right track.

I think I need to try out more stuff before coming to a conclusion on Brisk.

For Pig operations over Cassandra, I only could find
http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig. Are there any
other resource that you can point me to? There seems to be a lack of samples
on this subject.

On Tue, Aug 30, 2011 at 10:56 PM, Jeremy Hanna
<je...@gmail.com>wrote:

> FWIW, we are using Pig (and Hadoop) with Cassandra and are looking to
> potentially move to Brisk because of the simplicity of operations there.
>
> Not sure what you mean about the true power of Hadoop.  In my mind the true
> power of Hadoop is the ability to parallelize jobs and send each task to
> where the data resides.  HDFS exists to enable that.  Brisk is just another
> HDFS compatible implementation.  If you're already storing your data in
> Cassandra and are looking to use Hadoop with it, then I would seriously
> consider using Brisk.
>
> That said, Cassandra with Hadoop works fine.
>
> On Aug 30, 2011, at 11:58 AM, Tharindu Mathew wrote:
>
> > Hi Eric,
> >
> > Thanks for your response.
> >
> > On Tue, Aug 30, 2011 at 5:35 PM, Eric Djatsa <dj...@gmail.com>
> wrote:
> >
> >> Hi Tharindu, try having a look at Brisk(
> >> http://www.datastax.com/products/brisk) it integrates Hadoop with
> >> Cassandra and is shipped with Hive for SQL analysis. You can then
> install
> >> Sqoop(http://www.cloudera.com/downloads/sqoop/) on top of Hadoop in
> order
> >> to enable data import/export between Hadoop and MySQL.
> >> Does this sound ok to you ?
> >>
> >> These do sound ok. But I was looking at using something from Apache
> itself.
> >
> > Brisk sounds nice, but I feel that disregarding HDFS and totally
> switching
> > to Cassandra is not the right thing to do. Just my opinion there. I feel
> we
> > are not using the true power of Hadoop then.
> >
> > I feel Pig has more integration with Cassandra, so I might take a look
> > there.
> >
> > Whichever I choose, I will contribute the code back to the Apache
> projects I
> > use. Here's a sample data analysis I do with my language. Maybe, there is
> no
> > generic way to do what I want to do.
> >
> >
> >
> > <get name="NodeId">
> > <index name="ServerName" start="" end=""/>
> > <!--<index name="nodeId" start="AS" end="FB"/>-->
> > <!--<groupBy index="nodeId"/>-->
> > <granularity index="timeStamp" type="hour"/>
> > </get>
> >
> > <lookup name="Event"/>
> >
> > <aggregate>
> > <measure name="RequestCount" aggregationType="CUMULATIVE"/>
> > <measure name="ResponseCount" aggregationType="CUMULATIVE"/>
> > <measure name="MaximumResponseTime" aggregationType="AVG"/>
> > </aggregate>
> >
> > <put name="NodeResult" indexRow="allKeys"/>
> >
> > <log/>
> >
> > <get name="NodeResult">
> > <index name="ServerName" start="" end=""/>
> > <groupBy index="ServerName"/>
> > </get>
> >
> > <aggregate>
> > <measure name="RequestCount" aggregationType="CUMULATIVE"/>
> > <measure name="ResponseCount" aggregationType="CUMULATIVE"/>
> > <measure name="MaximumResponseTime" aggregationType="AVG"/>
> > </aggregate>
> >
> > <put name="NodeAccumilator" indexRow="allKeys"/>
> >
> > <log/>
> >
> >
> >> 2011/8/29 Tharindu Mathew <mc...@gmail.com>
> >>
> >>> Hi,
> >>>
> >>> I have an already running system where I define a simple data flow
> (using
> >>> a simple custom data flow language) and configure jobs to run against
> stored
> >>> data. I use quartz to schedule and run these jobs and the data exists
> on
> >>> various data stores (mainly Cassandra but some data exists in RDBMS
> like
> >>> mysql as well).
> >>>
> >>> Thinking about scalability and already existing support for standard
> data
> >>> flow languages in the form of Pig and HiveQL, I plan to move my system
> to
> >>> Hadoop.
> >>>
> >>> I've seen some efforts on the integration of Cassandra and Hadoop. I've
> >>> been reading up and still am contemplating on how to make this change.
> >>>
> >>> It would be great to hear the recommended approach of doing this on
> Hadoop
> >>> with the integration of Cassandra and other RDBMS. For example, a
> sample
> >>> task that already runs on the system is "once in every hour, get rows
> from
> >>> column family X, aggregate data in columns A, B and C and write back to
> >>> column family Y, and enter details of last aggregated row into a table
> in
> >>> mysql"
> >>>
> >>> Thanks in advance.
> >>>
> >>> --
> >>> Regards,
> >>>
> >>> Tharindu
> >>>
> >>
> >>
> >>
> >> --
> >> *Eric Djatsa Yota*
> >> *Double degree MsC Student in Computer Science Engineering and
> >> Communication Networks
> >> Télécom ParisTech (FRANCE) - Politecnico di Torino (ITALY)*
> >> *Intern at AMADEUS S.A.S Sophia Antipolis*
> >> djatsaedy@gmail.com
> >> *Tel : 0601791859*
> >>
> >>
> >
> >
> > --
> > Regards,
> >
> > Tharindu
>
>


-- 
Regards,

Tharindu

Re: Recommendations on moving to Hadoop/Hive with Cassandra + RDBMS

Posted by Jeremy Hanna <je...@gmail.com>.

FWIW, we are using Pig (and Hadoop) with Cassandra and are looking to potentially move to Brisk because of the simplicity of operations there.

Not sure what you mean about the true power of Hadoop.  In my mind the true power of Hadoop is the ability to parallelize jobs and send each task to where the data resides.  HDFS exists to enable that.  Brisk is just another HDFS compatible implementation.  If you're already storing your data in Cassandra and are looking to use Hadoop with it, then I would seriously consider using Brisk.

That said, Cassandra with Hadoop works fine.

On Aug 30, 2011, at 11:58 AM, Tharindu Mathew wrote:

> Hi Eric,
> 
> Thanks for your response.
> 
> On Tue, Aug 30, 2011 at 5:35 PM, Eric Djatsa <dj...@gmail.com> wrote:
> 
>> Hi Tharindu, try having a look at Brisk(
>> http://www.datastax.com/products/brisk) it integrates Hadoop with
>> Cassandra and is shipped with Hive for SQL analysis. You can then install
>> Sqoop(http://www.cloudera.com/downloads/sqoop/) on top of Hadoop in order
>> to enable data import/export between Hadoop and MySQL.
>> Does this sound ok to you ?
>> 
>> These do sound ok. But I was looking at using something from Apache itself.
> 
> Brisk sounds nice, but I feel that disregarding HDFS and totally switching
> to Cassandra is not the right thing to do. Just my opinion there. I feel we
> are not using the true power of Hadoop then.
> 
> I feel Pig has more integration with Cassandra, so I might take a look
> there.
> 
> Whichever I choose, I will contribute the code back to the Apache projects I
> use. Here's a sample data analysis I do with my language. Maybe, there is no
> generic way to do what I want to do.
> 
> 
> 
> <get name="NodeId">
> <index name="ServerName" start="" end=""/>
> <!--<index name="nodeId" start="AS" end="FB"/>-->
> <!--<groupBy index="nodeId"/>-->
> <granularity index="timeStamp" type="hour"/>
> </get>
> 
> <lookup name="Event"/>
> 
> <aggregate>
> <measure name="RequestCount" aggregationType="CUMULATIVE"/>
> <measure name="ResponseCount" aggregationType="CUMULATIVE"/>
> <measure name="MaximumResponseTime" aggregationType="AVG"/>
> </aggregate>
> 
> <put name="NodeResult" indexRow="allKeys"/>
> 
> <log/>
> 
> <get name="NodeResult">
> <index name="ServerName" start="" end=""/>
> <groupBy index="ServerName"/>
> </get>
> 
> <aggregate>
> <measure name="RequestCount" aggregationType="CUMULATIVE"/>
> <measure name="ResponseCount" aggregationType="CUMULATIVE"/>
> <measure name="MaximumResponseTime" aggregationType="AVG"/>
> </aggregate>
> 
> <put name="NodeAccumilator" indexRow="allKeys"/>
> 
> <log/>
> 
> 
>> 2011/8/29 Tharindu Mathew <mc...@gmail.com>
>> 
>>> Hi,
>>> 
>>> I have an already running system where I define a simple data flow (using
>>> a simple custom data flow language) and configure jobs to run against stored
>>> data. I use quartz to schedule and run these jobs and the data exists on
>>> various data stores (mainly Cassandra but some data exists in RDBMS like
>>> mysql as well).
>>> 
>>> Thinking about scalability and already existing support for standard data
>>> flow languages in the form of Pig and HiveQL, I plan to move my system to
>>> Hadoop.
>>> 
>>> I've seen some efforts on the integration of Cassandra and Hadoop. I've
>>> been reading up and still am contemplating on how to make this change.
>>> 
>>> It would be great to hear the recommended approach of doing this on Hadoop
>>> with the integration of Cassandra and other RDBMS. For example, a sample
>>> task that already runs on the system is "once in every hour, get rows from
>>> column family X, aggregate data in columns A, B and C and write back to
>>> column family Y, and enter details of last aggregated row into a table in
>>> mysql"
>>> 
>>> Thanks in advance.
>>> 
>>> --
>>> Regards,
>>> 
>>> Tharindu
>>> 
>> 
>> 
>> 
>> --
>> *Eric Djatsa Yota*
>> *Double degree MsC Student in Computer Science Engineering and
>> Communication Networks
>> Télécom ParisTech (FRANCE) - Politecnico di Torino (ITALY)*
>> *Intern at AMADEUS S.A.S Sophia Antipolis*
>> djatsaedy@gmail.com
>> *Tel : 0601791859*
>> 
>> 
> 
> 
> -- 
> Regards,
> 
> Tharindu

Re: Recommendations on moving to Hadoop/Hive with Cassandra + RDBMS

Posted by Jeremy Hanna <je...@gmail.com>.

FWIW, we are using Pig (and Hadoop) with Cassandra and are looking to potentially move to Brisk because of the simplicity of operations there.

Not sure what you mean about the true power of Hadoop.  In my mind the true power of Hadoop is the ability to parallelize jobs and send each task to where the data resides.  HDFS exists to enable that.  Brisk is just another HDFS compatible implementation.  If you're already storing your data in Cassandra and are looking to use Hadoop with it, then I would seriously consider using Brisk.

That said, Cassandra with Hadoop works fine.

On Aug 30, 2011, at 11:58 AM, Tharindu Mathew wrote:

> Hi Eric,
> 
> Thanks for your response.
> 
> On Tue, Aug 30, 2011 at 5:35 PM, Eric Djatsa <dj...@gmail.com> wrote:
> 
>> Hi Tharindu, try having a look at Brisk(
>> http://www.datastax.com/products/brisk) it integrates Hadoop with
>> Cassandra and is shipped with Hive for SQL analysis. You can then install
>> Sqoop(http://www.cloudera.com/downloads/sqoop/) on top of Hadoop in order
>> to enable data import/export between Hadoop and MySQL.
>> Does this sound ok to you ?
>> 
>> These do sound ok. But I was looking at using something from Apache itself.
> 
> Brisk sounds nice, but I feel that disregarding HDFS and totally switching
> to Cassandra is not the right thing to do. Just my opinion there. I feel we
> are not using the true power of Hadoop then.
> 
> I feel Pig has more integration with Cassandra, so I might take a look
> there.
> 
> Whichever I choose, I will contribute the code back to the Apache projects I
> use. Here's a sample data analysis I do with my language. Maybe, there is no
> generic way to do what I want to do.
> 
> 
> 
> <get name="NodeId">
> <index name="ServerName" start="" end=""/>
> <!--<index name="nodeId" start="AS" end="FB"/>-->
> <!--<groupBy index="nodeId"/>-->
> <granularity index="timeStamp" type="hour"/>
> </get>
> 
> <lookup name="Event"/>
> 
> <aggregate>
> <measure name="RequestCount" aggregationType="CUMULATIVE"/>
> <measure name="ResponseCount" aggregationType="CUMULATIVE"/>
> <measure name="MaximumResponseTime" aggregationType="AVG"/>
> </aggregate>
> 
> <put name="NodeResult" indexRow="allKeys"/>
> 
> <log/>
> 
> <get name="NodeResult">
> <index name="ServerName" start="" end=""/>
> <groupBy index="ServerName"/>
> </get>
> 
> <aggregate>
> <measure name="RequestCount" aggregationType="CUMULATIVE"/>
> <measure name="ResponseCount" aggregationType="CUMULATIVE"/>
> <measure name="MaximumResponseTime" aggregationType="AVG"/>
> </aggregate>
> 
> <put name="NodeAccumilator" indexRow="allKeys"/>
> 
> <log/>
> 
> 
>> 2011/8/29 Tharindu Mathew <mc...@gmail.com>
>> 
>>> Hi,
>>> 
>>> I have an already running system where I define a simple data flow (using
>>> a simple custom data flow language) and configure jobs to run against stored
>>> data. I use quartz to schedule and run these jobs and the data exists on
>>> various data stores (mainly Cassandra but some data exists in RDBMS like
>>> mysql as well).
>>> 
>>> Thinking about scalability and already existing support for standard data
>>> flow languages in the form of Pig and HiveQL, I plan to move my system to
>>> Hadoop.
>>> 
>>> I've seen some efforts on the integration of Cassandra and Hadoop. I've
>>> been reading up and still am contemplating on how to make this change.
>>> 
>>> It would be great to hear the recommended approach of doing this on Hadoop
>>> with the integration of Cassandra and other RDBMS. For example, a sample
>>> task that already runs on the system is "once in every hour, get rows from
>>> column family X, aggregate data in columns A, B and C and write back to
>>> column family Y, and enter details of last aggregated row into a table in
>>> mysql"
>>> 
>>> Thanks in advance.
>>> 
>>> --
>>> Regards,
>>> 
>>> Tharindu
>>> 
>> 
>> 
>> 
>> --
>> *Eric Djatsa Yota*
>> *Double degree MsC Student in Computer Science Engineering and
>> Communication Networks
>> Télécom ParisTech (FRANCE) - Politecnico di Torino (ITALY)*
>> *Intern at AMADEUS S.A.S Sophia Antipolis*
>> djatsaedy@gmail.com
>> *Tel : 0601791859*
>> 
>> 
> 
> 
> -- 
> Regards,
> 
> Tharindu

Re: Recommendations on moving to Hadoop/Hive with Cassandra + RDBMS

Posted by Tharindu Mathew <mc...@gmail.com>.

Hi Eric,

Thanks for your response.

On Tue, Aug 30, 2011 at 5:35 PM, Eric Djatsa <dj...@gmail.com> wrote:

> Hi Tharindu, try having a look at Brisk(
> http://www.datastax.com/products/brisk) it integrates Hadoop with
> Cassandra and is shipped with Hive for SQL analysis. You can then install
> Sqoop(http://www.cloudera.com/downloads/sqoop/) on top of Hadoop in order
> to enable data import/export between Hadoop and MySQL.
> Does this sound ok to you ?
>
> These do sound ok. But I was looking at using something from Apache itself.

Brisk sounds nice, but I feel that disregarding HDFS and totally switching
to Cassandra is not the right thing to do. Just my opinion there. I feel we
are not using the true power of Hadoop then.

I feel Pig has more integration with Cassandra, so I might take a look
there.

Whichever I choose, I will contribute the code back to the Apache projects I
use. Here's a sample data analysis I do with my language. Maybe, there is no
generic way to do what I want to do.



<get name="NodeId">
<index name="ServerName" start="" end=""/>
<!--<index name="nodeId" start="AS" end="FB"/>-->
<!--<groupBy index="nodeId"/>-->
<granularity index="timeStamp" type="hour"/>
</get>

<lookup name="Event"/>

<aggregate>
<measure name="RequestCount" aggregationType="CUMULATIVE"/>
<measure name="ResponseCount" aggregationType="CUMULATIVE"/>
<measure name="MaximumResponseTime" aggregationType="AVG"/>
</aggregate>

<put name="NodeResult" indexRow="allKeys"/>

<log/>

<get name="NodeResult">
<index name="ServerName" start="" end=""/>
<groupBy index="ServerName"/>
</get>

<aggregate>
<measure name="RequestCount" aggregationType="CUMULATIVE"/>
<measure name="ResponseCount" aggregationType="CUMULATIVE"/>
<measure name="MaximumResponseTime" aggregationType="AVG"/>
</aggregate>

<put name="NodeAccumilator" indexRow="allKeys"/>

<log/>


> 2011/8/29 Tharindu Mathew <mc...@gmail.com>
>
>> Hi,
>>
>> I have an already running system where I define a simple data flow (using
>> a simple custom data flow language) and configure jobs to run against stored
>> data. I use quartz to schedule and run these jobs and the data exists on
>> various data stores (mainly Cassandra but some data exists in RDBMS like
>> mysql as well).
>>
>> Thinking about scalability and already existing support for standard data
>> flow languages in the form of Pig and HiveQL, I plan to move my system to
>> Hadoop.
>>
>> I've seen some efforts on the integration of Cassandra and Hadoop. I've
>> been reading up and still am contemplating on how to make this change.
>>
>> It would be great to hear the recommended approach of doing this on Hadoop
>> with the integration of Cassandra and other RDBMS. For example, a sample
>> task that already runs on the system is "once in every hour, get rows from
>> column family X, aggregate data in columns A, B and C and write back to
>> column family Y, and enter details of last aggregated row into a table in
>> mysql"
>>
>> Thanks in advance.
>>
>> --
>> Regards,
>>
>> Tharindu
>>
>
>
>
> --
> *Eric Djatsa Yota*
> *Double degree MsC Student in Computer Science Engineering and
> Communication Networks
> Télécom ParisTech (FRANCE) - Politecnico di Torino (ITALY)*
> *Intern at AMADEUS S.A.S Sophia Antipolis*
> djatsaedy@gmail.com
> *Tel : 0601791859*
>
>


-- 
Regards,

Tharindu

Re: Recommendations on moving to Hadoop/Hive with Cassandra + RDBMS

Posted by Tharindu Mathew <mc...@gmail.com>.

Hi Eric,

Thanks for your response.

On Tue, Aug 30, 2011 at 5:35 PM, Eric Djatsa <dj...@gmail.com> wrote:

> Hi Tharindu, try having a look at Brisk(
> http://www.datastax.com/products/brisk) it integrates Hadoop with
> Cassandra and is shipped with Hive for SQL analysis. You can then install
> Sqoop(http://www.cloudera.com/downloads/sqoop/) on top of Hadoop in order
> to enable data import/export between Hadoop and MySQL.
> Does this sound ok to you ?
>
> These do sound ok. But I was looking at using something from Apache itself.

Brisk sounds nice, but I feel that disregarding HDFS and totally switching
to Cassandra is not the right thing to do. Just my opinion there. I feel we
are not using the true power of Hadoop then.

I feel Pig has more integration with Cassandra, so I might take a look
there.

Whichever I choose, I will contribute the code back to the Apache projects I
use. Here's a sample data analysis I do with my language. Maybe, there is no
generic way to do what I want to do.



<get name="NodeId">
<index name="ServerName" start="" end=""/>
<!--<index name="nodeId" start="AS" end="FB"/>-->
<!--<groupBy index="nodeId"/>-->
<granularity index="timeStamp" type="hour"/>
</get>

<lookup name="Event"/>

<aggregate>
<measure name="RequestCount" aggregationType="CUMULATIVE"/>
<measure name="ResponseCount" aggregationType="CUMULATIVE"/>
<measure name="MaximumResponseTime" aggregationType="AVG"/>
</aggregate>

<put name="NodeResult" indexRow="allKeys"/>

<log/>

<get name="NodeResult">
<index name="ServerName" start="" end=""/>
<groupBy index="ServerName"/>
</get>

<aggregate>
<measure name="RequestCount" aggregationType="CUMULATIVE"/>
<measure name="ResponseCount" aggregationType="CUMULATIVE"/>
<measure name="MaximumResponseTime" aggregationType="AVG"/>
</aggregate>

<put name="NodeAccumilator" indexRow="allKeys"/>

<log/>


> 2011/8/29 Tharindu Mathew <mc...@gmail.com>
>
>> Hi,
>>
>> I have an already running system where I define a simple data flow (using
>> a simple custom data flow language) and configure jobs to run against stored
>> data. I use quartz to schedule and run these jobs and the data exists on
>> various data stores (mainly Cassandra but some data exists in RDBMS like
>> mysql as well).
>>
>> Thinking about scalability and already existing support for standard data
>> flow languages in the form of Pig and HiveQL, I plan to move my system to
>> Hadoop.
>>
>> I've seen some efforts on the integration of Cassandra and Hadoop. I've
>> been reading up and still am contemplating on how to make this change.
>>
>> It would be great to hear the recommended approach of doing this on Hadoop
>> with the integration of Cassandra and other RDBMS. For example, a sample
>> task that already runs on the system is "once in every hour, get rows from
>> column family X, aggregate data in columns A, B and C and write back to
>> column family Y, and enter details of last aggregated row into a table in
>> mysql"
>>
>> Thanks in advance.
>>
>> --
>> Regards,
>>
>> Tharindu
>>
>
>
>
> --
> *Eric Djatsa Yota*
> *Double degree MsC Student in Computer Science Engineering and
> Communication Networks
> Télécom ParisTech (FRANCE) - Politecnico di Torino (ITALY)*
> *Intern at AMADEUS S.A.S Sophia Antipolis*
> djatsaedy@gmail.com
> *Tel : 0601791859*
>
>


-- 
Regards,

Tharindu

Re: Recommendations on moving to Hadoop/Hive with Cassandra + RDBMS

Posted by Eric Djatsa <dj...@gmail.com>.

Hi Tharindu, try having a look at Brisk(
http://www.datastax.com/products/brisk) it integrates Hadoop with Cassandra
and is shipped with Hive for SQL analysis. You can then install Sqoop(
http://www.cloudera.com/downloads/sqoop/) on top of Hadoop in order to
enable data import/export between Hadoop and MySQL.
Does this sound ok to you ?

2011/8/29 Tharindu Mathew <mc...@gmail.com>

> Hi,
>
> I have an already running system where I define a simple data flow (using a
> simple custom data flow language) and configure jobs to run against stored
> data. I use quartz to schedule and run these jobs and the data exists on
> various data stores (mainly Cassandra but some data exists in RDBMS like
> mysql as well).
>
> Thinking about scalability and already existing support for standard data
> flow languages in the form of Pig and HiveQL, I plan to move my system to
> Hadoop.
>
> I've seen some efforts on the integration of Cassandra and Hadoop. I've
> been reading up and still am contemplating on how to make this change.
>
> It would be great to hear the recommended approach of doing this on Hadoop
> with the integration of Cassandra and other RDBMS. For example, a sample
> task that already runs on the system is "once in every hour, get rows from
> column family X, aggregate data in columns A, B and C and write back to
> column family Y, and enter details of last aggregated row into a table in
> mysql"
>
> Thanks in advance.
>
> --
> Regards,
>
> Tharindu
>



-- 
*Eric Djatsa Yota*
*Double degree MsC Student in Computer Science Engineering and Communication
Networks
Télécom ParisTech (FRANCE) - Politecnico di Torino (ITALY)*
*Intern at AMADEUS S.A.S Sophia Antipolis*
djatsaedy@gmail.com
*Tel : 0601791859*