You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Bhuvan Rawal <bh...@gmail.com> on 2016/03/04 09:26:52 UTC

How to create an additional cluster in Cassandra exclusively for Analytics Purpose

Hi,

We would like to create an additional C* data center for batch processing
using spark on CFS. We would like to limit this DC exclusively for Spark
operations and would like to continue the Application Servers to continue
fetching data from OLTP.

Is there any way to configure the same?


​
Regards,
Bhuvan

Re: How to create an additional cluster in Cassandra exclusively for Analytics Purpose

Posted by Bhuvan Rawal <bh...@gmail.com>.
Thanks for the correction Jon. (Atmost 2000 queries *per cluster* for
serving 100 searches.)

On Mon, Mar 7, 2016 at 11:47 PM, Jonathan Haddad <jo...@jonhaddad.com> wrote:

> If you're doing 100 searches a second each machine will be serving at most
> 100 requests per second, not 2000.
>
> On Mon, Mar 7, 2016 at 10:13 AM Bhuvan Rawal <bh...@gmail.com> wrote:
>
>> Well thats certainly true, there are these points worth discussing here :
>>
>> 1. Scatter Gather queries - Especially if the cluster size is large. Say
>> we have a 20 node cluster, and we are searching 100 times a second. then
>> effectively coordinator would be hitting each node 2000 times (20*100) That
>> factor will only increase as the number of node goes higher. Im sure having
>> a centralized index alleviates that problem.
>> 2. High Cardinality (For columns like email / phone number)
>> 3. Low Cardinality (Boolean column or any column with limited set of
>> available options).
>>
>> SASI seems to be a good solution for Like queries this doc
>> <https://github.com/apache/cassandra/blob/trunk/doc/SASI.md> looks
>> really promising. But wouldn't it be better to tackle the use cases of
>> search differently than from data storage ones, from a design standpoint?
>>
>> On Sun, Mar 6, 2016 at 9:14 PM, Jack Krupansky <ja...@gmail.com>
>> wrote:
>>
>>> I don't have any direct personal experience with Stratio. It will all
>>> depend on your queries and your data cardinality - some queries are fine
>>> with secondary indexes while other are quite poor. Ditto for Lucene and
>>> Solr.
>>>
>>> It is also worth noting that the new SASI feature of Cassandra supports
>>> keyword and prefix/suffix search. But it doesn't support multi-column ad
>>> hoc queries, which is what people tend to use Lucene and Solr for. So,
>>> again, it all depends on your queries and your data cardinality.
>>>
>>> -- Jack Krupansky
>>>
>>> On Sun, Mar 6, 2016 at 1:29 AM, Bhuvan Rawal <bh...@gmail.com>
>>> wrote:
>>>
>>>> Yes Jack, we are rolling out with Stratio right now, we will assess the
>>>> performance benefit it yields and can go for ElasticSearch/Solr later.
>>>>
>>>> As per your experience how does Stratio perform vis-a-vis Secondary
>>>> Indexes?
>>>>
>>>> On Sun, Mar 6, 2016 at 11:15 AM, Jack Krupansky <
>>>> jack.krupansky@gmail.com> wrote:
>>>>
>>>>> You haven't been clear about how you intend to add Solr. You can also
>>>>> use Stratio or Stargate for basic Lucene search if you don't want need full
>>>>> Solr support and want to stick to open source rather than go with DSE
>>>>> Search for Solr.
>>>>>
>>>>> -- Jack Krupansky
>>>>>
>>>>> On Sun, Mar 6, 2016 at 12:25 AM, Bhuvan Rawal <bh...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Sean and Nirmallaya.
>>>>>>
>>>>>> @Jack, We are going with DSC right now and plan to use spark and
>>>>>> later solr over the analytics DC. The use case is to have  olap and oltp
>>>>>> workloads separated and not intertwine them, whether it is achieved by
>>>>>> creating a new DC or a new cluster altogether. From Nirmallaya's and Sean's
>>>>>> answer I could understand that its easily achievable by creating a separate
>>>>>> DC, app client will need to be made DC aware and it should not make a
>>>>>> coordinator in dc3. And same goes for spark configuration, it should read
>>>>>> from 3rd DC. Correct me if I'm wrong.
>>>>>>
>>>>>> On Mar 4, 2016 7:55 PM, "Jack Krupansky" <ja...@gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > DataStax Enterprise (DSE) should be fine for three or even four
>>>>>> data centers in the same cluster. Or are you talking about some custom Solr
>>>>>> implementation?
>>>>>> >
>>>>>> > -- Jack Krupansky
>>>>>> >
>>>>>> > On Fri, Mar 4, 2016 at 9:21 AM, <SE...@homedepot.com>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> Sure. Just add a new DC. Alter your keyspaces with a new
>>>>>> replication factor for that DC. Run repairs on the new DC to get the data
>>>>>> streamed. Then make sure your clients only connect to the DC(s) that they
>>>>>> need.
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> Separation of workloads is one of the key powers of a Cassandra
>>>>>> cluster.
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> You may want to look at different configurations for the analytics
>>>>>> cluster – smaller replication factor, more memory per node, more disk per
>>>>>> node, perhaps less vnodes. Others may chime in with their experience.
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> Sean Durity
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> From: Bhuvan Rawal [mailto:bhu1rawal@gmail.com]
>>>>>> >> Sent: Friday, March 04, 2016 3:27 AM
>>>>>> >> To: user@cassandra.apache.org
>>>>>> >> Subject: How to create an additional cluster in Cassandra
>>>>>> exclusively for Analytics Purpose
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> Hi,
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> We would like to create an additional C* data center for batch
>>>>>> processing using spark on CFS. We would like to limit this DC exclusively
>>>>>> for Spark operations and would like to continue the Application Servers to
>>>>>> continue fetching data from OLTP.
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> Is there any way to configure the same?
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> ​
>>>>>> >>
>>>>>> >> Regards,
>>>>>> >>
>>>>>> >> Bhuvan
>>>>>> >>
>>>>>> >>
>>>>>> >> ________________________________
>>>>>> >>
>>>>>> >> The information in this Internet Email is confidential and may be
>>>>>> legally privileged. It is intended solely for the addressee. Access to this
>>>>>> Email by anyone else is unauthorized. If you are not the intended
>>>>>> recipient, any disclosure, copying, distribution or any action taken or
>>>>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>>>>> When addressed to our clients any opinions or advice contained in this
>>>>>> Email are subject to the terms and conditions expressed in any applicable
>>>>>> governing The Home Depot terms of business or client engagement letter. The
>>>>>> Home Depot disclaims all responsibility and liability for the accuracy and
>>>>>> content of this attachment and for any damages or losses arising from any
>>>>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>>>>> items of a destructive nature, which may be contained in this attachment
>>>>>> and shall not be liable for direct, indirect, consequential or special
>>>>>> damages in connection with this e-mail message or its attachment.
>>>>>> >
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>

Re: How to create an additional cluster in Cassandra exclusively for Analytics Purpose

Posted by Jonathan Haddad <jo...@jonhaddad.com>.
If you're doing 100 searches a second each machine will be serving at most
100 requests per second, not 2000.

On Mon, Mar 7, 2016 at 10:13 AM Bhuvan Rawal <bh...@gmail.com> wrote:

> Well thats certainly true, there are these points worth discussing here :
>
> 1. Scatter Gather queries - Especially if the cluster size is large. Say
> we have a 20 node cluster, and we are searching 100 times a second. then
> effectively coordinator would be hitting each node 2000 times (20*100) That
> factor will only increase as the number of node goes higher. Im sure having
> a centralized index alleviates that problem.
> 2. High Cardinality (For columns like email / phone number)
> 3. Low Cardinality (Boolean column or any column with limited set of
> available options).
>
> SASI seems to be a good solution for Like queries this doc
> <https://github.com/apache/cassandra/blob/trunk/doc/SASI.md> looks really
> promising. But wouldn't it be better to tackle the use cases of search
> differently than from data storage ones, from a design standpoint?
>
> On Sun, Mar 6, 2016 at 9:14 PM, Jack Krupansky <ja...@gmail.com>
> wrote:
>
>> I don't have any direct personal experience with Stratio. It will all
>> depend on your queries and your data cardinality - some queries are fine
>> with secondary indexes while other are quite poor. Ditto for Lucene and
>> Solr.
>>
>> It is also worth noting that the new SASI feature of Cassandra supports
>> keyword and prefix/suffix search. But it doesn't support multi-column ad
>> hoc queries, which is what people tend to use Lucene and Solr for. So,
>> again, it all depends on your queries and your data cardinality.
>>
>> -- Jack Krupansky
>>
>> On Sun, Mar 6, 2016 at 1:29 AM, Bhuvan Rawal <bh...@gmail.com> wrote:
>>
>>> Yes Jack, we are rolling out with Stratio right now, we will assess the
>>> performance benefit it yields and can go for ElasticSearch/Solr later.
>>>
>>> As per your experience how does Stratio perform vis-a-vis Secondary
>>> Indexes?
>>>
>>> On Sun, Mar 6, 2016 at 11:15 AM, Jack Krupansky <
>>> jack.krupansky@gmail.com> wrote:
>>>
>>>> You haven't been clear about how you intend to add Solr. You can also
>>>> use Stratio or Stargate for basic Lucene search if you don't want need full
>>>> Solr support and want to stick to open source rather than go with DSE
>>>> Search for Solr.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> On Sun, Mar 6, 2016 at 12:25 AM, Bhuvan Rawal <bh...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks Sean and Nirmallaya.
>>>>>
>>>>> @Jack, We are going with DSC right now and plan to use spark and later
>>>>> solr over the analytics DC. The use case is to have  olap and oltp
>>>>> workloads separated and not intertwine them, whether it is achieved by
>>>>> creating a new DC or a new cluster altogether. From Nirmallaya's and Sean's
>>>>> answer I could understand that its easily achievable by creating a separate
>>>>> DC, app client will need to be made DC aware and it should not make a
>>>>> coordinator in dc3. And same goes for spark configuration, it should read
>>>>> from 3rd DC. Correct me if I'm wrong.
>>>>>
>>>>> On Mar 4, 2016 7:55 PM, "Jack Krupansky" <ja...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> > DataStax Enterprise (DSE) should be fine for three or even four data
>>>>> centers in the same cluster. Or are you talking about some custom Solr
>>>>> implementation?
>>>>> >
>>>>> > -- Jack Krupansky
>>>>> >
>>>>> > On Fri, Mar 4, 2016 at 9:21 AM, <SE...@homedepot.com> wrote:
>>>>> >>
>>>>> >> Sure. Just add a new DC. Alter your keyspaces with a new
>>>>> replication factor for that DC. Run repairs on the new DC to get the data
>>>>> streamed. Then make sure your clients only connect to the DC(s) that they
>>>>> need.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> Separation of workloads is one of the key powers of a Cassandra
>>>>> cluster.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> You may want to look at different configurations for the analytics
>>>>> cluster – smaller replication factor, more memory per node, more disk per
>>>>> node, perhaps less vnodes. Others may chime in with their experience.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> Sean Durity
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> From: Bhuvan Rawal [mailto:bhu1rawal@gmail.com]
>>>>> >> Sent: Friday, March 04, 2016 3:27 AM
>>>>> >> To: user@cassandra.apache.org
>>>>> >> Subject: How to create an additional cluster in Cassandra
>>>>> exclusively for Analytics Purpose
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> Hi,
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> We would like to create an additional C* data center for batch
>>>>> processing using spark on CFS. We would like to limit this DC exclusively
>>>>> for Spark operations and would like to continue the Application Servers to
>>>>> continue fetching data from OLTP.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> Is there any way to configure the same?
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> ​
>>>>> >>
>>>>> >> Regards,
>>>>> >>
>>>>> >> Bhuvan
>>>>> >>
>>>>> >>
>>>>> >> ________________________________
>>>>> >>
>>>>> >> The information in this Internet Email is confidential and may be
>>>>> legally privileged. It is intended solely for the addressee. Access to this
>>>>> Email by anyone else is unauthorized. If you are not the intended
>>>>> recipient, any disclosure, copying, distribution or any action taken or
>>>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>>>> When addressed to our clients any opinions or advice contained in this
>>>>> Email are subject to the terms and conditions expressed in any applicable
>>>>> governing The Home Depot terms of business or client engagement letter. The
>>>>> Home Depot disclaims all responsibility and liability for the accuracy and
>>>>> content of this attachment and for any damages or losses arising from any
>>>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>>>> items of a destructive nature, which may be contained in this attachment
>>>>> and shall not be liable for direct, indirect, consequential or special
>>>>> damages in connection with this e-mail message or its attachment.
>>>>> >
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>
>

Re: How to create an additional cluster in Cassandra exclusively for Analytics Purpose

Posted by Bhuvan Rawal <bh...@gmail.com>.
Well thats certainly true, there are these points worth discussing here :

1. Scatter Gather queries - Especially if the cluster size is large. Say we
have a 20 node cluster, and we are searching 100 times a second. then
effectively coordinator would be hitting each node 2000 times (20*100) That
factor will only increase as the number of node goes higher. Im sure having
a centralized index alleviates that problem.
2. High Cardinality (For columns like email / phone number)
3. Low Cardinality (Boolean column or any column with limited set of
available options).

SASI seems to be a good solution for Like queries this doc
<https://github.com/apache/cassandra/blob/trunk/doc/SASI.md> looks really
promising. But wouldn't it be better to tackle the use cases of search
differently than from data storage ones, from a design standpoint?

On Sun, Mar 6, 2016 at 9:14 PM, Jack Krupansky <ja...@gmail.com>
wrote:

> I don't have any direct personal experience with Stratio. It will all
> depend on your queries and your data cardinality - some queries are fine
> with secondary indexes while other are quite poor. Ditto for Lucene and
> Solr.
>
> It is also worth noting that the new SASI feature of Cassandra supports
> keyword and prefix/suffix search. But it doesn't support multi-column ad
> hoc queries, which is what people tend to use Lucene and Solr for. So,
> again, it all depends on your queries and your data cardinality.
>
> -- Jack Krupansky
>
> On Sun, Mar 6, 2016 at 1:29 AM, Bhuvan Rawal <bh...@gmail.com> wrote:
>
>> Yes Jack, we are rolling out with Stratio right now, we will assess the
>> performance benefit it yields and can go for ElasticSearch/Solr later.
>>
>> As per your experience how does Stratio perform vis-a-vis Secondary
>> Indexes?
>>
>> On Sun, Mar 6, 2016 at 11:15 AM, Jack Krupansky <jack.krupansky@gmail.com
>> > wrote:
>>
>>> You haven't been clear about how you intend to add Solr. You can also
>>> use Stratio or Stargate for basic Lucene search if you don't want need full
>>> Solr support and want to stick to open source rather than go with DSE
>>> Search for Solr.
>>>
>>> -- Jack Krupansky
>>>
>>> On Sun, Mar 6, 2016 at 12:25 AM, Bhuvan Rawal <bh...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Sean and Nirmallaya.
>>>>
>>>> @Jack, We are going with DSC right now and plan to use spark and later
>>>> solr over the analytics DC. The use case is to have  olap and oltp
>>>> workloads separated and not intertwine them, whether it is achieved by
>>>> creating a new DC or a new cluster altogether. From Nirmallaya's and Sean's
>>>> answer I could understand that its easily achievable by creating a separate
>>>> DC, app client will need to be made DC aware and it should not make a
>>>> coordinator in dc3. And same goes for spark configuration, it should read
>>>> from 3rd DC. Correct me if I'm wrong.
>>>>
>>>> On Mar 4, 2016 7:55 PM, "Jack Krupansky" <ja...@gmail.com>
>>>> wrote:
>>>> >
>>>> > DataStax Enterprise (DSE) should be fine for three or even four data
>>>> centers in the same cluster. Or are you talking about some custom Solr
>>>> implementation?
>>>> >
>>>> > -- Jack Krupansky
>>>> >
>>>> > On Fri, Mar 4, 2016 at 9:21 AM, <SE...@homedepot.com> wrote:
>>>> >>
>>>> >> Sure. Just add a new DC. Alter your keyspaces with a new replication
>>>> factor for that DC. Run repairs on the new DC to get the data streamed.
>>>> Then make sure your clients only connect to the DC(s) that they need.
>>>> >>
>>>> >>
>>>> >>
>>>> >> Separation of workloads is one of the key powers of a Cassandra
>>>> cluster.
>>>> >>
>>>> >>
>>>> >>
>>>> >> You may want to look at different configurations for the analytics
>>>> cluster – smaller replication factor, more memory per node, more disk per
>>>> node, perhaps less vnodes. Others may chime in with their experience.
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> Sean Durity
>>>> >>
>>>> >>
>>>> >>
>>>> >> From: Bhuvan Rawal [mailto:bhu1rawal@gmail.com]
>>>> >> Sent: Friday, March 04, 2016 3:27 AM
>>>> >> To: user@cassandra.apache.org
>>>> >> Subject: How to create an additional cluster in Cassandra
>>>> exclusively for Analytics Purpose
>>>> >>
>>>> >>
>>>> >>
>>>> >> Hi,
>>>> >>
>>>> >>
>>>> >>
>>>> >> We would like to create an additional C* data center for batch
>>>> processing using spark on CFS. We would like to limit this DC exclusively
>>>> for Spark operations and would like to continue the Application Servers to
>>>> continue fetching data from OLTP.
>>>> >>
>>>> >>
>>>> >>
>>>> >> Is there any way to configure the same?
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> ​
>>>> >>
>>>> >> Regards,
>>>> >>
>>>> >> Bhuvan
>>>> >>
>>>> >>
>>>> >> ________________________________
>>>> >>
>>>> >> The information in this Internet Email is confidential and may be
>>>> legally privileged. It is intended solely for the addressee. Access to this
>>>> Email by anyone else is unauthorized. If you are not the intended
>>>> recipient, any disclosure, copying, distribution or any action taken or
>>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>>> When addressed to our clients any opinions or advice contained in this
>>>> Email are subject to the terms and conditions expressed in any applicable
>>>> governing The Home Depot terms of business or client engagement letter. The
>>>> Home Depot disclaims all responsibility and liability for the accuracy and
>>>> content of this attachment and for any damages or losses arising from any
>>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>>> items of a destructive nature, which may be contained in this attachment
>>>> and shall not be liable for direct, indirect, consequential or special
>>>> damages in connection with this e-mail message or its attachment.
>>>> >
>>>> >
>>>>
>>>
>>>
>>
>

Re: How to create an additional cluster in Cassandra exclusively for Analytics Purpose

Posted by Jack Krupansky <ja...@gmail.com>.
I don't have any direct personal experience with Stratio. It will all
depend on your queries and your data cardinality - some queries are fine
with secondary indexes while other are quite poor. Ditto for Lucene and
Solr.

It is also worth noting that the new SASI feature of Cassandra supports
keyword and prefix/suffix search. But it doesn't support multi-column ad
hoc queries, which is what people tend to use Lucene and Solr for. So,
again, it all depends on your queries and your data cardinality.

-- Jack Krupansky

On Sun, Mar 6, 2016 at 1:29 AM, Bhuvan Rawal <bh...@gmail.com> wrote:

> Yes Jack, we are rolling out with Stratio right now, we will assess the
> performance benefit it yields and can go for ElasticSearch/Solr later.
>
> As per your experience how does Stratio perform vis-a-vis Secondary
> Indexes?
>
> On Sun, Mar 6, 2016 at 11:15 AM, Jack Krupansky <ja...@gmail.com>
> wrote:
>
>> You haven't been clear about how you intend to add Solr. You can also use
>> Stratio or Stargate for basic Lucene search if you don't want need full
>> Solr support and want to stick to open source rather than go with DSE
>> Search for Solr.
>>
>> -- Jack Krupansky
>>
>> On Sun, Mar 6, 2016 at 12:25 AM, Bhuvan Rawal <bh...@gmail.com>
>> wrote:
>>
>>> Thanks Sean and Nirmallaya.
>>>
>>> @Jack, We are going with DSC right now and plan to use spark and later
>>> solr over the analytics DC. The use case is to have  olap and oltp
>>> workloads separated and not intertwine them, whether it is achieved by
>>> creating a new DC or a new cluster altogether. From Nirmallaya's and Sean's
>>> answer I could understand that its easily achievable by creating a separate
>>> DC, app client will need to be made DC aware and it should not make a
>>> coordinator in dc3. And same goes for spark configuration, it should read
>>> from 3rd DC. Correct me if I'm wrong.
>>>
>>> On Mar 4, 2016 7:55 PM, "Jack Krupansky" <ja...@gmail.com>
>>> wrote:
>>> >
>>> > DataStax Enterprise (DSE) should be fine for three or even four data
>>> centers in the same cluster. Or are you talking about some custom Solr
>>> implementation?
>>> >
>>> > -- Jack Krupansky
>>> >
>>> > On Fri, Mar 4, 2016 at 9:21 AM, <SE...@homedepot.com> wrote:
>>> >>
>>> >> Sure. Just add a new DC. Alter your keyspaces with a new replication
>>> factor for that DC. Run repairs on the new DC to get the data streamed.
>>> Then make sure your clients only connect to the DC(s) that they need.
>>> >>
>>> >>
>>> >>
>>> >> Separation of workloads is one of the key powers of a Cassandra
>>> cluster.
>>> >>
>>> >>
>>> >>
>>> >> You may want to look at different configurations for the analytics
>>> cluster – smaller replication factor, more memory per node, more disk per
>>> node, perhaps less vnodes. Others may chime in with their experience.
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> Sean Durity
>>> >>
>>> >>
>>> >>
>>> >> From: Bhuvan Rawal [mailto:bhu1rawal@gmail.com]
>>> >> Sent: Friday, March 04, 2016 3:27 AM
>>> >> To: user@cassandra.apache.org
>>> >> Subject: How to create an additional cluster in Cassandra exclusively
>>> for Analytics Purpose
>>> >>
>>> >>
>>> >>
>>> >> Hi,
>>> >>
>>> >>
>>> >>
>>> >> We would like to create an additional C* data center for batch
>>> processing using spark on CFS. We would like to limit this DC exclusively
>>> for Spark operations and would like to continue the Application Servers to
>>> continue fetching data from OLTP.
>>> >>
>>> >>
>>> >>
>>> >> Is there any way to configure the same?
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> ​
>>> >>
>>> >> Regards,
>>> >>
>>> >> Bhuvan
>>> >>
>>> >>
>>> >> ________________________________
>>> >>
>>> >> The information in this Internet Email is confidential and may be
>>> legally privileged. It is intended solely for the addressee. Access to this
>>> Email by anyone else is unauthorized. If you are not the intended
>>> recipient, any disclosure, copying, distribution or any action taken or
>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>> When addressed to our clients any opinions or advice contained in this
>>> Email are subject to the terms and conditions expressed in any applicable
>>> governing The Home Depot terms of business or client engagement letter. The
>>> Home Depot disclaims all responsibility and liability for the accuracy and
>>> content of this attachment and for any damages or losses arising from any
>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>> items of a destructive nature, which may be contained in this attachment
>>> and shall not be liable for direct, indirect, consequential or special
>>> damages in connection with this e-mail message or its attachment.
>>> >
>>> >
>>>
>>
>>
>

Re: How to create an additional cluster in Cassandra exclusively for Analytics Purpose

Posted by Bhuvan Rawal <bh...@gmail.com>.
Yes Jack, we are rolling out with Stratio right now, we will assess the
performance benefit it yields and can go for ElasticSearch/Solr later.

As per your experience how does Stratio perform vis-a-vis Secondary Indexes?

On Sun, Mar 6, 2016 at 11:15 AM, Jack Krupansky <ja...@gmail.com>
wrote:

> You haven't been clear about how you intend to add Solr. You can also use
> Stratio or Stargate for basic Lucene search if you don't want need full
> Solr support and want to stick to open source rather than go with DSE
> Search for Solr.
>
> -- Jack Krupansky
>
> On Sun, Mar 6, 2016 at 12:25 AM, Bhuvan Rawal <bh...@gmail.com> wrote:
>
>> Thanks Sean and Nirmallaya.
>>
>> @Jack, We are going with DSC right now and plan to use spark and later
>> solr over the analytics DC. The use case is to have  olap and oltp
>> workloads separated and not intertwine them, whether it is achieved by
>> creating a new DC or a new cluster altogether. From Nirmallaya's and Sean's
>> answer I could understand that its easily achievable by creating a separate
>> DC, app client will need to be made DC aware and it should not make a
>> coordinator in dc3. And same goes for spark configuration, it should read
>> from 3rd DC. Correct me if I'm wrong.
>>
>> On Mar 4, 2016 7:55 PM, "Jack Krupansky" <ja...@gmail.com>
>> wrote:
>> >
>> > DataStax Enterprise (DSE) should be fine for three or even four data
>> centers in the same cluster. Or are you talking about some custom Solr
>> implementation?
>> >
>> > -- Jack Krupansky
>> >
>> > On Fri, Mar 4, 2016 at 9:21 AM, <SE...@homedepot.com> wrote:
>> >>
>> >> Sure. Just add a new DC. Alter your keyspaces with a new replication
>> factor for that DC. Run repairs on the new DC to get the data streamed.
>> Then make sure your clients only connect to the DC(s) that they need.
>> >>
>> >>
>> >>
>> >> Separation of workloads is one of the key powers of a Cassandra
>> cluster.
>> >>
>> >>
>> >>
>> >> You may want to look at different configurations for the analytics
>> cluster – smaller replication factor, more memory per node, more disk per
>> node, perhaps less vnodes. Others may chime in with their experience.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> Sean Durity
>> >>
>> >>
>> >>
>> >> From: Bhuvan Rawal [mailto:bhu1rawal@gmail.com]
>> >> Sent: Friday, March 04, 2016 3:27 AM
>> >> To: user@cassandra.apache.org
>> >> Subject: How to create an additional cluster in Cassandra exclusively
>> for Analytics Purpose
>> >>
>> >>
>> >>
>> >> Hi,
>> >>
>> >>
>> >>
>> >> We would like to create an additional C* data center for batch
>> processing using spark on CFS. We would like to limit this DC exclusively
>> for Spark operations and would like to continue the Application Servers to
>> continue fetching data from OLTP.
>> >>
>> >>
>> >>
>> >> Is there any way to configure the same?
>> >>
>> >>
>> >>
>> >>
>> >> ​
>> >>
>> >> Regards,
>> >>
>> >> Bhuvan
>> >>
>> >>
>> >> ________________________________
>> >>
>> >> The information in this Internet Email is confidential and may be
>> legally privileged. It is intended solely for the addressee. Access to this
>> Email by anyone else is unauthorized. If you are not the intended
>> recipient, any disclosure, copying, distribution or any action taken or
>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>> When addressed to our clients any opinions or advice contained in this
>> Email are subject to the terms and conditions expressed in any applicable
>> governing The Home Depot terms of business or client engagement letter. The
>> Home Depot disclaims all responsibility and liability for the accuracy and
>> content of this attachment and for any damages or losses arising from any
>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>> items of a destructive nature, which may be contained in this attachment
>> and shall not be liable for direct, indirect, consequential or special
>> damages in connection with this e-mail message or its attachment.
>> >
>> >
>>
>
>

Re: How to create an additional cluster in Cassandra exclusively for Analytics Purpose

Posted by Jack Krupansky <ja...@gmail.com>.
You haven't been clear about how you intend to add Solr. You can also use
Stratio or Stargate for basic Lucene search if you don't want need full
Solr support and want to stick to open source rather than go with DSE
Search for Solr.

-- Jack Krupansky

On Sun, Mar 6, 2016 at 12:25 AM, Bhuvan Rawal <bh...@gmail.com> wrote:

> Thanks Sean and Nirmallaya.
>
> @Jack, We are going with DSC right now and plan to use spark and later
> solr over the analytics DC. The use case is to have  olap and oltp
> workloads separated and not intertwine them, whether it is achieved by
> creating a new DC or a new cluster altogether. From Nirmallaya's and Sean's
> answer I could understand that its easily achievable by creating a separate
> DC, app client will need to be made DC aware and it should not make a
> coordinator in dc3. And same goes for spark configuration, it should read
> from 3rd DC. Correct me if I'm wrong.
>
> On Mar 4, 2016 7:55 PM, "Jack Krupansky" <ja...@gmail.com> wrote:
> >
> > DataStax Enterprise (DSE) should be fine for three or even four data
> centers in the same cluster. Or are you talking about some custom Solr
> implementation?
> >
> > -- Jack Krupansky
> >
> > On Fri, Mar 4, 2016 at 9:21 AM, <SE...@homedepot.com> wrote:
> >>
> >> Sure. Just add a new DC. Alter your keyspaces with a new replication
> factor for that DC. Run repairs on the new DC to get the data streamed.
> Then make sure your clients only connect to the DC(s) that they need.
> >>
> >>
> >>
> >> Separation of workloads is one of the key powers of a Cassandra cluster.
> >>
> >>
> >>
> >> You may want to look at different configurations for the analytics
> cluster – smaller replication factor, more memory per node, more disk per
> node, perhaps less vnodes. Others may chime in with their experience.
> >>
> >>
> >>
> >>
> >>
> >> Sean Durity
> >>
> >>
> >>
> >> From: Bhuvan Rawal [mailto:bhu1rawal@gmail.com]
> >> Sent: Friday, March 04, 2016 3:27 AM
> >> To: user@cassandra.apache.org
> >> Subject: How to create an additional cluster in Cassandra exclusively
> for Analytics Purpose
> >>
> >>
> >>
> >> Hi,
> >>
> >>
> >>
> >> We would like to create an additional C* data center for batch
> processing using spark on CFS. We would like to limit this DC exclusively
> for Spark operations and would like to continue the Application Servers to
> continue fetching data from OLTP.
> >>
> >>
> >>
> >> Is there any way to configure the same?
> >>
> >>
> >>
> >>
> >> ​
> >>
> >> Regards,
> >>
> >> Bhuvan
> >>
> >>
> >> ________________________________
> >>
> >> The information in this Internet Email is confidential and may be
> legally privileged. It is intended solely for the addressee. Access to this
> Email by anyone else is unauthorized. If you are not the intended
> recipient, any disclosure, copying, distribution or any action taken or
> omitted to be taken in reliance on it, is prohibited and may be unlawful.
> When addressed to our clients any opinions or advice contained in this
> Email are subject to the terms and conditions expressed in any applicable
> governing The Home Depot terms of business or client engagement letter. The
> Home Depot disclaims all responsibility and liability for the accuracy and
> content of this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
> >
> >
>

Re: How to create an additional cluster in Cassandra exclusively for Analytics Purpose

Posted by Bhuvan Rawal <bh...@gmail.com>.
Thanks Sean and Nirmallaya.

@Jack, We are going with DSC right now and plan to use spark and later solr
over the analytics DC. The use case is to have  olap and oltp workloads
separated and not intertwine them, whether it is achieved by creating a new
DC or a new cluster altogether. From Nirmallaya's and Sean's answer I could
understand that its easily achievable by creating a separate DC, app client
will need to be made DC aware and it should not make a coordinator in dc3.
And same goes for spark configuration, it should read from 3rd DC. Correct
me if I'm wrong.

On Mar 4, 2016 7:55 PM, "Jack Krupansky" <ja...@gmail.com> wrote:
>
> DataStax Enterprise (DSE) should be fine for three or even four data
centers in the same cluster. Or are you talking about some custom Solr
implementation?
>
> -- Jack Krupansky
>
> On Fri, Mar 4, 2016 at 9:21 AM, <SE...@homedepot.com> wrote:
>>
>> Sure. Just add a new DC. Alter your keyspaces with a new replication
factor for that DC. Run repairs on the new DC to get the data streamed.
Then make sure your clients only connect to the DC(s) that they need.
>>
>>
>>
>> Separation of workloads is one of the key powers of a Cassandra cluster.
>>
>>
>>
>> You may want to look at different configurations for the analytics
cluster – smaller replication factor, more memory per node, more disk per
node, perhaps less vnodes. Others may chime in with their experience.
>>
>>
>>
>>
>>
>> Sean Durity
>>
>>
>>
>> From: Bhuvan Rawal [mailto:bhu1rawal@gmail.com]
>> Sent: Friday, March 04, 2016 3:27 AM
>> To: user@cassandra.apache.org
>> Subject: How to create an additional cluster in Cassandra exclusively
for Analytics Purpose
>>
>>
>>
>> Hi,
>>
>>
>>
>> We would like to create an additional C* data center for batch
processing using spark on CFS. We would like to limit this DC exclusively
for Spark operations and would like to continue the Application Servers to
continue fetching data from OLTP.
>>
>>
>>
>> Is there any way to configure the same?
>>
>>
>>
>>
>> ​
>>
>> Regards,
>>
>> Bhuvan
>>
>>
>> ________________________________
>>
>> The information in this Internet Email is confidential and may be
legally privileged. It is intended solely for the addressee. Access to this
Email by anyone else is unauthorized. If you are not the intended
recipient, any disclosure, copying, distribution or any action taken or
omitted to be taken in reliance on it, is prohibited and may be unlawful.
When addressed to our clients any opinions or advice contained in this
Email are subject to the terms and conditions expressed in any applicable
governing The Home Depot terms of business or client engagement letter. The
Home Depot disclaims all responsibility and liability for the accuracy and
content of this attachment and for any damages or losses arising from any
inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
items of a destructive nature, which may be contained in this attachment
and shall not be liable for direct, indirect, consequential or special
damages in connection with this e-mail message or its attachment.
>
>

Re: How to create an additional cluster in Cassandra exclusively for Analytics Purpose

Posted by Jack Krupansky <ja...@gmail.com>.
DataStax Enterprise (DSE) should be fine for three or even four data
centers in the same cluster. Or are you talking about some custom Solr
implementation?

-- Jack Krupansky

On Fri, Mar 4, 2016 at 9:21 AM, <SE...@homedepot.com> wrote:

> Sure. Just add a new DC. Alter your keyspaces with a new replication
> factor for that DC. Run repairs on the new DC to get the data streamed.
> Then make sure your clients only connect to the DC(s) that they need.
>
>
>
> Separation of workloads is one of the key powers of a Cassandra cluster.
>
>
>
> You may want to look at different configurations for the analytics cluster
> – smaller replication factor, more memory per node, more disk per node,
> perhaps less vnodes. Others may chime in with their experience.
>
>
>
>
>
> Sean Durity
>
>
>
> *From:* Bhuvan Rawal [mailto:bhu1rawal@gmail.com]
> *Sent:* Friday, March 04, 2016 3:27 AM
> *To:* user@cassandra.apache.org
> *Subject:* How to create an additional cluster in Cassandra exclusively
> for Analytics Purpose
>
>
>
> Hi,
>
>
>
> We would like to create an additional C* data center for batch processing
> using spark on CFS. We would like to limit this DC exclusively for Spark
> operations and would like to continue the Application Servers to continue
> fetching data from OLTP.
>
>
>
> Is there any way to configure the same?
>
>
>
>
> ​
>
> Regards,
>
> Bhuvan
>
> ------------------------------
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>

RE: How to create an additional cluster in Cassandra exclusively for Analytics Purpose

Posted by SE...@homedepot.com.
Sure. Just add a new DC. Alter your keyspaces with a new replication factor for that DC. Run repairs on the new DC to get the data streamed. Then make sure your clients only connect to the DC(s) that they need.

Separation of workloads is one of the key powers of a Cassandra cluster.

You may want to look at different configurations for the analytics cluster – smaller replication factor, more memory per node, more disk per node, perhaps less vnodes. Others may chime in with their experience.


Sean Durity

From: Bhuvan Rawal [mailto:bhu1rawal@gmail.com]
Sent: Friday, March 04, 2016 3:27 AM
To: user@cassandra.apache.org
Subject: How to create an additional cluster in Cassandra exclusively for Analytics Purpose

Hi,

We would like to create an additional C* data center for batch processing using spark on CFS. We would like to limit this DC exclusively for Spark operations and would like to continue the Application Servers to continue fetching data from OLTP.

Is there any way to configure the same?

[cid:image002.png@01D175F6.E080D840]
​
Regards,
Bhuvan

________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.

Re: How to create an additional cluster in Cassandra exclusively for Analytics Purpose

Posted by Nirmallya Mukherjee <ni...@yahoo.com>.
You cannot have a cluster within a cluster in C*. In my opinion you have 2 choices -1. Have a DC3 and replicate data to it. In your app code, include a DC aware RR policy so that the coordinator is chosen from DC1 or DC2 and not from DC3.2. In case you want to have a different cluster for OLAP and search then modify your app DAO to insert into two separate clusters. In this case for all practical purposes these are 2 completely independent C* clusters with no relationship with each other.
Option 1 is the preferred choice in my view.
Thanks,Nirmallya 

    On Friday, 4 March 2016 1:57 PM, Bhuvan Rawal <bh...@gmail.com> wrote:
 

 Hi,
We would like to create an additional C* data center for batch processing using spark on CFS. We would like to limit this DC exclusively for Spark operations and would like to continue the Application Servers to continue fetching data from OLTP. 
Is there any way to configure the same? 
​
Regards,Bhuvan