You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Benyi Wang <be...@gmail.com> on 2015/01/08 01:32:25 UTC

How to bulkload into a specific data center?

I set up two virtual data centers, one for analytics and one for REST
service. The analytics data center sits top on Hadoop cluster. I want to
bulk load my ETL results into the analytics data center so that the REST
service won't have the heavy load. I'm using CQLTableInputFormat in my
Spark Application, and I gave the nodes in analytics data center as
Intialial address.

However, I found my jobs were connecting to the REST service data center.

How can I specify the data center?

Re: How to bulkload into a specific data center?

Posted by Benyi Wang <be...@gmail.com>.
On Fri, Jan 9, 2015 at 3:55 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Fri, Jan 9, 2015 at 11:38 AM, Benyi Wang <be...@gmail.com> wrote:
>
>>
>>    - Is it possible to modify SSTableLoader to allow it access one data
>>    center?
>>
>> Even if you only write to nodes in DC A, if you replicate that data to DC
> B, it will have to travel over the WAN anyway? What are you trying to avoid?
>
>

I'm lucky that those are virtual data centers in LAN.

I just don't want to have a load burst in the "service" virtual data center
because it may downgrade the REST service. I'm trying to load data into the
"analytics" virtual data center, then let cassandra "slowly" replicates
data into the "service" virtual data center. It is ok for the REST service
to read some old data during the time of replication.

I'm wondering if I should just use Throttle speed in Mbits to solve my
problem?

Because I may load ~100 million, I think spark-cassandra-connector might be
>> too slow. I'm wondering if the methods "*Copy-the-sstables/”nodetool
>> refresh” can be useful" in h*ttp://
>> www.pythian.com/blog/bulk-loading-options-for-cassandra/ will be a good
>> choice. I'm still a newbie to Cassandra. I could not understand what the
>> author said in that page.
>>
>
> The author of that post is as wise as he is modest... ;D
>
>
>> One of my question is:
>>
>> * When I run a spark job in yarn mode, the sstables are created into YARN
>> working directory.
>> * Assume I have a way to copy the files into the Cassandra directory on
>> the same node.
>> * Because the data are distributed across all analytics data center's
>> nodes, each one has only a part of sstables, node A has part A, node B has
>> part B. If I run refresh on each node, eventually node A has part A,B, and
>> node B will have part A,B too. Am I right?
>>
>
> I'm not sure I fully understand your question, but...
>
> In order to run refresh without having to immediately run cleanup, you
> need to have SSTables which contain data only for ranges which the node you
> are loading them on.
>
> So for a RF=3, N=3 cluster without vnodes (simple case), data is naturally
> on every node.
>
> For RF=3, N=6 cluster A B C D E F, node C contains :
>
> - Third replica for A.
> - Second replica for B.
> - First replica for C.
>
> In order for you to generate the correct SSTable, you need to understand
> all 3 replicas that should be there. With vnodes and nodes joining and
> parting, this becomes more difficult.
>
> That's why people tend to use SSTableloader and the streaming interface :
> with SSTableloader, Cassandra takes input which might live on any replica
> and sends it to the appropriate nodes.
>
> =Rob
> http://twitter.com/rcolidba
>

I'd better to stay at SSTableLoader. Thanks for your explanation.

Re: How to bulkload into a specific data center?

Posted by Robert Coli <rc...@eventbrite.com>.
On Fri, Jan 9, 2015 at 11:38 AM, Benyi Wang <be...@gmail.com> wrote:

>
>    - Is it possible to modify SSTableLoader to allow it access one data
>    center?
>
> Even if you only write to nodes in DC A, if you replicate that data to DC
B, it will have to travel over the WAN anyway? What are you trying to avoid?


> Because I may load ~100 million, I think spark-cassandra-connector might
> be too slow. I'm wondering if the methods "*Copy-the-sstables/”nodetool
> refresh” can be useful" in h*ttp://
> www.pythian.com/blog/bulk-loading-options-for-cassandra/ will be a good
> choice. I'm still a newbie to Cassandra. I could not understand what the
> author said in that page.
>

The author of that post is as wise as he is modest... ;D


> One of my question is:
>
> * When I run a spark job in yarn mode, the sstables are created into YARN
> working directory.
> * Assume I have a way to copy the files into the Cassandra directory on
> the same node.
> * Because the data are distributed across all analytics data center's
> nodes, each one has only a part of sstables, node A has part A, node B has
> part B. If I run refresh on each node, eventually node A has part A,B, and
> node B will have part A,B too. Am I right?
>

I'm not sure I fully understand your question, but...

In order to run refresh without having to immediately run cleanup, you need
to have SSTables which contain data only for ranges which the node you are
loading them on.

So for a RF=3, N=3 cluster without vnodes (simple case), data is naturally
on every node.

For RF=3, N=6 cluster A B C D E F, node C contains :

- Third replica for A.
- Second replica for B.
- First replica for C.

In order for you to generate the correct SSTable, you need to understand
all 3 replicas that should be there. With vnodes and nodes joining and
parting, this becomes more difficult.

That's why people tend to use SSTableloader and the streaming interface :
with SSTableloader, Cassandra takes input which might live on any replica
and sends it to the appropriate nodes.

=Rob
http://twitter.com/rcolidba

Re: How to bulkload into a specific data center?

Posted by Benyi Wang <be...@gmail.com>.
Hi Ryan,

Thanks for your reply. Now I understood how SSTableLoader works.

   - If I understand correctly, the current o.a.c.io.sstable.SSTableLoader
   doesn't use LOCAL_ONE or LOCAL_QUORUM. Is it right?
   - Is it possible to modify SSTableLoader to allow it access one data
   center?

Because I may load ~100 million, I think spark-cassandra-connector might be
too slow. I'm wondering if the methods "*Copy-the-sstables/”nodetool
refresh” can be useful" in h*ttp://
www.pythian.com/blog/bulk-loading-options-for-cassandra/ will be a good
choice. I'm still a newbie to Cassandra. I could not understand what the
author said in that page. One of my question is:

* When I run a spark job in yarn mode, the sstables are created into YARN
working directory.
* Assume I have a way to copy the files into the Cassandra directory on the
same node.
* Because the data are distributed across all analytics data center's
nodes, each one has only a part of sstables, node A has part A, node B has
part B. If I run refresh on each node, eventually node A has part A,B, and
node B will have part A,B too. Am I right?

Thanks.

On Thu, Jan 8, 2015 at 6:34 AM, Ryan Svihla <rs...@foundev.pro> wrote:

> Just noticed you'd sent this to the dev list, this is a question for only
> the user list, and please do not send questions of this type to the
> developer list.
>
> On Thu, Jan 8, 2015 at 8:33 AM, Ryan Svihla <rs...@foundev.pro> wrote:
>
> > The nature of replication factor is such that writes will go wherever
> > there is replication. If you're wanting responses to be faster, and not
> > involve the REST data center in the spark job for response I suggest
> using
> > a cql driver and LOCAL_ONE or LOCAL_QUORUM consistency level (look at the
> > spark cassandra connector here
> > https://github.com/datastax/spark-cassandra-connector ) . While write
> > traffic will still be replicated to the REST service data center, because
> > you do want those results available, you will not be waiting on the
> remote
> > data center to respond "successful".
> >
> > Final point, bulk loading sends a copy per replica across the wire, so
> > lets say you have RF3 in each data center that means bulk loading will
> send
> > out 6 copies from that client at once, with normal mutations via thrift
> or
> > cql writes between data centers go out as 1 copy, then that node will
> > forward on to the other replicas. This means intra data center traffic in
> > this case would be 3x more with the bulk loader than with using a
> > traditional cql or thrift based client.
> >
> >
> >
> > On Wed, Jan 7, 2015 at 6:32 PM, Benyi Wang <be...@gmail.com>
> wrote:
> >
> >> I set up two virtual data centers, one for analytics and one for REST
> >> service. The analytics data center sits top on Hadoop cluster. I want to
> >> bulk load my ETL results into the analytics data center so that the REST
> >> service won't have the heavy load. I'm using CQLTableInputFormat in my
> >> Spark Application, and I gave the nodes in analytics data center as
> >> Intialial address.
> >>
> >> However, I found my jobs were connecting to the REST service data
> center.
> >>
> >> How can I specify the data center?
> >>
> >
> >
> >
> > --
> >
> > Thanks,
> > Ryan Svihla
> >
> >
>
>
> --
>
> Thanks,
> Ryan Svihla
>

Re: How to bulkload into a specific data center?

Posted by Ryan Svihla <rs...@foundev.pro>.
Just noticed you'd sent this to the dev list, this is a question for only
the user list, and please do not send questions of this type to the
developer list.

On Thu, Jan 8, 2015 at 8:33 AM, Ryan Svihla <rs...@foundev.pro> wrote:

> The nature of replication factor is such that writes will go wherever
> there is replication. If you're wanting responses to be faster, and not
> involve the REST data center in the spark job for response I suggest using
> a cql driver and LOCAL_ONE or LOCAL_QUORUM consistency level (look at the
> spark cassandra connector here
> https://github.com/datastax/spark-cassandra-connector ) . While write
> traffic will still be replicated to the REST service data center, because
> you do want those results available, you will not be waiting on the remote
> data center to respond "successful".
>
> Final point, bulk loading sends a copy per replica across the wire, so
> lets say you have RF3 in each data center that means bulk loading will send
> out 6 copies from that client at once, with normal mutations via thrift or
> cql writes between data centers go out as 1 copy, then that node will
> forward on to the other replicas. This means intra data center traffic in
> this case would be 3x more with the bulk loader than with using a
> traditional cql or thrift based client.
>
>
>
> On Wed, Jan 7, 2015 at 6:32 PM, Benyi Wang <be...@gmail.com> wrote:
>
>> I set up two virtual data centers, one for analytics and one for REST
>> service. The analytics data center sits top on Hadoop cluster. I want to
>> bulk load my ETL results into the analytics data center so that the REST
>> service won't have the heavy load. I'm using CQLTableInputFormat in my
>> Spark Application, and I gave the nodes in analytics data center as
>> Intialial address.
>>
>> However, I found my jobs were connecting to the REST service data center.
>>
>> How can I specify the data center?
>>
>
>
>
> --
>
> Thanks,
> Ryan Svihla
>
>


-- 

Thanks,
Ryan Svihla

Re: How to bulkload into a specific data center?

Posted by Ryan Svihla <rs...@foundev.pro>.
Just noticed you'd sent this to the dev list, this is a question for only
the user list, and please do not send questions of this type to the
developer list.

On Thu, Jan 8, 2015 at 8:33 AM, Ryan Svihla <rs...@foundev.pro> wrote:

> The nature of replication factor is such that writes will go wherever
> there is replication. If you're wanting responses to be faster, and not
> involve the REST data center in the spark job for response I suggest using
> a cql driver and LOCAL_ONE or LOCAL_QUORUM consistency level (look at the
> spark cassandra connector here
> https://github.com/datastax/spark-cassandra-connector ) . While write
> traffic will still be replicated to the REST service data center, because
> you do want those results available, you will not be waiting on the remote
> data center to respond "successful".
>
> Final point, bulk loading sends a copy per replica across the wire, so
> lets say you have RF3 in each data center that means bulk loading will send
> out 6 copies from that client at once, with normal mutations via thrift or
> cql writes between data centers go out as 1 copy, then that node will
> forward on to the other replicas. This means intra data center traffic in
> this case would be 3x more with the bulk loader than with using a
> traditional cql or thrift based client.
>
>
>
> On Wed, Jan 7, 2015 at 6:32 PM, Benyi Wang <be...@gmail.com> wrote:
>
>> I set up two virtual data centers, one for analytics and one for REST
>> service. The analytics data center sits top on Hadoop cluster. I want to
>> bulk load my ETL results into the analytics data center so that the REST
>> service won't have the heavy load. I'm using CQLTableInputFormat in my
>> Spark Application, and I gave the nodes in analytics data center as
>> Intialial address.
>>
>> However, I found my jobs were connecting to the REST service data center.
>>
>> How can I specify the data center?
>>
>
>
>
> --
>
> Thanks,
> Ryan Svihla
>
>


-- 

Thanks,
Ryan Svihla

Re: How to bulkload into a specific data center?

Posted by Ryan Svihla <rs...@foundev.pro>.
The nature of replication factor is such that writes will go wherever there
is replication. If you're wanting responses to be faster, and not involve
the REST data center in the spark job for response I suggest using a cql
driver and LOCAL_ONE or LOCAL_QUORUM consistency level (look at the spark
cassandra connector here
https://github.com/datastax/spark-cassandra-connector ) . While write
traffic will still be replicated to the REST service data center, because
you do want those results available, you will not be waiting on the remote
data center to respond "successful".

Final point, bulk loading sends a copy per replica across the wire, so lets
say you have RF3 in each data center that means bulk loading will send out
6 copies from that client at once, with normal mutations via thrift or cql
writes between data centers go out as 1 copy, then that node will forward
on to the other replicas. This means intra data center traffic in this case
would be 3x more with the bulk loader than with using a traditional cql or
thrift based client.



On Wed, Jan 7, 2015 at 6:32 PM, Benyi Wang <be...@gmail.com> wrote:

> I set up two virtual data centers, one for analytics and one for REST
> service. The analytics data center sits top on Hadoop cluster. I want to
> bulk load my ETL results into the analytics data center so that the REST
> service won't have the heavy load. I'm using CQLTableInputFormat in my
> Spark Application, and I gave the nodes in analytics data center as
> Intialial address.
>
> However, I found my jobs were connecting to the REST service data center.
>
> How can I specify the data center?
>



-- 

Thanks,
Ryan Svihla

Re: How to bulkload into a specific data center?

Posted by Ryan Svihla <rs...@foundev.pro>.
The nature of replication factor is such that writes will go wherever there
is replication. If you're wanting responses to be faster, and not involve
the REST data center in the spark job for response I suggest using a cql
driver and LOCAL_ONE or LOCAL_QUORUM consistency level (look at the spark
cassandra connector here
https://github.com/datastax/spark-cassandra-connector ) . While write
traffic will still be replicated to the REST service data center, because
you do want those results available, you will not be waiting on the remote
data center to respond "successful".

Final point, bulk loading sends a copy per replica across the wire, so lets
say you have RF3 in each data center that means bulk loading will send out
6 copies from that client at once, with normal mutations via thrift or cql
writes between data centers go out as 1 copy, then that node will forward
on to the other replicas. This means intra data center traffic in this case
would be 3x more with the bulk loader than with using a traditional cql or
thrift based client.



On Wed, Jan 7, 2015 at 6:32 PM, Benyi Wang <be...@gmail.com> wrote:

> I set up two virtual data centers, one for analytics and one for REST
> service. The analytics data center sits top on Hadoop cluster. I want to
> bulk load my ETL results into the analytics data center so that the REST
> service won't have the heavy load. I'm using CQLTableInputFormat in my
> Spark Application, and I gave the nodes in analytics data center as
> Intialial address.
>
> However, I found my jobs were connecting to the REST service data center.
>
> How can I specify the data center?
>



-- 

Thanks,
Ryan Svihla