You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by "platon.tema" <pl...@yandex.ru> on 2014/09/16 10:58:22 UTC

Direct IO with Spark and Hadoop over Cassandra

Hi.

As I see massive data processing tools (map\reduce) with C* data include

connectors
- Calliope http://tuplejump.github.io/calliope/
- Datastax spark cassandra connector 
https://github.com/datastax/spark-cassandra-connector
- Startio Deep https://github.com/Stratio/stratio-deep
- other free\commercial

runtime (job management and infrastructure)
- Spark
- Hadoop

But if I'm not mistaken all these solutions use network for data 
loading. In best case logic instance (some "job") run on the same node 
(wherethe corresponding range was found).

Why this logic can`t use direct C* IO (sstable reading from disk)? Any 
cons ?

Some time ago i read article (still can't find it) about academical 
research within Hadoop was modified to support this direct IO mode. 
According to that benchmarks direct IOgave a significant performance 
increase.

Re: Direct IO with Spark and Hadoop over Cassandra

Posted by "platon.tema" <pl...@yandex.ru>.

Yes, updates and deletes is trouble. At the moment for updates 
collection we refresh result data by query to C* (java driver) before 
reporting to user. For deletes we can skip it during scanning by TTL for 
example (not tested yet).

On 09/16/2014 04:53 PM, moshe.kranc@barclays.com wrote:
>
> You will also have to read/resolve multiple row instances (if you 
> update records) and tombstones (if you delete records) yourself.
>
> *From:*platon.tema [mailto:platon.tema@yandex.ru]
> *Sent:* Tuesday, September 16, 2014 1:51 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Direct IO with Spark and Hadoop over Cassandra
>
> Thanks.
>
> But 1) overcomes with C* API for commitlog and memtables or with mixed 
> access (direct IO + traditional connectors or pure CQL if data model 
> allows, we experimented with it).
>
> 2) is more complex for universal solution. In our case C* uses without 
> replication (RF=1) because of huge data size (replication too expensive).
>
> On 09/16/2014 03:40 PM, DuyHai Doan wrote:
>
>     If you access directly the C* sstables from those frameworks, you
>     will:
>
>     1) miss live data which are in memory and not dumped yet to disk
>
>     2) skip the Dynamo layer of C* responsible for data consistency
>
>     Le 16 sept. 2014 10:58, "platon.tema" <platon.tema@yandex.ru
>     <ma...@yandex.ru>> a écrit :
>
>     Hi.
>
>     As I see massive data processing tools (map\reduce) with C* data
>     include
>
>     connectors
>     - Calliope http://tuplejump.github.io/calliope/
>     - Datastax spark cassandra connector
>     https://github.com/datastax/spark-cassandra-connector
>     - Startio Deep https://github.com/Stratio/stratio-deep
>     - other free\commercial
>
>     runtime (job management and infrastructure)
>     - Spark
>     - Hadoop
>
>     But if I'm not mistaken all these solutions use network for data
>     loading. In best case logic instance (some "job") run on the same
>     node (wherethe corresponding range was found).
>
>     Why this logic can`t use direct C* IO (sstable reading from disk)?
>     Any cons ?
>
>     Some time ago i read article (still can't find it) about
>     academical research within Hadoop was modified to support this
>     direct IO mode. According to that benchmarks direct IOgave a
>     significant performance increase.
>
> _______________________________________________
>
> This message is for information purposes only, it is not a 
> recommendation, advice, offer or solicitation to buy or sell a product 
> or service nor an official confirmation of any transaction. It is 
> directed at persons who are professionals and is not intended for 
> retail customer use. Intended for recipient only. This message is 
> subject to the terms at: www.barclays.com/emaildisclaimer 
> <http://www.barclays.com/emaildisclaimer>.
>
> For important disclosures, please see: 
> www.barclays.com/salesandtradingdisclaimer 
> <http://www.barclays.com/salesandtradingdisclaimer> regarding market 
> commentary from Barclays Sales and/or Trading, who are active market 
> participants; and in respect of Barclays Research, including 
> disclosures relating to specific issuers, please see 
> http://publicresearch.barclays.com.
>
> _______________________________________________
>

RE: Direct IO with Spark and Hadoop over Cassandra

Posted by mo...@barclays.com.

You will also have to read/resolve multiple row instances (if you update records) and tombstones (if you delete records) yourself.

From: platon.tema [mailto:platon.tema@yandex.ru]
Sent: Tuesday, September 16, 2014 1:51 PM
To: user@cassandra.apache.org
Subject: Re: Direct IO with Spark and Hadoop over Cassandra

Thanks.

But 1) overcomes with C* API for commitlog and memtables or with mixed access (direct IO + traditional connectors or pure CQL if data model allows, we experimented with it).

2) is more complex for universal solution. In our case C* uses without replication (RF=1) because of huge data size (replication too expensive).
On 09/16/2014 03:40 PM, DuyHai Doan wrote:

If you access directly the C* sstables from those frameworks, you will:

1) miss live data which are in memory and not dumped yet to disk

2) skip the Dynamo layer of C* responsible for data consistency
Le 16 sept. 2014 10:58, "platon.tema" <pl...@yandex.ru>> a écrit :
Hi.

As I see massive data processing tools (map\reduce) with C* data include

connectors
- Calliope http://tuplejump.github.io/calliope/
- Datastax spark cassandra connector https://github.com/datastax/spark-cassandra-connector
- Startio Deep https://github.com/Stratio/stratio-deep
- other free\commercial

runtime (job management and infrastructure)
- Spark
- Hadoop

But if I'm not mistaken all these solutions use network for data loading. In best case logic instance (some "job") run on the same node (wherethe corresponding range was found).

Why this logic can`t use direct C* IO (sstable reading from disk)? Any cons ?

Some time ago i read article (still can't find it) about academical research within Hadoop was modified to support this direct IO mode. According to that benchmarks direct IOgave a significant performance increase.

_______________________________________________

This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer.

For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com.

_______________________________________________

Re: Direct IO with Spark and Hadoop over Cassandra

Posted by "platon.tema" <pl...@yandex.ru>.

Thanks.

But 1) overcomes with C* API for commitlog and memtables or with mixed 
access (direct IO + traditional connectors or pure CQL if data model 
allows, we experimented with it).

2) is more complex for universal solution. In our case C* uses without 
replication (RF=1) because of huge data size (replication too expensive).

On 09/16/2014 03:40 PM, DuyHai Doan wrote:
>
> If you access directly the C* sstables from those frameworks, you will:
>
> 1) miss live data which are in memory and not dumped yet to disk
>
> 2) skip the Dynamo layer of C* responsible for data consistency
>
> Le 16 sept. 2014 10:58, "platon.tema" <platon.tema@yandex.ru 
> <ma...@yandex.ru>> a écrit :
>
>     Hi.
>
>     As I see massive data processing tools (map\reduce) with C* data
>     include
>
>     connectors
>     - Calliope http://tuplejump.github.io/calliope/
>     - Datastax spark cassandra connector
>     https://github.com/datastax/spark-cassandra-connector
>     - Startio Deep https://github.com/Stratio/stratio-deep
>     - other free\commercial
>
>     runtime (job management and infrastructure)
>     - Spark
>     - Hadoop
>
>     But if I'm not mistaken all these solutions use network for data
>     loading. In best case logic instance (some "job") run on the same
>     node (wherethe corresponding range was found).
>
>     Why this logic can`t use direct C* IO (sstable reading from disk)?
>     Any cons ?
>
>     Some time ago i read article (still can't find it) about
>     academical research within Hadoop was modified to support this
>     direct IO mode. According to that benchmarks direct IOgave a
>     significant performance increase.
>

Re: Direct IO with Spark and Hadoop over Cassandra

Posted by DuyHai Doan <do...@gmail.com>.

If you access directly the C* sstables from those frameworks, you will:

1) miss live data which are in memory and not dumped yet to disk

2) skip the Dynamo layer of C* responsible for data consistency
Le 16 sept. 2014 10:58, "platon.tema" <pl...@yandex.ru> a écrit :

> Hi.
>
> As I see massive data processing tools (map\reduce) with C* data include
>
> connectors
> - Calliope http://tuplejump.github.io/calliope/
> - Datastax spark cassandra connector https://github.com/datastax/
> spark-cassandra-connector
> - Startio Deep https://github.com/Stratio/stratio-deep
> - other free\commercial
>
> runtime (job management and infrastructure)
> - Spark
> - Hadoop
>
> But if I'm not mistaken all these solutions use network for data loading.
> In best case logic instance (some "job") run on the same node (wherethe
> corresponding range was found).
>
> Why this logic can`t use direct C* IO (sstable reading from disk)? Any
> cons ?
>
> Some time ago i read article (still can't find it) about academical
> research within Hadoop was modified to support this direct IO mode.
> According to that benchmarks direct IOgave a significant performance
> increase.
>