You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cassandra.apache.org by fr...@orange.com on 2014/04/23 15:15:50 UTC

Request: easy access to local data features

Hi,


We are using Cassandra at Orange to manage a big sparse matrix on a cluster of servers.

On this database we want to run a sparse matrix factorization algorithm.



We need to parallelize this matrix factorization algorithm, for instance by computing the factorization model rows by rows.

So we want to distribute the computation of the rows on each server.

A natural way to do this would be to apply the algorithm on each server, using the local rows that are stored by this server.

As the factorization model is also distributed, there is no need to merge the results (no need to a kind of "reduce phase").

So there is no need of Hadoop.

Cassandra and the distributed algorithm on each server could be sufficient.



The problem is that the access to local data is currently not easy with the Cassandra API:

- There is a token() function allowing to iterate on local rows.

- but this token function works well only with the one-token-per-server partition scheme of Cassandra; with the 256-virtual-token partition scheme, it becomes very difficult to access efficiently to local rows

- Unfortunately it seems that the one-token-per-server partition scheme is not recommended, and may be it could become deprecated, as the later scheme is more efficient for cluster managements.



We believe that the easy access to local data could be a key feature for Cassandra to offer implicit parallelization strategies for many classes of algorithms and classical process.

To ensure this key feature, it is just necessary to provide an easy, transparent and sustainable function to access local data (local tables). This function will just have to be compliant with future partition schemes.



Do you think this request may be a priority to Cassandra?

If so, when and how do you plan to provide this feature?, so we could adapt our developments?



Many thanks for considering my request,

Best Regards,



Frank Meyer.

Research Engineer

Orange Labs - Lannion


Frank Meyer
France Telecom OLPS/UCE/CRM-DA/PROF (LD128)
2 avenue Pierre Marzin 22307 Lannion Cedex
E-mail : franck.meyer@orange.com<ma...@orange-ftgroup.com>
Telephone : +33 (0)2 96 05 28 89
http://www.francetelecom.com/rd


_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified.
Thank you.

Re: Request: easy access to local data features

Posted by Cyril Scetbon <cy...@free.fr>.

Hi Frank,

You could also use Hadoop with no reducer or with IdentityReducer, which ensures data locality as long as you start task tracker on Cassandra nodes where the data resides.

Concerning the difficulty to get tokens in a vnodes environment that's what Hadoop core functions do. You can have a look at https://github.com/apache/cassandra/blob/cassandra-1.2/src/java/org/apache/cassandra/hadoop/AbstractColumnFamilyInputFormat.java#L114 which shows how splits are enumerated. And as you got the endpoint address of each split, you can choose where to launch your code

Regards.
-- 
Cyril SCETBON
(Orange Sophia Antipolis)

On 23 Apr 2014, at 15:15, franck.meyer@orange.com wrote:

> Hi,
> 
> 
> We are using Cassandra at Orange to manage a big sparse matrix on a cluster of servers.
> 
> On this database we want to run a sparse matrix factorization algorithm.
> 
> 
> 
> We need to parallelize this matrix factorization algorithm, for instance by computing the factorization model rows by rows.
> 
> So we want to distribute the computation of the rows on each server.
> 
> A natural way to do this would be to apply the algorithm on each server, using the local rows that are stored by this server.
> 
> As the factorization model is also distributed, there is no need to merge the results (no need to a kind of "reduce phase").
> 
> So there is no need of Hadoop.
> 
> Cassandra and the distributed algorithm on each server could be sufficient.
> 
> 
> 
> The problem is that the access to local data is currently not easy with the Cassandra API:
> 
> - There is a token() function allowing to iterate on local rows.
> 
> - but this token function works well only with the one-token-per-server partition scheme of Cassandra; with the 256-virtual-token partition scheme, it becomes very difficult to access efficiently to local rows
> 
> - Unfortunately it seems that the one-token-per-server partition scheme is not recommended, and may be it could become deprecated, as the later scheme is more efficient for cluster managements.
> 
> 
> 
> We believe that the easy access to local data could be a key feature for Cassandra to offer implicit parallelization strategies for many classes of algorithms and classical process.
> 
> To ensure this key feature, it is just necessary to provide an easy, transparent and sustainable function to access local data (local tables). This function will just have to be compliant with future partition schemes.
> 
> 
> 
> Do you think this request may be a priority to Cassandra?
> 
> If so, when and how do you plan to provide this feature?, so we could adapt our developments?
> 
> 
> 
> Many thanks for considering my request,
> 
> Best Regards,
> 
> 
> 
> Frank Meyer.
> 
> Research Engineer
> 
> Orange Labs - Lannion
> 
> 
> Frank Meyer
> France Telecom OLPS/UCE/CRM-DA/PROF (LD128)
> 2 avenue Pierre Marzin 22307 Lannion Cedex
> E-mail : franck.meyer@orange.com<ma...@orange-ftgroup.com>
> Telephone : +33 (0)2 96 05 28 89
> http://www.francetelecom.com/rd
> 
> 
> _________________________________________________________________________________________________________________________
> 
> Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc
> pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler
> a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration,
> Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci.
> 
> This message and its attachments may contain confidential or privileged information that may be protected by law;
> they should not be distributed, used or copied without authorisation.
> If you have received this email in error, please notify the sender and delete this message and its attachments.
> As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified.
> Thank you.
>