You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Maciej Miklas <ma...@googlemail.com> on 2011/11/17 22:08:24 UTC

Data Model Design for Login Servie

Hallo all,

I need your help to design structure for simple login service. It contains
about 100.000.000 customers and each one can have about 10 different logins
- this results 1.000.000.000 different logins.

Each customer contains following data:
- one to many login names as string, max 20 UTF-8 characters long
- ID as long - one customer has only one ID
- gender
- birth date
- name
- password as MD5

Login process needs to find user by login name.
Data in Cassandra is replicated - this is necessary to obtain all required
login data in single call. Also usually we expect low write traffic and
heavy read traffic - round trips for reading data should be avoided.
Below I've described two possible cassandra data models based on example:
we have two users, first user has two logins and second user has three
logins

A) Skinny rows
 - row key contains login name - this is the main search criteria
 - login data is replicated - each possible login is stored as single row
which contains all user data - 10 logins for single customer create 10
rows, where each row has different key and the same content

    // first 3 rows has different key and the same replicated data
        alfred.tester@xyz.de {
          id: 1122
          gender: MALE
          birthdate: 1987.11.09
          name: Alfred Tester
          pwd: e72c504dc16c8fcd2fe8c74bb492affa
        },
        alfred@aad.de {
          id: 1122
          gender: MALE
          birthdate: 1987.11.09
          name: Alfred Tester
          pwd: e72c504dc16c8fcd2fe8c74bb492affa
        },
        alf@dd.de {
          id: 1122
          gender: MALE
          birthdate: 1987.11.09
          name: Alfred Tester
          pwd: e72c504dc16c8fcd2fe8c74bb492affa
        },

    // two following rows has again the same data for second customer
        manfred@xyz.de {
          id: 1133
          gender: MALE
          birthdate: 1997.02.01
          name: Manfredus Maximus
          pwd: e44c504ff16c8fcd2fe8c74bb492adda
        },
        roberrto@xyz.de {
          id: 1133
          gender: MALE
          birthdate: 1997.02.01
          name: Manfredus Maximus
          pwd: e44c504ff16c8fcd2fe8c74bb492adda
        }

B) Rows grouped by alphabetical prefix
- Number of rows is limited - for example first letter from login name
- Each row contains all logins which benign with row key - row with key 'a'
contains all logins which begin with 'a'
- Data might be unbalanced, but we avoid skinny rows - this might have
positive performance impact (??)
- to avoid super columns each row contains directly columns, where column
name is the user login and column value is corresponding data in kind of
serialized form (I would like to have is human readable)

    a {
        alfred.tester@xyz.de:"1122;MALE;1987.11.09;
                                 Alfred
Tester;e72c504dc16c8fcd2fe8c74bb492affa",

        alfred@aad.de@xyz.de:"1122;MALE;1987.11.09;
                                 Alfred
Tester;e72c504dc16c8fcd2fe8c74bb492affa",

        alf@dd.de@xyz.de:"1122;MALE;1987.11.09;
                                 Alfred
Tester;e72c504dc16c8fcd2fe8c74bb492affa"
      },

    m {
        manfred@xyz.de:"1133;MALE;1997.02.01;
                  Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda"
      },

    r {
        roberrto@xyz.de:"1133;MALE;1997.02.01;
                  Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda"

      }

Which solution is better, especially for better read performance? Do you
have better idea?

Thanks,
Maciej

Re: Data Model Design for Login Servie

Posted by Maciej Miklas <ma...@googlemail.com>.
I will follow exactly this solution - thanks :)

On Fri, Nov 18, 2011 at 9:53 PM, David Jeske <da...@gmail.com> wrote:

> On Thu, Nov 17, 2011 at 1:08 PM, Maciej Miklas <ma...@googlemail.com>wrote:
>
>> A) Skinny rows
>>  - row key contains login name - this is the main search criteria
>>  - login data is replicated - each possible login is stored as single row
>> which contains all user data - 10 logins for
>
> single customer create 10 rows, where each row has different key and the
>> same content
>>
>
> To me this seems reasonable. Remember, because of your replication of the
> datavalues you will want a quick way to find all the logins for a given ID,
> so you will also want to store a separate dataset like:
>
> 1122 {
>      alfred.tester@xyz.de =1    (where the login is a column key)
>      alfred@aad.de =1
> }
>
> When you do an update, you'll need to fetch the entire row for the
> user-id, and then update all copies of the data. THis can create problems,
> if the data is out of sync (which it will be at certain times because of
> eventual consistency, and might be if something bad happens).
>
> ...the other option, of course, is to make a login-name indirection. You
> would have only one copy of the user-data stored by ID, and then you would
> store a separate mapping from login-name-to-ID. Of course this would
> require two roundtrips to get the user information from login-id, which is
> something I know you said you didn't want to do.
>
>
>

Re: Data Model Design for Login Servie

Posted by David Jeske <da...@gmail.com>.
On Thu, Nov 17, 2011 at 1:08 PM, Maciej Miklas <ma...@googlemail.com>wrote:

> A) Skinny rows
>  - row key contains login name - this is the main search criteria
>  - login data is replicated - each possible login is stored as single row
> which contains all user data - 10 logins for

single customer create 10 rows, where each row has different key and the
> same content
>

To me this seems reasonable. Remember, because of your replication of the
datavalues you will want a quick way to find all the logins for a given ID,
so you will also want to store a separate dataset like:

1122 {
     alfred.tester@xyz.de =1    (where the login is a column key)
     alfred@aad.de =1
}

When you do an update, you'll need to fetch the entire row for the user-id,
and then update all copies of the data. THis can create problems, if the
data is out of sync (which it will be at certain times because of eventual
consistency, and might be if something bad happens).

...the other option, of course, is to make a login-name indirection. You
would have only one copy of the user-data stored by ID, and then you would
store a separate mapping from login-name-to-ID. Of course this would
require two roundtrips to get the user information from login-id, which is
something I know you said you didn't want to do.

Re: Data Model Design for Login Servie

Posted by Maxim Potekhin <po...@bnl.gov>.
1122: {
           gender: MALE
           birthdate: 1987.11.09
           name: Alfred Tester
           pwd: e72c504dc16c8fcd2fe8c74bb492affa
           alias1: alfred.tester@xyz.de <ma...@xyz.de>
           alias2: alfred@aad.de <ma...@aad.de>
           alias3: alf@dd.de <ma...@dd.de>
          }

...and you can use secondary indexes to query on anything.

Maxim


On 11/17/2011 4:08 PM, Maciej Miklas wrote:
> Hallo all,
>
> I need your help to design structure for simple login service. It 
> contains about 100.000.000 customers and each one can have about 10 
> different logins - this results 1.000.000.000 different logins.
>
> Each customer contains following data:
> - one to many login names as string, max 20 UTF-8 characters long
> - ID as long - one customer has only one ID
> - gender
> - birth date
> - name
> - password as MD5
>
> Login process needs to find user by login name.
> Data in Cassandra is replicated - this is necessary to obtain all 
> required login data in single call. Also usually we expect low write 
> traffic and heavy read traffic - round trips for reading data should 
> be avoided.
> Below I've described two possible cassandra data models based on 
> example: we have two users, first user has two logins and second user 
> has three logins
>
> A) Skinny rows
>  - row key contains login name - this is the main search criteria
>  - login data is replicated - each possible login is stored as single 
> row which contains all user data - 10 logins for single customer 
> create 10 rows, where each row has different key and the same content
>
>     // first 3 rows has different key and the same replicated data
> alfred.tester@xyz.de <ma...@xyz.de> {
>           id: 1122
>           gender: MALE
>           birthdate: 1987.11.09
>           name: Alfred Tester
>           pwd: e72c504dc16c8fcd2fe8c74bb492affa
>         },
> alfred@aad.de <ma...@aad.de> {
>           id: 1122
>           gender: MALE
>           birthdate: 1987.11.09
>           name: Alfred Tester
>           pwd: e72c504dc16c8fcd2fe8c74bb492affa
>         },
> alf@dd.de <ma...@dd.de> {
>           id: 1122
>           gender: MALE
>           birthdate: 1987.11.09
>           name: Alfred Tester
>           pwd: e72c504dc16c8fcd2fe8c74bb492affa
>         },
>
>     // two following rows has again the same data for second customer
> manfred@xyz.de <ma...@xyz.de> {
>           id: 1133
>           gender: MALE
>           birthdate: 1997.02.01
>           name: Manfredus Maximus
>           pwd: e44c504ff16c8fcd2fe8c74bb492adda
>         },
> roberrto@xyz.de <ma...@xyz.de> {
>           id: 1133
>           gender: MALE
>           birthdate: 1997.02.01
>           name: Manfredus Maximus
>           pwd: e44c504ff16c8fcd2fe8c74bb492adda
>         }
>
> B) Rows grouped by alphabetical prefix
> - Number of rows is limited - for example first letter from login name
> - Each row contains all logins which benign with row key - row with 
> key 'a' contains all logins which begin with 'a'
> - Data might be unbalanced, but we avoid skinny rows - this might have 
> positive performance impact (??)
> - to avoid super columns each row contains directly columns, where 
> column name is the user login and column value is corresponding data 
> in kind of serialized form (I would like to have is human readable)
>
>     a {
> alfred.tester@xyz.de <ma...@xyz.de>:"1122;MALE;1987.11.09;
>                                  Alfred 
> Tester;e72c504dc16c8fcd2fe8c74bb492affa",
>
>         alfred@aad.de@xyz.de <http://xyz.de>:"1122;MALE;1987.11.09;
>                                  Alfred 
> Tester;e72c504dc16c8fcd2fe8c74bb492affa",
>
>         alf@dd.de@xyz.de <http://xyz.de>:"1122;MALE;1987.11.09;
>                                  Alfred 
> Tester;e72c504dc16c8fcd2fe8c74bb492affa"
>       },
>
>     m {
> manfred@xyz.de <ma...@xyz.de>:"1133;MALE;1997.02.01;
>                   Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda"
>       },
>
>     r {
> roberrto@xyz.de <ma...@xyz.de>:"1133;MALE;1997.02.01;
>                   Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda"
>
>       }
>
> Which solution is better, especially for better read performance? Do 
> you have better idea?
>
> Thanks,
> Maciej


RE: Data Model Design for Login Servie

Posted by Dan Hendry <da...@gmail.com>.
Your first approach, skinny rows, will almost certainly be a better solution although it never hurts to experiment for yourself. Even for low end hardware (for sake of argument, EC2 m1.smalls), a few million rows is basically nothing (again though, I encourage you to verify for yourself). For read heavy workloads, skinny rows allow for more effective use of the key cache and possibly row cache. I advise caution when using the row cache however – I have never found it useful (in 0.7 and 0.8 at least) as it introduces too much memory pressure for generally random read workloads, benchmark against your specific case.

 

Dan

 

From: Maciej Miklas [mailto:mac.miklas@googlemail.com] 
Sent: November-17-11 16:08
To: user@cassandra.apache.org
Subject: Data Model Design for Login Servie

 

Hallo all,

I need your help to design structure for simple login service. It contains about 100.000.000 customers and each one can have about 10 different logins - this results 1.000.000.000 different logins.
    
Each customer contains following data:
- one to many login names as string, max 20 UTF-8 characters long
- ID as long - one customer has only one ID
- gender
- birth date
- name
- password as MD5

Login process needs to find user by login name.
Data in Cassandra is replicated - this is necessary to obtain all required login data in single call. Also usually we expect low write traffic and heavy read traffic - round trips for reading data should be avoided.
Below I've described two possible cassandra data models based on example: we have two users, first user has two logins and second user has three logins
   
A) Skinny rows
 - row key contains login name - this is the main search criteria
 - login data is replicated - each possible login is stored as single row which contains all user data - 10 logins for single customer create 10 rows, where each row has different key and the same content

    // first 3 rows has different key and the same replicated data
        alfred.tester@xyz.de {
          id: 1122
          gender: MALE
          birthdate: 1987.11.09
          name: Alfred Tester
          pwd: e72c504dc16c8fcd2fe8c74bb492affa  
        },
        alfred@aad.de {
          id: 1122
          gender: MALE
          birthdate: 1987.11.09
          name: Alfred Tester
          pwd: e72c504dc16c8fcd2fe8c74bb492affa  
        },
        alf@dd.de {
          id: 1122
          gender: MALE
          birthdate: 1987.11.09
          name: Alfred Tester
          pwd: e72c504dc16c8fcd2fe8c74bb492affa  
        },
    
    // two following rows has again the same data for second customer
        manfred@xyz.de {
          id: 1133
          gender: MALE
          birthdate: 1997.02.01
          name: Manfredus Maximus
          pwd: e44c504ff16c8fcd2fe8c74bb492adda  
        },
        roberrto@xyz.de {
          id: 1133
          gender: MALE
          birthdate: 1997.02.01
          name: Manfredus Maximus
          pwd: e44c504ff16c8fcd2fe8c74bb492adda  
        }
    
B) Rows grouped by alphabetical prefix
- Number of rows is limited - for example first letter from login name
- Each row contains all logins which benign with row key - row with key 'a' contains all logins which begin with 'a'
- Data might be unbalanced, but we avoid skinny rows - this might have positive performance impact (??)
- to avoid super columns each row contains directly columns, where column name is the user login and column value is corresponding data in kind of serialized form (I would like to have is human readable)

    a {
        alfred.tester@xyz.de:"1122;MALE;1987.11.09;
                                 Alfred Tester;e72c504dc16c8fcd2fe8c74bb492affa",
        
        alfred@aad.de@xyz.de:"1122;MALE;1987.11.09;
                                 Alfred Tester;e72c504dc16c8fcd2fe8c74bb492affa",
            
        alf@dd.de@xyz.de:"1122;MALE;1987.11.09;
                                 Alfred Tester;e72c504dc16c8fcd2fe8c74bb492affa"
      },
            
    m {
        manfred@xyz.de:"1133;MALE;1997.02.01;
                  Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda"    
      },
            
    r {
        roberrto@xyz.de:"1133;MALE;1997.02.01;
                  Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda"    
            
      }

Which solution is better, especially for better read performance? Do you have better idea?

Thanks,
Maciej

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.920 / Virus Database: 271.1.1/4020 - Release Date: 11/17/11 02:34:00


Re: Data Model Design for Login Servie

Posted by Mohit Anchlia <mo...@gmail.com>.
Secondary indexes in Cassandra are not good fit for High Cardinality values

On Fri, Nov 18, 2011 at 7:14 AM, Dan Hendry <da...@gmail.com> wrote:
> I they are not limited to repeating values but the Datastax docs[1] on
> secondary indexes certainly seem to indicate they would be a poor fit for
> this case (high read load, many unique values).
>
>
>
> [1] http://www.datastax.com/docs/1.0/ddl/indexes
>
>
>
> Dan
>
>
>
> From: Maciej Miklas [mailto:mac.miklas@googlemail.com]
> Sent: November-18-11 1:39
> To: user@cassandra.apache.org
> Subject: Re: Data Model Design for Login Servie
>
>
>
> but secondary index is limited only to repeating values like enums. In my
> case I would have performance issue. right?
>
> On 18.11.2011, at 02:08, Maxim Potekhin <po...@bnl.gov> wrote:
>
> 1122: {
>           gender: MALE
>           birthdate: 1987.11.09
>           name: Alfred Tester
>           pwd: e72c504dc16c8fcd2fe8c74bb492affa
>           alias1: alfred.tester@xyz.de
>           alias2: alfred@aad.de
>           alias3: alf@dd.de
>          }
>
> ...and you can use secondary indexes to query on anything.
>
> Maxim
>
>
> On 11/17/2011 4:08 PM, Maciej Miklas wrote:
>
> Hallo all,
>
> I need your help to design structure for simple login service. It contains
> about 100.000.000 customers and each one can have about 10 different logins
> - this results 1.000.000.000 different logins.
>
> Each customer contains following data:
> - one to many login names as string, max 20 UTF-8 characters long
> - ID as long - one customer has only one ID
> - gender
> - birth date
> - name
> - password as MD5
>
> Login process needs to find user by login name.
> Data in Cassandra is replicated - this is necessary to obtain all required
> login data in single call. Also usually we expect low write traffic and
> heavy read traffic - round trips for reading data should be avoided.
> Below I've described two possible cassandra data models based on example: we
> have two users, first user has two logins and second user has three logins
>
> A) Skinny rows
>  - row key contains login name - this is the main search criteria
>  - login data is replicated - each possible login is stored as single row
> which contains all user data - 10 logins for single customer create 10 rows,
> where each row has different key and the same content
>
>     // first 3 rows has different key and the same replicated data
>         alfred.tester@xyz.de {
>           id: 1122
>           gender: MALE
>           birthdate: 1987.11.09
>           name: Alfred Tester
>           pwd: e72c504dc16c8fcd2fe8c74bb492affa
>         },
>         alfred@aad.de {
>           id: 1122
>           gender: MALE
>           birthdate: 1987.11.09
>           name: Alfred Tester
>           pwd: e72c504dc16c8fcd2fe8c74bb492affa
>         },
>         alf@dd.de {
>           id: 1122
>           gender: MALE
>           birthdate: 1987.11.09
>           name: Alfred Tester
>           pwd: e72c504dc16c8fcd2fe8c74bb492affa
>         },
>
>     // two following rows has again the same data for second customer
>         manfred@xyz.de {
>           id: 1133
>           gender: MALE
>           birthdate: 1997.02.01
>           name: Manfredus Maximus
>           pwd: e44c504ff16c8fcd2fe8c74bb492adda
>         },
>         roberrto@xyz.de {
>           id: 1133
>           gender: MALE
>           birthdate: 1997.02.01
>           name: Manfredus Maximus
>           pwd: e44c504ff16c8fcd2fe8c74bb492adda
>         }
>
> B) Rows grouped by alphabetical prefix
> - Number of rows is limited - for example first letter from login name
> - Each row contains all logins which benign with row key - row with key 'a'
> contains all logins which begin with 'a'
> - Data might be unbalanced, but we avoid skinny rows - this might have
> positive performance impact (??)
> - to avoid super columns each row contains directly columns, where column
> name is the user login and column value is corresponding data in kind of
> serialized form (I would like to have is human readable)
>
>     a {
>         alfred.tester@xyz.de:"1122;MALE;1987.11.09;
>                                  Alfred
> Tester;e72c504dc16c8fcd2fe8c74bb492affa",
>
>         alfred@aad.de@xyz.de:"1122;MALE;1987.11.09;
>                                  Alfred
> Tester;e72c504dc16c8fcd2fe8c74bb492affa",
>
>         alf@dd.de@xyz.de:"1122;MALE;1987.11.09;
>                                  Alfred
> Tester;e72c504dc16c8fcd2fe8c74bb492affa"
>       },
>
>     m {
>         manfred@xyz.de:"1133;MALE;1997.02.01;
>                   Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda"
>       },
>
>     r {
>         roberrto@xyz.de:"1133;MALE;1997.02.01;
>                   Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda"
>
>       }
>
> Which solution is better, especially for better read performance? Do you
> have better idea?
>
> Thanks,
> Maciej
>
>
>
> No virus found in this incoming message.
>
> Checked by AVG - www.avg.com
> Version: 9.0.920 / Virus Database: 271.1.1/4022 - Release Date: 11/17/11
> 02:34:00

RE: Data Model Design for Login Servie

Posted by Dan Hendry <da...@gmail.com>.
I they are not limited to repeating values but the Datastax docs[1] on secondary indexes certainly seem to indicate they would be a poor fit for this case (high read load, many unique values).

 

[1] http://www.datastax.com/docs/1.0/ddl/indexes

 

Dan

 

From: Maciej Miklas [mailto:mac.miklas@googlemail.com] 
Sent: November-18-11 1:39
To: user@cassandra.apache.org
Subject: Re: Data Model Design for Login Servie

 

but secondary index is limited only to repeating values like enums. In my case I would have performance issue. right?


On 18.11.2011, at 02:08, Maxim Potekhin <po...@bnl.gov> wrote:

1122: {
          gender: MALE
          birthdate: 1987.11.09
          name: Alfred Tester
          pwd: e72c504dc16c8fcd2fe8c74bb492affa
          alias1: alfred.tester@xyz.de
          alias2: alfred@aad.de
          alias3: alf@dd.de
         }

...and you can use secondary indexes to query on anything.

Maxim


On 11/17/2011 4:08 PM, Maciej Miklas wrote: 

Hallo all,

I need your help to design structure for simple login service. It contains about 100.000.000 customers and each one can have about 10 different logins - this results 1.000.000.000 different logins.
    
Each customer contains following data:
- one to many login names as string, max 20 UTF-8 characters long
- ID as long - one customer has only one ID
- gender
- birth date
- name
- password as MD5

Login process needs to find user by login name.
Data in Cassandra is replicated - this is necessary to obtain all required login data in single call. Also usually we expect low write traffic and heavy read traffic - round trips for reading data should be avoided.
Below I've described two possible cassandra data models based on example: we have two users, first user has two logins and second user has three logins
   
A) Skinny rows
 - row key contains login name - this is the main search criteria
 - login data is replicated - each possible login is stored as single row which contains all user data - 10 logins for single customer create 10 rows, where each row has different key and the same content

    // first 3 rows has different key and the same replicated data
        alfred.tester@xyz.de {
          id: 1122
          gender: MALE
          birthdate: 1987.11.09
          name: Alfred Tester
          pwd: e72c504dc16c8fcd2fe8c74bb492affa  
        },
        alfred@aad.de {
          id: 1122
          gender: MALE
          birthdate: 1987.11.09
          name: Alfred Tester
          pwd: e72c504dc16c8fcd2fe8c74bb492affa  
        },
        alf@dd.de {
          id: 1122
          gender: MALE
          birthdate: 1987.11.09
          name: Alfred Tester
          pwd: e72c504dc16c8fcd2fe8c74bb492affa  
        },
    
    // two following rows has again the same data for second customer
        manfred@xyz.de {
          id: 1133
          gender: MALE
          birthdate: 1997.02.01
          name: Manfredus Maximus
          pwd: e44c504ff16c8fcd2fe8c74bb492adda  
        },
        roberrto@xyz.de {
          id: 1133
          gender: MALE
          birthdate: 1997.02.01
          name: Manfredus Maximus
          pwd: e44c504ff16c8fcd2fe8c74bb492adda  
        }
    
B) Rows grouped by alphabetical prefix
- Number of rows is limited - for example first letter from login name
- Each row contains all logins which benign with row key - row with key 'a' contains all logins which begin with 'a'
- Data might be unbalanced, but we avoid skinny rows - this might have positive performance impact (??)
- to avoid super columns each row contains directly columns, where column name is the user login and column value is corresponding data in kind of serialized form (I would like to have is human readable)

    a {
        alfred.tester@xyz.de:"1122;MALE;1987.11.09;
                                 Alfred Tester;e72c504dc16c8fcd2fe8c74bb492affa",
        
        alfred@aad.de@xyz.de:"1122;MALE;1987.11.09;
                                 Alfred Tester;e72c504dc16c8fcd2fe8c74bb492affa",
            
        alf@dd.de@xyz.de:"1122;MALE;1987.11.09;
                                 Alfred Tester;e72c504dc16c8fcd2fe8c74bb492affa"
      },
            
    m {
        manfred@xyz.de:"1133;MALE;1997.02.01;
                  Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda"    
      },
            
    r {
        roberrto@xyz.de:"1133;MALE;1997.02.01;
                  Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda"    
            
      }

Which solution is better, especially for better read performance? Do you have better idea?

Thanks,
Maciej

 

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.920 / Virus Database: 271.1.1/4022 - Release Date: 11/17/11 02:34:00


Re: Data Model Design for Login Servie

Posted by Maciej Miklas <ma...@googlemail.com>.
but secondary index is limited only to repeating values like enums. In my
case I would have performance issue. right?

On 18.11.2011, at 02:08, Maxim Potekhin <po...@bnl.gov> wrote:

 1122: {
          gender: MALE
          birthdate: 1987.11.09
          name: Alfred Tester
          pwd: e72c504dc16c8fcd2fe8c74bb492affa
          alias1: alfred.tester@xyz.de
          alias2: alfred@aad.de
          alias3: alf@dd.de
         }

...and you can use secondary indexes to query on anything.

Maxim


On 11/17/2011 4:08 PM, Maciej Miklas wrote:

Hallo all,

I need your help to design structure for simple login service. It contains
about 100.000.000 customers and each one can have about 10 different logins
- this results 1.000.000.000 different logins.

Each customer contains following data:
- one to many login names as string, max 20 UTF-8 characters long
- ID as long - one customer has only one ID
- gender
- birth date
- name
- password as MD5

Login process needs to find user by login name.
Data in Cassandra is replicated - this is necessary to obtain all required
login data in single call. Also usually we expect low write traffic and
heavy read traffic - round trips for reading data should be avoided.
Below I've described two possible cassandra data models based on example:
we have two users, first user has two logins and second user has three
logins

A) Skinny rows
 - row key contains login name - this is the main search criteria
 - login data is replicated - each possible login is stored as single row
which contains all user data - 10 logins for single customer create 10
rows, where each row has different key and the same content

    // first 3 rows has different key and the same replicated data
        alfred.tester@xyz.de {
          id: 1122
          gender: MALE
          birthdate: 1987.11.09
          name: Alfred Tester
          pwd: e72c504dc16c8fcd2fe8c74bb492affa
        },
        alfred@aad.de {
          id: 1122
          gender: MALE
          birthdate: 1987.11.09
          name: Alfred Tester
          pwd: e72c504dc16c8fcd2fe8c74bb492affa
        },
        alf@dd.de {
          id: 1122
          gender: MALE
          birthdate: 1987.11.09
          name: Alfred Tester
          pwd: e72c504dc16c8fcd2fe8c74bb492affa
        },

    // two following rows has again the same data for second customer
        manfred@xyz.de {
          id: 1133
          gender: MALE
          birthdate: 1997.02.01
          name: Manfredus Maximus
          pwd: e44c504ff16c8fcd2fe8c74bb492adda
        },
        roberrto@xyz.de {
          id: 1133
          gender: MALE
          birthdate: 1997.02.01
          name: Manfredus Maximus
          pwd: e44c504ff16c8fcd2fe8c74bb492adda
        }

B) Rows grouped by alphabetical prefix
- Number of rows is limited - for example first letter from login name
- Each row contains all logins which benign with row key - row with key 'a'
contains all logins which begin with 'a'
- Data might be unbalanced, but we avoid skinny rows - this might have
positive performance impact (??)
- to avoid super columns each row contains directly columns, where column
name is the user login and column value is corresponding data in kind of
serialized form (I would like to have is human readable)

    a {
        alfred.tester@xyz.de:"1122;MALE;1987.11.09;
                                 Alfred
Tester;e72c504dc16c8fcd2fe8c74bb492affa",

        alfred@aad.de@xyz.de:"1122;MALE;1987.11.09;
                                 Alfred
Tester;e72c504dc16c8fcd2fe8c74bb492affa",

        alf@dd.de@xyz.de:"1122;MALE;1987.11.09;
                                 Alfred
Tester;e72c504dc16c8fcd2fe8c74bb492affa"
      },

    m {
        manfred@xyz.de:"1133;MALE;1997.02.01;
                  Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda"
      },

    r {
        roberrto@xyz.de:"1133;MALE;1997.02.01;
                  Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda"

      }

Which solution is better, especially for better read performance? Do you
have better idea?

Thanks,
Maciej