You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Dexter Fryar (JIRA)" <ji...@apache.org> on 2011/04/07 20:50:05 UTC

[jira] [Created] (CASSANDRA-2436) Secondary Index Updates Invalidate Data Set

Secondary Index Updates Invalidate Data Set
-------------------------------------------

                 Key: CASSANDRA-2436
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2436
             Project: Cassandra
          Issue Type: Bug
          Components: Core
    Affects Versions: 0.7.4
         Environment: RedHat Linux 5.5 - OS is not important here.
            Reporter: Dexter Fryar
            Priority: Blocker


Creating an index, validator, and default validator then renaming/dropping the index later results in read errors and an invalid unreadable data set.

Updating the CF with the old index will not resolve the problem. You can insert/write all you want, but reads will fail if you come across a row that included one of these cases. The only workaround that I've been able to use is to know exactly what the columns/changes were prior to the CF change and iterate through all the rows inserting the same column name will a NULL value. One problem here is that you __must__ absolutely know what the row keys are called because you can't do a read to get them.


1) create a secondary index on a column with a validator and a default validator
2) insert a row
3) read and verify the row
4) update the CF/index/name/validator
5) read the CF and get an error (CLI or Pycassa)


CLI Commands to create the row and CF/Index

create column family cf_testing with comparator=UTF8Type and default_validation_class=UTF8Type and column_metadata=[{column_name: colour, validation_class: LongType, index_type: KEYS}];

set cf_testing['key']['colour']='1234';
list cf_testing;

update column family cf_testing with comparator=UTF8Type and default_validation_class=UTF8Type and column_metadata=[{column_name: color, validation_class: LongType, index_type: KEYS}];


ERROR from the CLI:

list cf_testing;
Using default limit of 100
-------------------
RowKey: key
invalid UTF8 bytes 00000000000004d2



Here is the Pycassa client code that shows this error too.

badindex.py

#!/usr/local/bin/python2.7

import pycassa
import uuid
import sys

def main():
  try:
    keyspace="badindex"
    serverPoolList = ['localhost:9160']
    pool = pycassa.connect(keyspace, serverPoolList)
  except:
    print "couldn't get a connection"
    sys.exit()

  cfname="cf_testing"
  cf = pycassa.ColumnFamily(pool, cfname)
  results = cf.get_range(start='key', finish='key', row_count=1)
  for key, columns in results:
    print key, '=>', columns

if __name__ == "__main__":
  main()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (CASSANDRA-2436) Secondary Index Updates Invalidate Data Set

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis resolved CASSANDRA-2436.
---------------------------------------

    Resolution: Not A Problem

default_validation_class means "all data that isn't explicitly in column_metadata conforms to this data type."  So you've violated that.  You have two options:

- set d_v_c to BytesType (the default)
- leave the column definition alone, but only drop the index part (maybe this is what you were trying to do, but you changed from "colour" to "color")

More generally, note that best practice is to only use d_v_c in CFs with dynamic column names.  I.e., if you know what the columns are going to be in the CF ahead of time as you do here, you shouldn't use d_v_c.

> Secondary Index Updates Invalidate Data Set
> -------------------------------------------
>
>                 Key: CASSANDRA-2436
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2436
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.4
>         Environment: RedHat Linux 5.5 - OS is not important here.
>            Reporter: Dexter Fryar
>            Priority: Blocker
>              Labels: index, indexed, indexing, read, write
>
> Creating an index, validator, and default validator then renaming/dropping the index later results in read errors and an invalid unreadable data set.
> Updating the CF with the old index will not resolve the problem. You can insert/write all you want, but reads will fail if you come across a row that included one of these cases. The only workaround that I've been able to use is to know exactly what the columns/changes were prior to the CF change and iterate through all the rows inserting the same column name will a NULL value. One problem here is that you __must__ absolutely know what the row keys are called because you can't do a read to get them.
> 1) create a secondary index on a column with a validator and a default validator
> 2) insert a row
> 3) read and verify the row
> 4) update the CF/index/name/validator
> 5) read the CF and get an error (CLI or Pycassa)
> CLI Commands to create the row and CF/Index
> create column family cf_testing with comparator=UTF8Type and default_validation_class=UTF8Type and column_metadata=[{column_name: colour, validation_class: LongType, index_type: KEYS}];
> set cf_testing['key']['colour']='1234';
> list cf_testing;
> update column family cf_testing with comparator=UTF8Type and default_validation_class=UTF8Type and column_metadata=[{column_name: color, validation_class: LongType, index_type: KEYS}];
> ERROR from the CLI:
> list cf_testing;
> Using default limit of 100
> -------------------
> RowKey: key
> invalid UTF8 bytes 00000000000004d2
> Here is the Pycassa client code that shows this error too.
> badindex.py
> #!/usr/local/bin/python2.7
> import pycassa
> import uuid
> import sys
> def main():
>   try:
>     keyspace="badindex"
>     serverPoolList = ['localhost:9160']
>     pool = pycassa.connect(keyspace, serverPoolList)
>   except:
>     print "couldn't get a connection"
>     sys.exit()
>   cfname="cf_testing"
>   cf = pycassa.ColumnFamily(pool, cfname)
>   results = cf.get_range(start='key', finish='key', row_count=1)
>   for key, columns in results:
>     print key, '=>', columns
> if __name__ == "__main__":
>   main()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira