You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jeremy Hanna (JIRA)" <ji...@apache.org> on 2010/06/26 03:12:52 UTC
[jira] Issue Comment Edited: (CASSANDRA-1042) ColumnFamilyRecordReader returns duplicate rows

    [ https://issues.apache.org/jira/browse/CASSANDRA-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882790#action_12882790 ] 

Jeremy Hanna edited comment on CASSANDRA-1042 at 6/25/10 9:12 PM:
------------------------------------------------------------------

Good point.

>From what I could tell in this instance, it would go through the input splits and on the last input split, it would have an incorrect last value.  So it would go back through and take that value to the end of the input list.  I would imagine that is where it had wrapped.  I'm not sure why it had the incorrect last value as the last value in the wrapped input split though.  If someone is wiser than I in these matters, please chime in.  But it appears that normalizing how the splits are done so one split does not wrap internally, it solves the problem.

To reproduce easily and with a small dataset: If you don't apply the patch and run the word_count_setup with only 10 values for text3, usually that will be enough to manifest the problem when running wordcount.

Also, I might think that if the wrap could be detected when creating the splits, as with this patch, then it makes sense that wrapping could be detected when reading the rows in the ColumnFamilyRecordReader.  That could be another way to resolve it.  But I think it's sixes when it comes to the solution.

Like I said, I'm not certain why that incorrect ordering happens on the wrapped split.

      was (Author: jeromatron):
    Good point.

>From what I could tell in this instance, it would go through the input splits and on the last input split, it would have an incorrect last value.  So it would go back through and take that value to the end of the input list.  I would imagine that is where it had wrapped.  I'm not sure why it had the incorrect last value as the last value in the last input split though.  If someone is wiser than I in these matters, please chime in.  But it appears that normalizing how the splits are done so one split does not wrap internally, it solves the problem.

To reproduce easily and with a small dataset: If you don't apply the patch and run the word_count_setup with only 10 values for text3, usually that will be enough to manifest the problem when running wordcount.

Also, I might think that if the wrap could be detected when creating the splits, as with this patch, then it makes sense that wrapping could be detected when reading the rows in the ColumnFamilyRecordReader.  That could be another way to resolve it.  But I think it's sixes when it comes to the solution.

Like I said, I'm not certain why that incorrect ordering happens on the last split.
  
> ColumnFamilyRecordReader returns duplicate rows
> -----------------------------------------------
>
>                 Key: CASSANDRA-1042
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1042
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Hadoop
>    Affects Versions: 0.6
>            Reporter: Joost Ouwerkerk
>            Assignee: Jeremy Hanna
>             Fix For: 0.6.4
>
>         Attachments: 1042-0_6.txt, Cassandra-1042-0_6-branch.patch.txt, CASSANDRA-1042-trunk.patch.txt, cassandra.tar.gz
>
>
> There's a bug in ColumnFamilyRecordReader that appears when processing a single split (which happens in most tests that have small number of rows), and potentially in other cases.  When the start and end tokens of the split are equal, duplicate rows can be returned.
> Example with 5 rows:
> token (start and end) = 53193025635115934196771903670925341736
> Tokens returned by first get_range_slices iteration (all 5 rows):
>  16955237001963240173058271559858726497
>  40670782773005619916245995581909898190
>  99079589977253916124855502156832923443
>  144992942750327304334463589818972416113
>  166860289390734216023086131251507064403
> Tokens returned by next iteration (first token is last token from
> previous, end token is unchanged)
>  16955237001963240173058271559858726497
>  40670782773005619916245995581909898190
> Tokens returned by final iteration  (first token is last token from
> previous, end token is unchanged)
>  [] (empty)
> In this example, the mapper has processed 7 rows in total, 2 of which
> were duplicates.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.