You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@batchee.apache.org by "Scott Kurz (JIRA)" <ji...@apache.org> on 2014/11/10 06:38:34 UTC

[jira] [Commented] (BATCHEE-54) After retry-with-rollback, 1-at-a-time processing continues for the rest of the step. Numerous areas where ChunkStepController's sequence of calls is out-of-synch with spec.

    [ https://issues.apache.org/jira/browse/BATCHEE-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14204379#comment-14204379 ] 

Scott Kurz commented on BATCHEE-54:
-----------------------------------

Here is the patch I promised.  I made an effort to port over to BatchEE. 

Sorry for the separate changes all rolled into one.   I'll try to explain them piece-by-piece, and if you want to undo/unravel some I won't take it personally.

As you might guess, at times, I just guessed you inherited a problem from jbatch when I saw the code looked the same without necessarily proving that was the case.

Anyway, here are the issues:

1)  The biggest single issue is the fact that, after retry-with-rollback puts us into "single-item-chunk" processing, we never revert back to normal processing.      

I didn't think the existing approach of passing the state back and forth across read/process/write was worth building on, so I moved the state into ChunkStatus, SingleItemStatus inner classes instead to make this work.   (E.g. we try to keep track of the failing item, for read and process exceptions anyway, and revert to normal chunk processing once we've passed the failing item).

2)  There are several pieces of the spec-defined sequence that, unfortunately, jbatch did not implement correctly according to the spec.  

2a) E.g. Sec. 11.8 declares the order (we weren't following):

f. <->[ChunkListener.afterChunk] // thread A
g. <->[ItemReader.checkpointInfo]

2b)  The writer is supposed to be closed before the reader (https://java.net/bugzilla/show_bug.cgi?id=6456) 

2c) CheckpointAlgorithm.beginCheckpoint was called at the wrong time (https://java.net/bugzilla/show_bug.cgi?id=5873)

2d) Sec. 11.9 declares the order in "Rollback Procedure"

1. <->ItemWriter.close // thread A
2. <->ItemReader.close // thread A
3. [ChunkListener.onError] // thread A
4. [rollback transaction]

We weren't following this order

2e) There appear to be some duplicated calls to the listeners, especially in the retry-without-rollback routines.   The "process" calls look particularly wrong, and there's a duplicate in "read" apparently too.   I admit I didn't write tests here, this is my visual observation.

Anyway I cleaned this up and removed the apparent dups.

3)  The one test change I had to make was to the PartitionMetricsTest.   I think the expected rollback metric count of '2' simply assumed that the RI was doing the correct thing on a single partition.   But I believe there's actually only one rollback on one partition here, so the expected count should really be '1' (and my patch changes the test accordingly).

4) A couple other "side effects" that come along with this change.

a) metric counts are "rolled back" along with retry-with-rollback.    My fix doesn't update the count until the chunk completes, so if it's rolled back we only increment the rollback metric, not the read, write, etc. metrics.

b) zero-item chunk.   For 15 items, with item-count = 5, we have 4 commits, not 3.  This reflects the notion discussed in producing 1.0 Rev A MR that we consider this last chunk a "zero-item chunk" rather than "not a chunk".     Although.. we don't call the writer.   Maybe still takes some getting used to, but there it is for now.

5) One other "random" fix... to PartitionedStepController ... you can ignore this for now.. it's truly unrelated but avoids a hang when restarting a step when all its partitions have already completed on an earlier execution.

6) Why all the changes to the checkpoint algorithm proxy, deletion of CheckpointAlgorithmFactory, etc?    Well.. in refactoring the way that we'd been re-creating CheckpointManager on retry-with-rollback.. I just kept getting confused about how we were calculating the checkpoint as custom or not.    I took the refactor far enough to make sure retry-with-rollback worked.    To me it seemed a little better... it might not look any better to you though.

7) Finally, there may be a small impact to exception handling (the rollback().. not retry-with-rollback, but rollback on non-retryable Throwable... I think I basically kept what you had, but please review.

----

OK, that's a lot to sort through, and the code itself is even more of a chore to review, but would be happy to try to explain it more. 

Scott

> After retry-with-rollback, 1-at-a-time processing continues for the rest of the step.   Numerous areas where ChunkStepController's sequence of calls is out-of-synch with spec.
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: BATCHEE-54
>                 URL: https://issues.apache.org/jira/browse/BATCHEE-54
>             Project: BatchEE
>          Issue Type: Bug
>          Components: jbatch-core
>    Affects Versions: 0.3-incubating
>            Reporter: Scott Kurz
>            Priority: Minor
>              Labels: batch
>             Fix For: 0.3-incubating
>
>         Attachments: sk.batchee-54.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)