You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by "Jonathan Hsieh (JIRA)" <ji...@apache.org> on 2011/08/17 23:57:27 UTC

[jira] [Created] (FLUME-745) Race condition in RollSink.

Race condition in RollSink.
---------------------------

                 Key: FLUME-745
                 URL: https://issues.apache.org/jira/browse/FLUME-745
             Project: Flume
          Issue Type: Bug
    Affects Versions: v0.9.5
            Reporter: Jonathan Hsieh


There is a race condition present when rotating sinks in a roller.  It is fairly rare but can cause a agent or collector to hang.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (FLUME-745) Fix Race condition in NaiveFileWALDeco and retransmit logic

Posted by "Jonathan Hsieh (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/FLUME-745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Hsieh updated FLUME-745:
---------------------------------

    Description: There is a race condition in state transtiions that happen in the NaiveFileWALDeco and retransmits.  This condition is fairly rare but when it occurs it cause an agent or collector to hang.    (was: There is a race condition present when rotating sinks in a roller.  It is fairly rare but can cause a agent or collector to hang.  )
        Summary: Fix Race condition in NaiveFileWALDeco and retransmit logic  (was: Race condition in RollSink.)

> Fix Race condition in NaiveFileWALDeco and retransmit logic
> -----------------------------------------------------------
>
>                 Key: FLUME-745
>                 URL: https://issues.apache.org/jira/browse/FLUME-745
>             Project: Flume
>          Issue Type: Bug
>    Affects Versions: v0.9.5
>            Reporter: Jonathan Hsieh
>
> There is a race condition in state transtiions that happen in the NaiveFileWALDeco and retransmits.  This condition is fairly rare but when it occurs it cause an agent or collector to hang.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (FLUME-745) Fix Race condition in NaiveFileWALDeco and retransmit logic

Posted by "Jonathan Hsieh (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/FLUME-745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Hsieh updated FLUME-745:
---------------------------------

    Status: Patch Available  (was: Open)

> Fix Race condition in NaiveFileWALDeco and retransmit logic
> -----------------------------------------------------------
>
>                 Key: FLUME-745
>                 URL: https://issues.apache.org/jira/browse/FLUME-745
>             Project: Flume
>          Issue Type: Bug
>    Affects Versions: v0.9.5
>            Reporter: Jonathan Hsieh
>            Assignee: Jonathan Hsieh
>         Attachments: 0001-FLUME-745-Race-condition-in-NaiveFileWALDeco-and-ret.patch
>
>
> There is a race condition in state transtiions that happen in the NaiveFileWALDeco and retransmits.  This condition is fairly rare but when it occurs it cause an agent or collector to hang.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (FLUME-745) Fix Race condition in NaiveFileWALDeco and retransmit logic

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FLUME-745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088010#comment-13088010 ] 

jiraposter@reviews.apache.org commented on FLUME-745:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1599/
-----------------------------------------------------------

Review request for Flume, Arvind Prabhakar and Eric Sammer.


Summary
-------

commit 80ffaeebead83de9d7b0af55f38bd2dfe62ad931
Author: Jonathan Hsieh <jm...@apache.org>
Date:   Thu Aug 18 14:05:41 2011 -0700

    FLUME-745: Race condition in NaiveFileWALDeco and retransmit logic
    
    - Setup test to run for a long time exacerbating potential race every 10ms.
    - Made test runnable from command line for arbitrary iterations
    - Eliminated possible memory leak by remove WALdata entry after completing e2eacked
    - NaiveFileWALDeco to use object lock


This addresses bug flume-745.
    https://issues.apache.org/jira/browse/flume-745


Diffs
-----

  flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java e7d5c8b 
  flume-core/src/test/java/com/cloudera/flume/agent/durability/TestFlumeNodeWALNotifierRacy.java PRE-CREATION 

Diff: https://reviews.apache.org/r/1599/diff


Testing
-------

All tests pass except for known flakies.  Ran to 500000 iterations (over 30 minutes of retry attempts every 10ms) and passed.


Thanks,

jmhsieh



> Fix Race condition in NaiveFileWALDeco and retransmit logic
> -----------------------------------------------------------
>
>                 Key: FLUME-745
>                 URL: https://issues.apache.org/jira/browse/FLUME-745
>             Project: Flume
>          Issue Type: Bug
>    Affects Versions: v0.9.5
>            Reporter: Jonathan Hsieh
>            Assignee: Jonathan Hsieh
>         Attachments: 0001-FLUME-745-Race-condition-in-NaiveFileWALDeco-and-ret.patch
>
>
> There is a race condition in state transtiions that happen in the NaiveFileWALDeco and retransmits.  This condition is fairly rare but when it occurs it cause an agent or collector to hang.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (FLUME-745) Fix Race condition in NaiveFileWALDeco and retransmit logic

Posted by "Jonathan Hsieh (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FLUME-745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13100105#comment-13100105 ] 

Jonathan Hsieh commented on FLUME-745:
--------------------------------------

The unit test that beats up on the synchronization and potential race can be run manually for by running the test using 

(execute until 10000k messages and rotations handled).
'flume class com.cloudera.flume.agent.durability.TestFlumeNodeWALNotifierRacy 100000'

The test will attempt to inject retry attempts every 10ms.  

> Fix Race condition in NaiveFileWALDeco and retransmit logic
> -----------------------------------------------------------
>
>                 Key: FLUME-745
>                 URL: https://issues.apache.org/jira/browse/FLUME-745
>             Project: Flume
>          Issue Type: Bug
>    Affects Versions: v0.9.5
>            Reporter: Jonathan Hsieh
>            Assignee: Jonathan Hsieh
>              Labels: wal
>             Fix For: v0.9.5
>
>         Attachments: 0001-FLUME-745-Race-condition-in-NaiveFileWALDeco-and-ret.patch
>
>
> There is a race condition in state transtiions that happen in the NaiveFileWALDeco and retransmits.  This condition is fairly rare but when it occurs it cause an agent or collector to hang.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (FLUME-745) Race condition in RollSink.

Posted by "Jonathan Hsieh (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FLUME-745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086625#comment-13086625 ] 

Jonathan Hsieh commented on FLUME-745:
--------------------------------------

  @Test
  public void testRotateRaciness() throws IOException, InterruptedException {

    CounterSink cnt = new CounterSink("count");
    int millis = 10; // rotate the agent wal entries every 10 millis.
    int events = 10000;
    int iters = 10;
    int expected = events * iters;
    MemorySinkSource mem = MemorySinkSource.cannedData("blah", events);

    NaiveFileWALDeco wal = (NaiveFileWALDeco) NaiveFileWALDeco
        .builderEndToEndDir().create(LogicalNodeContext.testingContext(),
            millis);
    wal.setSink(cnt);
    try {
      wal.open();
      for (int i = 0; i < iters; i++) {
        EventUtil.dumpAll(mem, wal);
        mem.open(); // goes back to beginning for another loop
      }
    } finally {
      wal.close();
      Log.info("Counted " + cnt.getCount() + " of " + expected + " expected");
    }
    assertEquals(cnt.getCount(), expected);
  }

In testing, race causes an exit/exception after 20k-30k rotations.  I'm pretty convinced the problem is in the roller.

> Race condition in RollSink.
> ---------------------------
>
>                 Key: FLUME-745
>                 URL: https://issues.apache.org/jira/browse/FLUME-745
>             Project: Flume
>          Issue Type: Bug
>    Affects Versions: v0.9.5
>            Reporter: Jonathan Hsieh
>
> There is a race condition present when rotating sinks in a roller.  It is fairly rare but can cause a agent or collector to hang.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (FLUME-745) Fix Race condition in NaiveFileWALDeco and retransmit logic

Posted by "Jonathan Hsieh (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/FLUME-745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Hsieh reassigned FLUME-745:
------------------------------------

    Assignee: Jonathan Hsieh

> Fix Race condition in NaiveFileWALDeco and retransmit logic
> -----------------------------------------------------------
>
>                 Key: FLUME-745
>                 URL: https://issues.apache.org/jira/browse/FLUME-745
>             Project: Flume
>          Issue Type: Bug
>    Affects Versions: v0.9.5
>            Reporter: Jonathan Hsieh
>            Assignee: Jonathan Hsieh
>
> There is a race condition in state transtiions that happen in the NaiveFileWALDeco and retransmits.  This condition is fairly rare but when it occurs it cause an agent or collector to hang.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (FLUME-745) Fix Race condition in NaiveFileWALDeco and retransmit logic

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FLUME-745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088093#comment-13088093 ] 

jiraposter@reviews.apache.org commented on FLUME-745:
-----------------------------------------------------



bq.  On 2011-08-19 23:07:35, Eric Sammer wrote:
bq.  > This *looks* right to me, but I can't be really sure. There are some blocks that are unprotected by the lock and I'm not sure why. In some cases, inner classes are static and sometimes not and both share the lock; complicated.

I promise to do a refactor.  I think structurally this started one way (a single wal manager would serve multiple flows) but eventually became one wal manager per flow. Because of that some of the logic and pieces are inverted.  


- jmhsieh


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1599/#review1576
-----------------------------------------------------------


On 2011-08-19 21:57:07, jmhsieh wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/1599/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-08-19 21:57:07)
bq.  
bq.  
bq.  Review request for Flume, Arvind Prabhakar and Eric Sammer.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  commit 80ffaeebead83de9d7b0af55f38bd2dfe62ad931
bq.  Author: Jonathan Hsieh <jm...@apache.org>
bq.  Date:   Thu Aug 18 14:05:41 2011 -0700
bq.  
bq.      FLUME-745: Race condition in NaiveFileWALDeco and retransmit logic
bq.      
bq.      - Setup test to run for a long time exacerbating potential race every 10ms.
bq.      - Made test runnable from command line for arbitrary iterations
bq.      - Eliminated possible memory leak by remove WALdata entry after completing e2eacked
bq.      - NaiveFileWALDeco to use object lock
bq.  
bq.  
bq.  This addresses bug flume-745.
bq.      https://issues.apache.org/jira/browse/flume-745
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java e7d5c8b 
bq.    flume-core/src/test/java/com/cloudera/flume/agent/durability/TestFlumeNodeWALNotifierRacy.java PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/1599/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  All tests pass except for known flakies.  Ran to 500000 iterations (over 30 minutes of retry attempts every 10ms) and passed.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  jmhsieh
bq.  
bq.



> Fix Race condition in NaiveFileWALDeco and retransmit logic
> -----------------------------------------------------------
>
>                 Key: FLUME-745
>                 URL: https://issues.apache.org/jira/browse/FLUME-745
>             Project: Flume
>          Issue Type: Bug
>    Affects Versions: v0.9.5
>            Reporter: Jonathan Hsieh
>            Assignee: Jonathan Hsieh
>         Attachments: 0001-FLUME-745-Race-condition-in-NaiveFileWALDeco-and-ret.patch
>
>
> There is a race condition in state transtiions that happen in the NaiveFileWALDeco and retransmits.  This condition is fairly rare but when it occurs it cause an agent or collector to hang.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (FLUME-745) Fix Race condition in NaiveFileWALDeco and retransmit logic

Posted by "Jonathan Hsieh (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FLUME-745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087466#comment-13087466 ] 

Jonathan Hsieh commented on FLUME-745:
--------------------------------------

There was code snippet posted previously but it is incorrect.

The main issue is that sychrnozation in the NaiveFileWALManager isn't handled properly with the normal statemachine and when another thread triggers a retransmit due to a retry timeout.  

> Fix Race condition in NaiveFileWALDeco and retransmit logic
> -----------------------------------------------------------
>
>                 Key: FLUME-745
>                 URL: https://issues.apache.org/jira/browse/FLUME-745
>             Project: Flume
>          Issue Type: Bug
>    Affects Versions: v0.9.5
>            Reporter: Jonathan Hsieh
>
> There is a race condition in state transtiions that happen in the NaiveFileWALDeco and retransmits.  This condition is fairly rare but when it occurs it cause an agent or collector to hang.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (FLUME-745) Fix Race condition in NaiveFileWALDeco and retransmit logic

Posted by "Jonathan Hsieh (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/FLUME-745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Hsieh updated FLUME-745:
---------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

> Fix Race condition in NaiveFileWALDeco and retransmit logic
> -----------------------------------------------------------
>
>                 Key: FLUME-745
>                 URL: https://issues.apache.org/jira/browse/FLUME-745
>             Project: Flume
>          Issue Type: Bug
>    Affects Versions: v0.9.5
>            Reporter: Jonathan Hsieh
>            Assignee: Jonathan Hsieh
>         Attachments: 0001-FLUME-745-Race-condition-in-NaiveFileWALDeco-and-ret.patch
>
>
> There is a race condition in state transtiions that happen in the NaiveFileWALDeco and retransmits.  This condition is fairly rare but when it occurs it cause an agent or collector to hang.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (FLUME-745) Fix Race condition in NaiveFileWALDeco and retransmit logic

Posted by "Jonathan Hsieh (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/FLUME-745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Hsieh updated FLUME-745:
---------------------------------

    Fix Version/s: v0.9.5
           Labels: wal  (was: )

> Fix Race condition in NaiveFileWALDeco and retransmit logic
> -----------------------------------------------------------
>
>                 Key: FLUME-745
>                 URL: https://issues.apache.org/jira/browse/FLUME-745
>             Project: Flume
>          Issue Type: Bug
>    Affects Versions: v0.9.5
>            Reporter: Jonathan Hsieh
>            Assignee: Jonathan Hsieh
>              Labels: wal
>             Fix For: v0.9.5
>
>         Attachments: 0001-FLUME-745-Race-condition-in-NaiveFileWALDeco-and-ret.patch
>
>
> There is a race condition in state transtiions that happen in the NaiveFileWALDeco and retransmits.  This condition is fairly rare but when it occurs it cause an agent or collector to hang.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (FLUME-745) Fix Race condition in NaiveFileWALDeco and retransmit logic

Posted by "Jonathan Hsieh (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/FLUME-745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Hsieh updated FLUME-745:
---------------------------------

    Attachment: 0001-FLUME-745-Race-condition-in-NaiveFileWALDeco-and-ret.patch

review here: https://reviews.apache.org/r/1599/

> Fix Race condition in NaiveFileWALDeco and retransmit logic
> -----------------------------------------------------------
>
>                 Key: FLUME-745
>                 URL: https://issues.apache.org/jira/browse/FLUME-745
>             Project: Flume
>          Issue Type: Bug
>    Affects Versions: v0.9.5
>            Reporter: Jonathan Hsieh
>            Assignee: Jonathan Hsieh
>         Attachments: 0001-FLUME-745-Race-condition-in-NaiveFileWALDeco-and-ret.patch
>
>
> There is a race condition in state transtiions that happen in the NaiveFileWALDeco and retransmits.  This condition is fairly rare but when it occurs it cause an agent or collector to hang.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (FLUME-745) Fix Race condition in NaiveFileWALDeco and retransmit logic

Posted by "Jonathan Hsieh (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/FLUME-745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Hsieh closed FLUME-745.
--------------------------------


I think this patch fixed one of the major sources of duplicate data when using E2E mode.


> Fix Race condition in NaiveFileWALDeco and retransmit logic
> -----------------------------------------------------------
>
>                 Key: FLUME-745
>                 URL: https://issues.apache.org/jira/browse/FLUME-745
>             Project: Flume
>          Issue Type: Bug
>    Affects Versions: v0.9.5
>            Reporter: Jonathan Hsieh
>            Assignee: Jonathan Hsieh
>              Labels: wal
>             Fix For: v0.9.5
>
>         Attachments: 0001-FLUME-745-Race-condition-in-NaiveFileWALDeco-and-ret.patch
>
>
> There is a race condition in state transtiions that happen in the NaiveFileWALDeco and retransmits.  This condition is fairly rare but when it occurs it cause an agent or collector to hang.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (FLUME-745) Fix Race condition in NaiveFileWALDeco and retransmit logic

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FLUME-745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088059#comment-13088059 ] 

jiraposter@reviews.apache.org commented on FLUME-745:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1599/#review1576
-----------------------------------------------------------

Ship it!


This *looks* right to me, but I can't be really sure. There are some blocks that are unprotected by the lock and I'm not sure why. In some cases, inner classes are static and sometimes not and both share the lock; complicated.

- Eric


On 2011-08-19 21:57:07, jmhsieh wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/1599/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-08-19 21:57:07)
bq.  
bq.  
bq.  Review request for Flume, Arvind Prabhakar and Eric Sammer.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  commit 80ffaeebead83de9d7b0af55f38bd2dfe62ad931
bq.  Author: Jonathan Hsieh <jm...@apache.org>
bq.  Date:   Thu Aug 18 14:05:41 2011 -0700
bq.  
bq.      FLUME-745: Race condition in NaiveFileWALDeco and retransmit logic
bq.      
bq.      - Setup test to run for a long time exacerbating potential race every 10ms.
bq.      - Made test runnable from command line for arbitrary iterations
bq.      - Eliminated possible memory leak by remove WALdata entry after completing e2eacked
bq.      - NaiveFileWALDeco to use object lock
bq.  
bq.  
bq.  This addresses bug flume-745.
bq.      https://issues.apache.org/jira/browse/flume-745
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    flume-core/src/main/java/com/cloudera/flume/agent/durability/NaiveFileWALManager.java e7d5c8b 
bq.    flume-core/src/test/java/com/cloudera/flume/agent/durability/TestFlumeNodeWALNotifierRacy.java PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/1599/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  All tests pass except for known flakies.  Ran to 500000 iterations (over 30 minutes of retry attempts every 10ms) and passed.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  jmhsieh
bq.  
bq.



> Fix Race condition in NaiveFileWALDeco and retransmit logic
> -----------------------------------------------------------
>
>                 Key: FLUME-745
>                 URL: https://issues.apache.org/jira/browse/FLUME-745
>             Project: Flume
>          Issue Type: Bug
>    Affects Versions: v0.9.5
>            Reporter: Jonathan Hsieh
>            Assignee: Jonathan Hsieh
>         Attachments: 0001-FLUME-745-Race-condition-in-NaiveFileWALDeco-and-ret.patch
>
>
> There is a race condition in state transtiions that happen in the NaiveFileWALDeco and retransmits.  This condition is fairly rare but when it occurs it cause an agent or collector to hang.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (FLUME-745) Fix Race condition in NaiveFileWALDeco and retransmit logic

Posted by "Jonathan Hsieh (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/FLUME-745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Hsieh updated FLUME-745:
---------------------------------

    Comment: was deleted

(was:   @Test
  public void testRotateRaciness() throws IOException, InterruptedException {

    CounterSink cnt = new CounterSink("count");
    int millis = 10; // rotate the agent wal entries every 10 millis.
    int events = 10000;
    int iters = 10;
    int expected = events * iters;
    MemorySinkSource mem = MemorySinkSource.cannedData("blah", events);

    NaiveFileWALDeco wal = (NaiveFileWALDeco) NaiveFileWALDeco
        .builderEndToEndDir().create(LogicalNodeContext.testingContext(),
            millis);
    wal.setSink(cnt);
    try {
      wal.open();
      for (int i = 0; i < iters; i++) {
        EventUtil.dumpAll(mem, wal);
        mem.open(); // goes back to beginning for another loop
      }
    } finally {
      wal.close();
      Log.info("Counted " + cnt.getCount() + " of " + expected + " expected");
    }
    assertEquals(cnt.getCount(), expected);
  }

In testing, race causes an exit/exception after 20k-30k rotations.  I'm pretty convinced the problem is in the roller.)

> Fix Race condition in NaiveFileWALDeco and retransmit logic
> -----------------------------------------------------------
>
>                 Key: FLUME-745
>                 URL: https://issues.apache.org/jira/browse/FLUME-745
>             Project: Flume
>          Issue Type: Bug
>    Affects Versions: v0.9.5
>            Reporter: Jonathan Hsieh
>
> There is a race condition in state transtiions that happen in the NaiveFileWALDeco and retransmits.  This condition is fairly rare but when it occurs it cause an agent or collector to hang.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira