You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@metron.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/01/26 00:12:27 UTC
[jira] [Commented] (METRON-672) SolrIndexingIntegrationTest fails intermittently

    [ https://issues.apache.org/jira/browse/METRON-672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838867#comment-15838867 ] 

ASF GitHub Bot commented on METRON-672:
---------------------------------------

GitHub user cestella opened a pull request:

    https://github.com/apache/incubator-metron/pull/424

    METRON-672: SolrIndexingIntegrationTest fails intermittently

    This failure is due to a change in default behavior when indexing was split off into a separate configuration file.  The default batch size was changed from `5` to `1` in particular.  This, by itself, is not a problem, but the `IndexingIntegrationTest` (base class for Solr and Elastic search integration tests):
    * submits the configs
    * starts the indexing topology
    * writes the input data
    
    The writing of the input data may happen before the topology fully loads or the configuration fully loads, especially if the machine running the unit tests is under load (like with travis).  As a result, the first record may end up with the default batch size (of 1) and write out immediately because the indexing configs haven't loaded into zookeeper just yet.  In that circumstance, eventually the configs load and the batch size is set to `5`.  Meanwhile we've written 10 records and are expecting 10 in return, but because you wrote the first out already and then the next 5, we have another 4 pending to be written by the `BulkMessageWriterBolt`.
    
    So, the failure scenario is as follows:
    * Message 1 is received and the indexing config hasn't loaded yet, so the batch size is 1 and it immediately gets written out
    * Message 2 - 5 are each received and the indexing config has loaded, so the batch size is 5 and it queues up
    * Message 6 is received and the batch writes out
    * Messages 7 - 10 are received, but never make a full batch, so we time out waiting for them to write out
    
    The fix is to ensure that we don't write out messages to kafka until the configs are loaded, which is what this PR does.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cestella/incubator-metron METRON-672

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-metron/pull/424.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #424
    
----

----


> SolrIndexingIntegrationTest fails intermittently
> ------------------------------------------------
>
>                 Key: METRON-672
>                 URL: https://issues.apache.org/jira/browse/METRON-672
>             Project: Metron
>          Issue Type: Bug
>    Affects Versions: 0.3.0
>            Reporter: Justin Leet
>            Assignee: Casey Stella
>
> Adapted from a dev list conversation
> h4. Initial Error in Travis
> Jon noted this in the Travis builds
> {code}
> `test(org.apache.metron.solr.integration.SolrIndexingIntegrationTest):
> Took too long to complete: 150582 > 150000`, more details below:
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed:
> 166.167 sec <<< FAILURE!
> test(org.apache.metron.solr.integration.SolrIndexingIntegrationTest)
> Time elapsed: 166.071 sec  <<< ERROR!
> {code}
> h4. Additional Notes
> Casey was able to reproduce this locally (but not in his IDE). Couple details in the dev list excerpt.
> Fixing this should ideally include adding more detailed logging to hopefully avoid these issues in the future.  As a note, unfortunately in this case, Casey notes that logging seems to make this issue rarer.  Still, logging to be able to understand the flow (and tuning logging level as appropriate) would help resolve issues in the future.
> h4. Dev List Excerpt
> Per Casey:
> {quote}
> Ok, so now I'm concerned that this isn't a fluke.  Here's an excerpt from
> the failing logs on travis for my PR with substantially longer timeouts (
> https://s3.amazonaws.com/archive.travis-ci.org/jobs/194575474/log.txt)
> {code}
> Running org.apache.metron.solr.integration.SolrIndexingIntegrationTest
> 0 vs 10 vs 0
> Processed target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Processed target/indexingIntegrationTest/hdfs/test/enrichment-null-0-0-1485200689038.json
> 10 vs 10 vs 6
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed:
> 317.056 sec <<< FAILURE!
> test(org.apache.metron.solr.integration.SolrIndexingIntegrationTest)
> Time elapsed: 316.949 sec  <<< ERROR!
> java.lang.RuntimeException: Took too long to complete: 300783 > 300000
>         at org.apache.metron.integration.ComponentRunner.process(ComponentRunner.java:131)
>         at org.apache.metron.indexing.integration.IndexingIntegrationTest.test(IndexingIntegrationTest.java:173)
> {code}
> I'm getting the impression that this isn't the timeout and we have a
> mystery on our hands.  Each of those lines "10 vs 10 vs 6" happen 15
> seconds apart.  That line means that it read 10 entries from kafka, 10
> entries from the indexed data and 6 entries from HDFS.  It's that 6
> entries that is the problem.   Also of note, this does not seem to
> happen to me locally AND it's not consistent on Travis.  Given all
> that I'd say that it's a problem with the HDFS Writer not getting
> flushed, but I verified that it is indeed flushed per message.
> One more thing, just for posterity here, it always freezes at 6 records
> written to HDFS.  That's the reason I thought it was a flushing issue.
> One thing that I would caution though is that this is likely a heisenbug.
> The more logging I added earlier made it less likely to occur. It seems
> more likely to occur on Travis than locally and I made it happen by
> repeatedly running mvn install on Metron-solr (after a mvn install of the
> whole project).
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)