You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@daffodil.apache.org by "Dave Thompson (Jira)" <ji...@apache.org> on 2023/02/01 22:24:00 UTC

[jira] [Closed] (DAFFODIL-2751) Occasional network timeout exceptions can hang a CI job now

     [ https://issues.apache.org/jira/browse/DAFFODIL-2751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Thompson closed DAFFODIL-2751.
-----------------------------------

Verified the specified commits (commit 2fd1f1947094d8da3db3b34463ad7038a01c857a and 490048f23bd3c1ea2f90a42631a38d01f64cf543) are included in the latest pull from the daffodil repository.

Verified, via review, change identified in the commit comments were implemented. 

Verified the affected daffodil subproject sbt test suites executed successfully including the added tests.

Verified the nightly test schemas compile and save successfully.

Verified the nightly test suite executes successfully.

If the issue occurs in the future the ticket can be reopened.

> Occasional network timeout exceptions can hang a CI job now
> -----------------------------------------------------------
>
>                 Key: DAFFODIL-2751
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-2751
>             Project: Daffodil
>          Issue Type: Bug
>          Components: Infrastructure
>    Affects Versions: 3.5.0
>            Reporter: John Interrante
>            Assignee: Steve Lawrence
>            Priority: Minor
>             Fix For: 3.5.0
>
>
> Please see these 2 runs in GitHub Actions:
> [Add Daffodil Developer Guide · apache/daffodil@9d114c3 (github.com)|https://github.com/apache/daffodil/actions/runs/3464760904/jobs/5786683343]
> [Add Daffodil Developer Guide · apache/daffodil@0bc99e6 (github.com)|https://github.com/apache/daffodil/actions/runs/3475210535/jobs/5809186675]
> One job in both runs hanged for 5 hours 54 minutes so GitHub Actions had to kill the job.  Both jobs were running on the same runner (Java 8, Scala 2.12.17, ubuntu-20.04) and had failed in the following unit tests with the same error message:
> org.apache.daffodil.io.TestInputSourceDataInputStream8.networkReadPartial1 
> org.apache.daffodil.io.TestSocketPairTestRig.testHangDetection1
> org.apache.daffodil.io.TestSocketPairTestRig.testHangDetection2
> org.apache.daffodil.io.TestSocketPairTestRig.testSocketPairTestRig1
> failed: java.util.concurrent.TimeoutException: Futures timed out after [1000 milliseconds], took 1.002 sec
> The rest of the jobs ran all of the unit tests successfully without any timeout exceptions.  We have had an occasional timeout exception fail 1 out of 6 jobs in a run before but they had not caused the job to hang before (the job had simply terminated after running the unit tests).
> I do not think there was a change in the GitHub Actions runner.  I checked the last CI job on the main branch ([Update sbt to 1.8.0 · apache/daffodil@6d4b2b6 (github.com)|https://github.com/apache/daffodil/actions/runs/3462161126/jobs/5780684309]) and the runner version numbers were the same in the setup job details.  We have had several CI jobs since the recent changes to the integration tests so it seems unlikely they had anything to do with the new hangs, even though hangs can happen due to non-daemon threads still running in a JVM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)