You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Robert Metzger (Jira)" <ji...@apache.org> on 2021/01/25 10:30:00 UTC
[jira] [Comment Edited] (FLINK-19158) Revisit java e2e download timeouts

    [ https://issues.apache.org/jira/browse/FLINK-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271228#comment-17271228 ] 

Robert Metzger edited comment on FLINK-19158 at 1/25/21, 10:29 AM:
-------------------------------------------------------------------

The problem still persists. This is a case from a PR build: https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=12400&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee. It contains my last fix already.

I should have added a log statement for each retry, and we should maybe print out the reason why wget failed. The logs are currently not very helpful.

{code}
05:34:30,679 [                main] INFO  org.apache.flink.tests.util.cache.PersistingDownloadCache    [] - Downloading https://archive.apache.org/dist/hbase/1.4.3/hbase-1.4.3-bin.tar.gz.
05:46:30,701 [                main] ERROR org.apache.flink.tests.util.hbase.SQLClientHBaseITCase       [] - 
--------------------------------------------------------------------------------
Test testHBase[0: hbase-version:1.4.3](org.apache.flink.tests.util.hbase.SQLClientHBaseITCase) failed with:
java.io.IOException: Process ([wget, -q, -P, /home/vsts/work/1/e2e_cache/downloads/1598516010, --timeout, 240, https://archive.apache.org/dist/hbase/1.4.3/hbase-1.4.3-bin.tar.gz]) exceeded timeout (600000) or number of retries (3).
{code}.

I'm not sure if it makes sense to go into the rabbit hole of fixing this (by using fallback mirrors). I'd rather suggest to rely on one common method of binary distribution (docker images), and make their distribution as reliable as possible.

I'll leave this to the maintainers of the HBase test.


was (Author: rmetzger):
The problem still persists. This is a case from a PR build: https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=12400&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee. It contains my last fix already.

I should have added a log statement for each retry, and we should maybe print out the reason why wget failed. The logs are currently not very helpful.

{code}
05:34:30,679 [                main] INFO  org.apache.flink.tests.util.cache.PersistingDownloadCache    [] - Downloading https://archive.apache.org/dist/hbase/1.4.3/hbase-1.4.3-bin.tar.gz.
05:46:30,701 [                main] ERROR org.apache.flink.tests.util.hbase.SQLClientHBaseITCase       [] - 
--------------------------------------------------------------------------------
Test testHBase[0: hbase-version:1.4.3](org.apache.flink.tests.util.hbase.SQLClientHBaseITCase) failed with:
java.io.IOException: Process ([wget, -q, -P, /home/vsts/work/1/e2e_cache/downloads/1598516010, --timeout, 240, https://archive.apache.org/dist/hbase/1.4.3/hbase-1.4.3-bin.tar.gz]) exceeded timeout (600000) or number of retries (3).
{code}.

I'm not sure if it makes sense to go into the rabbit hole of fixing this (by using fallback mirrors). I'd rather suggest to rely on one common method of binary distribution (docker images), and make their distribution as reliable as possible.

I'll leave this to the maintainers of this test.

> Revisit java e2e download timeouts
> ----------------------------------
>
>                 Key: FLINK-19158
>                 URL: https://issues.apache.org/jira/browse/FLINK-19158
>             Project: Flink
>          Issue Type: Improvement
>          Components: Build System
>    Affects Versions: 1.12.0
>            Reporter: Robert Metzger
>            Priority: Major
>              Labels: pull-request-available, test-stability
>             Fix For: 1.12.0, 1.13.0
>
>
> Consider this failed test case
> {code}
> Test testHBase(org.apache.flink.tests.util.hbase.SQLClientHBaseITCase) is running.
> --------------------------------------------------------------------------------
> 09:38:38,719 [                main] INFO  org.apache.flink.tests.util.cache.PersistingDownloadCache    [] - Downloading https://archive.apache.org/dist/hbase/1.4.3/hbase-1.4.3-bin.tar.gz.
> 09:40:38,732 [                main] ERROR org.apache.flink.tests.util.hbase.SQLClientHBaseITCase       [] - 
> --------------------------------------------------------------------------------
> Test testHBase(org.apache.flink.tests.util.hbase.SQLClientHBaseITCase) failed with:
> java.io.IOException: Process ([wget, -q, -P, /home/vsts/work/1/e2e_cache/downloads/1598516010, https://archive.apache.org/dist/hbase/1.4.3/hbase-1.4.3-bin.tar.gz]) exceeded timeout (120000) or number of retries (3).
> 	at org.apache.flink.tests.util.AutoClosableProcess$AutoClosableProcessBuilder.runBlockingWithRetry(AutoClosableProcess.java:148)
> 	at org.apache.flink.tests.util.cache.AbstractDownloadCache.getOrDownload(AbstractDownloadCache.java:127)
> 	at org.apache.flink.tests.util.cache.PersistingDownloadCache.getOrDownload(PersistingDownloadCache.java:36)
> 	at org.apache.flink.tests.util.hbase.LocalStandaloneHBaseResource.setupHBaseDist(LocalStandaloneHBaseResource.java:76)
> 	at org.apache.flink.tests.util.hbase.LocalStandaloneHBaseResource.before(LocalStandaloneHBaseResource.java:70)
> 	at org.apache.flink.util.ExternalResource$1.evaluate(ExternalResource.java:46)
> {code}
> It seems that the download has not been retried. The download might be stuck? I would propose to set a timeout per try and increase the total time from 2 to 5 minutes.
> This example is from: https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=6267&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529



--
This message was sent by Atlassian Jira
(v8.3.4#803005)