You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Karin Valisova <ka...@datapine.com> on 2017/05/17 10:13:50 UTC

Parquet file amazon s3a timeout

Hello!
I'm working with some parquet files saved on amazon service and loading
them to dataframe with

Dataset<Row> df = spark.read() .parquet(parketFileLocation);

however, after some time I get the "Timeout waiting for connection from
pool" exception. I hope I'm not mistaken, but I think that there's the
limitation for the length of any open connection with s3a, but I have
enough local memory to actually just load the file and close the connection.

Is it possible to specify some option when reading the parquet to store the
data locally and release the connection? Or any other ideas on how to solve
the problem?

Thank you very much,
have a nice day!
Karin

Re: Parquet file amazon s3a timeout

Posted by Steve Loughran <st...@hortonworks.com>.

On 17 May 2017, at 11:13, Karin Valisova <ka...@datapine.com>> wrote:

Hello!
I'm working with some parquet files saved on amazon service and loading them to dataframe with

Dataset<Row> df = spark.read() .parquet(parketFileLocation);

however, after some time I get the "Timeout waiting for connection from pool" exception. I hope I'm not mistaken, but I think that there's the limitation for the length of any open connection with s3a, but I have enough local memory to actually just load the file and close the connection.

1 version of Hadoop binaries? You should be using Hadoop 2.7.x for S3a to start working properly (see https://issues.apache.org/jira/browse/HADOOP-11571 for the list of issues)

2. If you move up to 2.7 & still see the exception, can you paste the  full stack trace?


Is it possible to specify some option when reading the parquet to store the data locally and release the connection? Or any other ideas on how to solve the problem?


If the problem is still there with Hadoop 2.7 binaries, then there's some thread pool options related to the AWS transfer manager and some other pooling going on, as well as the setting fs.s3a.connection.maximum to play with


http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/index.html


though as usual, people are always finding new corner cases to deadlock. Here I suspect https://issues.apache.org/jira/browse/HADOOP-13826; which is fixed in Hadoop 2.8+

-Steve