You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Karin Valisova <ka...@datapine.com> on 2017/05/17 10:13:50 UTC
Parquet file amazon s3a timeout
Hello!
I'm working with some parquet files saved on amazon service and loading
them to dataframe with
Dataset<Row> df = spark.read() .parquet(parketFileLocation);
however, after some time I get the "Timeout waiting for connection from
pool" exception. I hope I'm not mistaken, but I think that there's the
limitation for the length of any open connection with s3a, but I have
enough local memory to actually just load the file and close the connection.
Is it possible to specify some option when reading the parquet to store the
data locally and release the connection? Or any other ideas on how to solve
the problem?
Thank you very much,
have a nice day!
Karin
Re: Parquet file amazon s3a timeout
Posted by Steve Loughran <st...@hortonworks.com>.
On 17 May 2017, at 11:13, Karin Valisova <ka...@datapine.com>> wrote:
Hello!
I'm working with some parquet files saved on amazon service and loading them to dataframe with
Dataset<Row> df = spark.read() .parquet(parketFileLocation);
however, after some time I get the "Timeout waiting for connection from pool" exception. I hope I'm not mistaken, but I think that there's the limitation for the length of any open connection with s3a, but I have enough local memory to actually just load the file and close the connection.
1 version of Hadoop binaries? You should be using Hadoop 2.7.x for S3a to start working properly (see https://issues.apache.org/jira/browse/HADOOP-11571 for the list of issues)
2. If you move up to 2.7 & still see the exception, can you paste the full stack trace?
Is it possible to specify some option when reading the parquet to store the data locally and release the connection? Or any other ideas on how to solve the problem?
If the problem is still there with Hadoop 2.7 binaries, then there's some thread pool options related to the AWS transfer manager and some other pooling going on, as well as the setting fs.s3a.connection.maximum to play with
http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/index.html
though as usual, people are always finding new corner cases to deadlock. Here I suspect https://issues.apache.org/jira/browse/HADOOP-13826; which is fixed in Hadoop 2.8+
-Steve