You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by "Marchant, Hayden " <ha...@citi.com> on 2018/02/01 08:21:45 UTC

RE: S3 for state backend in Flink 1.4.0

Edward,

We are using Object Storage for checkpointing. I'd like to point out that we were seeing performance problems using the S3 protocol. Btw, we had quite a few problems using the flink-s3-fs-hadoop jar with Object Storage and had to do some ugly hacking to get it working all over. We recently discovered an alternative connector developed by IBM Research called stocator. It's a streaming writer and performs better than using the S3 protocol.

Here is a link to the library - https://github.com/SparkTC/stocator, and a blog explaining about it - http://www.spark.tc/stocator-the-fast-lane-connecting-object-stores-to-spark/

Good luck!!

-----Original Message-----
From: Edward Rojas [mailto:edward.rojascl@gmail.com] 
Sent: Wednesday, January 31, 2018 3:02 PM
To: user@flink.apache.org
Subject: RE: S3 for state backend in Flink 1.4.0

Hi,

We are having a similar problem when trying to use Flink 1.4.0 with IBM Object Storage for reading and writing data. 

We followed
https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.4_ops_deployment_aws.html&d=DwICAg&c=j-EkbjBYwkAB4f8ZbVn1Fw&r=g-5xYRH8L3aCnCNTROw5LrsB5gbTayWjXSm6Nil9x0c&m=jSLW4Ugl1FvEta-R0h_thQVZ6tQ2LsUX10cRoIWNNkk&s=gY41yFjnJzQNaL3R1YK7HzG8XUyBn0kJ6_3m-4t7E7k&e=
and the suggestion on https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D851&d=DwICAg&c=j-EkbjBYwkAB4f8ZbVn1Fw&r=g-5xYRH8L3aCnCNTROw5LrsB5gbTayWjXSm6Nil9x0c&m=jSLW4Ugl1FvEta-R0h_thQVZ6tQ2LsUX10cRoIWNNkk&s=bDXNhnIV4KFTK9Byg5w2R_8UlWiXH05uAp9rkWJm_jo&e=.

We put the flink-s3-fs-hadoop jar from the opt/ folder to the lib/ folder and we added the configuration on the flink-config.yaml:

s3.access-key: <ACCESS_KEY>
s3.secret-key: <SECRET_KEY>
s3.endpoint: s3.us-south.objectstorage.softlayer.net 

With this we can read from IBM Object Storage without any problem when using env.readTextFile("s3://flink-test/flink-test.txt");

But we are having problems when trying to write. 
We are using a kafka consumer to read from the bus, we're making some processing and after saving  some data on Object Storage.

When using stream.writeAsText("s3://flink-test/data.txt").setParallelism(1);
The file is created but only when the job finish (or we stop it). But we need to save the data without stopping the job, so we are trying to use a Sink.

But when using a BucketingSink, we get the error: 
java.io.IOException: No FileSystem for scheme: s3 at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2798)
	at
org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.createHadoopFileSystem(BucketingSink.java:1196)
	at
org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.initFileSystem(BucketingSink.java:411)
	at
org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.initializeState(BucketingSink.java:355)


Do you have any idea how could we make it work using Sink?

Thanks,
Regards,

Edward



--
Sent from: https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_&d=DwICAg&c=j-EkbjBYwkAB4f8ZbVn1Fw&r=g-5xYRH8L3aCnCNTROw5LrsB5gbTayWjXSm6Nil9x0c&m=jSLW4Ugl1FvEta-R0h_thQVZ6tQ2LsUX10cRoIWNNkk&s=vN9sFldnlnzHZPgOBi42Rwfq1Hbq79gUPUNLgi0zmSM&e=

Re: S3 for state backend in Flink 1.4.0

Posted by Stephan Ewen <se...@apache.org>.

A heads up on this front:

  - For state backends during checkpointing, I would suggest to use the
flink-s3-fs-presto, which is quite a bit faster than the flink-s3-fs-hadoop
by avoiding a bunch of unnecessary metadata operations.

  - We have started work on re-writing the Bucketing Sink to make it work
with the shaded S3 filesystems (like flink-s3-fs-presto). We are also
adding a more powerful internal abstraction that uses multipart uploads for
faster incremental persistence of result chunks on checkpoints. This should
be in 1.6, happy to share more as soon as it is out...


On Wed, Feb 7, 2018 at 3:56 PM, Marchant, Hayden <ha...@citi.com>
wrote:

> WE actually got it working. Essentially, it's an implementation of
> HadoopFilesytem, and was written with the idea that it can be used with
> Spark (since it has broader adoption than Flink as of now). We managed to
> get it configured, and found the latency to be much lower than by using the
> s3 connector. There are a lot less copying operations etc... happening
> under the hood when using this native API which explains the better
> performance.
>
> Happy to provide assistance offline if you're interested.
>
> Thanks
> Hayden
>
> -----Original Message-----
> From: Edward Rojas [mailto:edward.rojascl@gmail.com]
> Sent: Thursday, February 01, 2018 6:09 PM
> To: user@flink.apache.org
> Subject: RE: S3 for state backend in Flink 1.4.0
>
> Hi Hayden,
>
> It seems like a good alternative. But I see it's intended to work with
> spark, did you manage to get it working with Flink ?
>
> I some tests but I get several errors when trying to create a file, either
> for checkpointing or saving data.
>
> Thanks in advance,
> Regards,
> Edward
>
>
>
> --
> Sent from: https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-
> 2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.
> nabble.com_&d=DwICAg&c=j-EkbjBYwkAB4f8ZbVn1Fw&r=g-
> 5xYRH8L3aCnCNTROw5LrsB5gbTayWjXSm6Nil9x0c&m=MW1NZ-mLVkooOHg-
> TWiOE7j2e9PCk7EOAmahXApcLtQ&s=b8kvNKIjylDuKlc2munyBj1da85y8a
> Z8brJsO24R2GU&e=
>

RE: S3 for state backend in Flink 1.4.0

Posted by "Marchant, Hayden " <ha...@citi.com>.

WE actually got it working. Essentially, it's an implementation of HadoopFilesytem, and was written with the idea that it can be used with Spark (since it has broader adoption than Flink as of now). We managed to get it configured, and found the latency to be much lower than by using the s3 connector. There are a lot less copying operations etc... happening under the hood when using this native API which explains the better performance.

Happy to provide assistance offline if you're interested.

Thanks
Hayden

-----Original Message-----
From: Edward Rojas [mailto:edward.rojascl@gmail.com] 
Sent: Thursday, February 01, 2018 6:09 PM
To: user@flink.apache.org
Subject: RE: S3 for state backend in Flink 1.4.0

Hi Hayden,

It seems like a good alternative. But I see it's intended to work with spark, did you manage to get it working with Flink ?

I some tests but I get several errors when trying to create a file, either for checkpointing or saving data. 

Thanks in advance,
Regards,
Edward



--
Sent from: https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_&d=DwICAg&c=j-EkbjBYwkAB4f8ZbVn1Fw&r=g-5xYRH8L3aCnCNTROw5LrsB5gbTayWjXSm6Nil9x0c&m=MW1NZ-mLVkooOHg-TWiOE7j2e9PCk7EOAmahXApcLtQ&s=b8kvNKIjylDuKlc2munyBj1da85y8aZ8brJsO24R2GU&e=

RE: S3 for state backend in Flink 1.4.0

Posted by Edward Rojas <ed...@gmail.com>.

Hi Hayden,

It seems like a good alternative. But I see it's intended to work with
spark, did you manage to get it working with Flink ?

I some tests but I get several errors when trying to create a file, either
for checkpointing or saving data. 

Thanks in advance,
Regards,
Edward



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/