You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Sameer Abhyankar <sa...@google.com> on 2018/06/21 19:06:25 UTC

Issues with self executing jar and FileSystems API

Hello!

I am trying to package a Beam Dataflow pipeline as a self executing jar
using these
<https://beam.apache.org/documentation/runners/dataflow/#self-executing-jar>
instructions.
However, I am running into a weird issue when attempting to execute this
jar.

My pipeline needs to read a file (avro schema .avsc) from GCS outside of a
PCollection before starting to work with PCollections. In order to do that
I use the FileSystems API. This works perfectly fine when I execute the
pipeline via mvn compile exec:java ..

However, if I attempt to run this as a jar, it appears to treat the GCS
path as local and fails with a FileNotFoundException.

*Exception in thread "main" java.io.FileNotFoundException:
/some/local/filesystem/path/myproject/gs:/my-gcs-bucket/schema/my-schema.avsc
(No such file or directory)*
* at java.io.FileInputStream.open0(Native Method)*
* at java.io.FileInputStream.open(FileInputStream.java:195)*
* at java.io.FileInputStream.<init>(FileInputStream.java:138)*
* at org.apache.beam.sdk.io.LocalFileSystem.open(LocalFileSystem.java:113)*
* at org.apache.beam.sdk.io.LocalFileSystem.open(LocalFileSystem.java:78)*
* at org.apache.beam.sdk.io.FileSystems.open(FileSystems.java:262)*

(Note that the input path is correct with the double slash but the error
seems to strip that out
e.g: --inputPath=gs://my-gcs-bucket/schema/my-schema.avsc)

Any pointers on what might be causing this?

Thanks,
- Sameer

Re: Issues with self executing jar and FileSystems API

Posted by Sameer Abhyankar <sa...@google.com>.
That was it! Thanks Lukasz. I had to use a custom assembly to get around
this. Thanks!

On Thu, Jun 21, 2018 at 3:28 PM Lukasz Cwik <lc...@google.com> wrote:

> The FileSystems API uses a ServiceLoader[1] to find Apache Beam FileSystem
> implementations. The ServiceLoader works by finding "service" files on the
> classpath containing a list of classes implementing the Apache Beam
> FileSystem API. The way in which your creating an executable jar is likely
> dropping or incorrectly merging service files. The most common case is that
> your using the Maven shade plugin and you haven't configured it to use the
> services file resource transformer[2]. If you are packaging your executable
> jar a different way, you'll want to lookup the documentation for your tool
> and see how it can properly deal with the service files.
>
> 1: https://docs.oracle.com/javase/7/docs/api/java/util/ServiceLoader.html
> 2:
> https://maven.apache.org/plugins/maven-shade-plugin/examples/resource-transformers.html#ServicesResourceTransformer
>
> On Thu, Jun 21, 2018 at 12:06 PM Sameer Abhyankar <sa...@google.com>
> wrote:
>
>> Hello!
>>
>> I am trying to package a Beam Dataflow pipeline as a self executing jar
>> using these
>> <https://beam.apache.org/documentation/runners/dataflow/#self-executing-jar> instructions.
>> However, I am running into a weird issue when attempting to execute this
>> jar.
>>
>> My pipeline needs to read a file (avro schema .avsc) from GCS outside of
>> a PCollection before starting to work with PCollections. In order to do
>> that I use the FileSystems API. This works perfectly fine when I execute
>> the pipeline via mvn compile exec:java ..
>>
>> However, if I attempt to run this as a jar, it appears to treat the GCS
>> path as local and fails with a FileNotFoundException.
>>
>> *Exception in thread "main" java.io.FileNotFoundException:
>> /some/local/filesystem/path/myproject/gs:/my-gcs-bucket/schema/my-schema.avsc
>> (No such file or directory)*
>> * at java.io.FileInputStream.open0(Native Method)*
>> * at java.io.FileInputStream.open(FileInputStream.java:195)*
>> * at java.io.FileInputStream.<init>(FileInputStream.java:138)*
>> * at
>> org.apache.beam.sdk.io.LocalFileSystem.open(LocalFileSystem.java:113)*
>> * at org.apache.beam.sdk.io.LocalFileSystem.open(LocalFileSystem.java:78)*
>> * at org.apache.beam.sdk.io.FileSystems.open(FileSystems.java:262)*
>>
>> (Note that the input path is correct with the double slash but the error
>> seems to strip that out
>> e.g: --inputPath=gs://my-gcs-bucket/schema/my-schema.avsc)
>>
>> Any pointers on what might be causing this?
>>
>> Thanks,
>> - Sameer
>>
>

Re: Issues with self executing jar and FileSystems API

Posted by Lukasz Cwik <lc...@google.com>.
The FileSystems API uses a ServiceLoader[1] to find Apache Beam FileSystem
implementations. The ServiceLoader works by finding "service" files on the
classpath containing a list of classes implementing the Apache Beam
FileSystem API. The way in which your creating an executable jar is likely
dropping or incorrectly merging service files. The most common case is that
your using the Maven shade plugin and you haven't configured it to use the
services file resource transformer[2]. If you are packaging your executable
jar a different way, you'll want to lookup the documentation for your tool
and see how it can properly deal with the service files.

1: https://docs.oracle.com/javase/7/docs/api/java/util/ServiceLoader.html
2:
https://maven.apache.org/plugins/maven-shade-plugin/examples/resource-transformers.html#ServicesResourceTransformer

On Thu, Jun 21, 2018 at 12:06 PM Sameer Abhyankar <sa...@google.com>
wrote:

> Hello!
>
> I am trying to package a Beam Dataflow pipeline as a self executing jar
> using these
> <https://beam.apache.org/documentation/runners/dataflow/#self-executing-jar> instructions.
> However, I am running into a weird issue when attempting to execute this
> jar.
>
> My pipeline needs to read a file (avro schema .avsc) from GCS outside of a
> PCollection before starting to work with PCollections. In order to do that
> I use the FileSystems API. This works perfectly fine when I execute the
> pipeline via mvn compile exec:java ..
>
> However, if I attempt to run this as a jar, it appears to treat the GCS
> path as local and fails with a FileNotFoundException.
>
> *Exception in thread "main" java.io.FileNotFoundException:
> /some/local/filesystem/path/myproject/gs:/my-gcs-bucket/schema/my-schema.avsc
> (No such file or directory)*
> * at java.io.FileInputStream.open0(Native Method)*
> * at java.io.FileInputStream.open(FileInputStream.java:195)*
> * at java.io.FileInputStream.<init>(FileInputStream.java:138)*
> * at org.apache.beam.sdk.io.LocalFileSystem.open(LocalFileSystem.java:113)*
> * at org.apache.beam.sdk.io.LocalFileSystem.open(LocalFileSystem.java:78)*
> * at org.apache.beam.sdk.io.FileSystems.open(FileSystems.java:262)*
>
> (Note that the input path is correct with the double slash but the error
> seems to strip that out
> e.g: --inputPath=gs://my-gcs-bucket/schema/my-schema.avsc)
>
> Any pointers on what might be causing this?
>
> Thanks,
> - Sameer
>