You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@orc.apache.org by Juan Carlos Blanco Martínez <jc...@gmail.com> on 2019/01/16 11:46:09 UTC

Reading S3 objects in ORC format from Lambda

Hi,
I’m trying to read S3 objects in ORC format and parse the content from
Lambda through:

3ObjectInputStream s3ObjectInputStream =
amazonS3.getObject(request).getObjectContent();

, where s3ObjectInputStream extends java.io.InputStream.

I found that ORC format was designed for Hadoop ecosystem and even though
Spark and Presto have support to read data in that format, there is no
support to read those files outside of a distributed processing framework.
I've checked org.apache.orc.impl.ReaderImpl.java and it is tied to
“org.apache.hadoop.fs.Path”.
Even if there were a S3-based path class extending
org.apache.hadoop.fs.Path, it would require to instantiate
org.apache.orc.impl.ReaderImpl and add Hadoop dependencies to Lambda's zip,
which I'm not very inclined to.
Is there any light library that would allow me to read either ORC files or
java.io.InputStream in ORC format?

Thanks.
Regards.

Juan Carlos Blanco Martínez