You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@orc.apache.org by Asha Andrade <as...@yahoo.com> on 2020/03/10 18:03:55 UTC

Unable to use OrcFile.createReader to read from S3

I am having trouble reading an ORC file from S3 with the OrcFile.createReader option. I am using hive-exec-2.2.0.jar at the moment and am wondering if this is supported at all? Am i missing any configuration settings? See code below. Any help will be appreciated.
String accessKey = "***";

String secretKey = "***";




Configuration configuration = new Configuration();

configuration.set("fs.s3.awsAccessKeyId", accessKey);

configuration.set("fs.s3.awsSecretAccessKey", secretKey);
configuration.set("fs.defaultFS", "s3:// <bucket>");



//configuration.set("fs.default.name", "s3://<bucket>");

//configuration.set("fs.s3.impl", "org.apache.hadoop.fs.s3.S3FileSystem");

FileSystem fs = FileSystem.get(configuration);



Reader reader = OrcFile.createReader(new Path("/some/path/file.orc"), OrcFile.readerOptions(configuration).filesystem(fs));The s3 object was created as follows,aws s3api put-object --bucket <bucket>  --key /some/path/file.orc  --body file.orc  --tagging jira=<id> --content-type="binary/octet-stream"
This would throw a S3FileSystemException ("Not a Hadoop S3 file"),
private void checkMetadata(S3Object object) throws S3FileSystemException, S3ServiceException {

    String name = (String)object.getMetadata("fs");

    if (!"Hadoop".equals(name)) {

        throw new S3FileSystemException("Not a Hadoop S3 file.");

    } else {

        String type = (String)object.getMetadata("fs-type");

        if (!"block".equals(type)) {

            throw new S3FileSystemException("Not a block file.");

        } else {

            String dataVersion = (String)object.getMetadata("fs-version");

            if (!"1".equals(dataVersion)) {

                throw new VersionMismatchException("1", dataVersion);

            }

        }

    }

}

Updated the object to include --metadata as follows,aws s3api put-object --bucket <bucket>  --key /some/path/file.orc  --body file.orc  --tagging jira=<id> --content-type="binary/octet-stream" --metadata="fs=Hadoop,fs-type=block,fs-version=1"
Of course after this it barfs at getting the data, probably because the file formats differ(?).

public static INode deserialize(InputStream in) throws IOException {

        if (in == null) {

            return null;

        } else {

            DataInputStream dataIn = new DataInputStream(in);

            INode.FileType fileType = FILE_TYPES[dataIn.readByte()];


in INode of package org.apache.hadoop.fs.s3;

The dataIn.readByte is returning a larger value (FILE_TYPES is an array of size 2). Attached is the snapshot of backtrace.