You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@orc.apache.org by Ashwin Jayaprakash <as...@gmail.com> on 2017/06/23 02:03:22 UTC

Why is Apache Orc RecordReader.searchArgument() not filtering correctly?

(This is a cross post. I did not get any response on SO:
https://stackoverflow.com/questions/44691416/why-is-apache-orc-recordreader-searchargument-not-filtering-correctly.
I'm hoping someone can help me get to the bottom of the issue.)

Here is a simple program that:

   1. Writes records into an Orc file
   2. Then tries to read the file using predicate pushdown (searchArgument)

Questions:

   1. Is this the right way to use predicate push down in Orc?
   2. The read(..) method seems to return all the records, completely
   ignoring the searchArguments. Why is that?

*Notes:*

I have not been able to find any useful unit test that demonstrates how
predicate pushdown works in Orc (Orc on GitHub
<https://github.com/apache/orc/tree/9175b3e22742b5d4537f072165b863c78de23db5/java/core/src/test/org/apache/orc>).
Nor am I able to find any clear documentation on this feature. Tried
looking at Spark
<https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala>
and Presto
<https://github.com/prestodb/presto/blob/master/presto-orc/src/test/java/com/facebook/presto/orc/TestCachingOrcDataSource.java#L192>
code, but I was not able to find anything useful.

The code below is a modified version of
https://github.com/melanio/codecheese-blog-examples/tree/master/orc-examples/src/main/java/codecheese/blog/examples/orc

public class TestRoundTrip {public static void main(String[] args)
throws IOException {
    final String file = "tmp/test-round-trip.orc";
    new File(file).delete();

    final long highestX = 10000L;
    final Configuration conf = new Configuration();

    write(file, highestX, conf);
    read(file, highestX, conf);}
private static void read(String file, long highestX, Configuration
conf) throws IOException {
    Reader reader = OrcFile.createReader(
            new Path(file),
            OrcFile.readerOptions(conf)
    );

    //Retrieve x that is "highestX - 1000". So, only 1 value should've
been retrieved.
    Options readerOptions = new Options(conf)
            .searchArgument(
                    SearchArgumentFactory
                            .newBuilder()
                            .equals("x", Type.LONG, highestX - 1000)
                            .build(),
                    new String[]{"x"}
            );
    RecordReader rows = reader.rows(readerOptions);
    VectorizedRowBatch batch = reader.getSchema().createRowBatch();

    while (rows.nextBatch(batch)) {
        LongColumnVector x = (LongColumnVector) batch.cols[0];
        LongColumnVector y = (LongColumnVector) batch.cols[1];

        for (int r = 0; r < batch.size; r++) {
            long xValue = x.vector[r];
            long yValue = y.vector[r];

            System.out.println(xValue + ", " + yValue);
        }
    }
    rows.close();}
private static void write(String file, long highestX, Configuration
conf) throws IOException {
    TypeDescription schema = TypeDescription.fromString("struct<x:int,y:int>");
    Writer writer = OrcFile.createWriter(
            new Path(file),
            OrcFile.writerOptions(conf).setSchema(schema)
    );

    VectorizedRowBatch batch = schema.createRowBatch();
    LongColumnVector x = (LongColumnVector) batch.cols[0];
    LongColumnVector y = (LongColumnVector) batch.cols[1];
    for (int r = 0; r < highestX; ++r) {
        int row = batch.size++;
        x.vector[row] = r;
        y.vector[row] = r * 3;
        // If the batch is full, write it out and start over.
        if (batch.size == batch.getMaxSize()) {
            writer.addRowBatch(batch);
            batch.reset();
        }
    }
    if (batch.size != 0) {
        writer.addRowBatch(batch);
        batch.reset();
    }
    writer.close();}

}


Thanks,
Ashwin.