You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@orc.apache.org by Scott Wells <sc...@gmail.com> on 2016/12/06 14:52:33 UTC

Unable to write string data into ORC file (or at least read it back)

I'm trying to create a little utility to convert CSV files into ORC files.
I've noticed that the resulting ORC files don't seem quite correct,
though.  In an effort to create a simple reproducible test case, I just
changed the "Writing/Reading ORC Files" examples here:

https://orc.apache.org/docs/core-java.html

to create a file based on a pair of strings instead of integers.  The first
issue I hit is that TypeDescription.fromString() isn't available in 2.1.0,
but instead I did the following:

        TypeDescription schema = TypeDescription.createStruct()
            .addField("first", TypeDescription.createString())
            .addField("last", TypeDescription.createString());

Then I changed the loop as follows:

        BytesColumnVector first = (BytesColumnVector) writeBatch.cols[0];
        BytesColumnVector last = (BytesColumnVector) writeBatch.cols[1];
        for (int r = 0; r < 10; ++r)
        {
            String firstName = ("First-" + r).intern();
            String lastName = ("Last-" + (r * 3)).intern();
            ...
        }

The file writes without errors, and if I write it with no compression, I
can see the data using "strings my-file.orc".  However, when I then try to
read the data back from the file and print out the resulting batches to the
console, I get the following:

["       ", "      "]
["       ", "      "]
["       ", "      "]
["       ", "      "]
["       ", "       "]
["       ", "       "]
["       ", "       "]
["       ", "       "]
["       ", "       "]
["       ", "       "]

Any insights about what I may be doing wrong here would be greatly
appreciated!

Regards,
Scott

Re: Unable to write string data into ORC file (or at least read it back)

Posted by Owen O'Malley <om...@apache.org>.

I found the problem. Basically BytesColumnVector.stringifyValue is broken.

I'll update ORC-115.

On Tue, Dec 6, 2016 at 9:31 AM, Owen O'Malley <om...@apache.org> wrote:

> It looks like your writer is correct. Maybe the
> VectorizedRowBatch.toString is wonky. Can you try printing the output using
> the standard dumper:
>
> % java -jar tools/target/orc-tools-1.2.2-uber.jar data my-file.orc
>
> Thanks,
>    Owen
>
> On Tue, Dec 6, 2016 at 8:48 AM, Scott Wells <sc...@gmail.com> wrote:
>
>> Thanks, Owen.  I'd tried using references but it didn't resolve the
>> issue.  Here's the code:
>>
>> ========================================================
>> new File("my-file.orc").delete();
>>
>> Configuration conf = new Configuration();
>> TypeDescription schema = TypeDescription.fromString("st
>> ruct<x:int,str:string>");
>> Writer writer = OrcFile.createWriter(new Path("my-file.orc"),
>>     OrcFile.writerOptions(conf)
>>         .setSchema(schema));
>>
>> VectorizedRowBatch writeBatch = schema.createRowBatch();
>> LongColumnVector x = (LongColumnVector) writeBatch.cols[0];
>> BytesColumnVector str = (BytesColumnVector) writeBatch.cols[1];
>> for (int r = 0; r < 10; ++r)
>> {
>>     int row = writeBatch.size++;
>>     x.vector[row] = r;
>>     byte[] lastNameBytes = ("String-" + (r *
>> 3)).getBytes(StandardCharsets.UTF_8);
>>     str.setRef(row, lastNameBytes, 0, lastNameBytes.length);
>>
>>     // If the batch is full, write it out and start over.
>>     if (writeBatch.size == writeBatch.getMaxSize())
>>     {
>>         writer.addRowBatch(writeBatch);
>>         writeBatch.reset();
>>     }
>> }
>> if (writeBatch.size > 0)
>> {
>>     writer.addRowBatch(writeBatch);
>> }
>> writer.close();
>>
>> Reader reader = OrcFile.createReader(new Path("my-file.orc"),
>>     OrcFile.readerOptions(conf));
>>
>> RecordReader rows = reader.rows();
>> VectorizedRowBatch readBatch = reader.getSchema().createRowBatch();
>> while (rows.nextBatch(readBatch))
>> {
>>     System.out.println(readBatch);
>> }
>> rows.close();
>> ========================================================
>>
>> and here's the result of running it:
>>
>> [0, "        "]
>> [1, "        "]
>> [2, "        "]
>> [3, "        "]
>> [4, "         "]
>> [5, "         "]
>> [6, "         "]
>> [7, "         "]
>> [8, "         "]
>> [9, "         "]
>>
>> Any idea why the strings are coming back empty?  Am I missing something
>> on the reader?  For what it's worth, I've tried to put this ORC file into
>> S3 for access via Hive/PrestoDB (using AWS' new Athena service) and it also
>> doesn't like it.
>>
>> Thanks again!
>> Scott
>>
>> On Tue, Dec 6, 2016 at 10:41 AM, Owen O'Malley <om...@apache.org>
>> wrote:
>>
>>> As an example of why having the code be executable is a good idea, I
>>> noticed that I was dropping the last batch and needed to add:
>>>
>>> if (batch.size != 0) {
>>>   writer.addRowBatch(batch);
>>> }
>>>
>>> before the close.
>>>
>>> .. Owen
>>>
>>> On Tue, Dec 6, 2016 at 8:35 AM, Owen O'Malley <om...@apache.org>
>>> wrote:
>>>
>>>> You need to call setRef on the BytesColumnVectors. The relevant part is:
>>>>
>>>> byte[] buffer = ("Last-" + (r * 3)).getBytes(StandardCharsets.UTF_8);
>>>> y.setRef(row, buffer, 0, buffer.length);
>>>>
>>>> I've created a gist with the example modified to do one int and one
>>>> string, here:
>>>>
>>>> https://gist.github.com/omalley/75093e104381ab9d157313993afcbbdf
>>>>
>>>> I realized that we should include the example code in the code base and
>>>> created ORC-116.
>>>>
>>>> .. Owen
>>>>
>>>> On Tue, Dec 6, 2016 at 6:52 AM, Scott Wells <sc...@gmail.com>
>>>> wrote:
>>>>
>>>>> I'm trying to create a little utility to convert CSV files into ORC
>>>>> files.  I've noticed that the resulting ORC files don't seem quite correct,
>>>>> though.  In an effort to create a simple reproducible test case, I just
>>>>> changed the "Writing/Reading ORC Files" examples here:
>>>>>
>>>>> https://orc.apache.org/docs/core-java.html
>>>>>
>>>>> to create a file based on a pair of strings instead of integers.  The
>>>>> first issue I hit is that TypeDescription.fromString() isn't available in
>>>>> 2.1.0, but instead I did the following:
>>>>>
>>>>>         TypeDescription schema = TypeDescription.createStruct()
>>>>>             .addField("first", TypeDescription.createString())
>>>>>             .addField("last", TypeDescription.createString());
>>>>>
>>>>> Then I changed the loop as follows:
>>>>>
>>>>>         BytesColumnVector first = (BytesColumnVector)
>>>>> writeBatch.cols[0];
>>>>>         BytesColumnVector last = (BytesColumnVector)
>>>>> writeBatch.cols[1];
>>>>>         for (int r = 0; r < 10; ++r)
>>>>>         {
>>>>>             String firstName = ("First-" + r).intern();
>>>>>             String lastName = ("Last-" + (r * 3)).intern();
>>>>>             ...
>>>>>         }
>>>>>
>>>>> The file writes without errors, and if I write it with no compression,
>>>>> I can see the data using "strings my-file.orc".  However, when I then try
>>>>> to read the data back from the file and print out the resulting batches to
>>>>> the console, I get the following:
>>>>>
>>>>> ["       ", "      "]
>>>>> ["       ", "      "]
>>>>> ["       ", "      "]
>>>>> ["       ", "      "]
>>>>> ["       ", "       "]
>>>>> ["       ", "       "]
>>>>> ["       ", "       "]
>>>>> ["       ", "       "]
>>>>> ["       ", "       "]
>>>>> ["       ", "       "]
>>>>>
>>>>> Any insights about what I may be doing wrong here would be greatly
>>>>> appreciated!
>>>>>
>>>>> Regards,
>>>>> Scott
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Unable to write string data into ORC file (or at least read it back)

Posted by Owen O'Malley <om...@apache.org>.

It looks like your writer is correct. Maybe the VectorizedRowBatch.toString
is wonky. Can you try printing the output using the standard dumper:

% java -jar tools/target/orc-tools-1.2.2-uber.jar data my-file.orc

Thanks,
   Owen

On Tue, Dec 6, 2016 at 8:48 AM, Scott Wells <sc...@gmail.com> wrote:

> Thanks, Owen.  I'd tried using references but it didn't resolve the
> issue.  Here's the code:
>
> ========================================================
> new File("my-file.orc").delete();
>
> Configuration conf = new Configuration();
> TypeDescription schema = TypeDescription.fromString("
> struct<x:int,str:string>");
> Writer writer = OrcFile.createWriter(new Path("my-file.orc"),
>     OrcFile.writerOptions(conf)
>         .setSchema(schema));
>
> VectorizedRowBatch writeBatch = schema.createRowBatch();
> LongColumnVector x = (LongColumnVector) writeBatch.cols[0];
> BytesColumnVector str = (BytesColumnVector) writeBatch.cols[1];
> for (int r = 0; r < 10; ++r)
> {
>     int row = writeBatch.size++;
>     x.vector[row] = r;
>     byte[] lastNameBytes = ("String-" + (r * 3)).getBytes(StandardCharsets.
> UTF_8);
>     str.setRef(row, lastNameBytes, 0, lastNameBytes.length);
>
>     // If the batch is full, write it out and start over.
>     if (writeBatch.size == writeBatch.getMaxSize())
>     {
>         writer.addRowBatch(writeBatch);
>         writeBatch.reset();
>     }
> }
> if (writeBatch.size > 0)
> {
>     writer.addRowBatch(writeBatch);
> }
> writer.close();
>
> Reader reader = OrcFile.createReader(new Path("my-file.orc"),
>     OrcFile.readerOptions(conf));
>
> RecordReader rows = reader.rows();
> VectorizedRowBatch readBatch = reader.getSchema().createRowBatch();
> while (rows.nextBatch(readBatch))
> {
>     System.out.println(readBatch);
> }
> rows.close();
> ========================================================
>
> and here's the result of running it:
>
> [0, "        "]
> [1, "        "]
> [2, "        "]
> [3, "        "]
> [4, "         "]
> [5, "         "]
> [6, "         "]
> [7, "         "]
> [8, "         "]
> [9, "         "]
>
> Any idea why the strings are coming back empty?  Am I missing something on
> the reader?  For what it's worth, I've tried to put this ORC file into S3
> for access via Hive/PrestoDB (using AWS' new Athena service) and it also
> doesn't like it.
>
> Thanks again!
> Scott
>
> On Tue, Dec 6, 2016 at 10:41 AM, Owen O'Malley <om...@apache.org> wrote:
>
>> As an example of why having the code be executable is a good idea, I
>> noticed that I was dropping the last batch and needed to add:
>>
>> if (batch.size != 0) {
>>   writer.addRowBatch(batch);
>> }
>>
>> before the close.
>>
>> .. Owen
>>
>> On Tue, Dec 6, 2016 at 8:35 AM, Owen O'Malley <om...@apache.org> wrote:
>>
>>> You need to call setRef on the BytesColumnVectors. The relevant part is:
>>>
>>> byte[] buffer = ("Last-" + (r * 3)).getBytes(StandardCharsets.UTF_8);
>>> y.setRef(row, buffer, 0, buffer.length);
>>>
>>> I've created a gist with the example modified to do one int and one
>>> string, here:
>>>
>>> https://gist.github.com/omalley/75093e104381ab9d157313993afcbbdf
>>>
>>> I realized that we should include the example code in the code base and
>>> created ORC-116.
>>>
>>> .. Owen
>>>
>>> On Tue, Dec 6, 2016 at 6:52 AM, Scott Wells <sc...@gmail.com>
>>> wrote:
>>>
>>>> I'm trying to create a little utility to convert CSV files into ORC
>>>> files.  I've noticed that the resulting ORC files don't seem quite correct,
>>>> though.  In an effort to create a simple reproducible test case, I just
>>>> changed the "Writing/Reading ORC Files" examples here:
>>>>
>>>> https://orc.apache.org/docs/core-java.html
>>>>
>>>> to create a file based on a pair of strings instead of integers.  The
>>>> first issue I hit is that TypeDescription.fromString() isn't available in
>>>> 2.1.0, but instead I did the following:
>>>>
>>>>         TypeDescription schema = TypeDescription.createStruct()
>>>>             .addField("first", TypeDescription.createString())
>>>>             .addField("last", TypeDescription.createString());
>>>>
>>>> Then I changed the loop as follows:
>>>>
>>>>         BytesColumnVector first = (BytesColumnVector)
>>>> writeBatch.cols[0];
>>>>         BytesColumnVector last = (BytesColumnVector) writeBatch.cols[1];
>>>>         for (int r = 0; r < 10; ++r)
>>>>         {
>>>>             String firstName = ("First-" + r).intern();
>>>>             String lastName = ("Last-" + (r * 3)).intern();
>>>>             ...
>>>>         }
>>>>
>>>> The file writes without errors, and if I write it with no compression,
>>>> I can see the data using "strings my-file.orc".  However, when I then try
>>>> to read the data back from the file and print out the resulting batches to
>>>> the console, I get the following:
>>>>
>>>> ["       ", "      "]
>>>> ["       ", "      "]
>>>> ["       ", "      "]
>>>> ["       ", "      "]
>>>> ["       ", "       "]
>>>> ["       ", "       "]
>>>> ["       ", "       "]
>>>> ["       ", "       "]
>>>> ["       ", "       "]
>>>> ["       ", "       "]
>>>>
>>>> Any insights about what I may be doing wrong here would be greatly
>>>> appreciated!
>>>>
>>>> Regards,
>>>> Scott
>>>>
>>>
>>>
>>
>

Re: Unable to write string data into ORC file (or at least read it back)

Posted by Scott Wells <sc...@gmail.com>.

Thanks, Owen.  I'd tried using references but it didn't resolve the issue.
Here's the code:

========================================================
new File("my-file.orc").delete();

Configuration conf = new Configuration();
TypeDescription schema =
TypeDescription.fromString("struct<x:int,str:string>");
Writer writer = OrcFile.createWriter(new Path("my-file.orc"),
    OrcFile.writerOptions(conf)
        .setSchema(schema));

VectorizedRowBatch writeBatch = schema.createRowBatch();
LongColumnVector x = (LongColumnVector) writeBatch.cols[0];
BytesColumnVector str = (BytesColumnVector) writeBatch.cols[1];
for (int r = 0; r < 10; ++r)
{
    int row = writeBatch.size++;
    x.vector[row] = r;
    byte[] lastNameBytes = ("String-" + (r *
3)).getBytes(StandardCharsets.UTF_8);
    str.setRef(row, lastNameBytes, 0, lastNameBytes.length);

    // If the batch is full, write it out and start over.
    if (writeBatch.size == writeBatch.getMaxSize())
    {
        writer.addRowBatch(writeBatch);
        writeBatch.reset();
    }
}
if (writeBatch.size > 0)
{
    writer.addRowBatch(writeBatch);
}
writer.close();

Reader reader = OrcFile.createReader(new Path("my-file.orc"),
    OrcFile.readerOptions(conf));

RecordReader rows = reader.rows();
VectorizedRowBatch readBatch = reader.getSchema().createRowBatch();
while (rows.nextBatch(readBatch))
{
    System.out.println(readBatch);
}
rows.close();
========================================================

and here's the result of running it:

[0, "        "]
[1, "        "]
[2, "        "]
[3, "        "]
[4, "         "]
[5, "         "]
[6, "         "]
[7, "         "]
[8, "         "]
[9, "         "]

Any idea why the strings are coming back empty?  Am I missing something on
the reader?  For what it's worth, I've tried to put this ORC file into S3
for access via Hive/PrestoDB (using AWS' new Athena service) and it also
doesn't like it.

Thanks again!
Scott

On Tue, Dec 6, 2016 at 10:41 AM, Owen O'Malley <om...@apache.org> wrote:

> As an example of why having the code be executable is a good idea, I
> noticed that I was dropping the last batch and needed to add:
>
> if (batch.size != 0) {
>   writer.addRowBatch(batch);
> }
>
> before the close.
>
> .. Owen
>
> On Tue, Dec 6, 2016 at 8:35 AM, Owen O'Malley <om...@apache.org> wrote:
>
>> You need to call setRef on the BytesColumnVectors. The relevant part is:
>>
>> byte[] buffer = ("Last-" + (r * 3)).getBytes(StandardCharsets.UTF_8);
>> y.setRef(row, buffer, 0, buffer.length);
>>
>> I've created a gist with the example modified to do one int and one
>> string, here:
>>
>> https://gist.github.com/omalley/75093e104381ab9d157313993afcbbdf
>>
>> I realized that we should include the example code in the code base and
>> created ORC-116.
>>
>> .. Owen
>>
>> On Tue, Dec 6, 2016 at 6:52 AM, Scott Wells <sc...@gmail.com> wrote:
>>
>>> I'm trying to create a little utility to convert CSV files into ORC
>>> files.  I've noticed that the resulting ORC files don't seem quite correct,
>>> though.  In an effort to create a simple reproducible test case, I just
>>> changed the "Writing/Reading ORC Files" examples here:
>>>
>>> https://orc.apache.org/docs/core-java.html
>>>
>>> to create a file based on a pair of strings instead of integers.  The
>>> first issue I hit is that TypeDescription.fromString() isn't available in
>>> 2.1.0, but instead I did the following:
>>>
>>>         TypeDescription schema = TypeDescription.createStruct()
>>>             .addField("first", TypeDescription.createString())
>>>             .addField("last", TypeDescription.createString());
>>>
>>> Then I changed the loop as follows:
>>>
>>>         BytesColumnVector first = (BytesColumnVector) writeBatch.cols[0];
>>>         BytesColumnVector last = (BytesColumnVector) writeBatch.cols[1];
>>>         for (int r = 0; r < 10; ++r)
>>>         {
>>>             String firstName = ("First-" + r).intern();
>>>             String lastName = ("Last-" + (r * 3)).intern();
>>>             ...
>>>         }
>>>
>>> The file writes without errors, and if I write it with no compression, I
>>> can see the data using "strings my-file.orc".  However, when I then try to
>>> read the data back from the file and print out the resulting batches to the
>>> console, I get the following:
>>>
>>> ["       ", "      "]
>>> ["       ", "      "]
>>> ["       ", "      "]
>>> ["       ", "      "]
>>> ["       ", "       "]
>>> ["       ", "       "]
>>> ["       ", "       "]
>>> ["       ", "       "]
>>> ["       ", "       "]
>>> ["       ", "       "]
>>>
>>> Any insights about what I may be doing wrong here would be greatly
>>> appreciated!
>>>
>>> Regards,
>>> Scott
>>>
>>
>>
>

Re: Unable to write string data into ORC file (or at least read it back)

Posted by Owen O'Malley <om...@apache.org>.

As an example of why having the code be executable is a good idea, I
noticed that I was dropping the last batch and needed to add:

if (batch.size != 0) {
  writer.addRowBatch(batch);
}

before the close.

.. Owen

On Tue, Dec 6, 2016 at 8:35 AM, Owen O'Malley <om...@apache.org> wrote:

> You need to call setRef on the BytesColumnVectors. The relevant part is:
>
> byte[] buffer = ("Last-" + (r * 3)).getBytes(StandardCharsets.UTF_8);
> y.setRef(row, buffer, 0, buffer.length);
>
> I've created a gist with the example modified to do one int and one
> string, here:
>
> https://gist.github.com/omalley/75093e104381ab9d157313993afcbbdf
>
> I realized that we should include the example code in the code base and
> created ORC-116.
>
> .. Owen
>
> On Tue, Dec 6, 2016 at 6:52 AM, Scott Wells <sc...@gmail.com> wrote:
>
>> I'm trying to create a little utility to convert CSV files into ORC
>> files.  I've noticed that the resulting ORC files don't seem quite correct,
>> though.  In an effort to create a simple reproducible test case, I just
>> changed the "Writing/Reading ORC Files" examples here:
>>
>> https://orc.apache.org/docs/core-java.html
>>
>> to create a file based on a pair of strings instead of integers.  The
>> first issue I hit is that TypeDescription.fromString() isn't available in
>> 2.1.0, but instead I did the following:
>>
>>         TypeDescription schema = TypeDescription.createStruct()
>>             .addField("first", TypeDescription.createString())
>>             .addField("last", TypeDescription.createString());
>>
>> Then I changed the loop as follows:
>>
>>         BytesColumnVector first = (BytesColumnVector) writeBatch.cols[0];
>>         BytesColumnVector last = (BytesColumnVector) writeBatch.cols[1];
>>         for (int r = 0; r < 10; ++r)
>>         {
>>             String firstName = ("First-" + r).intern();
>>             String lastName = ("Last-" + (r * 3)).intern();
>>             ...
>>         }
>>
>> The file writes without errors, and if I write it with no compression, I
>> can see the data using "strings my-file.orc".  However, when I then try to
>> read the data back from the file and print out the resulting batches to the
>> console, I get the following:
>>
>> ["       ", "      "]
>> ["       ", "      "]
>> ["       ", "      "]
>> ["       ", "      "]
>> ["       ", "       "]
>> ["       ", "       "]
>> ["       ", "       "]
>> ["       ", "       "]
>> ["       ", "       "]
>> ["       ", "       "]
>>
>> Any insights about what I may be doing wrong here would be greatly
>> appreciated!
>>
>> Regards,
>> Scott
>>
>
>

Re: Unable to write string data into ORC file (or at least read it back)

Posted by Owen O'Malley <om...@apache.org>.

You need to call setRef on the BytesColumnVectors. The relevant part is:

byte[] buffer = ("Last-" + (r * 3)).getBytes(StandardCharsets.UTF_8);
y.setRef(row, buffer, 0, buffer.length);

I've created a gist with the example modified to do one int and one string,
here:

https://gist.github.com/omalley/75093e104381ab9d157313993afcbbdf

I realized that we should include the example code in the code base and
created ORC-116.

.. Owen

On Tue, Dec 6, 2016 at 6:52 AM, Scott Wells <sc...@gmail.com> wrote:

> I'm trying to create a little utility to convert CSV files into ORC
> files.  I've noticed that the resulting ORC files don't seem quite correct,
> though.  In an effort to create a simple reproducible test case, I just
> changed the "Writing/Reading ORC Files" examples here:
>
> https://orc.apache.org/docs/core-java.html
>
> to create a file based on a pair of strings instead of integers.  The
> first issue I hit is that TypeDescription.fromString() isn't available in
> 2.1.0, but instead I did the following:
>
>         TypeDescription schema = TypeDescription.createStruct()
>             .addField("first", TypeDescription.createString())
>             .addField("last", TypeDescription.createString());
>
> Then I changed the loop as follows:
>
>         BytesColumnVector first = (BytesColumnVector) writeBatch.cols[0];
>         BytesColumnVector last = (BytesColumnVector) writeBatch.cols[1];
>         for (int r = 0; r < 10; ++r)
>         {
>             String firstName = ("First-" + r).intern();
>             String lastName = ("Last-" + (r * 3)).intern();
>             ...
>         }
>
> The file writes without errors, and if I write it with no compression, I
> can see the data using "strings my-file.orc".  However, when I then try to
> read the data back from the file and print out the resulting batches to the
> console, I get the following:
>
> ["       ", "      "]
> ["       ", "      "]
> ["       ", "      "]
> ["       ", "      "]
> ["       ", "       "]
> ["       ", "       "]
> ["       ", "       "]
> ["       ", "       "]
> ["       ", "       "]
> ["       ", "       "]
>
> Any insights about what I may be doing wrong here would be greatly
> appreciated!
>
> Regards,
> Scott
>