You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@orc.apache.org by Scott Wells <sc...@gmail.com> on 2016/12/06 14:52:33 UTC
Unable to write string data into ORC file (or at least read it back)
I'm trying to create a little utility to convert CSV files into ORC files.
I've noticed that the resulting ORC files don't seem quite correct,
though. In an effort to create a simple reproducible test case, I just
changed the "Writing/Reading ORC Files" examples here:
https://orc.apache.org/docs/core-java.html
to create a file based on a pair of strings instead of integers. The first
issue I hit is that TypeDescription.fromString() isn't available in 2.1.0,
but instead I did the following:
TypeDescription schema = TypeDescription.createStruct()
.addField("first", TypeDescription.createString())
.addField("last", TypeDescription.createString());
Then I changed the loop as follows:
BytesColumnVector first = (BytesColumnVector) writeBatch.cols[0];
BytesColumnVector last = (BytesColumnVector) writeBatch.cols[1];
for (int r = 0; r < 10; ++r)
{
String firstName = ("First-" + r).intern();
String lastName = ("Last-" + (r * 3)).intern();
...
}
The file writes without errors, and if I write it with no compression, I
can see the data using "strings my-file.orc". However, when I then try to
read the data back from the file and print out the resulting batches to the
console, I get the following:
[" ", " "]
[" ", " "]
[" ", " "]
[" ", " "]
[" ", " "]
[" ", " "]
[" ", " "]
[" ", " "]
[" ", " "]
[" ", " "]
Any insights about what I may be doing wrong here would be greatly
appreciated!
Regards,
Scott
Re: Unable to write string data into ORC file (or at least read it back)
Posted by Owen O'Malley <om...@apache.org>.
I found the problem. Basically BytesColumnVector.stringifyValue is broken.
I'll update ORC-115.
On Tue, Dec 6, 2016 at 9:31 AM, Owen O'Malley <om...@apache.org> wrote:
> It looks like your writer is correct. Maybe the
> VectorizedRowBatch.toString is wonky. Can you try printing the output using
> the standard dumper:
>
> % java -jar tools/target/orc-tools-1.2.2-uber.jar data my-file.orc
>
> Thanks,
> Owen
>
> On Tue, Dec 6, 2016 at 8:48 AM, Scott Wells <sc...@gmail.com> wrote:
>
>> Thanks, Owen. I'd tried using references but it didn't resolve the
>> issue. Here's the code:
>>
>> ========================================================
>> new File("my-file.orc").delete();
>>
>> Configuration conf = new Configuration();
>> TypeDescription schema = TypeDescription.fromString("st
>> ruct<x:int,str:string>");
>> Writer writer = OrcFile.createWriter(new Path("my-file.orc"),
>> OrcFile.writerOptions(conf)
>> .setSchema(schema));
>>
>> VectorizedRowBatch writeBatch = schema.createRowBatch();
>> LongColumnVector x = (LongColumnVector) writeBatch.cols[0];
>> BytesColumnVector str = (BytesColumnVector) writeBatch.cols[1];
>> for (int r = 0; r < 10; ++r)
>> {
>> int row = writeBatch.size++;
>> x.vector[row] = r;
>> byte[] lastNameBytes = ("String-" + (r *
>> 3)).getBytes(StandardCharsets.UTF_8);
>> str.setRef(row, lastNameBytes, 0, lastNameBytes.length);
>>
>> // If the batch is full, write it out and start over.
>> if (writeBatch.size == writeBatch.getMaxSize())
>> {
>> writer.addRowBatch(writeBatch);
>> writeBatch.reset();
>> }
>> }
>> if (writeBatch.size > 0)
>> {
>> writer.addRowBatch(writeBatch);
>> }
>> writer.close();
>>
>> Reader reader = OrcFile.createReader(new Path("my-file.orc"),
>> OrcFile.readerOptions(conf));
>>
>> RecordReader rows = reader.rows();
>> VectorizedRowBatch readBatch = reader.getSchema().createRowBatch();
>> while (rows.nextBatch(readBatch))
>> {
>> System.out.println(readBatch);
>> }
>> rows.close();
>> ========================================================
>>
>> and here's the result of running it:
>>
>> [0, " "]
>> [1, " "]
>> [2, " "]
>> [3, " "]
>> [4, " "]
>> [5, " "]
>> [6, " "]
>> [7, " "]
>> [8, " "]
>> [9, " "]
>>
>> Any idea why the strings are coming back empty? Am I missing something
>> on the reader? For what it's worth, I've tried to put this ORC file into
>> S3 for access via Hive/PrestoDB (using AWS' new Athena service) and it also
>> doesn't like it.
>>
>> Thanks again!
>> Scott
>>
>> On Tue, Dec 6, 2016 at 10:41 AM, Owen O'Malley <om...@apache.org>
>> wrote:
>>
>>> As an example of why having the code be executable is a good idea, I
>>> noticed that I was dropping the last batch and needed to add:
>>>
>>> if (batch.size != 0) {
>>> writer.addRowBatch(batch);
>>> }
>>>
>>> before the close.
>>>
>>> .. Owen
>>>
>>> On Tue, Dec 6, 2016 at 8:35 AM, Owen O'Malley <om...@apache.org>
>>> wrote:
>>>
>>>> You need to call setRef on the BytesColumnVectors. The relevant part is:
>>>>
>>>> byte[] buffer = ("Last-" + (r * 3)).getBytes(StandardCharsets.UTF_8);
>>>> y.setRef(row, buffer, 0, buffer.length);
>>>>
>>>> I've created a gist with the example modified to do one int and one
>>>> string, here:
>>>>
>>>> https://gist.github.com/omalley/75093e104381ab9d157313993afcbbdf
>>>>
>>>> I realized that we should include the example code in the code base and
>>>> created ORC-116.
>>>>
>>>> .. Owen
>>>>
>>>> On Tue, Dec 6, 2016 at 6:52 AM, Scott Wells <sc...@gmail.com>
>>>> wrote:
>>>>
>>>>> I'm trying to create a little utility to convert CSV files into ORC
>>>>> files. I've noticed that the resulting ORC files don't seem quite correct,
>>>>> though. In an effort to create a simple reproducible test case, I just
>>>>> changed the "Writing/Reading ORC Files" examples here:
>>>>>
>>>>> https://orc.apache.org/docs/core-java.html
>>>>>
>>>>> to create a file based on a pair of strings instead of integers. The
>>>>> first issue I hit is that TypeDescription.fromString() isn't available in
>>>>> 2.1.0, but instead I did the following:
>>>>>
>>>>> TypeDescription schema = TypeDescription.createStruct()
>>>>> .addField("first", TypeDescription.createString())
>>>>> .addField("last", TypeDescription.createString());
>>>>>
>>>>> Then I changed the loop as follows:
>>>>>
>>>>> BytesColumnVector first = (BytesColumnVector)
>>>>> writeBatch.cols[0];
>>>>> BytesColumnVector last = (BytesColumnVector)
>>>>> writeBatch.cols[1];
>>>>> for (int r = 0; r < 10; ++r)
>>>>> {
>>>>> String firstName = ("First-" + r).intern();
>>>>> String lastName = ("Last-" + (r * 3)).intern();
>>>>> ...
>>>>> }
>>>>>
>>>>> The file writes without errors, and if I write it with no compression,
>>>>> I can see the data using "strings my-file.orc". However, when I then try
>>>>> to read the data back from the file and print out the resulting batches to
>>>>> the console, I get the following:
>>>>>
>>>>> [" ", " "]
>>>>> [" ", " "]
>>>>> [" ", " "]
>>>>> [" ", " "]
>>>>> [" ", " "]
>>>>> [" ", " "]
>>>>> [" ", " "]
>>>>> [" ", " "]
>>>>> [" ", " "]
>>>>> [" ", " "]
>>>>>
>>>>> Any insights about what I may be doing wrong here would be greatly
>>>>> appreciated!
>>>>>
>>>>> Regards,
>>>>> Scott
>>>>>
>>>>
>>>>
>>>
>>
>
Re: Unable to write string data into ORC file (or at least read it back)
Posted by Owen O'Malley <om...@apache.org>.
It looks like your writer is correct. Maybe the VectorizedRowBatch.toString
is wonky. Can you try printing the output using the standard dumper:
% java -jar tools/target/orc-tools-1.2.2-uber.jar data my-file.orc
Thanks,
Owen
On Tue, Dec 6, 2016 at 8:48 AM, Scott Wells <sc...@gmail.com> wrote:
> Thanks, Owen. I'd tried using references but it didn't resolve the
> issue. Here's the code:
>
> ========================================================
> new File("my-file.orc").delete();
>
> Configuration conf = new Configuration();
> TypeDescription schema = TypeDescription.fromString("
> struct<x:int,str:string>");
> Writer writer = OrcFile.createWriter(new Path("my-file.orc"),
> OrcFile.writerOptions(conf)
> .setSchema(schema));
>
> VectorizedRowBatch writeBatch = schema.createRowBatch();
> LongColumnVector x = (LongColumnVector) writeBatch.cols[0];
> BytesColumnVector str = (BytesColumnVector) writeBatch.cols[1];
> for (int r = 0; r < 10; ++r)
> {
> int row = writeBatch.size++;
> x.vector[row] = r;
> byte[] lastNameBytes = ("String-" + (r * 3)).getBytes(StandardCharsets.
> UTF_8);
> str.setRef(row, lastNameBytes, 0, lastNameBytes.length);
>
> // If the batch is full, write it out and start over.
> if (writeBatch.size == writeBatch.getMaxSize())
> {
> writer.addRowBatch(writeBatch);
> writeBatch.reset();
> }
> }
> if (writeBatch.size > 0)
> {
> writer.addRowBatch(writeBatch);
> }
> writer.close();
>
> Reader reader = OrcFile.createReader(new Path("my-file.orc"),
> OrcFile.readerOptions(conf));
>
> RecordReader rows = reader.rows();
> VectorizedRowBatch readBatch = reader.getSchema().createRowBatch();
> while (rows.nextBatch(readBatch))
> {
> System.out.println(readBatch);
> }
> rows.close();
> ========================================================
>
> and here's the result of running it:
>
> [0, " "]
> [1, " "]
> [2, " "]
> [3, " "]
> [4, " "]
> [5, " "]
> [6, " "]
> [7, " "]
> [8, " "]
> [9, " "]
>
> Any idea why the strings are coming back empty? Am I missing something on
> the reader? For what it's worth, I've tried to put this ORC file into S3
> for access via Hive/PrestoDB (using AWS' new Athena service) and it also
> doesn't like it.
>
> Thanks again!
> Scott
>
> On Tue, Dec 6, 2016 at 10:41 AM, Owen O'Malley <om...@apache.org> wrote:
>
>> As an example of why having the code be executable is a good idea, I
>> noticed that I was dropping the last batch and needed to add:
>>
>> if (batch.size != 0) {
>> writer.addRowBatch(batch);
>> }
>>
>> before the close.
>>
>> .. Owen
>>
>> On Tue, Dec 6, 2016 at 8:35 AM, Owen O'Malley <om...@apache.org> wrote:
>>
>>> You need to call setRef on the BytesColumnVectors. The relevant part is:
>>>
>>> byte[] buffer = ("Last-" + (r * 3)).getBytes(StandardCharsets.UTF_8);
>>> y.setRef(row, buffer, 0, buffer.length);
>>>
>>> I've created a gist with the example modified to do one int and one
>>> string, here:
>>>
>>> https://gist.github.com/omalley/75093e104381ab9d157313993afcbbdf
>>>
>>> I realized that we should include the example code in the code base and
>>> created ORC-116.
>>>
>>> .. Owen
>>>
>>> On Tue, Dec 6, 2016 at 6:52 AM, Scott Wells <sc...@gmail.com>
>>> wrote:
>>>
>>>> I'm trying to create a little utility to convert CSV files into ORC
>>>> files. I've noticed that the resulting ORC files don't seem quite correct,
>>>> though. In an effort to create a simple reproducible test case, I just
>>>> changed the "Writing/Reading ORC Files" examples here:
>>>>
>>>> https://orc.apache.org/docs/core-java.html
>>>>
>>>> to create a file based on a pair of strings instead of integers. The
>>>> first issue I hit is that TypeDescription.fromString() isn't available in
>>>> 2.1.0, but instead I did the following:
>>>>
>>>> TypeDescription schema = TypeDescription.createStruct()
>>>> .addField("first", TypeDescription.createString())
>>>> .addField("last", TypeDescription.createString());
>>>>
>>>> Then I changed the loop as follows:
>>>>
>>>> BytesColumnVector first = (BytesColumnVector)
>>>> writeBatch.cols[0];
>>>> BytesColumnVector last = (BytesColumnVector) writeBatch.cols[1];
>>>> for (int r = 0; r < 10; ++r)
>>>> {
>>>> String firstName = ("First-" + r).intern();
>>>> String lastName = ("Last-" + (r * 3)).intern();
>>>> ...
>>>> }
>>>>
>>>> The file writes without errors, and if I write it with no compression,
>>>> I can see the data using "strings my-file.orc". However, when I then try
>>>> to read the data back from the file and print out the resulting batches to
>>>> the console, I get the following:
>>>>
>>>> [" ", " "]
>>>> [" ", " "]
>>>> [" ", " "]
>>>> [" ", " "]
>>>> [" ", " "]
>>>> [" ", " "]
>>>> [" ", " "]
>>>> [" ", " "]
>>>> [" ", " "]
>>>> [" ", " "]
>>>>
>>>> Any insights about what I may be doing wrong here would be greatly
>>>> appreciated!
>>>>
>>>> Regards,
>>>> Scott
>>>>
>>>
>>>
>>
>
Re: Unable to write string data into ORC file (or at least read it back)
Posted by Scott Wells <sc...@gmail.com>.
Thanks, Owen. I'd tried using references but it didn't resolve the issue.
Here's the code:
========================================================
new File("my-file.orc").delete();
Configuration conf = new Configuration();
TypeDescription schema =
TypeDescription.fromString("struct<x:int,str:string>");
Writer writer = OrcFile.createWriter(new Path("my-file.orc"),
OrcFile.writerOptions(conf)
.setSchema(schema));
VectorizedRowBatch writeBatch = schema.createRowBatch();
LongColumnVector x = (LongColumnVector) writeBatch.cols[0];
BytesColumnVector str = (BytesColumnVector) writeBatch.cols[1];
for (int r = 0; r < 10; ++r)
{
int row = writeBatch.size++;
x.vector[row] = r;
byte[] lastNameBytes = ("String-" + (r *
3)).getBytes(StandardCharsets.UTF_8);
str.setRef(row, lastNameBytes, 0, lastNameBytes.length);
// If the batch is full, write it out and start over.
if (writeBatch.size == writeBatch.getMaxSize())
{
writer.addRowBatch(writeBatch);
writeBatch.reset();
}
}
if (writeBatch.size > 0)
{
writer.addRowBatch(writeBatch);
}
writer.close();
Reader reader = OrcFile.createReader(new Path("my-file.orc"),
OrcFile.readerOptions(conf));
RecordReader rows = reader.rows();
VectorizedRowBatch readBatch = reader.getSchema().createRowBatch();
while (rows.nextBatch(readBatch))
{
System.out.println(readBatch);
}
rows.close();
========================================================
and here's the result of running it:
[0, " "]
[1, " "]
[2, " "]
[3, " "]
[4, " "]
[5, " "]
[6, " "]
[7, " "]
[8, " "]
[9, " "]
Any idea why the strings are coming back empty? Am I missing something on
the reader? For what it's worth, I've tried to put this ORC file into S3
for access via Hive/PrestoDB (using AWS' new Athena service) and it also
doesn't like it.
Thanks again!
Scott
On Tue, Dec 6, 2016 at 10:41 AM, Owen O'Malley <om...@apache.org> wrote:
> As an example of why having the code be executable is a good idea, I
> noticed that I was dropping the last batch and needed to add:
>
> if (batch.size != 0) {
> writer.addRowBatch(batch);
> }
>
> before the close.
>
> .. Owen
>
> On Tue, Dec 6, 2016 at 8:35 AM, Owen O'Malley <om...@apache.org> wrote:
>
>> You need to call setRef on the BytesColumnVectors. The relevant part is:
>>
>> byte[] buffer = ("Last-" + (r * 3)).getBytes(StandardCharsets.UTF_8);
>> y.setRef(row, buffer, 0, buffer.length);
>>
>> I've created a gist with the example modified to do one int and one
>> string, here:
>>
>> https://gist.github.com/omalley/75093e104381ab9d157313993afcbbdf
>>
>> I realized that we should include the example code in the code base and
>> created ORC-116.
>>
>> .. Owen
>>
>> On Tue, Dec 6, 2016 at 6:52 AM, Scott Wells <sc...@gmail.com> wrote:
>>
>>> I'm trying to create a little utility to convert CSV files into ORC
>>> files. I've noticed that the resulting ORC files don't seem quite correct,
>>> though. In an effort to create a simple reproducible test case, I just
>>> changed the "Writing/Reading ORC Files" examples here:
>>>
>>> https://orc.apache.org/docs/core-java.html
>>>
>>> to create a file based on a pair of strings instead of integers. The
>>> first issue I hit is that TypeDescription.fromString() isn't available in
>>> 2.1.0, but instead I did the following:
>>>
>>> TypeDescription schema = TypeDescription.createStruct()
>>> .addField("first", TypeDescription.createString())
>>> .addField("last", TypeDescription.createString());
>>>
>>> Then I changed the loop as follows:
>>>
>>> BytesColumnVector first = (BytesColumnVector) writeBatch.cols[0];
>>> BytesColumnVector last = (BytesColumnVector) writeBatch.cols[1];
>>> for (int r = 0; r < 10; ++r)
>>> {
>>> String firstName = ("First-" + r).intern();
>>> String lastName = ("Last-" + (r * 3)).intern();
>>> ...
>>> }
>>>
>>> The file writes without errors, and if I write it with no compression, I
>>> can see the data using "strings my-file.orc". However, when I then try to
>>> read the data back from the file and print out the resulting batches to the
>>> console, I get the following:
>>>
>>> [" ", " "]
>>> [" ", " "]
>>> [" ", " "]
>>> [" ", " "]
>>> [" ", " "]
>>> [" ", " "]
>>> [" ", " "]
>>> [" ", " "]
>>> [" ", " "]
>>> [" ", " "]
>>>
>>> Any insights about what I may be doing wrong here would be greatly
>>> appreciated!
>>>
>>> Regards,
>>> Scott
>>>
>>
>>
>
Re: Unable to write string data into ORC file (or at least read it back)
Posted by Owen O'Malley <om...@apache.org>.
As an example of why having the code be executable is a good idea, I
noticed that I was dropping the last batch and needed to add:
if (batch.size != 0) {
writer.addRowBatch(batch);
}
before the close.
.. Owen
On Tue, Dec 6, 2016 at 8:35 AM, Owen O'Malley <om...@apache.org> wrote:
> You need to call setRef on the BytesColumnVectors. The relevant part is:
>
> byte[] buffer = ("Last-" + (r * 3)).getBytes(StandardCharsets.UTF_8);
> y.setRef(row, buffer, 0, buffer.length);
>
> I've created a gist with the example modified to do one int and one
> string, here:
>
> https://gist.github.com/omalley/75093e104381ab9d157313993afcbbdf
>
> I realized that we should include the example code in the code base and
> created ORC-116.
>
> .. Owen
>
> On Tue, Dec 6, 2016 at 6:52 AM, Scott Wells <sc...@gmail.com> wrote:
>
>> I'm trying to create a little utility to convert CSV files into ORC
>> files. I've noticed that the resulting ORC files don't seem quite correct,
>> though. In an effort to create a simple reproducible test case, I just
>> changed the "Writing/Reading ORC Files" examples here:
>>
>> https://orc.apache.org/docs/core-java.html
>>
>> to create a file based on a pair of strings instead of integers. The
>> first issue I hit is that TypeDescription.fromString() isn't available in
>> 2.1.0, but instead I did the following:
>>
>> TypeDescription schema = TypeDescription.createStruct()
>> .addField("first", TypeDescription.createString())
>> .addField("last", TypeDescription.createString());
>>
>> Then I changed the loop as follows:
>>
>> BytesColumnVector first = (BytesColumnVector) writeBatch.cols[0];
>> BytesColumnVector last = (BytesColumnVector) writeBatch.cols[1];
>> for (int r = 0; r < 10; ++r)
>> {
>> String firstName = ("First-" + r).intern();
>> String lastName = ("Last-" + (r * 3)).intern();
>> ...
>> }
>>
>> The file writes without errors, and if I write it with no compression, I
>> can see the data using "strings my-file.orc". However, when I then try to
>> read the data back from the file and print out the resulting batches to the
>> console, I get the following:
>>
>> [" ", " "]
>> [" ", " "]
>> [" ", " "]
>> [" ", " "]
>> [" ", " "]
>> [" ", " "]
>> [" ", " "]
>> [" ", " "]
>> [" ", " "]
>> [" ", " "]
>>
>> Any insights about what I may be doing wrong here would be greatly
>> appreciated!
>>
>> Regards,
>> Scott
>>
>
>
Re: Unable to write string data into ORC file (or at least read it back)
Posted by Owen O'Malley <om...@apache.org>.
You need to call setRef on the BytesColumnVectors. The relevant part is:
byte[] buffer = ("Last-" + (r * 3)).getBytes(StandardCharsets.UTF_8);
y.setRef(row, buffer, 0, buffer.length);
I've created a gist with the example modified to do one int and one string,
here:
https://gist.github.com/omalley/75093e104381ab9d157313993afcbbdf
I realized that we should include the example code in the code base and
created ORC-116.
.. Owen
On Tue, Dec 6, 2016 at 6:52 AM, Scott Wells <sc...@gmail.com> wrote:
> I'm trying to create a little utility to convert CSV files into ORC
> files. I've noticed that the resulting ORC files don't seem quite correct,
> though. In an effort to create a simple reproducible test case, I just
> changed the "Writing/Reading ORC Files" examples here:
>
> https://orc.apache.org/docs/core-java.html
>
> to create a file based on a pair of strings instead of integers. The
> first issue I hit is that TypeDescription.fromString() isn't available in
> 2.1.0, but instead I did the following:
>
> TypeDescription schema = TypeDescription.createStruct()
> .addField("first", TypeDescription.createString())
> .addField("last", TypeDescription.createString());
>
> Then I changed the loop as follows:
>
> BytesColumnVector first = (BytesColumnVector) writeBatch.cols[0];
> BytesColumnVector last = (BytesColumnVector) writeBatch.cols[1];
> for (int r = 0; r < 10; ++r)
> {
> String firstName = ("First-" + r).intern();
> String lastName = ("Last-" + (r * 3)).intern();
> ...
> }
>
> The file writes without errors, and if I write it with no compression, I
> can see the data using "strings my-file.orc". However, when I then try to
> read the data back from the file and print out the resulting batches to the
> console, I get the following:
>
> [" ", " "]
> [" ", " "]
> [" ", " "]
> [" ", " "]
> [" ", " "]
> [" ", " "]
> [" ", " "]
> [" ", " "]
> [" ", " "]
> [" ", " "]
>
> Any insights about what I may be doing wrong here would be greatly
> appreciated!
>
> Regards,
> Scott
>