You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@orc.apache.org by Sivaprasanna <si...@gmail.com> on 2020/04/05 15:58:07 UTC

Query regarding usage of VectorizedRowBatch

Hello,

Context: I am working on a solution to enable bulk writing for ORC format
in Apache Flink[1] which is a stream processing framework.

The scenario is this: Flink receives an element/record (which could be any
Java type) one by one, and we want to write them in bulk to have the actual
benefit of ORC. To solve this, I have tried two approaches:

1. As and when the element is received, convert that single element to a
VectorizedRowBatch and call writer.addRowBatch(rowBatch). This happens for
all the incoming records, meaning a new VectorizedRowBatch was created per
record and then get added using addRowBatch() one by one.
2. As and when the element is received, add them to a list and when we want
to write, create an instance of VectorizedRowBatch and iterate over the
list containing the elements and transform each record into ColumnVectors
and add to the same VectorizedRowBatch previously created.

In both the approaches, I saw that the records got into one stripe and have
all the records in tact in the output ORC files. And I wasn't able to find
any significant differences in the file sizes between these two approaches.
So I want to understand the difference and trade-offs between these two
approaches? Are there any difference w.r.t to compression between these two
approaches?

[1] https://issues.apache.org/jira/browse/FLINK-10114

Thanks,
Sivaprasanna

Re: Query regarding usage of VectorizedRowBatch

Posted by Owen O'Malley <ow...@gmail.com>.
Sivaprasanna,
   As Gopal writes, there will only be minor differences in the file caused
by when it will check for when to finish the stripe.

Generally, the fewer allocations your code causes, the faster it will be.
You can look at the mapred code for another example of building a
VectorizedRowBatch row by row.
https://github.com/apache/orc/blob/76547648fe36b7d93638dc2712057eb511248094/java/mapreduce/src/java/org/apache/orc/mapred/OrcMapredRecordWriter.java#L251

.. Owen

On Sun, Apr 5, 2020 at 8:58 AM Sivaprasanna <si...@gmail.com>
wrote:

> Hello,
>
> Context: I am working on a solution to enable bulk writing for ORC format
> in Apache Flink[1] which is a stream processing framework.
>
> The scenario is this: Flink receives an element/record (which could be any
> Java type) one by one, and we want to write them in bulk to have the actual
> benefit of ORC. To solve this, I have tried two approaches:
>
> 1. As and when the element is received, convert that single element to a
> VectorizedRowBatch and call writer.addRowBatch(rowBatch). This happens for
> all the incoming records, meaning a new VectorizedRowBatch was created per
> record and then get added using addRowBatch() one by one.
> 2. As and when the element is received, add them to a list and when we
> want to write, create an instance of VectorizedRowBatch and iterate over
> the list containing the elements and transform each record into
> ColumnVectors and add to the same VectorizedRowBatch previously created.
>
> In both the approaches, I saw that the records got into one stripe and
> have all the records in tact in the output ORC files. And I wasn't able to
> find any significant differences in the file sizes between these two
> approaches. So I want to understand the difference and trade-offs between
> these two approaches? Are there any difference w.r.t to compression between
> these two approaches?
>
> [1] https://issues.apache.org/jira/browse/FLINK-10114
>
> Thanks,
> Sivaprasanna
>

Re: Query regarding usage of VectorizedRowBatch

Posted by Gopal V <go...@notmysock.org>.
Hi,


>     2. As and when the element is received, add them to a list and when
>     we want to write, create an instance of VectorizedRowBatch and
>     iterate over the list containing the elements and transform each
>     record into ColumnVectors and add to the same VectorizedRowBatch
>     previously created.

There is no difference in the files produced by either approach, the 
row-batch being created by Flink has a slight advantage in memory 
management.

Here's how the Hive writer maintains compatibility to row-by-row writes.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java#L307

>     [1] https://issues.apache.org/jira/browse/FLINK-10114

Will watch this, cheers.

For streaming IO performance, there was a special mode added to ORC 
within hive-streaming.

 
HIVE_ORC_DELTA_STREAMING_OPTIMIZATIONS_ENABLED("hive.exec.orc.delta.streaming.optimizations.enabled", 
false,
       "Whether to enable streaming optimizations for ORC delta files. 
This will disable ORC's internal indexes,\n" +
         "disable compression, enable fast encoding and disable 
dictionary encoding."),

Cheers,
Gopal


Re: Query regarding usage of VectorizedRowBatch

Posted by Sivaprasanna <si...@gmail.com>.
Bump.

I would really appreciate, if someone could help me with this.

Cheers,
Sivaprasanna

On Sun, Apr 5, 2020 at 9:28 PM Sivaprasanna <si...@gmail.com>
wrote:

> Hello,
>
> Context: I am working on a solution to enable bulk writing for ORC format
> in Apache Flink[1] which is a stream processing framework.
>
> The scenario is this: Flink receives an element/record (which could be any
> Java type) one by one, and we want to write them in bulk to have the actual
> benefit of ORC. To solve this, I have tried two approaches:
>
> 1. As and when the element is received, convert that single element to a
> VectorizedRowBatch and call writer.addRowBatch(rowBatch). This happens for
> all the incoming records, meaning a new VectorizedRowBatch was created per
> record and then get added using addRowBatch() one by one.
> 2. As and when the element is received, add them to a list and when we
> want to write, create an instance of VectorizedRowBatch and iterate over
> the list containing the elements and transform each record into
> ColumnVectors and add to the same VectorizedRowBatch previously created.
>
> In both the approaches, I saw that the records got into one stripe and
> have all the records in tact in the output ORC files. And I wasn't able to
> find any significant differences in the file sizes between these two
> approaches. So I want to understand the difference and trade-offs between
> these two approaches? Are there any difference w.r.t to compression between
> these two approaches?
>
> [1] https://issues.apache.org/jira/browse/FLINK-10114
>
> Thanks,
> Sivaprasanna
>