You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Krzysztof Chmielewski <kr...@gmail.com> on 2022/01/10 13:59:00 UTC

ParquetColumnarRowInputFormat - parameter description

Hi,
I would like to ask for some more details regarding
three ParquetColumnarRowInputFormat contruction parameters.

The parameters are:
batchSize,
isUtcTimestamp,
isCaseSensitive

The parametr names gives some hint about their purpose but there is no
description in docs (java, flink page).

Could you provide me some information about the batching process and other
two boolean flags?

Regards,
Krzysztof Chmielewski

Re: ParquetColumnarRowInputFormat - parameter description

Posted by Krzysztof Chmielewski <kr...@gmail.com>.
Thank you Fabian,

I have one followup question.

You wrote:


*isUtcTimestamp denotes whether timestamps should be represented asSQL UTC
timestamps.*
Quetion:
So, if  *isUtcTimestamp *is set to false, how timestamps are represented?

Regards,
Krzysztof Chmielewski

wt., 25 sty 2022 o 11:56 Fabian Paul <fp...@apache.org> napisał(a):

> Hi Krzysztof,
>
> sorry for the late reply. The community is very busy at the moment
> with the final two weeks of Flink 1.15.
>
> The parameters you have mentioned are mostly relevant for the internal
> conversion or representation from Parquet types to Flink's SQL type
> system.
>
> - isUtcTimestamp denotes whether timestamps should be represented as
> SQL UTC timestamps
> - batchSize is an internal number of how many rows are put into one
> vector. Vectors are used internally in Flink SQL for performance
> reasons to enable faster execution on batches i.e. for Hive we use the
> following default value [1]
> - isCaseSensitive is used to map the field/column names from parquet
> and match them to columns in Flink
>
> I have also included @jingsonglee0@gmail.com who is more familiar with
> the parquet format.
>
> Best,
> Fabian
>
> [1]
> https://github.com/apache/flink/blob/d8a031c2b7d7b73fe38a3f894913d3dcaa5a4111/flink-table/flink-table-common/src/main/java/org/apache/flink/table/data/columnar/vector/VectorizedColumnBatch.java#L46
>
> On Mon, Jan 24, 2022 at 4:32 PM Krzysztof Chmielewski
> <kr...@gmail.com> wrote:
> >
> > Hi,
> > I would like to bump this up a little bit.
> >
> > The isCaseSensitive  is rather clear. If this is false, then column read
> in parquet file is case insensitive.
> > batchSize - how many records we read from the Parquet file before
> passing it to the upper classes right?
> >
> > Could someone describe what  timestamp flab does with some examples?
> >
> > Regards,
> > Krzysztof Chmielewski
> >
> >
> > pon., 10 sty 2022 o 14:59 Krzysztof Chmielewski <
> krzysiek.chmielewski@gmail.com> napisał(a):
> >>
> >> Hi,
> >> I would like to ask for some more details regarding three
> ParquetColumnarRowInputFormat contruction parameters.
> >>
> >> The parameters are:
> >> batchSize,
> >> isUtcTimestamp,
> >> isCaseSensitive
> >>
> >> The parametr names gives some hint about their purpose but there is no
> description in docs (java, flink page).
> >>
> >> Could you provide me some information about the batching process and
> other two boolean flags?
> >>
> >> Regards,
> >> Krzysztof Chmielewski
>

Re: ParquetColumnarRowInputFormat - parameter description

Posted by Fabian Paul <fp...@apache.org>.
Hi Krzysztof,

sorry for the late reply. The community is very busy at the moment
with the final two weeks of Flink 1.15.

The parameters you have mentioned are mostly relevant for the internal
conversion or representation from Parquet types to Flink's SQL type
system.

- isUtcTimestamp denotes whether timestamps should be represented as
SQL UTC timestamps
- batchSize is an internal number of how many rows are put into one
vector. Vectors are used internally in Flink SQL for performance
reasons to enable faster execution on batches i.e. for Hive we use the
following default value [1]
- isCaseSensitive is used to map the field/column names from parquet
and match them to columns in Flink

I have also included @jingsonglee0@gmail.com who is more familiar with
the parquet format.

Best,
Fabian

[1] https://github.com/apache/flink/blob/d8a031c2b7d7b73fe38a3f894913d3dcaa5a4111/flink-table/flink-table-common/src/main/java/org/apache/flink/table/data/columnar/vector/VectorizedColumnBatch.java#L46

On Mon, Jan 24, 2022 at 4:32 PM Krzysztof Chmielewski
<kr...@gmail.com> wrote:
>
> Hi,
> I would like to bump this up a little bit.
>
> The isCaseSensitive  is rather clear. If this is false, then column read in parquet file is case insensitive.
> batchSize - how many records we read from the Parquet file before passing it to the upper classes right?
>
> Could someone describe what  timestamp flab does with some examples?
>
> Regards,
> Krzysztof Chmielewski
>
>
> pon., 10 sty 2022 o 14:59 Krzysztof Chmielewski <kr...@gmail.com> napisał(a):
>>
>> Hi,
>> I would like to ask for some more details regarding three ParquetColumnarRowInputFormat contruction parameters.
>>
>> The parameters are:
>> batchSize,
>> isUtcTimestamp,
>> isCaseSensitive
>>
>> The parametr names gives some hint about their purpose but there is no description in docs (java, flink page).
>>
>> Could you provide me some information about the batching process and other two boolean flags?
>>
>> Regards,
>> Krzysztof Chmielewski

Re: ParquetColumnarRowInputFormat - parameter description

Posted by Krzysztof Chmielewski <kr...@gmail.com>.
Hi,
I would like to bump this up a little bit.

The isCaseSensitive  is rather clear. If this is false, then column read in
parquet file is case insensitive.
batchSize - how many records we read from the Parquet file before passing
it to the upper classes right?

Could someone describe what  timestamp flab does with some examples?

Regards,
Krzysztof Chmielewski


pon., 10 sty 2022 o 14:59 Krzysztof Chmielewski <
krzysiek.chmielewski@gmail.com> napisał(a):

> Hi,
> I would like to ask for some more details regarding
> three ParquetColumnarRowInputFormat contruction parameters.
>
> The parameters are:
> batchSize,
> isUtcTimestamp,
> isCaseSensitive
>
> The parametr names gives some hint about their purpose but there is no
> description in docs (java, flink page).
>
> Could you provide me some information about the batching process and other
> two boolean flags?
>
> Regards,
> Krzysztof Chmielewski
>