You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by iv...@gmail.com,
iv...@gmail.com on 2018/08/02 17:59:21 UTC
num_level in Parquet Cpp library & how to add a JSON field?
Hi,
I’m creating a parquet file using the parquet C++ library. I’ve been looking for answers online but still can’t figure out the following questions.
1. What does num_level mean in the WriteBatch method?
WriteBatch(int64_t num_levels, const int16_t* def_levels,
const int16_t* rep_levels,
const typename ParquetType::c_type* values)
2. How to create a filed for JSON datatype? By looking at this link https://github.com/apache/parquet-format/blob/master/LogicalTypes.md, it seems JSON is not considered as a nested datatype. To create a filed for JSON data, what primitive type should it be? According to the link, it says “binary primitive type”, does it mean "Type::BYTE_ARRAY”?
PrimitiveNode::Make(“JSON_field", Repetition::REQUIRED, Type:: ?, LogicalType::JSON))
Any help is appreciated!
Thanks,
Ivy
Re: num_level in Parquet Cpp library & how to add a JSON field?
Posted by "Uwe L. Korn" <uw...@xhochy.com>.
Hello Ivy,
> Is there any ways to read the data in logical format? because I want to
> check if my final output is correct.
I usually use the parquet-cli from the parquet-mr project to check if my file is written correctly. This should give you much more informative output.
Simple usage:
git clone https://github.com/apache/parquet-mr
cd parquet-mr
mvn -DskipTests=true package
cd parquet-cli
mvn dependency:copy-dependencies
java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main meta <path-to-parquet-file>
Note that these commands may not all work out-of-the box for you. In case anything breaks I can highly recommend reading parquet-mr's READMEs.
Uwe
>
> Thanks!
> -Ivy
>
> On 2018/08/03 13:46:15, "Uwe L. Korn" <uw...@xhochy.com> wrote:
> > Hello Ivy,
> >
> > "primitive binary" means `Type::BYTE_ARRAY`, so you're correct. I have not yet seen anyone use the JSON field with parquet-cpp but the JSON type is simply a binary string with an annotation so I would expect everything to just work.
> >
> > Uwe
> >
> > On Thu, Aug 2, 2018, at 7:59 PM, ivywuyzl@gmail.com wrote:
> > > Hi,
> > > I’m creating a parquet file using the parquet C++ library. I’ve been
> > > looking for answers online but still can’t figure out the following
> > > questions.
> > >
> > > 1. What does num_level mean in the WriteBatch method?
> > > WriteBatch(int64_t num_levels, const int16_t* def_levels,
> > > const int16_t* rep_levels,
> > > const typename ParquetType::c_type* values)
> > >
> > > 2. How to create a filed for JSON datatype? By looking at this link
> > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md, it
> > > seems JSON is not considered as a nested datatype. To create a filed
> > > for JSON data, what primitive type should it be? According to the link,
> > > it says “binary primitive type”, does it mean "Type::BYTE_ARRAY”?
> > > PrimitiveNode::Make(“JSON_field", Repetition::REQUIRED, Type:: ?,
> > > LogicalType::JSON))
> > >
> > > Any help is appreciated!
> > > Thanks,
> > > Ivy
> > >
> >
Re: num_level in Parquet Cpp library & how to add a JSON field?
Posted by iv...@gmail.com,
iv...@gmail.com.
Hi Uwe,
Thank you for the quick reply! That was very helpful.
I have another question regarding your low-level api example here https://github.com/apache/parquet-cpp/blob/master/examples/low-level-api/reader-writer.cc.
in the "int32_field", the logical type is TIME_MILLIS and we put dummy data 0-499 (in int32_t) into this field. When I read the output parquet file by using the executable "parquet_reader" (in /parquet-cpp/build/latest folder), the value is still shown in int32_t primitive-data format (0 - 499) instead of in TIME_MILLIS logical-data format. Should it be this way? does the executable only read parquet file in primitive data type?
Is there any ways to read the data in logical format? because I want to check if my final output is correct.
Thanks!
-Ivy
On 2018/08/03 13:46:15, "Uwe L. Korn" <uw...@xhochy.com> wrote:
> Hello Ivy,
>
> "primitive binary" means `Type::BYTE_ARRAY`, so you're correct. I have not yet seen anyone use the JSON field with parquet-cpp but the JSON type is simply a binary string with an annotation so I would expect everything to just work.
>
> Uwe
>
> On Thu, Aug 2, 2018, at 7:59 PM, ivywuyzl@gmail.com wrote:
> > Hi,
> > I’m creating a parquet file using the parquet C++ library. I’ve been
> > looking for answers online but still can’t figure out the following
> > questions.
> >
> > 1. What does num_level mean in the WriteBatch method?
> > WriteBatch(int64_t num_levels, const int16_t* def_levels,
> > const int16_t* rep_levels,
> > const typename ParquetType::c_type* values)
> >
> > 2. How to create a filed for JSON datatype? By looking at this link
> > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md, it
> > seems JSON is not considered as a nested datatype. To create a filed
> > for JSON data, what primitive type should it be? According to the link,
> > it says “binary primitive type”, does it mean "Type::BYTE_ARRAY”?
> > PrimitiveNode::Make(“JSON_field", Repetition::REQUIRED, Type:: ?,
> > LogicalType::JSON))
> >
> > Any help is appreciated!
> > Thanks,
> > Ivy
> >
>
Re: num_level in Parquet Cpp library & how to add a JSON field?
Posted by "Uwe L. Korn" <uw...@xhochy.com>.
Hello Ivy,
"primitive binary" means `Type::BYTE_ARRAY`, so you're correct. I have not yet seen anyone use the JSON field with parquet-cpp but the JSON type is simply a binary string with an annotation so I would expect everything to just work.
Uwe
On Thu, Aug 2, 2018, at 7:59 PM, ivywuyzl@gmail.com wrote:
> Hi,
> I’m creating a parquet file using the parquet C++ library. I’ve been
> looking for answers online but still can’t figure out the following
> questions.
>
> 1. What does num_level mean in the WriteBatch method?
> WriteBatch(int64_t num_levels, const int16_t* def_levels,
> const int16_t* rep_levels,
> const typename ParquetType::c_type* values)
>
> 2. How to create a filed for JSON datatype? By looking at this link
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md, it
> seems JSON is not considered as a nested datatype. To create a filed
> for JSON data, what primitive type should it be? According to the link,
> it says “binary primitive type”, does it mean "Type::BYTE_ARRAY”?
> PrimitiveNode::Make(“JSON_field", Repetition::REQUIRED, Type:: ?,
> LogicalType::JSON))
>
> Any help is appreciated!
> Thanks,
> Ivy
>