You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by iv...@gmail.com, iv...@gmail.com on 2018/08/02 17:59:21 UTC

num_level in Parquet Cpp library & how to add a JSON field?

Hi, 
I’m creating a parquet file using the parquet C++ library. I’ve been looking for answers online but still can’t figure out the following questions.

1. What does num_level mean in the WriteBatch method?
 WriteBatch(int64_t num_levels, const int16_t* def_levels,
                    const int16_t* rep_levels,
                    const typename ParquetType::c_type* values)

2. How to create a filed for JSON datatype?  By looking at this link https://github.com/apache/parquet-format/blob/master/LogicalTypes.md, it seems JSON is not considered as a nested datatype.  To create a filed for JSON data, what primitive type should it be? According to the link, it says “binary primitive type”,  does it mean "Type::BYTE_ARRAY”?
	PrimitiveNode::Make(“JSON_field", Repetition::REQUIRED, Type:: ?, LogicalType::JSON))
	
Any help is appreciated! 
Thanks,
Ivy


Re: num_level in Parquet Cpp library & how to add a JSON field?

Posted by "Uwe L. Korn" <uw...@xhochy.com>.
Hello Ivy,

> Is there any ways to read the data in logical format? because I want to 
> check if my final output is correct.

I usually use the parquet-cli from the parquet-mr project to check if my file is written correctly. This should give you much more informative output.

Simple usage:

git clone https://github.com/apache/parquet-mr
cd parquet-mr
mvn -DskipTests=true package
cd parquet-cli
mvn dependency:copy-dependencies
java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main meta <path-to-parquet-file>

Note that these commands may not all work out-of-the box for you. In case anything breaks I can highly recommend reading parquet-mr's READMEs.

Uwe

> 
> Thanks!
> -Ivy
> 
> On 2018/08/03 13:46:15, "Uwe L. Korn" <uw...@xhochy.com> wrote: 
> > Hello Ivy,
> > 
> > "primitive binary" means `Type::BYTE_ARRAY`, so you're correct. I have not yet seen anyone use the JSON field with parquet-cpp but the JSON type is simply a binary string with an annotation so I would expect everything to just work.
> > 
> > Uwe
> > 
> > On Thu, Aug 2, 2018, at 7:59 PM, ivywuyzl@gmail.com wrote:
> > > Hi, 
> > > I’m creating a parquet file using the parquet C++ library. I’ve been 
> > > looking for answers online but still can’t figure out the following 
> > > questions.
> > > 
> > > 1. What does num_level mean in the WriteBatch method?
> > >  WriteBatch(int64_t num_levels, const int16_t* def_levels,
> > >                     const int16_t* rep_levels,
> > >                     const typename ParquetType::c_type* values)
> > > 
> > > 2. How to create a filed for JSON datatype?  By looking at this link 
> > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md, it 
> > > seems JSON is not considered as a nested datatype.  To create a filed 
> > > for JSON data, what primitive type should it be? According to the link, 
> > > it says “binary primitive type”,  does it mean "Type::BYTE_ARRAY”?
> > > 	PrimitiveNode::Make(“JSON_field", Repetition::REQUIRED, Type:: ?, 
> > > LogicalType::JSON))
> > > 	
> > > Any help is appreciated! 
> > > Thanks,
> > > Ivy
> > > 
> > 

Re: num_level in Parquet Cpp library & how to add a JSON field?

Posted by iv...@gmail.com, iv...@gmail.com.
Hi Uwe, 

Thank you for the quick reply! That was very helpful.

I have another question regarding your low-level api example here https://github.com/apache/parquet-cpp/blob/master/examples/low-level-api/reader-writer.cc.
in the "int32_field", the logical type is TIME_MILLIS and we put dummy data 0-499 (in int32_t) into this field. When I read the output parquet file by using the executable "parquet_reader"  (in /parquet-cpp/build/latest folder), the value is still shown in int32_t primitive-data format (0 - 499) instead of in TIME_MILLIS logical-data format.  Should it be this way? does the executable only read parquet file in primitive data type?

Is there any ways to read the data in logical format? because I want to check if my final output is correct.

Thanks!
-Ivy

On 2018/08/03 13:46:15, "Uwe L. Korn" <uw...@xhochy.com> wrote: 
> Hello Ivy,
> 
> "primitive binary" means `Type::BYTE_ARRAY`, so you're correct. I have not yet seen anyone use the JSON field with parquet-cpp but the JSON type is simply a binary string with an annotation so I would expect everything to just work.
> 
> Uwe
> 
> On Thu, Aug 2, 2018, at 7:59 PM, ivywuyzl@gmail.com wrote:
> > Hi, 
> > I’m creating a parquet file using the parquet C++ library. I’ve been 
> > looking for answers online but still can’t figure out the following 
> > questions.
> > 
> > 1. What does num_level mean in the WriteBatch method?
> >  WriteBatch(int64_t num_levels, const int16_t* def_levels,
> >                     const int16_t* rep_levels,
> >                     const typename ParquetType::c_type* values)
> > 
> > 2. How to create a filed for JSON datatype?  By looking at this link 
> > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md, it 
> > seems JSON is not considered as a nested datatype.  To create a filed 
> > for JSON data, what primitive type should it be? According to the link, 
> > it says “binary primitive type”,  does it mean "Type::BYTE_ARRAY”?
> > 	PrimitiveNode::Make(“JSON_field", Repetition::REQUIRED, Type:: ?, 
> > LogicalType::JSON))
> > 	
> > Any help is appreciated! 
> > Thanks,
> > Ivy
> > 
> 

Re: num_level in Parquet Cpp library & how to add a JSON field?

Posted by "Uwe L. Korn" <uw...@xhochy.com>.
Hello Ivy,

"primitive binary" means `Type::BYTE_ARRAY`, so you're correct. I have not yet seen anyone use the JSON field with parquet-cpp but the JSON type is simply a binary string with an annotation so I would expect everything to just work.

Uwe

On Thu, Aug 2, 2018, at 7:59 PM, ivywuyzl@gmail.com wrote:
> Hi, 
> I’m creating a parquet file using the parquet C++ library. I’ve been 
> looking for answers online but still can’t figure out the following 
> questions.
> 
> 1. What does num_level mean in the WriteBatch method?
>  WriteBatch(int64_t num_levels, const int16_t* def_levels,
>                     const int16_t* rep_levels,
>                     const typename ParquetType::c_type* values)
> 
> 2. How to create a filed for JSON datatype?  By looking at this link 
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md, it 
> seems JSON is not considered as a nested datatype.  To create a filed 
> for JSON data, what primitive type should it be? According to the link, 
> it says “binary primitive type”,  does it mean "Type::BYTE_ARRAY”?
> 	PrimitiveNode::Make(“JSON_field", Repetition::REQUIRED, Type:: ?, 
> LogicalType::JSON))
> 	
> Any help is appreciated! 
> Thanks,
> Ivy
>