You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Bhavesh K Shah <Bh...@bitwiseglobal.com> on 2015/06/29 08:09:05 UTC

BigDecimal & Date Datatype - How to use with parquet-cascading?

Hi,
I am trying to use BigDecimal & Date datatype with parquet-cascading. I have created some sample job using it but it throws exception if I try to use BigDecimal/Date type in "message type". When I write the "message type" in ParquetTupleScheme, in that I am not able to use BigDecimal & Date. Below is the sample code:

public class ReadWriteParquet {

                static String textInputPath = "inputOutput/input/in1.txt";
                static String parquetOutputPath = "inputOutput/output/parquet-out";
                static String textOutputPath = "inputOutput/output/text-out";

                public static void main(String[] args) throws Exception {
                                ReadWriteParquet.write();
                                ReadWriteParquet.read();
                }

                private static void read() {
                                Scheme parquetinput = new ParquetTupleScheme(new Fields("Name",
                                                                "College", "Branch", "Age", "Doj", "BigDeci"));
                                Scheme textoutput = new TextDelimited(true, ",");

                                Tap source = new Hfs(parquetinput, parquetOutputPath);
                                Tap sink = new Hfs(textoutput, textOutputPath, SinkMode.REPLACE);

                                Pipe pipe = new Pipe("Read Parquet");
                                pipe = new GroupBy(pipe, new Fields("Branch"));

                                Properties hadoopProps = new Properties();
                                AppProps.setApplicationJarClass(hadoopProps, ReadWriteParquet.class);
                                TupleSerializationProps.addSerialization(hadoopProps,
                                                                BigDecimalSerialization.class.getName());

                                FlowDef flowdef = FlowDef.flowDef().addSource(pipe, source)
                                                                .addTailSink(pipe, sink);
                                HadoopFlowConnector hd = new HadoopFlowConnector(hadoopProps);
                                hd.connect(flowdef).complete();
                }

                private static void write() {
                                DateType dateType = new DateType("dd/MM/yyyy");
                                Fields fields = new Fields("Name", "College", "Branch", "Age", "Doj",
                                                                "BigDeci").applyTypes(String.class, String.class, String.class,
                                                                Integer.class, dateType, BigDecimal.class);

                                Scheme input = new TextDelimited(fields, true, ",");

                                Scheme parquetout = new ParquetTupleScheme(
                                                                fields, fields,  "message ReadWriteParquet {required Binary Name; required Binary College; required Binary Branch; optional int64 Age; required int64 Doj; required Double BigDeci; }");

                                Tap source = new Hfs(input, textInputPath);
                                Tap sink = new Hfs(parquetout, parquetOutputPath, SinkMode.REPLACE);

                                Pipe pipe = new Pipe("Write Parquet");

                                FlowDef flowdef = FlowDef.flowDef().addSource(pipe, source)
                                                                .addTailSink(pipe, sink);
                                new HadoopFlowConnector().connect(flowdef).complete();
                }
}

In above code, you can see that in "message type"  I have used Int64 for date as there is no provision for date datatype and Double for BigDecimal. Saying Int64 to date it writes the data as long values but I want date to be written in some particular format. And same with BigDecimal, To run the job I have mapped it to Double but I want to map it to BigDecimal.

Is there any way to deal with date and bigdecimal datatypes in parquet-cascading while writing the data? Please let me know if any of the workaround is there.


Thanks,
Bhavesh

**************************************Disclaimer****************************************** This e-mail message and any attachments may contain confidential information and is for the sole use of the intended recipient(s) only. Any views or opinions presented or implied are solely those of the author and do not necessarily represent the views of BitWise. If you are not the intended recipient(s), you are hereby notified that disclosure, printing, copying, forwarding, distribution, or the taking of any action whatsoever in reliance on the contents of this electronic information is strictly prohibited. If you have received this e-mail message in error, please immediately notify the sender and delete the electronic message and any attachments.BitWise does not accept liability for any virus introduced by this e-mail or any attachments. ********************************************************************************************

Re: BigDecimal & Date Datatype - How to use with parquet-cascading?

Posted by Ryan Blue <bl...@cloudera.com>.

Hi Bhavesh,

The support for BigDecimal and Date is still evolving. For parquet-avro, 
we're adding support in upstream Avro and will be adding the translation 
once that is released in 1.7.8.

For parquet-cascading, I'm not sure. I'm not very familiar with 
cascading, though I think someone was working on Avro support. I assume 
with Avro support in parquet-cascading and Avro support for those types, 
it would work. Otherwise, I think the best option is to use a conversion 
in your Cascading code and set your Parquet schema correctly for other 
consumers to pick up the types.

For conversions, you can use the ones that are in Avro. See [1] for 
Decimal and [2] for Dates and times.

rb


[1]: 
https://github.com/apache/avro/blob/trunk/lang/java/avro/src/main/java/org/apache/avro/Conversions.java
[2]: 
https://github.com/rdblue/avro/blob/6a046d45/lang/java/avro/src/main/java/org/apache/avro/data/TimeConversions.java

On 06/28/2015 11:09 PM, Bhavesh K Shah wrote:
> Hi,
> I am trying to use BigDecimal & Date datatype with parquet-cascading. I have created some sample job using it but it throws exception if I try to use BigDecimal/Date type in "message type". When I write the "message type" in ParquetTupleScheme, in that I am not able to use BigDecimal & Date. Below is the sample code:
>
> public class ReadWriteParquet {
>
>                  static String textInputPath = "inputOutput/input/in1.txt";
>                  static String parquetOutputPath = "inputOutput/output/parquet-out";
>                  static String textOutputPath = "inputOutput/output/text-out";
>
>                  public static void main(String[] args) throws Exception {
>                                  ReadWriteParquet.write();
>                                  ReadWriteParquet.read();
>                  }
>
>                  private static void read() {
>                                  Scheme parquetinput = new ParquetTupleScheme(new Fields("Name",
>                                                                  "College", "Branch", "Age", "Doj", "BigDeci"));
>                                  Scheme textoutput = new TextDelimited(true, ",");
>
>                                  Tap source = new Hfs(parquetinput, parquetOutputPath);
>                                  Tap sink = new Hfs(textoutput, textOutputPath, SinkMode.REPLACE);
>
>                                  Pipe pipe = new Pipe("Read Parquet");
>                                  pipe = new GroupBy(pipe, new Fields("Branch"));
>
>                                  Properties hadoopProps = new Properties();
>                                  AppProps.setApplicationJarClass(hadoopProps, ReadWriteParquet.class);
>                                  TupleSerializationProps.addSerialization(hadoopProps,
>                                                                  BigDecimalSerialization.class.getName());
>
>                                  FlowDef flowdef = FlowDef.flowDef().addSource(pipe, source)
>                                                                  .addTailSink(pipe, sink);
>                                  HadoopFlowConnector hd = new HadoopFlowConnector(hadoopProps);
>                                  hd.connect(flowdef).complete();
>                  }
>
>                  private static void write() {
>                                  DateType dateType = new DateType("dd/MM/yyyy");
>                                  Fields fields = new Fields("Name", "College", "Branch", "Age", "Doj",
>                                                                  "BigDeci").applyTypes(String.class, String.class, String.class,
>                                                                  Integer.class, dateType, BigDecimal.class);
>
>                                  Scheme input = new TextDelimited(fields, true, ",");
>
>                                  Scheme parquetout = new ParquetTupleScheme(
>                                                                  fields, fields,  "message ReadWriteParquet {required Binary Name; required Binary College; required Binary Branch; optional int64 Age; required int64 Doj; required Double BigDeci; }");
>
>                                  Tap source = new Hfs(input, textInputPath);
>                                  Tap sink = new Hfs(parquetout, parquetOutputPath, SinkMode.REPLACE);
>
>                                  Pipe pipe = new Pipe("Write Parquet");
>
>                                  FlowDef flowdef = FlowDef.flowDef().addSource(pipe, source)
>                                                                  .addTailSink(pipe, sink);
>                                  new HadoopFlowConnector().connect(flowdef).complete();
>                  }
> }
>
> In above code, you can see that in "message type"  I have used Int64 for date as there is no provision for date datatype and Double for BigDecimal. Saying Int64 to date it writes the data as long values but I want date to be written in some particular format. And same with BigDecimal, To run the job I have mapped it to Double but I want to map it to BigDecimal.
>
> Is there any way to deal with date and bigdecimal datatypes in parquet-cascading while writing the data? Please let me know if any of the workaround is there.
>
>
> Thanks,
> Bhavesh
>
> **************************************Disclaimer****************************************** This e-mail message and any attachments may contain confidential information and is for the sole use of the intended recipient(s) only. Any views or opinions presented or implied are solely those of the author and do not necessarily represent the views of BitWise. If you are not the intended recipient(s), you are hereby notified that disclosure, printing, copying, forwarding, distribution, or the taking of any action whatsoever in reliance on the contents of this electronic information is strictly prohibited. If you have received this e-mail message in error, please immediately notify the sender and delete the electronic message and any attachments.BitWise does not accept liability for any virus introduced by this e-mail or any attachments. ********************************************************************************************
>


-- 
Ryan Blue
Software Engineer
Cloudera, Inc.