You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Danfeng Li <dl...@operasolutions.com> on 2012/08/22 01:38:00 UTC

runtime exception when load and store multiple files using avro in pig

I run into this strange problem when try to load multiple text formatted files and convert them into avro format using pig. However, if I read and convert one file at a time in separated runs, everything is fine. The error message is following

2012-08-21 19:15:32,964 [main] ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to recreate exception from backed error: org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.RuntimeException: Datum 1980-01-01 00:00:00.000 is not in union ["null","long"]
                at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:263)
                at org.apache.pig.piggybank.storage.avro.PigAvroRecordWriter.write(PigAvroRecordWriter.java:49)
                at org.apache.pig.piggybank.storage.avro.AvroStorage.putNext(AvroStorage.java:612)
                at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
                at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
                at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:531)
                at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
                at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.collect(PigMapOnly.java:48)
                at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapB

my code is
set1 = load '$input_dir/set1.txt' using PigStorage('|') as (
   id:long,
   f1:long,
   f2:chararray,
   f3:float,
   f4:float,
   f5:float,
   f6:float,
   f7:float,
   f8:float,
   f9:float,
   f10:float,
   f11:float,
   f12:float);
store set1 into '$output_dir/set1.avro'
using org.apache.pig.piggybank.storage.avro.AvroStorage();

set2 = load '$input_dir/set2.txt' using PigStorage('|') as (
   id : int,
   date : chararray);
store set2 into '$output_dir/set2.avro'
using org.apache.pig.piggybank.storage.avro.AvroStorage();

The first file is converted fine, but the 2nd one is failed. The error is coming from the 2nd field in the 2nd file, but the strange thing is that I don't even have "long" in my schema while the error message is showing ["null","long"].

I use pig 0.10.0 and avro-1.7.1.jar.

I wonder if this is a bug or I missed something.

Thanks.
Dan

Here's set1.txt
827352|740214|Long|26|0.08731795012183759|1661335.541733333|0|0|0.001057865808239878|0.001059541098077884|0.001059541098077821|0.0514156486228232|0.001043980181757539
827353|740214|Short|12|-0.05967910581502997|-1135471.22271|0|0|-0.001185620143839061|-0.001187497751909232|-0.001187497751909183|-0.0747641932858414|-0.0001307449002148424
827354|740214|Total|38|0.02763884430680765|19026277.40819863|0|0|-0.0001277543355991829|-0.0001279566538313473|-0.0001279566538313626|-0.02334854466301821|0.0009132352815426966
827193|739576|Long|26|0.08731795012183759|1661335.541733333|0|0|0.001057865808239878|0.001059541098077884|0.001059541098077821|0.0514156486228232|0.001043980181757539
827194|739576|Short|12|-0.05967910581502997|-1135471.22271|0|0|-0.001185620143839061|-0.001187497751909232|-0.001187497751909183|-0.0747641932858414|-0.0001307449002148424
827195|739576|Total|38|0.02763884430680765|19026277.40819863|0|0|-0.0001277543355991829|-0.0001279566538313473|-0.0001279566538313626|-0.02334854466301821|0.0009132352815426966
827355|740215|Long|51|1.776868012839072|113652088.7063555|0|0|0.01952547658695701|0.0195703176808393|0.01957031768083928|1.164818333642054|0
827356|740215|Short|34|-2.360589090333165|-150988074.9471841|0|0|-0.00868330219442376|-0.008616238065508337|-0.008616238065508375|-0.5943698959308671|-0.02690679230502523
827357|740215|Total|85|-0.5837210774940929|63962032.00527128|0|0|0.01084217439253325|0.01095407961533095|0.0109540796153309|0.5704484377111866|-0.02690679230502523
827202|739590|Long|53|1.777568428360522|113696888.7063555|0|0|0.01952547658695701|0.0195703176808393|0.01957031768083928|1.156653489849146|0

Here's the set2.txt
1|1980-01-01 00:00:00.000
2|1980-01-02 00:00:00.000
3|1980-01-03 00:00:00.000
4|1980-01-04 00:00:00.000
5|1980-01-07 00:00:00.000
6|1980-01-08 00:00:00.000
7|1980-01-09 00:00:00.000
8|1980-01-10 00:00:00.000
9|1980-01-11 00:00:00.000
10|1980-01-14 00:00:00.000


Re: runtime exception when load and store multiple files using avro in pig

Posted by Alan Gates <ga...@hortonworks.com>.
Moving it into core makes sense to me, as Avro is a format we should be supporting.

Alan.

On Aug 21, 2012, at 6:03 PM, Cheolsoo Park wrote:

> Hi Dan,
> 
> Glad to hear that it worked. I totally agree that AvroStorage can be
> improved. In fact, it was written for Pig 0.7, so it can be written much
> nicer now.
> 
> Only concern that I have is backward compatibility. That is, if I change
> syntax (I wanted so badly while working on AvroStorage recently), it will
> break backward compatibility. What I have been thinking is to
> rewrite AvroStorage in core Pig like HBaseStorage. For
> backward compatibility, we may keep the old version in Piggybank for a
> while and eventually retire it.
> 
> I am wondering what other people think. Please let me know if it is not a
> good idea to move AvroStorage to core Pig from Piggybank.
> 
> Thanks,
> Cheolsoo
> 
> On Tue, Aug 21, 2012 at 5:47 PM, Danfeng Li <dl...@operasolutions.com> wrote:
> 
>> Thanks, Cheolsoo. That solve my problems.
>> 
>> It will be nice if pig can do this automatically when there are multiple
>> avrostorage in the code. Otherwise, we have to manually track the numbers.
>> 
>> Dan
>> 
>> -----Original Message-----
>> From: Cheolsoo Park [mailto:cheolsoo@cloudera.com]
>> Sent: Tuesday, August 21, 2012 5:06 PM
>> To: user@pig.apache.org
>> Subject: Re: runtime exception when load and store multiple files using
>> avro in pig
>> 
>> Hi Danfeng,
>> 
>> The "long" is from the 1st AvroStorage store in your script. The
>> AvroStorage has very funny syntax regarding multiple stores. To apply
>> different avro schemas to multiple stores, you have to specify their
>> "index" as follows:
>> 
>> set1 = load 'input1.txt' using PigStorage('|') as ( ... ); *store set1
>> into 'set1' using
>> org.apache.pig.piggybank.storage.avro.AvroStorage('index', '1');*
>> 
>> set2 = load 'input2.txt' using PigStorage('|') as ( .. ); *store set2 into
>> 'set2' using org.apache.pig.piggybank.storage.avro.AvroStorage('index',
>> '2');*
>> 
>> As can be seen, I added the 'index' parameters.
>> 
>> What AvroStorage does is to construct the following string in the frontend:
>> 
>> "1#<1st avro schema>,2#<2nd avro schema>"
>> 
>> and pass it to backend via UdfContext. Now in backend, tasks parse this
>> string to get output schema for each store.
>> 
>> Thanks,
>> Cheolsoo
>> 
>> On Tue, Aug 21, 2012 at 4:38 PM, Danfeng Li <dl...@operasolutions.com>
>> wrote:
>> 
>>> I run into this strange problem when try to load multiple text
>>> formatted files and convert them into avro format using pig. However,
>>> if I read and convert one file at a time in separated runs, everything
>>> is fine. The error message is following
>>> 
>>> 2012-08-21 19:15:32,964 [main] ERROR
>>> org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to
>>> recreate exception from backed error:
>>> org.apache.avro.file.DataFileWriter$AppendWriteException:
>>> java.lang.RuntimeException: Datum 1980-01-01 00:00:00.000 is not in
>>> union ["null","long"]
>>>                at
>>> org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:263)
>>>                at
>>> 
>> org.apache.pig.piggybank.storage.avro.PigAvroRecordWriter.write(PigAvroRecordWriter.java:49)
>>>                at
>>> 
>> org.apache.pig.piggybank.storage.avro.AvroStorage.putNext(AvroStorage.java:612)
>>>                at
>>> 
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
>>>                at
>>> 
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
>>>                at
>>> 
>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:531)
>>>                at
>>> 
>> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>>>                at
>>> 
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.collect(PigMapOnly.java:48)
>>>                at
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGeneri
>>> cMapB
>>> 
>>> my code is
>>> set1 = load '$input_dir/set1.txt' using PigStorage('|') as (
>>>   id:long,
>>>   f1:long,
>>>   f2:chararray,
>>>   f3:float,
>>>   f4:float,
>>>   f5:float,
>>>   f6:float,
>>>   f7:float,
>>>   f8:float,
>>>   f9:float,
>>>   f10:float,
>>>   f11:float,
>>>   f12:float);
>>> store set1 into '$output_dir/set1.avro'
>>> using org.apache.pig.piggybank.storage.avro.AvroStorage();
>>> 
>>> set2 = load '$input_dir/set2.txt' using PigStorage('|') as (
>>>   id : int,
>>>   date : chararray);
>>> store set2 into '$output_dir/set2.avro'
>>> using org.apache.pig.piggybank.storage.avro.AvroStorage();
>>> 
>>> The first file is converted fine, but the 2nd one is failed. The error
>>> is coming from the 2nd field in the 2nd file, but the strange thing is
>>> that I don't even have "long" in my schema while the error message is
>>> showing ["null","long"].
>>> 
>>> I use pig 0.10.0 and avro-1.7.1.jar.
>>> 
>>> I wonder if this is a bug or I missed something.
>>> 
>>> Thanks.
>>> Dan
>>> 
>>> Here's set1.txt
>>> 
>>> 827352|740214|Long|26|0.08731795012183759|1661335.541733333|0|0|0.0010
>>> 827352|740214|Long|26|57865808239878|0.001059541098077884|0.0010595410
>>> 827352|740214|Long|26|98077821|0.0514156486228232|0.001043980181757539
>>> 
>>> 827353|740214|Short|12|-0.05967910581502997|-1135471.22271|0|0|-0.0011
>>> 827353|740214|Short|12|85620143839061|-0.001187497751909232|-0.0011874
>>> 827353|740214|Short|12|97751909183|-0.0747641932858414|-0.000130744900
>>> 827353|740214|Short|12|2148424
>>> 
>>> 827354|740214|Total|38|0.02763884430680765|19026277.40819863|0|0|-0.00
>>> 827354|740214|Total|38|01277543355991829|-0.0001279566538313473|-0.000
>>> 827354|740214|Total|38|1279566538313626|-0.02334854466301821|0.0009132
>>> 827354|740214|Total|38|352815426966
>>> 
>>> 827193|739576|Long|26|0.08731795012183759|1661335.541733333|0|0|0.0010
>>> 827193|739576|Long|26|57865808239878|0.001059541098077884|0.0010595410
>>> 827193|739576|Long|26|98077821|0.0514156486228232|0.001043980181757539
>>> 
>>> 827194|739576|Short|12|-0.05967910581502997|-1135471.22271|0|0|-0.0011
>>> 827194|739576|Short|12|85620143839061|-0.001187497751909232|-0.0011874
>>> 827194|739576|Short|12|97751909183|-0.0747641932858414|-0.000130744900
>>> 827194|739576|Short|12|2148424
>>> 
>>> 827195|739576|Total|38|0.02763884430680765|19026277.40819863|0|0|-0.00
>>> 827195|739576|Total|38|01277543355991829|-0.0001279566538313473|-0.000
>>> 827195|739576|Total|38|1279566538313626|-0.02334854466301821|0.0009132
>>> 827195|739576|Total|38|352815426966
>>> 
>>> 827355|740215|Long|51|1.776868012839072|113652088.7063555|0|0|0.019525
>>> 827355|740215|Long|51|47658695701|0.0195703176808393|0.019570317680839
>>> 827355|740215|Long|51|28|1.164818333642054|0
>>> 
>>> 827356|740215|Short|34|-2.360589090333165|-150988074.9471841|0|0|-0.00
>>> 827356|740215|Short|34|868330219442376|-0.008616238065508337|-0.008616
>>> 827356|740215|Short|34|238065508375|-0.5943698959308671|-0.02690679230
>>> 827356|740215|Short|34|502523
>>> 
>>> 827357|740215|Total|85|-0.5837210774940929|63962032.00527128|0|0|0.010
>>> 827357|740215|Total|85|84217439253325|0.01095407961533095|0.0109540796
>>> 827357|740215|Total|85|153309|0.5704484377111866|-0.02690679230502523
>>> 
>>> 827202|739590|Long|53|1.777568428360522|113696888.7063555|0|0|0.019525
>>> 827202|739590|Long|53|47658695701|0.0195703176808393|0.019570317680839
>>> 827202|739590|Long|53|28|1.156653489849146|0
>>> 
>>> Here's the set2.txt
>>> 1|1980-01-01 00:00:00.000
>>> 2|1980-01-02 00:00:00.000
>>> 3|1980-01-03 00:00:00.000
>>> 4|1980-01-04 00:00:00.000
>>> 5|1980-01-07 00:00:00.000
>>> 6|1980-01-08 00:00:00.000
>>> 7|1980-01-09 00:00:00.000
>>> 8|1980-01-10 00:00:00.000
>>> 9|1980-01-11 00:00:00.000
>>> 10|1980-01-14 00:00:00.000
>>> 
>>> 
>> 


RE: runtime exception when load and store multiple files using avro in pig

Posted by Danfeng Li <dl...@operasolutions.com>.
Hi, Cheolsoo,

If we can allow string as index, then it should be backward compatible and also give us ability to separate schema without the need to track them.

Thanks.
Dan

-----Original Message-----
From: Cheolsoo Park [mailto:cheolsoo@cloudera.com] 
Sent: Tuesday, August 21, 2012 6:04 PM
To: user@pig.apache.org
Subject: Re: runtime exception when load and store multiple files using avro in pig

Hi Dan,

Glad to hear that it worked. I totally agree that AvroStorage can be improved. In fact, it was written for Pig 0.7, so it can be written much nicer now.

Only concern that I have is backward compatibility. That is, if I change syntax (I wanted so badly while working on AvroStorage recently), it will break backward compatibility. What I have been thinking is to rewrite AvroStorage in core Pig like HBaseStorage. For backward compatibility, we may keep the old version in Piggybank for a while and eventually retire it.

I am wondering what other people think. Please let me know if it is not a good idea to move AvroStorage to core Pig from Piggybank.

Thanks,
Cheolsoo

On Tue, Aug 21, 2012 at 5:47 PM, Danfeng Li <dl...@operasolutions.com> wrote:

> Thanks, Cheolsoo. That solve my problems.
>
> It will be nice if pig can do this automatically when there are 
> multiple avrostorage in the code. Otherwise, we have to manually track the numbers.
>
> Dan
>
> -----Original Message-----
> From: Cheolsoo Park [mailto:cheolsoo@cloudera.com]
> Sent: Tuesday, August 21, 2012 5:06 PM
> To: user@pig.apache.org
> Subject: Re: runtime exception when load and store multiple files 
> using avro in pig
>
> Hi Danfeng,
>
> The "long" is from the 1st AvroStorage store in your script. The 
> AvroStorage has very funny syntax regarding multiple stores. To apply 
> different avro schemas to multiple stores, you have to specify their 
> "index" as follows:
>
> set1 = load 'input1.txt' using PigStorage('|') as ( ... ); *store set1 
> into 'set1' using 
> org.apache.pig.piggybank.storage.avro.AvroStorage('index', '1');*
>
> set2 = load 'input2.txt' using PigStorage('|') as ( .. ); *store set2 
> into 'set2' using 
> org.apache.pig.piggybank.storage.avro.AvroStorage('index',
> '2');*
>
> As can be seen, I added the 'index' parameters.
>
> What AvroStorage does is to construct the following string in the frontend:
>
> "1#<1st avro schema>,2#<2nd avro schema>"
>
> and pass it to backend via UdfContext. Now in backend, tasks parse 
> this string to get output schema for each store.
>
> Thanks,
> Cheolsoo
>
> On Tue, Aug 21, 2012 at 4:38 PM, Danfeng Li <dl...@operasolutions.com>
> wrote:
>
> > I run into this strange problem when try to load multiple text 
> > formatted files and convert them into avro format using pig. 
> > However, if I read and convert one file at a time in separated runs, 
> > everything is fine. The error message is following
> >
> > 2012-08-21 19:15:32,964 [main] ERROR 
> > org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to 
> > recreate exception from backed error:
> > org.apache.avro.file.DataFileWriter$AppendWriteException:
> > java.lang.RuntimeException: Datum 1980-01-01 00:00:00.000 is not in 
> > union ["null","long"]
> >                 at
> > org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:263)
> >                 at
> >
> org.apache.pig.piggybank.storage.avro.PigAvroRecordWriter.write(PigAvr
> oRecordWriter.java:49)
> >                 at
> >
> org.apache.pig.piggybank.storage.avro.AvroStorage.putNext(AvroStorage.
> java:612)
> >                 at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutput
> Format$PigRecordWriter.write(PigOutputFormat.java:139)
> >                 at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutput
> Format$PigRecordWriter.write(PigOutputFormat.java:98)
> >                 at
> >
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTas
> k.java:531)
> >                 at
> >
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutp
> utContext.java:80)
> >                 at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnl
> y$Map.collect(PigMapOnly.java:48)
> >                 at
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGene
> > ri
> > cMapB
> >
> > my code is
> > set1 = load '$input_dir/set1.txt' using PigStorage('|') as (
> >    id:long,
> >    f1:long,
> >    f2:chararray,
> >    f3:float,
> >    f4:float,
> >    f5:float,
> >    f6:float,
> >    f7:float,
> >    f8:float,
> >    f9:float,
> >    f10:float,
> >    f11:float,
> >    f12:float);
> > store set1 into '$output_dir/set1.avro'
> > using org.apache.pig.piggybank.storage.avro.AvroStorage();
> >
> > set2 = load '$input_dir/set2.txt' using PigStorage('|') as (
> >    id : int,
> >    date : chararray);
> > store set2 into '$output_dir/set2.avro'
> > using org.apache.pig.piggybank.storage.avro.AvroStorage();
> >
> > The first file is converted fine, but the 2nd one is failed. The 
> > error is coming from the 2nd field in the 2nd file, but the strange 
> > thing is that I don't even have "long" in my schema while the error 
> > message is showing ["null","long"].
> >
> > I use pig 0.10.0 and avro-1.7.1.jar.
> >
> > I wonder if this is a bug or I missed something.
> >
> > Thanks.
> > Dan
> >
> > Here's set1.txt
> >
> > 827352|740214|Long|26|0.08731795012183759|1661335.541733333|0|0|0.00
> > 827352|740214|Long|26|10
> > 827352|740214|Long|26|57865808239878|0.001059541098077884|0.00105954
> > 827352|740214|Long|26|57865808239878|10
> > 827352|740214|Long|26|98077821|0.0514156486228232|0.0010439801817575
> > 827352|740214|Long|26|98077821|39
> >
> > 827353|740214|Short|12|-0.05967910581502997|-1135471.22271|0|0|-0.00
> > 827353|740214|Short|12|11
> > 827353|740214|Short|12|85620143839061|-0.001187497751909232|-0.00118
> > 827353|740214|Short|12|85620143839061|74
> > 827353|740214|Short|12|97751909183|-0.0747641932858414|-0.0001307449
> > 827353|740214|Short|12|97751909183|00
> > 827353|740214|Short|12|2148424
> >
> > 827354|740214|Total|38|0.02763884430680765|19026277.40819863|0|0|-0.
> > 827354|740214|Total|38|00
> > 827354|740214|Total|38|01277543355991829|-0.0001279566538313473|-0.0
> > 827354|740214|Total|38|01277543355991829|00
> > 827354|740214|Total|38|1279566538313626|-0.02334854466301821|0.00091
> > 827354|740214|Total|38|1279566538313626|32
> > 827354|740214|Total|38|352815426966
> >
> > 827193|739576|Long|26|0.08731795012183759|1661335.541733333|0|0|0.00
> > 827193|739576|Long|26|10
> > 827193|739576|Long|26|57865808239878|0.001059541098077884|0.00105954
> > 827193|739576|Long|26|57865808239878|10
> > 827193|739576|Long|26|98077821|0.0514156486228232|0.0010439801817575
> > 827193|739576|Long|26|98077821|39
> >
> > 827194|739576|Short|12|-0.05967910581502997|-1135471.22271|0|0|-0.00
> > 827194|739576|Short|12|11
> > 827194|739576|Short|12|85620143839061|-0.001187497751909232|-0.00118
> > 827194|739576|Short|12|85620143839061|74
> > 827194|739576|Short|12|97751909183|-0.0747641932858414|-0.0001307449
> > 827194|739576|Short|12|97751909183|00
> > 827194|739576|Short|12|2148424
> >
> > 827195|739576|Total|38|0.02763884430680765|19026277.40819863|0|0|-0.
> > 827195|739576|Total|38|00
> > 827195|739576|Total|38|01277543355991829|-0.0001279566538313473|-0.0
> > 827195|739576|Total|38|01277543355991829|00
> > 827195|739576|Total|38|1279566538313626|-0.02334854466301821|0.00091
> > 827195|739576|Total|38|1279566538313626|32
> > 827195|739576|Total|38|352815426966
> >
> > 827355|740215|Long|51|1.776868012839072|113652088.7063555|0|0|0.0195
> > 827355|740215|Long|51|25
> > 827355|740215|Long|51|47658695701|0.0195703176808393|0.0195703176808
> > 827355|740215|Long|51|47658695701|39
> > 827355|740215|Long|51|28|1.164818333642054|0
> >
> > 827356|740215|Short|34|-2.360589090333165|-150988074.9471841|0|0|-0.
> > 827356|740215|Short|34|00
> > 827356|740215|Short|34|868330219442376|-0.008616238065508337|-0.0086
> > 827356|740215|Short|34|868330219442376|16
> > 827356|740215|Short|34|238065508375|-0.5943698959308671|-0.026906792
> > 827356|740215|Short|34|238065508375|30
> > 827356|740215|Short|34|502523
> >
> > 827357|740215|Total|85|-0.5837210774940929|63962032.00527128|0|0|0.0
> > 827357|740215|Total|85|10
> > 827357|740215|Total|85|84217439253325|0.01095407961533095|0.01095407
> > 827357|740215|Total|85|84217439253325|96
> > 827357|740215|Total|85|153309|0.5704484377111866|-0.0269067923050252
> > 827357|740215|Total|85|153309|3
> >
> > 827202|739590|Long|53|1.777568428360522|113696888.7063555|0|0|0.0195
> > 827202|739590|Long|53|25
> > 827202|739590|Long|53|47658695701|0.0195703176808393|0.0195703176808
> > 827202|739590|Long|53|47658695701|39
> > 827202|739590|Long|53|28|1.156653489849146|0
> >
> > Here's the set2.txt
> > 1|1980-01-01 00:00:00.000
> > 2|1980-01-02 00:00:00.000
> > 3|1980-01-03 00:00:00.000
> > 4|1980-01-04 00:00:00.000
> > 5|1980-01-07 00:00:00.000
> > 6|1980-01-08 00:00:00.000
> > 7|1980-01-09 00:00:00.000
> > 8|1980-01-10 00:00:00.000
> > 9|1980-01-11 00:00:00.000
> > 10|1980-01-14 00:00:00.000
> >
> >
>

Re: runtime exception when load and store multiple files using avro in pig

Posted by Cheolsoo Park <ch...@cloudera.com>.
Hi Dan,

Glad to hear that it worked. I totally agree that AvroStorage can be
improved. In fact, it was written for Pig 0.7, so it can be written much
nicer now.

Only concern that I have is backward compatibility. That is, if I change
syntax (I wanted so badly while working on AvroStorage recently), it will
break backward compatibility. What I have been thinking is to
rewrite AvroStorage in core Pig like HBaseStorage. For
backward compatibility, we may keep the old version in Piggybank for a
while and eventually retire it.

I am wondering what other people think. Please let me know if it is not a
good idea to move AvroStorage to core Pig from Piggybank.

Thanks,
Cheolsoo

On Tue, Aug 21, 2012 at 5:47 PM, Danfeng Li <dl...@operasolutions.com> wrote:

> Thanks, Cheolsoo. That solve my problems.
>
> It will be nice if pig can do this automatically when there are multiple
> avrostorage in the code. Otherwise, we have to manually track the numbers.
>
> Dan
>
> -----Original Message-----
> From: Cheolsoo Park [mailto:cheolsoo@cloudera.com]
> Sent: Tuesday, August 21, 2012 5:06 PM
> To: user@pig.apache.org
> Subject: Re: runtime exception when load and store multiple files using
> avro in pig
>
> Hi Danfeng,
>
> The "long" is from the 1st AvroStorage store in your script. The
> AvroStorage has very funny syntax regarding multiple stores. To apply
> different avro schemas to multiple stores, you have to specify their
> "index" as follows:
>
> set1 = load 'input1.txt' using PigStorage('|') as ( ... ); *store set1
> into 'set1' using
> org.apache.pig.piggybank.storage.avro.AvroStorage('index', '1');*
>
> set2 = load 'input2.txt' using PigStorage('|') as ( .. ); *store set2 into
> 'set2' using org.apache.pig.piggybank.storage.avro.AvroStorage('index',
> '2');*
>
> As can be seen, I added the 'index' parameters.
>
> What AvroStorage does is to construct the following string in the frontend:
>
> "1#<1st avro schema>,2#<2nd avro schema>"
>
> and pass it to backend via UdfContext. Now in backend, tasks parse this
> string to get output schema for each store.
>
> Thanks,
> Cheolsoo
>
> On Tue, Aug 21, 2012 at 4:38 PM, Danfeng Li <dl...@operasolutions.com>
> wrote:
>
> > I run into this strange problem when try to load multiple text
> > formatted files and convert them into avro format using pig. However,
> > if I read and convert one file at a time in separated runs, everything
> > is fine. The error message is following
> >
> > 2012-08-21 19:15:32,964 [main] ERROR
> > org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to
> > recreate exception from backed error:
> > org.apache.avro.file.DataFileWriter$AppendWriteException:
> > java.lang.RuntimeException: Datum 1980-01-01 00:00:00.000 is not in
> > union ["null","long"]
> >                 at
> > org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:263)
> >                 at
> >
> org.apache.pig.piggybank.storage.avro.PigAvroRecordWriter.write(PigAvroRecordWriter.java:49)
> >                 at
> >
> org.apache.pig.piggybank.storage.avro.AvroStorage.putNext(AvroStorage.java:612)
> >                 at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
> >                 at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
> >                 at
> >
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:531)
> >                 at
> >
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> >                 at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.collect(PigMapOnly.java:48)
> >                 at
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGeneri
> > cMapB
> >
> > my code is
> > set1 = load '$input_dir/set1.txt' using PigStorage('|') as (
> >    id:long,
> >    f1:long,
> >    f2:chararray,
> >    f3:float,
> >    f4:float,
> >    f5:float,
> >    f6:float,
> >    f7:float,
> >    f8:float,
> >    f9:float,
> >    f10:float,
> >    f11:float,
> >    f12:float);
> > store set1 into '$output_dir/set1.avro'
> > using org.apache.pig.piggybank.storage.avro.AvroStorage();
> >
> > set2 = load '$input_dir/set2.txt' using PigStorage('|') as (
> >    id : int,
> >    date : chararray);
> > store set2 into '$output_dir/set2.avro'
> > using org.apache.pig.piggybank.storage.avro.AvroStorage();
> >
> > The first file is converted fine, but the 2nd one is failed. The error
> > is coming from the 2nd field in the 2nd file, but the strange thing is
> > that I don't even have "long" in my schema while the error message is
> > showing ["null","long"].
> >
> > I use pig 0.10.0 and avro-1.7.1.jar.
> >
> > I wonder if this is a bug or I missed something.
> >
> > Thanks.
> > Dan
> >
> > Here's set1.txt
> >
> > 827352|740214|Long|26|0.08731795012183759|1661335.541733333|0|0|0.0010
> > 827352|740214|Long|26|57865808239878|0.001059541098077884|0.0010595410
> > 827352|740214|Long|26|98077821|0.0514156486228232|0.001043980181757539
> >
> > 827353|740214|Short|12|-0.05967910581502997|-1135471.22271|0|0|-0.0011
> > 827353|740214|Short|12|85620143839061|-0.001187497751909232|-0.0011874
> > 827353|740214|Short|12|97751909183|-0.0747641932858414|-0.000130744900
> > 827353|740214|Short|12|2148424
> >
> > 827354|740214|Total|38|0.02763884430680765|19026277.40819863|0|0|-0.00
> > 827354|740214|Total|38|01277543355991829|-0.0001279566538313473|-0.000
> > 827354|740214|Total|38|1279566538313626|-0.02334854466301821|0.0009132
> > 827354|740214|Total|38|352815426966
> >
> > 827193|739576|Long|26|0.08731795012183759|1661335.541733333|0|0|0.0010
> > 827193|739576|Long|26|57865808239878|0.001059541098077884|0.0010595410
> > 827193|739576|Long|26|98077821|0.0514156486228232|0.001043980181757539
> >
> > 827194|739576|Short|12|-0.05967910581502997|-1135471.22271|0|0|-0.0011
> > 827194|739576|Short|12|85620143839061|-0.001187497751909232|-0.0011874
> > 827194|739576|Short|12|97751909183|-0.0747641932858414|-0.000130744900
> > 827194|739576|Short|12|2148424
> >
> > 827195|739576|Total|38|0.02763884430680765|19026277.40819863|0|0|-0.00
> > 827195|739576|Total|38|01277543355991829|-0.0001279566538313473|-0.000
> > 827195|739576|Total|38|1279566538313626|-0.02334854466301821|0.0009132
> > 827195|739576|Total|38|352815426966
> >
> > 827355|740215|Long|51|1.776868012839072|113652088.7063555|0|0|0.019525
> > 827355|740215|Long|51|47658695701|0.0195703176808393|0.019570317680839
> > 827355|740215|Long|51|28|1.164818333642054|0
> >
> > 827356|740215|Short|34|-2.360589090333165|-150988074.9471841|0|0|-0.00
> > 827356|740215|Short|34|868330219442376|-0.008616238065508337|-0.008616
> > 827356|740215|Short|34|238065508375|-0.5943698959308671|-0.02690679230
> > 827356|740215|Short|34|502523
> >
> > 827357|740215|Total|85|-0.5837210774940929|63962032.00527128|0|0|0.010
> > 827357|740215|Total|85|84217439253325|0.01095407961533095|0.0109540796
> > 827357|740215|Total|85|153309|0.5704484377111866|-0.02690679230502523
> >
> > 827202|739590|Long|53|1.777568428360522|113696888.7063555|0|0|0.019525
> > 827202|739590|Long|53|47658695701|0.0195703176808393|0.019570317680839
> > 827202|739590|Long|53|28|1.156653489849146|0
> >
> > Here's the set2.txt
> > 1|1980-01-01 00:00:00.000
> > 2|1980-01-02 00:00:00.000
> > 3|1980-01-03 00:00:00.000
> > 4|1980-01-04 00:00:00.000
> > 5|1980-01-07 00:00:00.000
> > 6|1980-01-08 00:00:00.000
> > 7|1980-01-09 00:00:00.000
> > 8|1980-01-10 00:00:00.000
> > 9|1980-01-11 00:00:00.000
> > 10|1980-01-14 00:00:00.000
> >
> >
>

RE: runtime exception when load and store multiple files using avro in pig

Posted by Danfeng Li <dl...@operasolutions.com>.
Thanks, Cheolsoo. That solve my problems. 

It will be nice if pig can do this automatically when there are multiple avrostorage in the code. Otherwise, we have to manually track the numbers.

Dan

-----Original Message-----
From: Cheolsoo Park [mailto:cheolsoo@cloudera.com] 
Sent: Tuesday, August 21, 2012 5:06 PM
To: user@pig.apache.org
Subject: Re: runtime exception when load and store multiple files using avro in pig

Hi Danfeng,

The "long" is from the 1st AvroStorage store in your script. The AvroStorage has very funny syntax regarding multiple stores. To apply different avro schemas to multiple stores, you have to specify their "index" as follows:

set1 = load 'input1.txt' using PigStorage('|') as ( ... ); *store set1 into 'set1' using org.apache.pig.piggybank.storage.avro.AvroStorage('index', '1');*

set2 = load 'input2.txt' using PigStorage('|') as ( .. ); *store set2 into 'set2' using org.apache.pig.piggybank.storage.avro.AvroStorage('index', '2');*

As can be seen, I added the 'index' parameters.

What AvroStorage does is to construct the following string in the frontend:

"1#<1st avro schema>,2#<2nd avro schema>"

and pass it to backend via UdfContext. Now in backend, tasks parse this string to get output schema for each store.

Thanks,
Cheolsoo

On Tue, Aug 21, 2012 at 4:38 PM, Danfeng Li <dl...@operasolutions.com> wrote:

> I run into this strange problem when try to load multiple text 
> formatted files and convert them into avro format using pig. However, 
> if I read and convert one file at a time in separated runs, everything 
> is fine. The error message is following
>
> 2012-08-21 19:15:32,964 [main] ERROR
> org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to 
> recreate exception from backed error:
> org.apache.avro.file.DataFileWriter$AppendWriteException:
> java.lang.RuntimeException: Datum 1980-01-01 00:00:00.000 is not in 
> union ["null","long"]
>                 at
> org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:263)
>                 at
> org.apache.pig.piggybank.storage.avro.PigAvroRecordWriter.write(PigAvroRecordWriter.java:49)
>                 at
> org.apache.pig.piggybank.storage.avro.AvroStorage.putNext(AvroStorage.java:612)
>                 at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
>                 at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
>                 at
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:531)
>                 at
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>                 at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.collect(PigMapOnly.java:48)
>                 at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGeneri
> cMapB
>
> my code is
> set1 = load '$input_dir/set1.txt' using PigStorage('|') as (
>    id:long,
>    f1:long,
>    f2:chararray,
>    f3:float,
>    f4:float,
>    f5:float,
>    f6:float,
>    f7:float,
>    f8:float,
>    f9:float,
>    f10:float,
>    f11:float,
>    f12:float);
> store set1 into '$output_dir/set1.avro'
> using org.apache.pig.piggybank.storage.avro.AvroStorage();
>
> set2 = load '$input_dir/set2.txt' using PigStorage('|') as (
>    id : int,
>    date : chararray);
> store set2 into '$output_dir/set2.avro'
> using org.apache.pig.piggybank.storage.avro.AvroStorage();
>
> The first file is converted fine, but the 2nd one is failed. The error 
> is coming from the 2nd field in the 2nd file, but the strange thing is 
> that I don't even have "long" in my schema while the error message is 
> showing ["null","long"].
>
> I use pig 0.10.0 and avro-1.7.1.jar.
>
> I wonder if this is a bug or I missed something.
>
> Thanks.
> Dan
>
> Here's set1.txt
>
> 827352|740214|Long|26|0.08731795012183759|1661335.541733333|0|0|0.0010
> 827352|740214|Long|26|57865808239878|0.001059541098077884|0.0010595410
> 827352|740214|Long|26|98077821|0.0514156486228232|0.001043980181757539
>
> 827353|740214|Short|12|-0.05967910581502997|-1135471.22271|0|0|-0.0011
> 827353|740214|Short|12|85620143839061|-0.001187497751909232|-0.0011874
> 827353|740214|Short|12|97751909183|-0.0747641932858414|-0.000130744900
> 827353|740214|Short|12|2148424
>
> 827354|740214|Total|38|0.02763884430680765|19026277.40819863|0|0|-0.00
> 827354|740214|Total|38|01277543355991829|-0.0001279566538313473|-0.000
> 827354|740214|Total|38|1279566538313626|-0.02334854466301821|0.0009132
> 827354|740214|Total|38|352815426966
>
> 827193|739576|Long|26|0.08731795012183759|1661335.541733333|0|0|0.0010
> 827193|739576|Long|26|57865808239878|0.001059541098077884|0.0010595410
> 827193|739576|Long|26|98077821|0.0514156486228232|0.001043980181757539
>
> 827194|739576|Short|12|-0.05967910581502997|-1135471.22271|0|0|-0.0011
> 827194|739576|Short|12|85620143839061|-0.001187497751909232|-0.0011874
> 827194|739576|Short|12|97751909183|-0.0747641932858414|-0.000130744900
> 827194|739576|Short|12|2148424
>
> 827195|739576|Total|38|0.02763884430680765|19026277.40819863|0|0|-0.00
> 827195|739576|Total|38|01277543355991829|-0.0001279566538313473|-0.000
> 827195|739576|Total|38|1279566538313626|-0.02334854466301821|0.0009132
> 827195|739576|Total|38|352815426966
>
> 827355|740215|Long|51|1.776868012839072|113652088.7063555|0|0|0.019525
> 827355|740215|Long|51|47658695701|0.0195703176808393|0.019570317680839
> 827355|740215|Long|51|28|1.164818333642054|0
>
> 827356|740215|Short|34|-2.360589090333165|-150988074.9471841|0|0|-0.00
> 827356|740215|Short|34|868330219442376|-0.008616238065508337|-0.008616
> 827356|740215|Short|34|238065508375|-0.5943698959308671|-0.02690679230
> 827356|740215|Short|34|502523
>
> 827357|740215|Total|85|-0.5837210774940929|63962032.00527128|0|0|0.010
> 827357|740215|Total|85|84217439253325|0.01095407961533095|0.0109540796
> 827357|740215|Total|85|153309|0.5704484377111866|-0.02690679230502523
>
> 827202|739590|Long|53|1.777568428360522|113696888.7063555|0|0|0.019525
> 827202|739590|Long|53|47658695701|0.0195703176808393|0.019570317680839
> 827202|739590|Long|53|28|1.156653489849146|0
>
> Here's the set2.txt
> 1|1980-01-01 00:00:00.000
> 2|1980-01-02 00:00:00.000
> 3|1980-01-03 00:00:00.000
> 4|1980-01-04 00:00:00.000
> 5|1980-01-07 00:00:00.000
> 6|1980-01-08 00:00:00.000
> 7|1980-01-09 00:00:00.000
> 8|1980-01-10 00:00:00.000
> 9|1980-01-11 00:00:00.000
> 10|1980-01-14 00:00:00.000
>
>

Re: runtime exception when load and store multiple files using avro in pig

Posted by Cheolsoo Park <ch...@cloudera.com>.
Hi Danfeng,

The "long" is from the 1st AvroStorage store in your script. The
AvroStorage has very funny syntax regarding multiple stores. To apply
different avro schemas to multiple stores, you have to specify their
"index" as follows:

set1 = load 'input1.txt' using PigStorage('|') as ( ... );
*store set1 into 'set1' using
org.apache.pig.piggybank.storage.avro.AvroStorage('index', '1');*

set2 = load 'input2.txt' using PigStorage('|') as ( .. );
*store set2 into 'set2' using
org.apache.pig.piggybank.storage.avro.AvroStorage('index', '2');*

As can be seen, I added the 'index' parameters.

What AvroStorage does is to construct the following string in the frontend:

"1#<1st avro schema>,2#<2nd avro schema>"

and pass it to backend via UdfContext. Now in backend, tasks parse this
string to get output schema for each store.

Thanks,
Cheolsoo

On Tue, Aug 21, 2012 at 4:38 PM, Danfeng Li <dl...@operasolutions.com> wrote:

> I run into this strange problem when try to load multiple text formatted
> files and convert them into avro format using pig. However, if I read and
> convert one file at a time in separated runs, everything is fine. The error
> message is following
>
> 2012-08-21 19:15:32,964 [main] ERROR
> org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to recreate
> exception from backed error:
> org.apache.avro.file.DataFileWriter$AppendWriteException:
> java.lang.RuntimeException: Datum 1980-01-01 00:00:00.000 is not in union
> ["null","long"]
>                 at
> org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:263)
>                 at
> org.apache.pig.piggybank.storage.avro.PigAvroRecordWriter.write(PigAvroRecordWriter.java:49)
>                 at
> org.apache.pig.piggybank.storage.avro.AvroStorage.putNext(AvroStorage.java:612)
>                 at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
>                 at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
>                 at
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:531)
>                 at
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>                 at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.collect(PigMapOnly.java:48)
>                 at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapB
>
> my code is
> set1 = load '$input_dir/set1.txt' using PigStorage('|') as (
>    id:long,
>    f1:long,
>    f2:chararray,
>    f3:float,
>    f4:float,
>    f5:float,
>    f6:float,
>    f7:float,
>    f8:float,
>    f9:float,
>    f10:float,
>    f11:float,
>    f12:float);
> store set1 into '$output_dir/set1.avro'
> using org.apache.pig.piggybank.storage.avro.AvroStorage();
>
> set2 = load '$input_dir/set2.txt' using PigStorage('|') as (
>    id : int,
>    date : chararray);
> store set2 into '$output_dir/set2.avro'
> using org.apache.pig.piggybank.storage.avro.AvroStorage();
>
> The first file is converted fine, but the 2nd one is failed. The error is
> coming from the 2nd field in the 2nd file, but the strange thing is that I
> don't even have "long" in my schema while the error message is showing
> ["null","long"].
>
> I use pig 0.10.0 and avro-1.7.1.jar.
>
> I wonder if this is a bug or I missed something.
>
> Thanks.
> Dan
>
> Here's set1.txt
>
> 827352|740214|Long|26|0.08731795012183759|1661335.541733333|0|0|0.001057865808239878|0.001059541098077884|0.001059541098077821|0.0514156486228232|0.001043980181757539
>
> 827353|740214|Short|12|-0.05967910581502997|-1135471.22271|0|0|-0.001185620143839061|-0.001187497751909232|-0.001187497751909183|-0.0747641932858414|-0.0001307449002148424
>
> 827354|740214|Total|38|0.02763884430680765|19026277.40819863|0|0|-0.0001277543355991829|-0.0001279566538313473|-0.0001279566538313626|-0.02334854466301821|0.0009132352815426966
>
> 827193|739576|Long|26|0.08731795012183759|1661335.541733333|0|0|0.001057865808239878|0.001059541098077884|0.001059541098077821|0.0514156486228232|0.001043980181757539
>
> 827194|739576|Short|12|-0.05967910581502997|-1135471.22271|0|0|-0.001185620143839061|-0.001187497751909232|-0.001187497751909183|-0.0747641932858414|-0.0001307449002148424
>
> 827195|739576|Total|38|0.02763884430680765|19026277.40819863|0|0|-0.0001277543355991829|-0.0001279566538313473|-0.0001279566538313626|-0.02334854466301821|0.0009132352815426966
>
> 827355|740215|Long|51|1.776868012839072|113652088.7063555|0|0|0.01952547658695701|0.0195703176808393|0.01957031768083928|1.164818333642054|0
>
> 827356|740215|Short|34|-2.360589090333165|-150988074.9471841|0|0|-0.00868330219442376|-0.008616238065508337|-0.008616238065508375|-0.5943698959308671|-0.02690679230502523
>
> 827357|740215|Total|85|-0.5837210774940929|63962032.00527128|0|0|0.01084217439253325|0.01095407961533095|0.0109540796153309|0.5704484377111866|-0.02690679230502523
>
> 827202|739590|Long|53|1.777568428360522|113696888.7063555|0|0|0.01952547658695701|0.0195703176808393|0.01957031768083928|1.156653489849146|0
>
> Here's the set2.txt
> 1|1980-01-01 00:00:00.000
> 2|1980-01-02 00:00:00.000
> 3|1980-01-03 00:00:00.000
> 4|1980-01-04 00:00:00.000
> 5|1980-01-07 00:00:00.000
> 6|1980-01-08 00:00:00.000
> 7|1980-01-09 00:00:00.000
> 8|1980-01-10 00:00:00.000
> 9|1980-01-11 00:00:00.000
> 10|1980-01-14 00:00:00.000
>
>