You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Alex Holmes <gr...@gmail.com> on 2011/09/21 13:34:58 UTC

Pig duplicate records

Hi all,

I have a simple schema

{"name": "Record", "type": "record",
  "fields": [
    {"name": "name", "type": "string"},
    {"name": "id", "type": "int"}
  ]
}

which I use to write 2 records to an Avro file, and my reader code
(which reads the file and dumps the records) verifies that there are 2
records in the file:

Record@1e9e5c73[name=r1,id=1]
Record@ed42d08[name=r2,id=2]

When using this file with pig and AvroStorage, pig seems to think
there are 4 records:

grunt> REGISTER /app/hadoop/lib/avro-1.5.4.jar;
grunt> REGISTER /app/pig-0.9.0/contrib/piggybank/java/piggybank.jar;
grunt> REGISTER /app/pig-0.9.0/build/ivy/lib/Pig/json-simple-1.1.jar;
grunt> REGISTER /app/pig-0.9.0/build/ivy/lib/Pig/jackson-core-asl-1.6.0.jar;
grunt> REGISTER /app/pig-0.9.0/build/ivy/lib/Pig/jackson-mapper-asl-1.6.0.jar;
grunt> raw = LOAD 'test.v1.avro' USING
org.apache.pig.piggybank.storage.avro.AvroStorage;
grunt> dump raw;
..
Input(s):
Successfully read 4 records (825 bytes) from:
"hdfs://localhost:9000/user/aholmes/test.v1.avro"

Output(s):
Successfully stored 4 records (46 bytes) in:
"hdfs://localhost:9000/tmp/temp2039109003/tmp1924774585"

Counters:
Total records written : 4
Total bytes written : 46
..
(r1,1)
(r2,2)
(r1,1)
(r2,2)

I'm sure I'm doing something wrong (again)!

Many thanks,
Alex

Re: Pig duplicate records

Posted by Jeff Zhang <zj...@gmail.com>.
Seems this is a pig bug. Maybe it is caused by AvroStorage.
According the log, it said pig read 4 records, and output 4 records.



On Wed, Sep 21, 2011 at 1:55 PM, Scott Carey <sc...@apache.org> wrote:

> You will want to ask the pig user mailing list this question.
>
> org.apache.pig.piggybank.storage.avro.AvroStorage is maintained by the Pig
> project and you will get more help from there.
>
> On 9/21/11 4:34 AM, "Alex Holmes" <gr...@gmail.com> wrote:
>
> >Hi all,
> >
> >I have a simple schema
> >
> >{"name": "Record", "type": "record",
> >  "fields": [
> >    {"name": "name", "type": "string"},
> >    {"name": "id", "type": "int"}
> >  ]
> >}
> >
> >which I use to write 2 records to an Avro file, and my reader code
> >(which reads the file and dumps the records) verifies that there are 2
> >records in the file:
> >
> >Record@1e9e5c73[name=r1,id=1]
> >Record@ed42d08[name=r2,id=2]
> >
> >When using this file with pig and AvroStorage, pig seems to think
> >there are 4 records:
> >
> >grunt> REGISTER /app/hadoop/lib/avro-1.5.4.jar;
> >grunt> REGISTER /app/pig-0.9.0/contrib/piggybank/java/piggybank.jar;
> >grunt> REGISTER /app/pig-0.9.0/build/ivy/lib/Pig/json-simple-1.1.jar;
> >grunt> REGISTER
> >/app/pig-0.9.0/build/ivy/lib/Pig/jackson-core-asl-1.6.0.jar;
> >grunt> REGISTER
> >/app/pig-0.9.0/build/ivy/lib/Pig/jackson-mapper-asl-1.6.0.jar;
> >grunt> raw = LOAD 'test.v1.avro' USING
> >org.apache.pig.piggybank.storage.avro.AvroStorage;
> >grunt> dump raw;
> >..
> >Input(s):
> >Successfully read 4 records (825 bytes) from:
> >"hdfs://localhost:9000/user/aholmes/test.v1.avro"
> >
> >Output(s):
> >Successfully stored 4 records (46 bytes) in:
> >"hdfs://localhost:9000/tmp/temp2039109003/tmp1924774585"
> >
> >Counters:
> >Total records written : 4
> >Total bytes written : 46
> >..
> >(r1,1)
> >(r2,2)
> >(r1,1)
> >(r2,2)
> >
> >I'm sure I'm doing something wrong (again)!
> >
> >Many thanks,
> >Alex
>
>
>


-- 
Best Regards

Jeff Zhang

Re: Pig duplicate records

Posted by Jeff Zhang <zj...@gmail.com>.
Seems this is a pig bug. Maybe it is caused by AvroStorage.
According the log, it said pig read 4 records, and output 4 records.



On Wed, Sep 21, 2011 at 1:55 PM, Scott Carey <sc...@apache.org> wrote:

> You will want to ask the pig user mailing list this question.
>
> org.apache.pig.piggybank.storage.avro.AvroStorage is maintained by the Pig
> project and you will get more help from there.
>
> On 9/21/11 4:34 AM, "Alex Holmes" <gr...@gmail.com> wrote:
>
> >Hi all,
> >
> >I have a simple schema
> >
> >{"name": "Record", "type": "record",
> >  "fields": [
> >    {"name": "name", "type": "string"},
> >    {"name": "id", "type": "int"}
> >  ]
> >}
> >
> >which I use to write 2 records to an Avro file, and my reader code
> >(which reads the file and dumps the records) verifies that there are 2
> >records in the file:
> >
> >Record@1e9e5c73[name=r1,id=1]
> >Record@ed42d08[name=r2,id=2]
> >
> >When using this file with pig and AvroStorage, pig seems to think
> >there are 4 records:
> >
> >grunt> REGISTER /app/hadoop/lib/avro-1.5.4.jar;
> >grunt> REGISTER /app/pig-0.9.0/contrib/piggybank/java/piggybank.jar;
> >grunt> REGISTER /app/pig-0.9.0/build/ivy/lib/Pig/json-simple-1.1.jar;
> >grunt> REGISTER
> >/app/pig-0.9.0/build/ivy/lib/Pig/jackson-core-asl-1.6.0.jar;
> >grunt> REGISTER
> >/app/pig-0.9.0/build/ivy/lib/Pig/jackson-mapper-asl-1.6.0.jar;
> >grunt> raw = LOAD 'test.v1.avro' USING
> >org.apache.pig.piggybank.storage.avro.AvroStorage;
> >grunt> dump raw;
> >..
> >Input(s):
> >Successfully read 4 records (825 bytes) from:
> >"hdfs://localhost:9000/user/aholmes/test.v1.avro"
> >
> >Output(s):
> >Successfully stored 4 records (46 bytes) in:
> >"hdfs://localhost:9000/tmp/temp2039109003/tmp1924774585"
> >
> >Counters:
> >Total records written : 4
> >Total bytes written : 46
> >..
> >(r1,1)
> >(r2,2)
> >(r1,1)
> >(r2,2)
> >
> >I'm sure I'm doing something wrong (again)!
> >
> >Many thanks,
> >Alex
>
>
>


-- 
Best Regards

Jeff Zhang

Re: Pig duplicate records

Posted by Scott Carey <sc...@apache.org>.
You will want to ask the pig user mailing list this question.

org.apache.pig.piggybank.storage.avro.AvroStorage is maintained by the Pig
project and you will get more help from there.

On 9/21/11 4:34 AM, "Alex Holmes" <gr...@gmail.com> wrote:

>Hi all,
>
>I have a simple schema
>
>{"name": "Record", "type": "record",
>  "fields": [
>    {"name": "name", "type": "string"},
>    {"name": "id", "type": "int"}
>  ]
>}
>
>which I use to write 2 records to an Avro file, and my reader code
>(which reads the file and dumps the records) verifies that there are 2
>records in the file:
>
>Record@1e9e5c73[name=r1,id=1]
>Record@ed42d08[name=r2,id=2]
>
>When using this file with pig and AvroStorage, pig seems to think
>there are 4 records:
>
>grunt> REGISTER /app/hadoop/lib/avro-1.5.4.jar;
>grunt> REGISTER /app/pig-0.9.0/contrib/piggybank/java/piggybank.jar;
>grunt> REGISTER /app/pig-0.9.0/build/ivy/lib/Pig/json-simple-1.1.jar;
>grunt> REGISTER 
>/app/pig-0.9.0/build/ivy/lib/Pig/jackson-core-asl-1.6.0.jar;
>grunt> REGISTER 
>/app/pig-0.9.0/build/ivy/lib/Pig/jackson-mapper-asl-1.6.0.jar;
>grunt> raw = LOAD 'test.v1.avro' USING
>org.apache.pig.piggybank.storage.avro.AvroStorage;
>grunt> dump raw;
>..
>Input(s):
>Successfully read 4 records (825 bytes) from:
>"hdfs://localhost:9000/user/aholmes/test.v1.avro"
>
>Output(s):
>Successfully stored 4 records (46 bytes) in:
>"hdfs://localhost:9000/tmp/temp2039109003/tmp1924774585"
>
>Counters:
>Total records written : 4
>Total bytes written : 46
>..
>(r1,1)
>(r2,2)
>(r1,1)
>(r2,2)
>
>I'm sure I'm doing something wrong (again)!
>
>Many thanks,
>Alex