You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Mohit Anchlia <mo...@gmail.com> on 2012/04/21 01:22:59 UTC

Are records tuple

I am writing unit test but I had a doubt. My understanding is that complete
record is a tuple. So record "a b
{(ST:NC),(ZIP:28613),(CITY:Xxxxxxx),(NAM2:Xxxxx X &xxx; Xxxxx X
Xxxxxx)}        {(OCCUP:xxxxxxx xxxxx),(AGE:55    ),(MARITAL:Married)}"
which is one line in a file is a tuple? But I somehow feel it's not right.
Could someone please clarify?

Below is the code, my test is incomplete but just pasting it to show how I
am constructing this tuple.


  TupleFactory mTupleFactory = TupleFactory.getInstance();
 BagFactory mBagFactory = BagFactory.getInstance();

 @Test
 public void evalFuncTest() throws IOException{
  String record = "a b
{(ST:NC),(ZIP:28613),(CITY:Xxxxxxx),(NAM2:Xxxxx X &xxx; Xxxxx X
Xxxxxx)}        {(OCCUP:xxxxxxx xxxxx),(AGE:55    ),(MARITAL:Married)}";
  Tuple t = mTupleFactory.newTuple();
  DataInput in = new DataInputStream(new
ByteArrayInputStream(record.getBytes()));
  t.readFields(in);
 }

PIG - Best way to write to NFS

Posted by "Meyer, Dennis" <de...@adtech.com>.
Hi,

Just wondering what's the best way to write files out to NFS. Do I need to
hit Hadoop first or is there a nice way to get around writing intermediate
files to HDFS and export them afterwards.

Thanks,
Dennis


Re: Are records tuple

Posted by Mohit Anchlia <mo...@gmail.com>.
Thanks for the response, that helps. I was thinking the same but now
knowing too much about pig I wanted to clarify. I'll look at how to use
PigStorage in my unit test.

On Sun, Apr 22, 2012 at 3:47 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> You are trying to read a string that represents a tuple using binary
> deserialization.
>
> Pig has an abstraction called LoadFunc that knows how to read data off
> disk and turn it into tuples (yes, records are tuples).  PigStorage is
> one such LoadFunc, and it reads data represented as strings such as
> what you are trying to feed in.  There are other load funcs that know
> how to read other serializations and interpret the data in very
> different ways (json, avro, thrift, records from a database, xml...).
> There is no way for Tuple.readFields to know what format you are
> trying to feed into it. Tuples serialization is used for intermediate
> serialization between MR jobs and is not intended for the end-user.
>
> You should be using the appropriate LoadFunc to create tuples
> (PigStorage in this case?), or create them in code as I demonstrated
> earlier.
>
> You might find ReadToEndLoader, which wraps a real loadfunc and helps
> with some details of instantiating input formats, getting splits, etc,
> helpful:
> http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/impl/io/ReadToEndLoader.html
>
> But really, you should just create the tuples you want in code rather
> than involve all of this machinery.
>
> D
>
>
> On Sun, Apr 22, 2012 at 9:56 AM, Mohit Anchlia <mo...@gmail.com>
> wrote:
> > Could someone help mw answer this question if records (each line) ==
> tuples?
> >
> > On Fri, Apr 20, 2012 at 4:22 PM, Mohit Anchlia <mohitanchlia@gmail.com
> >wrote:
> >
> >> I am writing unit test but I had a doubt. My understanding is that
> >> complete record is a tuple. So record "a b
> >> {(ST:NC),(ZIP:28613),(CITY:Xxxxxxx),(NAM2:Xxxxx X &xxx; Xxxxx X
> >> Xxxxxx)}        {(OCCUP:xxxxxxx xxxxx),(AGE:55    ),(MARITAL:Married)}"
> >> which is one line in a file is a tuple? But I somehow feel it's not
> right.
> >> Could someone please clarify?
> >>
> >> Below is the code, my test is incomplete but just pasting it to show
> how I
> >> am constructing this tuple.
> >>
> >>
> >>   TupleFactory mTupleFactory = TupleFactory.getInstance();
> >>  BagFactory mBagFactory = BagFactory.getInstance();
> >>
> >>  @Test
> >>  public void evalFuncTest() throws IOException{
> >>   String record = "a b
> >> {(ST:NC),(ZIP:28613),(CITY:Xxxxxxx),(NAM2:Xxxxx X &xxx; Xxxxx X
> >> Xxxxxx)}        {(OCCUP:xxxxxxx xxxxx),(AGE:55    ),(MARITAL:Married)}";
> >>   Tuple t = mTupleFactory.newTuple();
> >>   DataInput in = new DataInputStream(new
> >> ByteArrayInputStream(record.getBytes()));
> >>   t.readFields(in);
> >>  }
> >>
>

Re: Are records tuple

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
You are trying to read a string that represents a tuple using binary
deserialization.

Pig has an abstraction called LoadFunc that knows how to read data off
disk and turn it into tuples (yes, records are tuples).  PigStorage is
one such LoadFunc, and it reads data represented as strings such as
what you are trying to feed in.  There are other load funcs that know
how to read other serializations and interpret the data in very
different ways (json, avro, thrift, records from a database, xml...).
There is no way for Tuple.readFields to know what format you are
trying to feed into it. Tuples serialization is used for intermediate
serialization between MR jobs and is not intended for the end-user.

You should be using the appropriate LoadFunc to create tuples
(PigStorage in this case?), or create them in code as I demonstrated
earlier.

You might find ReadToEndLoader, which wraps a real loadfunc and helps
with some details of instantiating input formats, getting splits, etc,
helpful: http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/impl/io/ReadToEndLoader.html

But really, you should just create the tuples you want in code rather
than involve all of this machinery.

D


On Sun, Apr 22, 2012 at 9:56 AM, Mohit Anchlia <mo...@gmail.com> wrote:
> Could someone help mw answer this question if records (each line) == tuples?
>
> On Fri, Apr 20, 2012 at 4:22 PM, Mohit Anchlia <mo...@gmail.com>wrote:
>
>> I am writing unit test but I had a doubt. My understanding is that
>> complete record is a tuple. So record "a b
>> {(ST:NC),(ZIP:28613),(CITY:Xxxxxxx),(NAM2:Xxxxx X &xxx; Xxxxx X
>> Xxxxxx)}        {(OCCUP:xxxxxxx xxxxx),(AGE:55    ),(MARITAL:Married)}"
>> which is one line in a file is a tuple? But I somehow feel it's not right.
>> Could someone please clarify?
>>
>> Below is the code, my test is incomplete but just pasting it to show how I
>> am constructing this tuple.
>>
>>
>>   TupleFactory mTupleFactory = TupleFactory.getInstance();
>>  BagFactory mBagFactory = BagFactory.getInstance();
>>
>>  @Test
>>  public void evalFuncTest() throws IOException{
>>   String record = "a b
>> {(ST:NC),(ZIP:28613),(CITY:Xxxxxxx),(NAM2:Xxxxx X &xxx; Xxxxx X
>> Xxxxxx)}        {(OCCUP:xxxxxxx xxxxx),(AGE:55    ),(MARITAL:Married)}";
>>   Tuple t = mTupleFactory.newTuple();
>>   DataInput in = new DataInputStream(new
>> ByteArrayInputStream(record.getBytes()));
>>   t.readFields(in);
>>  }
>>

Re: Are records tuple

Posted by Mohit Anchlia <mo...@gmail.com>.
Could someone help mw answer this question if records (each line) == tuples?

On Fri, Apr 20, 2012 at 4:22 PM, Mohit Anchlia <mo...@gmail.com>wrote:

> I am writing unit test but I had a doubt. My understanding is that
> complete record is a tuple. So record "a b
> {(ST:NC),(ZIP:28613),(CITY:Xxxxxxx),(NAM2:Xxxxx X &xxx; Xxxxx X
> Xxxxxx)}        {(OCCUP:xxxxxxx xxxxx),(AGE:55    ),(MARITAL:Married)}"
> which is one line in a file is a tuple? But I somehow feel it's not right.
> Could someone please clarify?
>
> Below is the code, my test is incomplete but just pasting it to show how I
> am constructing this tuple.
>
>
>   TupleFactory mTupleFactory = TupleFactory.getInstance();
>  BagFactory mBagFactory = BagFactory.getInstance();
>
>  @Test
>  public void evalFuncTest() throws IOException{
>   String record = "a b
> {(ST:NC),(ZIP:28613),(CITY:Xxxxxxx),(NAM2:Xxxxx X &xxx; Xxxxx X
> Xxxxxx)}        {(OCCUP:xxxxxxx xxxxx),(AGE:55    ),(MARITAL:Married)}";
>   Tuple t = mTupleFactory.newTuple();
>   DataInput in = new DataInputStream(new
> ByteArrayInputStream(record.getBytes()));
>   t.readFields(in);
>  }
>

Re: Are records tuple

Posted by Mohit Anchlia <mo...@gmail.com>.
Additional details, when I try to build a tuple from a line in a file using
below code from my previous email I get . Looks like I need to define
schema somehow. I wonder how others test this. I am trying to test udf and
I need to pass a line from a file, build a tuple and pass it to eval func.

org.apache.pig.backend.executionengine.ExecException: ERROR 2112:
Unexpected datatype 97 while reading tuplefrom binary file.
 at org.apache.pig.data.BinInterSedes.getTupleSize(BinInterSedes.java:132)
 at org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:553)
 at org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:64)
 at
com.intuit.cg.services.dp.analytics.pig.udf.TAXOUTPUTTest.evalFuncTest(TAXOUTPUTTest.java:30)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
 at
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
 at
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
 at
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
...

On Fri, Apr 20, 2012 at 4:22 PM, Mohit Anchlia <mo...@gmail.com>wrote:

> I am writing unit test but I had a doubt. My understanding is that
> complete record is a tuple. So record "a b
> {(ST:NC),(ZIP:28613),(CITY:Xxxxxxx),(NAM2:Xxxxx X &xxx; Xxxxx X
> Xxxxxx)}        {(OCCUP:xxxxxxx xxxxx),(AGE:55    ),(MARITAL:Married)}"
> which is one line in a file is a tuple? But I somehow feel it's not right.
> Could someone please clarify?
>
> Below is the code, my test is incomplete but just pasting it to show how I
> am constructing this tuple.
>
>
>   TupleFactory mTupleFactory = TupleFactory.getInstance();
>  BagFactory mBagFactory = BagFactory.getInstance();
>
>  @Test
>  public void evalFuncTest() throws IOException{
>   String record = "a b
> {(ST:NC),(ZIP:28613),(CITY:Xxxxxxx),(NAM2:Xxxxx X &xxx; Xxxxx X
> Xxxxxx)}        {(OCCUP:xxxxxxx xxxxx),(AGE:55    ),(MARITAL:Married)}";
>   Tuple t = mTupleFactory.newTuple();
>   DataInput in = new DataInputStream(new
> ByteArrayInputStream(record.getBytes()));
>   t.readFields(in);
>  }
>