You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by zaki rahaman <za...@gmail.com> on 2009/11/19 20:03:22 UTC

Is Pig dropping records?

Hi All,

I have the following mini-script running as part of a larger set of
scripts/workflow... however it seems like pig is dropping records as when I
tried running the same thing as a simple grep | wc -l I get a completely
different result (2500 with Pig vs. 3300). The Pig script is as follows:

A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS
(timestamp:chararray,
ip:chararray,
userid:chararray,
dist:chararray,
clickid:chararray,
usra:chararray,
campaign:chararray,
clickurl:chararray,
plugin:chararray,
tab:chararray,
feature:chararray);

B = FILTER raw BY clickurl matches '.*http://www.amazon.*';

dump B produces the following output:
2009-11-19 18:50:46,013 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Successfully stored result in: "s3://kikin-pig-test/amazonoutput2"
2009-11-19 18:50:46,058 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Records written : 2502
2009-11-19 18:50:46,058 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Bytes written : 0
2009-11-19 18:50:46,058 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Success!


The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep
http://www.amazon | wc -l

Both sets of inputs are the same files... and I'm not sure where the
discrepency is coming from. Any help would be greatly appreciated.

-- 
Zaki Rahaman

Re: Is Pig dropping records?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Sam,
Can you post your changes to a Jira?
-D

On Fri, Nov 20, 2009 at 1:28 PM, Sam Rash <sa...@ning.com> wrote:
> Hi,
>
> This reminds me of something else, though, that I took the latest patch for
> PIG-911 (sequence file reader) and found it skipped records
>
> https://issues.apache.org/jira/browse/PIG-911
>
> What I found is that the condition in getNext() would miss records:
>
> if (reader != null && (reader.getPosition() < end || !reader.syncSeen()) &&
> reader.next(key, value)) {
> ...
> }
>
> I had to change it to:
>
> if (reader != null && reader.next(key,value) && (reader.getPosition() < end
> || !reader.syncSeen())) {
> ...
> }
>
> (also ended up breaking out to read(key) and get the below to support
> reading other types than Writable)
>
> This only happened when I file files pig read where more than one block; ie,
> the records dropped were around block boundaries.
>
> has anyone noticed this?
>
> thx,
> -sr
>
> Sam Rash
> samr@ning.com
>
>
>
> On Nov 19, 2009, at 4:48 PM, Dmitriy Ryaboy wrote:
>
>> Zaki,
>> Glad to hear it wasn't Pig's fault!
>> Can you post a description of what was going on with S3, or at least
>> how you fixed it?
>>
>> -D
>>
>> On Thu, Nov 19, 2009 at 2:57 PM, zaki rahaman <za...@gmail.com>
>> wrote:
>> > Okay fixed some problem with corrupted file transfers from S3... now wc
>> > -l
>> > produces the same 143710 records... so yea its not a problem... and now
>> > I am
>> > getting the correct result from both methods... not sure what went
>> > wrong...
>> > thanks for the help though guys.
>> >
>> > On Thu, Nov 19, 2009 at 2:48 PM, Thejas Nair <te...@yahoo-inc.com>
>> > wrote:
>> >
>> >> Another thing to verify is that clickurl's position in the schema is
>> >> correct.
>> >> -Thejas
>> >>
>> >>
>> >>
>> >> On 11/19/09 11:43 AM, "Ashutosh Chauhan" <as...@gmail.com>
>> >> wrote:
>> >>
>> >> > Hmm... You are sure that your records are separated by /n (newline)
>> >> > and fields by /t (tab).  If so, will it be possible you to upload
>> >> > your
>> >> > dataset (possibly smaller) somewhere so that someone can take a look
>> >> > at that.
>> >> >
>> >> > Ashutosh
>> >> >
>> >> > On Thu, Nov 19, 2009 at 14:35, zaki rahaman <za...@gmail.com>
>> >> wrote:
>> >> >> On Thu, Nov 19, 2009 at 2:24 PM, Ashutosh Chauhan <
>> >> >> ashutosh.chauhan@gmail.com> wrote:
>> >> >>
>> >> >>> Hi Zaki,
>> >> >>>
>> >> >>> Just to narrow down the problem, can you do:
>> >> >>>
>> >> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*';
>> >> >>> dump A;
>> >> >>>
>> >> >>
>> >> >> This produced 143710 records;
>> >> >>
>> >> >>
>> >> >>> and
>> >> >>>
>> >> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS (
>> >> >>> timestamp:chararray,
>> >> >>> ip:chararray,
>> >> >>> userid:chararray,
>> >> >>> dist:chararray,
>> >> >>> clickid:chararray,
>> >> >>> usra:chararray,
>> >> >>> campaign:chararray,
>> >> >>> clickurl:chararray,
>> >> >>> plugin:chararray,
>> >> >>> tab:chararray,
>> >> >>> feature:chararray);
>> >> >>> dump A;
>> >> >>>
>> >> >>
>> >> >>
>> >> >> This produced 143710 records (so no problem there);
>> >> >>
>> >> >>
>> >> >>> and
>> >> >>>
>> >> >>> cut -f8 *week.46*clickLog.2009* | wc -l
>> >> >>>
>> >> >>
>> >> >>
>> >> >> This produced...
>> >> >> 175572
>> >> >>
>> >> >> Clearly, something is wrong...
>> >> >>
>> >> >>
>> >> >> Thanks,
>> >> >>> Ashutosh
>> >> >>>
>> >> >>> On Thu, Nov 19, 2009 at 14:03, zaki rahaman
>> >> >>> <za...@gmail.com>
>> >> >>> wrote:
>> >> >>>> Hi All,
>> >> >>>>
>> >> >>>> I have the following mini-script running as part of a larger set
>> >> >>>> of
>> >> >>>> scripts/workflow... however it seems like pig is dropping records
>> >> >>>> as
>> >> when
>> >> >>> I
>> >> >>>> tried running the same thing as a simple grep | wc -l I get a
>> >> completely
>> >> >>>> different result (2500 with Pig vs. 3300). The Pig script is as
>> >> follows:
>> >> >>>>
>> >> >>>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS
>> >> >>>> (timestamp:chararray,
>> >> >>>> ip:chararray,
>> >> >>>> userid:chararray,
>> >> >>>> dist:chararray,
>> >> >>>> clickid:chararray,
>> >> >>>> usra:chararray,
>> >> >>>> campaign:chararray,
>> >> >>>> clickurl:chararray,
>> >> >>>> plugin:chararray,
>> >> >>>> tab:chararray,
>> >> >>>> feature:chararray);
>> >> >>>>
>> >> >>>> B = FILTER raw BY clickurl matches '.*http://www.amazon.*';
>> >> >>>>
>> >> >>>> dump B produces the following output:
>> >> >>>> 2009-11-19 18:50:46,013 [main] INFO
>> >> >>>>
>> >> >>>
>> >>
>> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>> >> >>> er
>> >> >>>> - Successfully stored result in:
>> >> >>>> "s3://kikin-pig-test/amazonoutput2"
>> >> >>>> 2009-11-19 18:50:46,058 [main] INFO
>> >> >>>>
>> >> >>>
>> >>
>> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>> >> >>> er
>> >> >>>> - Records written : 2502
>> >> >>>> 2009-11-19 18:50:46,058 [main] INFO
>> >> >>>>
>> >> >>>
>> >>
>> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>> >> >>> er
>> >> >>>> - Bytes written : 0
>> >> >>>> 2009-11-19 18:50:46,058 [main] INFO
>> >> >>>>
>> >> >>>
>> >>
>> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>> >> >>> er
>> >> >>>> - Success!
>> >> >>>>
>> >> >>>>
>> >> >>>> The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep
>> >> >>>> http://www.amazon | wc -l
>> >> >>>>
>> >> >>>> Both sets of inputs are the same files... and I'm not sure where
>> >> >>>> the
>> >> >>>> discrepency is coming from. Any help would be greatly appreciated.
>> >> >>>>
>> >> >>>> --
>> >> >>>> Zaki Rahaman
>> >> >>>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Zaki Rahaman
>> >> >>
>> >>
>> >>
>> >
>> >
>> > --
>> > Zaki Rahaman
>> >
>>
>
>

Re: Is Pig dropping records?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Sam,
Can you post your changes to a Jira?
-D

On Fri, Nov 20, 2009 at 1:28 PM, Sam Rash <sa...@ning.com> wrote:
> Hi,
>
> This reminds me of something else, though, that I took the latest patch for
> PIG-911 (sequence file reader) and found it skipped records
>
> https://issues.apache.org/jira/browse/PIG-911
>
> What I found is that the condition in getNext() would miss records:
>
> if (reader != null && (reader.getPosition() < end || !reader.syncSeen()) &&
> reader.next(key, value)) {
> ...
> }
>
> I had to change it to:
>
> if (reader != null && reader.next(key,value) && (reader.getPosition() < end
> || !reader.syncSeen())) {
> ...
> }
>
> (also ended up breaking out to read(key) and get the below to support
> reading other types than Writable)
>
> This only happened when I file files pig read where more than one block; ie,
> the records dropped were around block boundaries.
>
> has anyone noticed this?
>
> thx,
> -sr
>
> Sam Rash
> samr@ning.com
>
>
>
> On Nov 19, 2009, at 4:48 PM, Dmitriy Ryaboy wrote:
>
>> Zaki,
>> Glad to hear it wasn't Pig's fault!
>> Can you post a description of what was going on with S3, or at least
>> how you fixed it?
>>
>> -D
>>
>> On Thu, Nov 19, 2009 at 2:57 PM, zaki rahaman <za...@gmail.com>
>> wrote:
>> > Okay fixed some problem with corrupted file transfers from S3... now wc
>> > -l
>> > produces the same 143710 records... so yea its not a problem... and now
>> > I am
>> > getting the correct result from both methods... not sure what went
>> > wrong...
>> > thanks for the help though guys.
>> >
>> > On Thu, Nov 19, 2009 at 2:48 PM, Thejas Nair <te...@yahoo-inc.com>
>> > wrote:
>> >
>> >> Another thing to verify is that clickurl's position in the schema is
>> >> correct.
>> >> -Thejas
>> >>
>> >>
>> >>
>> >> On 11/19/09 11:43 AM, "Ashutosh Chauhan" <as...@gmail.com>
>> >> wrote:
>> >>
>> >> > Hmm... You are sure that your records are separated by /n (newline)
>> >> > and fields by /t (tab).  If so, will it be possible you to upload
>> >> > your
>> >> > dataset (possibly smaller) somewhere so that someone can take a look
>> >> > at that.
>> >> >
>> >> > Ashutosh
>> >> >
>> >> > On Thu, Nov 19, 2009 at 14:35, zaki rahaman <za...@gmail.com>
>> >> wrote:
>> >> >> On Thu, Nov 19, 2009 at 2:24 PM, Ashutosh Chauhan <
>> >> >> ashutosh.chauhan@gmail.com> wrote:
>> >> >>
>> >> >>> Hi Zaki,
>> >> >>>
>> >> >>> Just to narrow down the problem, can you do:
>> >> >>>
>> >> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*';
>> >> >>> dump A;
>> >> >>>
>> >> >>
>> >> >> This produced 143710 records;
>> >> >>
>> >> >>
>> >> >>> and
>> >> >>>
>> >> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS (
>> >> >>> timestamp:chararray,
>> >> >>> ip:chararray,
>> >> >>> userid:chararray,
>> >> >>> dist:chararray,
>> >> >>> clickid:chararray,
>> >> >>> usra:chararray,
>> >> >>> campaign:chararray,
>> >> >>> clickurl:chararray,
>> >> >>> plugin:chararray,
>> >> >>> tab:chararray,
>> >> >>> feature:chararray);
>> >> >>> dump A;
>> >> >>>
>> >> >>
>> >> >>
>> >> >> This produced 143710 records (so no problem there);
>> >> >>
>> >> >>
>> >> >>> and
>> >> >>>
>> >> >>> cut -f8 *week.46*clickLog.2009* | wc -l
>> >> >>>
>> >> >>
>> >> >>
>> >> >> This produced...
>> >> >> 175572
>> >> >>
>> >> >> Clearly, something is wrong...
>> >> >>
>> >> >>
>> >> >> Thanks,
>> >> >>> Ashutosh
>> >> >>>
>> >> >>> On Thu, Nov 19, 2009 at 14:03, zaki rahaman
>> >> >>> <za...@gmail.com>
>> >> >>> wrote:
>> >> >>>> Hi All,
>> >> >>>>
>> >> >>>> I have the following mini-script running as part of a larger set
>> >> >>>> of
>> >> >>>> scripts/workflow... however it seems like pig is dropping records
>> >> >>>> as
>> >> when
>> >> >>> I
>> >> >>>> tried running the same thing as a simple grep | wc -l I get a
>> >> completely
>> >> >>>> different result (2500 with Pig vs. 3300). The Pig script is as
>> >> follows:
>> >> >>>>
>> >> >>>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS
>> >> >>>> (timestamp:chararray,
>> >> >>>> ip:chararray,
>> >> >>>> userid:chararray,
>> >> >>>> dist:chararray,
>> >> >>>> clickid:chararray,
>> >> >>>> usra:chararray,
>> >> >>>> campaign:chararray,
>> >> >>>> clickurl:chararray,
>> >> >>>> plugin:chararray,
>> >> >>>> tab:chararray,
>> >> >>>> feature:chararray);
>> >> >>>>
>> >> >>>> B = FILTER raw BY clickurl matches '.*http://www.amazon.*';
>> >> >>>>
>> >> >>>> dump B produces the following output:
>> >> >>>> 2009-11-19 18:50:46,013 [main] INFO
>> >> >>>>
>> >> >>>
>> >>
>> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>> >> >>> er
>> >> >>>> - Successfully stored result in:
>> >> >>>> "s3://kikin-pig-test/amazonoutput2"
>> >> >>>> 2009-11-19 18:50:46,058 [main] INFO
>> >> >>>>
>> >> >>>
>> >>
>> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>> >> >>> er
>> >> >>>> - Records written : 2502
>> >> >>>> 2009-11-19 18:50:46,058 [main] INFO
>> >> >>>>
>> >> >>>
>> >>
>> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>> >> >>> er
>> >> >>>> - Bytes written : 0
>> >> >>>> 2009-11-19 18:50:46,058 [main] INFO
>> >> >>>>
>> >> >>>
>> >>
>> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>> >> >>> er
>> >> >>>> - Success!
>> >> >>>>
>> >> >>>>
>> >> >>>> The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep
>> >> >>>> http://www.amazon | wc -l
>> >> >>>>
>> >> >>>> Both sets of inputs are the same files... and I'm not sure where
>> >> >>>> the
>> >> >>>> discrepency is coming from. Any help would be greatly appreciated.
>> >> >>>>
>> >> >>>> --
>> >> >>>> Zaki Rahaman
>> >> >>>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Zaki Rahaman
>> >> >>
>> >>
>> >>
>> >
>> >
>> > --
>> > Zaki Rahaman
>> >
>>
>
>

Re: Is Pig dropping records?

Posted by Sam Rash <sa...@ning.com>.
Hi,

This reminds me of something else, though, that I took the latest  
patch for PIG-911 (sequence file reader) and found it skipped records

https://issues.apache.org/jira/browse/PIG-911

What I found is that the condition in getNext() would miss records:

if (reader != null && (reader.getPosition() < end || ! 
reader.syncSeen()) && reader.next(key, value)) {
...
}

I had to change it to:

if (reader != null && reader.next(key,value) && (reader.getPosition()  
< end || !reader.syncSeen())) {
...
}

(also ended up breaking out to read(key) and get the below to support  
reading other types than Writable)

This only happened when I file files pig read where more than one  
block; ie, the records dropped were around block boundaries.

has anyone noticed this?

thx,
-sr

Sam Rash
samr@ning.com



On Nov 19, 2009, at 4:48 PM, Dmitriy Ryaboy wrote:

> Zaki,
> Glad to hear it wasn't Pig's fault!
> Can you post a description of what was going on with S3, or at least
> how you fixed it?
>
> -D
>
> On Thu, Nov 19, 2009 at 2:57 PM, zaki rahaman  
> <za...@gmail.com> wrote:
> > Okay fixed some problem with corrupted file transfers from S3...  
> now wc -l
> > produces the same 143710 records... so yea its not a problem...  
> and now I am
> > getting the correct result from both methods... not sure what went  
> wrong...
> > thanks for the help though guys.
> >
> > On Thu, Nov 19, 2009 at 2:48 PM, Thejas Nair <te...@yahoo-inc.com>  
> wrote:
> >
> >> Another thing to verify is that clickurl's position in the schema  
> is
> >> correct.
> >> -Thejas
> >>
> >>
> >>
> >> On 11/19/09 11:43 AM, "Ashutosh Chauhan" <ashutosh.chauhan@gmail.com 
> >
> >> wrote:
> >>
> >> > Hmm... You are sure that your records are separated by /n  
> (newline)
> >> > and fields by /t (tab).  If so, will it be possible you to  
> upload your
> >> > dataset (possibly smaller) somewhere so that someone can take a  
> look
> >> > at that.
> >> >
> >> > Ashutosh
> >> >
> >> > On Thu, Nov 19, 2009 at 14:35, zaki rahaman <zaki.rahaman@gmail.com 
> >
> >> wrote:
> >> >> On Thu, Nov 19, 2009 at 2:24 PM, Ashutosh Chauhan <
> >> >> ashutosh.chauhan@gmail.com> wrote:
> >> >>
> >> >>> Hi Zaki,
> >> >>>
> >> >>> Just to narrow down the problem, can you do:
> >> >>>
> >> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*';
> >> >>> dump A;
> >> >>>
> >> >>
> >> >> This produced 143710 records;
> >> >>
> >> >>
> >> >>> and
> >> >>>
> >> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS (
> >> >>> timestamp:chararray,
> >> >>> ip:chararray,
> >> >>> userid:chararray,
> >> >>> dist:chararray,
> >> >>> clickid:chararray,
> >> >>> usra:chararray,
> >> >>> campaign:chararray,
> >> >>> clickurl:chararray,
> >> >>> plugin:chararray,
> >> >>> tab:chararray,
> >> >>> feature:chararray);
> >> >>> dump A;
> >> >>>
> >> >>
> >> >>
> >> >> This produced 143710 records (so no problem there);
> >> >>
> >> >>
> >> >>> and
> >> >>>
> >> >>> cut -f8 *week.46*clickLog.2009* | wc -l
> >> >>>
> >> >>
> >> >>
> >> >> This produced...
> >> >> 175572
> >> >>
> >> >> Clearly, something is wrong...
> >> >>
> >> >>
> >> >> Thanks,
> >> >>> Ashutosh
> >> >>>
> >> >>> On Thu, Nov 19, 2009 at 14:03, zaki rahaman <zaki.rahaman@gmail.com 
> >
> >> >>> wrote:
> >> >>>> Hi All,
> >> >>>>
> >> >>>> I have the following mini-script running as part of a larger  
> set of
> >> >>>> scripts/workflow... however it seems like pig is dropping  
> records as
> >> when
> >> >>> I
> >> >>>> tried running the same thing as a simple grep | wc -l I get a
> >> completely
> >> >>>> different result (2500 with Pig vs. 3300). The Pig script is  
> as
> >> follows:
> >> >>>>
> >> >>>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS
> >> >>>> (timestamp:chararray,
> >> >>>> ip:chararray,
> >> >>>> userid:chararray,
> >> >>>> dist:chararray,
> >> >>>> clickid:chararray,
> >> >>>> usra:chararray,
> >> >>>> campaign:chararray,
> >> >>>> clickurl:chararray,
> >> >>>> plugin:chararray,
> >> >>>> tab:chararray,
> >> >>>> feature:chararray);
> >> >>>>
> >> >>>> B = FILTER raw BY clickurl matches '.*http://www.amazon.*';
> >> >>>>
> >> >>>> dump B produces the following output:
> >> >>>> 2009-11-19 18:50:46,013 [main] INFO
> >> >>>>
> >> >>>
> >>  
> org 
> .apache 
> .pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
> >> >>> er
> >> >>>> - Successfully stored result in: "s3://kikin-pig-test/ 
> amazonoutput2"
> >> >>>> 2009-11-19 18:50:46,058 [main] INFO
> >> >>>>
> >> >>>
> >>  
> org 
> .apache 
> .pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
> >> >>> er
> >> >>>> - Records written : 2502
> >> >>>> 2009-11-19 18:50:46,058 [main] INFO
> >> >>>>
> >> >>>
> >>  
> org 
> .apache 
> .pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
> >> >>> er
> >> >>>> - Bytes written : 0
> >> >>>> 2009-11-19 18:50:46,058 [main] INFO
> >> >>>>
> >> >>>
> >>  
> org 
> .apache 
> .pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
> >> >>> er
> >> >>>> - Success!
> >> >>>>
> >> >>>>
> >> >>>> The bash command is simply cut -f8 *week.46*clickLog.2009* |  
> fgrep
> >> >>>> http://www.amazon | wc -l
> >> >>>>
> >> >>>> Both sets of inputs are the same files... and I'm not sure  
> where the
> >> >>>> discrepency is coming from. Any help would be greatly  
> appreciated.
> >> >>>>
> >> >>>> --
> >> >>>> Zaki Rahaman
> >> >>>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Zaki Rahaman
> >> >>
> >>
> >>
> >
> >
> > --
> > Zaki Rahaman
> >
>


Re: Is Pig dropping records?

Posted by Sam Rash <sa...@ning.com>.
Hi,

This reminds me of something else, though, that I took the latest  
patch for PIG-911 (sequence file reader) and found it skipped records

https://issues.apache.org/jira/browse/PIG-911

What I found is that the condition in getNext() would miss records:

if (reader != null && (reader.getPosition() < end || ! 
reader.syncSeen()) && reader.next(key, value)) {
...
}

I had to change it to:

if (reader != null && reader.next(key,value) && (reader.getPosition()  
< end || !reader.syncSeen())) {
...
}

(also ended up breaking out to read(key) and get the below to support  
reading other types than Writable)

This only happened when I file files pig read where more than one  
block; ie, the records dropped were around block boundaries.

has anyone noticed this?

thx,
-sr

Sam Rash
samr@ning.com



On Nov 19, 2009, at 4:48 PM, Dmitriy Ryaboy wrote:

> Zaki,
> Glad to hear it wasn't Pig's fault!
> Can you post a description of what was going on with S3, or at least
> how you fixed it?
>
> -D
>
> On Thu, Nov 19, 2009 at 2:57 PM, zaki rahaman  
> <za...@gmail.com> wrote:
> > Okay fixed some problem with corrupted file transfers from S3...  
> now wc -l
> > produces the same 143710 records... so yea its not a problem...  
> and now I am
> > getting the correct result from both methods... not sure what went  
> wrong...
> > thanks for the help though guys.
> >
> > On Thu, Nov 19, 2009 at 2:48 PM, Thejas Nair <te...@yahoo-inc.com>  
> wrote:
> >
> >> Another thing to verify is that clickurl's position in the schema  
> is
> >> correct.
> >> -Thejas
> >>
> >>
> >>
> >> On 11/19/09 11:43 AM, "Ashutosh Chauhan" <ashutosh.chauhan@gmail.com 
> >
> >> wrote:
> >>
> >> > Hmm... You are sure that your records are separated by /n  
> (newline)
> >> > and fields by /t (tab).  If so, will it be possible you to  
> upload your
> >> > dataset (possibly smaller) somewhere so that someone can take a  
> look
> >> > at that.
> >> >
> >> > Ashutosh
> >> >
> >> > On Thu, Nov 19, 2009 at 14:35, zaki rahaman <zaki.rahaman@gmail.com 
> >
> >> wrote:
> >> >> On Thu, Nov 19, 2009 at 2:24 PM, Ashutosh Chauhan <
> >> >> ashutosh.chauhan@gmail.com> wrote:
> >> >>
> >> >>> Hi Zaki,
> >> >>>
> >> >>> Just to narrow down the problem, can you do:
> >> >>>
> >> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*';
> >> >>> dump A;
> >> >>>
> >> >>
> >> >> This produced 143710 records;
> >> >>
> >> >>
> >> >>> and
> >> >>>
> >> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS (
> >> >>> timestamp:chararray,
> >> >>> ip:chararray,
> >> >>> userid:chararray,
> >> >>> dist:chararray,
> >> >>> clickid:chararray,
> >> >>> usra:chararray,
> >> >>> campaign:chararray,
> >> >>> clickurl:chararray,
> >> >>> plugin:chararray,
> >> >>> tab:chararray,
> >> >>> feature:chararray);
> >> >>> dump A;
> >> >>>
> >> >>
> >> >>
> >> >> This produced 143710 records (so no problem there);
> >> >>
> >> >>
> >> >>> and
> >> >>>
> >> >>> cut -f8 *week.46*clickLog.2009* | wc -l
> >> >>>
> >> >>
> >> >>
> >> >> This produced...
> >> >> 175572
> >> >>
> >> >> Clearly, something is wrong...
> >> >>
> >> >>
> >> >> Thanks,
> >> >>> Ashutosh
> >> >>>
> >> >>> On Thu, Nov 19, 2009 at 14:03, zaki rahaman <zaki.rahaman@gmail.com 
> >
> >> >>> wrote:
> >> >>>> Hi All,
> >> >>>>
> >> >>>> I have the following mini-script running as part of a larger  
> set of
> >> >>>> scripts/workflow... however it seems like pig is dropping  
> records as
> >> when
> >> >>> I
> >> >>>> tried running the same thing as a simple grep | wc -l I get a
> >> completely
> >> >>>> different result (2500 with Pig vs. 3300). The Pig script is  
> as
> >> follows:
> >> >>>>
> >> >>>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS
> >> >>>> (timestamp:chararray,
> >> >>>> ip:chararray,
> >> >>>> userid:chararray,
> >> >>>> dist:chararray,
> >> >>>> clickid:chararray,
> >> >>>> usra:chararray,
> >> >>>> campaign:chararray,
> >> >>>> clickurl:chararray,
> >> >>>> plugin:chararray,
> >> >>>> tab:chararray,
> >> >>>> feature:chararray);
> >> >>>>
> >> >>>> B = FILTER raw BY clickurl matches '.*http://www.amazon.*';
> >> >>>>
> >> >>>> dump B produces the following output:
> >> >>>> 2009-11-19 18:50:46,013 [main] INFO
> >> >>>>
> >> >>>
> >>  
> org 
> .apache 
> .pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
> >> >>> er
> >> >>>> - Successfully stored result in: "s3://kikin-pig-test/ 
> amazonoutput2"
> >> >>>> 2009-11-19 18:50:46,058 [main] INFO
> >> >>>>
> >> >>>
> >>  
> org 
> .apache 
> .pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
> >> >>> er
> >> >>>> - Records written : 2502
> >> >>>> 2009-11-19 18:50:46,058 [main] INFO
> >> >>>>
> >> >>>
> >>  
> org 
> .apache 
> .pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
> >> >>> er
> >> >>>> - Bytes written : 0
> >> >>>> 2009-11-19 18:50:46,058 [main] INFO
> >> >>>>
> >> >>>
> >>  
> org 
> .apache 
> .pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
> >> >>> er
> >> >>>> - Success!
> >> >>>>
> >> >>>>
> >> >>>> The bash command is simply cut -f8 *week.46*clickLog.2009* |  
> fgrep
> >> >>>> http://www.amazon | wc -l
> >> >>>>
> >> >>>> Both sets of inputs are the same files... and I'm not sure  
> where the
> >> >>>> discrepency is coming from. Any help would be greatly  
> appreciated.
> >> >>>>
> >> >>>> --
> >> >>>> Zaki Rahaman
> >> >>>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Zaki Rahaman
> >> >>
> >>
> >>
> >
> >
> > --
> > Zaki Rahaman
> >
>


Re: Is Pig dropping records?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Zaki,
Glad to hear it wasn't Pig's fault!
Can you post a description of what was going on with S3, or at least
how you fixed it?

-D

On Thu, Nov 19, 2009 at 2:57 PM, zaki rahaman <za...@gmail.com> wrote:
> Okay fixed some problem with corrupted file transfers from S3... now wc -l
> produces the same 143710 records... so yea its not a problem... and now I am
> getting the correct result from both methods... not sure what went wrong...
> thanks for the help though guys.
>
> On Thu, Nov 19, 2009 at 2:48 PM, Thejas Nair <te...@yahoo-inc.com> wrote:
>
>> Another thing to verify is that clickurl's position in the schema is
>> correct.
>> -Thejas
>>
>>
>>
>> On 11/19/09 11:43 AM, "Ashutosh Chauhan" <as...@gmail.com>
>> wrote:
>>
>> > Hmm... You are sure that your records are separated by /n (newline)
>> > and fields by /t (tab).  If so, will it be possible you to upload your
>> > dataset (possibly smaller) somewhere so that someone can take a look
>> > at that.
>> >
>> > Ashutosh
>> >
>> > On Thu, Nov 19, 2009 at 14:35, zaki rahaman <za...@gmail.com>
>> wrote:
>> >> On Thu, Nov 19, 2009 at 2:24 PM, Ashutosh Chauhan <
>> >> ashutosh.chauhan@gmail.com> wrote:
>> >>
>> >>> Hi Zaki,
>> >>>
>> >>> Just to narrow down the problem, can you do:
>> >>>
>> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*';
>> >>> dump A;
>> >>>
>> >>
>> >> This produced 143710 records;
>> >>
>> >>
>> >>> and
>> >>>
>> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS (
>> >>> timestamp:chararray,
>> >>> ip:chararray,
>> >>> userid:chararray,
>> >>> dist:chararray,
>> >>> clickid:chararray,
>> >>> usra:chararray,
>> >>> campaign:chararray,
>> >>> clickurl:chararray,
>> >>> plugin:chararray,
>> >>> tab:chararray,
>> >>> feature:chararray);
>> >>> dump A;
>> >>>
>> >>
>> >>
>> >> This produced 143710 records (so no problem there);
>> >>
>> >>
>> >>> and
>> >>>
>> >>> cut -f8 *week.46*clickLog.2009* | wc -l
>> >>>
>> >>
>> >>
>> >> This produced...
>> >> 175572
>> >>
>> >> Clearly, something is wrong...
>> >>
>> >>
>> >> Thanks,
>> >>> Ashutosh
>> >>>
>> >>> On Thu, Nov 19, 2009 at 14:03, zaki rahaman <za...@gmail.com>
>> >>> wrote:
>> >>>> Hi All,
>> >>>>
>> >>>> I have the following mini-script running as part of a larger set of
>> >>>> scripts/workflow... however it seems like pig is dropping records as
>> when
>> >>> I
>> >>>> tried running the same thing as a simple grep | wc -l I get a
>> completely
>> >>>> different result (2500 with Pig vs. 3300). The Pig script is as
>> follows:
>> >>>>
>> >>>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS
>> >>>> (timestamp:chararray,
>> >>>> ip:chararray,
>> >>>> userid:chararray,
>> >>>> dist:chararray,
>> >>>> clickid:chararray,
>> >>>> usra:chararray,
>> >>>> campaign:chararray,
>> >>>> clickurl:chararray,
>> >>>> plugin:chararray,
>> >>>> tab:chararray,
>> >>>> feature:chararray);
>> >>>>
>> >>>> B = FILTER raw BY clickurl matches '.*http://www.amazon.*';
>> >>>>
>> >>>> dump B produces the following output:
>> >>>> 2009-11-19 18:50:46,013 [main] INFO
>> >>>>
>> >>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>> >>> er
>> >>>> - Successfully stored result in: "s3://kikin-pig-test/amazonoutput2"
>> >>>> 2009-11-19 18:50:46,058 [main] INFO
>> >>>>
>> >>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>> >>> er
>> >>>> - Records written : 2502
>> >>>> 2009-11-19 18:50:46,058 [main] INFO
>> >>>>
>> >>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>> >>> er
>> >>>> - Bytes written : 0
>> >>>> 2009-11-19 18:50:46,058 [main] INFO
>> >>>>
>> >>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>> >>> er
>> >>>> - Success!
>> >>>>
>> >>>>
>> >>>> The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep
>> >>>> http://www.amazon | wc -l
>> >>>>
>> >>>> Both sets of inputs are the same files... and I'm not sure where the
>> >>>> discrepency is coming from. Any help would be greatly appreciated.
>> >>>>
>> >>>> --
>> >>>> Zaki Rahaman
>> >>>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Zaki Rahaman
>> >>
>>
>>
>
>
> --
> Zaki Rahaman
>

Re: Is Pig dropping records?

Posted by zaki rahaman <za...@gmail.com>.
Okay fixed some problem with corrupted file transfers from S3... now wc -l
produces the same 143710 records... so yea its not a problem... and now I am
getting the correct result from both methods... not sure what went wrong...
thanks for the help though guys.

On Thu, Nov 19, 2009 at 2:48 PM, Thejas Nair <te...@yahoo-inc.com> wrote:

> Another thing to verify is that clickurl's position in the schema is
> correct.
> -Thejas
>
>
>
> On 11/19/09 11:43 AM, "Ashutosh Chauhan" <as...@gmail.com>
> wrote:
>
> > Hmm... You are sure that your records are separated by /n (newline)
> > and fields by /t (tab).  If so, will it be possible you to upload your
> > dataset (possibly smaller) somewhere so that someone can take a look
> > at that.
> >
> > Ashutosh
> >
> > On Thu, Nov 19, 2009 at 14:35, zaki rahaman <za...@gmail.com>
> wrote:
> >> On Thu, Nov 19, 2009 at 2:24 PM, Ashutosh Chauhan <
> >> ashutosh.chauhan@gmail.com> wrote:
> >>
> >>> Hi Zaki,
> >>>
> >>> Just to narrow down the problem, can you do:
> >>>
> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*';
> >>> dump A;
> >>>
> >>
> >> This produced 143710 records;
> >>
> >>
> >>> and
> >>>
> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS (
> >>> timestamp:chararray,
> >>> ip:chararray,
> >>> userid:chararray,
> >>> dist:chararray,
> >>> clickid:chararray,
> >>> usra:chararray,
> >>> campaign:chararray,
> >>> clickurl:chararray,
> >>> plugin:chararray,
> >>> tab:chararray,
> >>> feature:chararray);
> >>> dump A;
> >>>
> >>
> >>
> >> This produced 143710 records (so no problem there);
> >>
> >>
> >>> and
> >>>
> >>> cut -f8 *week.46*clickLog.2009* | wc -l
> >>>
> >>
> >>
> >> This produced...
> >> 175572
> >>
> >> Clearly, something is wrong...
> >>
> >>
> >> Thanks,
> >>> Ashutosh
> >>>
> >>> On Thu, Nov 19, 2009 at 14:03, zaki rahaman <za...@gmail.com>
> >>> wrote:
> >>>> Hi All,
> >>>>
> >>>> I have the following mini-script running as part of a larger set of
> >>>> scripts/workflow... however it seems like pig is dropping records as
> when
> >>> I
> >>>> tried running the same thing as a simple grep | wc -l I get a
> completely
> >>>> different result (2500 with Pig vs. 3300). The Pig script is as
> follows:
> >>>>
> >>>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS
> >>>> (timestamp:chararray,
> >>>> ip:chararray,
> >>>> userid:chararray,
> >>>> dist:chararray,
> >>>> clickid:chararray,
> >>>> usra:chararray,
> >>>> campaign:chararray,
> >>>> clickurl:chararray,
> >>>> plugin:chararray,
> >>>> tab:chararray,
> >>>> feature:chararray);
> >>>>
> >>>> B = FILTER raw BY clickurl matches '.*http://www.amazon.*';
> >>>>
> >>>> dump B produces the following output:
> >>>> 2009-11-19 18:50:46,013 [main] INFO
> >>>>
> >>>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
> >>> er
> >>>> - Successfully stored result in: "s3://kikin-pig-test/amazonoutput2"
> >>>> 2009-11-19 18:50:46,058 [main] INFO
> >>>>
> >>>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
> >>> er
> >>>> - Records written : 2502
> >>>> 2009-11-19 18:50:46,058 [main] INFO
> >>>>
> >>>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
> >>> er
> >>>> - Bytes written : 0
> >>>> 2009-11-19 18:50:46,058 [main] INFO
> >>>>
> >>>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
> >>> er
> >>>> - Success!
> >>>>
> >>>>
> >>>> The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep
> >>>> http://www.amazon | wc -l
> >>>>
> >>>> Both sets of inputs are the same files... and I'm not sure where the
> >>>> discrepency is coming from. Any help would be greatly appreciated.
> >>>>
> >>>> --
> >>>> Zaki Rahaman
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Zaki Rahaman
> >>
>
>


-- 
Zaki Rahaman

Re: Is Pig dropping records?

Posted by Thejas Nair <te...@yahoo-inc.com>.
Another thing to verify is that clickurl's position in the schema is
correct.
-Thejas



On 11/19/09 11:43 AM, "Ashutosh Chauhan" <as...@gmail.com> wrote:

> Hmm... You are sure that your records are separated by /n (newline)
> and fields by /t (tab).  If so, will it be possible you to upload your
> dataset (possibly smaller) somewhere so that someone can take a look
> at that.
> 
> Ashutosh
> 
> On Thu, Nov 19, 2009 at 14:35, zaki rahaman <za...@gmail.com> wrote:
>> On Thu, Nov 19, 2009 at 2:24 PM, Ashutosh Chauhan <
>> ashutosh.chauhan@gmail.com> wrote:
>> 
>>> Hi Zaki,
>>> 
>>> Just to narrow down the problem, can you do:
>>> 
>>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*';
>>> dump A;
>>> 
>> 
>> This produced 143710 records;
>> 
>> 
>>> and
>>> 
>>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS (
>>> timestamp:chararray,
>>> ip:chararray,
>>> userid:chararray,
>>> dist:chararray,
>>> clickid:chararray,
>>> usra:chararray,
>>> campaign:chararray,
>>> clickurl:chararray,
>>> plugin:chararray,
>>> tab:chararray,
>>> feature:chararray);
>>> dump A;
>>> 
>> 
>> 
>> This produced 143710 records (so no problem there);
>> 
>> 
>>> and
>>> 
>>> cut -f8 *week.46*clickLog.2009* | wc -l
>>> 
>> 
>> 
>> This produced...
>> 175572
>> 
>> Clearly, something is wrong...
>> 
>> 
>> Thanks,
>>> Ashutosh
>>> 
>>> On Thu, Nov 19, 2009 at 14:03, zaki rahaman <za...@gmail.com>
>>> wrote:
>>>> Hi All,
>>>> 
>>>> I have the following mini-script running as part of a larger set of
>>>> scripts/workflow... however it seems like pig is dropping records as when
>>> I
>>>> tried running the same thing as a simple grep | wc -l I get a completely
>>>> different result (2500 with Pig vs. 3300). The Pig script is as follows:
>>>> 
>>>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS
>>>> (timestamp:chararray,
>>>> ip:chararray,
>>>> userid:chararray,
>>>> dist:chararray,
>>>> clickid:chararray,
>>>> usra:chararray,
>>>> campaign:chararray,
>>>> clickurl:chararray,
>>>> plugin:chararray,
>>>> tab:chararray,
>>>> feature:chararray);
>>>> 
>>>> B = FILTER raw BY clickurl matches '.*http://www.amazon.*';
>>>> 
>>>> dump B produces the following output:
>>>> 2009-11-19 18:50:46,013 [main] INFO
>>>> 
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>>> er
>>>> - Successfully stored result in: "s3://kikin-pig-test/amazonoutput2"
>>>> 2009-11-19 18:50:46,058 [main] INFO
>>>> 
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>>> er
>>>> - Records written : 2502
>>>> 2009-11-19 18:50:46,058 [main] INFO
>>>> 
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>>> er
>>>> - Bytes written : 0
>>>> 2009-11-19 18:50:46,058 [main] INFO
>>>> 
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>>> er
>>>> - Success!
>>>> 
>>>> 
>>>> The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep
>>>> http://www.amazon | wc -l
>>>> 
>>>> Both sets of inputs are the same files... and I'm not sure where the
>>>> discrepency is coming from. Any help would be greatly appreciated.
>>>> 
>>>> --
>>>> Zaki Rahaman
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Zaki Rahaman
>> 


Re: Is Pig dropping records?

Posted by Ashutosh Chauhan <as...@gmail.com>.
Hmm... You are sure that your records are separated by /n (newline)
and fields by /t (tab).  If so, will it be possible you to upload your
dataset (possibly smaller) somewhere so that someone can take a look
at that.

Ashutosh

On Thu, Nov 19, 2009 at 14:35, zaki rahaman <za...@gmail.com> wrote:
> On Thu, Nov 19, 2009 at 2:24 PM, Ashutosh Chauhan <
> ashutosh.chauhan@gmail.com> wrote:
>
>> Hi Zaki,
>>
>> Just to narrow down the problem, can you do:
>>
>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*';
>> dump A;
>>
>
> This produced 143710 records;
>
>
>> and
>>
>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS (
>> timestamp:chararray,
>> ip:chararray,
>> userid:chararray,
>> dist:chararray,
>> clickid:chararray,
>> usra:chararray,
>> campaign:chararray,
>> clickurl:chararray,
>> plugin:chararray,
>> tab:chararray,
>> feature:chararray);
>> dump A;
>>
>
>
> This produced 143710 records (so no problem there);
>
>
>> and
>>
>> cut -f8 *week.46*clickLog.2009* | wc -l
>>
>
>
> This produced...
> 175572
>
> Clearly, something is wrong...
>
>
> Thanks,
>> Ashutosh
>>
>> On Thu, Nov 19, 2009 at 14:03, zaki rahaman <za...@gmail.com>
>> wrote:
>> > Hi All,
>> >
>> > I have the following mini-script running as part of a larger set of
>> > scripts/workflow... however it seems like pig is dropping records as when
>> I
>> > tried running the same thing as a simple grep | wc -l I get a completely
>> > different result (2500 with Pig vs. 3300). The Pig script is as follows:
>> >
>> > A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS
>> > (timestamp:chararray,
>> > ip:chararray,
>> > userid:chararray,
>> > dist:chararray,
>> > clickid:chararray,
>> > usra:chararray,
>> > campaign:chararray,
>> > clickurl:chararray,
>> > plugin:chararray,
>> > tab:chararray,
>> > feature:chararray);
>> >
>> > B = FILTER raw BY clickurl matches '.*http://www.amazon.*';
>> >
>> > dump B produces the following output:
>> > 2009-11-19 18:50:46,013 [main] INFO
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> > - Successfully stored result in: "s3://kikin-pig-test/amazonoutput2"
>> > 2009-11-19 18:50:46,058 [main] INFO
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> > - Records written : 2502
>> > 2009-11-19 18:50:46,058 [main] INFO
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> > - Bytes written : 0
>> > 2009-11-19 18:50:46,058 [main] INFO
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> > - Success!
>> >
>> >
>> > The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep
>> > http://www.amazon | wc -l
>> >
>> > Both sets of inputs are the same files... and I'm not sure where the
>> > discrepency is coming from. Any help would be greatly appreciated.
>> >
>> > --
>> > Zaki Rahaman
>> >
>>
>
>
>
> --
> Zaki Rahaman
>

Re: Is Pig dropping records?

Posted by zaki rahaman <za...@gmail.com>.
On Thu, Nov 19, 2009 at 2:24 PM, Ashutosh Chauhan <
ashutosh.chauhan@gmail.com> wrote:

> Hi Zaki,
>
> Just to narrow down the problem, can you do:
>
> A = LOAD 's3n://bucket/*week.46*clickLog.2009*';
> dump A;
>

This produced 143710 records;


> and
>
> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS (
> timestamp:chararray,
> ip:chararray,
> userid:chararray,
> dist:chararray,
> clickid:chararray,
> usra:chararray,
> campaign:chararray,
> clickurl:chararray,
> plugin:chararray,
> tab:chararray,
> feature:chararray);
> dump A;
>


This produced 143710 records (so no problem there);


> and
>
> cut -f8 *week.46*clickLog.2009* | wc -l
>


This produced...
175572

Clearly, something is wrong...


Thanks,
> Ashutosh
>
> On Thu, Nov 19, 2009 at 14:03, zaki rahaman <za...@gmail.com>
> wrote:
> > Hi All,
> >
> > I have the following mini-script running as part of a larger set of
> > scripts/workflow... however it seems like pig is dropping records as when
> I
> > tried running the same thing as a simple grep | wc -l I get a completely
> > different result (2500 with Pig vs. 3300). The Pig script is as follows:
> >
> > A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS
> > (timestamp:chararray,
> > ip:chararray,
> > userid:chararray,
> > dist:chararray,
> > clickid:chararray,
> > usra:chararray,
> > campaign:chararray,
> > clickurl:chararray,
> > plugin:chararray,
> > tab:chararray,
> > feature:chararray);
> >
> > B = FILTER raw BY clickurl matches '.*http://www.amazon.*';
> >
> > dump B produces the following output:
> > 2009-11-19 18:50:46,013 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - Successfully stored result in: "s3://kikin-pig-test/amazonoutput2"
> > 2009-11-19 18:50:46,058 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - Records written : 2502
> > 2009-11-19 18:50:46,058 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - Bytes written : 0
> > 2009-11-19 18:50:46,058 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - Success!
> >
> >
> > The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep
> > http://www.amazon | wc -l
> >
> > Both sets of inputs are the same files... and I'm not sure where the
> > discrepency is coming from. Any help would be greatly appreciated.
> >
> > --
> > Zaki Rahaman
> >
>



-- 
Zaki Rahaman

Re: Is Pig dropping records?

Posted by Ashutosh Chauhan <as...@gmail.com>.
Hi Zaki,

Just to narrow down the problem, can you do:

A = LOAD 's3n://bucket/*week.46*clickLog.2009*';
dump A;

and

A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS (
timestamp:chararray,
ip:chararray,
userid:chararray,
dist:chararray,
clickid:chararray,
usra:chararray,
campaign:chararray,
clickurl:chararray,
plugin:chararray,
tab:chararray,
feature:chararray);
dump A;

and

cut -f8 *week.46*clickLog.2009* | wc -l

Thanks,
Ashutosh

On Thu, Nov 19, 2009 at 14:03, zaki rahaman <za...@gmail.com> wrote:
> Hi All,
>
> I have the following mini-script running as part of a larger set of
> scripts/workflow... however it seems like pig is dropping records as when I
> tried running the same thing as a simple grep | wc -l I get a completely
> different result (2500 with Pig vs. 3300). The Pig script is as follows:
>
> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS
> (timestamp:chararray,
> ip:chararray,
> userid:chararray,
> dist:chararray,
> clickid:chararray,
> usra:chararray,
> campaign:chararray,
> clickurl:chararray,
> plugin:chararray,
> tab:chararray,
> feature:chararray);
>
> B = FILTER raw BY clickurl matches '.*http://www.amazon.*';
>
> dump B produces the following output:
> 2009-11-19 18:50:46,013 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Successfully stored result in: "s3://kikin-pig-test/amazonoutput2"
> 2009-11-19 18:50:46,058 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Records written : 2502
> 2009-11-19 18:50:46,058 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Bytes written : 0
> 2009-11-19 18:50:46,058 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Success!
>
>
> The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep
> http://www.amazon | wc -l
>
> Both sets of inputs are the same files... and I'm not sure where the
> discrepency is coming from. Any help would be greatly appreciated.
>
> --
> Zaki Rahaman
>