You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Paul O'Leary <po...@quantivo.com> on 2008/09/25 19:49:39 UTC

FW: DISTINCT Problem

Trying to compile types branch to verify/check this problem.

I get the following compile error and have checked all the obvious
stuff:

$ ant compile
Buildfile: build.xml

init:

cc-compile:
   [javacc] Java Compiler Compiler Version 4.0 (Parser Generator)
   [javacc] (type "javacc" with no arguments for help)
   [javacc] Reading from file
C:\dev\pig\test\org\apache\pig\test\utils\dotGraph
\parser\Dot.jj . . .
   [javacc] Exception in thread "main" java.lang.Error: Invalid escape
character
 at line 1 column 97.
   [javacc]     at org.javacc.parser.JavaCharStream.readChar(Unknown
Source)
   [javacc]     at
org.javacc.parser.JavaCCParserTokenManager.getNextToken(Unkno
wn Source)
   [javacc]     at org.javacc.parser.JavaCCParser.jj_ntk(Unknown Source)
   [javacc]     at org.javacc.parser.JavaCCParser.javacc_options(Unknown
Source)

   [javacc]     at org.javacc.parser.JavaCCParser.javacc_input(Unknown
Source)
   [javacc]     at org.javacc.parser.Main.mainProgram(Unknown Source)
   [javacc]     at org.javacc.parser.Main.main(Unknown Source)

BUILD FAILED
C:\dev\pig\build.xml:151: C:\Program
Files\Java\jdk1.5.0_06\jre\bin\java.exe fai
led with return code 1

Total time: 5 seconds

I am compiling on Windows but I get the same error under cygwin.

Any ideas?  Thanks for the help.
PaulO.

-----Original Message-----
From: Olga Natkovich [mailto:olgan@yahoo-inc.com] 
Sent: Wednesday, September 24, 2008 4:58 PM
To: pig-user@incubator.apache.org
Subject: RE: DISTINCT Problem

This could be a bug. Can you try it with pig.jar build from type branch
and see if you get the expected results?

Note that type branch is still on Hadoop 17 but will move to Hadoop 18
later today. 

Olga

> -----Original Message-----
> From: Paul O'Leary [mailto:poleary@quantivo.com] 
> Sent: Wednesday, September 24, 2008 3:57 PM
> To: pig-user@incubator.apache.org
> Subject: DISTINCT Problem
> 
> Hi All,
> 
>  
> 
> I seem to be seeing a problem with the DISTINCT operator.  I 
> have a script that looks like this:
> 
>  
> 
> raw_tran_hdr = load 'tran_hdr/tran_header' using PigStorage( 
> '|' ) as ( ... many fields ... );
> 
> tran_hdr_dist = DISTINCT raw_tran_hdr;
> 
> b = GROUP tran_hdr_dist ALL;
> 
> c = FOREACH b GENERATE COUNT(tran_hdr_dist.$0);
> 
>  
> 
> The data set 'tran_hdr/tran_header' has about 7M rows of 
> which I know for certain 14 are exact duplicates.  When I 
> execute the Pig script above I get the total row count; that 
> is, the number returned doesn't correctly drop out the duplicate rows.
> 
>  
> 
> There is a thread in the user group about previous DISTINCT 
> problems that sound just like this but JIRA says they're all 
> resolved.  The code I'm using is up-to-date with the trunk (@ 
> revision 698759) so I'm assuming I've picked up any fixes.
> 
>  
> 
> When (in a different script) I move the DISTINCT into a 
> nested FOREACH it fixes (or at least works-around) the problem; e.g.:
> 
>  
> 
> (after COGROUP)
> 
>  
> 
> Z = FOREACH X
> 
> {
> 
> thd = DISTINCT raw_tran_hdr;
> 
> GENERATE 
> 
> FLATTEN( thd.(... many fields .... ) ),
> 
> FLATTEN( sale_line_calc.(... many fields ...) );
> 
> }
> 
>  
> 
> I will continue to try to dig into the problem but any 
> guidance anyone can provide would be appreciated.  Maybe I'm 
> misunderstanding something.
> 
> As mentioned, I am successfully working around the issue 
> right now but - as a data junkie like I know you all are - 
> answers that look incorrect make me nervous.
> 
>  
> 
> BTW, I don't think this is just a counting issue with 
> DISTINCT (as the previous issues seem to allude to); when I 
> tried to use tran_hdr_dist to do a COGROUP (without counting) 
> I got wrong results.
> 
>  
> 
> Thanks,
> 
> PaulO.
> 
> 


RE: DISTINCT Problem

Posted by Paul O'Leary <po...@quantivo.com>.
Hadoop 18.

Thanks, Olga.  If you send me the build I'll try to reproduce the
DISTINCT problem asap.

PaulO.

-----Original Message-----
From: Olga Natkovich [mailto:olgan@yahoo-inc.com] 
Sent: Thursday, September 25, 2008 12:28 PM
To: pig-dev@incubator.apache.org
Subject: RE: DISTINCT Problem

We are not seeing it on our builds but none of us run on Window. While
we figure out what is going on, I can send you a jar file build from
types branch. Are you running with Hadoop 17 or Hadoop 18 cluster?

Olga 

> -----Original Message-----
> From: Paul O'Leary [mailto:poleary@quantivo.com] 
> Sent: Thursday, September 25, 2008 11:27 AM
> To: pig-dev@incubator.apache.org
> Subject: RE: DISTINCT Problem
> 
> Yes. I cannot currently compile the 'types' branch.
> 
> -----Original Message-----
> From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> Sent: Thursday, September 25, 2008 11:16 AM
> To: pig-dev@incubator.apache.org
> Subject: RE: DISTINCT Problem
> 
> Is this repeatable? 
> 
> > -----Original Message-----
> > From: Paul O'Leary [mailto:poleary@quantivo.com]
> > Sent: Thursday, September 25, 2008 10:50 AM
> > To: pig-dev@incubator.apache.org
> > Subject: FW: DISTINCT Problem
> > 
> > Trying to compile types branch to verify/check this problem.
> > 
> > I get the following compile error and have checked all the obvious
> > stuff:
> > 
> > $ ant compile
> > Buildfile: build.xml
> > 
> > init:
> > 
> > cc-compile:
> >    [javacc] Java Compiler Compiler Version 4.0 (Parser Generator)
> >    [javacc] (type "javacc" with no arguments for help)
> >    [javacc] Reading from file
> > C:\dev\pig\test\org\apache\pig\test\utils\dotGraph
> > \parser\Dot.jj . . .
> >    [javacc] Exception in thread "main" java.lang.Error: 
> > Invalid escape character  at line 1 column 97.
> >    [javacc]     at org.javacc.parser.JavaCharStream.readChar(Unknown
> > Source)
> >    [javacc]     at
> > org.javacc.parser.JavaCCParserTokenManager.getNextToken(Unkno
> > wn Source)
> >    [javacc]     at 
> > org.javacc.parser.JavaCCParser.jj_ntk(Unknown Source)
> >    [javacc]     at 
> > org.javacc.parser.JavaCCParser.javacc_options(Unknown
> > Source)
> > 
> >    [javacc]     at 
> org.javacc.parser.JavaCCParser.javacc_input(Unknown
> > Source)
> >    [javacc]     at 
> org.javacc.parser.Main.mainProgram(Unknown Source)
> >    [javacc]     at org.javacc.parser.Main.main(Unknown Source)
> > 
> > BUILD FAILED
> > C:\dev\pig\build.xml:151: C:\Program
> > Files\Java\jdk1.5.0_06\jre\bin\java.exe fai led with return code 1
> > 
> > Total time: 5 seconds
> > 
> > I am compiling on Windows but I get the same error under cygwin.
> > 
> > Any ideas?  Thanks for the help.
> > PaulO.
> > 
> > -----Original Message-----
> > From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> > Sent: Wednesday, September 24, 2008 4:58 PM
> > To: pig-user@incubator.apache.org
> > Subject: RE: DISTINCT Problem
> > 
> > This could be a bug. Can you try it with pig.jar build from type 
> > branch and see if you get the expected results?
> > 
> > Note that type branch is still on Hadoop 17 but will move 
> to Hadoop 18 
> > later today.
> > 
> > Olga
> > 
> [snip]
> 
> 


RE: DISTINCT Problem

Posted by Paul O'Leary <po...@quantivo.com>.
Olga,

Never saw any JAR file come through... lost in the mail?

Could you please try again?  Happy to test this on my end.

Cheers,
PaulO.

-----Original Message-----
From: Olga Natkovich [mailto:olgan@yahoo-inc.com] 
Sent: Tuesday, October 07, 2008 8:34 AM
To: pig-dev@incubator.apache.org
Subject: RE: DISTINCT Problem

Hi Paul,

Did you try the jar file that I sent you?

Olga 

> -----Original Message-----
> From: Paul O'Leary [mailto:poleary@quantivo.com] 
> Sent: Monday, October 06, 2008 5:57 PM
> To: pig-dev@incubator.apache.org
> Subject: RE: DISTINCT Problem
> 
> Hi Olga et al,
> 
> I have made a little progress in figuring out what's going on 
> for me with the types branch.
> 
> When 'test\utils\dotGraph\parser\Dot.jj' is generated the 
> first line of the generated file looks like this:
> 
> /*@bgen(jjtree) Generated By:JJTree: Do not edit this line.
> C:\dev\pig\test\org\apache\pig\test\utils\dotGraph\parser\Dot.
> jj */ /*@egen*/options {
> 
> Long story short it chokes - believe it or not - on the '\u' 
> of '...test\utils...' because it thinks it's a Unicode 
> character.  I believe this is a known issue with Java 
> compilation and it explains why it seems to be a Windows-only problem.
> 
> Don't know what the 'fix' is for this (other than getting off 
> Windows) but I can work around the problem by whacking the JJ 
> file directly.
> 
> However, when I do this (on the types branch) I seem to see a 
> problem where I can't execute any commands from the grunt 
> shell.  Every command fails immediately with a parse error:
> 
> 2008-10-06 17:51:31,421 [main] INFO
> org.apache.pig.backend.hadoop.executionengi
> ne.HExecutionEngine - Connecting to map-reduce job tracker at:
> localhost:9001
> grunt> ls
> org.apache.pig.tools.pigscript.parser.ParseException: 
> Encountered "l" at line 1,  column 1.
> Was expecting one of:
>     <EOF>
>     "cat" ...
>     "cd" ...
> <snip>
> 
> Don't know if this is related to the same issue above or 
> caused by my having to whack the file or what...?
> 
> Anyway, this all goes back to the DISTINCT problem I was 
> seeing.  I'd be happy to try to recreate the problem on the 
> types branch if I can get it built or someone can provide me 
> with a JAR to test with.
> 
> Thanks,
> PaulO.
> 
> -----Original Message-----
> From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> Sent: Thursday, September 25, 2008 12:28 PM
> To: pig-dev@incubator.apache.org
> Subject: RE: DISTINCT Problem
> 
> We are not seeing it on our builds but none of us run on 
> Window. While we figure out what is going on, I can send you 
> a jar file build from types branch. Are you running with 
> Hadoop 17 or Hadoop 18 cluster?
> 
> Olga 
> 
> > -----Original Message-----
> > From: Paul O'Leary [mailto:poleary@quantivo.com]
> > Sent: Thursday, September 25, 2008 11:27 AM
> > To: pig-dev@incubator.apache.org
> > Subject: RE: DISTINCT Problem
> > 
> > Yes. I cannot currently compile the 'types' branch.
> > 
> > -----Original Message-----
> > From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> > Sent: Thursday, September 25, 2008 11:16 AM
> > To: pig-dev@incubator.apache.org
> > Subject: RE: DISTINCT Problem
> > 
> > Is this repeatable? 
> > 
> 
> 
> 



RE: DISTINCT Problem

Posted by Olga Natkovich <ol...@yahoo-inc.com>.
Hi Paul,

Did you try the jar file that I sent you?

Olga 

> -----Original Message-----
> From: Paul O'Leary [mailto:poleary@quantivo.com] 
> Sent: Monday, October 06, 2008 5:57 PM
> To: pig-dev@incubator.apache.org
> Subject: RE: DISTINCT Problem
> 
> Hi Olga et al,
> 
> I have made a little progress in figuring out what's going on 
> for me with the types branch.
> 
> When 'test\utils\dotGraph\parser\Dot.jj' is generated the 
> first line of the generated file looks like this:
> 
> /*@bgen(jjtree) Generated By:JJTree: Do not edit this line.
> C:\dev\pig\test\org\apache\pig\test\utils\dotGraph\parser\Dot.
> jj */ /*@egen*/options {
> 
> Long story short it chokes - believe it or not - on the '\u' 
> of '...test\utils...' because it thinks it's a Unicode 
> character.  I believe this is a known issue with Java 
> compilation and it explains why it seems to be a Windows-only problem.
> 
> Don't know what the 'fix' is for this (other than getting off 
> Windows) but I can work around the problem by whacking the JJ 
> file directly.
> 
> However, when I do this (on the types branch) I seem to see a 
> problem where I can't execute any commands from the grunt 
> shell.  Every command fails immediately with a parse error:
> 
> 2008-10-06 17:51:31,421 [main] INFO
> org.apache.pig.backend.hadoop.executionengi
> ne.HExecutionEngine - Connecting to map-reduce job tracker at:
> localhost:9001
> grunt> ls
> org.apache.pig.tools.pigscript.parser.ParseException: 
> Encountered "l" at line 1,  column 1.
> Was expecting one of:
>     <EOF>
>     "cat" ...
>     "cd" ...
> <snip>
> 
> Don't know if this is related to the same issue above or 
> caused by my having to whack the file or what...?
> 
> Anyway, this all goes back to the DISTINCT problem I was 
> seeing.  I'd be happy to try to recreate the problem on the 
> types branch if I can get it built or someone can provide me 
> with a JAR to test with.
> 
> Thanks,
> PaulO.
> 
> -----Original Message-----
> From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> Sent: Thursday, September 25, 2008 12:28 PM
> To: pig-dev@incubator.apache.org
> Subject: RE: DISTINCT Problem
> 
> We are not seeing it on our builds but none of us run on 
> Window. While we figure out what is going on, I can send you 
> a jar file build from types branch. Are you running with 
> Hadoop 17 or Hadoop 18 cluster?
> 
> Olga 
> 
> > -----Original Message-----
> > From: Paul O'Leary [mailto:poleary@quantivo.com]
> > Sent: Thursday, September 25, 2008 11:27 AM
> > To: pig-dev@incubator.apache.org
> > Subject: RE: DISTINCT Problem
> > 
> > Yes. I cannot currently compile the 'types' branch.
> > 
> > -----Original Message-----
> > From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> > Sent: Thursday, September 25, 2008 11:16 AM
> > To: pig-dev@incubator.apache.org
> > Subject: RE: DISTINCT Problem
> > 
> > Is this repeatable? 
> > 
> 
> 
> 

RE: DISTINCT Problem

Posted by Paul O'Leary <po...@quantivo.com>.
Hi Olga et al,

I have made a little progress in figuring out what's going on for me
with the types branch.

When 'test\utils\dotGraph\parser\Dot.jj' is generated the first line of
the generated file looks like this:

/*@bgen(jjtree) Generated By:JJTree: Do not edit this line.
C:\dev\pig\test\org\apache\pig\test\utils\dotGraph\parser\Dot.jj */
/*@egen*/options {

Long story short it chokes - believe it or not - on the '\u' of
'...test\utils...' because it thinks it's a Unicode character.  I
believe this is a known issue with Java compilation and it explains why
it seems to be a Windows-only problem.

Don't know what the 'fix' is for this (other than getting off Windows)
but I can work around the problem by whacking the JJ file directly.

However, when I do this (on the types branch) I seem to see a problem
where I can't execute any commands from the grunt shell.  Every command
fails immediately with a parse error:

2008-10-06 17:51:31,421 [main] INFO
org.apache.pig.backend.hadoop.executionengi
ne.HExecutionEngine - Connecting to map-reduce job tracker at:
localhost:9001
grunt> ls
org.apache.pig.tools.pigscript.parser.ParseException: Encountered "l" at
line 1,
 column 1.
Was expecting one of:
    <EOF>
    "cat" ...
    "cd" ...
<snip>

Don't know if this is related to the same issue above or caused by my
having to whack the file or what...?

Anyway, this all goes back to the DISTINCT problem I was seeing.  I'd be
happy to try to recreate the problem on the types branch if I can get it
built or someone can provide me with a JAR to test with.

Thanks,
PaulO.

-----Original Message-----
From: Olga Natkovich [mailto:olgan@yahoo-inc.com] 
Sent: Thursday, September 25, 2008 12:28 PM
To: pig-dev@incubator.apache.org
Subject: RE: DISTINCT Problem

We are not seeing it on our builds but none of us run on Window. While
we figure out what is going on, I can send you a jar file build from
types branch. Are you running with Hadoop 17 or Hadoop 18 cluster?

Olga 

> -----Original Message-----
> From: Paul O'Leary [mailto:poleary@quantivo.com] 
> Sent: Thursday, September 25, 2008 11:27 AM
> To: pig-dev@incubator.apache.org
> Subject: RE: DISTINCT Problem
> 
> Yes. I cannot currently compile the 'types' branch.
> 
> -----Original Message-----
> From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> Sent: Thursday, September 25, 2008 11:16 AM
> To: pig-dev@incubator.apache.org
> Subject: RE: DISTINCT Problem
> 
> Is this repeatable? 
> 



RE: DISTINCT Problem

Posted by Olga Natkovich <ol...@yahoo-inc.com>.
We are not seeing it on our builds but none of us run on Window. While
we figure out what is going on, I can send you a jar file build from
types branch. Are you running with Hadoop 17 or Hadoop 18 cluster?

Olga 

> -----Original Message-----
> From: Paul O'Leary [mailto:poleary@quantivo.com] 
> Sent: Thursday, September 25, 2008 11:27 AM
> To: pig-dev@incubator.apache.org
> Subject: RE: DISTINCT Problem
> 
> Yes. I cannot currently compile the 'types' branch.
> 
> -----Original Message-----
> From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> Sent: Thursday, September 25, 2008 11:16 AM
> To: pig-dev@incubator.apache.org
> Subject: RE: DISTINCT Problem
> 
> Is this repeatable? 
> 
> > -----Original Message-----
> > From: Paul O'Leary [mailto:poleary@quantivo.com]
> > Sent: Thursday, September 25, 2008 10:50 AM
> > To: pig-dev@incubator.apache.org
> > Subject: FW: DISTINCT Problem
> > 
> > Trying to compile types branch to verify/check this problem.
> > 
> > I get the following compile error and have checked all the obvious
> > stuff:
> > 
> > $ ant compile
> > Buildfile: build.xml
> > 
> > init:
> > 
> > cc-compile:
> >    [javacc] Java Compiler Compiler Version 4.0 (Parser Generator)
> >    [javacc] (type "javacc" with no arguments for help)
> >    [javacc] Reading from file
> > C:\dev\pig\test\org\apache\pig\test\utils\dotGraph
> > \parser\Dot.jj . . .
> >    [javacc] Exception in thread "main" java.lang.Error: 
> > Invalid escape character  at line 1 column 97.
> >    [javacc]     at org.javacc.parser.JavaCharStream.readChar(Unknown
> > Source)
> >    [javacc]     at
> > org.javacc.parser.JavaCCParserTokenManager.getNextToken(Unkno
> > wn Source)
> >    [javacc]     at 
> > org.javacc.parser.JavaCCParser.jj_ntk(Unknown Source)
> >    [javacc]     at 
> > org.javacc.parser.JavaCCParser.javacc_options(Unknown
> > Source)
> > 
> >    [javacc]     at 
> org.javacc.parser.JavaCCParser.javacc_input(Unknown
> > Source)
> >    [javacc]     at 
> org.javacc.parser.Main.mainProgram(Unknown Source)
> >    [javacc]     at org.javacc.parser.Main.main(Unknown Source)
> > 
> > BUILD FAILED
> > C:\dev\pig\build.xml:151: C:\Program
> > Files\Java\jdk1.5.0_06\jre\bin\java.exe fai led with return code 1
> > 
> > Total time: 5 seconds
> > 
> > I am compiling on Windows but I get the same error under cygwin.
> > 
> > Any ideas?  Thanks for the help.
> > PaulO.
> > 
> > -----Original Message-----
> > From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> > Sent: Wednesday, September 24, 2008 4:58 PM
> > To: pig-user@incubator.apache.org
> > Subject: RE: DISTINCT Problem
> > 
> > This could be a bug. Can you try it with pig.jar build from type 
> > branch and see if you get the expected results?
> > 
> > Note that type branch is still on Hadoop 17 but will move 
> to Hadoop 18 
> > later today.
> > 
> > Olga
> > 
> [snip]
> 
> 

RE: DISTINCT Problem

Posted by Paul O'Leary <po...@quantivo.com>.
Yes. I cannot currently compile the 'types' branch.

-----Original Message-----
From: Olga Natkovich [mailto:olgan@yahoo-inc.com] 
Sent: Thursday, September 25, 2008 11:16 AM
To: pig-dev@incubator.apache.org
Subject: RE: DISTINCT Problem

Is this repeatable? 

> -----Original Message-----
> From: Paul O'Leary [mailto:poleary@quantivo.com] 
> Sent: Thursday, September 25, 2008 10:50 AM
> To: pig-dev@incubator.apache.org
> Subject: FW: DISTINCT Problem
> 
> Trying to compile types branch to verify/check this problem.
> 
> I get the following compile error and have checked all the obvious
> stuff:
> 
> $ ant compile
> Buildfile: build.xml
> 
> init:
> 
> cc-compile:
>    [javacc] Java Compiler Compiler Version 4.0 (Parser Generator)
>    [javacc] (type "javacc" with no arguments for help)
>    [javacc] Reading from file
> C:\dev\pig\test\org\apache\pig\test\utils\dotGraph
> \parser\Dot.jj . . .
>    [javacc] Exception in thread "main" java.lang.Error: 
> Invalid escape character  at line 1 column 97.
>    [javacc]     at org.javacc.parser.JavaCharStream.readChar(Unknown
> Source)
>    [javacc]     at
> org.javacc.parser.JavaCCParserTokenManager.getNextToken(Unkno
> wn Source)
>    [javacc]     at 
> org.javacc.parser.JavaCCParser.jj_ntk(Unknown Source)
>    [javacc]     at 
> org.javacc.parser.JavaCCParser.javacc_options(Unknown
> Source)
> 
>    [javacc]     at org.javacc.parser.JavaCCParser.javacc_input(Unknown
> Source)
>    [javacc]     at org.javacc.parser.Main.mainProgram(Unknown Source)
>    [javacc]     at org.javacc.parser.Main.main(Unknown Source)
> 
> BUILD FAILED
> C:\dev\pig\build.xml:151: C:\Program
> Files\Java\jdk1.5.0_06\jre\bin\java.exe fai led with return code 1
> 
> Total time: 5 seconds
> 
> I am compiling on Windows but I get the same error under cygwin.
> 
> Any ideas?  Thanks for the help.
> PaulO.
> 
> -----Original Message-----
> From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> Sent: Wednesday, September 24, 2008 4:58 PM
> To: pig-user@incubator.apache.org
> Subject: RE: DISTINCT Problem
> 
> This could be a bug. Can you try it with pig.jar build from 
> type branch and see if you get the expected results?
> 
> Note that type branch is still on Hadoop 17 but will move to 
> Hadoop 18 later today. 
> 
> Olga
> 
[snip]


RE: DISTINCT Problem

Posted by Olga Natkovich <ol...@yahoo-inc.com>.
Is this repeatable? 

> -----Original Message-----
> From: Paul O'Leary [mailto:poleary@quantivo.com] 
> Sent: Thursday, September 25, 2008 10:50 AM
> To: pig-dev@incubator.apache.org
> Subject: FW: DISTINCT Problem
> 
> Trying to compile types branch to verify/check this problem.
> 
> I get the following compile error and have checked all the obvious
> stuff:
> 
> $ ant compile
> Buildfile: build.xml
> 
> init:
> 
> cc-compile:
>    [javacc] Java Compiler Compiler Version 4.0 (Parser Generator)
>    [javacc] (type "javacc" with no arguments for help)
>    [javacc] Reading from file
> C:\dev\pig\test\org\apache\pig\test\utils\dotGraph
> \parser\Dot.jj . . .
>    [javacc] Exception in thread "main" java.lang.Error: 
> Invalid escape character  at line 1 column 97.
>    [javacc]     at org.javacc.parser.JavaCharStream.readChar(Unknown
> Source)
>    [javacc]     at
> org.javacc.parser.JavaCCParserTokenManager.getNextToken(Unkno
> wn Source)
>    [javacc]     at 
> org.javacc.parser.JavaCCParser.jj_ntk(Unknown Source)
>    [javacc]     at 
> org.javacc.parser.JavaCCParser.javacc_options(Unknown
> Source)
> 
>    [javacc]     at org.javacc.parser.JavaCCParser.javacc_input(Unknown
> Source)
>    [javacc]     at org.javacc.parser.Main.mainProgram(Unknown Source)
>    [javacc]     at org.javacc.parser.Main.main(Unknown Source)
> 
> BUILD FAILED
> C:\dev\pig\build.xml:151: C:\Program
> Files\Java\jdk1.5.0_06\jre\bin\java.exe fai led with return code 1
> 
> Total time: 5 seconds
> 
> I am compiling on Windows but I get the same error under cygwin.
> 
> Any ideas?  Thanks for the help.
> PaulO.
> 
> -----Original Message-----
> From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> Sent: Wednesday, September 24, 2008 4:58 PM
> To: pig-user@incubator.apache.org
> Subject: RE: DISTINCT Problem
> 
> This could be a bug. Can you try it with pig.jar build from 
> type branch and see if you get the expected results?
> 
> Note that type branch is still on Hadoop 17 but will move to 
> Hadoop 18 later today. 
> 
> Olga
> 
> > -----Original Message-----
> > From: Paul O'Leary [mailto:poleary@quantivo.com]
> > Sent: Wednesday, September 24, 2008 3:57 PM
> > To: pig-user@incubator.apache.org
> > Subject: DISTINCT Problem
> > 
> > Hi All,
> > 
> >  
> > 
> > I seem to be seeing a problem with the DISTINCT operator.  I have a 
> > script that looks like this:
> > 
> >  
> > 
> > raw_tran_hdr = load 'tran_hdr/tran_header' using 
> PigStorage( '|' ) as 
> > ( ... many fields ... );
> > 
> > tran_hdr_dist = DISTINCT raw_tran_hdr;
> > 
> > b = GROUP tran_hdr_dist ALL;
> > 
> > c = FOREACH b GENERATE COUNT(tran_hdr_dist.$0);
> > 
> >  
> > 
> > The data set 'tran_hdr/tran_header' has about 7M rows of 
> which I know 
> > for certain 14 are exact duplicates.  When I execute the Pig script 
> > above I get the total row count; that is, the number 
> returned doesn't 
> > correctly drop out the duplicate rows.
> > 
> >  
> > 
> > There is a thread in the user group about previous DISTINCT 
> problems 
> > that sound just like this but JIRA says they're all resolved.  The 
> > code I'm using is up-to-date with the trunk (@ revision 
> 698759) so I'm 
> > assuming I've picked up any fixes.
> > 
> >  
> > 
> > When (in a different script) I move the DISTINCT into a 
> nested FOREACH 
> > it fixes (or at least works-around) the problem; e.g.:
> > 
> >  
> > 
> > (after COGROUP)
> > 
> >  
> > 
> > Z = FOREACH X
> > 
> > {
> > 
> > thd = DISTINCT raw_tran_hdr;
> > 
> > GENERATE
> > 
> > FLATTEN( thd.(... many fields .... ) ),
> > 
> > FLATTEN( sale_line_calc.(... many fields ...) );
> > 
> > }
> > 
> >  
> > 
> > I will continue to try to dig into the problem but any 
> guidance anyone 
> > can provide would be appreciated.  Maybe I'm misunderstanding 
> > something.
> > 
> > As mentioned, I am successfully working around the issue 
> right now but 
> > - as a data junkie like I know you all are - answers that look 
> > incorrect make me nervous.
> > 
> >  
> > 
> > BTW, I don't think this is just a counting issue with 
> DISTINCT (as the 
> > previous issues seem to allude to); when I tried to use 
> tran_hdr_dist 
> > to do a COGROUP (without counting) I got wrong results.
> > 
> >  
> > 
> > Thanks,
> > 
> > PaulO.
> > 
> > 
> 
>