You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Daniel Lescohier (JIRA)" <ji...@apache.org> on 2009/02/12 20:50:59 UTC

[jira] Created: (PIG-672) bad data output from STREAM operator in trunk (regression from 0.1.1)

bad data output from STREAM operator in trunk (regression from 0.1.1)
---------------------------------------------------------------------

                 Key: PIG-672
                 URL: https://issues.apache.org/jira/browse/PIG-672
             Project: Pig
          Issue Type: Bug
          Components: impl
    Affects Versions: types_branch
         Environment: Red Hat Enterprise Linux 4 & 5
Hadoop 0.18.2
            Reporter: Daniel Lescohier
            Priority: Critical


In the 0.1.1 release of pig, all of the following works fine; the problem is in the trunk version.  Here's a brief intro to the workflow (details below):

 * I have 174856784 lines of input data, each line is a unique title string.
 * I stream the data through `sha1.py`, which outputs a sha1 hash of each input line: a string of 40 hexadecimal digits.
 * I group on the hash, generating a count of each group, then filter on rows having a count > 1.
 * With pig 0.1.1, it outputs all 0-byte part-* files, because all the hashes are unique.
 * I've also verified totally outside of Hadoop, using sort and uniq, that the hashes are unique.
 * pig trunk checkout with "last changed rev 737863" returns non-zero results; the 7 part-* files are 1.5MB.
 * I've tracked it down to the STREAM operation (details below).

Here's the pig-svn-trunk job that produces the hashes:

set job.name 'title hash';
DEFINE Cmd `sha1.py` ship('sha1.py');
row = load '/home/danl/url_title/unique_titles';
hashes = stream row through Cmd;
store hashes into '/home/danl/url_title/title_hash';

Here's the pig-0.1.1 job that produces the hashes:

set job.name 'title hash 011';
DEFINE Cmd `sha1.py` ship('sha1.py');
row = load '/home/danl/url_title/unique_titles';
hashes = stream row through Cmd;
store hashes into '/home/danl/url_title/title_hash.011';

Here's sha1.py:

#!/opt/cnet-python/default-2.5/bin/python
from sys import stdin, stdout
from hashlib import sha1

for line in stdin:
    h = sha1()
    h.update(line[:-1])
    stdout.write("%s\n" % h.hexdigest())

Here's the pig-svn-trunk job for finding duplicate hashes from the hashes data generated by pig-svn-trunk:

set job.name 'h40';
hash = load '/home/danl/url_title/title_hash';
grouped = group hash by $0 parallel 7;
counted = foreach grouped generate group, COUNT(hash) as cnt;
having = filter counted by cnt > 1;
store having into '/home/danl/url_title/title_hash_collisions/h40';

The seven part-* files in /home/danl/url_title/title_hash_collisions/h40 are 1.5MB each.

Here's the pig-0.1.1 job for finding duplicate hashes from the hashes data generated by pig-0.1.1:

set job.name 'h40.011.nh';
hash = load '/home/danl/url_title/title_hash.011';
grouped = group hash by $0 parallel 7;
counted = foreach grouped generate group, COUNT(hash) as cnt;
having = filter counted by cnt > 1;
store having into '/home/danl/url_title/title_hash_collisions/h40.011.nh';

The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.011.nh are 0KB each.

Here's the pig-0.1.1 job for finding duplicate hashes from the hashes data generated by pig-svn-trunk:

set job.name 'h40.011';
hash = load '/home/danl/url_title/title_hash';
grouped = group hash by $0 parallel 7;
counted = foreach grouped generate group, COUNT(hash) as cnt;
having = filter counted by cnt > 1;
store having into '/home/danl/url_title/title_hash_collisions/h40.011';

The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.011 are 1.5MB each.

Therefore, it's the hash data generated by pig-svn-trunk (/home/danl/url_title/title_hash) which has duplicates in it.

Here are the first six lines of /home/danl/url_title/title_hash/part-00064.  You can see that lines five and six are duplicates.  It looks like the stream operator read the same line twice from the Python program? The job which produces the hashes is a map-only job, no reduces.

8f3513136b1c8b87b8b73b9d39d96555095e9cdd
2edb20c5a3862cc5f545ae649f1e26430a38bda4
ca9c216629fce16b4c113c0d9fcf65f906ab5e04
03fe80633822215a6935bcf95305bb14adf23f18
03fe80633822215a6935bcf95305bb14adf23f18
6d324b2cd1c52f564e2a29fcbf6ae2fb83d2697c

After narrowing it down to the stream operator in pig-svn-trunk, I decided to run the find dupes job again using pig-svn-trunk, but to first pipe the data through cat.  Cat shouldn't change the data at all, it's an identity operation.  Here's the job:

set job.name 'h40.cat';
DEFINE Cmd `cat`;
row = load '/home/danl/url_title/title_hash';
hash = stream row through Cmd;
grouped = group hash by $0 parallel 7;
counted = foreach grouped generate group, COUNT(hash) as cnt;
having = filter counted by cnt > 1;
store having into '/home/danl/url_title/title_hash_collisions/h40.cat';

The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.cat are 7.2MB each. This 'h40.cat' job should produce the same results as the 'h40' job.  The 'h40' job had part-* files of 1.5MB each, and this job 7.2MB each, so piping it through `cat` produced even more duplicates, when `cat` is not supposed to change the results at all.

I also ran under pig-svn-trunk a 'title hash.r2' job that re-ran creating the hashes again into another directory, just to make sure it wasn't a fluke run that produced duplicate hashes. The second time around, it also produced duplicates.  Running the dupe-detection pig job under svn-0.1.1 for the hashes produced by pig-svn-trunk from the second run, I also got 1.5MB output files.

For a final test, I ran in pig-svn-trunk the dupe-detection code on hash data generated from pig-0.1.1:

set job.name 'h40.trk.nh';
hash = load '/home/danl/url_title/title_hash.011';
grouped = group hash by $0 parallel 7;
counted = foreach grouped generate group, COUNT(hash) as cnt;
having = filter counted by cnt > 1;
store having into '/home/danl/url_title/title_hash_collisions/h40.trk.nh';

The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.trk.nh are 0 bytes.  It's clear that it's the stream operation running in pig-svn-trunk which is producing the duplicates.

Here is the complete svn info of the checkout I built pig from:

Path: .
URL: http://svn.apache.org/repos/asf/hadoop/pig/trunk
Repository Root: http://svn.apache.org/repos/asf
Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
Revision: 737873
Node Kind: directory
Schedule: normal
Last Changed Author: pradeepkth
Last Changed Rev: 737863
Last Changed Date: 2009-01-26 13:27:16 -0800 (Mon, 26 Jan 2009)

When I built it, I also ran all the unit tests.

This was all run on Hadoop 0.18.2.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PIG-672) bad data output from STREAM operator in trunk (regression from 0.1.1)

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pradeep Kamath resolved PIG-672.
--------------------------------

    Resolution: Invalid

Closing the issue based on the last comment. Please reopen if there is still an issue. 

> bad data output from STREAM operator in trunk (regression from 0.1.1)
> ---------------------------------------------------------------------
>
>                 Key: PIG-672
>                 URL: https://issues.apache.org/jira/browse/PIG-672
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>         Environment: Red Hat Enterprise Linux 4 & 5
> Hadoop 0.18.2
>            Reporter: Daniel Lescohier
>            Priority: Critical
>
> In the 0.1.1 release of pig, all of the following works fine; the problem is in the trunk version.  Here's a brief intro to the workflow (details below):
>  * I have 174856784 lines of input data, each line is a unique title string.
>  * I stream the data through `sha1.py`, which outputs a sha1 hash of each input line: a string of 40 hexadecimal digits.
>  * I group on the hash, generating a count of each group, then filter on rows having a count > 1.
>  * With pig 0.1.1, it outputs all 0-byte part-* files, because all the hashes are unique.
>  * I've also verified totally outside of Hadoop, using sort and uniq, that the hashes are unique.
>  * pig trunk checkout with "last changed rev 737863" returns non-zero results; the 7 part-* files are 1.5MB.
>  * I've tracked it down to the STREAM operation (details below).
> Here's the pig-svn-trunk job that produces the hashes:
> set job.name 'title hash';
> DEFINE Cmd `sha1.py` ship('sha1.py');
> row = load '/home/danl/url_title/unique_titles';
> hashes = stream row through Cmd;
> store hashes into '/home/danl/url_title/title_hash';
> Here's the pig-0.1.1 job that produces the hashes:
> set job.name 'title hash 011';
> DEFINE Cmd `sha1.py` ship('sha1.py');
> row = load '/home/danl/url_title/unique_titles';
> hashes = stream row through Cmd;
> store hashes into '/home/danl/url_title/title_hash.011';
> Here's sha1.py:
> #!/opt/cnet-python/default-2.5/bin/python
> from sys import stdin, stdout
> from hashlib import sha1
> for line in stdin:
>     h = sha1()
>     h.update(line[:-1])
>     stdout.write("%s\n" % h.hexdigest())
> Here's the pig-svn-trunk job for finding duplicate hashes from the hashes data generated by pig-svn-trunk:
> set job.name 'h40';
> hash = load '/home/danl/url_title/title_hash';
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40';
> The seven part-* files in /home/danl/url_title/title_hash_collisions/h40 are 1.5MB each.
> Here's the pig-0.1.1 job for finding duplicate hashes from the hashes data generated by pig-0.1.1:
> set job.name 'h40.011.nh';
> hash = load '/home/danl/url_title/title_hash.011';
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40.011.nh';
> The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.011.nh are 0KB each.
> Here's the pig-0.1.1 job for finding duplicate hashes from the hashes data generated by pig-svn-trunk:
> set job.name 'h40.011';
> hash = load '/home/danl/url_title/title_hash';
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40.011';
> The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.011 are 1.5MB each.
> Therefore, it's the hash data generated by pig-svn-trunk (/home/danl/url_title/title_hash) which has duplicates in it.
> Here are the first six lines of /home/danl/url_title/title_hash/part-00064.  You can see that lines five and six are duplicates.  It looks like the stream operator read the same line twice from the Python program? The job which produces the hashes is a map-only job, no reduces.
> 8f3513136b1c8b87b8b73b9d39d96555095e9cdd
> 2edb20c5a3862cc5f545ae649f1e26430a38bda4
> ca9c216629fce16b4c113c0d9fcf65f906ab5e04
> 03fe80633822215a6935bcf95305bb14adf23f18
> 03fe80633822215a6935bcf95305bb14adf23f18
> 6d324b2cd1c52f564e2a29fcbf6ae2fb83d2697c
> After narrowing it down to the stream operator in pig-svn-trunk, I decided to run the find dupes job again using pig-svn-trunk, but to first pipe the data through cat.  Cat shouldn't change the data at all, it's an identity operation.  Here's the job:
> set job.name 'h40.cat';
> DEFINE Cmd `cat`;
> row = load '/home/danl/url_title/title_hash';
> hash = stream row through Cmd;
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40.cat';
> The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.cat are 7.2MB each. This 'h40.cat' job should produce the same results as the 'h40' job.  The 'h40' job had part-* files of 1.5MB each, and this job 7.2MB each, so piping it through `cat` produced even more duplicates, when `cat` is not supposed to change the results at all.
> I also ran under pig-svn-trunk a 'title hash.r2' job that re-ran creating the hashes again into another directory, just to make sure it wasn't a fluke run that produced duplicate hashes. The second time around, it also produced duplicates.  Running the dupe-detection pig job under svn-0.1.1 for the hashes produced by pig-svn-trunk from the second run, I also got 1.5MB output files.
> For a final test, I ran in pig-svn-trunk the dupe-detection code on hash data generated from pig-0.1.1:
> set job.name 'h40.trk.nh';
> hash = load '/home/danl/url_title/title_hash.011';
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40.trk.nh';
> The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.trk.nh are 0 bytes.  It's clear that it's the stream operation running in pig-svn-trunk which is producing the duplicates.
> Here is the complete svn info of the checkout I built pig from:
> Path: .
> URL: http://svn.apache.org/repos/asf/hadoop/pig/trunk
> Repository Root: http://svn.apache.org/repos/asf
> Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
> Revision: 737873
> Node Kind: directory
> Schedule: normal
> Last Changed Author: pradeepkth
> Last Changed Rev: 737863
> Last Changed Date: 2009-01-26 13:27:16 -0800 (Mon, 26 Jan 2009)
> When I built it, I also ran all the unit tests.
> This was all run on Hadoop 0.18.2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-672) bad data output from STREAM operator in trunk (regression from 0.1.1)

Posted by "Daniel Lescohier (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673069#action_12673069 ] 

Daniel Lescohier commented on PIG-672:
--------------------------------------

I tested with trunk 743881, and the problem is solved.  I recreated the hashes streaming through sha1.py, and then ran the dupe-checking query.  The output was 0 bytes.

set job.name 'title hash 743881';
DEFINE Cmd `sha1.py` ship('sha1.py');
row = load '/home/danl/url_title/unique_titles';
hashes = stream row through Cmd;
store hashes into '/home/danl/url_title/title_hash.743881';

set job.name 'h40.011.nh';
hash = load '/home/danl/url_title/title_hash.743881';
grouped = group hash by $0 parallel 7;
counted = foreach grouped generate group, COUNT(hash) as cnt;
having = filter counted by cnt > 1;
store having into '/home/danl/url_title/title_hash_collisions/h40.743881';


I thought of a theoretical unit test.  Run an input file through this, and see if the output file diffs at all from the input file:

I = load 'seq.in.txt';
CAT = stream I through `cat`;
store CAT into 'seq.out.txt';

I did this in pig -x local mode, but it never had a problem; but I realize now from PIG-645 that it only occurred in mapreduce mode.


> bad data output from STREAM operator in trunk (regression from 0.1.1)
> ---------------------------------------------------------------------
>
>                 Key: PIG-672
>                 URL: https://issues.apache.org/jira/browse/PIG-672
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>         Environment: Red Hat Enterprise Linux 4 & 5
> Hadoop 0.18.2
>            Reporter: Daniel Lescohier
>            Priority: Critical
>
> In the 0.1.1 release of pig, all of the following works fine; the problem is in the trunk version.  Here's a brief intro to the workflow (details below):
>  * I have 174856784 lines of input data, each line is a unique title string.
>  * I stream the data through `sha1.py`, which outputs a sha1 hash of each input line: a string of 40 hexadecimal digits.
>  * I group on the hash, generating a count of each group, then filter on rows having a count > 1.
>  * With pig 0.1.1, it outputs all 0-byte part-* files, because all the hashes are unique.
>  * I've also verified totally outside of Hadoop, using sort and uniq, that the hashes are unique.
>  * pig trunk checkout with "last changed rev 737863" returns non-zero results; the 7 part-* files are 1.5MB.
>  * I've tracked it down to the STREAM operation (details below).
> Here's the pig-svn-trunk job that produces the hashes:
> set job.name 'title hash';
> DEFINE Cmd `sha1.py` ship('sha1.py');
> row = load '/home/danl/url_title/unique_titles';
> hashes = stream row through Cmd;
> store hashes into '/home/danl/url_title/title_hash';
> Here's the pig-0.1.1 job that produces the hashes:
> set job.name 'title hash 011';
> DEFINE Cmd `sha1.py` ship('sha1.py');
> row = load '/home/danl/url_title/unique_titles';
> hashes = stream row through Cmd;
> store hashes into '/home/danl/url_title/title_hash.011';
> Here's sha1.py:
> #!/opt/cnet-python/default-2.5/bin/python
> from sys import stdin, stdout
> from hashlib import sha1
> for line in stdin:
>     h = sha1()
>     h.update(line[:-1])
>     stdout.write("%s\n" % h.hexdigest())
> Here's the pig-svn-trunk job for finding duplicate hashes from the hashes data generated by pig-svn-trunk:
> set job.name 'h40';
> hash = load '/home/danl/url_title/title_hash';
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40';
> The seven part-* files in /home/danl/url_title/title_hash_collisions/h40 are 1.5MB each.
> Here's the pig-0.1.1 job for finding duplicate hashes from the hashes data generated by pig-0.1.1:
> set job.name 'h40.011.nh';
> hash = load '/home/danl/url_title/title_hash.011';
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40.011.nh';
> The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.011.nh are 0KB each.
> Here's the pig-0.1.1 job for finding duplicate hashes from the hashes data generated by pig-svn-trunk:
> set job.name 'h40.011';
> hash = load '/home/danl/url_title/title_hash';
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40.011';
> The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.011 are 1.5MB each.
> Therefore, it's the hash data generated by pig-svn-trunk (/home/danl/url_title/title_hash) which has duplicates in it.
> Here are the first six lines of /home/danl/url_title/title_hash/part-00064.  You can see that lines five and six are duplicates.  It looks like the stream operator read the same line twice from the Python program? The job which produces the hashes is a map-only job, no reduces.
> 8f3513136b1c8b87b8b73b9d39d96555095e9cdd
> 2edb20c5a3862cc5f545ae649f1e26430a38bda4
> ca9c216629fce16b4c113c0d9fcf65f906ab5e04
> 03fe80633822215a6935bcf95305bb14adf23f18
> 03fe80633822215a6935bcf95305bb14adf23f18
> 6d324b2cd1c52f564e2a29fcbf6ae2fb83d2697c
> After narrowing it down to the stream operator in pig-svn-trunk, I decided to run the find dupes job again using pig-svn-trunk, but to first pipe the data through cat.  Cat shouldn't change the data at all, it's an identity operation.  Here's the job:
> set job.name 'h40.cat';
> DEFINE Cmd `cat`;
> row = load '/home/danl/url_title/title_hash';
> hash = stream row through Cmd;
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40.cat';
> The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.cat are 7.2MB each. This 'h40.cat' job should produce the same results as the 'h40' job.  The 'h40' job had part-* files of 1.5MB each, and this job 7.2MB each, so piping it through `cat` produced even more duplicates, when `cat` is not supposed to change the results at all.
> I also ran under pig-svn-trunk a 'title hash.r2' job that re-ran creating the hashes again into another directory, just to make sure it wasn't a fluke run that produced duplicate hashes. The second time around, it also produced duplicates.  Running the dupe-detection pig job under svn-0.1.1 for the hashes produced by pig-svn-trunk from the second run, I also got 1.5MB output files.
> For a final test, I ran in pig-svn-trunk the dupe-detection code on hash data generated from pig-0.1.1:
> set job.name 'h40.trk.nh';
> hash = load '/home/danl/url_title/title_hash.011';
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40.trk.nh';
> The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.trk.nh are 0 bytes.  It's clear that it's the stream operation running in pig-svn-trunk which is producing the duplicates.
> Here is the complete svn info of the checkout I built pig from:
> Path: .
> URL: http://svn.apache.org/repos/asf/hadoop/pig/trunk
> Repository Root: http://svn.apache.org/repos/asf
> Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
> Revision: 737873
> Node Kind: directory
> Schedule: normal
> Last Changed Author: pradeepkth
> Last Changed Rev: 737863
> Last Changed Date: 2009-01-26 13:27:16 -0800 (Mon, 26 Jan 2009)
> When I built it, I also ran all the unit tests.
> This was all run on Hadoop 0.18.2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-672) bad data output from STREAM operator in trunk (regression from 0.1.1)

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673039#action_12673039 ] 

Pradeep Kamath commented on PIG-672:
------------------------------------

Can you try the latest svn trunk - there was an issue with streaming which was fixes on 1/29:

== from svn log of CHANGES.txt==
r739142 | pradeepkth | 2009-01-29 18:12:18 -0800 (Thu, 29 Jan 2009) | 1 line

PIG-645: Streaming is broken with the latest trunk
===========================


> bad data output from STREAM operator in trunk (regression from 0.1.1)
> ---------------------------------------------------------------------
>
>                 Key: PIG-672
>                 URL: https://issues.apache.org/jira/browse/PIG-672
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>         Environment: Red Hat Enterprise Linux 4 & 5
> Hadoop 0.18.2
>            Reporter: Daniel Lescohier
>            Priority: Critical
>
> In the 0.1.1 release of pig, all of the following works fine; the problem is in the trunk version.  Here's a brief intro to the workflow (details below):
>  * I have 174856784 lines of input data, each line is a unique title string.
>  * I stream the data through `sha1.py`, which outputs a sha1 hash of each input line: a string of 40 hexadecimal digits.
>  * I group on the hash, generating a count of each group, then filter on rows having a count > 1.
>  * With pig 0.1.1, it outputs all 0-byte part-* files, because all the hashes are unique.
>  * I've also verified totally outside of Hadoop, using sort and uniq, that the hashes are unique.
>  * pig trunk checkout with "last changed rev 737863" returns non-zero results; the 7 part-* files are 1.5MB.
>  * I've tracked it down to the STREAM operation (details below).
> Here's the pig-svn-trunk job that produces the hashes:
> set job.name 'title hash';
> DEFINE Cmd `sha1.py` ship('sha1.py');
> row = load '/home/danl/url_title/unique_titles';
> hashes = stream row through Cmd;
> store hashes into '/home/danl/url_title/title_hash';
> Here's the pig-0.1.1 job that produces the hashes:
> set job.name 'title hash 011';
> DEFINE Cmd `sha1.py` ship('sha1.py');
> row = load '/home/danl/url_title/unique_titles';
> hashes = stream row through Cmd;
> store hashes into '/home/danl/url_title/title_hash.011';
> Here's sha1.py:
> #!/opt/cnet-python/default-2.5/bin/python
> from sys import stdin, stdout
> from hashlib import sha1
> for line in stdin:
>     h = sha1()
>     h.update(line[:-1])
>     stdout.write("%s\n" % h.hexdigest())
> Here's the pig-svn-trunk job for finding duplicate hashes from the hashes data generated by pig-svn-trunk:
> set job.name 'h40';
> hash = load '/home/danl/url_title/title_hash';
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40';
> The seven part-* files in /home/danl/url_title/title_hash_collisions/h40 are 1.5MB each.
> Here's the pig-0.1.1 job for finding duplicate hashes from the hashes data generated by pig-0.1.1:
> set job.name 'h40.011.nh';
> hash = load '/home/danl/url_title/title_hash.011';
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40.011.nh';
> The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.011.nh are 0KB each.
> Here's the pig-0.1.1 job for finding duplicate hashes from the hashes data generated by pig-svn-trunk:
> set job.name 'h40.011';
> hash = load '/home/danl/url_title/title_hash';
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40.011';
> The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.011 are 1.5MB each.
> Therefore, it's the hash data generated by pig-svn-trunk (/home/danl/url_title/title_hash) which has duplicates in it.
> Here are the first six lines of /home/danl/url_title/title_hash/part-00064.  You can see that lines five and six are duplicates.  It looks like the stream operator read the same line twice from the Python program? The job which produces the hashes is a map-only job, no reduces.
> 8f3513136b1c8b87b8b73b9d39d96555095e9cdd
> 2edb20c5a3862cc5f545ae649f1e26430a38bda4
> ca9c216629fce16b4c113c0d9fcf65f906ab5e04
> 03fe80633822215a6935bcf95305bb14adf23f18
> 03fe80633822215a6935bcf95305bb14adf23f18
> 6d324b2cd1c52f564e2a29fcbf6ae2fb83d2697c
> After narrowing it down to the stream operator in pig-svn-trunk, I decided to run the find dupes job again using pig-svn-trunk, but to first pipe the data through cat.  Cat shouldn't change the data at all, it's an identity operation.  Here's the job:
> set job.name 'h40.cat';
> DEFINE Cmd `cat`;
> row = load '/home/danl/url_title/title_hash';
> hash = stream row through Cmd;
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40.cat';
> The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.cat are 7.2MB each. This 'h40.cat' job should produce the same results as the 'h40' job.  The 'h40' job had part-* files of 1.5MB each, and this job 7.2MB each, so piping it through `cat` produced even more duplicates, when `cat` is not supposed to change the results at all.
> I also ran under pig-svn-trunk a 'title hash.r2' job that re-ran creating the hashes again into another directory, just to make sure it wasn't a fluke run that produced duplicate hashes. The second time around, it also produced duplicates.  Running the dupe-detection pig job under svn-0.1.1 for the hashes produced by pig-svn-trunk from the second run, I also got 1.5MB output files.
> For a final test, I ran in pig-svn-trunk the dupe-detection code on hash data generated from pig-0.1.1:
> set job.name 'h40.trk.nh';
> hash = load '/home/danl/url_title/title_hash.011';
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40.trk.nh';
> The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.trk.nh are 0 bytes.  It's clear that it's the stream operation running in pig-svn-trunk which is producing the duplicates.
> Here is the complete svn info of the checkout I built pig from:
> Path: .
> URL: http://svn.apache.org/repos/asf/hadoop/pig/trunk
> Repository Root: http://svn.apache.org/repos/asf
> Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
> Revision: 737873
> Node Kind: directory
> Schedule: normal
> Last Changed Author: pradeepkth
> Last Changed Rev: 737863
> Last Changed Date: 2009-01-26 13:27:16 -0800 (Mon, 26 Jan 2009)
> When I built it, I also ran all the unit tests.
> This was all run on Hadoop 0.18.2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.