You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Pradeep Kamath (JIRA)" <ji...@apache.org> on 2009/02/12 23:26:59 UTC
[jira] Resolved: (PIG-672) bad data output from STREAM operator in trunk (regression from 0.1.1)

     [ https://issues.apache.org/jira/browse/PIG-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pradeep Kamath resolved PIG-672.
--------------------------------

    Resolution: Invalid

Closing the issue based on the last comment. Please reopen if there is still an issue. 

> bad data output from STREAM operator in trunk (regression from 0.1.1)
> ---------------------------------------------------------------------
>
>                 Key: PIG-672
>                 URL: https://issues.apache.org/jira/browse/PIG-672
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>         Environment: Red Hat Enterprise Linux 4 & 5
> Hadoop 0.18.2
>            Reporter: Daniel Lescohier
>            Priority: Critical
>
> In the 0.1.1 release of pig, all of the following works fine; the problem is in the trunk version.  Here's a brief intro to the workflow (details below):
>  * I have 174856784 lines of input data, each line is a unique title string.
>  * I stream the data through `sha1.py`, which outputs a sha1 hash of each input line: a string of 40 hexadecimal digits.
>  * I group on the hash, generating a count of each group, then filter on rows having a count > 1.
>  * With pig 0.1.1, it outputs all 0-byte part-* files, because all the hashes are unique.
>  * I've also verified totally outside of Hadoop, using sort and uniq, that the hashes are unique.
>  * pig trunk checkout with "last changed rev 737863" returns non-zero results; the 7 part-* files are 1.5MB.
>  * I've tracked it down to the STREAM operation (details below).
> Here's the pig-svn-trunk job that produces the hashes:
> set job.name 'title hash';
> DEFINE Cmd `sha1.py` ship('sha1.py');
> row = load '/home/danl/url_title/unique_titles';
> hashes = stream row through Cmd;
> store hashes into '/home/danl/url_title/title_hash';
> Here's the pig-0.1.1 job that produces the hashes:
> set job.name 'title hash 011';
> DEFINE Cmd `sha1.py` ship('sha1.py');
> row = load '/home/danl/url_title/unique_titles';
> hashes = stream row through Cmd;
> store hashes into '/home/danl/url_title/title_hash.011';
> Here's sha1.py:
> #!/opt/cnet-python/default-2.5/bin/python
> from sys import stdin, stdout
> from hashlib import sha1
> for line in stdin:
>     h = sha1()
>     h.update(line[:-1])
>     stdout.write("%s\n" % h.hexdigest())
> Here's the pig-svn-trunk job for finding duplicate hashes from the hashes data generated by pig-svn-trunk:
> set job.name 'h40';
> hash = load '/home/danl/url_title/title_hash';
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40';
> The seven part-* files in /home/danl/url_title/title_hash_collisions/h40 are 1.5MB each.
> Here's the pig-0.1.1 job for finding duplicate hashes from the hashes data generated by pig-0.1.1:
> set job.name 'h40.011.nh';
> hash = load '/home/danl/url_title/title_hash.011';
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40.011.nh';
> The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.011.nh are 0KB each.
> Here's the pig-0.1.1 job for finding duplicate hashes from the hashes data generated by pig-svn-trunk:
> set job.name 'h40.011';
> hash = load '/home/danl/url_title/title_hash';
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40.011';
> The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.011 are 1.5MB each.
> Therefore, it's the hash data generated by pig-svn-trunk (/home/danl/url_title/title_hash) which has duplicates in it.
> Here are the first six lines of /home/danl/url_title/title_hash/part-00064.  You can see that lines five and six are duplicates.  It looks like the stream operator read the same line twice from the Python program? The job which produces the hashes is a map-only job, no reduces.
> 8f3513136b1c8b87b8b73b9d39d96555095e9cdd
> 2edb20c5a3862cc5f545ae649f1e26430a38bda4
> ca9c216629fce16b4c113c0d9fcf65f906ab5e04
> 03fe80633822215a6935bcf95305bb14adf23f18
> 03fe80633822215a6935bcf95305bb14adf23f18
> 6d324b2cd1c52f564e2a29fcbf6ae2fb83d2697c
> After narrowing it down to the stream operator in pig-svn-trunk, I decided to run the find dupes job again using pig-svn-trunk, but to first pipe the data through cat.  Cat shouldn't change the data at all, it's an identity operation.  Here's the job:
> set job.name 'h40.cat';
> DEFINE Cmd `cat`;
> row = load '/home/danl/url_title/title_hash';
> hash = stream row through Cmd;
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40.cat';
> The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.cat are 7.2MB each. This 'h40.cat' job should produce the same results as the 'h40' job.  The 'h40' job had part-* files of 1.5MB each, and this job 7.2MB each, so piping it through `cat` produced even more duplicates, when `cat` is not supposed to change the results at all.
> I also ran under pig-svn-trunk a 'title hash.r2' job that re-ran creating the hashes again into another directory, just to make sure it wasn't a fluke run that produced duplicate hashes. The second time around, it also produced duplicates.  Running the dupe-detection pig job under svn-0.1.1 for the hashes produced by pig-svn-trunk from the second run, I also got 1.5MB output files.
> For a final test, I ran in pig-svn-trunk the dupe-detection code on hash data generated from pig-0.1.1:
> set job.name 'h40.trk.nh';
> hash = load '/home/danl/url_title/title_hash.011';
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40.trk.nh';
> The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.trk.nh are 0 bytes.  It's clear that it's the stream operation running in pig-svn-trunk which is producing the duplicates.
> Here is the complete svn info of the checkout I built pig from:
> Path: .
> URL: http://svn.apache.org/repos/asf/hadoop/pig/trunk
> Repository Root: http://svn.apache.org/repos/asf
> Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
> Revision: 737873
> Node Kind: directory
> Schedule: normal
> Last Changed Author: pradeepkth
> Last Changed Rev: 737863
> Last Changed Date: 2009-01-26 13:27:16 -0800 (Mon, 26 Jan 2009)
> When I built it, I also ran all the unit tests.
> This was all run on Hadoop 0.18.2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.