You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Ranjit Mathew (JIRA)" <ji...@apache.org> on 2011/01/07 04:17:45 UTC

[jira] Created: (PIG-1792) Skewed Join Taking Too Long and Producing Too Much Data

Skewed Join Taking Too Long and Producing Too Much Data
-------------------------------------------------------

                 Key: PIG-1792
                 URL: https://issues.apache.org/jira/browse/PIG-1792
             Project: Pig
          Issue Type: Bug
    Affects Versions: 0.8.0
            Reporter: Ranjit Mathew


With Pig 0.8.0 and Hadoop 0.20, a skewed join takes too long and produces too much
data.

Using the data-generator from PIG-200, I generated two relations:
--------------------------------- 8< ---------------------------------
3881312410   page_views
4370223      queryterm
--------------------------------- 8< ---------------------------------
(The first column represents the size in bytes of the relation in HDFS. So "page_views"
was around 4,700 MiB and "queryterm" was around 4 MiB.)

"queryterm" was generated from "page_views" using this Pig snippet:
--------------------------------- 8< ---------------------------------
pig << @EOF
A = load 'page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, 
page_links);
B = foreach A generate query_term;
C = sample B 0.2;
store C into 'queryterm';
@EOF
--------------------------------- 8< ---------------------------------

To test skewed join, I used the following script:
--------------------------------- 8< ---------------------------------
A = load 'page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
     as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, 
page_links);
B = load 'queryterm' as (query_term);
C = join A by query_term, B by query_term using 'skewed' parallel 40;
store C into 'L18out';
--------------------------------- 8< ---------------------------------

I had to abort this script after it had run for about 18.5 hours and had generated
about 7 TiB of data. :-(

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] [Updated] (PIG-1792) Skewed Join Taking Too Long and Producing Too Much Data

Posted by "Olga Natkovich (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1792:
--------------------------------

    Fix Version/s:     (was: 0.10)
    
> Skewed Join Taking Too Long and Producing Too Much Data
> -------------------------------------------------------
>
>                 Key: PIG-1792
>                 URL: https://issues.apache.org/jira/browse/PIG-1792
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0
>            Reporter: Ranjit Mathew
>            Assignee: Thejas M Nair
>
> With Pig 0.8.0 and Hadoop 0.20, a skewed join takes too long and produces too much
> data.
> Using the data-generator from PIG-200, I generated two relations:
> --------------------------------- 8< ---------------------------------
> 3881312410   page_views
> 4370223      queryterm
> --------------------------------- 8< ---------------------------------
> (The first column represents the size in bytes of the relation in HDFS. So "page_views"
> was around 4,700 MiB and "queryterm" was around 4 MiB.)
> "queryterm" was generated from "page_views" using this Pig snippet:
> --------------------------------- 8< ---------------------------------
> pig << @EOF
> A = load 'page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
>     as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, 
> page_links);
> B = foreach A generate query_term;
> C = sample B 0.2;
> store C into 'queryterm';
> @EOF
> --------------------------------- 8< ---------------------------------
> To test skewed join, I used the following script:
> --------------------------------- 8< ---------------------------------
> A = load 'page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
>      as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, 
> page_links);
> B = load 'queryterm' as (query_term);
> C = join A by query_term, B by query_term using 'skewed' parallel 40;
> store C into 'L18out';
> --------------------------------- 8< ---------------------------------
> I had to abort this script after it had run for about 18.5 hours and had generated
> about 7 TiB of data. :-(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (PIG-1792) Skewed Join Taking Too Long and Producing Too Much Data

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1792:
--------------------------------

    Fix Version/s: 0.10
         Assignee: Thejas M Nair

> Skewed Join Taking Too Long and Producing Too Much Data
> -------------------------------------------------------
>
>                 Key: PIG-1792
>                 URL: https://issues.apache.org/jira/browse/PIG-1792
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0
>            Reporter: Ranjit Mathew
>            Assignee: Thejas M Nair
>             Fix For: 0.10
>
>
> With Pig 0.8.0 and Hadoop 0.20, a skewed join takes too long and produces too much
> data.
> Using the data-generator from PIG-200, I generated two relations:
> --------------------------------- 8< ---------------------------------
> 3881312410   page_views
> 4370223      queryterm
> --------------------------------- 8< ---------------------------------
> (The first column represents the size in bytes of the relation in HDFS. So "page_views"
> was around 4,700 MiB and "queryterm" was around 4 MiB.)
> "queryterm" was generated from "page_views" using this Pig snippet:
> --------------------------------- 8< ---------------------------------
> pig << @EOF
> A = load 'page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
>     as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, 
> page_links);
> B = foreach A generate query_term;
> C = sample B 0.2;
> store C into 'queryterm';
> @EOF
> --------------------------------- 8< ---------------------------------
> To test skewed join, I used the following script:
> --------------------------------- 8< ---------------------------------
> A = load 'page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
>      as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, 
> page_links);
> B = load 'queryterm' as (query_term);
> C = join A by query_term, B by query_term using 'skewed' parallel 40;
> store C into 'L18out';
> --------------------------------- 8< ---------------------------------
> I had to abort this script after it had run for about 18.5 hours and had generated
> about 7 TiB of data. :-(

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PIG-1792) Skewed Join Taking Too Long and Producing Too Much Data

Posted by "Ranjit Mathew (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978624#action_12978624 ] 

Ranjit Mathew commented on PIG-1792:
------------------------------------

??So "page_views" was around 4,700 MiB??

Err...I meant 3,700 MiB.

One thing I noticed was that almost all the time was being spent in reducers
and the reducers were finishing their tasks very slowly.

> Skewed Join Taking Too Long and Producing Too Much Data
> -------------------------------------------------------
>
>                 Key: PIG-1792
>                 URL: https://issues.apache.org/jira/browse/PIG-1792
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0
>            Reporter: Ranjit Mathew
>
> With Pig 0.8.0 and Hadoop 0.20, a skewed join takes too long and produces too much
> data.
> Using the data-generator from PIG-200, I generated two relations:
> --------------------------------- 8< ---------------------------------
> 3881312410   page_views
> 4370223      queryterm
> --------------------------------- 8< ---------------------------------
> (The first column represents the size in bytes of the relation in HDFS. So "page_views"
> was around 4,700 MiB and "queryterm" was around 4 MiB.)
> "queryterm" was generated from "page_views" using this Pig snippet:
> --------------------------------- 8< ---------------------------------
> pig << @EOF
> A = load 'page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
>     as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, 
> page_links);
> B = foreach A generate query_term;
> C = sample B 0.2;
> store C into 'queryterm';
> @EOF
> --------------------------------- 8< ---------------------------------
> To test skewed join, I used the following script:
> --------------------------------- 8< ---------------------------------
> A = load 'page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
>      as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, 
> page_links);
> B = load 'queryterm' as (query_term);
> C = join A by query_term, B by query_term using 'skewed' parallel 40;
> store C into 'L18out';
> --------------------------------- 8< ---------------------------------
> I had to abort this script after it had run for about 18.5 hours and had generated
> about 7 TiB of data. :-(

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.