You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Olga Natkovich (JIRA)" <ji...@apache.org> on 2010/06/18 00:55:24 UTC

[jira] Created: (PIG-1458) aggregate files for replicated join

aggregate files for replicated join
-----------------------------------

                 Key: PIG-1458
                 URL: https://issues.apache.org/jira/browse/PIG-1458
             Project: Pig
          Issue Type: Improvement
            Reporter: Olga Natkovich


We have noticed that if the smaller data in replicated join has many files, this puts  unneeded burden on the name node. pre-aggregating the files can improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1458) aggregate files for replicated join

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903895#action_12903895 ] 

Thejas M Nair commented on PIG-1458:
------------------------------------

Another comment about the patch -
- The test testUnknownNumMaps2 is same as testUnknownNumMaps, it should be removed .


A note about the 2nd case described in first comment -
bq. 2.  The right input is a map-only job and input files do not exist at the compile time.

When the input does not exist for the input map-only job, in most(/all ?) cases it would be possible to determine the number of files by looking at the previous MR operator (or ones before that).
Also, with current implementation, since the checks for number of files are being done before the MR jobs are merged together, there will be cases where the final plan has only one MR job with existing input for the replicated input and pig still considers it as a case 2.

The example used in testUnknownNumMaps() has only one input MR job with inputs that exist at compile time, but if pig.frjoin.merge.files.optimistic=false, it will create an additional MR job that combines the input -
{code}
A = LOAD '" + INPUT_FILE + "' as (x:int,y:int);
B = Filter A by x < 50;
C = join A by $0, B by $0 using 'repl';
{code}


> aggregate files for replicated join
> -----------------------------------
>
>                 Key: PIG-1458
>                 URL: https://issues.apache.org/jira/browse/PIG-1458
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Richard Ding
>             Fix For: 0.8.0
>
>         Attachments: PIG-1458.patch
>
>
> We have noticed that if the smaller data in replicated join has many files, this puts  unneeded burden on the name node. pre-aggregating the files can improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1458) aggregate files for replicated join

Posted by "Richard Ding (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Ding updated PIG-1458:
------------------------------

    Attachment: PIG-1458.patch

This patch uses the new multi-file-combiner (PIG-1518) to concatenate many small files for replicated join. This is based on the assumption that the total size of the replicated files should be small enough to fit into main memory. 

> aggregate files for replicated join
> -----------------------------------
>
>                 Key: PIG-1458
>                 URL: https://issues.apache.org/jira/browse/PIG-1458
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Richard Ding
>             Fix For: 0.8.0
>
>         Attachments: PIG-1458.patch
>
>
> We have noticed that if the smaller data in replicated join has many files, this puts  unneeded burden on the name node. pre-aggregating the files can improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1458) aggregate files for replicated join

Posted by "Koji Noguchi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904346#action_12904346 ] 

Koji Noguchi commented on PIG-1458:
-----------------------------------

Can we increase the replication to 10 for the aggregated file (if not already done)?

> aggregate files for replicated join
> -----------------------------------
>
>                 Key: PIG-1458
>                 URL: https://issues.apache.org/jira/browse/PIG-1458
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Richard Ding
>             Fix For: 0.8.0
>
>         Attachments: PIG-1458.patch, PIG-1458_1.patch
>
>
> We have noticed that if the smaller data in replicated join has many files, this puts  unneeded burden on the name node. pre-aggregating the files can improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1458) aggregate files for replicated join

Posted by "Richard Ding (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897484#action_12897484 ] 

Richard Ding commented on PIG-1458:
-----------------------------------

For 1. and 2. above, another approach is to do nothing and rely on MultiFileInputFormat (PIG-1518) to merge small files. 

> aggregate files for replicated join
> -----------------------------------
>
>                 Key: PIG-1458
>                 URL: https://issues.apache.org/jira/browse/PIG-1458
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Richard Ding
>             Fix For: 0.8.0
>
>
> We have noticed that if the smaller data in replicated join has many files, this puts  unneeded burden on the name node. pre-aggregating the files can improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1458) aggregate files for replicated join

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1458:
--------------------------------

    Fix Version/s: 0.8.0

> aggregate files for replicated join
> -----------------------------------
>
>                 Key: PIG-1458
>                 URL: https://issues.apache.org/jira/browse/PIG-1458
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>             Fix For: 0.8.0
>
>
> We have noticed that if the smaller data in replicated join has many files, this puts  unneeded burden on the name node. pre-aggregating the files can improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-1458) aggregate files for replicated join

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich reassigned PIG-1458:
-----------------------------------

    Assignee: Richard Ding

> aggregate files for replicated join
> -----------------------------------
>
>                 Key: PIG-1458
>                 URL: https://issues.apache.org/jira/browse/PIG-1458
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Richard Ding
>             Fix For: 0.8.0
>
>
> We have noticed that if the smaller data in replicated join has many files, this puts  unneeded burden on the name node. pre-aggregating the files can improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1458) aggregate files for replicated join

Posted by "Richard Ding (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904451#action_12904451 ] 

Richard Ding commented on PIG-1458:
-----------------------------------

Patch committed to trunk.

> aggregate files for replicated join
> -----------------------------------
>
>                 Key: PIG-1458
>                 URL: https://issues.apache.org/jira/browse/PIG-1458
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Richard Ding
>             Fix For: 0.8.0
>
>         Attachments: PIG-1458.patch, PIG-1458_1.patch
>
>
> We have noticed that if the smaller data in replicated join has many files, this puts  unneeded burden on the name node. pre-aggregating the files can improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1458) aggregate files for replicated join

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904358#action_12904358 ] 

Thejas M Nair commented on PIG-1458:
------------------------------------

+1

> aggregate files for replicated join
> -----------------------------------
>
>                 Key: PIG-1458
>                 URL: https://issues.apache.org/jira/browse/PIG-1458
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Richard Ding
>             Fix For: 0.8.0
>
>         Attachments: PIG-1458.patch, PIG-1458_1.patch
>
>
> We have noticed that if the smaller data in replicated join has many files, this puts  unneeded burden on the name node. pre-aggregating the files can improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1458) aggregate files for replicated join

Posted by "Richard Ding (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Ding updated PIG-1458:
------------------------------

    Attachment: PIG-1458_1.patch

New patch addressing review comments.

> aggregate files for replicated join
> -----------------------------------
>
>                 Key: PIG-1458
>                 URL: https://issues.apache.org/jira/browse/PIG-1458
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Richard Ding
>             Fix For: 0.8.0
>
>         Attachments: PIG-1458.patch, PIG-1458_1.patch
>
>
> We have noticed that if the smaller data in replicated join has many files, this puts  unneeded burden on the name node. pre-aggregating the files can improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1458) aggregate files for replicated join

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903892#action_12903892 ] 

Thejas M Nair commented on PIG-1458:
------------------------------------

+1
Looks good. Some minor comments - 

- If the preceding op is a native MR job (for native mapreduce operator), we don't know how many reducers will be run , pig should use the  pig.frjoin.merge.files.optimistic property in that case. For native mr job, the map plan will be empty, so currently the check for number of roots will return false.

- If one input file has been found to have several files, we can stop there, and avoid checking other files.
{code}
      } else if (!frJoinOptimisticFileMerge) {
                        // file doesn't exist yet. Treat it as having too many
                        // files
                        numFiles = frJoinFileMergeThreshold;  
      }
{code}
{code}
      } else if (!frJoinOptimisticFileMerge) {
                        // file doesn't exist yet. Treat it as having too many
                        // files
                       return true;
      }
{code}



> aggregate files for replicated join
> -----------------------------------
>
>                 Key: PIG-1458
>                 URL: https://issues.apache.org/jira/browse/PIG-1458
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Richard Ding
>             Fix For: 0.8.0
>
>         Attachments: PIG-1458.patch
>
>
> We have noticed that if the smaller data in replicated join has many files, this puts  unneeded burden on the name node. pre-aggregating the files can improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PIG-1458) aggregate files for replicated join

Posted by "Richard Ding (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Ding resolved PIG-1458.
-------------------------------

    Hadoop Flags: [Reviewed]
      Resolution: Fixed

> aggregate files for replicated join
> -----------------------------------
>
>                 Key: PIG-1458
>                 URL: https://issues.apache.org/jira/browse/PIG-1458
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Richard Ding
>             Fix For: 0.8.0
>
>         Attachments: PIG-1458.patch, PIG-1458_1.patch
>
>
> We have noticed that if the smaller data in replicated join has many files, this puts  unneeded burden on the name node. pre-aggregating the files can improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1458) aggregate files for replicated join

Posted by "Richard Ding (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904385#action_12904385 ] 

Richard Ding commented on PIG-1458:
-----------------------------------

Koji,

Please open a jira on increasing the replication factor of the replicated files. Now it uses the default replication factor. 

Thanks,
-Richard 

> aggregate files for replicated join
> -----------------------------------
>
>                 Key: PIG-1458
>                 URL: https://issues.apache.org/jira/browse/PIG-1458
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Richard Ding
>             Fix For: 0.8.0
>
>         Attachments: PIG-1458.patch, PIG-1458_1.patch
>
>
> We have noticed that if the smaller data in replicated join has many files, this puts  unneeded burden on the name node. pre-aggregating the files can improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1458) aggregate files for replicated join

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903898#action_12903898 ] 

Thejas M Nair commented on PIG-1458:
------------------------------------

What i described under 'A note about the 2nd case described in first comment -' in previous comment is a change that can be done as part of a separate jira.


> aggregate files for replicated join
> -----------------------------------
>
>                 Key: PIG-1458
>                 URL: https://issues.apache.org/jira/browse/PIG-1458
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Richard Ding
>             Fix For: 0.8.0
>
>         Attachments: PIG-1458.patch
>
>
> We have noticed that if the smaller data in replicated join has many files, this puts  unneeded burden on the name node. pre-aggregating the files can improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1458) aggregate files for replicated join

Posted by "Richard Ding (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897451#action_12897451 ] 

Richard Ding commented on PIG-1458:
-----------------------------------

The proposal is to run another map-reduce job to merge the small files before the replicated join. This additional job will be added to the MR plan at the compile time.

We consider three cases of a replicated join: 

# The right input is a map-only job and input files exist at the compile time.
# The right input is a map-only job and input files do not exist at the compile time.
# The right input is a map-reduce job.

For 1., if the number of files exceeds the threshold specified in the property file (_pig.frjoin.merge.files.threshold_), a merge job is added between right input job and FR join job.

For 3., if the number of reducers exceeds the threshold specified in the property file (_pig.frjoin.merge.files.threshold_), a merge job is added between right input job and FR join job.

For 2., if the flag specified in the property file (_pig.frjoin.merge.files.optimistic_) is false,  a merge job is added between right input job and FR join job. The default value of this flag is false. 



> aggregate files for replicated join
> -----------------------------------
>
>                 Key: PIG-1458
>                 URL: https://issues.apache.org/jira/browse/PIG-1458
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Richard Ding
>             Fix For: 0.8.0
>
>
> We have noticed that if the smaller data in replicated join has many files, this puts  unneeded burden on the name node. pre-aggregating the files can improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.