You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Olga Natkovich (JIRA)" <ji...@apache.org> on 2009/01/21 01:49:59 UTC

[jira] Resolved: (PIG-614) reduce io during sharing scans of the same input datasets

     [ https://issues.apache.org/jira/browse/PIG-614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich resolved PIG-614.
--------------------------------

    Resolution: Duplicate

This issue will be addressed by https://issues.apache.org/jira/browse/PIG-627


> reduce io during sharing scans of the same input datasets 
> ----------------------------------------------------------
>
>                 Key: PIG-614
>                 URL: https://issues.apache.org/jira/browse/PIG-614
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Samuel Guo
>            Priority: Minor
>             Fix For: types_branch
>
>
> If we want to store different results that generated from the same input dataset, now we need to write two or several *STORE* clauses. And these *STORE* clauses will be translated to different mr jobs despite of these mr jobs may share scans of the same input datasets.
> for example:
> Dataset 'weather' contains the records of the weather. Each record contains three part : wind/air/tempreture. we need to process different part of the records.
> we may write a pig script as below:
> weather = load 'weather.txt' as (wind, air, tempreture);
> wind_results = ... wind ...;
> air_results = ...air...;
> temp_results = ...tempreture...;
> store wind_results into 'wind.results';
> store air_results into 'air.results';
> store temp_results into 'temp.results';
> now pig will translate this script into three different MR jobs wich run sequencely: scan 'weather.txt', process the wind data, store the wind results; scan 'weather.txt' again, process the air data, store the air results; ... 
> if the input data set is large, it is not efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.