You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "David Ciemiewicz (JIRA)" <ji...@apache.org> on 2009/05/07 18:41:30 UTC

[jira] Created: (PIG-801) Pig needs to handle scalar aliases to programmer and code execution efficiency

Pig needs to handle scalar aliases to programmer and code execution efficiency
------------------------------------------------------------------------------

                 Key: PIG-801
                 URL: https://issues.apache.org/jira/browse/PIG-801
             Project: Pig
          Issue Type: New Feature
            Reporter: David Ciemiewicz


In Pig, it is often the case that the result of an operation is a scalar value that needs to be applied to the next step of processing.

For example:
* FILTER by MAX of group -- See: PIG-772
* Compute proportions by dividing by total (SUM) of grouped alias

Today Pig programmers need to go through distasteful and slow contortions of using FLATTEN or CROSS to propagate the scalar computation to EVERY row of data to perform these operations creating needless copies of data.  Or, the user must write the global sum to a file, then read it back in to gain the efficiency.

If the language were simply extended to have the notion of scalar aliases, then coding would be simplified without contortions for the programmer and, I believe, execution of the code would be faster too.

For instance, to compute global proportions, I want to do the following:

{code}
CountryPopulations = load 'country.dat' using PigStorage() as ( country: chararray, population: long );
AllCountryPopulations= group CountryPopulations all;
Total = foreach AllCountryPopulations generate SUM(CountryPopulations.population) as population;
PopulationProportions = foreach CountryPopulations generate
    country, population, (double)population / (double)Total.population as global_proportion;
{code}

One of the very distasteful workarounds for this is to do something like:

{code}
CountryPopulations = load 'country.dat' using PigStorage() as ( country: chararray, population: long );
AllCountryPopulations= group CountryPopulations all;
Total = foreach AllCountryPopulations generate SUM(CountryPopulations.population) as population;
CountryPopulationsTotal = cross CountryPopulations, Total;
PopulationProportions = foreach CountryPopulations generate
    CountryPopulations::country,
    CountryPopulations::population,
    (double)CountryPopulations::population / (double)Total::population as global_proportion;
{code}

This just makes me cringe every time I have to do it.  Constructing new rows of data simply to apply
the same scalar value row after row after row for potentially billions of rows of data just feels horribly wrong
and inefficient both from the coding standpoint and from the execution standpoint.

In SQL, I'd just code this as:

{code}
select
     country,
     population,
     population / SUM(population)
from
     CountryPopulations;
{code}

In writing a SQL to Pig translator, it would seem that this construct or idiom would need to be supported, so why not create a higher level of Pig which would support the notion of scalars efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-801) Pig needs to handle scalar aliases to improve programmer and code execution efficiency

Posted by "David Ciemiewicz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714972#action_12714972 ] 

David Ciemiewicz commented on PIG-801:
--------------------------------------

I'm very much beginning to like the idea of introducing some "syntactic sugar" in Pig for an "forall"  or "overall" statement that would allow one to write the "high level pig" for this case as:

{code}
Total = forall CountryPopulations generate SUM(CountryPopulations.population) as population;
{code}

or as:

{code}
Total = overall CountryPopulations generate SUM(CountryPopulations.population) as population;
{code}

Yeah, I know I could use construct:

{code}
Total = foreach (group CountryPopulations all) generate SUM(CountryPopulations.population) as population;
{code}

 But I like syntactic sugar.

Then again, it would be really good if Pig just supported:  Since this would need to be done for SQL, it could be done for Pig as well.

{code}
CountryPopulations = load 'country.dat' using PigStorage() as ( country: chararray, population: long );
PopulationProportions = foreach CountryPopulations generate
    country, population, (double)population / (double)SUM(population) as global_proportion;
{code}








> Pig needs to handle scalar aliases to improve programmer and code execution efficiency
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-801
>                 URL: https://issues.apache.org/jira/browse/PIG-801
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: David Ciemiewicz
>
> In Pig, it is often the case that the result of an operation is a scalar value that needs to be applied to the next step of processing.
> For example:
> * FILTER by MAX of group -- See: PIG-772
> * Compute proportions by dividing by total (SUM) of grouped alias
> Today Pig programmers need to go through distasteful and slow contortions of using FLATTEN or CROSS to propagate the scalar computation to EVERY row of data to perform these operations creating needless copies of data.  Or, the user must write the global sum to a file, then read it back in to gain the efficiency.
> If the language were simply extended to have the notion of scalar aliases, then coding would be simplified without contortions for the programmer and, I believe, execution of the code would be faster too.
> For instance, to compute global proportions, I want to do the following:
> {code}
> CountryPopulations = load 'country.dat' using PigStorage() as ( country: chararray, population: long );
> AllCountryPopulations= group CountryPopulations all;
> Total = foreach AllCountryPopulations generate SUM(CountryPopulations.population) as population;
> PopulationProportions = foreach CountryPopulations generate
>     country, population, (double)population / (double)Total.population as global_proportion;
> {code}
> One of the very distasteful workarounds for this is to do something like:
> {code}
> CountryPopulations = load 'country.dat' using PigStorage() as ( country: chararray, population: long );
> AllCountryPopulations= group CountryPopulations all;
> Total = foreach AllCountryPopulations generate SUM(CountryPopulations.population) as population;
> CountryPopulationsTotal = cross CountryPopulations, Total;
> PopulationProportions = foreach CountryPopulations generate
>     CountryPopulations::country,
>     CountryPopulations::population,
>     (double)CountryPopulations::population / (double)Total::population as global_proportion;
> {code}
> This just makes me cringe every time I have to do it.  Constructing new rows of data simply to apply
> the same scalar value row after row after row for potentially billions of rows of data just feels horribly wrong
> and inefficient both from the coding standpoint and from the execution standpoint.
> In SQL, I'd just code this as:
> {code}
> select
>      country,
>      population,
>      population / SUM(population)
> from
>      CountryPopulations;
> {code}
> In writing a SQL to Pig translator, it would seem that this construct or idiom would need to be supported, so why not create a higher level of Pig which would support the notion of scalars efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-801) Pig needs to handle scalar aliases to improve programmer and code execution efficiency

Posted by "David Ciemiewicz (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Ciemiewicz updated PIG-801:
---------------------------------

    Summary: Pig needs to handle scalar aliases to improve programmer and code execution efficiency  (was: Pig needs to handle scalar aliases to programmer and code execution efficiency)

> Pig needs to handle scalar aliases to improve programmer and code execution efficiency
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-801
>                 URL: https://issues.apache.org/jira/browse/PIG-801
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: David Ciemiewicz
>
> In Pig, it is often the case that the result of an operation is a scalar value that needs to be applied to the next step of processing.
> For example:
> * FILTER by MAX of group -- See: PIG-772
> * Compute proportions by dividing by total (SUM) of grouped alias
> Today Pig programmers need to go through distasteful and slow contortions of using FLATTEN or CROSS to propagate the scalar computation to EVERY row of data to perform these operations creating needless copies of data.  Or, the user must write the global sum to a file, then read it back in to gain the efficiency.
> If the language were simply extended to have the notion of scalar aliases, then coding would be simplified without contortions for the programmer and, I believe, execution of the code would be faster too.
> For instance, to compute global proportions, I want to do the following:
> {code}
> CountryPopulations = load 'country.dat' using PigStorage() as ( country: chararray, population: long );
> AllCountryPopulations= group CountryPopulations all;
> Total = foreach AllCountryPopulations generate SUM(CountryPopulations.population) as population;
> PopulationProportions = foreach CountryPopulations generate
>     country, population, (double)population / (double)Total.population as global_proportion;
> {code}
> One of the very distasteful workarounds for this is to do something like:
> {code}
> CountryPopulations = load 'country.dat' using PigStorage() as ( country: chararray, population: long );
> AllCountryPopulations= group CountryPopulations all;
> Total = foreach AllCountryPopulations generate SUM(CountryPopulations.population) as population;
> CountryPopulationsTotal = cross CountryPopulations, Total;
> PopulationProportions = foreach CountryPopulations generate
>     CountryPopulations::country,
>     CountryPopulations::population,
>     (double)CountryPopulations::population / (double)Total::population as global_proportion;
> {code}
> This just makes me cringe every time I have to do it.  Constructing new rows of data simply to apply
> the same scalar value row after row after row for potentially billions of rows of data just feels horribly wrong
> and inefficient both from the coding standpoint and from the execution standpoint.
> In SQL, I'd just code this as:
> {code}
> select
>      country,
>      population,
>      population / SUM(population)
> from
>      CountryPopulations;
> {code}
> In writing a SQL to Pig translator, it would seem that this construct or idiom would need to be supported, so why not create a higher level of Pig which would support the notion of scalars efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-801) Pig needs to handle scalar aliases to improve programmer and code execution efficiency

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-801:
-------------------------------

         Assignee: Aniket Mokashi
    Fix Version/s: 0.8.0

> Pig needs to handle scalar aliases to improve programmer and code execution efficiency
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-801
>                 URL: https://issues.apache.org/jira/browse/PIG-801
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: David Ciemiewicz
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>
> In Pig, it is often the case that the result of an operation is a scalar value that needs to be applied to the next step of processing.
> For example:
> * FILTER by MAX of group -- See: PIG-772
> * Compute proportions by dividing by total (SUM) of grouped alias
> Today Pig programmers need to go through distasteful and slow contortions of using FLATTEN or CROSS to propagate the scalar computation to EVERY row of data to perform these operations creating needless copies of data.  Or, the user must write the global sum to a file, then read it back in to gain the efficiency.
> If the language were simply extended to have the notion of scalar aliases, then coding would be simplified without contortions for the programmer and, I believe, execution of the code would be faster too.
> For instance, to compute global proportions, I want to do the following:
> {code}
> CountryPopulations = load 'country.dat' using PigStorage() as ( country: chararray, population: long );
> AllCountryPopulations= group CountryPopulations all;
> Total = foreach AllCountryPopulations generate SUM(CountryPopulations.population) as population;
> PopulationProportions = foreach CountryPopulations generate
>     country, population, (double)population / (double)Total.population as global_proportion;
> {code}
> One of the very distasteful workarounds for this is to do something like:
> {code}
> CountryPopulations = load 'country.dat' using PigStorage() as ( country: chararray, population: long );
> AllCountryPopulations= group CountryPopulations all;
> Total = foreach AllCountryPopulations generate SUM(CountryPopulations.population) as population;
> CountryPopulationsTotal = cross CountryPopulations, Total;
> PopulationProportions = foreach CountryPopulations generate
>     CountryPopulations::country,
>     CountryPopulations::population,
>     (double)CountryPopulations::population / (double)Total::population as global_proportion;
> {code}
> This just makes me cringe every time I have to do it.  Constructing new rows of data simply to apply
> the same scalar value row after row after row for potentially billions of rows of data just feels horribly wrong
> and inefficient both from the coding standpoint and from the execution standpoint.
> In SQL, I'd just code this as:
> {code}
> select
>      country,
>      population,
>      population / SUM(population)
> from
>      CountryPopulations;
> {code}
> In writing a SQL to Pig translator, it would seem that this construct or idiom would need to be supported, so why not create a higher level of Pig which would support the notion of scalars efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PIG-801) Pig needs to handle scalar aliases to improve programmer and code execution efficiency

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich resolved PIG-801.
--------------------------------

    Resolution: Fixed

This is subset of PIG-1434

> Pig needs to handle scalar aliases to improve programmer and code execution efficiency
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-801
>                 URL: https://issues.apache.org/jira/browse/PIG-801
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: David Ciemiewicz
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>
> In Pig, it is often the case that the result of an operation is a scalar value that needs to be applied to the next step of processing.
> For example:
> * FILTER by MAX of group -- See: PIG-772
> * Compute proportions by dividing by total (SUM) of grouped alias
> Today Pig programmers need to go through distasteful and slow contortions of using FLATTEN or CROSS to propagate the scalar computation to EVERY row of data to perform these operations creating needless copies of data.  Or, the user must write the global sum to a file, then read it back in to gain the efficiency.
> If the language were simply extended to have the notion of scalar aliases, then coding would be simplified without contortions for the programmer and, I believe, execution of the code would be faster too.
> For instance, to compute global proportions, I want to do the following:
> {code}
> CountryPopulations = load 'country.dat' using PigStorage() as ( country: chararray, population: long );
> AllCountryPopulations= group CountryPopulations all;
> Total = foreach AllCountryPopulations generate SUM(CountryPopulations.population) as population;
> PopulationProportions = foreach CountryPopulations generate
>     country, population, (double)population / (double)Total.population as global_proportion;
> {code}
> One of the very distasteful workarounds for this is to do something like:
> {code}
> CountryPopulations = load 'country.dat' using PigStorage() as ( country: chararray, population: long );
> AllCountryPopulations= group CountryPopulations all;
> Total = foreach AllCountryPopulations generate SUM(CountryPopulations.population) as population;
> CountryPopulationsTotal = cross CountryPopulations, Total;
> PopulationProportions = foreach CountryPopulations generate
>     CountryPopulations::country,
>     CountryPopulations::population,
>     (double)CountryPopulations::population / (double)Total::population as global_proportion;
> {code}
> This just makes me cringe every time I have to do it.  Constructing new rows of data simply to apply
> the same scalar value row after row after row for potentially billions of rows of data just feels horribly wrong
> and inefficient both from the coding standpoint and from the execution standpoint.
> In SQL, I'd just code this as:
> {code}
> select
>      country,
>      population,
>      population / SUM(population)
> from
>      CountryPopulations;
> {code}
> In writing a SQL to Pig translator, it would seem that this construct or idiom would need to be supported, so why not create a higher level of Pig which would support the notion of scalars efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-801) Pig needs to handle scalar aliases to improve programmer and code execution efficiency

Posted by "David Ciemiewicz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794474#action_12794474 ] 

David Ciemiewicz commented on PIG-801:
--------------------------------------

This is really still an issue.

I currently have 5GB (yes GB) of data that I am trying to process and normalize (take count and divide by total) and it is taking 6 HOURS to process.

However, this is in a cogroup because each partition of the data is 5GB.

Currently my code looks like:

{code}
A = load ... as ( partition, count );
TotalGroup = group A by ( partition );
Total = foreach TotalGroup generate group as partition, SUM(A.count) as total;

ATotal = cogroup A by (partition), Total by (partition);
ATotal = foreach ATotal generate FLATTEN(Total.total) as total, FLATTEN(A);

ATotal = foreach ATotal generate partition, count, total, (double)count / total as proportion;
{code}

This is nuts when dealing with 5GB of data.  It won't fit in memory.

This is the ultimate in skewed joins.  I don't see any point in scanning the data to determine it's skewness and then reprocessing the data again.

I know this is simply a scalar I want to project on every row of data.

This should be combinable in some way.  How do I make that happen?

Like I said, my preference is to use a simpler syntax than cogroup or join.

> Pig needs to handle scalar aliases to improve programmer and code execution efficiency
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-801
>                 URL: https://issues.apache.org/jira/browse/PIG-801
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: David Ciemiewicz
>
> In Pig, it is often the case that the result of an operation is a scalar value that needs to be applied to the next step of processing.
> For example:
> * FILTER by MAX of group -- See: PIG-772
> * Compute proportions by dividing by total (SUM) of grouped alias
> Today Pig programmers need to go through distasteful and slow contortions of using FLATTEN or CROSS to propagate the scalar computation to EVERY row of data to perform these operations creating needless copies of data.  Or, the user must write the global sum to a file, then read it back in to gain the efficiency.
> If the language were simply extended to have the notion of scalar aliases, then coding would be simplified without contortions for the programmer and, I believe, execution of the code would be faster too.
> For instance, to compute global proportions, I want to do the following:
> {code}
> CountryPopulations = load 'country.dat' using PigStorage() as ( country: chararray, population: long );
> AllCountryPopulations= group CountryPopulations all;
> Total = foreach AllCountryPopulations generate SUM(CountryPopulations.population) as population;
> PopulationProportions = foreach CountryPopulations generate
>     country, population, (double)population / (double)Total.population as global_proportion;
> {code}
> One of the very distasteful workarounds for this is to do something like:
> {code}
> CountryPopulations = load 'country.dat' using PigStorage() as ( country: chararray, population: long );
> AllCountryPopulations= group CountryPopulations all;
> Total = foreach AllCountryPopulations generate SUM(CountryPopulations.population) as population;
> CountryPopulationsTotal = cross CountryPopulations, Total;
> PopulationProportions = foreach CountryPopulations generate
>     CountryPopulations::country,
>     CountryPopulations::population,
>     (double)CountryPopulations::population / (double)Total::population as global_proportion;
> {code}
> This just makes me cringe every time I have to do it.  Constructing new rows of data simply to apply
> the same scalar value row after row after row for potentially billions of rows of data just feels horribly wrong
> and inefficient both from the coding standpoint and from the execution standpoint.
> In SQL, I'd just code this as:
> {code}
> select
>      country,
>      population,
>      population / SUM(population)
> from
>      CountryPopulations;
> {code}
> In writing a SQL to Pig translator, it would seem that this construct or idiom would need to be supported, so why not create a higher level of Pig which would support the notion of scalars efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.