You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "David Ciemiewicz (JIRA)" <ji...@apache.org> on 2009/06/01 06:56:07 UTC
[jira] Created: (PIG-826) DISTINCT as "Function" rather than
statement - High Level Pig
DISTINCT as "Function" rather than statement - High Level Pig
-------------------------------------------------------------
Key: PIG-826
URL: https://issues.apache.org/jira/browse/PIG-826
Project: Pig
Issue Type: New Feature
Reporter: David Ciemiewicz
In SQL, a user would think nothing of doing something like:
{code}
select
COUNT(DISTINCT(user)) as user_count,
COUNT(DISTINCT(country)) as country_count,
COUNT(DISTINCT(url) as url_count
from
server_logs;
{code}
But in Pig, we'd need to do something like the following. And this is about the most
compact version I could come up with.
{code}
Logs = load 'log' using PigStorage()
as ( user: chararray, country: chararray, url: chararray);
DistinctUsers = distinct (foreach Logs generate user);
DistinctCountries = distinct (foreach Logs generate country);
DistinctUrls = distinct (foreach Logs generate url);
DistinctUsersCount = foreach (group DistinctUsers all) generate
group, COUNT(DistinctUsers) as user_count;
DistinctCountriesCount = foreach (group DistinctCountries all) generate
group, COUNT(DistinctCountries) as country_count;
DistinctUrlCount = foreach (group DistinctUrls all) generate
group, COUNT(DistinctUrls) as url_count;
AllDistinctCounts = cross
DistinctUsersCount, DistinctCountriesCount, DistinctUrlCount;
Report = foreach AllDistinctCounts generate
DistinctUsersCount::user_count,
DistinctCountriesCount::country_count,
DistinctUrlCount::url_count;
store Report into 'log_report' using PigStorage();
{code}
It would be good if there was a higher level version of Pig that permitted code to be written as:
{code}
Logs = load 'log' using PigStorage()
as ( user: chararray, country: chararray, url: chararray);
Report = overall Logs generate
COUNT(DISTINCT(user)) as user_count,
COUNT(DISTINCT(country)) as country_count,
COUNT(DISTINCT(url)) as url_count;
store Report into 'log_report' using PigStorage();
{code}
I do want this in Pig and not as SQL. I'd expect High Level Pig to generate Lower Level Pig.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-826) DISTINCT as "Function/Operator" rather
than statement/operator - High Level Pig
Posted by "Mridul Muralidharan (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715745#action_12715745 ]
Mridul Muralidharan commented on PIG-826:
-----------------------------------------
This would be a welcome change !
Another usecase which would get enabled (which, imo cant be done 'easily' now) is to use DISTINCT in filter.
Like :
B = FILTER A by COUNT(DISTINCT($1)) > 1;
> DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig
> -------------------------------------------------------------------------------
>
> Key: PIG-826
> URL: https://issues.apache.org/jira/browse/PIG-826
> Project: Pig
> Issue Type: New Feature
> Reporter: David Ciemiewicz
>
> In SQL, a user would think nothing of doing something like:
> {code}
> select
> COUNT(DISTINCT(user)) as user_count,
> COUNT(DISTINCT(country)) as country_count,
> COUNT(DISTINCT(url) as url_count
> from
> server_logs;
> {code}
> But in Pig, we'd need to do something like the following. And this is about the most
> compact version I could come up with.
> {code}
> Logs = load 'log' using PigStorage()
> as ( user: chararray, country: chararray, url: chararray);
> DistinctUsers = distinct (foreach Logs generate user);
> DistinctCountries = distinct (foreach Logs generate country);
> DistinctUrls = distinct (foreach Logs generate url);
> DistinctUsersCount = foreach (group DistinctUsers all) generate
> group, COUNT(DistinctUsers) as user_count;
> DistinctCountriesCount = foreach (group DistinctCountries all) generate
> group, COUNT(DistinctCountries) as country_count;
> DistinctUrlCount = foreach (group DistinctUrls all) generate
> group, COUNT(DistinctUrls) as url_count;
> AllDistinctCounts = cross
> DistinctUsersCount, DistinctCountriesCount, DistinctUrlCount;
> Report = foreach AllDistinctCounts generate
> DistinctUsersCount::user_count,
> DistinctCountriesCount::country_count,
> DistinctUrlCount::url_count;
> store Report into 'log_report' using PigStorage();
> {code}
> It would be good if there was a higher level version of Pig that permitted code to be written as:
> {code}
> Logs = load 'log' using PigStorage()
> as ( user: chararray, country: chararray, url: chararray);
> Report = overall Logs generate
> COUNT(DISTINCT(user)) as user_count,
> COUNT(DISTINCT(country)) as country_count,
> COUNT(DISTINCT(url)) as url_count;
> store Report into 'log_report' using PigStorage();
> {code}
> I do want this in Pig and not as SQL. I'd expect High Level Pig to generate Lower Level Pig.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-826) DISTINCT as "Function/Operator" rather
than statement/operator - High Level Pig
Posted by "Amr Awadallah (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715680#action_12715680 ]
Amr Awadallah commented on PIG-826:
-----------------------------------
neat.
> DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig
> -------------------------------------------------------------------------------
>
> Key: PIG-826
> URL: https://issues.apache.org/jira/browse/PIG-826
> Project: Pig
> Issue Type: New Feature
> Reporter: David Ciemiewicz
>
> In SQL, a user would think nothing of doing something like:
> {code}
> select
> COUNT(DISTINCT(user)) as user_count,
> COUNT(DISTINCT(country)) as country_count,
> COUNT(DISTINCT(url) as url_count
> from
> server_logs;
> {code}
> But in Pig, we'd need to do something like the following. And this is about the most
> compact version I could come up with.
> {code}
> Logs = load 'log' using PigStorage()
> as ( user: chararray, country: chararray, url: chararray);
> DistinctUsers = distinct (foreach Logs generate user);
> DistinctCountries = distinct (foreach Logs generate country);
> DistinctUrls = distinct (foreach Logs generate url);
> DistinctUsersCount = foreach (group DistinctUsers all) generate
> group, COUNT(DistinctUsers) as user_count;
> DistinctCountriesCount = foreach (group DistinctCountries all) generate
> group, COUNT(DistinctCountries) as country_count;
> DistinctUrlCount = foreach (group DistinctUrls all) generate
> group, COUNT(DistinctUrls) as url_count;
> AllDistinctCounts = cross
> DistinctUsersCount, DistinctCountriesCount, DistinctUrlCount;
> Report = foreach AllDistinctCounts generate
> DistinctUsersCount::user_count,
> DistinctCountriesCount::country_count,
> DistinctUrlCount::url_count;
> store Report into 'log_report' using PigStorage();
> {code}
> It would be good if there was a higher level version of Pig that permitted code to be written as:
> {code}
> Logs = load 'log' using PigStorage()
> as ( user: chararray, country: chararray, url: chararray);
> Report = overall Logs generate
> COUNT(DISTINCT(user)) as user_count,
> COUNT(DISTINCT(country)) as country_count,
> COUNT(DISTINCT(url)) as url_count;
> store Report into 'log_report' using PigStorage();
> {code}
> I do want this in Pig and not as SQL. I'd expect High Level Pig to generate Lower Level Pig.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-826) DISTINCT as "Function/Operator" rather
than statement/operator - High Level Pig
Posted by "David Ciemiewicz (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Ciemiewicz updated PIG-826:
---------------------------------
Summary: DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig (was: DISTINCT as "Function" rather than statement - High Level Pig)
> DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig
> -------------------------------------------------------------------------------
>
> Key: PIG-826
> URL: https://issues.apache.org/jira/browse/PIG-826
> Project: Pig
> Issue Type: New Feature
> Reporter: David Ciemiewicz
>
> In SQL, a user would think nothing of doing something like:
> {code}
> select
> COUNT(DISTINCT(user)) as user_count,
> COUNT(DISTINCT(country)) as country_count,
> COUNT(DISTINCT(url) as url_count
> from
> server_logs;
> {code}
> But in Pig, we'd need to do something like the following. And this is about the most
> compact version I could come up with.
> {code}
> Logs = load 'log' using PigStorage()
> as ( user: chararray, country: chararray, url: chararray);
> DistinctUsers = distinct (foreach Logs generate user);
> DistinctCountries = distinct (foreach Logs generate country);
> DistinctUrls = distinct (foreach Logs generate url);
> DistinctUsersCount = foreach (group DistinctUsers all) generate
> group, COUNT(DistinctUsers) as user_count;
> DistinctCountriesCount = foreach (group DistinctCountries all) generate
> group, COUNT(DistinctCountries) as country_count;
> DistinctUrlCount = foreach (group DistinctUrls all) generate
> group, COUNT(DistinctUrls) as url_count;
> AllDistinctCounts = cross
> DistinctUsersCount, DistinctCountriesCount, DistinctUrlCount;
> Report = foreach AllDistinctCounts generate
> DistinctUsersCount::user_count,
> DistinctCountriesCount::country_count,
> DistinctUrlCount::url_count;
> store Report into 'log_report' using PigStorage();
> {code}
> It would be good if there was a higher level version of Pig that permitted code to be written as:
> {code}
> Logs = load 'log' using PigStorage()
> as ( user: chararray, country: chararray, url: chararray);
> Report = overall Logs generate
> COUNT(DISTINCT(user)) as user_count,
> COUNT(DISTINCT(country)) as country_count,
> COUNT(DISTINCT(url)) as url_count;
> store Report into 'log_report' using PigStorage();
> {code}
> I do want this in Pig and not as SQL. I'd expect High Level Pig to generate Lower Level Pig.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-826) DISTINCT as "Function/Operator" rather
than statement/operator - High Level Pig
Posted by "David Ciemiewicz (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715726#action_12715726 ]
David Ciemiewicz commented on PIG-826:
--------------------------------------
Alan, thanks! But what if I want to do the following:
{code}
foreach Grouped {
dcountryurl = distinct Logs.(country,url);
generate COUNT(dcountryurl);
};
{code}
Projecting multiple aliases doesn't seem to work. I also tried the following and it doesn't work either.
{code}
foreach Grouped {
dcountryurl = distinct Logs.country, Logs.url;
generate COUNT(dcountryurl);
};
{code}
> DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig
> -------------------------------------------------------------------------------
>
> Key: PIG-826
> URL: https://issues.apache.org/jira/browse/PIG-826
> Project: Pig
> Issue Type: New Feature
> Reporter: David Ciemiewicz
>
> In SQL, a user would think nothing of doing something like:
> {code}
> select
> COUNT(DISTINCT(user)) as user_count,
> COUNT(DISTINCT(country)) as country_count,
> COUNT(DISTINCT(url) as url_count
> from
> server_logs;
> {code}
> But in Pig, we'd need to do something like the following. And this is about the most
> compact version I could come up with.
> {code}
> Logs = load 'log' using PigStorage()
> as ( user: chararray, country: chararray, url: chararray);
> DistinctUsers = distinct (foreach Logs generate user);
> DistinctCountries = distinct (foreach Logs generate country);
> DistinctUrls = distinct (foreach Logs generate url);
> DistinctUsersCount = foreach (group DistinctUsers all) generate
> group, COUNT(DistinctUsers) as user_count;
> DistinctCountriesCount = foreach (group DistinctCountries all) generate
> group, COUNT(DistinctCountries) as country_count;
> DistinctUrlCount = foreach (group DistinctUrls all) generate
> group, COUNT(DistinctUrls) as url_count;
> AllDistinctCounts = cross
> DistinctUsersCount, DistinctCountriesCount, DistinctUrlCount;
> Report = foreach AllDistinctCounts generate
> DistinctUsersCount::user_count,
> DistinctCountriesCount::country_count,
> DistinctUrlCount::url_count;
> store Report into 'log_report' using PigStorage();
> {code}
> It would be good if there was a higher level version of Pig that permitted code to be written as:
> {code}
> Logs = load 'log' using PigStorage()
> as ( user: chararray, country: chararray, url: chararray);
> Report = overall Logs generate
> COUNT(DISTINCT(user)) as user_count,
> COUNT(DISTINCT(country)) as country_count,
> COUNT(DISTINCT(url)) as url_count;
> store Report into 'log_report' using PigStorage();
> {code}
> I do want this in Pig and not as SQL. I'd expect High Level Pig to generate Lower Level Pig.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-826) DISTINCT as "Function/Operator" rather
than statement/operator - High Level Pig
Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715639#action_12715639 ]
Alan Gates commented on PIG-826:
--------------------------------
It can be done like this:
{code}
Logs = load 'log' using PigStorage()
as ( user: chararray, country: chararray, url: chararray);
Grouped = group Logs all;
foreach Grouped {
duser = distinct Logs.user;
dcountry = distinct Logs.country;
durl = distinct Logs.url;
generate COUNT(duser), COUNT(dcountry), COUNT(durl);
};
{code}
> DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig
> -------------------------------------------------------------------------------
>
> Key: PIG-826
> URL: https://issues.apache.org/jira/browse/PIG-826
> Project: Pig
> Issue Type: New Feature
> Reporter: David Ciemiewicz
>
> In SQL, a user would think nothing of doing something like:
> {code}
> select
> COUNT(DISTINCT(user)) as user_count,
> COUNT(DISTINCT(country)) as country_count,
> COUNT(DISTINCT(url) as url_count
> from
> server_logs;
> {code}
> But in Pig, we'd need to do something like the following. And this is about the most
> compact version I could come up with.
> {code}
> Logs = load 'log' using PigStorage()
> as ( user: chararray, country: chararray, url: chararray);
> DistinctUsers = distinct (foreach Logs generate user);
> DistinctCountries = distinct (foreach Logs generate country);
> DistinctUrls = distinct (foreach Logs generate url);
> DistinctUsersCount = foreach (group DistinctUsers all) generate
> group, COUNT(DistinctUsers) as user_count;
> DistinctCountriesCount = foreach (group DistinctCountries all) generate
> group, COUNT(DistinctCountries) as country_count;
> DistinctUrlCount = foreach (group DistinctUrls all) generate
> group, COUNT(DistinctUrls) as url_count;
> AllDistinctCounts = cross
> DistinctUsersCount, DistinctCountriesCount, DistinctUrlCount;
> Report = foreach AllDistinctCounts generate
> DistinctUsersCount::user_count,
> DistinctCountriesCount::country_count,
> DistinctUrlCount::url_count;
> store Report into 'log_report' using PigStorage();
> {code}
> It would be good if there was a higher level version of Pig that permitted code to be written as:
> {code}
> Logs = load 'log' using PigStorage()
> as ( user: chararray, country: chararray, url: chararray);
> Report = overall Logs generate
> COUNT(DISTINCT(user)) as user_count,
> COUNT(DISTINCT(country)) as country_count,
> COUNT(DISTINCT(url)) as url_count;
> store Report into 'log_report' using PigStorage();
> {code}
> I do want this in Pig and not as SQL. I'd expect High Level Pig to generate Lower Level Pig.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.