You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Dhruv M (JIRA)" <ji...@apache.org> on 2009/03/18 12:01:50 UTC
[jira] Created: (PIG-723) Pig generates incorrect schema for
generated bags after FOREACH.
Pig generates incorrect schema for generated bags after FOREACH.
----------------------------------------------------------------
Key: PIG-723
URL: https://issues.apache.org/jira/browse/PIG-723
Project: Pig
Issue Type: Bug
Affects Versions: 0.1.0
Environment: Linux
$pig --version
Apache Pig version 0.1.0-dev (r750430)
compiled Mar 07 2009, 09:20:13
Reporter: Dhruv M
Priority: Critical
grunt> rf_src = LOAD 'rf_test.txt' USING PigStorage(',') AS (lhs:chararray, rhs:chararray, r:float, p:float, c:float);
grunt> rf_grouped = GROUP rf_src BY rhs;
grunt> lhs_grouped = FOREACH rf_grouped GENERATE group as rhs, rf_src.(lhs, r) as lhs, MAX(rf_src.p) as p, MAX(rf_src.c) AS c;
grunt> describe lhs_grouped;
lhs_grouped: {rhs: chararray,lhs: {lhs: chararray,r: float},p: float,c: float}
I think it should be:
lhs_grouped: {rhs: chararray,lhs: {(lhs: chararray,r: float)},p: float,c: float}
Because of this, we are not able to perform UNION on 2 sets because union on incompatible schemas is causing a complete loss of schema information, making further processing impossible.
This is what we want to UNION with:
grunt> asrc = LOAD 'atest.txt' USING PigStorage(',') AS (rhs:chararray, a:int);
grunt> aa = FOREACH asrc GENERATE rhs, (bag{tuple(chararray,float)}) null as lhs, -10F as p, -10F as c;
grunt> describe aa;
aa: {rhs: chararray,lhs: {(chararray,float)},p: float,c: float}
If there is something wrong with what I am trying to do, please let me know.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
Re: [jira] Created: (PIG-723) Pig generates incorrect schema for
generated bags after FOREACH.
Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
Hi,
I think the schema generated is right.
It might a problem with declaring complex types (bags only ?) in terms
of scalars in the script which you might be hitting (iirc there is a bug
for it already).
From what I recall, even I have faced issues with things like "foreach
A generate $0 as f1, {(1.0, 'str')} as f1;" type declarations - pig
seems to not be able to handle them currently.
If this is the issue, rest of the mail might help you -
The workaround we have is to have a dummy udf which returns this value -
and use that udf.
An example snippet which we use :
--- A custom udf which generates bag with tuples (0, '')
define GENERATE_EMPTY_BAG1 myudf.GenerateEmptyBag('long, chararray');
--- A custom udf which generates bag with tuples (0, '', 0.0f)
define GENERATE_EMPTY_BAG1 myudf.GenerateEmptyBag('long, chararray, float');
grp_op = COGROUP inp1 by eid, inp2 by eid6;
res = FOREACH grp_op { GENERATE FLATTEN((COUNT(inp1) != 0 ? inp1 :
GENERATE_EMPTY_BAG1(0L))) AS (eid:long, query:chararray),
FLATTEN((COUNT(inp2) != 0 ? inp2 : GENERATE_EMPTY_BAG2(0L))) AS
(eid6:long, candidate_url6:chararray, rank:float); };
---
The idea above is to have some default values when the bags are empty
(that is, no data for a particular eid/eid6).
Note the exact syntax in foreach (defends against parser bugs in pig),
and the use of the udf - the generate the bag.
Regards,
Mridul
Dhruv M (JIRA) wrote:
> Pig generates incorrect schema for generated bags after FOREACH.
> ----------------------------------------------------------------
>
> Key: PIG-723
> URL: https://issues.apache.org/jira/browse/PIG-723
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.1.0
> Environment: Linux
> $pig --version
> Apache Pig version 0.1.0-dev (r750430)
> compiled Mar 07 2009, 09:20:13
>
> Reporter: Dhruv M
> Priority: Critical
>
>
>
> grunt> rf_src = LOAD 'rf_test.txt' USING PigStorage(',') AS (lhs:chararray, rhs:chararray, r:float, p:float, c:float);
> grunt> rf_grouped = GROUP rf_src BY rhs;
> grunt> lhs_grouped = FOREACH rf_grouped GENERATE group as rhs, rf_src.(lhs, r) as lhs, MAX(rf_src.p) as p, MAX(rf_src.c) AS c;
> grunt> describe lhs_grouped;
> lhs_grouped: {rhs: chararray,lhs: {lhs: chararray,r: float},p: float,c: float}
>
> I think it should be:
> lhs_grouped: {rhs: chararray,lhs: {(lhs: chararray,r: float)},p: float,c: float}
>
> Because of this, we are not able to perform UNION on 2 sets because union on incompatible schemas is causing a complete loss of schema information, making further processing impossible.
>
> This is what we want to UNION with:
>
> grunt> asrc = LOAD 'atest.txt' USING PigStorage(',') AS (rhs:chararray, a:int);
> grunt> aa = FOREACH asrc GENERATE rhs, (bag{tuple(chararray,float)}) null as lhs, -10F as p, -10F as c;
> grunt> describe aa;
> aa: {rhs: chararray,lhs: {(chararray,float)},p: float,c: float}
>
> If there is something wrong with what I am trying to do, please let me know.
>
>
[jira] Commented: (PIG-723) Pig generates incorrect schema for
generated bags after FOREACH.
Posted by "Santhosh Srinivasan (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683098#action_12683098 ]
Santhosh Srinivasan commented on PIG-723:
-----------------------------------------
This is a duplicate of PIG-694.
> Pig generates incorrect schema for generated bags after FOREACH.
> ----------------------------------------------------------------
>
> Key: PIG-723
> URL: https://issues.apache.org/jira/browse/PIG-723
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.1.0
> Environment: Linux
> $pig --version
> Apache Pig version 0.1.0-dev (r750430)
> compiled Mar 07 2009, 09:20:13
> Reporter: Dhruv M
> Priority: Critical
>
> grunt> rf_src = LOAD 'rf_test.txt' USING PigStorage(',') AS (lhs:chararray, rhs:chararray, r:float, p:float, c:float);
> grunt> rf_grouped = GROUP rf_src BY rhs;
> grunt> lhs_grouped = FOREACH rf_grouped GENERATE group as rhs, rf_src.(lhs, r) as lhs, MAX(rf_src.p) as p, MAX(rf_src.c) AS c;
> grunt> describe lhs_grouped;
> lhs_grouped: {rhs: chararray,lhs: {lhs: chararray,r: float},p: float,c: float}
> I think it should be:
> lhs_grouped: {rhs: chararray,lhs: {(lhs: chararray,r: float)},p: float,c: float}
> Because of this, we are not able to perform UNION on 2 sets because union on incompatible schemas is causing a complete loss of schema information, making further processing impossible.
> This is what we want to UNION with:
> grunt> asrc = LOAD 'atest.txt' USING PigStorage(',') AS (rhs:chararray, a:int);
> grunt> aa = FOREACH asrc GENERATE rhs, (bag{tuple(chararray,float)}) null as lhs, -10F as p, -10F as c;
> grunt> describe aa;
> aa: {rhs: chararray,lhs: {(chararray,float)},p: float,c: float}
> If there is something wrong with what I am trying to do, please let me know.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-723) Pig generates incorrect schema for
generated bags after FOREACH.
Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Olga Natkovich updated PIG-723:
-------------------------------
Fix Version/s: 0.9.0
> Pig generates incorrect schema for generated bags after FOREACH.
> ----------------------------------------------------------------
>
> Key: PIG-723
> URL: https://issues.apache.org/jira/browse/PIG-723
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.1.0
> Environment: Linux
> $pig --version
> Apache Pig version 0.1.0-dev (r750430)
> compiled Mar 07 2009, 09:20:13
> Reporter: Dhruv M
> Fix For: 0.9.0
>
>
> grunt> rf_src = LOAD 'rf_test.txt' USING PigStorage(',') AS (lhs:chararray, rhs:chararray, r:float, p:float, c:float);
> grunt> rf_grouped = GROUP rf_src BY rhs;
> grunt> lhs_grouped = FOREACH rf_grouped GENERATE group as rhs, rf_src.(lhs, r) as lhs, MAX(rf_src.p) as p, MAX(rf_src.c) AS c;
> grunt> describe lhs_grouped;
> lhs_grouped: {rhs: chararray,lhs: {lhs: chararray,r: float},p: float,c: float}
> I think it should be:
> lhs_grouped: {rhs: chararray,lhs: {(lhs: chararray,r: float)},p: float,c: float}
> Because of this, we are not able to perform UNION on 2 sets because union on incompatible schemas is causing a complete loss of schema information, making further processing impossible.
> This is what we want to UNION with:
> grunt> asrc = LOAD 'atest.txt' USING PigStorage(',') AS (rhs:chararray, a:int);
> grunt> aa = FOREACH asrc GENERATE rhs, (bag{tuple(chararray,float)}) null as lhs, -10F as p, -10F as c;
> grunt> describe aa;
> aa: {rhs: chararray,lhs: {(chararray,float)},p: float,c: float}
> If there is something wrong with what I am trying to do, please let me know.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-723) Pig generates incorrect schema for
generated bags after FOREACH.
Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alan Gates reassigned PIG-723:
------------------------------
Assignee: Alan Gates
> Pig generates incorrect schema for generated bags after FOREACH.
> ----------------------------------------------------------------
>
> Key: PIG-723
> URL: https://issues.apache.org/jira/browse/PIG-723
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.1.0
> Environment: Linux
> $pig --version
> Apache Pig version 0.1.0-dev (r750430)
> compiled Mar 07 2009, 09:20:13
> Reporter: Dhruv M
> Assignee: Alan Gates
> Fix For: 0.9.0
>
>
> grunt> rf_src = LOAD 'rf_test.txt' USING PigStorage(',') AS (lhs:chararray, rhs:chararray, r:float, p:float, c:float);
> grunt> rf_grouped = GROUP rf_src BY rhs;
> grunt> lhs_grouped = FOREACH rf_grouped GENERATE group as rhs, rf_src.(lhs, r) as lhs, MAX(rf_src.p) as p, MAX(rf_src.c) AS c;
> grunt> describe lhs_grouped;
> lhs_grouped: {rhs: chararray,lhs: {lhs: chararray,r: float},p: float,c: float}
> I think it should be:
> lhs_grouped: {rhs: chararray,lhs: {(lhs: chararray,r: float)},p: float,c: float}
> Because of this, we are not able to perform UNION on 2 sets because union on incompatible schemas is causing a complete loss of schema information, making further processing impossible.
> This is what we want to UNION with:
> grunt> asrc = LOAD 'atest.txt' USING PigStorage(',') AS (rhs:chararray, a:int);
> grunt> aa = FOREACH asrc GENERATE rhs, (bag{tuple(chararray,float)}) null as lhs, -10F as p, -10F as c;
> grunt> describe aa;
> aa: {rhs: chararray,lhs: {(chararray,float)},p: float,c: float}
> If there is something wrong with what I am trying to do, please let me know.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-723) Pig generates incorrect schema for
generated bags after FOREACH.
Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Olga Natkovich updated PIG-723:
-------------------------------
Description:
grunt> rf_src = LOAD 'rf_test.txt' USING PigStorage(',') AS (lhs:chararray, rhs:chararray, r:float, p:float, c:float);
grunt> rf_grouped = GROUP rf_src BY rhs;
grunt> lhs_grouped = FOREACH rf_grouped GENERATE group as rhs, rf_src.(lhs, r) as lhs, MAX(rf_src.p) as p, MAX(rf_src.c) AS c;
grunt> describe lhs_grouped;
lhs_grouped: {rhs: chararray,lhs: {lhs: chararray,r: float},p: float,c: float}
I think it should be:
lhs_grouped: {rhs: chararray,lhs: {(lhs: chararray,r: float)},p: float,c: float}
Because of this, we are not able to perform UNION on 2 sets because union on incompatible schemas is causing a complete loss of schema information, making further processing impossible.
This is what we want to UNION with:
grunt> asrc = LOAD 'atest.txt' USING PigStorage(',') AS (rhs:chararray, a:int);
grunt> aa = FOREACH asrc GENERATE rhs, (bag{tuple(chararray,float)}) null as lhs, -10F as p, -10F as c;
grunt> describe aa;
aa: {rhs: chararray,lhs: {(chararray,float)},p: float,c: float}
If there is something wrong with what I am trying to do, please let me know.
was:
grunt> rf_src = LOAD 'rf_test.txt' USING PigStorage(',') AS (lhs:chararray, rhs:chararray, r:float, p:float, c:float);
grunt> rf_grouped = GROUP rf_src BY rhs;
grunt> lhs_grouped = FOREACH rf_grouped GENERATE group as rhs, rf_src.(lhs, r) as lhs, MAX(rf_src.p) as p, MAX(rf_src.c) AS c;
grunt> describe lhs_grouped;
lhs_grouped: {rhs: chararray,lhs: {lhs: chararray,r: float},p: float,c: float}
I think it should be:
lhs_grouped: {rhs: chararray,lhs: {(lhs: chararray,r: float)},p: float,c: float}
Because of this, we are not able to perform UNION on 2 sets because union on incompatible schemas is causing a complete loss of schema information, making further processing impossible.
This is what we want to UNION with:
grunt> asrc = LOAD 'atest.txt' USING PigStorage(',') AS (rhs:chararray, a:int);
grunt> aa = FOREACH asrc GENERATE rhs, (bag{tuple(chararray,float)}) null as lhs, -10F as p, -10F as c;
grunt> describe aa;
aa: {rhs: chararray,lhs: {(chararray,float)},p: float,c: float}
If there is something wrong with what I am trying to do, please let me know.
Priority: Major (was: Critical)
Not sure why this issue was marked as critical
> Pig generates incorrect schema for generated bags after FOREACH.
> ----------------------------------------------------------------
>
> Key: PIG-723
> URL: https://issues.apache.org/jira/browse/PIG-723
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.1.0
> Environment: Linux
> $pig --version
> Apache Pig version 0.1.0-dev (r750430)
> compiled Mar 07 2009, 09:20:13
> Reporter: Dhruv M
>
> grunt> rf_src = LOAD 'rf_test.txt' USING PigStorage(',') AS (lhs:chararray, rhs:chararray, r:float, p:float, c:float);
> grunt> rf_grouped = GROUP rf_src BY rhs;
> grunt> lhs_grouped = FOREACH rf_grouped GENERATE group as rhs, rf_src.(lhs, r) as lhs, MAX(rf_src.p) as p, MAX(rf_src.c) AS c;
> grunt> describe lhs_grouped;
> lhs_grouped: {rhs: chararray,lhs: {lhs: chararray,r: float},p: float,c: float}
> I think it should be:
> lhs_grouped: {rhs: chararray,lhs: {(lhs: chararray,r: float)},p: float,c: float}
> Because of this, we are not able to perform UNION on 2 sets because union on incompatible schemas is causing a complete loss of schema information, making further processing impossible.
> This is what we want to UNION with:
> grunt> asrc = LOAD 'atest.txt' USING PigStorage(',') AS (rhs:chararray, a:int);
> grunt> aa = FOREACH asrc GENERATE rhs, (bag{tuple(chararray,float)}) null as lhs, -10F as p, -10F as c;
> grunt> describe aa;
> aa: {rhs: chararray,lhs: {(chararray,float)},p: float,c: float}
> If there is something wrong with what I am trying to do, please let me know.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.