You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@datafu.apache.org by Russell Jurney <ru...@gmail.com> on 2014/09/12 02:20:04 UTC

Review Request 25564: DATAFU-69: Create ChooseFieldByValue UDF - which, given a field who's value contains a field name, and *, returns the value of the field referenced by the field name

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25564/
-----------------------------------------------------------

Review request for DataFu.


Repository: datafu


Description
-------

Example use:
group_fields = LOAD '/e8/smalldata/group_fields.txt' AS (groupField:chararray); 
with_group = CROSS group_fields, hour_rounded;
with_group = FOREACH with_group GENERATE group_fields::groupField AS groupField, 
hour_rounded::sourceNameOrIp AS sourceNameOrIp,
hour_rounded::destinationNameOrIp AS destinationNameOrIp,
...;
with_value_substitution = FOREACH with_group GENERATE ChooseFieldByValue(groupField, *) AS groupValue:tuple(value:chararray), *;
with_value_substitution = FOREACH with_value_substitution GENERATE 
FLATTEN(groupValue) AS groupValue:chararray,
groupField,
foo,
bar,
...;
all_success = FOREACH (GROUP with_value_substitution BY (groupField, groupValue, day)) GENERATE
FLATTEN(group) AS (seriesType, groupValue, day),
(int)COUNT_STAR(with_value_substitution) AS connections:int;


Diffs
-----

  datafu-pig/src/main/java/datafu/pig/util/ChooseFieldByValue.java PRE-CREATION 
  datafu-pig/src/test/java/datafu/test/pig/util/ChooseFieldByValueTest.java PRE-CREATION 

Diff: https://reviews.apache.org/r/25564/diff/


Testing
-------

This UDF was used to replace a very inefficient pig script where macros that did many individual GROUP BY's took many minutes to plan.

Testing: unit tests and used on real data on a cluster.


Thanks,

Russell Jurney


Re: Review Request 25564: DATAFU-69: Create SelectFieldByName UDF - which, given a field who's value contains a field name, and *, returns the value of the field referenced by the field name

Posted by Russell Jurney <ru...@gmail.com>.

> On Nov. 3, 2014, 10:08 p.m., Matthew Hayes wrote:
> >

Awesome, thanks!


- Russell


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25564/#review59650
-----------------------------------------------------------


On Oct. 30, 2014, 9:13 p.m., Russell Jurney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25564/
> -----------------------------------------------------------
> 
> (Updated Oct. 30, 2014, 9:13 p.m.)
> 
> 
> Review request for DataFu, Jonathan Coveney, Jakob Homan, Matthew Hayes, and Sam Shah.
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> Example use:
> group_fields = LOAD '/e8/smalldata/group_fields.txt' AS (groupField:chararray); 
> with_group = CROSS group_fields, hour_rounded;
> with_group = FOREACH with_group GENERATE group_fields::groupField AS groupField, 
> hour_rounded::sourceNameOrIp AS sourceNameOrIp,
> hour_rounded::destinationNameOrIp AS destinationNameOrIp,
> ...;
> with_value_substitution = FOREACH with_group GENERATE ChooseFieldByValue(groupField, *) AS groupValue:tuple(value:chararray), *;
> with_value_substitution = FOREACH with_value_substitution GENERATE 
> FLATTEN(groupValue) AS groupValue:chararray,
> groupField,
> foo,
> bar,
> ...;
> all_success = FOREACH (GROUP with_value_substitution BY (groupField, groupValue, day)) GENERATE
> FLATTEN(group) AS (seriesType, groupValue, day),
> (int)COUNT_STAR(with_value_substitution) AS connections:int;
> 
> 
> Diffs
> -----
> 
>   datafu-pig/src/main/java/datafu/pig/util/SelectStringFieldByName.java PRE-CREATION 
>   datafu-pig/src/test/java/datafu/test/pig/util/SelectStringFieldByNameTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/25564/diff/
> 
> 
> Testing
> -------
> 
> This UDF was used to replace a very inefficient pig script where macros that did many individual GROUP BY's took many minutes to plan.
> 
> Testing: unit tests and used on real data on a cluster.
> 
> 
> Thanks,
> 
> Russell Jurney
> 
>


Re: Review Request 25564: DATAFU-69: Create SelectFieldByName UDF - which, given a field who's value contains a field name, and *, returns the value of the field referenced by the field name

Posted by Matthew Hayes <ma...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25564/#review59650
-----------------------------------------------------------

Ship it!


- Matthew Hayes


On Oct. 30, 2014, 9:13 p.m., Russell Jurney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25564/
> -----------------------------------------------------------
> 
> (Updated Oct. 30, 2014, 9:13 p.m.)
> 
> 
> Review request for DataFu, Jonathan Coveney, Jakob Homan, Matthew Hayes, and Sam Shah.
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> Example use:
> group_fields = LOAD '/e8/smalldata/group_fields.txt' AS (groupField:chararray); 
> with_group = CROSS group_fields, hour_rounded;
> with_group = FOREACH with_group GENERATE group_fields::groupField AS groupField, 
> hour_rounded::sourceNameOrIp AS sourceNameOrIp,
> hour_rounded::destinationNameOrIp AS destinationNameOrIp,
> ...;
> with_value_substitution = FOREACH with_group GENERATE ChooseFieldByValue(groupField, *) AS groupValue:tuple(value:chararray), *;
> with_value_substitution = FOREACH with_value_substitution GENERATE 
> FLATTEN(groupValue) AS groupValue:chararray,
> groupField,
> foo,
> bar,
> ...;
> all_success = FOREACH (GROUP with_value_substitution BY (groupField, groupValue, day)) GENERATE
> FLATTEN(group) AS (seriesType, groupValue, day),
> (int)COUNT_STAR(with_value_substitution) AS connections:int;
> 
> 
> Diffs
> -----
> 
>   datafu-pig/src/main/java/datafu/pig/util/SelectStringFieldByName.java PRE-CREATION 
>   datafu-pig/src/test/java/datafu/test/pig/util/SelectStringFieldByNameTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/25564/diff/
> 
> 
> Testing
> -------
> 
> This UDF was used to replace a very inefficient pig script where macros that did many individual GROUP BY's took many minutes to plan.
> 
> Testing: unit tests and used on real data on a cluster.
> 
> 
> Thanks,
> 
> Russell Jurney
> 
>


Re: Review Request 25564: DATAFU-69: Create SelectFieldByName UDF - which, given a field who's value contains a field name, and *, returns the value of the field referenced by the field name

Posted by Russell Jurney <ru...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25564/
-----------------------------------------------------------

(Updated Oct. 30, 2014, 9:13 p.m.)


Review request for DataFu, Jonathan Coveney, Jakob Homan, Matthew Hayes, and Sam Shah.


Changes
-------

Greatly simplified implementation and tests that assumes string input and returns a string output.


Repository: datafu


Description
-------

Example use:
group_fields = LOAD '/e8/smalldata/group_fields.txt' AS (groupField:chararray); 
with_group = CROSS group_fields, hour_rounded;
with_group = FOREACH with_group GENERATE group_fields::groupField AS groupField, 
hour_rounded::sourceNameOrIp AS sourceNameOrIp,
hour_rounded::destinationNameOrIp AS destinationNameOrIp,
...;
with_value_substitution = FOREACH with_group GENERATE ChooseFieldByValue(groupField, *) AS groupValue:tuple(value:chararray), *;
with_value_substitution = FOREACH with_value_substitution GENERATE 
FLATTEN(groupValue) AS groupValue:chararray,
groupField,
foo,
bar,
...;
all_success = FOREACH (GROUP with_value_substitution BY (groupField, groupValue, day)) GENERATE
FLATTEN(group) AS (seriesType, groupValue, day),
(int)COUNT_STAR(with_value_substitution) AS connections:int;


Diffs (updated)
-----

  datafu-pig/src/main/java/datafu/pig/util/SelectStringFieldByName.java PRE-CREATION 
  datafu-pig/src/test/java/datafu/test/pig/util/SelectStringFieldByNameTest.java PRE-CREATION 

Diff: https://reviews.apache.org/r/25564/diff/


Testing
-------

This UDF was used to replace a very inefficient pig script where macros that did many individual GROUP BY's took many minutes to plan.

Testing: unit tests and used on real data on a cluster.


Thanks,

Russell Jurney


Re: Review Request 25564: DATAFU-69: Create SelectFieldByName UDF - which, given a field who's value contains a field name, and *, returns the value of the field referenced by the field name

Posted by Russell Jurney <ru...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25564/
-----------------------------------------------------------

(Updated Oct. 28, 2014, 7:28 p.m.)


Review request for DataFu, Jonathan Coveney, Jakob Homan, Matthew Hayes, and Sam Shah.


Changes
-------

Updated patch with new name, SelectStringFieldByName


Repository: datafu


Description
-------

Example use:
group_fields = LOAD '/e8/smalldata/group_fields.txt' AS (groupField:chararray); 
with_group = CROSS group_fields, hour_rounded;
with_group = FOREACH with_group GENERATE group_fields::groupField AS groupField, 
hour_rounded::sourceNameOrIp AS sourceNameOrIp,
hour_rounded::destinationNameOrIp AS destinationNameOrIp,
...;
with_value_substitution = FOREACH with_group GENERATE ChooseFieldByValue(groupField, *) AS groupValue:tuple(value:chararray), *;
with_value_substitution = FOREACH with_value_substitution GENERATE 
FLATTEN(groupValue) AS groupValue:chararray,
groupField,
foo,
bar,
...;
all_success = FOREACH (GROUP with_value_substitution BY (groupField, groupValue, day)) GENERATE
FLATTEN(group) AS (seriesType, groupValue, day),
(int)COUNT_STAR(with_value_substitution) AS connections:int;


Diffs (updated)
-----

  datafu-pig/src/main/java/datafu/pig/util/SelectStringFieldByName.java PRE-CREATION 
  datafu-pig/src/test/java/datafu/test/pig/util/SelectStringFieldByNameTest.java PRE-CREATION 

Diff: https://reviews.apache.org/r/25564/diff/


Testing
-------

This UDF was used to replace a very inefficient pig script where macros that did many individual GROUP BY's took many minutes to plan.

Testing: unit tests and used on real data on a cluster.


Thanks,

Russell Jurney


Re: Review Request 25564: DATAFU-69: Create SelectFieldByName UDF - which, given a field who's value contains a field name, and *, returns the value of the field referenced by the field name

Posted by Russell Jurney <ru...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25564/
-----------------------------------------------------------

(Updated Oct. 2, 2014, 4:19 p.m.)


Review request for DataFu, Jonathan Coveney, Jakob Homan, Matthew Hayes, and Sam Shah.


Summary (updated)
-----------------

DATAFU-69: Create SelectFieldByName UDF - which, given a field who's value contains a field name, and *, returns the value of the field referenced by the field name


Repository: datafu


Description
-------

Example use:
group_fields = LOAD '/e8/smalldata/group_fields.txt' AS (groupField:chararray); 
with_group = CROSS group_fields, hour_rounded;
with_group = FOREACH with_group GENERATE group_fields::groupField AS groupField, 
hour_rounded::sourceNameOrIp AS sourceNameOrIp,
hour_rounded::destinationNameOrIp AS destinationNameOrIp,
...;
with_value_substitution = FOREACH with_group GENERATE ChooseFieldByValue(groupField, *) AS groupValue:tuple(value:chararray), *;
with_value_substitution = FOREACH with_value_substitution GENERATE 
FLATTEN(groupValue) AS groupValue:chararray,
groupField,
foo,
bar,
...;
all_success = FOREACH (GROUP with_value_substitution BY (groupField, groupValue, day)) GENERATE
FLATTEN(group) AS (seriesType, groupValue, day),
(int)COUNT_STAR(with_value_substitution) AS connections:int;


Diffs
-----

  datafu-pig/src/main/java/datafu/pig/util/SelectFieldByName.java PRE-CREATION 
  datafu-pig/src/test/java/datafu/test/pig/util/SelectFieldByNameTest.java PRE-CREATION 

Diff: https://reviews.apache.org/r/25564/diff/


Testing
-------

This UDF was used to replace a very inefficient pig script where macros that did many individual GROUP BY's took many minutes to plan.

Testing: unit tests and used on real data on a cluster.


Thanks,

Russell Jurney


Re: Review Request 25564: DATAFU-69: Create ChooseFieldByValue UDF - which, given a field who's value contains a field name, and *, returns the value of the field referenced by the field name

Posted by Matthew Hayes <ma...@gmail.com>.

> On Sept. 29, 2014, 12:56 a.m., Matthew Hayes wrote:
> > datafu-pig/src/main/java/datafu/pig/util/SelectFieldByName.java, line 49
> > <https://reviews.apache.org/r/25564/diff/2/?file=707974#file707974line49>
> >
> >     Hmm, something just occurred to me.  This does not currently provide the output schema.  So this is one problem.  But, how do we determine the output schema?  If the output value is decided dynamically, then it can vary.  One way to address this is to require that all the other values of the tuple are of the same type.  Then you just take the schema form the first value.  In your example they are all chararray.  But this does limit the uses of this UDF.
> 
> Russell Jurney wrote:
>     In practice, this is not an issue. The UDF is used this way, and you can cast it to what you want.
>     
>     with_value_substitution = FOREACH with_group GENERATE 
>         FLATTEN(ChooseFieldByValue(groupField, *)) AS groupValue:chararray,
>         *, 
>         (int)$period AS periodSeconds:int;
>     
>     However, I don't see why I can't detect the schema of the field selected and return that?

The schema can't be dynamic like that.  I'll have to think about this some more.  I don't like that we have to cast it like this.  One way we can make this better is to have the UDF pick the schema that is best fit for the types provided.  For example, if all the fields are of the same type, like chararray, then the resulting type is chararray.  Otherwise make the type bytearray and you can cast however you want.  I'd like to hear what other people think about this.  How about emailing datafu dev?


- Matthew


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25564/#review54788
-----------------------------------------------------------


On Sept. 29, 2014, 12:20 a.m., Russell Jurney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25564/
> -----------------------------------------------------------
> 
> (Updated Sept. 29, 2014, 12:20 a.m.)
> 
> 
> Review request for DataFu, Jonathan Coveney, Jakob Homan, Matthew Hayes, and Sam Shah.
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> Example use:
> group_fields = LOAD '/e8/smalldata/group_fields.txt' AS (groupField:chararray); 
> with_group = CROSS group_fields, hour_rounded;
> with_group = FOREACH with_group GENERATE group_fields::groupField AS groupField, 
> hour_rounded::sourceNameOrIp AS sourceNameOrIp,
> hour_rounded::destinationNameOrIp AS destinationNameOrIp,
> ...;
> with_value_substitution = FOREACH with_group GENERATE ChooseFieldByValue(groupField, *) AS groupValue:tuple(value:chararray), *;
> with_value_substitution = FOREACH with_value_substitution GENERATE 
> FLATTEN(groupValue) AS groupValue:chararray,
> groupField,
> foo,
> bar,
> ...;
> all_success = FOREACH (GROUP with_value_substitution BY (groupField, groupValue, day)) GENERATE
> FLATTEN(group) AS (seriesType, groupValue, day),
> (int)COUNT_STAR(with_value_substitution) AS connections:int;
> 
> 
> Diffs
> -----
> 
>   datafu-pig/src/main/java/datafu/pig/util/SelectFieldByName.java PRE-CREATION 
>   datafu-pig/src/test/java/datafu/test/pig/util/SelectFieldByNameTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/25564/diff/
> 
> 
> Testing
> -------
> 
> This UDF was used to replace a very inefficient pig script where macros that did many individual GROUP BY's took many minutes to plan.
> 
> Testing: unit tests and used on real data on a cluster.
> 
> 
> Thanks,
> 
> Russell Jurney
> 
>


Re: Review Request 25564: DATAFU-69: Create SelectFieldByName UDF - which, given a field who's value contains a field name, and *, returns the value of the field referenced by the field name

Posted by Matthew Hayes <ma...@gmail.com>.

> On Sept. 29, 2014, 12:56 a.m., Matthew Hayes wrote:
> >
> 
> Russell Jurney wrote:
>     Actually, I think renaming this to SelectStringFieldByName might make more sense. The way this UDF is used, you are looking for a tag that identifies a group. These are almost always chararrays. This doesn't limit the usefulness of the UDF at all.
> 
> Matthew Hayes wrote:
>     I like the new name ;)  I still think we need to have a schema for the output.  For example, you could set the output schema to consist of a single chararray value and require that all inputs be chararrays.  This should be simple to do for now.  If we find that this is needed for other non-chararray cases we could relax the requirements.
> 
> Russell Jurney wrote:
>     Yeah, I agree. I'll have it return a chararray inside a tuple, in addition to changing the name.

Okay sounds good. chararray in a tuple as the schema seems fine.  Also validate in getSchema that the inputs are chararray as well.


- Matthew


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25564/#review54788
-----------------------------------------------------------


On Oct. 2, 2014, 4:19 p.m., Russell Jurney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25564/
> -----------------------------------------------------------
> 
> (Updated Oct. 2, 2014, 4:19 p.m.)
> 
> 
> Review request for DataFu, Jonathan Coveney, Jakob Homan, Matthew Hayes, and Sam Shah.
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> Example use:
> group_fields = LOAD '/e8/smalldata/group_fields.txt' AS (groupField:chararray); 
> with_group = CROSS group_fields, hour_rounded;
> with_group = FOREACH with_group GENERATE group_fields::groupField AS groupField, 
> hour_rounded::sourceNameOrIp AS sourceNameOrIp,
> hour_rounded::destinationNameOrIp AS destinationNameOrIp,
> ...;
> with_value_substitution = FOREACH with_group GENERATE ChooseFieldByValue(groupField, *) AS groupValue:tuple(value:chararray), *;
> with_value_substitution = FOREACH with_value_substitution GENERATE 
> FLATTEN(groupValue) AS groupValue:chararray,
> groupField,
> foo,
> bar,
> ...;
> all_success = FOREACH (GROUP with_value_substitution BY (groupField, groupValue, day)) GENERATE
> FLATTEN(group) AS (seriesType, groupValue, day),
> (int)COUNT_STAR(with_value_substitution) AS connections:int;
> 
> 
> Diffs
> -----
> 
>   datafu-pig/src/main/java/datafu/pig/util/SelectFieldByName.java PRE-CREATION 
>   datafu-pig/src/test/java/datafu/test/pig/util/SelectFieldByNameTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/25564/diff/
> 
> 
> Testing
> -------
> 
> This UDF was used to replace a very inefficient pig script where macros that did many individual GROUP BY's took many minutes to plan.
> 
> Testing: unit tests and used on real data on a cluster.
> 
> 
> Thanks,
> 
> Russell Jurney
> 
>


Re: Review Request 25564: DATAFU-69: Create SelectFieldByName UDF - which, given a field who's value contains a field name, and *, returns the value of the field referenced by the field name

Posted by Matthew Hayes <ma...@gmail.com>.

> On Sept. 29, 2014, 12:56 a.m., Matthew Hayes wrote:
> >
> 
> Russell Jurney wrote:
>     Actually, I think renaming this to SelectStringFieldByName might make more sense. The way this UDF is used, you are looking for a tag that identifies a group. These are almost always chararrays. This doesn't limit the usefulness of the UDF at all.

I like the new name ;)  I still think we need to have a schema for the output.  For example, you could set the output schema to consist of a single chararray value and require that all inputs be chararrays.  This should be simple to do for now.  If we find that this is needed for other non-chararray cases we could relax the requirements.


- Matthew


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25564/#review54788
-----------------------------------------------------------


On Oct. 2, 2014, 4:19 p.m., Russell Jurney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25564/
> -----------------------------------------------------------
> 
> (Updated Oct. 2, 2014, 4:19 p.m.)
> 
> 
> Review request for DataFu, Jonathan Coveney, Jakob Homan, Matthew Hayes, and Sam Shah.
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> Example use:
> group_fields = LOAD '/e8/smalldata/group_fields.txt' AS (groupField:chararray); 
> with_group = CROSS group_fields, hour_rounded;
> with_group = FOREACH with_group GENERATE group_fields::groupField AS groupField, 
> hour_rounded::sourceNameOrIp AS sourceNameOrIp,
> hour_rounded::destinationNameOrIp AS destinationNameOrIp,
> ...;
> with_value_substitution = FOREACH with_group GENERATE ChooseFieldByValue(groupField, *) AS groupValue:tuple(value:chararray), *;
> with_value_substitution = FOREACH with_value_substitution GENERATE 
> FLATTEN(groupValue) AS groupValue:chararray,
> groupField,
> foo,
> bar,
> ...;
> all_success = FOREACH (GROUP with_value_substitution BY (groupField, groupValue, day)) GENERATE
> FLATTEN(group) AS (seriesType, groupValue, day),
> (int)COUNT_STAR(with_value_substitution) AS connections:int;
> 
> 
> Diffs
> -----
> 
>   datafu-pig/src/main/java/datafu/pig/util/SelectFieldByName.java PRE-CREATION 
>   datafu-pig/src/test/java/datafu/test/pig/util/SelectFieldByNameTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/25564/diff/
> 
> 
> Testing
> -------
> 
> This UDF was used to replace a very inefficient pig script where macros that did many individual GROUP BY's took many minutes to plan.
> 
> Testing: unit tests and used on real data on a cluster.
> 
> 
> Thanks,
> 
> Russell Jurney
> 
>


Re: Review Request 25564: DATAFU-69: Create SelectFieldByName UDF - which, given a field who's value contains a field name, and *, returns the value of the field referenced by the field name

Posted by Russell Jurney <ru...@gmail.com>.

> On Sept. 29, 2014, 12:56 a.m., Matthew Hayes wrote:
> >

Actually, I think renaming this to SelectStringFieldByName might make more sense. The way this UDF is used, you are looking for a tag that identifies a group. These are almost always chararrays. This doesn't limit the usefulness of the UDF at all.


- Russell


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25564/#review54788
-----------------------------------------------------------


On Oct. 2, 2014, 4:19 p.m., Russell Jurney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25564/
> -----------------------------------------------------------
> 
> (Updated Oct. 2, 2014, 4:19 p.m.)
> 
> 
> Review request for DataFu, Jonathan Coveney, Jakob Homan, Matthew Hayes, and Sam Shah.
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> Example use:
> group_fields = LOAD '/e8/smalldata/group_fields.txt' AS (groupField:chararray); 
> with_group = CROSS group_fields, hour_rounded;
> with_group = FOREACH with_group GENERATE group_fields::groupField AS groupField, 
> hour_rounded::sourceNameOrIp AS sourceNameOrIp,
> hour_rounded::destinationNameOrIp AS destinationNameOrIp,
> ...;
> with_value_substitution = FOREACH with_group GENERATE ChooseFieldByValue(groupField, *) AS groupValue:tuple(value:chararray), *;
> with_value_substitution = FOREACH with_value_substitution GENERATE 
> FLATTEN(groupValue) AS groupValue:chararray,
> groupField,
> foo,
> bar,
> ...;
> all_success = FOREACH (GROUP with_value_substitution BY (groupField, groupValue, day)) GENERATE
> FLATTEN(group) AS (seriesType, groupValue, day),
> (int)COUNT_STAR(with_value_substitution) AS connections:int;
> 
> 
> Diffs
> -----
> 
>   datafu-pig/src/main/java/datafu/pig/util/SelectFieldByName.java PRE-CREATION 
>   datafu-pig/src/test/java/datafu/test/pig/util/SelectFieldByNameTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/25564/diff/
> 
> 
> Testing
> -------
> 
> This UDF was used to replace a very inefficient pig script where macros that did many individual GROUP BY's took many minutes to plan.
> 
> Testing: unit tests and used on real data on a cluster.
> 
> 
> Thanks,
> 
> Russell Jurney
> 
>


Re: Review Request 25564: DATAFU-69: Create ChooseFieldByValue UDF - which, given a field who's value contains a field name, and *, returns the value of the field referenced by the field name

Posted by Russell Jurney <ru...@gmail.com>.

> On Sept. 29, 2014, 12:56 a.m., Matthew Hayes wrote:
> > datafu-pig/src/main/java/datafu/pig/util/SelectFieldByName.java, line 49
> > <https://reviews.apache.org/r/25564/diff/2/?file=707974#file707974line49>
> >
> >     Hmm, something just occurred to me.  This does not currently provide the output schema.  So this is one problem.  But, how do we determine the output schema?  If the output value is decided dynamically, then it can vary.  One way to address this is to require that all the other values of the tuple are of the same type.  Then you just take the schema form the first value.  In your example they are all chararray.  But this does limit the uses of this UDF.

In practice, this is not an issue. The UDF is used this way, and you can cast it to what you want.

with_value_substitution = FOREACH with_group GENERATE 
    FLATTEN(ChooseFieldByValue(groupField, *)) AS groupValue:chararray,
    *, 
    (int)$period AS periodSeconds:int;

However, I don't see why I can't detect the schema of the field selected and return that?


- Russell


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25564/#review54788
-----------------------------------------------------------


On Sept. 29, 2014, 12:20 a.m., Russell Jurney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25564/
> -----------------------------------------------------------
> 
> (Updated Sept. 29, 2014, 12:20 a.m.)
> 
> 
> Review request for DataFu, Jonathan Coveney, Jakob Homan, Matthew Hayes, and Sam Shah.
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> Example use:
> group_fields = LOAD '/e8/smalldata/group_fields.txt' AS (groupField:chararray); 
> with_group = CROSS group_fields, hour_rounded;
> with_group = FOREACH with_group GENERATE group_fields::groupField AS groupField, 
> hour_rounded::sourceNameOrIp AS sourceNameOrIp,
> hour_rounded::destinationNameOrIp AS destinationNameOrIp,
> ...;
> with_value_substitution = FOREACH with_group GENERATE ChooseFieldByValue(groupField, *) AS groupValue:tuple(value:chararray), *;
> with_value_substitution = FOREACH with_value_substitution GENERATE 
> FLATTEN(groupValue) AS groupValue:chararray,
> groupField,
> foo,
> bar,
> ...;
> all_success = FOREACH (GROUP with_value_substitution BY (groupField, groupValue, day)) GENERATE
> FLATTEN(group) AS (seriesType, groupValue, day),
> (int)COUNT_STAR(with_value_substitution) AS connections:int;
> 
> 
> Diffs
> -----
> 
>   datafu-pig/src/main/java/datafu/pig/util/SelectFieldByName.java PRE-CREATION 
>   datafu-pig/src/test/java/datafu/test/pig/util/SelectFieldByNameTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/25564/diff/
> 
> 
> Testing
> -------
> 
> This UDF was used to replace a very inefficient pig script where macros that did many individual GROUP BY's took many minutes to plan.
> 
> Testing: unit tests and used on real data on a cluster.
> 
> 
> Thanks,
> 
> Russell Jurney
> 
>


Re: Review Request 25564: DATAFU-69: Create ChooseFieldByValue UDF - which, given a field who's value contains a field name, and *, returns the value of the field referenced by the field name

Posted by Russell Jurney <ru...@gmail.com>.

> On Sept. 29, 2014, 12:56 a.m., Matthew Hayes wrote:
> > datafu-pig/src/main/java/datafu/pig/util/SelectFieldByName.java, line 49
> > <https://reviews.apache.org/r/25564/diff/2/?file=707974#file707974line49>
> >
> >     Hmm, something just occurred to me.  This does not currently provide the output schema.  So this is one problem.  But, how do we determine the output schema?  If the output value is decided dynamically, then it can vary.  One way to address this is to require that all the other values of the tuple are of the same type.  Then you just take the schema form the first value.  In your example they are all chararray.  But this does limit the uses of this UDF.
> 
> Russell Jurney wrote:
>     In practice, this is not an issue. The UDF is used this way, and you can cast it to what you want.
>     
>     with_value_substitution = FOREACH with_group GENERATE 
>         FLATTEN(ChooseFieldByValue(groupField, *)) AS groupValue:chararray,
>         *, 
>         (int)$period AS periodSeconds:int;
>     
>     However, I don't see why I can't detect the schema of the field selected and return that?
> 
> Matthew Hayes wrote:
>     The schema can't be dynamic like that.  I'll have to think about this some more.  I don't like that we have to cast it like this.  One way we can make this better is to have the UDF pick the schema that is best fit for the types provided.  For example, if all the fields are of the same type, like chararray, then the resulting type is chararray.  Otherwise make the type bytearray and you can cast however you want.  I'd like to hear what other people think about this.  How about emailing datafu dev?

I will bring it up on the list, but I don't think returning a tuple is weird at all. It is highly convenient, and 'just works.'


- Russell


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25564/#review54788
-----------------------------------------------------------


On Sept. 29, 2014, 12:20 a.m., Russell Jurney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25564/
> -----------------------------------------------------------
> 
> (Updated Sept. 29, 2014, 12:20 a.m.)
> 
> 
> Review request for DataFu, Jonathan Coveney, Jakob Homan, Matthew Hayes, and Sam Shah.
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> Example use:
> group_fields = LOAD '/e8/smalldata/group_fields.txt' AS (groupField:chararray); 
> with_group = CROSS group_fields, hour_rounded;
> with_group = FOREACH with_group GENERATE group_fields::groupField AS groupField, 
> hour_rounded::sourceNameOrIp AS sourceNameOrIp,
> hour_rounded::destinationNameOrIp AS destinationNameOrIp,
> ...;
> with_value_substitution = FOREACH with_group GENERATE ChooseFieldByValue(groupField, *) AS groupValue:tuple(value:chararray), *;
> with_value_substitution = FOREACH with_value_substitution GENERATE 
> FLATTEN(groupValue) AS groupValue:chararray,
> groupField,
> foo,
> bar,
> ...;
> all_success = FOREACH (GROUP with_value_substitution BY (groupField, groupValue, day)) GENERATE
> FLATTEN(group) AS (seriesType, groupValue, day),
> (int)COUNT_STAR(with_value_substitution) AS connections:int;
> 
> 
> Diffs
> -----
> 
>   datafu-pig/src/main/java/datafu/pig/util/SelectFieldByName.java PRE-CREATION 
>   datafu-pig/src/test/java/datafu/test/pig/util/SelectFieldByNameTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/25564/diff/
> 
> 
> Testing
> -------
> 
> This UDF was used to replace a very inefficient pig script where macros that did many individual GROUP BY's took many minutes to plan.
> 
> Testing: unit tests and used on real data on a cluster.
> 
> 
> Thanks,
> 
> Russell Jurney
> 
>


Re: Review Request 25564: DATAFU-69: Create SelectFieldByName UDF - which, given a field who's value contains a field name, and *, returns the value of the field referenced by the field name

Posted by Matthew Hayes <ma...@gmail.com>.

> On Sept. 29, 2014, 12:56 a.m., Matthew Hayes wrote:
> > datafu-pig/src/main/java/datafu/pig/util/SelectFieldByName.java, line 49
> > <https://reviews.apache.org/r/25564/diff/2/?file=707974#file707974line49>
> >
> >     Hmm, something just occurred to me.  This does not currently provide the output schema.  So this is one problem.  But, how do we determine the output schema?  If the output value is decided dynamically, then it can vary.  One way to address this is to require that all the other values of the tuple are of the same type.  Then you just take the schema form the first value.  In your example they are all chararray.  But this does limit the uses of this UDF.
> 
> Russell Jurney wrote:
>     In practice, this is not an issue. The UDF is used this way, and you can cast it to what you want.
>     
>     with_value_substitution = FOREACH with_group GENERATE 
>         FLATTEN(ChooseFieldByValue(groupField, *)) AS groupValue:chararray,
>         *, 
>         (int)$period AS periodSeconds:int;
>     
>     However, I don't see why I can't detect the schema of the field selected and return that?
> 
> Matthew Hayes wrote:
>     The schema can't be dynamic like that.  I'll have to think about this some more.  I don't like that we have to cast it like this.  One way we can make this better is to have the UDF pick the schema that is best fit for the types provided.  For example, if all the fields are of the same type, like chararray, then the resulting type is chararray.  Otherwise make the type bytearray and you can cast however you want.  I'd like to hear what other people think about this.  How about emailing datafu dev?
> 
> Russell Jurney wrote:
>     I will bring it up on the list, but I don't think returning a tuple is weird at all. It is highly convenient, and 'just works.'

I'm not saying that returning a tuple is weird.  What is weird to me is not defining the schema of the tuple being returned by the UDF.


- Matthew


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25564/#review54788
-----------------------------------------------------------


On Oct. 2, 2014, 4:19 p.m., Russell Jurney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25564/
> -----------------------------------------------------------
> 
> (Updated Oct. 2, 2014, 4:19 p.m.)
> 
> 
> Review request for DataFu, Jonathan Coveney, Jakob Homan, Matthew Hayes, and Sam Shah.
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> Example use:
> group_fields = LOAD '/e8/smalldata/group_fields.txt' AS (groupField:chararray); 
> with_group = CROSS group_fields, hour_rounded;
> with_group = FOREACH with_group GENERATE group_fields::groupField AS groupField, 
> hour_rounded::sourceNameOrIp AS sourceNameOrIp,
> hour_rounded::destinationNameOrIp AS destinationNameOrIp,
> ...;
> with_value_substitution = FOREACH with_group GENERATE ChooseFieldByValue(groupField, *) AS groupValue:tuple(value:chararray), *;
> with_value_substitution = FOREACH with_value_substitution GENERATE 
> FLATTEN(groupValue) AS groupValue:chararray,
> groupField,
> foo,
> bar,
> ...;
> all_success = FOREACH (GROUP with_value_substitution BY (groupField, groupValue, day)) GENERATE
> FLATTEN(group) AS (seriesType, groupValue, day),
> (int)COUNT_STAR(with_value_substitution) AS connections:int;
> 
> 
> Diffs
> -----
> 
>   datafu-pig/src/main/java/datafu/pig/util/SelectFieldByName.java PRE-CREATION 
>   datafu-pig/src/test/java/datafu/test/pig/util/SelectFieldByNameTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/25564/diff/
> 
> 
> Testing
> -------
> 
> This UDF was used to replace a very inefficient pig script where macros that did many individual GROUP BY's took many minutes to plan.
> 
> Testing: unit tests and used on real data on a cluster.
> 
> 
> Thanks,
> 
> Russell Jurney
> 
>


Re: Review Request 25564: DATAFU-69: Create SelectFieldByName UDF - which, given a field who's value contains a field name, and *, returns the value of the field referenced by the field name

Posted by Russell Jurney <ru...@gmail.com>.

> On Sept. 29, 2014, 12:56 a.m., Matthew Hayes wrote:
> >
> 
> Russell Jurney wrote:
>     Actually, I think renaming this to SelectStringFieldByName might make more sense. The way this UDF is used, you are looking for a tag that identifies a group. These are almost always chararrays. This doesn't limit the usefulness of the UDF at all.
> 
> Matthew Hayes wrote:
>     I like the new name ;)  I still think we need to have a schema for the output.  For example, you could set the output schema to consist of a single chararray value and require that all inputs be chararrays.  This should be simple to do for now.  If we find that this is needed for other non-chararray cases we could relax the requirements.

Yeah, I agree. I'll have it return a chararray inside a tuple, in addition to changing the name.


- Russell


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25564/#review54788
-----------------------------------------------------------


On Oct. 2, 2014, 4:19 p.m., Russell Jurney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25564/
> -----------------------------------------------------------
> 
> (Updated Oct. 2, 2014, 4:19 p.m.)
> 
> 
> Review request for DataFu, Jonathan Coveney, Jakob Homan, Matthew Hayes, and Sam Shah.
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> Example use:
> group_fields = LOAD '/e8/smalldata/group_fields.txt' AS (groupField:chararray); 
> with_group = CROSS group_fields, hour_rounded;
> with_group = FOREACH with_group GENERATE group_fields::groupField AS groupField, 
> hour_rounded::sourceNameOrIp AS sourceNameOrIp,
> hour_rounded::destinationNameOrIp AS destinationNameOrIp,
> ...;
> with_value_substitution = FOREACH with_group GENERATE ChooseFieldByValue(groupField, *) AS groupValue:tuple(value:chararray), *;
> with_value_substitution = FOREACH with_value_substitution GENERATE 
> FLATTEN(groupValue) AS groupValue:chararray,
> groupField,
> foo,
> bar,
> ...;
> all_success = FOREACH (GROUP with_value_substitution BY (groupField, groupValue, day)) GENERATE
> FLATTEN(group) AS (seriesType, groupValue, day),
> (int)COUNT_STAR(with_value_substitution) AS connections:int;
> 
> 
> Diffs
> -----
> 
>   datafu-pig/src/main/java/datafu/pig/util/SelectFieldByName.java PRE-CREATION 
>   datafu-pig/src/test/java/datafu/test/pig/util/SelectFieldByNameTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/25564/diff/
> 
> 
> Testing
> -------
> 
> This UDF was used to replace a very inefficient pig script where macros that did many individual GROUP BY's took many minutes to plan.
> 
> Testing: unit tests and used on real data on a cluster.
> 
> 
> Thanks,
> 
> Russell Jurney
> 
>


Re: Review Request 25564: DATAFU-69: Create ChooseFieldByValue UDF - which, given a field who's value contains a field name, and *, returns the value of the field referenced by the field name

Posted by Matthew Hayes <ma...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25564/#review54788
-----------------------------------------------------------



datafu-pig/src/main/java/datafu/pig/util/SelectFieldByName.java
<https://reviews.apache.org/r/25564/#comment95058>

    Hmm, something just occurred to me.  This does not currently provide the output schema.  So this is one problem.  But, how do we determine the output schema?  If the output value is decided dynamically, then it can vary.  One way to address this is to require that all the other values of the tuple are of the same type.  Then you just take the schema form the first value.  In your example they are all chararray.  But this does limit the uses of this UDF.


- Matthew Hayes


On Sept. 29, 2014, 12:20 a.m., Russell Jurney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25564/
> -----------------------------------------------------------
> 
> (Updated Sept. 29, 2014, 12:20 a.m.)
> 
> 
> Review request for DataFu, Jonathan Coveney, Jakob Homan, Matthew Hayes, and Sam Shah.
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> Example use:
> group_fields = LOAD '/e8/smalldata/group_fields.txt' AS (groupField:chararray); 
> with_group = CROSS group_fields, hour_rounded;
> with_group = FOREACH with_group GENERATE group_fields::groupField AS groupField, 
> hour_rounded::sourceNameOrIp AS sourceNameOrIp,
> hour_rounded::destinationNameOrIp AS destinationNameOrIp,
> ...;
> with_value_substitution = FOREACH with_group GENERATE ChooseFieldByValue(groupField, *) AS groupValue:tuple(value:chararray), *;
> with_value_substitution = FOREACH with_value_substitution GENERATE 
> FLATTEN(groupValue) AS groupValue:chararray,
> groupField,
> foo,
> bar,
> ...;
> all_success = FOREACH (GROUP with_value_substitution BY (groupField, groupValue, day)) GENERATE
> FLATTEN(group) AS (seriesType, groupValue, day),
> (int)COUNT_STAR(with_value_substitution) AS connections:int;
> 
> 
> Diffs
> -----
> 
>   datafu-pig/src/main/java/datafu/pig/util/SelectFieldByName.java PRE-CREATION 
>   datafu-pig/src/test/java/datafu/test/pig/util/SelectFieldByNameTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/25564/diff/
> 
> 
> Testing
> -------
> 
> This UDF was used to replace a very inefficient pig script where macros that did many individual GROUP BY's took many minutes to plan.
> 
> Testing: unit tests and used on real data on a cluster.
> 
> 
> Thanks,
> 
> Russell Jurney
> 
>


Re: Review Request 25564: DATAFU-69: Create ChooseFieldByValue UDF - which, given a field who's value contains a field name, and *, returns the value of the field referenced by the field name

Posted by Russell Jurney <ru...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25564/
-----------------------------------------------------------

(Updated Sept. 29, 2014, 12:20 a.m.)


Review request for DataFu, Jonathan Coveney, Jakob Homan, Matthew Hayes, and Sam Shah.


Changes
-------

Updated to new patch.


Repository: datafu


Description
-------

Example use:
group_fields = LOAD '/e8/smalldata/group_fields.txt' AS (groupField:chararray); 
with_group = CROSS group_fields, hour_rounded;
with_group = FOREACH with_group GENERATE group_fields::groupField AS groupField, 
hour_rounded::sourceNameOrIp AS sourceNameOrIp,
hour_rounded::destinationNameOrIp AS destinationNameOrIp,
...;
with_value_substitution = FOREACH with_group GENERATE ChooseFieldByValue(groupField, *) AS groupValue:tuple(value:chararray), *;
with_value_substitution = FOREACH with_value_substitution GENERATE 
FLATTEN(groupValue) AS groupValue:chararray,
groupField,
foo,
bar,
...;
all_success = FOREACH (GROUP with_value_substitution BY (groupField, groupValue, day)) GENERATE
FLATTEN(group) AS (seriesType, groupValue, day),
(int)COUNT_STAR(with_value_substitution) AS connections:int;


Diffs (updated)
-----

  datafu-pig/src/main/java/datafu/pig/util/SelectFieldByName.java PRE-CREATION 
  datafu-pig/src/test/java/datafu/test/pig/util/SelectFieldByNameTest.java PRE-CREATION 

Diff: https://reviews.apache.org/r/25564/diff/


Testing
-------

This UDF was used to replace a very inefficient pig script where macros that did many individual GROUP BY's took many minutes to plan.

Testing: unit tests and used on real data on a cluster.


Thanks,

Russell Jurney


Re: Review Request 25564: DATAFU-69: Create ChooseFieldByValue UDF - which, given a field who's value contains a field name, and *, returns the value of the field referenced by the field name

Posted by Russell Jurney <ru...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25564/#review54787
-----------------------------------------------------------

Ship it!


Ship It!

- Russell Jurney


On Sept. 15, 2014, 6:58 p.m., Russell Jurney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25564/
> -----------------------------------------------------------
> 
> (Updated Sept. 15, 2014, 6:58 p.m.)
> 
> 
> Review request for DataFu, Jonathan Coveney, Jakob Homan, Matthew Hayes, and Sam Shah.
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> Example use:
> group_fields = LOAD '/e8/smalldata/group_fields.txt' AS (groupField:chararray); 
> with_group = CROSS group_fields, hour_rounded;
> with_group = FOREACH with_group GENERATE group_fields::groupField AS groupField, 
> hour_rounded::sourceNameOrIp AS sourceNameOrIp,
> hour_rounded::destinationNameOrIp AS destinationNameOrIp,
> ...;
> with_value_substitution = FOREACH with_group GENERATE ChooseFieldByValue(groupField, *) AS groupValue:tuple(value:chararray), *;
> with_value_substitution = FOREACH with_value_substitution GENERATE 
> FLATTEN(groupValue) AS groupValue:chararray,
> groupField,
> foo,
> bar,
> ...;
> all_success = FOREACH (GROUP with_value_substitution BY (groupField, groupValue, day)) GENERATE
> FLATTEN(group) AS (seriesType, groupValue, day),
> (int)COUNT_STAR(with_value_substitution) AS connections:int;
> 
> 
> Diffs
> -----
> 
>   datafu-pig/src/main/java/datafu/pig/util/ChooseFieldByValue.java PRE-CREATION 
>   datafu-pig/src/test/java/datafu/test/pig/util/ChooseFieldByValueTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/25564/diff/
> 
> 
> Testing
> -------
> 
> This UDF was used to replace a very inefficient pig script where macros that did many individual GROUP BY's took many minutes to plan.
> 
> Testing: unit tests and used on real data on a cluster.
> 
> 
> Thanks,
> 
> Russell Jurney
> 
>


Re: Review Request 25564: DATAFU-69: Create ChooseFieldByValue UDF - which, given a field who's value contains a field name, and *, returns the value of the field referenced by the field name

Posted by Matthew Hayes <ma...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25564/#review54778
-----------------------------------------------------------



datafu-pig/src/main/java/datafu/pig/util/ChooseFieldByValue.java
<https://reviews.apache.org/r/25564/#comment95054>

    Something like this seems more accurate and concise:
    
    Selects the value for a field within a tuple using that field's name.



datafu-pig/src/main/java/datafu/pig/util/ChooseFieldByValue.java
<https://reviews.apache.org/r/25564/#comment95055>

    I'm not sure if I like the name ChooseFieldByValue .  What about SelectFieldByName?



datafu-pig/src/main/java/datafu/pig/util/ChooseFieldByValue.java
<https://reviews.apache.org/r/25564/#comment95052>

    remove this comment



datafu-pig/src/main/java/datafu/pig/util/ChooseFieldByValue.java
<https://reviews.apache.org/r/25564/#comment95053>

    include message in exception, also something like IllegalArgumentException is probably more appropriate



datafu-pig/src/main/java/datafu/pig/util/ChooseFieldByValue.java
<https://reviews.apache.org/r/25564/#comment95056>

    Should start at i=1 since doesn't make sense to select itself


Sorry it took awhile for me to take a look at this.

- Matthew Hayes


On Sept. 15, 2014, 6:58 p.m., Russell Jurney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25564/
> -----------------------------------------------------------
> 
> (Updated Sept. 15, 2014, 6:58 p.m.)
> 
> 
> Review request for DataFu, Jonathan Coveney, Jakob Homan, Matthew Hayes, and Sam Shah.
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> Example use:
> group_fields = LOAD '/e8/smalldata/group_fields.txt' AS (groupField:chararray); 
> with_group = CROSS group_fields, hour_rounded;
> with_group = FOREACH with_group GENERATE group_fields::groupField AS groupField, 
> hour_rounded::sourceNameOrIp AS sourceNameOrIp,
> hour_rounded::destinationNameOrIp AS destinationNameOrIp,
> ...;
> with_value_substitution = FOREACH with_group GENERATE ChooseFieldByValue(groupField, *) AS groupValue:tuple(value:chararray), *;
> with_value_substitution = FOREACH with_value_substitution GENERATE 
> FLATTEN(groupValue) AS groupValue:chararray,
> groupField,
> foo,
> bar,
> ...;
> all_success = FOREACH (GROUP with_value_substitution BY (groupField, groupValue, day)) GENERATE
> FLATTEN(group) AS (seriesType, groupValue, day),
> (int)COUNT_STAR(with_value_substitution) AS connections:int;
> 
> 
> Diffs
> -----
> 
>   datafu-pig/src/main/java/datafu/pig/util/ChooseFieldByValue.java PRE-CREATION 
>   datafu-pig/src/test/java/datafu/test/pig/util/ChooseFieldByValueTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/25564/diff/
> 
> 
> Testing
> -------
> 
> This UDF was used to replace a very inefficient pig script where macros that did many individual GROUP BY's took many minutes to plan.
> 
> Testing: unit tests and used on real data on a cluster.
> 
> 
> Thanks,
> 
> Russell Jurney
> 
>


Re: Review Request 25564: DATAFU-69: Create ChooseFieldByValue UDF - which, given a field who's value contains a field name, and *, returns the value of the field referenced by the field name

Posted by Russell Jurney <ru...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25564/
-----------------------------------------------------------

(Updated Sept. 15, 2014, 6:58 p.m.)


Review request for DataFu, Jonathan Coveney, Jakob Homan, Matthew Hayes, and Sam Shah.


Repository: datafu


Description
-------

Example use:
group_fields = LOAD '/e8/smalldata/group_fields.txt' AS (groupField:chararray); 
with_group = CROSS group_fields, hour_rounded;
with_group = FOREACH with_group GENERATE group_fields::groupField AS groupField, 
hour_rounded::sourceNameOrIp AS sourceNameOrIp,
hour_rounded::destinationNameOrIp AS destinationNameOrIp,
...;
with_value_substitution = FOREACH with_group GENERATE ChooseFieldByValue(groupField, *) AS groupValue:tuple(value:chararray), *;
with_value_substitution = FOREACH with_value_substitution GENERATE 
FLATTEN(groupValue) AS groupValue:chararray,
groupField,
foo,
bar,
...;
all_success = FOREACH (GROUP with_value_substitution BY (groupField, groupValue, day)) GENERATE
FLATTEN(group) AS (seriesType, groupValue, day),
(int)COUNT_STAR(with_value_substitution) AS connections:int;


Diffs
-----

  datafu-pig/src/main/java/datafu/pig/util/ChooseFieldByValue.java PRE-CREATION 
  datafu-pig/src/test/java/datafu/test/pig/util/ChooseFieldByValueTest.java PRE-CREATION 

Diff: https://reviews.apache.org/r/25564/diff/


Testing
-------

This UDF was used to replace a very inefficient pig script where macros that did many individual GROUP BY's took many minutes to plan.

Testing: unit tests and used on real data on a cluster.


Thanks,

Russell Jurney


Re: Review Request 25564: DATAFU-69: Create ChooseFieldByValue UDF - which, given a field who's value contains a field name, and *, returns the value of the field referenced by the field name

Posted by Russell Jurney <ru...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25564/
-----------------------------------------------------------

(Updated Sept. 14, 2014, 6:55 p.m.)


Review request for DataFu, Jakob Homan, Matthew Hayes, and Sam Shah.


Changes
-------

Added individual people to review request.


Repository: datafu


Description
-------

Example use:
group_fields = LOAD '/e8/smalldata/group_fields.txt' AS (groupField:chararray); 
with_group = CROSS group_fields, hour_rounded;
with_group = FOREACH with_group GENERATE group_fields::groupField AS groupField, 
hour_rounded::sourceNameOrIp AS sourceNameOrIp,
hour_rounded::destinationNameOrIp AS destinationNameOrIp,
...;
with_value_substitution = FOREACH with_group GENERATE ChooseFieldByValue(groupField, *) AS groupValue:tuple(value:chararray), *;
with_value_substitution = FOREACH with_value_substitution GENERATE 
FLATTEN(groupValue) AS groupValue:chararray,
groupField,
foo,
bar,
...;
all_success = FOREACH (GROUP with_value_substitution BY (groupField, groupValue, day)) GENERATE
FLATTEN(group) AS (seriesType, groupValue, day),
(int)COUNT_STAR(with_value_substitution) AS connections:int;


Diffs
-----

  datafu-pig/src/main/java/datafu/pig/util/ChooseFieldByValue.java PRE-CREATION 
  datafu-pig/src/test/java/datafu/test/pig/util/ChooseFieldByValueTest.java PRE-CREATION 

Diff: https://reviews.apache.org/r/25564/diff/


Testing
-------

This UDF was used to replace a very inefficient pig script where macros that did many individual GROUP BY's took many minutes to plan.

Testing: unit tests and used on real data on a cluster.


Thanks,

Russell Jurney