You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2009/01/31 02:32:59 UTC

[jira] Created: (PIG-653) Make fieldsToRead work in loader

Make fieldsToRead work in loader
--------------------------------

                 Key: PIG-653
                 URL: https://issues.apache.org/jira/browse/PIG-653
             Project: Pig
          Issue Type: New Feature
            Reporter: Alan Gates
            Assignee: Alan Gates


Currently pig does not call the fieldsToRead function in LoadFunc, thus it does not provide information to load functions on what fields are needed.  We need to implement a visitor that determines (where possible) which fields in a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-653) Make fieldsToRead work in loader

Posted by "Yan Zhou (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785937#action_12785937 ] 

Yan Zhou commented on PIG-653:
------------------------------

A typo in my last comment. should have been 27 audit *warnings* not *failures*

> Make fieldsToRead work in loader
> --------------------------------
>
>                 Key: PIG-653
>                 URL: https://issues.apache.org/jira/browse/PIG-653
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>         Attachments: PIG-653-2.comment, PIG-653-3-proposal.txt, PIG-653.patch
>
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it does not provide information to load functions on what fields are needed.  We need to implement a visitor that determines (where possible) which fields in a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-653) Make fieldsToRead work in loader

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12671355#action_12671355 ] 

Pradeep Kamath commented on PIG-653:
------------------------------------

Interface for passing required fields information to the loader
Proposal
Two new Classes will be introduced in the API call to the loader for passing information about required fields.
{code}
class RequiredField {
        String alias; // will hold name of the field (would be null if not supplied)
        int index; // will hold the index (position) of the required field (would be -1 if not supplied), index is 0 based
        List<RequiredField> subFields; // A list of sub fields in this field (this could be a list of hash keys for example). This would be null if the entire field is required and no specific sub fields are required. In the initial implementation only one level of subfields will be populated.
        byte type; // Type of this field - the value could be any current PIG DataType (as specified by the constants in DataType class. A new Type BAG_OF_MAP will be added to represent a bag of maps field).

	// Constructor and getters and setters follow        
	// getters are getAlias(), getIndex(), getSubFields(), getType()
	// setters are setAlias(), setIndex(), setSubFields(), setType()
    }
{code}

NOTE: Both alias and index could be set. The index has a value as perceived by Pig if all fields were sent to it from the loader.

For performance it would be good if when a single key in a map is requested the loader returns a map with just that key. Likewise, when the required fields is a key in a bag of map field, the expected value from the loader would be a bag of map where the maps contain that key (preferably only that key for performance since this will reduce the data handed by the loader).

{code}
class RequiredFieldResponse {
	boolean requiredFieldRequestHonored; // true if the loader will return a schema containing only the List of RequiredFields in that order. false if the loader will return all fields in the data
}
{code}

The reason we have a RequiredFieldResponse class encapsulating the boolean is to allow for future extensibility. For example, in the future the loader may be able to honor all top level field requests but not subfields in hashes. So it may hand back top level maps in return for sub field requests. The loader will then need to inform back to the caller which fields will be returned exactly the way they were requested and which will be sent as top level fields (even though the request was for subfields). For the first pass though it is all or none conveyed through the Boolean.

The API call in LoadFunc will change from 
{code}
void fieldsToRead(Schema schema) 
{code}
to
{code}
RequiredFieldResponse fieldsToRead(List<RequiredField> requiredFields, boolean allFieldsRequired);
{code}

NOTE: 
1.	It is expected that the loader returns the required fields in exactly the same order as in the List provided in the above call.
3.	The boolean flag allFieldsRequired is set to true when all fields are required. The loader should first check this flag and use the List<RequiredField> ONLY if this flag is false.

Use Cases
=========

Use Cases which only use aliases
================================
{noformat}
1.	Required fields are columns x (int), y (long)
[
{
	alias=>x,
	index => -1,
	subfields => null,
	type => DataType.INTEGER
},
{
	alias=>y,
	index => -1,
	subfields => null,
	type => DataType.LONG
}
]

2.	Required fields are m1#key1 (map subcolumn), b1#key2 (subcolumn from a bag of maps),
[
{
	alias=>m,
	index => -1,
	subfields => [
{
alias => key1,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be null!
Type => DataType.BYTEARRAY
}
			       ]
	type => DataType.MAP
},
{
	alias=>b1,
	index => -1,
	subfields => [
{
alias => key2,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be null!
Type => DataType.BYTEARRAY
}
			       ]
	type => DataType.BAG_OF_MAP
}
]

3.	Required fields are   m2#(key3, key4)  (map subcolumns), b2#(key5, key6) (subcolumns from bag of maps)
[
{
	alias=>m2,
	index => -1,
	subfields => [
{
alias => key3,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be null!
Type => DataType.BYTEARRAY
},
{
alias => key4,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be null!
Type => DataType.BYTEARRAY
}

			       ]
	type => DataType.MAP
},
{
	alias=>b2,
	index => -1,
	subfields => [
{
alias => key5,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be null!
Type => DataType.BYTEARRAY
},
{
alias => key6,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be null!
Type => DataType.BYTEARRAY
}

			       ]
	type => DataType.BAG_OF_MAP
},
]
{noformat}

Use Cases which use positional indices
======================================
{noformat}
1.	Required fields are columns $0 (int), $1 (long)
[
{
	alias=>null,
	index => 0,
	subfields => null,
	type => DataType.INTEGER
},
{
	alias=>null,
	index => 1,
	subfields => null,
	type => DataType.LONG
}
]

2.	Required fields are $0#key1 (map subcolumn), $2#key2 (subcolumn from a bag of maps),
[
{
	alias=>null,
	index => 0,
	subfields => [
{
alias => key1,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be null!
Type => DataType.BYTEARRAY
}
			       ]
	type => DataType.MAP
},
{
	alias=>null,
	index => 2,
	subfields => [
{
alias => key2,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be null!
Type => DataType.BYTEARRAY
}
			       ]
	type => DataType.BAG_OF_MAP
}
]

3.	Required fields are   $5#(key3, key4)  (map subcolumns), $3#(key5, key6) (subcolumns from bag of maps)
[
{
	alias=>null,
	index => 5,
	subfields => [
{
alias => key3,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be null!
Type => DataType.BYTEARRAY
},
{
alias => key4,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be null!
Type => DataType.BYTEARRAY
}

			       ]
	type => DataType.MAP
},
{
	alias=>null,
	index => 3,
	subfields => [
{
alias => key5,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be null!
Type => DataType.BYTEARRAY
},
{
alias => key6,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be null!
Type => DataType.BYTEARRAY
}

			       ]
	type => DataType.BAG_OF_MAP
},
]
{noformat}



> Make fieldsToRead work in loader
> --------------------------------
>
>                 Key: PIG-653
>                 URL: https://issues.apache.org/jira/browse/PIG-653
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it does not provide information to load functions on what fields are needed.  We need to implement a visitor that determines (where possible) which fields in a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-653) Make fieldsToRead work in loader

Posted by "Yan Zhou (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785936#action_12785936 ] 

Yan Zhou commented on PIG-653:
------------------------------

The 27 release audit failures are from 25 pig test scripts and 2 test data files, none of them are source files and should be ignored.

> Make fieldsToRead work in loader
> --------------------------------
>
>                 Key: PIG-653
>                 URL: https://issues.apache.org/jira/browse/PIG-653
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>         Attachments: PIG-653-2.comment, PIG-653-3-proposal.txt, PIG-653.patch
>
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it does not provide information to load functions on what fields are needed.  We need to implement a visitor that determines (where possible) which fields in a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-653) Make fieldsToRead work in loader

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-653:
-------------------------------

    Resolution: Duplicate
        Status: Resolved  (was: Patch Available)

PIG-922

> Make fieldsToRead work in loader
> --------------------------------
>
>                 Key: PIG-653
>                 URL: https://issues.apache.org/jira/browse/PIG-653
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>         Attachments: PIG-653-2.comment, PIG-653-3-proposal.txt, PIG-653.patch
>
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it does not provide information to load functions on what fields are needed.  We need to implement a visitor that determines (where possible) which fields in a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-653) Make fieldsToRead work in loader

Posted by "Hong Tang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672176#action_12672176 ] 

Hong Tang commented on PIG-653:
-------------------------------

my quibble is that the interface uses null to indicate all required for nested fields, but uses a concrete class for top level fields. any justification why possible future extensions are only applicable to top-level fields but not nested fields?

> Make fieldsToRead work in loader
> --------------------------------
>
>                 Key: PIG-653
>                 URL: https://issues.apache.org/jira/browse/PIG-653
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>         Attachments: PIG-653-2.comment
>
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it does not provide information to load functions on what fields are needed.  We need to implement a visitor that determines (where possible) which fields in a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-653) Make fieldsToRead work in loader

Posted by "Yan Zhou (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785926#action_12785926 ] 

Yan Zhou commented on PIG-653:
------------------------------

+1

> Make fieldsToRead work in loader
> --------------------------------
>
>                 Key: PIG-653
>                 URL: https://issues.apache.org/jira/browse/PIG-653
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>         Attachments: PIG-653-2.comment, PIG-653-3-proposal.txt, PIG-653.patch
>
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it does not provide information to load functions on what fields are needed.  We need to implement a visitor that determines (where possible) which fields in a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-653) Make fieldsToRead work in loader

Posted by "Hong Tang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672015#action_12672015 ] 

Hong Tang commented on PIG-653:
-------------------------------

Should subFields also have the type RequiredFieldList?

> Make fieldsToRead work in loader
> --------------------------------
>
>                 Key: PIG-653
>                 URL: https://issues.apache.org/jira/browse/PIG-653
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>         Attachments: PIG-653-2.comment
>
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it does not provide information to load functions on what fields are needed.  We need to implement a visitor that determines (where possible) which fields in a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-653) Make fieldsToRead work in loader

Posted by "Gaurav Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gaurav Jain updated PIG-653:
----------------------------

    Status: Patch Available  (was: Open)

> Make fieldsToRead work in loader
> --------------------------------
>
>                 Key: PIG-653
>                 URL: https://issues.apache.org/jira/browse/PIG-653
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>         Attachments: PIG-653-2.comment, PIG-653-3-proposal.txt, PIG-653.patch
>
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it does not provide information to load functions on what fields are needed.  We need to implement a visitor that determines (where possible) which fields in a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-653) Make fieldsToRead work in loader

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785928#action_12785928 ] 

Hadoop QA commented on PIG-653:
-------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12426879/PIG-653.patch
  against trunk revision 887049.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 97 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    -1 release audit.  The applied patch generated 395 release audit warnings (more than the trunk's current 368 warnings).

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/89/testReport/
Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/89/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/89/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/89/console

This message is automatically generated.

> Make fieldsToRead work in loader
> --------------------------------
>
>                 Key: PIG-653
>                 URL: https://issues.apache.org/jira/browse/PIG-653
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>         Attachments: PIG-653-2.comment, PIG-653-3-proposal.txt, PIG-653.patch
>
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it does not provide information to load functions on what fields are needed.  We need to implement a visitor that determines (where possible) which fields in a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-653) Make fieldsToRead work in loader

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pradeep Kamath updated PIG-653:
-------------------------------

    Attachment: PIG-653-2.comment

A new proposal has been attached as a revision of the proposal in comment 1.

The two main changes are:
1. A new class RequiredFieldList  will be used to convey the list of required fields. A separate class was chosen here (rather than using the List<RequiredFields> and boolean separately) since it gives us the flexibility to extend it easily in the future.
2. The new type, BAG_OF_MAP is no longer needed. So if a certain field is a bag (named "bg") which contains a single column which is a map and we need just the data for only one key (say k1) from it, we can represent that by having a RequiredField object of Type BAG with alias "bg". This object will have one RequiredField object in its subFields list which will be of type MAP and which will have index 0 to indicate this is the first subfield in the bag. This object inturn will have one RequiredField object in its subFields list which be of type BYTEARRAY and which will have alias "k1". This illustrates how subcolumns of interest can be represented by the RequiredField class.


> Make fieldsToRead work in loader
> --------------------------------
>
>                 Key: PIG-653
>                 URL: https://issues.apache.org/jira/browse/PIG-653
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>         Attachments: PIG-653-2.comment
>
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it does not provide information to load functions on what fields are needed.  We need to implement a visitor that determines (where possible) which fields in a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-653) Make fieldsToRead work in loader

Posted by "Gaurav Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gaurav Jain updated PIG-653:
----------------------------

    Attachment: PIG-653.patch


Zebra changes for the proposed feature

Please reveiw at your earliest convenience

> Make fieldsToRead work in loader
> --------------------------------
>
>                 Key: PIG-653
>                 URL: https://issues.apache.org/jira/browse/PIG-653
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>         Attachments: PIG-653-2.comment, PIG-653-3-proposal.txt, PIG-653.patch
>
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it does not provide information to load functions on what fields are needed.  We need to implement a visitor that determines (where possible) which fields in a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-653) Make fieldsToRead work in loader

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pradeep Kamath updated PIG-653:
-------------------------------

    Attachment: PIG-653-3-proposal.txt

Introduced a boolean into RequiredField to indicate if all sub fields are required. The reason I feel we cannot use RequiredFieldList internally in RequiredField to represent the subfields is that REquiredFieldList is meant to be a class for communicating about the top level required fields. In the future the extensions added to it may only make sense at the top level and hence would not fit well for the sub fields.

I have attached the third version of the proposal with the above changes and have explicitly listed the getters in the classes since these will be used by loaders.

> Make fieldsToRead work in loader
> --------------------------------
>
>                 Key: PIG-653
>                 URL: https://issues.apache.org/jira/browse/PIG-653
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>         Attachments: PIG-653-2.comment, PIG-653-3-proposal.txt
>
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it does not provide information to load functions on what fields are needed.  We need to implement a visitor that determines (where possible) which fields in a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (PIG-653) Make fieldsToRead work in loader

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates reassigned PIG-653:
------------------------------

    Assignee: Pradeep Kamath  (was: Alan Gates)

> Make fieldsToRead work in loader
> --------------------------------
>
>                 Key: PIG-653
>                 URL: https://issues.apache.org/jira/browse/PIG-653
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it does not provide information to load functions on what fields are needed.  We need to implement a visitor that determines (where possible) which fields in a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-653) Make fieldsToRead work in loader

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672020#action_12672020 ] 

Pradeep Kamath commented on PIG-653:
------------------------------------

Not so sure about using RequiredFieldList for subFields - this will mean we could ask for all subFields in two ways - 
1. By just asking for the main field (this will imply we need all sub fields)
2. By asking for the main field with a subField which has its allFieldsRequired flag set to true.

I think it would be better to keep the subFields as only THE required subfields represented as a list. RequiredFieldList is specifically being introduced to handle top level information to be given to the loader which may not be applicable at a field level. 

Thoughts?

> Make fieldsToRead work in loader
> --------------------------------
>
>                 Key: PIG-653
>                 URL: https://issues.apache.org/jira/browse/PIG-653
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>         Attachments: PIG-653-2.comment
>
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it does not provide information to load functions on what fields are needed.  We need to implement a visitor that determines (where possible) which fields in a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-653) Make fieldsToRead work in loader

Posted by "Hong Tang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12671371#action_12671371 ] 

Hong Tang commented on PIG-653:
-------------------------------

I don't like the idea of adding BAG_OF_MAP type. It really is a composite of two existing types BAG of MAP.

Here is another idea I came up, and briefly discussed with Pradeep.

{code}
public interface Filter {
  /**
   * Return the actual type of the filter. It can then be downcast to the
   * actual Filter.
   * 
   * @return one of the following constants defined in DataType: TUPLE, BAG, and
   *         MAP
   */
  byte getType();
}

class TupleFilter implements Filter {
  private static class TupleFilterEntry {
    String alias;
    Filter filter;
    TupleFilterEntry(String a, Filter f) {
      alias = a;
      filter = f;
    }
  }
  
  SortedMap<Integer, TupleFilterEntry> entries;

  public byte getType() { return DataType.TUPLE; }
  
  public TupleFilter() {
    entries = new TreeMap<Integer, TupleFilterEntry>();
  }

  /**
   * Convenience constructor for simple positioned based filtering.
   * @param indices
   */
  public TupleFilter(int...indices) {
    entries = new TreeMap<Integer, TupleFilterEntry>();
    for (int i : indices) {
      entries.put(i, new TupleFilterEntry(null, null));
    }
  }
  
  /**
   * Adding an entry into the filter. (Building the filter.)
   * 
   * @param index
   *          The field index we are interested
   * @param alias
   *          The alias name of the field, optional
   * @param filter
   *          Further filtering on the filed, null means no more nested filter.
   */
  public synchronized void add(int index, String alias, Filter filter) {
    entries.put(index, new TupleFilterEntry(alias, filter));
  }
  
  /**
   * Get the interested fields.
   * 
   * @return The indices to the interested fields, sorted in ascending order.
   */
  public synchronized int[] getFields() {
    int[] ret = new int[entries.size()];
    int i = 0;
    for (Iterator<Integer> it = entries.keySet().iterator(); it.hasNext(); ++i) {
      ret[i] = it.next();
    }
    return ret;
  }

  public synchronized String getAlias(int index) {
    TupleFilterEntry entry = entries.get(index);
    if (entry == null) {
      throw new IllegalArgumentException("Unrecognized field index");
    }
    return entry.alias;
  }

  public synchronized Filter getFilter(int index) {
    TupleFilterEntry entry = entries.get(index);
    if (entry == null) {
      throw new IllegalArgumentException("Unrecognized field index");
    }
    return entry.filter;
  }
}

class MapFilter implements Filter {
  Map<String, Filter> entries;
  
  public MapFilter() {
    entries = new TreeMap<String, Filter>();
  }
  
  /**
   * Convenience constructor for simple key matching filtering.
   * 
   * @param keys
   *          interested keys
   */
  public MapFilter(String... keys) {
    this();
    add(keys);
  }
  
  /**
   * Adding keys to the interested key set without further filteriing.
   * 
   * @param keys
   *          interested keys.
   */
  public void add(String... keys) {
    add(null, keys);
  }

  /**
   * Adding keys to the interested key set with further filtering
   * 
   * @param f
   *          The filter
   * @param keys
   *          the keys
   */
  public synchronized void add(Filter f, String... keys) {
    for (String k : keys) {
      entries.put(k, f);
    }
  }
  
  @Override
  public byte getType() {
    return DataType.MAP;
  }
  
  public synchronized Map<String, Filter> getKeyFilterMapping() {
    return entries;
  }
}

class BagFilter implements Filter {
  Filter filter;

  public BagFilter(TupleFilter filter) {
    this.filter = filter;
  }

  @Override
  public byte getType() {
    return DataType.BAG;
  }

  public Filter getTupleFilter() {
    return filter;
  }
}
{code}

> Make fieldsToRead work in loader
> --------------------------------
>
>                 Key: PIG-653
>                 URL: https://issues.apache.org/jira/browse/PIG-653
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it does not provide information to load functions on what fields are needed.  We need to implement a visitor that determines (where possible) which fields in a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-653) Make fieldsToRead work in loader

Posted by "Yan Zhou (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786046#action_12786046 ] 

Yan Zhou commented on PIG-653:
------------------------------

Zebra changes commited to both trunk and the 6.0 branch.

> Make fieldsToRead work in loader
> --------------------------------
>
>                 Key: PIG-653
>                 URL: https://issues.apache.org/jira/browse/PIG-653
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>         Attachments: PIG-653-2.comment, PIG-653-3-proposal.txt, PIG-653.patch
>
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it does not provide information to load functions on what fields are needed.  We need to implement a visitor that determines (where possible) which fields in a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.