You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Pradeep Kamath (JIRA)" <ji...@apache.org> on 2009/02/06 23:11:04 UTC

[jira] Commented: (PIG-653) Make fieldsToRead work in loader

    [ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12671355#action_12671355 ] 

Pradeep Kamath commented on PIG-653:
------------------------------------

Interface for passing required fields information to the loader
Proposal
Two new Classes will be introduced in the API call to the loader for passing information about required fields.
{code}
class RequiredField {
        String alias; // will hold name of the field (would be null if not supplied)
        int index; // will hold the index (position) of the required field (would be -1 if not supplied), index is 0 based
        List<RequiredField> subFields; // A list of sub fields in this field (this could be a list of hash keys for example). This would be null if the entire field is required and no specific sub fields are required. In the initial implementation only one level of subfields will be populated.
        byte type; // Type of this field - the value could be any current PIG DataType (as specified by the constants in DataType class. A new Type BAG_OF_MAP will be added to represent a bag of maps field).

	// Constructor and getters and setters follow        
	// getters are getAlias(), getIndex(), getSubFields(), getType()
	// setters are setAlias(), setIndex(), setSubFields(), setType()
    }
{code}

NOTE: Both alias and index could be set. The index has a value as perceived by Pig if all fields were sent to it from the loader.

For performance it would be good if when a single key in a map is requested the loader returns a map with just that key. Likewise, when the required fields is a key in a bag of map field, the expected value from the loader would be a bag of map where the maps contain that key (preferably only that key for performance since this will reduce the data handed by the loader).

{code}
class RequiredFieldResponse {
	boolean requiredFieldRequestHonored; // true if the loader will return a schema containing only the List of RequiredFields in that order. false if the loader will return all fields in the data
}
{code}

The reason we have a RequiredFieldResponse class encapsulating the boolean is to allow for future extensibility. For example, in the future the loader may be able to honor all top level field requests but not subfields in hashes. So it may hand back top level maps in return for sub field requests. The loader will then need to inform back to the caller which fields will be returned exactly the way they were requested and which will be sent as top level fields (even though the request was for subfields). For the first pass though it is all or none conveyed through the Boolean.

The API call in LoadFunc will change from 
{code}
void fieldsToRead(Schema schema) 
{code}
to
{code}
RequiredFieldResponse fieldsToRead(List<RequiredField> requiredFields, boolean allFieldsRequired);
{code}

NOTE: 
1.	It is expected that the loader returns the required fields in exactly the same order as in the List provided in the above call.
3.	The boolean flag allFieldsRequired is set to true when all fields are required. The loader should first check this flag and use the List<RequiredField> ONLY if this flag is false.

Use Cases
=========

Use Cases which only use aliases
================================
{noformat}
1.	Required fields are columns x (int), y (long)
[
{
	alias=>x,
	index => -1,
	subfields => null,
	type => DataType.INTEGER
},
{
	alias=>y,
	index => -1,
	subfields => null,
	type => DataType.LONG
}
]

2.	Required fields are m1#key1 (map subcolumn), b1#key2 (subcolumn from a bag of maps),
[
{
	alias=>m,
	index => -1,
	subfields => [
{
alias => key1,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be null!
Type => DataType.BYTEARRAY
}
			       ]
	type => DataType.MAP
},
{
	alias=>b1,
	index => -1,
	subfields => [
{
alias => key2,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be null!
Type => DataType.BYTEARRAY
}
			       ]
	type => DataType.BAG_OF_MAP
}
]

3.	Required fields are   m2#(key3, key4)  (map subcolumns), b2#(key5, key6) (subcolumns from bag of maps)
[
{
	alias=>m2,
	index => -1,
	subfields => [
{
alias => key3,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be null!
Type => DataType.BYTEARRAY
},
{
alias => key4,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be null!
Type => DataType.BYTEARRAY
}

			       ]
	type => DataType.MAP
},
{
	alias=>b2,
	index => -1,
	subfields => [
{
alias => key5,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be null!
Type => DataType.BYTEARRAY
},
{
alias => key6,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be null!
Type => DataType.BYTEARRAY
}

			       ]
	type => DataType.BAG_OF_MAP
},
]
{noformat}

Use Cases which use positional indices
======================================
{noformat}
1.	Required fields are columns $0 (int), $1 (long)
[
{
	alias=>null,
	index => 0,
	subfields => null,
	type => DataType.INTEGER
},
{
	alias=>null,
	index => 1,
	subfields => null,
	type => DataType.LONG
}
]

2.	Required fields are $0#key1 (map subcolumn), $2#key2 (subcolumn from a bag of maps),
[
{
	alias=>null,
	index => 0,
	subfields => [
{
alias => key1,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be null!
Type => DataType.BYTEARRAY
}
			       ]
	type => DataType.MAP
},
{
	alias=>null,
	index => 2,
	subfields => [
{
alias => key2,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be null!
Type => DataType.BYTEARRAY
}
			       ]
	type => DataType.BAG_OF_MAP
}
]

3.	Required fields are   $5#(key3, key4)  (map subcolumns), $3#(key5, key6) (subcolumns from bag of maps)
[
{
	alias=>null,
	index => 5,
	subfields => [
{
alias => key3,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be null!
Type => DataType.BYTEARRAY
},
{
alias => key4,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be null!
Type => DataType.BYTEARRAY
}

			       ]
	type => DataType.MAP
},
{
	alias=>null,
	index => 3,
	subfields => [
{
alias => key5,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be null!
Type => DataType.BYTEARRAY
},
{
alias => key6,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be null!
Type => DataType.BYTEARRAY
}

			       ]
	type => DataType.BAG_OF_MAP
},
]
{noformat}



> Make fieldsToRead work in loader
> --------------------------------
>
>                 Key: PIG-653
>                 URL: https://issues.apache.org/jira/browse/PIG-653
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it does not provide information to load functions on what fields are needed.  We need to implement a visitor that determines (where possible) which fields in a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.