You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by Apache Wiki <wi...@apache.org> on 2007/11/07 20:40:48 UTC
[Pig Wiki] Update of "PigDataTypeApis" by OlgaN
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by OlgaN:
http://wiki.apache.org/pig/PigDataTypeApis
New page:
[[Anchor(Data_type_APIs)]]
== Data type APIs ==
Every piece of data in Pig has the abstract type '''Datum'''. There are four concrete types, each of which is a special case of Datum:
* A '''Data Atom''' is a simple atomic data value. It is stored as a string but can be used as either a string or a number. Examples of data atoms are 'apache.org' and '1.0'.
* A '''Tuple''' is a data record consisting of a sequence of "fields". Each field is a Datum of any type (data atom, tuple or data bag).
* A '''Data Bag''' is a set of tuples (duplicate tuples are allowed). You may think of it as a "table", except that Pig does not require that the tuple field types match, or even that the tuples have the same number of fields! (It is up to you whether you want these properties.)
* A '''Data Map''' is a map of string keys mapped to datums of any types.
The APIs for reading/creating/modifying each kind of Datum are given below.
[[Anchor(Data_Atom)]]
=== Data Atom ===
The simplest kind of datum is a Data Atom. For now, a Data Atom is simply a string (in future releases, we may support binary data, numeric data types, etc.). Here is the API for Data Atom:
{{{
final public class DataAtom implements Datum {
// create a new data atom with value ""
public DataAtom();
// create a new data atom with value equal to the given string
public DataAtom(String valIn);
// create a new data atom with value equal to the given integer (gets converted to a string)
public DataAtom(int valIn);
// create a new data atom with value equal to the given long integer (gets converted to a string)
public DataAtom(long valIn);
// create a new data atom with value equal to the given floating-point number (gets converted to a string)
public DataAtom(double valIn);
// set the value to the given string
public void setValue(String valIn);
// set the value to the given integer (gets converted to a string)
public void setValue(int valIn);
// set the value to the given long integer (gets converted to a string)
public void setValue(long valIn);
// set the value to the given floating-point number (gets converted to a string)
public void setValue(double valIn);
// return the string value of the data atom
public String strval();
// return the numeric value of the data atom (only call this if you know that the value is numerical)
public Double numval();
}
}}}
[[Anchor(Tuple)]]
=== Tuple ===
A Tuple is a list of Datums. Each Datum in the list is called a "field". The number of fields a tuple has is called its "arity". The API for Tuple is:
{{{
public class Tuple implements Datum {
// Create a new tuple with zero fields
public Tuple();
// Create a new tuple with the specified number of fields
public Tuple(int numFields);
// create a new tuple with the set of fields given
public Tuple(List<Datum> fieldsIn);
// create a new single-field tuple
public Tuple(Datum fieldIn);
/**
* Create a tuple from a delimited line of text
*
* @param textLine
* the line containing fields of data
* @param delimiter
* a regular expression of the form specified by String.split(). If null, the default
* delimiter "[,\t]" will be used.
*/
public Tuple(String textLine, String delimiter);
// Create a tuple from a delimited line of text. (using default delimiters: comma and tab)
public Tuple(String textLine);
// Return the number of fields in this tuple
public int arity();
// Create a string representation of the tuple (useful in debugging)
public String toString();
// Set the ith field to the given value
public void setField(int i, Datum val) throws IOException;
// Set the ith field to a Data Atom created for the given integer value
public void setField(int i, int val) throws IOException {
setField(i, new DataAtom(val));
}
// Set the ith field to a Data Atom created for the given floating-point value
public void setField(int i, double val) throws IOException {
setField(i, new DataAtom(val));
}
// Set the ith field to a Data Atom created for the given string value
public void setField(int i, String val) throws IOException {
setField(i, new DataAtom(val));
}
// Retrieve the ith field
public Datum getField(int i) throws IOException;
// Get field i, if it is an Atom or can be coerced into an Atom (throws an exception if the ith field is not coercible into a DataAtom)
public DataAtom getAtomField(int i) throws IOException;
// Get field i, if it is a Tuple or can be coerced into a Tuple (throws an exception if the ith field is not coercible into a Tuple)
public Tuple getTupleField(int i) throws IOException;
// Get field i, if it is a Bag
public DataBag getBagField(int i) throws IOException;
// Append a new field to this tuple (increases the arity)
public void appendField(Datum newField) throws IOException;
// Append the fields of the given tuple to this tuple (increases the arity of this tuple)
public void appendTuple(Tuple other) throws IOException;
}
}}}
[[Anchor(Data_Bag)]]
=== Data Bag ===
A Bag is a set of Tuples. It may contain duplicate tuples. The order of tuples in the bag is not usually important, although if the bag has been ''sorted'' then they will remain in sorted order. The number of tuples in a bag is called its "cardinality". The API for Bag is:
{{{
public class DataBag extends DataCollector implements Datum {
// Create an empty data bag
public DataBag();
// Create a data bag containing the given set of tuples
public DataBag(List<Tuple> c);
// Create a data bag containing one tuple, namely the one given
public DataBag(Tuple t);
// Return the size of the data bag, in terms of the number of tuples it contains
public int cardinality();
// Return True iff the bag contains no tuples
public boolean isEmpty();
// Return an iterator over the tuples in the bag
public Iterator<Tuple> content();
// Add a tuple to the bag
public void add(Tuple t);
// Add a set of tuples to the bag
public void addAll(DataBag b);
// Remove a tuple from the bag
public void remove(Tuple d);
/**
* Returns the value of field i. Since there may be more than one tuple in the bag, this
* function throws an exception if it is not the case that all tuples agree on this field
*/
public DataAtom getField(int i) throws IOException;
// Remove all tuples from the bag
public void clear();
// Create a string representation of the bag (useful for debugging)
public String toString();
}
}}}
[[Anchor(Data_Map)]]
=== Data Map ===
A data map is a map of string keys mapped to arbitrary datums. They can be looked up by a string key.
(see EvalFunctions). The API for a Data Map is:
{{{
//returns the cardinality of the data map
public int cardinality()
// Adds the key value pair to the map
public void put(String key, Datum value)
//Adds the value as a data atom mapped to the given key
public void put(String key, String value)
//Adds the value as a data atom mapped to the given key
public void put(String key, int value)
//Fetch the value corresponding to a given key, returns an empty data atom if the key does not exist
public Datum get(String key)
}}}