You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by Apache Wiki <wi...@apache.org> on 2007/11/07 20:40:48 UTC

[Pig Wiki] Update of "PigDataTypeApis" by OlgaN

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/PigDataTypeApis

New page:
[[Anchor(Data_type_APIs)]]
== Data type APIs ==

Every piece of data in Pig has the abstract type '''Datum'''. There are four concrete types, each of which is a special case of Datum:

   * A '''Data Atom''' is a simple atomic data value. It is stored as a string but can be used as either a string or a number. Examples of data atoms are 'apache.org' and '1.0'.
   * A '''Tuple''' is a data record consisting of a sequence of "fields". Each field is a Datum of any type (data atom, tuple or data bag).
   * A '''Data Bag''' is a set of tuples (duplicate tuples are allowed). You may think of it as a "table", except that Pig does not require that the tuple field  types match, or even that the tuples have the same number of fields! (It is up to you whether you want these properties.)
   * A '''Data Map''' is a map of string keys mapped to datums of any types. 
The APIs for reading/creating/modifying each kind of Datum are given below. 

[[Anchor(Data_Atom)]]
=== Data Atom ===

The simplest kind of datum is a Data Atom. For now, a Data Atom is simply a string (in future releases, we may support binary data, numeric data types, etc.). Here is the API for Data Atom:

{{{
final public class DataAtom implements Datum {

    // create a new data atom with value ""
    public DataAtom();

    // create a new data atom with value equal to the given string
    public DataAtom(String valIn);

    // create a new data atom with value equal to the given integer (gets converted to a string)
    public DataAtom(int valIn);

    // create a new data atom with value equal to the given long integer (gets converted to a string)
    public DataAtom(long valIn);

    // create a new data atom with value equal to the given floating-point number (gets converted to a string)
    public DataAtom(double valIn);

    // set the value to the given string
    public void setValue(String valIn);

    // set the value to the given integer (gets converted to a string)
    public void setValue(int valIn);

    // set the value to the given long integer (gets converted to a string)
    public void setValue(long valIn);

    // set the value to the given floating-point number (gets converted to a string)
    public void setValue(double valIn);

    // return the string value of the data atom
    public String strval();

    // return the numeric value of the data atom (only call this if you know that the value is numerical)
    public Double numval();

}
}}}

[[Anchor(Tuple)]]
=== Tuple ===

A Tuple is a list of Datums. Each Datum in the list is called a "field". The number of fields a tuple has is called its "arity". The API for Tuple is:

{{{
public class Tuple implements Datum {

    // Create a new tuple with zero fields
    public Tuple();

    // Create a new tuple with the specified number of fields
    public Tuple(int numFields);

    // create a new tuple with the set of fields given
    public Tuple(List<Datum> fieldsIn);

    // create a new single-field tuple
    public Tuple(Datum fieldIn);

    /**
     * Create a tuple from a delimited line of text
     * 
     * @param textLine
     *            the line containing fields of data
     * @param delimiter
     *            a regular expression of the form specified by String.split(). If null, the default
     *            delimiter "[,\t]" will be used.
     */
    public Tuple(String textLine, String delimiter);

    // Create a tuple from a delimited line of text. (using default delimiters: comma and tab)
    public Tuple(String textLine);

    // Return the number of fields in this tuple
    public int arity();

    // Create a string representation of the tuple (useful in debugging)
    public String toString();

    // Set the ith field to the given value
    public void setField(int i, Datum val) throws IOException;

    // Set the ith field to a Data Atom created for the given integer value
    public void setField(int i, int val) throws IOException {
        setField(i, new DataAtom(val));
    }

    // Set the ith field to a Data Atom created for the given floating-point value
    public void setField(int i, double val) throws IOException {
        setField(i, new DataAtom(val));
    }

    // Set the ith field to a Data Atom created for the given string value
    public void setField(int i, String val) throws IOException {
        setField(i, new DataAtom(val));
    }

    // Retrieve the ith field
    public Datum getField(int i) throws IOException;

    // Get field i, if it is an Atom or can be coerced into an Atom (throws an exception if the ith field is not coercible into a DataAtom)
    public DataAtom getAtomField(int i) throws IOException;

    // Get field i, if it is a Tuple or can be coerced into a Tuple (throws an exception if the ith field is not coercible into a Tuple)
    public Tuple getTupleField(int i) throws IOException;

    // Get field i, if it is a Bag
    public DataBag getBagField(int i) throws IOException;

    // Append a new field to this tuple (increases the arity)
    public void appendField(Datum newField) throws IOException;

    // Append the fields of the given tuple to this tuple (increases the arity of this tuple)
    public void appendTuple(Tuple other) throws IOException;
}
}}}

[[Anchor(Data_Bag)]]
=== Data Bag ===

A Bag is a set of Tuples. It may contain duplicate tuples. The order of tuples in the bag is not usually important, although if the bag has been ''sorted'' then they will remain in sorted order. The number of tuples in a bag is called its "cardinality". The API for Bag is:

{{{
public class DataBag extends DataCollector implements Datum {

    // Create an empty data bag
    public DataBag();

    // Create a data bag containing the given set of tuples
    public DataBag(List<Tuple> c);

    // Create a data bag containing one tuple, namely the one given
    public DataBag(Tuple t);

    // Return the size of the data bag, in terms of the number of tuples it contains
    public int cardinality();

    // Return True iff the bag contains no tuples
    public boolean isEmpty();

    // Return an iterator over the tuples in the bag
    public Iterator<Tuple> content();
    
    // Add a tuple to the bag
    public void add(Tuple t);

    // Add a set of tuples to the bag
    public void addAll(DataBag b);

    // Remove a tuple from the bag
    public void remove(Tuple d);

    /**
     * Returns the value of field i. Since there may be more than one tuple in the bag, this
     * function throws an exception if it is not the case that all tuples agree on this field
     */
    public DataAtom getField(int i) throws IOException;

    // Remove all tuples from the bag
    public void clear();

    // Create a string representation of the bag (useful for debugging)
    public String toString();
}
}}}

[[Anchor(Data_Map)]]
=== Data Map ===

A data map is a map of string keys mapped to arbitrary datums. They can be looked up by a string key. 
(see EvalFunctions). The API for a Data Map  is:

{{{
	//returns the cardinality of the data map
	public int cardinality()
	
	// Adds the key value pair to the map
	public void put(String key, Datum value)
	
	//Adds the value as a data atom mapped to the given key
	 public void put(String key, String value)

	//Adds the value as a data atom mapped to the given key
	public void put(String key, int value)

	//Fetch the value corresponding to a given key, returns an empty data atom if the key does not exist
	public Datum get(String key)
}}}