You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by Apache Wiki <wi...@apache.org> on 2007/11/20 23:31:58 UTC
[Pig Wiki] Update of "PigAbstractionLayer" by AntonioMagnaghi
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by AntonioMagnaghi:
http://wiki.apache.org/pig/PigAbstractionLayer
New page:
##master-page:FrontPage
#format wiki
#language en
#pragma section-numbers off
= Pig Abstraction Layer =
== Introduction and Rational ==
Many of the activities that Pig carries out during the compilation and execution stages of Pig Latin queries are, currently, deeply tied to the Hadoop file system and Hadoop Map-Reduce paradigm.
For instance, file management tasks, job submission and job tracking in the Pig client explicitly assume the availability of a Hadoop cluster to which the client connects.
It is possible, however, to envision an architecture where the front-end part of the system (i.e. Pig client) may have a more abstract notion of the back-end portion. In this context, a Hadoop cluster could be regarded as a particular instance amongst a family of different back-ends, all of which provide similar functionalities that can be accessed via the same API.
The main motivations behind this proposal can be summarized as follows:
- The availability of well-defined APIs that a back-end needs to support in order to run Pig Latin queries can facilitate porting such APIs to different platforms. Hence, this could foster wider adoption of Pig.
- Changes in various back-ends can be encapsulated within the actual implementation of the generic APIs. Hence, fewer modifications to the front-end code-base will result in a more stable code-base.
A proper API design should be general enough to easily support various back-ends that are currently supported by Pig like: Hadoop, Galago (see section below) and the local back-end (i.e. the local file system and the local execution type.)
== Relevant links ==
[http://www.galagosearch.org/ Galago] is a research project started by Trevor Strohman at the University of Massachusetts, Amherst. Galago is a search-engine with its own execution back-end.
Galago is able to execute Pig Latin queries by translating them into its own representation language (TupleFlow jobs.)
== API Specification ==
The basic functionalities that a back-end may need to export to the Pig client could be categorized into two main abstractions:
- '''Data Storage''': provides functionalities that pertain to storing and retrieving data. It encapsulates the typical operations supported by file systems like creating, opening (for reading or writing) a data object.
- '''Query Execution/Tracking''': provides functionalities to parse a Pig Latin program and submit a compiled Pig job to a back-end. This API should enable the front-end to track the current status of a job, its progress, diagnostic information and possibly to terminate it.
The sections below provide some initial suggestions for possible APIs for the Data Storage and Query Execution abstractions.
=== Back-End Configuration ===
This interface abstracts functionalities for management of configuration information for both the Data Storage and Query Execution portions of a back-end.
{{{
package org.apache.pig.backend;
import java.io.Serializable;
import java.util.Map;
import java.net.URI;
/** Abstraction for a generic property object that can be
* used to specify configuration information, stats...
* Information is represented in the form of (key, value)
* pairs.
*/
public interface PigBackEndProperties extends Serializable,
Iterable<String> {
/**
* Introduces a new (key, value) pair or updates one already
* associated to key.
*
* @param key - the key to insert/update
* @param value -the value for the given key
* @return - the value of the old key, if it exists, null otherwise
*/
public Object setValue(String key, Object value);
/**
* Given a resource, update configuration information.
*
* @param resource from which property values come from.
* @return the set of keys and relative values that has been updated.
* If resource contains/updates the same key multiple
* times, only the initial value of key is returned.
*/
public Map<String, Object> addFromResource(URI resource);
/**
* Creates or Updates (key,value) pairs with information
* from other
*
* @param other - source of properties
* @return - keys that have been updated, if any, and the
* corresponding old values
*/
public Map <String, Object> merge(PigBackEndProperties other);
/**
* Removes (key, value) pair if present
* @param key - key to remove
* @return - value of key, if key was present, null otherwise
*/
public Object delete(String key);
/**
* Returns value of a key
* @param key
* @return value of key if present, null otherwise.
*/
public Object getValue(String key);
/**
* @return number of (key, value) pairs stored
*/
public long getCount();
}
}}}
=== Data Storage ===
This is a possible API for a generic interface that abstracts on the actual details used to store/persist collections of objects.
{{{
package org.apache.pig.datastorage;
import org.apache.pig.backend.PigBackEndProperties;
import java.io.Serializable;
import java.util.Map;
import java.net.URI;
/**
* Abstraction for a generic property object that can be
* used to specify configuration information, stats...
public interface DataStorageProperties extends PigBackEndProperties {
...
}
}}}
{{{
package org.apache.pig.datastorage;
public interface DataStorage {
/**
* Place holder for possible initialization activities.
*/
public void init();
/**
* Clean-up and releasing of resources.
*/
public void close();
/**
* Provides configuration information about the storage itself.
* For instance global data-replication policies if any, default
* values, ... Some of such values could be overridden at a finer
* granularity (e.g. on a specific object in the Data Storage)
*
* @return - configuration information
*/
public DataStorageProperties getConfiguration();
/**
* Provides a way to change configuration parameters
* at the Data Storage level. For instance, change the
* data replication policy.
*
* @param newConfiguration - the new configuration settings
* @throws when configuration conflicts are detected
*
*/
public void updateConfiguration(DataStorageProperties
newConfiguration)
throws DataStorageConfigurationException;
/**
* Provides statistics on the Storage: capacity values, how much
* storage is in use...
* @return statistics on the Data Storage
*/
public DataStorageProperties getStatistics();
/**
* Creates an entity handle for an object (no containment
* relation)
*
* @param name of the object
* @return an object descriptor
* @throws DataStorageException if name does not conform to naming
* convention enforced by the Data Storage.
*/
public DataStorageElementDescriptor asElement(String name)
throws DataStorageException;
/**
* Created an entity handle for a container.
*
* @param name of the container
* @return a container descripto
* @throws DataStorageException if name does not conform to naming
* convention enforced by the Data Storage.
*/
public DataStorageContainerDescriptor asContainer(String name)
throws DataStorageException;
}
}}}
=== Data Storage Descriptors ===
{{{
package org.apache.pig.datastorage;
public interface DataStorageElementDescriptor extends Comparable {
/**
* Opens a stream onto which an entity can be written to.
*
* @param configuration information at the object level
* @return stream where to write
* @throws DataStorageException
*/
public DataStorageOutputStream create(
DataStorageProperties configuration)
throws DataStorageException;
/**
* Copy entity from an existing one, possibly residing in a
* different Data Storage.
*
* @param dstName name of entity to create
* @param dstConfiguration configuration for the new entity
* @param removeSrc if src entity needs to be removed after copying it
* @throws DataStorageException for instance, configuration
* information for new entity is not compatible with
* configuration information at the Data
* Storage level, user does not have privileges to read from
* source entity or write to destination storage...
*/
public void copy(DataStorageElementDescriptor dstName,
DataStorageProperties dstConfiguration,
boolean removeSrc)
throws DataStorageException;
/**
* Open for read a given entity
*
* @return entity to read from
* @throws DataStorageExecption e.g. entity does not exist...
*/
public DataStorageInputStream open() throws DataStorageException;
/**
* Open an element in the Data Storage with support for random access
* (seek operations).
*
* @return a seekable input stream
* @throws DataStorageException
*/
public DataStorageSeekableInputStream sopen()
throws DataStorageException;
/**
* Checks whether the entity exists or not
*
* @param name of entity
* @return true if entity exists, false otherwise.
*/
public boolean exists();
/**
* Changes the name of an entity in the Data Storage
*
* @param newName new name of entity
* @throws DataStorageException
*/
public void rename(DataStorageElementDescriptor newName)
throws DataStorageException;
/**
* Remove entity from the Data Storage.
*
* @throws DataStorageException
*/
public void delete() throws DataStorageException;
/**
* Retrieve configuration information for entity
* @return configuration
*/
public DataStorageProperties getConfiguration();
/**
* Update configuration information for this entity
*
* @param newConfig configuration
* @throws DataStorageException
*/
public void updateConfiguration(DataStorageProperties newConfig)
throws DataStorageException;
/**
* List entity statistics
* @return DataStorageProperties
*/
public DataStorageProperties getStatistics();
}
}}}
{{{
package org.apache.pig.datastorage;
import org.apache.pig.datastorage.DataStorageElementDescriptor;
public interface DataStorageContainerDescriptor
extends DataStorageElementDescriptor,
Iterable<DataStorageElementDescriptor> {
}
}}}