You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@pig.apache.org by Apache Wiki <wi...@apache.org> on 2009/11/25 17:13:30 UTC

[Pig Wiki] Update of "LoadStoreRedesignProposal" by jeff zhang

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The "LoadStoreRedesignProposal" page has been changed by jeff zhang.
The comment on this change is: I think here it should be hadoop ranther pig accroding the context.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=37&rev2=38

--------------------------------------------------

1. The Slicer interface is redundant. Remove it and allow users to directly use Hadoop !InputFormats in Pig.
1. It is not currently easy to use a separate !OutputFormat for a !StoreFunc. This should be made easy to allow users to store data into locations other than HDFS.
- 1. Currently users that wish to operate on Pig and Map-Reduce are required to write Hadoop !InputFormat and !OutputFormat as well as a Pig load and storage functions. While Pig load and store functions will always be necessary to take the most advantage of Pig, it would be good for users to be able to use Hadoop !InputFormat and !OutputFormat classes directly to minimize the data interchange cost.
+ 1. Currently users that wish to operate on Pig and Map-Reduce are required to write Hadoop !InputFormat and !OutputFormat as well as a Pig load and storage functions. While Pig load and store functions will always be necessary to take the most advantage of Hadoop, it would be good for users to be able to use Hadoop !InputFormat and !OutputFormat classes directly to minimize the data interchange cost.
1. New storage formats such as Zebra are being implemented for Hadoop that include metadata information such as schema, etc. The !LoadFunc interface needs to allow Pig to obtain this metadata. There is a describeSchema call in the current interface. More functions may be necessary.
1. These new storage formats also plan to support pushing of, at least, projection and selection into the storage layer. Pig needs to be able to query loaders to determine what if any pushdown capabilities they support and then make use of those capabilities.
1. There already exists one metadata system in Hadoop (Hive's metastore) and there is a proposal to add another (Owl). Pig needs to be able to query these metadata systems for information about data to be read. It also needs to be able to record information to these metadata systems when writing data. The load and store functions are a reasonable place to do these operations since that is the point at which Pig is reading and writing data. This will also allow Pig to read and write data from and to multiple metadata stores in single Pig Latin scripts if that is desired.