You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by Apache Wiki <wi...@apache.org> on 2010/10/28 23:20:54 UTC

[Pig Wiki] Update of "HowlJournal" by AlanGates

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The "HowlJournal" page has been changed by AlanGates.
http://wiki.apache.org/pig/HowlJournal?action=diff&rev1=2&rev2=3

--------------------------------------------------

  || Read from Hive                                                     || Not yet released  ||          ||
  || Support pushdown of columns to be projected into storage format    || Not yet released  ||          ||
  || Support for RCFile storage                                         || Not yet released  ||          ||
+ || Add a CLI                                                          || Not yet released  ||          ||
+ || Partition Pruning                                                  || Not yet released  ||          ||
+ || Support data read across partitions with different storage formats || Not yet released  ||          ||
  
  == Work in Progress ==
  
  || Feature           || Description ||
- || Add a CLI         || This will allow users to use Howl without installing all of Hive.  The syntax will match that of Hive's DDL. ||
- || Partition pruning || Currently, when asked to return information about a table Hive's metastore returns all partitions in the table.  This has a couple of issues.  One, for tables with large numbers of partitions it means the metadata operation of fetching information about the table is very expensive.  Two, it makes more sense to have the partition pruning logic in one place (Howl) rather than in Hive, Pig, and MR. ||
+ || Authentication    || Integrate Howl with security work done on Hadoop so that users can be properly authenticated. ||
+ || Authorization     || See HowlAuthorizationProposal ||
  
  
  == Proposed Work ==
- '''Authentication'''<<BR>> Integrate Howl with security work done on Hadoop so that users can be properly authenticated.
+ The following describe tasks that are proposed for future work on Howl.  They are ordered by what we currently believe to be their priority, with the most important tasks being listed first.
  
+ '''Support for more file formats'''<<BR>> At least one row format and text format need to be supported.
- '''Authorization'''<<BR>> The initial proposal is to use HDFS permissions to determine whether Howl operations can be executed.  For example, it would not be possible to drop a table unless the user had write permissions on the directory holding that table.  We need to determine how to extend this model to data not stored in HDFS (e.g. Hbase) and objects that do not exist in HDFS (e.g. views).  See HowlSecurity for more information.
- 
- '''Dynamic Partitioning'''<<BR>> Currently Howl can only store data into one partition at a time.  It needs to support
- spraying to multiple partitions in one write.
- 
- '''Non-partition Predicate Pushdown'''<<BR>> Since in the future storage formats (such as RCFile) should support predicate pushdown, Howl needs to be able to push predicates into the storage layer when appropriate.
  
  '''Notification'''<<BR>> Add ability for systems such as work flow to be notified when new data arrives in Howl.  This will be designed around a few systems receiving notification, not large numbers of users receiving notifications (i.e. we will not be building a general purpose publish/subscribe system).  One solution to this might be an RSS feed or similar simple service.
  
- '''Schema Evolution'''<<BR>>  Currently schema evolution in Hive is limited to adding columns at the end of the non-partition keys columns.  It may be desirable to support other forms of schema evolution, such as adding columns in other parts of the record, or making it so that new partitions for a table no longer contain a given column.
+ '''Allow specification of general storage type'''<<BR>> Currently Hive allows the user to specify specific storage formats for a table.  For example, the user can say `STORED AS RCFILE`.  We would like to enable users to select general storage types (columnar, row, or text) without needing know the underlying format being used.  Thus it would be legal to say `STORED AS ROW` and let the administrators decide whether sequence file or tfile is used to store data in row format.
  
- '''Support data read across partitions with different storage formats'''<<BR>> This work is done except that only one storage format is currently supported.
+ '''Mark a set of partitions done''' <<BR>> Often users create a collection of data sets altogether, though different sets may be completed at different times.  For example, users might partition their web server logs by date and region.  Some users may wish to only read a particular region and are not interested in waiting until all of the regions are completed.  Others will want to wait until all regions are completed before beginning processing.  Since all partitions are committed individually, Howl has no way for users to know when all partitions for the day are present.  A way is needed for the writer to signal that all partitions with a given key value (such as date = today) are complete and users waiting for the entire collection can now begin.  This will need to be propagated through to the notification system.
  
- '''Support for more file formats'''<<BR>> Additional file formats such as sequence file, text, etc. need to be added.
+ '''Data Import/Export''' <<BR>> Howl currently provides a single input and output format (or loader or serde) that can be used for any data in Howl.  However, users would like to be able to take this data out of Howl in preparation for moving it off the grid.  They would also like to be able to prepare data for import into Howl when they are running jobs that may not be able to interact with Howl.  An import/export format will be defined that allows data to be imported into, exported from, and replicated between Howl instances.  This format will provide an !InputFormat and !OutputFormat as well as a Pig load and store function and a Hive !SerDe.  The collections of data created by these tools will contain schema information, storage information (that is, what underlying format is the data in, how is it compressed, etc.), and sufficient metadata to create it in another Howl instance.
+ 
+ '''Data Compaction''' <<BR>> Very frequently users wish to store data in a very fine grained manner because their queries tend to access only specific partitions of the data.  Consider, for example, if a user downloads logs from the website for all twenty countries it operates in, every hour, and keeps those logs for a year, and each hour has one hundred part files.  That's 1,720,000 files for just this one input.  This places a significant burden on the namenode.  A way is needed to compact these into a larger file while preserving the ability to address individual partitions.  This compaction may be done when the file is being written, done soon after the data is written, or done at some later point.  For an example of the last case consider the example of hourly data.  For the first few days hourly data may have significant value.  After a week, it is less likely that users will be interested in any given hour of data.  So the hourly data may be compacted into daily data after a week.  Small performance degradation will be acceptable to achieve this compaction.  har will be evaluated for implementing this feature.  Whether this compaction is automatically initiated by Howl or requires user or administrator initiation is TBD.
+ 
+ '''Dynamic Partitioning'''<<BR>> Currently Howl can only store data into one partition at a time.  It needs to support spraying to multiple partitions in one write.
  
  '''Utility APIs'''<<BR>> Grid managers will want to build tools that use Howl to help manage their grids.  For example, one might build a tool to do replication between two grids.  Such tools will want to use Howl's metadata.  Howl needs to provide an appropriate API for these types of tools.
  
+ '''Pushing filters into storage formats''' <<BR>> In columnar compression performance can be improved when a row selection predicate can be evaluated against the relevant columns before the remaining columns are decompressed and deserialized and the row is constructed.  When the filter itself can be applied on a compressed and serialized version of the column the performance boost is significant.  When the underlying storage format supports these, Howl needs to push the filters from Pig and Hive.  Columnar storage formats that Howl commonly uses will also need to be modified to support these features.
+ 
+ '''Separate compression for separate columns''' <<BR>> One of the values of columnar compression is the ability to select compression formats that are optimal for different columns in the data.  Howl needs to support a variety of data specific compression formats and allow users to select different formats for different columns in a table.
+ 
+ '''Indices for sorted tables''' <<BR>> Providing the first record in each block for a sorted table enables a number of performance optimizations in the query engine accessing the data (such as Pig's merge join).  In Howl's standard formats we may need to provide this functionality.  It is also possible that the index functionality already being added to Hive could be used for this.
+ 
+ '''Schema Evolution'''<<BR>>  Currently schema evolution in Hive is limited to adding columns at the end of the non-partition keys columns.  It may be desirable to support other forms of schema evolution, such as adding columns in other parts of the record, or making it so that new partitions for a table no longer contain a given column.
+