You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by Apache Wiki <wi...@apache.org> on 2010/12/06 23:26:55 UTC

[Pig Wiki] Update of "HowlJournal" by AlanGates

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The "HowlJournal" page has been changed by AlanGates.
http://wiki.apache.org/pig/HowlJournal?action=diff&rev1=4&rev2=5

--------------------------------------------------

  == Work in Progress ==
  
  || Feature           || Description ||
- || Authentication    || Integrate Howl with security work done on Hadoop so that users can be properly authenticated. ||
+ || Authentication    || See HowlAuthentication ||
  || Authorization     || See HowlAuthorizationProposal ||
+ || Data Import/Export || See HowlImportExport ||
  
  
  == Proposed Work ==
@@ -34, +35 @@

  '''Allow specification of general storage type'''<<BR>> Currently Hive allows the user to specify specific storage formats for a table.  For example, the user can say `STORED AS RCFILE`.  We would like to enable users to select general storage types (columnar, row, or text) without needing know the underlying format being used.  Thus it would be legal to say `STORED AS ROW` and let the administrators decide whether sequence file or tfile is used to store data in row format.
  
  '''Mark a set of partitions done''' <<BR>> Often users create a collection of data sets altogether, though different sets may be completed at different times.  For example, users might partition their web server logs by date and region.  Some users may wish to only read a particular region and are not interested in waiting until all of the regions are completed.  Others will want to wait until all regions are completed before beginning processing.  Since all partitions are committed individually, Howl has no way for users to know when all partitions for the day are present.  A way is needed for the writer to signal that all partitions with a given key value (such as date = today) are complete and users waiting for the entire collection can now begin.  This will need to be propagated through to the notification system.
- 
- '''Data Import/Export''' <<BR>> Howl currently provides a single input and output format (or loader or serde) that can be used for any data in Howl.  However, users would like to be able to take this data out of Howl in preparation for moving it off the grid.  They would also like to be able to prepare data for import into Howl when they are running jobs that may not be able to interact with Howl.  An import/export format will be defined that allows data to be imported into, exported from, and replicated between Howl instances.  This format will provide an !InputFormat and !OutputFormat as well as a Pig load and store function and a Hive !SerDe.  The collections of data created by these tools will contain schema information, storage information (that is, what underlying format is the data in, how is it compressed, etc.), and sufficient metadata to create it in another Howl instance.
  
  '''Data Compaction''' <<BR>> Very frequently users wish to store data in a very fine grained manner because their queries tend to access only specific partitions of the data.  Consider, for example, if a user downloads logs from the website for all twenty countries it operates in, every hour, and keeps those logs for a year, and each hour has one hundred part files.  That's 1,720,000 files for just this one input.  This places a significant burden on the namenode.  A way is needed to compact these into a larger file while preserving the ability to address individual partitions.  This compaction may be done when the file is being written, done soon after the data is written, or done at some later point.  For an example of the last case consider the example of hourly data.  For the first few days hourly data may have significant value.  After a week, it is less likely that users will be interested in any given hour of data.  So the hourly data may be compacted into daily data after a week.  Small performance degradation will be acceptable to achieve this compaction.  har will be evaluated for implementing this feature.  Whether this compaction is automatically initiated by Howl or requires user or administrator initiation is TBD.
  
@@ -53, +52 @@

  
  '''Schema Evolution'''<<BR>>  Currently schema evolution in Hive is limited to adding columns at the end of the non-partition keys columns.  It may be desirable to support other forms of schema evolution, such as adding columns in other parts of the record, or making it so that new partitions for a table no longer contain a given column.
  
+ '''Support for streaming'''<<BR>>  Currently Howl does not support Hadoop streaming users.  It should.
+ 
+ '''Integration with Hbase'''<<BR>>  Currently Howl does not support Hbase tables.  It needs to have storage drivers so that !HowlInputFormat and !HowlLoader can do bulk reads and !HowlOutputFormat and !HowlStorage can do bulk writes.  We also need to understand what, if any, interface it makes sense for Howl to expose for point reads and writes for Howl tables that use Hbase as a storage mechanism.
+