You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2011/03/10 00:34:59 UTC
[jira] Commented: (PIG-1891) Enable StoreFunc to make intelligent decision based on job success or failure

    [ https://issues.apache.org/jira/browse/PIG-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004873#comment-13004873 ] 

Alan Gates commented on PIG-1891:
---------------------------------

It sounds like what you want is a way for the storage function to inject code into OutputCommitter.cleanupJob.  (See http://hadoop.apache.org/common/docs/r0.20.2/api/index.html for details.  This is a final task that Hadoop runs after all reduces have finished.)  

At this point since this is already offered by Hadoop's OutputFormat we have left these things there, rather than mimic the interface in Pig.  So the way to do this would be to have the OutputFormat you are using return an OutputCommitter that would do the commit (or whatever) in cleanupJob.  You do not have to write a whole new OutputFormat for this.  You can extend whatever OutputFormat you are using and the associated OutputCommitter it returns.  Your extended OutputFormat should return your OutputCommitter in getOutputCommitter.  Your OutputCommitter should only change cleanupJob, which should call super.cleanupJob and then do whatever you want to do.


> Enable StoreFunc to make intelligent decision based on job success or failure
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1891
>                 URL: https://issues.apache.org/jira/browse/PIG-1891
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alex Rovner
>
> We are in the process of using PIG for various data processing and component integration. Here is where we feel pig storage funcs lack:
> They are not aware if the over all job has succeeded. This creates a problem for storage funcs which needs to "upload" results into another system:
> DB, FTP, another file system etc.
> I looked at the DBStorage in the piggybank (http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/DBStorage.java?view=markup) and what I see is essentially a mechanism which for each task does the following:
> 1. Creates a recordwriter (in this case open connection to db)
> 2. Open transaction.
> 3. Writes records into a batch
> 4. Executes commit or rollback depending if the task was successful.
> While this aproach works great on a task level, it does not work at all on a job level. 
> If certain tasks will succeed but over job will fail, partial records are going to get uploaded into the DB.
> Any ideas on the workaround? 
> Our current workaround is fairly ugly: We created a java wrapper that launches pig jobs and then uploads to DB's once pig's job is successful. While the approach works, it's not really integrated into pig.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira