You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Antonio Magnaghi (JIRA)" <ji...@apache.org> on 2007/11/30 17:40:43 UTC

[jira] Commented: (PIG-32) Abstraction Layer to decouple Pig from Back-End

    [ https://issues.apache.org/jira/browse/PIG-32?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547198 ] 

Antonio Magnaghi commented on PIG-32:
-------------------------------------

Attaching some feedback from Trevor (Galago project)

________________________________________
From: Trevor Strohman [mailto:strohman@cs.umass.edu] 
Sent: Wednesday, November 21, 2007 5:01 PM
To: Antonio Magnaghi
Subject: Re: galago


Antonio,

Wow, you've done a lot of work here.  This looks great.  I hope you end up with lots of other backends.

I'll just give you comments as I read the PigAbstractionLayer page.  Feel free to e-mail again if you want different (or more) information.

The DataStorage interface looks great.  I'd consider using this in Galago for file storage (I've always wanted to make the Hadoop DFS an option for data storage in Galago).  However, since Galago uses the native filesystem right now, I wouldn't have to implement this interface.

Should addFromResource be a part of the configuration interface?  This isn't something I want to implement myself (assuming there will be lots of these PigBackEndProperties objects around).  Maybe you could have a standard implementation that I could use.

A suggestion for the getStatistics() method in ExecutionEngine: perhaps part of the statistics object could be a set of objects that can be tracked using Java Management Extensions (JMX).  At some point I plan to make Galago JMX-ready, which would give you a lot of information about current running jobs, etc.

You might want a method on ExecutionEnginePhysicalPlan that allows the caller to block waiting for completion.

==
I think the API as specified seems like something I could implement for Galago.

It's not clear from the API how a new LogicalPlan can refer to results generated by previous LogicalPlans that have already been compiled and executed.  I never made this work in Galago with the current implementation.  Also, it seems like you might want to be able to ask a completed PhysicalPlan for a particular computed tuple stream.  Again, I never figured out how to do that in the current Pig (at least not in a way that would work with Galago).

Trevor

On Nov 21, 2007, at 5:57 PM, Antonio Magnaghi wrote:


Hi Trevor,
 
I would like to follow up on the email exchange we had few weeks ago about Galago and Pig.
 
In particular, at YRL we have decided to suggest, inside the Apache Pig incubator, some extensions to Pig that could make it easier to integrate Pig with different back-ends. The main approach is outlined at: http://wiki.apache.org/pig/PigAbstractionLayer.
 
At this point in time, I'm collecting some initial feedback before starting the actual implementation. Do you have possible requirements in order to allow Pig to better support Galago? As you have direct experience on some of the issues involved, I'd appreciate if you could share some of your thoughts on the design proposed.
 
Thanks,
Antonio
 
________________________________________
From: Trevor Strohman [mailto:strohman@cs.umass.edu] 
Sent: Tuesday, October 23, 2007 10:59 AM
To: Antonio Magnaghi
Subject: Re: galago
 
 
Antonio,
 
I'll do my best to answer your questions by e-mail, but you might also find it useful to download the Galago code and my version of Pig.  In the galago/java/pig-galago directory, you'll find a file called "pig-galago.patch" which contains all of the changes I made to the current Pig distribution to make it work with Galago.  The whole download is here:
            http://galagosearch.org/downloads
 
Before I start, I should mention that Galago is primarily meant to be a search engine toolkit, kind of like Lucene.  It happens to have its own MapReduce-like job execution engine called TupleFlow, and Pig can run on top of that.  TupleFlow has some similarities to the Pig model, in that strongly-typed tuples flow between computational steps to create an answer.



1.) the high-level language the user can utilize to specify the tuple-processing;
 
Users usually create TupleFlow jobs by creating an XML job specification.  The job specification allows the user to describe what Java objects will be used and how they should be connected together in an execution graph.  TupleFlow then schedules these components out onto computational nodes, sometimes with the help of a job execution system (like Grid Engine or Condor).  TupleFlow is probably most similar to Microsoft's Dryad system.
 
In the Pig/Galago port, I translate Pig jobs into TupleFlow jobs in code, so no XML files are made.
 
2.) how the tuple processing specification is mapped to a physical processing plan;
 
I know that Pig has both a high-level and low-level specification.  Compared to Pig, TupleFlow really only has a low-level processing language.  Pig is TupleFlow's high level language (when I want one).



3.) what type of platform/computational model is used.
 
I'm not exactly sure how to answer this.  It's all in Java, and objects are passed around using files on a shared file system.  Unlike Pig, Galago typically creates a different Java class for each type of tuple sent through the system.  When running Pig jobs on Galago, I've hacked Galago a little bit to allow it to use Pig's Tuple type.



 I understand that the data/tuple processing is carried out by porting/extending the Pig or Pig-like front-end to run on a back-end that is not Hadoop/map-reduce? Is this correct?
 
Yes, that's right.  It might be best to think of TupleFlow as something that implements most of the physical layer of Pig as well as a MapReduce execution engine.
 
Trevor
 

> Abstraction Layer to decouple Pig from Back-End
> -----------------------------------------------
>
>                 Key: PIG-32
>                 URL: https://issues.apache.org/jira/browse/PIG-32
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>            Reporter: Antonio Magnaghi
>            Assignee: Antonio Magnaghi
>
> I'm opening a new issue to track the development work to support an abstraction layer for Pig as defined at http://wiki.apache.org/pig/PigAbstractionLayer

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.