You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sameer Tilak <ss...@live.com> on 2014/03/06 22:11:28 UTC

Pig on Spark

Hi everyone,
We are using to Pig to build our data pipeline. I came across Spork -- Pig on Spark at: https://github.com/dvryaboy/pig and not sure if it is still active.   Can someone please let me know the status of Spork or any other effort that will let us run Pig on Spark? We can significantly benefit by using Spark, but we would like to keep using the existing Pig scripts.

Re: Pig on Spark

Posted by Julien Le Dem <ju...@twitter.com>.

Hi Mayur,
Are you going to the Pig meetup this afternoon?
http://www.meetup.com/PigUser/events/160604192/
Aniket and I will be there.
We would be happy to chat about Pig-on-Spark



On Tue, Mar 11, 2014 at 8:56 AM, Mayur Rustagi <ma...@gmail.com>wrote:

> Hi Lin,
> We are working on getting Pig on spark functional with 0.8.0, have you got
> it working on any spark version ?
> Also what all functionality works on it?
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng <me...@gmail.com> wrote:
>
>> Hi Sameer,
>>
>> Lin (cc'ed) could also give you some updates about Pig on Spark
>> development on her side.
>>
>> Best,
>> Xiangrui
>>
>> On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak <ss...@live.com> wrote:
>> > Hi Mayur,
>> > We are planning to upgrade our distribution MR1> MR2 (YARN) and the
>> goal is
>> > to get SPROK set up next month. I will keep you posted. Can you please
>> keep
>> > me informed about your progress as well.
>> >
>> > ________________________________
>> > From: mayur.rustagi@gmail.com
>> > Date: Mon, 10 Mar 2014 11:47:56 -0700
>> >
>> > Subject: Re: Pig on Spark
>> > To: user@spark.apache.org
>> >
>> >
>> > Hi Sameer,
>> > Did you make any progress on this. My team is also trying it out would
>> love
>> > to know some detail so progress.
>> >
>> > Mayur Rustagi
>> > Ph: +1 (760) 203 3257
>> > http://www.sigmoidanalytics.com
>> > @mayur_rustagi
>> >
>> >
>> >
>> > On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak <ss...@live.com> wrote:
>> >
>> > Hi Aniket,
>> > Many thanks! I will check this out.
>> >
>> > ________________________________
>> > Date: Thu, 6 Mar 2014 13:46:50 -0800
>> > Subject: Re: Pig on Spark
>> > From: aniket486@gmail.com
>> > To: user@spark.apache.org; tgraves_cs@yahoo.com
>> >
>> >
>> > There is some work to make this work on yarn at
>> > https://github.com/aniket486/pig. (So, compile pig with ant
>> > -Dhadoopversion=23)
>> >
>> > You can look at https://github.com/aniket486/pig/blob/spork/pig-sparkto
>> > find out what sort of env variables you need (sorry, I haven't been
>> able to
>> > clean this up- in-progress). There are few known issues with this, I
>> will
>> > work on fixing them soon.
>> >
>> > Known issues-
>> > 1. Limit does not work (spork-fix)
>> > 2. Foreach requires to turn off schema-tuple-backend (should be a
>> pig-jira)
>> > 3. Algebraic udfs dont work (spork-fix in-progress)
>> > 4. Group by rework (to avoid OOMs)
>> > 5. UDF Classloader issue (requires SPARK-1053, then you can put
>> > pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf jars)
>> >
>> > ~Aniket
>> >
>> >
>> >
>> >
>> > On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves <tg...@yahoo.com>
>> wrote:
>> >
>> > I had asked a similar question on the dev mailing list a while back (Jan
>> > 22nd).
>> >
>> > See the archives:
>> > http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser->
>> > look for spork.
>> >
>> > Basically Matei said:
>> >
>> > Yup, that was it, though I believe people at Twitter picked it up again
>> > recently. I'd suggest
>> > asking Dmitriy if you know him. I've seen interest in this from several
>> > other groups, and
>> > if there's enough of it, maybe we can start another open source repo to
>> > track it. The work
>> > in that repo you pointed to was done over one week, and already had
>> most of
>> > Pig's operators
>> > working. (I helped out with this prototype over Twitter's hack week.)
>> That
>> > work also calls
>> > the Scala API directly, because it was done before we had a Java API; it
>> > should be easier
>> > with the Java one.
>> >
>> >
>> > Tom
>> >
>> >
>> >
>> > On Thursday, March 6, 2014 3:11 PM, Sameer Tilak <ss...@live.com>
>> wrote:
>> > Hi everyone,
>> >
>> > We are using to Pig to build our data pipeline. I came across Spork --
>> Pig
>> > on Spark at: https://github.com/dvryaboy/pig and not sure if it is
>> still
>> > active.
>> >
>> > Can someone please let me know the status of Spork or any other effort
>> that
>> > will let us run Pig on Spark? We can significantly benefit by using
>> Spark,
>> > but we would like to keep using the existing Pig scripts.
>> >
>> >
>> >
>> >
>> >
>> > --
>> > "...:::Aniket:::... Quetzalco@tl"
>> >
>> >
>>
>
>

Re: Pig on Spark

Posted by Mayur Rustagi <ma...@gmail.com>.

Hi Lin,
We are working on getting Pig on spark functional with 0.8.0, have you got
it working on any spark version ?
Also what all functionality works on it?
Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng <me...@gmail.com> wrote:

> Hi Sameer,
>
> Lin (cc'ed) could also give you some updates about Pig on Spark
> development on her side.
>
> Best,
> Xiangrui
>
> On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak <ss...@live.com> wrote:
> > Hi Mayur,
> > We are planning to upgrade our distribution MR1> MR2 (YARN) and the goal
> is
> > to get SPROK set up next month. I will keep you posted. Can you please
> keep
> > me informed about your progress as well.
> >
> > ________________________________
> > From: mayur.rustagi@gmail.com
> > Date: Mon, 10 Mar 2014 11:47:56 -0700
> >
> > Subject: Re: Pig on Spark
> > To: user@spark.apache.org
> >
> >
> > Hi Sameer,
> > Did you make any progress on this. My team is also trying it out would
> love
> > to know some detail so progress.
> >
> > Mayur Rustagi
> > Ph: +1 (760) 203 3257
> > http://www.sigmoidanalytics.com
> > @mayur_rustagi
> >
> >
> >
> > On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak <ss...@live.com> wrote:
> >
> > Hi Aniket,
> > Many thanks! I will check this out.
> >
> > ________________________________
> > Date: Thu, 6 Mar 2014 13:46:50 -0800
> > Subject: Re: Pig on Spark
> > From: aniket486@gmail.com
> > To: user@spark.apache.org; tgraves_cs@yahoo.com
> >
> >
> > There is some work to make this work on yarn at
> > https://github.com/aniket486/pig. (So, compile pig with ant
> > -Dhadoopversion=23)
> >
> > You can look at https://github.com/aniket486/pig/blob/spork/pig-spark to
> > find out what sort of env variables you need (sorry, I haven't been able
> to
> > clean this up- in-progress). There are few known issues with this, I will
> > work on fixing them soon.
> >
> > Known issues-
> > 1. Limit does not work (spork-fix)
> > 2. Foreach requires to turn off schema-tuple-backend (should be a
> pig-jira)
> > 3. Algebraic udfs dont work (spork-fix in-progress)
> > 4. Group by rework (to avoid OOMs)
> > 5. UDF Classloader issue (requires SPARK-1053, then you can put
> > pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf jars)
> >
> > ~Aniket
> >
> >
> >
> >
> > On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves <tg...@yahoo.com> wrote:
> >
> > I had asked a similar question on the dev mailing list a while back (Jan
> > 22nd).
> >
> > See the archives:
> > http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser->
> > look for spork.
> >
> > Basically Matei said:
> >
> > Yup, that was it, though I believe people at Twitter picked it up again
> > recently. I'd suggest
> > asking Dmitriy if you know him. I've seen interest in this from several
> > other groups, and
> > if there's enough of it, maybe we can start another open source repo to
> > track it. The work
> > in that repo you pointed to was done over one week, and already had most
> of
> > Pig's operators
> > working. (I helped out with this prototype over Twitter's hack week.)
> That
> > work also calls
> > the Scala API directly, because it was done before we had a Java API; it
> > should be easier
> > with the Java one.
> >
> >
> > Tom
> >
> >
> >
> > On Thursday, March 6, 2014 3:11 PM, Sameer Tilak <ss...@live.com>
> wrote:
> > Hi everyone,
> >
> > We are using to Pig to build our data pipeline. I came across Spork --
> Pig
> > on Spark at: https://github.com/dvryaboy/pig and not sure if it is still
> > active.
> >
> > Can someone please let me know the status of Spork or any other effort
> that
> > will let us run Pig on Spark? We can significantly benefit by using
> Spark,
> > but we would like to keep using the existing Pig scripts.
> >
> >
> >
> >
> >
> > --
> > "...:::Aniket:::... Quetzalco@tl"
> >
> >
>

Re: Pig on Spark

Posted by suman bharadwaj <su...@gmail.com>.

Hey Mayur,

We use HiveColumnarLoader and XMLLoader. Are these working as well ?

Will try few things regarding porting Java MR.

Regards,
Suman Bharadwaj S


On Thu, Apr 24, 2014 at 3:09 AM, Mayur Rustagi <ma...@gmail.com>wrote:

> Right now UDF is not working. Its in the top list though. You should be
> able to soon :)
> Are thr any other functionality of pig you use often apart from the usual
> suspects??
>
> Existing Java MR jobs would be a easier move. are these cascading jobs or
> single map reduce jobs. If single then you should be able to,  write a
> scala wrapper code code to call map & reduce functions with some magic &
> let your core code be. Would be interesting to see an actual example & get
> it to work.
>
> Regards
> Mayur
>
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Thu, Apr 24, 2014 at 2:46 AM, suman bharadwaj <su...@gmail.com>wrote:
>
>> We currently are in the process of converting PIG and Java map reduce
>> jobs to SPARK jobs. And we have written couple of PIG UDFs as well. Hence
>> was checking if we can leverage SPORK without converting to SPARK jobs.
>>
>> And is there any way I can port my existing Java MR jobs to SPARK ?
>> I know this thread has a different subject, let me know if need to ask
>> this question in separate thread.
>>
>> Thanks in advance.
>>
>>
>> On Thu, Apr 24, 2014 at 2:13 AM, Mayur Rustagi <ma...@gmail.com>wrote:
>>
>>> UDF
>>> Generate
>>> & many many more are not working :)
>>>
>>> Several of them work. Joins, filters, group by etc.
>>> I am translating the ones we need, would be happy to get help on others.
>>> Will host a jira to track them if you are intersted.
>>>
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>
>>>
>>>
>>> On Thu, Apr 24, 2014 at 2:10 AM, suman bharadwaj <su...@gmail.com>wrote:
>>>
>>>> Are all the features available in PIG working in SPORK ?? Like for eg:
>>>> UDFs ?
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi <mayur.rustagi@gmail.com
>>>> > wrote:
>>>>
>>>>> Thr are two benefits I get as of now
>>>>> 1. Most of the time a lot of customers dont want the full power but
>>>>> they want something dead simple with which they can do dsl. They end up
>>>>> using Hive for a lot of ETL just cause its SQL & they understand it. Pig is
>>>>> close & wraps up a lot of framework level semantics away from the user &
>>>>> lets him focus on data flow
>>>>> 2. Some have codebases in Pig already & are just looking to do it
>>>>> faster. I am yet to benchmark that on Pig on spark.
>>>>>
>>>>> I agree that pig on spark cannot solve a lot problems but it can solve
>>>>> some without forcing the end customer to do anything even close to coding,
>>>>> I believe thr is quite some value in making Spark accessible to larger
>>>>> group of audience.
>>>>> End of the day to each his own :)
>>>>>
>>>>> Regards
>>>>> Mayur
>>>>>
>>>>>
>>>>> Mayur Rustagi
>>>>> Ph: +1 (760) 203 3257
>>>>> http://www.sigmoidanalytics.com
>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi <
>>>>> mundlapudi@gmail.com> wrote:
>>>>>
>>>>>> This seems like an interesting question.
>>>>>>
>>>>>> I love Apache Pig. It is so natural and the language flows with nice
>>>>>> syntax.
>>>>>>
>>>>>> While I was at Yahoo! in core Hadoop Engineering, I have used Pig a
>>>>>> lot for analytics and provided feedback to Pig Team to do much more
>>>>>> functionality when it was at version 0.7. Lots of new functionality got
>>>>>> offered now
>>>>>> .
>>>>>> End of the day, Pig is a DSL for data flows. There will be always
>>>>>> gaps and enhancements. I was often thought is DSL right way to solve data
>>>>>> flow problems? May be not, we need complete language construct. We may have
>>>>>> found the answer - Scala. With Scala's dynamic compilation, we can write
>>>>>> much power constructs than any DSL can provide.
>>>>>>
>>>>>> If I am a new organization and beginning to choose, I would go with
>>>>>> Scala.
>>>>>>
>>>>>> Here is the example:
>>>>>>
>>>>>> #!/bin/sh
>>>>>> exec scala "$0" "$@"
>>>>>> !#
>>>>>> YOUR DSL GOES HERE BUT IN SCALA!
>>>>>>
>>>>>> You have DSL like scripting, functional and complete language power!
>>>>>> If we can improve first 3 lines, here you go, you have most powerful DSL to
>>>>>> solve data problems.
>>>>>>
>>>>>> -Bharath
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng <me...@gmail.com>wrote:
>>>>>>
>>>>>>> Hi Sameer,
>>>>>>>
>>>>>>> Lin (cc'ed) could also give you some updates about Pig on Spark
>>>>>>> development on her side.
>>>>>>>
>>>>>>> Best,
>>>>>>> Xiangrui
>>>>>>>
>>>>>>> On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak <ss...@live.com>
>>>>>>> wrote:
>>>>>>> > Hi Mayur,
>>>>>>> > We are planning to upgrade our distribution MR1> MR2 (YARN) and
>>>>>>> the goal is
>>>>>>> > to get SPROK set up next month. I will keep you posted. Can you
>>>>>>> please keep
>>>>>>> > me informed about your progress as well.
>>>>>>> >
>>>>>>> > ________________________________
>>>>>>> > From: mayur.rustagi@gmail.com
>>>>>>> > Date: Mon, 10 Mar 2014 11:47:56 -0700
>>>>>>> >
>>>>>>> > Subject: Re: Pig on Spark
>>>>>>> > To: user@spark.apache.org
>>>>>>> >
>>>>>>> >
>>>>>>> > Hi Sameer,
>>>>>>> > Did you make any progress on this. My team is also trying it out
>>>>>>> would love
>>>>>>> > to know some detail so progress.
>>>>>>> >
>>>>>>> > Mayur Rustagi
>>>>>>> > Ph: +1 (760) 203 3257
>>>>>>> > http://www.sigmoidanalytics.com
>>>>>>> > @mayur_rustagi
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak <ss...@live.com>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > Hi Aniket,
>>>>>>> > Many thanks! I will check this out.
>>>>>>> >
>>>>>>> > ________________________________
>>>>>>> > Date: Thu, 6 Mar 2014 13:46:50 -0800
>>>>>>> > Subject: Re: Pig on Spark
>>>>>>> > From: aniket486@gmail.com
>>>>>>> > To: user@spark.apache.org; tgraves_cs@yahoo.com
>>>>>>> >
>>>>>>> >
>>>>>>> > There is some work to make this work on yarn at
>>>>>>> > https://github.com/aniket486/pig. (So, compile pig with ant
>>>>>>> > -Dhadoopversion=23)
>>>>>>> >
>>>>>>> > You can look at
>>>>>>> https://github.com/aniket486/pig/blob/spork/pig-spark to
>>>>>>> > find out what sort of env variables you need (sorry, I haven't
>>>>>>> been able to
>>>>>>> > clean this up- in-progress). There are few known issues with this,
>>>>>>> I will
>>>>>>> > work on fixing them soon.
>>>>>>> >
>>>>>>> > Known issues-
>>>>>>> > 1. Limit does not work (spork-fix)
>>>>>>> > 2. Foreach requires to turn off schema-tuple-backend (should be a
>>>>>>> pig-jira)
>>>>>>> > 3. Algebraic udfs dont work (spork-fix in-progress)
>>>>>>> > 4. Group by rework (to avoid OOMs)
>>>>>>> > 5. UDF Classloader issue (requires SPARK-1053, then you can put
>>>>>>> > pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf
>>>>>>> jars)
>>>>>>> >
>>>>>>> > ~Aniket
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves <tg...@yahoo.com>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > I had asked a similar question on the dev mailing list a while
>>>>>>> back (Jan
>>>>>>> > 22nd).
>>>>>>> >
>>>>>>> > See the archives:
>>>>>>> >
>>>>>>> http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser->
>>>>>>> > look for spork.
>>>>>>> >
>>>>>>> > Basically Matei said:
>>>>>>> >
>>>>>>> > Yup, that was it, though I believe people at Twitter picked it up
>>>>>>> again
>>>>>>> > recently. I'd suggest
>>>>>>> > asking Dmitriy if you know him. I've seen interest in this from
>>>>>>> several
>>>>>>> > other groups, and
>>>>>>> > if there's enough of it, maybe we can start another open source
>>>>>>> repo to
>>>>>>> > track it. The work
>>>>>>> > in that repo you pointed to was done over one week, and already
>>>>>>> had most of
>>>>>>> > Pig's operators
>>>>>>> > working. (I helped out with this prototype over Twitter's hack
>>>>>>> week.) That
>>>>>>> > work also calls
>>>>>>> > the Scala API directly, because it was done before we had a Java
>>>>>>> API; it
>>>>>>> > should be easier
>>>>>>> > with the Java one.
>>>>>>> >
>>>>>>> >
>>>>>>> > Tom
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > On Thursday, March 6, 2014 3:11 PM, Sameer Tilak <ss...@live.com>
>>>>>>> wrote:
>>>>>>> > Hi everyone,
>>>>>>> >
>>>>>>> > We are using to Pig to build our data pipeline. I came across
>>>>>>> Spork -- Pig
>>>>>>> > on Spark at: https://github.com/dvryaboy/pig and not sure if it
>>>>>>> is still
>>>>>>> > active.
>>>>>>> >
>>>>>>> > Can someone please let me know the status of Spork or any other
>>>>>>> effort that
>>>>>>> > will let us run Pig on Spark? We can significantly benefit by
>>>>>>> using Spark,
>>>>>>> > but we would like to keep using the existing Pig scripts.
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > --
>>>>>>> > "...:::Aniket:::... Quetzalco@tl"
>>>>>>> >
>>>>>>> >
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Pig on Spark

Posted by Mayur Rustagi <ma...@gmail.com>.

Right now UDF is not working. Its in the top list though. You should be
able to soon :)
Are thr any other functionality of pig you use often apart from the usual
suspects??

Existing Java MR jobs would be a easier move. are these cascading jobs or
single map reduce jobs. If single then you should be able to,  write a
scala wrapper code code to call map & reduce functions with some magic &
let your core code be. Would be interesting to see an actual example & get
it to work.

Regards
Mayur


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Thu, Apr 24, 2014 at 2:46 AM, suman bharadwaj <su...@gmail.com>wrote:

> We currently are in the process of converting PIG and Java map reduce jobs
> to SPARK jobs. And we have written couple of PIG UDFs as well. Hence was
> checking if we can leverage SPORK without converting to SPARK jobs.
>
> And is there any way I can port my existing Java MR jobs to SPARK ?
> I know this thread has a different subject, let me know if need to ask
> this question in separate thread.
>
> Thanks in advance.
>
>
> On Thu, Apr 24, 2014 at 2:13 AM, Mayur Rustagi <ma...@gmail.com>wrote:
>
>> UDF
>> Generate
>> & many many more are not working :)
>>
>> Several of them work. Joins, filters, group by etc.
>> I am translating the ones we need, would be happy to get help on others.
>> Will host a jira to track them if you are intersted.
>>
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>>
>> On Thu, Apr 24, 2014 at 2:10 AM, suman bharadwaj <su...@gmail.com>wrote:
>>
>>> Are all the features available in PIG working in SPORK ?? Like for eg:
>>> UDFs ?
>>>
>>> Thanks.
>>>
>>>
>>> On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi <ma...@gmail.com>wrote:
>>>
>>>> Thr are two benefits I get as of now
>>>> 1. Most of the time a lot of customers dont want the full power but
>>>> they want something dead simple with which they can do dsl. They end up
>>>> using Hive for a lot of ETL just cause its SQL & they understand it. Pig is
>>>> close & wraps up a lot of framework level semantics away from the user &
>>>> lets him focus on data flow
>>>> 2. Some have codebases in Pig already & are just looking to do it
>>>> faster. I am yet to benchmark that on Pig on spark.
>>>>
>>>> I agree that pig on spark cannot solve a lot problems but it can solve
>>>> some without forcing the end customer to do anything even close to coding,
>>>> I believe thr is quite some value in making Spark accessible to larger
>>>> group of audience.
>>>> End of the day to each his own :)
>>>>
>>>> Regards
>>>> Mayur
>>>>
>>>>
>>>> Mayur Rustagi
>>>> Ph: +1 (760) 203 3257
>>>> http://www.sigmoidanalytics.com
>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>
>>>>
>>>>
>>>> On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi <
>>>> mundlapudi@gmail.com> wrote:
>>>>
>>>>> This seems like an interesting question.
>>>>>
>>>>> I love Apache Pig. It is so natural and the language flows with nice
>>>>> syntax.
>>>>>
>>>>> While I was at Yahoo! in core Hadoop Engineering, I have used Pig a
>>>>> lot for analytics and provided feedback to Pig Team to do much more
>>>>> functionality when it was at version 0.7. Lots of new functionality got
>>>>> offered now
>>>>> .
>>>>> End of the day, Pig is a DSL for data flows. There will be always gaps
>>>>> and enhancements. I was often thought is DSL right way to solve data flow
>>>>> problems? May be not, we need complete language construct. We may have
>>>>> found the answer - Scala. With Scala's dynamic compilation, we can write
>>>>> much power constructs than any DSL can provide.
>>>>>
>>>>> If I am a new organization and beginning to choose, I would go with
>>>>> Scala.
>>>>>
>>>>> Here is the example:
>>>>>
>>>>> #!/bin/sh
>>>>> exec scala "$0" "$@"
>>>>> !#
>>>>> YOUR DSL GOES HERE BUT IN SCALA!
>>>>>
>>>>> You have DSL like scripting, functional and complete language power!
>>>>> If we can improve first 3 lines, here you go, you have most powerful DSL to
>>>>> solve data problems.
>>>>>
>>>>> -Bharath
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng <me...@gmail.com>wrote:
>>>>>
>>>>>> Hi Sameer,
>>>>>>
>>>>>> Lin (cc'ed) could also give you some updates about Pig on Spark
>>>>>> development on her side.
>>>>>>
>>>>>> Best,
>>>>>> Xiangrui
>>>>>>
>>>>>> On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak <ss...@live.com>
>>>>>> wrote:
>>>>>> > Hi Mayur,
>>>>>> > We are planning to upgrade our distribution MR1> MR2 (YARN) and the
>>>>>> goal is
>>>>>> > to get SPROK set up next month. I will keep you posted. Can you
>>>>>> please keep
>>>>>> > me informed about your progress as well.
>>>>>> >
>>>>>> > ________________________________
>>>>>> > From: mayur.rustagi@gmail.com
>>>>>> > Date: Mon, 10 Mar 2014 11:47:56 -0700
>>>>>> >
>>>>>> > Subject: Re: Pig on Spark
>>>>>> > To: user@spark.apache.org
>>>>>> >
>>>>>> >
>>>>>> > Hi Sameer,
>>>>>> > Did you make any progress on this. My team is also trying it out
>>>>>> would love
>>>>>> > to know some detail so progress.
>>>>>> >
>>>>>> > Mayur Rustagi
>>>>>> > Ph: +1 (760) 203 3257
>>>>>> > http://www.sigmoidanalytics.com
>>>>>> > @mayur_rustagi
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak <ss...@live.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > Hi Aniket,
>>>>>> > Many thanks! I will check this out.
>>>>>> >
>>>>>> > ________________________________
>>>>>> > Date: Thu, 6 Mar 2014 13:46:50 -0800
>>>>>> > Subject: Re: Pig on Spark
>>>>>> > From: aniket486@gmail.com
>>>>>> > To: user@spark.apache.org; tgraves_cs@yahoo.com
>>>>>> >
>>>>>> >
>>>>>> > There is some work to make this work on yarn at
>>>>>> > https://github.com/aniket486/pig. (So, compile pig with ant
>>>>>> > -Dhadoopversion=23)
>>>>>> >
>>>>>> > You can look at
>>>>>> https://github.com/aniket486/pig/blob/spork/pig-spark to
>>>>>> > find out what sort of env variables you need (sorry, I haven't been
>>>>>> able to
>>>>>> > clean this up- in-progress). There are few known issues with this,
>>>>>> I will
>>>>>> > work on fixing them soon.
>>>>>> >
>>>>>> > Known issues-
>>>>>> > 1. Limit does not work (spork-fix)
>>>>>> > 2. Foreach requires to turn off schema-tuple-backend (should be a
>>>>>> pig-jira)
>>>>>> > 3. Algebraic udfs dont work (spork-fix in-progress)
>>>>>> > 4. Group by rework (to avoid OOMs)
>>>>>> > 5. UDF Classloader issue (requires SPARK-1053, then you can put
>>>>>> > pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf
>>>>>> jars)
>>>>>> >
>>>>>> > ~Aniket
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves <tg...@yahoo.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > I had asked a similar question on the dev mailing list a while back
>>>>>> (Jan
>>>>>> > 22nd).
>>>>>> >
>>>>>> > See the archives:
>>>>>> >
>>>>>> http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser->
>>>>>> > look for spork.
>>>>>> >
>>>>>> > Basically Matei said:
>>>>>> >
>>>>>> > Yup, that was it, though I believe people at Twitter picked it up
>>>>>> again
>>>>>> > recently. I'd suggest
>>>>>> > asking Dmitriy if you know him. I've seen interest in this from
>>>>>> several
>>>>>> > other groups, and
>>>>>> > if there's enough of it, maybe we can start another open source
>>>>>> repo to
>>>>>> > track it. The work
>>>>>> > in that repo you pointed to was done over one week, and already had
>>>>>> most of
>>>>>> > Pig's operators
>>>>>> > working. (I helped out with this prototype over Twitter's hack
>>>>>> week.) That
>>>>>> > work also calls
>>>>>> > the Scala API directly, because it was done before we had a Java
>>>>>> API; it
>>>>>> > should be easier
>>>>>> > with the Java one.
>>>>>> >
>>>>>> >
>>>>>> > Tom
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Thursday, March 6, 2014 3:11 PM, Sameer Tilak <ss...@live.com>
>>>>>> wrote:
>>>>>> > Hi everyone,
>>>>>> >
>>>>>> > We are using to Pig to build our data pipeline. I came across Spork
>>>>>> -- Pig
>>>>>> > on Spark at: https://github.com/dvryaboy/pig and not sure if it is
>>>>>> still
>>>>>> > active.
>>>>>> >
>>>>>> > Can someone please let me know the status of Spork or any other
>>>>>> effort that
>>>>>> > will let us run Pig on Spark? We can significantly benefit by using
>>>>>> Spark,
>>>>>> > but we would like to keep using the existing Pig scripts.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > "...:::Aniket:::... Quetzalco@tl"
>>>>>> >
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Pig on Spark

Posted by suman bharadwaj <su...@gmail.com>.

We currently are in the process of converting PIG and Java map reduce jobs
to SPARK jobs. And we have written couple of PIG UDFs as well. Hence was
checking if we can leverage SPORK without converting to SPARK jobs.

And is there any way I can port my existing Java MR jobs to SPARK ?
I know this thread has a different subject, let me know if need to ask this
question in separate thread.

Thanks in advance.


On Thu, Apr 24, 2014 at 2:13 AM, Mayur Rustagi <ma...@gmail.com>wrote:

> UDF
> Generate
> & many many more are not working :)
>
> Several of them work. Joins, filters, group by etc.
> I am translating the ones we need, would be happy to get help on others.
> Will host a jira to track them if you are intersted.
>
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Thu, Apr 24, 2014 at 2:10 AM, suman bharadwaj <su...@gmail.com>wrote:
>
>> Are all the features available in PIG working in SPORK ?? Like for eg:
>> UDFs ?
>>
>> Thanks.
>>
>>
>> On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi <ma...@gmail.com>wrote:
>>
>>> Thr are two benefits I get as of now
>>> 1. Most of the time a lot of customers dont want the full power but they
>>> want something dead simple with which they can do dsl. They end up using
>>> Hive for a lot of ETL just cause its SQL & they understand it. Pig is close
>>> & wraps up a lot of framework level semantics away from the user & lets him
>>> focus on data flow
>>> 2. Some have codebases in Pig already & are just looking to do it
>>> faster. I am yet to benchmark that on Pig on spark.
>>>
>>> I agree that pig on spark cannot solve a lot problems but it can solve
>>> some without forcing the end customer to do anything even close to coding,
>>> I believe thr is quite some value in making Spark accessible to larger
>>> group of audience.
>>> End of the day to each his own :)
>>>
>>> Regards
>>> Mayur
>>>
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>
>>>
>>>
>>> On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi <
>>> mundlapudi@gmail.com> wrote:
>>>
>>>> This seems like an interesting question.
>>>>
>>>> I love Apache Pig. It is so natural and the language flows with nice
>>>> syntax.
>>>>
>>>> While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot
>>>> for analytics and provided feedback to Pig Team to do much more
>>>> functionality when it was at version 0.7. Lots of new functionality got
>>>> offered now
>>>> .
>>>> End of the day, Pig is a DSL for data flows. There will be always gaps
>>>> and enhancements. I was often thought is DSL right way to solve data flow
>>>> problems? May be not, we need complete language construct. We may have
>>>> found the answer - Scala. With Scala's dynamic compilation, we can write
>>>> much power constructs than any DSL can provide.
>>>>
>>>> If I am a new organization and beginning to choose, I would go with
>>>> Scala.
>>>>
>>>> Here is the example:
>>>>
>>>> #!/bin/sh
>>>> exec scala "$0" "$@"
>>>> !#
>>>> YOUR DSL GOES HERE BUT IN SCALA!
>>>>
>>>> You have DSL like scripting, functional and complete language power! If
>>>> we can improve first 3 lines, here you go, you have most powerful DSL to
>>>> solve data problems.
>>>>
>>>> -Bharath
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng <me...@gmail.com>wrote:
>>>>
>>>>> Hi Sameer,
>>>>>
>>>>> Lin (cc'ed) could also give you some updates about Pig on Spark
>>>>> development on her side.
>>>>>
>>>>> Best,
>>>>> Xiangrui
>>>>>
>>>>> On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak <ss...@live.com>
>>>>> wrote:
>>>>> > Hi Mayur,
>>>>> > We are planning to upgrade our distribution MR1> MR2 (YARN) and the
>>>>> goal is
>>>>> > to get SPROK set up next month. I will keep you posted. Can you
>>>>> please keep
>>>>> > me informed about your progress as well.
>>>>> >
>>>>> > ________________________________
>>>>> > From: mayur.rustagi@gmail.com
>>>>> > Date: Mon, 10 Mar 2014 11:47:56 -0700
>>>>> >
>>>>> > Subject: Re: Pig on Spark
>>>>> > To: user@spark.apache.org
>>>>> >
>>>>> >
>>>>> > Hi Sameer,
>>>>> > Did you make any progress on this. My team is also trying it out
>>>>> would love
>>>>> > to know some detail so progress.
>>>>> >
>>>>> > Mayur Rustagi
>>>>> > Ph: +1 (760) 203 3257
>>>>> > http://www.sigmoidanalytics.com
>>>>> > @mayur_rustagi
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak <ss...@live.com>
>>>>> wrote:
>>>>> >
>>>>> > Hi Aniket,
>>>>> > Many thanks! I will check this out.
>>>>> >
>>>>> > ________________________________
>>>>> > Date: Thu, 6 Mar 2014 13:46:50 -0800
>>>>> > Subject: Re: Pig on Spark
>>>>> > From: aniket486@gmail.com
>>>>> > To: user@spark.apache.org; tgraves_cs@yahoo.com
>>>>> >
>>>>> >
>>>>> > There is some work to make this work on yarn at
>>>>> > https://github.com/aniket486/pig. (So, compile pig with ant
>>>>> > -Dhadoopversion=23)
>>>>> >
>>>>> > You can look at
>>>>> https://github.com/aniket486/pig/blob/spork/pig-spark to
>>>>> > find out what sort of env variables you need (sorry, I haven't been
>>>>> able to
>>>>> > clean this up- in-progress). There are few known issues with this, I
>>>>> will
>>>>> > work on fixing them soon.
>>>>> >
>>>>> > Known issues-
>>>>> > 1. Limit does not work (spork-fix)
>>>>> > 2. Foreach requires to turn off schema-tuple-backend (should be a
>>>>> pig-jira)
>>>>> > 3. Algebraic udfs dont work (spork-fix in-progress)
>>>>> > 4. Group by rework (to avoid OOMs)
>>>>> > 5. UDF Classloader issue (requires SPARK-1053, then you can put
>>>>> > pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf
>>>>> jars)
>>>>> >
>>>>> > ~Aniket
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves <tg...@yahoo.com>
>>>>> wrote:
>>>>> >
>>>>> > I had asked a similar question on the dev mailing list a while back
>>>>> (Jan
>>>>> > 22nd).
>>>>> >
>>>>> > See the archives:
>>>>> >
>>>>> http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser->
>>>>> > look for spork.
>>>>> >
>>>>> > Basically Matei said:
>>>>> >
>>>>> > Yup, that was it, though I believe people at Twitter picked it up
>>>>> again
>>>>> > recently. I'd suggest
>>>>> > asking Dmitriy if you know him. I've seen interest in this from
>>>>> several
>>>>> > other groups, and
>>>>> > if there's enough of it, maybe we can start another open source repo
>>>>> to
>>>>> > track it. The work
>>>>> > in that repo you pointed to was done over one week, and already had
>>>>> most of
>>>>> > Pig's operators
>>>>> > working. (I helped out with this prototype over Twitter's hack
>>>>> week.) That
>>>>> > work also calls
>>>>> > the Scala API directly, because it was done before we had a Java
>>>>> API; it
>>>>> > should be easier
>>>>> > with the Java one.
>>>>> >
>>>>> >
>>>>> > Tom
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Thursday, March 6, 2014 3:11 PM, Sameer Tilak <ss...@live.com>
>>>>> wrote:
>>>>> > Hi everyone,
>>>>> >
>>>>> > We are using to Pig to build our data pipeline. I came across Spork
>>>>> -- Pig
>>>>> > on Spark at: https://github.com/dvryaboy/pig and not sure if it is
>>>>> still
>>>>> > active.
>>>>> >
>>>>> > Can someone please let me know the status of Spork or any other
>>>>> effort that
>>>>> > will let us run Pig on Spark? We can significantly benefit by using
>>>>> Spark,
>>>>> > but we would like to keep using the existing Pig scripts.
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > "...:::Aniket:::... Quetzalco@tl"
>>>>> >
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Pig on Spark

Posted by Mayur Rustagi <ma...@gmail.com>.

UDF
Generate
& many many more are not working :)

Several of them work. Joins, filters, group by etc.
I am translating the ones we need, would be happy to get help on others.
Will host a jira to track them if you are intersted.


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Thu, Apr 24, 2014 at 2:10 AM, suman bharadwaj <su...@gmail.com>wrote:

> Are all the features available in PIG working in SPORK ?? Like for eg:
> UDFs ?
>
> Thanks.
>
>
> On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi <ma...@gmail.com>wrote:
>
>> Thr are two benefits I get as of now
>> 1. Most of the time a lot of customers dont want the full power but they
>> want something dead simple with which they can do dsl. They end up using
>> Hive for a lot of ETL just cause its SQL & they understand it. Pig is close
>> & wraps up a lot of framework level semantics away from the user & lets him
>> focus on data flow
>> 2. Some have codebases in Pig already & are just looking to do it faster.
>> I am yet to benchmark that on Pig on spark.
>>
>> I agree that pig on spark cannot solve a lot problems but it can solve
>> some without forcing the end customer to do anything even close to coding,
>> I believe thr is quite some value in making Spark accessible to larger
>> group of audience.
>> End of the day to each his own :)
>>
>> Regards
>> Mayur
>>
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>>
>> On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi <mundlapudi@gmail.com
>> > wrote:
>>
>>> This seems like an interesting question.
>>>
>>> I love Apache Pig. It is so natural and the language flows with nice
>>> syntax.
>>>
>>> While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot
>>> for analytics and provided feedback to Pig Team to do much more
>>> functionality when it was at version 0.7. Lots of new functionality got
>>> offered now
>>> .
>>> End of the day, Pig is a DSL for data flows. There will be always gaps
>>> and enhancements. I was often thought is DSL right way to solve data flow
>>> problems? May be not, we need complete language construct. We may have
>>> found the answer - Scala. With Scala's dynamic compilation, we can write
>>> much power constructs than any DSL can provide.
>>>
>>> If I am a new organization and beginning to choose, I would go with
>>> Scala.
>>>
>>> Here is the example:
>>>
>>> #!/bin/sh
>>> exec scala "$0" "$@"
>>> !#
>>> YOUR DSL GOES HERE BUT IN SCALA!
>>>
>>> You have DSL like scripting, functional and complete language power! If
>>> we can improve first 3 lines, here you go, you have most powerful DSL to
>>> solve data problems.
>>>
>>> -Bharath
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng <me...@gmail.com>wrote:
>>>
>>>> Hi Sameer,
>>>>
>>>> Lin (cc'ed) could also give you some updates about Pig on Spark
>>>> development on her side.
>>>>
>>>> Best,
>>>> Xiangrui
>>>>
>>>> On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak <ss...@live.com>
>>>> wrote:
>>>> > Hi Mayur,
>>>> > We are planning to upgrade our distribution MR1> MR2 (YARN) and the
>>>> goal is
>>>> > to get SPROK set up next month. I will keep you posted. Can you
>>>> please keep
>>>> > me informed about your progress as well.
>>>> >
>>>> > ________________________________
>>>> > From: mayur.rustagi@gmail.com
>>>> > Date: Mon, 10 Mar 2014 11:47:56 -0700
>>>> >
>>>> > Subject: Re: Pig on Spark
>>>> > To: user@spark.apache.org
>>>> >
>>>> >
>>>> > Hi Sameer,
>>>> > Did you make any progress on this. My team is also trying it out
>>>> would love
>>>> > to know some detail so progress.
>>>> >
>>>> > Mayur Rustagi
>>>> > Ph: +1 (760) 203 3257
>>>> > http://www.sigmoidanalytics.com
>>>> > @mayur_rustagi
>>>> >
>>>> >
>>>> >
>>>> > On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak <ss...@live.com>
>>>> wrote:
>>>> >
>>>> > Hi Aniket,
>>>> > Many thanks! I will check this out.
>>>> >
>>>> > ________________________________
>>>> > Date: Thu, 6 Mar 2014 13:46:50 -0800
>>>> > Subject: Re: Pig on Spark
>>>> > From: aniket486@gmail.com
>>>> > To: user@spark.apache.org; tgraves_cs@yahoo.com
>>>> >
>>>> >
>>>> > There is some work to make this work on yarn at
>>>> > https://github.com/aniket486/pig. (So, compile pig with ant
>>>> > -Dhadoopversion=23)
>>>> >
>>>> > You can look at https://github.com/aniket486/pig/blob/spork/pig-sparkto
>>>> > find out what sort of env variables you need (sorry, I haven't been
>>>> able to
>>>> > clean this up- in-progress). There are few known issues with this, I
>>>> will
>>>> > work on fixing them soon.
>>>> >
>>>> > Known issues-
>>>> > 1. Limit does not work (spork-fix)
>>>> > 2. Foreach requires to turn off schema-tuple-backend (should be a
>>>> pig-jira)
>>>> > 3. Algebraic udfs dont work (spork-fix in-progress)
>>>> > 4. Group by rework (to avoid OOMs)
>>>> > 5. UDF Classloader issue (requires SPARK-1053, then you can put
>>>> > pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf
>>>> jars)
>>>> >
>>>> > ~Aniket
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves <tg...@yahoo.com>
>>>> wrote:
>>>> >
>>>> > I had asked a similar question on the dev mailing list a while back
>>>> (Jan
>>>> > 22nd).
>>>> >
>>>> > See the archives:
>>>> >
>>>> http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser->
>>>> > look for spork.
>>>> >
>>>> > Basically Matei said:
>>>> >
>>>> > Yup, that was it, though I believe people at Twitter picked it up
>>>> again
>>>> > recently. I'd suggest
>>>> > asking Dmitriy if you know him. I've seen interest in this from
>>>> several
>>>> > other groups, and
>>>> > if there's enough of it, maybe we can start another open source repo
>>>> to
>>>> > track it. The work
>>>> > in that repo you pointed to was done over one week, and already had
>>>> most of
>>>> > Pig's operators
>>>> > working. (I helped out with this prototype over Twitter's hack week.)
>>>> That
>>>> > work also calls
>>>> > the Scala API directly, because it was done before we had a Java API;
>>>> it
>>>> > should be easier
>>>> > with the Java one.
>>>> >
>>>> >
>>>> > Tom
>>>> >
>>>> >
>>>> >
>>>> > On Thursday, March 6, 2014 3:11 PM, Sameer Tilak <ss...@live.com>
>>>> wrote:
>>>> > Hi everyone,
>>>> >
>>>> > We are using to Pig to build our data pipeline. I came across Spork
>>>> -- Pig
>>>> > on Spark at: https://github.com/dvryaboy/pig and not sure if it is
>>>> still
>>>> > active.
>>>> >
>>>> > Can someone please let me know the status of Spork or any other
>>>> effort that
>>>> > will let us run Pig on Spark? We can significantly benefit by using
>>>> Spark,
>>>> > but we would like to keep using the existing Pig scripts.
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > "...:::Aniket:::... Quetzalco@tl"
>>>> >
>>>> >
>>>>
>>>
>>>
>>
>

Re: Pig on Spark

Posted by suman bharadwaj <su...@gmail.com>.

Are all the features available in PIG working in SPORK ?? Like for eg: UDFs
?

Thanks.


On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi <ma...@gmail.com>wrote:

> Thr are two benefits I get as of now
> 1. Most of the time a lot of customers dont want the full power but they
> want something dead simple with which they can do dsl. They end up using
> Hive for a lot of ETL just cause its SQL & they understand it. Pig is close
> & wraps up a lot of framework level semantics away from the user & lets him
> focus on data flow
> 2. Some have codebases in Pig already & are just looking to do it faster.
> I am yet to benchmark that on Pig on spark.
>
> I agree that pig on spark cannot solve a lot problems but it can solve
> some without forcing the end customer to do anything even close to coding,
> I believe thr is quite some value in making Spark accessible to larger
> group of audience.
> End of the day to each his own :)
>
> Regards
> Mayur
>
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi <mu...@gmail.com>wrote:
>
>> This seems like an interesting question.
>>
>> I love Apache Pig. It is so natural and the language flows with nice
>> syntax.
>>
>> While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot
>> for analytics and provided feedback to Pig Team to do much more
>> functionality when it was at version 0.7. Lots of new functionality got
>> offered now
>> .
>> End of the day, Pig is a DSL for data flows. There will be always gaps
>> and enhancements. I was often thought is DSL right way to solve data flow
>> problems? May be not, we need complete language construct. We may have
>> found the answer - Scala. With Scala's dynamic compilation, we can write
>> much power constructs than any DSL can provide.
>>
>> If I am a new organization and beginning to choose, I would go with Scala.
>>
>> Here is the example:
>>
>> #!/bin/sh
>> exec scala "$0" "$@"
>> !#
>> YOUR DSL GOES HERE BUT IN SCALA!
>>
>> You have DSL like scripting, functional and complete language power! If
>> we can improve first 3 lines, here you go, you have most powerful DSL to
>> solve data problems.
>>
>> -Bharath
>>
>>
>>
>>
>>
>> On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng <me...@gmail.com> wrote:
>>
>>> Hi Sameer,
>>>
>>> Lin (cc'ed) could also give you some updates about Pig on Spark
>>> development on her side.
>>>
>>> Best,
>>> Xiangrui
>>>
>>> On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak <ss...@live.com> wrote:
>>> > Hi Mayur,
>>> > We are planning to upgrade our distribution MR1> MR2 (YARN) and the
>>> goal is
>>> > to get SPROK set up next month. I will keep you posted. Can you please
>>> keep
>>> > me informed about your progress as well.
>>> >
>>> > ________________________________
>>> > From: mayur.rustagi@gmail.com
>>> > Date: Mon, 10 Mar 2014 11:47:56 -0700
>>> >
>>> > Subject: Re: Pig on Spark
>>> > To: user@spark.apache.org
>>> >
>>> >
>>> > Hi Sameer,
>>> > Did you make any progress on this. My team is also trying it out would
>>> love
>>> > to know some detail so progress.
>>> >
>>> > Mayur Rustagi
>>> > Ph: +1 (760) 203 3257
>>> > http://www.sigmoidanalytics.com
>>> > @mayur_rustagi
>>> >
>>> >
>>> >
>>> > On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak <ss...@live.com> wrote:
>>> >
>>> > Hi Aniket,
>>> > Many thanks! I will check this out.
>>> >
>>> > ________________________________
>>> > Date: Thu, 6 Mar 2014 13:46:50 -0800
>>> > Subject: Re: Pig on Spark
>>> > From: aniket486@gmail.com
>>> > To: user@spark.apache.org; tgraves_cs@yahoo.com
>>> >
>>> >
>>> > There is some work to make this work on yarn at
>>> > https://github.com/aniket486/pig. (So, compile pig with ant
>>> > -Dhadoopversion=23)
>>> >
>>> > You can look at https://github.com/aniket486/pig/blob/spork/pig-sparkto
>>> > find out what sort of env variables you need (sorry, I haven't been
>>> able to
>>> > clean this up- in-progress). There are few known issues with this, I
>>> will
>>> > work on fixing them soon.
>>> >
>>> > Known issues-
>>> > 1. Limit does not work (spork-fix)
>>> > 2. Foreach requires to turn off schema-tuple-backend (should be a
>>> pig-jira)
>>> > 3. Algebraic udfs dont work (spork-fix in-progress)
>>> > 4. Group by rework (to avoid OOMs)
>>> > 5. UDF Classloader issue (requires SPARK-1053, then you can put
>>> > pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf
>>> jars)
>>> >
>>> > ~Aniket
>>> >
>>> >
>>> >
>>> >
>>> > On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves <tg...@yahoo.com>
>>> wrote:
>>> >
>>> > I had asked a similar question on the dev mailing list a while back
>>> (Jan
>>> > 22nd).
>>> >
>>> > See the archives:
>>> > http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser->
>>> > look for spork.
>>> >
>>> > Basically Matei said:
>>> >
>>> > Yup, that was it, though I believe people at Twitter picked it up again
>>> > recently. I'd suggest
>>> > asking Dmitriy if you know him. I've seen interest in this from several
>>> > other groups, and
>>> > if there's enough of it, maybe we can start another open source repo to
>>> > track it. The work
>>> > in that repo you pointed to was done over one week, and already had
>>> most of
>>> > Pig's operators
>>> > working. (I helped out with this prototype over Twitter's hack week.)
>>> That
>>> > work also calls
>>> > the Scala API directly, because it was done before we had a Java API;
>>> it
>>> > should be easier
>>> > with the Java one.
>>> >
>>> >
>>> > Tom
>>> >
>>> >
>>> >
>>> > On Thursday, March 6, 2014 3:11 PM, Sameer Tilak <ss...@live.com>
>>> wrote:
>>> > Hi everyone,
>>> >
>>> > We are using to Pig to build our data pipeline. I came across Spork --
>>> Pig
>>> > on Spark at: https://github.com/dvryaboy/pig and not sure if it is
>>> still
>>> > active.
>>> >
>>> > Can someone please let me know the status of Spork or any other effort
>>> that
>>> > will let us run Pig on Spark? We can significantly benefit by using
>>> Spark,
>>> > but we would like to keep using the existing Pig scripts.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > "...:::Aniket:::... Quetzalco@tl"
>>> >
>>> >
>>>
>>
>>
>

Re: Pig on Spark

Posted by Mayur Rustagi <ma...@gmail.com>.

Thr are two benefits I get as of now
1. Most of the time a lot of customers dont want the full power but they
want something dead simple with which they can do dsl. They end up using
Hive for a lot of ETL just cause its SQL & they understand it. Pig is close
& wraps up a lot of framework level semantics away from the user & lets him
focus on data flow
2. Some have codebases in Pig already & are just looking to do it faster. I
am yet to benchmark that on Pig on spark.

I agree that pig on spark cannot solve a lot problems but it can solve some
without forcing the end customer to do anything even close to coding, I
believe thr is quite some value in making Spark accessible to larger group
of audience.
End of the day to each his own :)

Regards
Mayur


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi <mu...@gmail.com>wrote:

> This seems like an interesting question.
>
> I love Apache Pig. It is so natural and the language flows with nice
> syntax.
>
> While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot
> for analytics and provided feedback to Pig Team to do much more
> functionality when it was at version 0.7. Lots of new functionality got
> offered now
> .
> End of the day, Pig is a DSL for data flows. There will be always gaps and
> enhancements. I was often thought is DSL right way to solve data flow
> problems? May be not, we need complete language construct. We may have
> found the answer - Scala. With Scala's dynamic compilation, we can write
> much power constructs than any DSL can provide.
>
> If I am a new organization and beginning to choose, I would go with Scala.
>
> Here is the example:
>
> #!/bin/sh
> exec scala "$0" "$@"
> !#
> YOUR DSL GOES HERE BUT IN SCALA!
>
> You have DSL like scripting, functional and complete language power! If we
> can improve first 3 lines, here you go, you have most powerful DSL to solve
> data problems.
>
> -Bharath
>
>
>
>
>
> On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng <me...@gmail.com> wrote:
>
>> Hi Sameer,
>>
>> Lin (cc'ed) could also give you some updates about Pig on Spark
>> development on her side.
>>
>> Best,
>> Xiangrui
>>
>> On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak <ss...@live.com> wrote:
>> > Hi Mayur,
>> > We are planning to upgrade our distribution MR1> MR2 (YARN) and the
>> goal is
>> > to get SPROK set up next month. I will keep you posted. Can you please
>> keep
>> > me informed about your progress as well.
>> >
>> > ________________________________
>> > From: mayur.rustagi@gmail.com
>> > Date: Mon, 10 Mar 2014 11:47:56 -0700
>> >
>> > Subject: Re: Pig on Spark
>> > To: user@spark.apache.org
>> >
>> >
>> > Hi Sameer,
>> > Did you make any progress on this. My team is also trying it out would
>> love
>> > to know some detail so progress.
>> >
>> > Mayur Rustagi
>> > Ph: +1 (760) 203 3257
>> > http://www.sigmoidanalytics.com
>> > @mayur_rustagi
>> >
>> >
>> >
>> > On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak <ss...@live.com> wrote:
>> >
>> > Hi Aniket,
>> > Many thanks! I will check this out.
>> >
>> > ________________________________
>> > Date: Thu, 6 Mar 2014 13:46:50 -0800
>> > Subject: Re: Pig on Spark
>> > From: aniket486@gmail.com
>> > To: user@spark.apache.org; tgraves_cs@yahoo.com
>> >
>> >
>> > There is some work to make this work on yarn at
>> > https://github.com/aniket486/pig. (So, compile pig with ant
>> > -Dhadoopversion=23)
>> >
>> > You can look at https://github.com/aniket486/pig/blob/spork/pig-sparkto
>> > find out what sort of env variables you need (sorry, I haven't been
>> able to
>> > clean this up- in-progress). There are few known issues with this, I
>> will
>> > work on fixing them soon.
>> >
>> > Known issues-
>> > 1. Limit does not work (spork-fix)
>> > 2. Foreach requires to turn off schema-tuple-backend (should be a
>> pig-jira)
>> > 3. Algebraic udfs dont work (spork-fix in-progress)
>> > 4. Group by rework (to avoid OOMs)
>> > 5. UDF Classloader issue (requires SPARK-1053, then you can put
>> > pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf jars)
>> >
>> > ~Aniket
>> >
>> >
>> >
>> >
>> > On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves <tg...@yahoo.com>
>> wrote:
>> >
>> > I had asked a similar question on the dev mailing list a while back (Jan
>> > 22nd).
>> >
>> > See the archives:
>> > http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser->
>> > look for spork.
>> >
>> > Basically Matei said:
>> >
>> > Yup, that was it, though I believe people at Twitter picked it up again
>> > recently. I'd suggest
>> > asking Dmitriy if you know him. I've seen interest in this from several
>> > other groups, and
>> > if there's enough of it, maybe we can start another open source repo to
>> > track it. The work
>> > in that repo you pointed to was done over one week, and already had
>> most of
>> > Pig's operators
>> > working. (I helped out with this prototype over Twitter's hack week.)
>> That
>> > work also calls
>> > the Scala API directly, because it was done before we had a Java API; it
>> > should be easier
>> > with the Java one.
>> >
>> >
>> > Tom
>> >
>> >
>> >
>> > On Thursday, March 6, 2014 3:11 PM, Sameer Tilak <ss...@live.com>
>> wrote:
>> > Hi everyone,
>> >
>> > We are using to Pig to build our data pipeline. I came across Spork --
>> Pig
>> > on Spark at: https://github.com/dvryaboy/pig and not sure if it is
>> still
>> > active.
>> >
>> > Can someone please let me know the status of Spork or any other effort
>> that
>> > will let us run Pig on Spark? We can significantly benefit by using
>> Spark,
>> > but we would like to keep using the existing Pig scripts.
>> >
>> >
>> >
>> >
>> >
>> > --
>> > "...:::Aniket:::... Quetzalco@tl"
>> >
>> >
>>
>
>

Re: Pig on Spark

Posted by Eugen Cepoi <ce...@gmail.com>.

It depends, personally I have the opposite opinion.

IMO expressing pipelines in a functional language feels natural, you just
have to get used with the language (scala).

Testing spark jobs is easy where testing a Pig script is much harder and
not natural.

If you want a more high level language that deals with RDDs for you, you
can use spark sql
http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html

Of course you can express less things this way, but if you have some
complex logic I think it would make sense to write a classic spark job that
would be more robust in the long term.


2014-04-25 15:30 GMT+02:00 Mark Baker <di...@acm.org>:

> I've only had a quick look at Pig, but it seems that a declarative
> layer on top of Spark couldn't be anything other than a big win, as it
> allows developers to declare *what* they want, permitting the compiler
> to determine how best poke at the RDD API to implement it.
>
> In my brief time with Spark, I've often thought that it feels very
> unnatural to use imperative code to declare a pipeline.
>

Re: Pig on Spark

Posted by Mayur Rustagi <ma...@gmail.com>.

One core segment that frequently asks for systems like Pig & Hive are
analyst who want to deal with data. The key place I see pig fitting in is
getting non-developers deal with data at scale & free up developers to deal
with code, udf rather than manage day to day dataflow changes & updates.
A byproduct of this is that big data computation is made available to folks
beyond those who know what maven & sbt are :)


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Sat, Apr 26, 2014 at 12:04 AM, Bharath Mundlapudi
<mu...@gmail.com>wrote:

> >> I've only had a quick look at Pig, but it seems that a declarative
> >> layer on top of Spark couldn't be anything other than a big win, as it
> >> allows developers to declare *what* they want, permitting the compiler
> >> to determine how best poke at the RDD API to implement it.
>
> The devil is in the details - allowing developers to declare *what* they
> want - seems not practical in a declarative world since we are bound by the
> DSL constructs. The work around or rather hack is to have UDFs to have full
> language constructs. Some problems are hard, you will have twist your mind
> to solve in a restrictive way. At that time, we think, we wish we have
> complete language power.
>
> Being in Big Data world for short time (7 years), seen enough problems
> with Hive/Pig. All I am providing here is a thought to spark the Spark
> community to think beyond declarative constructs.
>
> I am sure there is a place for Pig and Hive.
>
> -Bharath
>
>
>
>
> On Fri, Apr 25, 2014 at 10:21 AM, Michael Armbrust <michael@databricks.com
> > wrote:
>
>> On Fri, Apr 25, 2014 at 6:30 AM, Mark Baker <di...@acm.org> wrote:
>>
>>> I've only had a quick look at Pig, but it seems that a declarative
>>> layer on top of Spark couldn't be anything other than a big win, as it
>>> allows developers to declare *what* they want, permitting the compiler
>>> to determine how best poke at the RDD API to implement it.
>>>
>>
>> Having Pig too would certainly be a win, but Spark SQL<http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html>is also a declarative layer on top of Spark.  Since the optimization is
>> lazy, you can chain multiple SQL statements in a row and still optimize
>> them holistically (similar to a pig job).  Alpha version coming soon to a
>> Spark 1.0 release near you!
>>
>> Spark SQL also lets to drop back into functional Scala when that is more
>> natural for a particular task.
>>
>
>

Re: Pig on Spark

Posted by Bharath Mundlapudi <mu...@gmail.com>.

>> I've only had a quick look at Pig, but it seems that a declarative
>> layer on top of Spark couldn't be anything other than a big win, as it
>> allows developers to declare *what* they want, permitting the compiler
>> to determine how best poke at the RDD API to implement it.

The devil is in the details - allowing developers to declare *what* they
want - seems not practical in a declarative world since we are bound by the
DSL constructs. The work around or rather hack is to have UDFs to have full
language constructs. Some problems are hard, you will have twist your mind
to solve in a restrictive way. At that time, we think, we wish we have
complete language power.

Being in Big Data world for short time (7 years), seen enough problems with
Hive/Pig. All I am providing here is a thought to spark the Spark community
to think beyond declarative constructs.

I am sure there is a place for Pig and Hive.

-Bharath

On Fri, Apr 25, 2014 at 10:21 AM, Michael Armbrust
<mi...@databricks.com>wrote:

> On Fri, Apr 25, 2014 at 6:30 AM, Mark Baker <di...@acm.org> wrote:
>
>> I've only had a quick look at Pig, but it seems that a declarative
>> layer on top of Spark couldn't be anything other than a big win, as it
>> allows developers to declare *what* they want, permitting the compiler
>> to determine how best poke at the RDD API to implement it.
>>
>
> Having Pig too would certainly be a win, but Spark SQL<http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html>is also a declarative layer on top of Spark.  Since the optimization is
> lazy, you can chain multiple SQL statements in a row and still optimize
> them holistically (similar to a pig job).  Alpha version coming soon to a
> Spark 1.0 release near you!
>
> Spark SQL also lets to drop back into functional Scala when that is more
> natural for a particular task.
>

Re: Pig on Spark

Posted by Michael Armbrust <mi...@databricks.com>.

On Fri, Apr 25, 2014 at 6:30 AM, Mark Baker <di...@acm.org> wrote:

> I've only had a quick look at Pig, but it seems that a declarative
> layer on top of Spark couldn't be anything other than a big win, as it
> allows developers to declare *what* they want, permitting the compiler
> to determine how best poke at the RDD API to implement it.
>

Having Pig too would certainly be a win, but Spark
SQL<http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html>is
also a declarative layer on top of Spark.  Since the optimization is
lazy, you can chain multiple SQL statements in a row and still optimize
them holistically (similar to a pig job).  Alpha version coming soon to a
Spark 1.0 release near you!

Spark SQL also lets to drop back into functional Scala when that is more
natural for a particular task.

Re: Pig on Spark

Posted by Mark Baker <di...@acm.org>.

I've only had a quick look at Pig, but it seems that a declarative
layer on top of Spark couldn't be anything other than a big win, as it
allows developers to declare *what* they want, permitting the compiler
to determine how best poke at the RDD API to implement it.

In my brief time with Spark, I've often thought that it feels very
unnatural to use imperative code to declare a pipeline.

Re: Pig on Spark

Posted by Bharath Mundlapudi <mu...@gmail.com>.

This seems like an interesting question.

I love Apache Pig. It is so natural and the language flows with nice syntax.

While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot for
analytics and provided feedback to Pig Team to do much more functionality
when it was at version 0.7. Lots of new functionality got offered now
.
End of the day, Pig is a DSL for data flows. There will be always gaps and
enhancements. I was often thought is DSL right way to solve data flow
problems? May be not, we need complete language construct. We may have
found the answer - Scala. With Scala's dynamic compilation, we can write
much power constructs than any DSL can provide.

If I am a new organization and beginning to choose, I would go with Scala.

Here is the example:

#!/bin/sh
exec scala "$0" "$@"
!#
YOUR DSL GOES HERE BUT IN SCALA!

You have DSL like scripting, functional and complete language power! If we
can improve first 3 lines, here you go, you have most powerful DSL to solve
data problems.

-Bharath





On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng <me...@gmail.com> wrote:

> Hi Sameer,
>
> Lin (cc'ed) could also give you some updates about Pig on Spark
> development on her side.
>
> Best,
> Xiangrui
>
> On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak <ss...@live.com> wrote:
> > Hi Mayur,
> > We are planning to upgrade our distribution MR1> MR2 (YARN) and the goal
> is
> > to get SPROK set up next month. I will keep you posted. Can you please
> keep
> > me informed about your progress as well.
> >
> > ________________________________
> > From: mayur.rustagi@gmail.com
> > Date: Mon, 10 Mar 2014 11:47:56 -0700
> >
> > Subject: Re: Pig on Spark
> > To: user@spark.apache.org
> >
> >
> > Hi Sameer,
> > Did you make any progress on this. My team is also trying it out would
> love
> > to know some detail so progress.
> >
> > Mayur Rustagi
> > Ph: +1 (760) 203 3257
> > http://www.sigmoidanalytics.com
> > @mayur_rustagi
> >
> >
> >
> > On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak <ss...@live.com> wrote:
> >
> > Hi Aniket,
> > Many thanks! I will check this out.
> >
> > ________________________________
> > Date: Thu, 6 Mar 2014 13:46:50 -0800
> > Subject: Re: Pig on Spark
> > From: aniket486@gmail.com
> > To: user@spark.apache.org; tgraves_cs@yahoo.com
> >
> >
> > There is some work to make this work on yarn at
> > https://github.com/aniket486/pig. (So, compile pig with ant
> > -Dhadoopversion=23)
> >
> > You can look at https://github.com/aniket486/pig/blob/spork/pig-spark to
> > find out what sort of env variables you need (sorry, I haven't been able
> to
> > clean this up- in-progress). There are few known issues with this, I will
> > work on fixing them soon.
> >
> > Known issues-
> > 1. Limit does not work (spork-fix)
> > 2. Foreach requires to turn off schema-tuple-backend (should be a
> pig-jira)
> > 3. Algebraic udfs dont work (spork-fix in-progress)
> > 4. Group by rework (to avoid OOMs)
> > 5. UDF Classloader issue (requires SPARK-1053, then you can put
> > pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf jars)
> >
> > ~Aniket
> >
> >
> >
> >
> > On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves <tg...@yahoo.com> wrote:
> >
> > I had asked a similar question on the dev mailing list a while back (Jan
> > 22nd).
> >
> > See the archives:
> > http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser->
> > look for spork.
> >
> > Basically Matei said:
> >
> > Yup, that was it, though I believe people at Twitter picked it up again
> > recently. I'd suggest
> > asking Dmitriy if you know him. I've seen interest in this from several
> > other groups, and
> > if there's enough of it, maybe we can start another open source repo to
> > track it. The work
> > in that repo you pointed to was done over one week, and already had most
> of
> > Pig's operators
> > working. (I helped out with this prototype over Twitter's hack week.)
> That
> > work also calls
> > the Scala API directly, because it was done before we had a Java API; it
> > should be easier
> > with the Java one.
> >
> >
> > Tom
> >
> >
> >
> > On Thursday, March 6, 2014 3:11 PM, Sameer Tilak <ss...@live.com>
> wrote:
> > Hi everyone,
> >
> > We are using to Pig to build our data pipeline. I came across Spork --
> Pig
> > on Spark at: https://github.com/dvryaboy/pig and not sure if it is still
> > active.
> >
> > Can someone please let me know the status of Spork or any other effort
> that
> > will let us run Pig on Spark? We can significantly benefit by using
> Spark,
> > but we would like to keep using the existing Pig scripts.
> >
> >
> >
> >
> >
> > --
> > "...:::Aniket:::... Quetzalco@tl"
> >
> >
>

Re: Pig on Spark

Posted by Xiangrui Meng <me...@gmail.com>.

Hi Sameer,

Lin (cc'ed) could also give you some updates about Pig on Spark
development on her side.

Best,
Xiangrui

On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak <ss...@live.com> wrote:
> Hi Mayur,
> We are planning to upgrade our distribution MR1> MR2 (YARN) and the goal is
> to get SPROK set up next month. I will keep you posted. Can you please keep
> me informed about your progress as well.
>
> ________________________________
> From: mayur.rustagi@gmail.com
> Date: Mon, 10 Mar 2014 11:47:56 -0700
>
> Subject: Re: Pig on Spark
> To: user@spark.apache.org
>
>
> Hi Sameer,
> Did you make any progress on this. My team is also trying it out would love
> to know some detail so progress.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi
>
>
>
> On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak <ss...@live.com> wrote:
>
> Hi Aniket,
> Many thanks! I will check this out.
>
> ________________________________
> Date: Thu, 6 Mar 2014 13:46:50 -0800
> Subject: Re: Pig on Spark
> From: aniket486@gmail.com
> To: user@spark.apache.org; tgraves_cs@yahoo.com
>
>
> There is some work to make this work on yarn at
> https://github.com/aniket486/pig. (So, compile pig with ant
> -Dhadoopversion=23)
>
> You can look at https://github.com/aniket486/pig/blob/spork/pig-spark to
> find out what sort of env variables you need (sorry, I haven't been able to
> clean this up- in-progress). There are few known issues with this, I will
> work on fixing them soon.
>
> Known issues-
> 1. Limit does not work (spork-fix)
> 2. Foreach requires to turn off schema-tuple-backend (should be a pig-jira)
> 3. Algebraic udfs dont work (spork-fix in-progress)
> 4. Group by rework (to avoid OOMs)
> 5. UDF Classloader issue (requires SPARK-1053, then you can put
> pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf jars)
>
> ~Aniket
>
>
>
>
> On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves <tg...@yahoo.com> wrote:
>
> I had asked a similar question on the dev mailing list a while back (Jan
> 22nd).
>
> See the archives:
> http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser ->
> look for spork.
>
> Basically Matei said:
>
> Yup, that was it, though I believe people at Twitter picked it up again
> recently. I'd suggest
> asking Dmitriy if you know him. I've seen interest in this from several
> other groups, and
> if there's enough of it, maybe we can start another open source repo to
> track it. The work
> in that repo you pointed to was done over one week, and already had most of
> Pig's operators
> working. (I helped out with this prototype over Twitter's hack week.) That
> work also calls
> the Scala API directly, because it was done before we had a Java API; it
> should be easier
> with the Java one.
>
>
> Tom
>
>
>
> On Thursday, March 6, 2014 3:11 PM, Sameer Tilak <ss...@live.com> wrote:
> Hi everyone,
>
> We are using to Pig to build our data pipeline. I came across Spork -- Pig
> on Spark at: https://github.com/dvryaboy/pig and not sure if it is still
> active.
>
> Can someone please let me know the status of Spork or any other effort that
> will let us run Pig on Spark? We can significantly benefit by using Spark,
> but we would like to keep using the existing Pig scripts.
>
>
>
>
>
> --
> "...:::Aniket:::... Quetzalco@tl"
>
>

RE: Pig on Spark

Posted by Sameer Tilak <ss...@live.com>.

Hi Mayur,We are planning to upgrade our distribution MR1> MR2 (YARN) and the goal is to get SPROK set up next month. I will keep you posted. Can you please keep me informed about your progress as well.
From: mayur.rustagi@gmail.com
Date: Mon, 10 Mar 2014 11:47:56 -0700
Subject: Re: Pig on Spark
To: user@spark.apache.org

Hi Sameer,Did you make any progress on this. My team is also trying it out would love to know some detail so progress. Mayur Rustagi

Ph: +1 (760) 203 3257http://www.sigmoidanalytics.com@mayur_rustagi

On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak <ss...@live.com> wrote:

Hi Aniket,Many thanks! I will check this out.

Date: Thu, 6 Mar 2014 13:46:50 -0800
Subject: Re: Pig on Spark
From: aniket486@gmail.com

To: user@spark.apache.org; tgraves_cs@yahoo.com

There is some work to make this work on yarn at https://github.com/aniket486/pig. (So, compile pig with ant -Dhadoopversion=23)

You can look at https://github.com/aniket486/pig/blob/spork/pig-spark to find out what sort of env variables you need (sorry, I haven't been able to clean this up- in-progress). There are few known issues with this, I will work on fixing them soon.

Known issues-1. Limit does not work (spork-fix)2. Foreach requires to turn off schema-tuple-backend (should be a pig-jira)3. Algebraic udfs dont work (spork-fix in-progress)

4. Group by rework (to avoid OOMs)5. UDF Classloader issue (requires SPARK-1053, then you can put pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf jars)
~Aniket

On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves <tg...@yahoo.com> wrote:

I had asked a similar question on the dev mailing list a while back (Jan 22nd). 

See the archives: http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser -> look for spork.

Basically Matei said:

Yup, that was it, though I believe people at Twitter picked it up again recently. I’d suggest
asking Dmitriy if you know him. I’ve seen interest in this from several other groups, and
if there’s enough of it, maybe we can start another open source repo to track it. The work
in that repo you pointed to was done over one week, and already had most of Pig’s operators
working. (I helped out with this prototype over Twitter’s hack week.) That work also calls
the Scala API directly, because it was done before we had a Java API; it should be easier
with the Java one.
Tom 

    On Thursday, March 6, 2014 3:11 PM, Sameer Tilak <ss...@live.com> wrote:

Hi everyone,

We are using to Pig to build our data pipeline. I came across Spork -- Pig on Spark at: https://github.com/dvryaboy/pig and not sure if it is still active.   

Can someone please let me know the status of Spork or any other effort that will let us run Pig on Spark? We can significantly benefit by using Spark, but we would like to keep using the existing Pig scripts.  		 	   		  

-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Pig on Spark

Posted by Mayur Rustagi <ma...@gmail.com>.

Hi Sameer,
Did you make any progress on this. My team is also trying it out would love
to know some detail so progress.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak <ss...@live.com> wrote:

> Hi Aniket,
> Many thanks! I will check this out.
>
> ------------------------------
> Date: Thu, 6 Mar 2014 13:46:50 -0800
> Subject: Re: Pig on Spark
> From: aniket486@gmail.com
> To: user@spark.apache.org; tgraves_cs@yahoo.com
>
>
> There is some work to make this work on yarn at
> https://github.com/aniket486/pig. (So, compile pig with ant
> -Dhadoopversion=23)
>
> You can look at https://github.com/aniket486/pig/blob/spork/pig-spark to
> find out what sort of env variables you need (sorry, I haven't been able to
> clean this up- in-progress). There are few known issues with this, I will
> work on fixing them soon.
>
> Known issues-
> 1. Limit does not work (spork-fix)
> 2. Foreach requires to turn off schema-tuple-backend (should be a pig-jira)
> 3. Algebraic udfs dont work (spork-fix in-progress)
> 4. Group by rework (to avoid OOMs)
> 5. UDF Classloader issue (requires SPARK-1053, then you can put
> pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf jars)
>
> ~Aniket
>
>
>
>
> On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves <tg...@yahoo.com> wrote:
>
> I had asked a similar question on the dev mailing list a while back (Jan
> 22nd).
>
> See the archives:
> http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser ->
> look for spork.
>
> Basically Matei said:
>
> Yup, that was it, though I believe people at Twitter picked it up again recently. I’d suggest
> asking Dmitriy if you know him. I’ve seen interest in this from several other groups, and
> if there’s enough of it, maybe we can start another open source repo to track it. The work
> in that repo you pointed to was done over one week, and already had most of Pig’s operators
> working. (I helped out with this prototype over Twitter’s hack week.) That work also calls
> the Scala API directly, because it was done before we had a Java API; it should be easier
> with the Java one.
>
>
> Tom
>
>
>
>   On Thursday, March 6, 2014 3:11 PM, Sameer Tilak <ss...@live.com>
> wrote:
>   Hi everyone,
>
> We are using to Pig to build our data pipeline. I came across Spork -- Pig
> on Spark at: https://github.com/dvryaboy/pig and not sure if it is still
> active.
>
> Can someone please let me know the status of Spork or any other effort
> that will let us run Pig on Spark? We can significantly benefit by using
> Spark, but we would like to keep using the existing Pig scripts.
>
>
>
>
>
> --
> "...:::Aniket:::... Quetzalco@tl"
>

RE: Pig on Spark

Posted by Sameer Tilak <ss...@live.com>.

Hi Aniket,Many thanks! I will check this out.

Date: Thu, 6 Mar 2014 13:46:50 -0800
Subject: Re: Pig on Spark
From: aniket486@gmail.com
To: user@spark.apache.org; tgraves_cs@yahoo.com

There is some work to make this work on yarn at https://github.com/aniket486/pig. (So, compile pig with ant -Dhadoopversion=23)
You can look at https://github.com/aniket486/pig/blob/spork/pig-spark to find out what sort of env variables you need (sorry, I haven't been able to clean this up- in-progress). There are few known issues with this, I will work on fixing them soon.

Known issues-1. Limit does not work (spork-fix)2. Foreach requires to turn off schema-tuple-backend (should be a pig-jira)3. Algebraic udfs dont work (spork-fix in-progress)
4. Group by rework (to avoid OOMs)5. UDF Classloader issue (requires SPARK-1053, then you can put pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf jars)
~Aniket

On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves <tg...@yahoo.com> wrote:

I had asked a similar question on the dev mailing list a while back (Jan 22nd). 

See the archives: http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser -> look for spork.

Basically Matei said:

Yup, that was it, though I believe people at Twitter picked it up again recently. I’d suggest
asking Dmitriy if you know him. I’ve seen interest in this from several other groups, and
if there’s enough of it, maybe we can start another open source repo to track it. The work
in that repo you pointed to was done over one week, and already had most of Pig’s operators
working. (I helped out with this prototype over Twitter’s hack week.) That work also calls
the Scala API directly, because it was done before we had a Java API; it should be easier
with the Java one.
Tom 

    On Thursday, March 6, 2014 3:11 PM, Sameer Tilak <ss...@live.com> wrote:

Hi everyone,

We are using to Pig to build our data pipeline. I came across Spork -- Pig on Spark at: https://github.com/dvryaboy/pig and not sure if it is still active.   
Can someone please let me know the status of Spork or any other effort that will let us run Pig on Spark? We can significantly benefit by using Spark, but we would like to keep using the existing Pig scripts.  		 	   		  

-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Pig on Spark

Posted by Aniket Mokashi <an...@gmail.com>.

There is some work to make this work on yarn at
https://github.com/aniket486/pig. (So, compile pig with ant
-Dhadoopversion=23)

You can look at https://github.com/aniket486/pig/blob/spork/pig-spark to
find out what sort of env variables you need (sorry, I haven't been able to
clean this up- in-progress). There are few known issues with this, I will
work on fixing them soon.

Known issues-
1. Limit does not work (spork-fix)
2. Foreach requires to turn off schema-tuple-backend (should be a pig-jira)
3. Algebraic udfs dont work (spork-fix in-progress)
4. Group by rework (to avoid OOMs)
5. UDF Classloader issue (requires SPARK-1053, then you can put
pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf jars)

~Aniket

On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves <tg...@yahoo.com> wrote:

> I had asked a similar question on the dev mailing list a while back (Jan
> 22nd).
>
> See the archives:
> http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser ->
> look for spork.
>
> Basically Matei said:
>
> Yup, that was it, though I believe people at Twitter picked it up again recently. I'd suggest
> asking Dmitriy if you know him. I've seen interest in this from several other groups, and
> if there's enough of it, maybe we can start another open source repo to track it. The work
> in that repo you pointed to was done over one week, and already had most of Pig's operators
> working. (I helped out with this prototype over Twitter's hack week.) That work also calls
> the Scala API directly, because it was done before we had a Java API; it should be easier
> with the Java one.
>
>
> Tom
>
>
>
>   On Thursday, March 6, 2014 3:11 PM, Sameer Tilak <ss...@live.com>
> wrote:
>   Hi everyone,
>
> We are using to Pig to build our data pipeline. I came across Spork -- Pig
> on Spark at: https://github.com/dvryaboy/pig and not sure if it is still
> active.
>
> Can someone please let me know the status of Spork or any other effort
> that will let us run Pig on Spark? We can significantly benefit by using
> Spark, but we would like to keep using the existing Pig scripts.
>
>
>

-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Pig on Spark

Posted by Tom Graves <tg...@yahoo.com>.

I had asked a similar question on the dev mailing list a while back (Jan 22nd). 

See the archives: http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser -> look for spork.

Basically Matei said:

Yup, that was it, though I believe people at Twitter picked it up again recently. I’d suggest
asking Dmitriy if you know him. I’ve seen interest in this from several other groups, and
if there’s enough of it, maybe we can start another open source repo to track it. The work
in that repo you pointed to was done over one week, and already had most of Pig’s operators
working. (I helped out with this prototype over Twitter’s hack week.) That work also calls
the Scala API directly, because it was done before we had a Java API; it should be easier
with the Java one.

Tom



On Thursday, March 6, 2014 3:11 PM, Sameer Tilak <ss...@live.com> wrote:
 
 
Hi everyone,
We are using to Pig to build our data pipeline. I came across Spork -- Pig on Spark at: https://github.com/dvryaboy/pig and not sure if it is still active.   

Can someone please let me know the status of Spork or any other effort that will let us run Pig on Spark? We can significantly benefit by using Spark, but we would like to keep using the existing Pig scripts.

Re: Pig on Spark

Posted by lalit1303 <la...@gmail.com>.

Hi,

I have been following Aniket's spork github repository.
https://github.com/aniket486/pig
I have done all the changes mentioned in recently modified pig-spark file.

I am using:
hadoop 2.0.5 alpha
spark-0.8.1-incubating
mesos 0.16.0

##PIG variables
export *HADOOP_CONF_DIR*=$HADOOP_INSTALL/etc/hadoop
export *SPARK_YARN_APP_JAR*=/home/ubuntu/pig/pig-withouthadoop.jar
export *SPARK_JAVA_OPTS*=" -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/heap.dump"
export
*SPARK_JAR*=/home/ubuntu/spark/assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.0.5-alpha.jar
export *SPARK_MASTER*=yarn-client
export *SPARK_HOME*=/home/ubuntu/spark
export *SPARK_JARS*=/home/ubuntu/pig/contrib/piggybank/java/piggybank.jar
export
*PIG_CLASSPATH*=${SPARK_JAR}:${SPARK_JARS}:/home/ubuntu/mesos/build/src/mesos-0.16.0.jar:/home/ubuntu/pig/pig-withouthadoop.jar
export *SPARK_PIG_JAR*=/home/ubuntu/pig/pig-withouthadoop.jar


This works fine in Mapreduce and local mode. But, while running on spark
mode I am facing follwing error. This error come after the job is submitted
and run on yarn-master.
Can you please tell me how to proceed.

###########################################error
message############################################################

*ERROR 2998*: *Unhandled internal error. class
org.apache.spark.util.InnerClosureFinder* has interface
org.objectweb.asm.ClassVisitor as super class

*java.lang.IncompatibleClassChangeError*: class
org.apache.spark.util.InnerClosureFinder has interface
org.objectweb.asm.ClassVisitor as super class
	at java.lang.ClassLoader.defineClass1(Native Method)
	at java.lang.ClassLoader.defineClass(ClassLoader.java:643)
	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
	at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
	at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
	at
org.apache.spark.util.ClosureCleaner$.getInnerClasses(ClosureCleaner.scala:87)
	at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:107)
	at org.apache.spark.SparkContext.clean(SparkContext.scala:970)
	at org.apache.spark.rdd.RDD.map(RDD.scala:246)
	at
org.apache.pig.backend.hadoop.executionengine.spark.converter.LoadConverter.convert(LoadConverter.java:68)
	at
org.apache.pig.backend.hadoop.executionengine.spark.converter.LoadConverter.convert(LoadConverter.java:38)
	at
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:212)
	at
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:201)
	at
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:201)
	at
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.launchPig(SparkLauncher.java:125)
	at org.apache.pig.PigServer.launchPlan(PigServer.java:1328)
	at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1310)
	at org.apache.pig.PigServer.storeEx(PigServer.java:993)
	at org.apache.pig.PigServer.store(PigServer.java:957)
	at org.apache.pig.PigServer.openIterator(PigServer.java:870)
	at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:729)
	at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:370)
	at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
	at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
	at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
	at org.apache.pig.Main.run(Main.java:609)
	at org.apache.pig.Main.main(Main.java:158)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:622)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
================================================================================



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pig-on-Spark-tp2367p3187.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Pig on Spark

Posted by lalit1303 <la...@sigmoidanalytics.com>.

Hi,

We got spork working on spark 0.9.0
Repository available at:
https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix

Please suggest your feedback.



-----
Lalit Yadav
lalit@sigmoidanalytics.com
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pig-on-Spark-tp2367p4668.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.