You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Yang <te...@gmail.com> on 2015/07/19 23:02:50 UTC

PIG and Spark actually similar?

Spark is very hot now, but after reading the paper, I found it surprisingly
similar to PIG's concept: the RDD is just Relation/set in PIG's
terminology.

I think a great strength of Spark is that it tries to merge multiple
"narrow dependency" stages together to avoid too much IO. does PIG do that
too? otherwise, I can't figure out what other major design differences
would lead to huge performance difference, if Spark also uses on-disk
storage. The overhead to start a MR task should not be that big.

Re: PIG and Spark actually similar?

Posted by Paul Houle <on...@gmail.com>.

There is a port of Pig that runs over Spark.

http://blog.cloudera.com/blog/2014/09/pig-is-flying-apache-pig-on-apache-spark/

The way I understand it,  Pig will analyze the pipeline you give it and
then compile the job so it runs on some fabric. Apache Tez is a
computational fabric which is somewhere in between Spark and the old
Map/Reduce in the sense that Tez eliminates many of the extreme
inefficiencies of Map/Reduce by allowing sequences other than

[storage] -> [map] -> [reduce] -> [storage]

but Tez is otherwise a lot like Map/Reduce,  whereas Spark offers in an
in-memory execution model (as well as on-disk) and is different in deeper
ways.

It could be that Pig-over-Spark is less compelling than Pig-over-something
else because a Spark program is a lot more like a Pig program than an M/R
program is.

On Sun, Jul 19, 2015 at 5:02 PM, Yang <te...@gmail.com> wrote:

> Spark is very hot now, but after reading the paper, I found it surprisingly
> similar to PIG's concept: the RDD is just Relation/set in PIG's
> terminology.
>
> I think a great strength of Spark is that it tries to merge multiple
> "narrow dependency" stages together to avoid too much IO. does PIG do that
> too? otherwise, I can't figure out what other major design differences
> would lead to huge performance difference, if Spark also uses on-disk
> storage. The overhead to start a MR task should not be that big.
>

-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254    paul.houle on Skype   ontology2@gmail.com
https://legalentityidentifier.info/lei/lookup/
<http://legalentityidentifier.info/lei/lookup/>

Re: PIG and Spark actually similar?

Posted by Russell Jurney <ru...@gmail.com>.

For a mix of the two, check out pig on spark, Spork.
https://github.com/sigmoidanalytics/spork

Both pig and spark have the same data flow abstraction and operators. You
have to think the same way. The actual syntax of Spark/PySpark and Pig are
different. Pig with Python mixes well with PySpark. Pig can run via
MapReduce, Spark or Tez.

Check out these three books on Pig (2 I wrote):

http://chimera.labs.oreilly.com/books/1234000001811/index.html
http://shop.oreilly.com/product/mobile/0636920025054.do
http://shop.oreilly.com/product/mobile/0636920039006.do

Programming pig introduces Pig Latin. Agile Data Science uses pig to build
applications. Big data for chimps teaches analytic patterns in pig.

On Sunday, July 19, 2015, Yang <te...@gmail.com> wrote:

> Spark is very hot now, but after reading the paper, I found it surprisingly
> similar to PIG's concept: the RDD is just Relation/set in PIG's
> terminology.
>
> I think a great strength of Spark is that it tries to merge multiple
> "narrow dependency" stages together to avoid too much IO. does PIG do that
> too? otherwise, I can't figure out what other major design differences
> would lead to huge performance difference, if Spark also uses on-disk
> storage. The overhead to start a MR task should not be that big.
>

-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com