You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Yang <te...@gmail.com> on 2015/07/19 23:02:50 UTC
PIG and Spark actually similar?
Spark is very hot now, but after reading the paper, I found it surprisingly
similar to PIG's concept: the RDD is just Relation/set in PIG's
terminology.
I think a great strength of Spark is that it tries to merge multiple
"narrow dependency" stages together to avoid too much IO. does PIG do that
too? otherwise, I can't figure out what other major design differences
would lead to huge performance difference, if Spark also uses on-disk
storage. The overhead to start a MR task should not be that big.
Re: PIG and Spark actually similar?
Posted by Paul Houle <on...@gmail.com>.
There is a port of Pig that runs over Spark.
http://blog.cloudera.com/blog/2014/09/pig-is-flying-apache-pig-on-apache-spark/
The way I understand it, Pig will analyze the pipeline you give it and
then compile the job so it runs on some fabric. Apache Tez is a
computational fabric which is somewhere in between Spark and the old
Map/Reduce in the sense that Tez eliminates many of the extreme
inefficiencies of Map/Reduce by allowing sequences other than
[storage] -> [map] -> [reduce] -> [storage]
but Tez is otherwise a lot like Map/Reduce, whereas Spark offers in an
in-memory execution model (as well as on-disk) and is different in deeper
ways.
It could be that Pig-over-Spark is less compelling than Pig-over-something
else because a Spark program is a lot more like a Pig program than an M/R
program is.
On Sun, Jul 19, 2015 at 5:02 PM, Yang <te...@gmail.com> wrote:
> Spark is very hot now, but after reading the paper, I found it surprisingly
> similar to PIG's concept: the RDD is just Relation/set in PIG's
> terminology.
>
> I think a great strength of Spark is that it tries to merge multiple
> "narrow dependency" stages together to avoid too much IO. does PIG do that
> too? otherwise, I can't figure out what other major design differences
> would lead to huge performance difference, if Spark also uses on-disk
> storage. The overhead to start a MR task should not be that big.
>
--
Paul Houle
*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*
(607) 539 6254 paul.houle on Skype ontology2@gmail.com
https://legalentityidentifier.info/lei/lookup/
<http://legalentityidentifier.info/lei/lookup/>
Re: PIG and Spark actually similar?
Posted by Russell Jurney <ru...@gmail.com>.
For a mix of the two, check out pig on spark, Spork.
https://github.com/sigmoidanalytics/spork
Both pig and spark have the same data flow abstraction and operators. You
have to think the same way. The actual syntax of Spark/PySpark and Pig are
different. Pig with Python mixes well with PySpark. Pig can run via
MapReduce, Spark or Tez.
Check out these three books on Pig (2 I wrote):
http://chimera.labs.oreilly.com/books/1234000001811/index.html
http://shop.oreilly.com/product/mobile/0636920025054.do
http://shop.oreilly.com/product/mobile/0636920039006.do
Programming pig introduces Pig Latin. Agile Data Science uses pig to build
applications. Big data for chimps teaches analytic patterns in pig.
On Sunday, July 19, 2015, Yang <te...@gmail.com> wrote:
> Spark is very hot now, but after reading the paper, I found it surprisingly
> similar to PIG's concept: the RDD is just Relation/set in PIG's
> terminology.
>
> I think a great strength of Spark is that it tries to merge multiple
> "narrow dependency" stages together to avoid too much IO. does PIG do that
> too? otherwise, I can't figure out what other major design differences
> would lead to huge performance difference, if Spark also uses on-disk
> storage. The overhead to start a MR task should not be that big.
>
--
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com