You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@zeppelin.apache.org by moon soo Lee <mo...@apache.org> on 2015/10/01 09:24:17 UTC

Re: Pig Interpreter

Hi,

As far as i know, there're no ongoing work for a pig interpreter. But no
reason to not having one. How about file an issue for it?

Thanks,
moon
On 2015년 9월 23일 (수) at 오후 11:23 Michael Parco <33...@cardinalmail.cua.edu>
wrote:

> Is there any current work or plans for a Pig interpreter in Zeppelin?
>

Re: Pig Interpreter

Posted by Michael Parco <33...@cardinalmail.cua.edu>.

The syntax between pig and spark sql (sql in general) does share similar
features, but in general pig is a scripted based flow as opposed to an
ad-hoc basis.

"In comparison to SQL, Pig

   1. uses lazy evaluation <https://en.wikipedia.org/wiki/Lazy_evaluation>,
   2. uses extract, transform, load
   <https://en.wikipedia.org/wiki/Extract,_transform,_load> (ETL),
   3. is able to store data at any point during a pipeline
   <https://en.wikipedia.org/wiki/Pipeline_%28software%29>,
   4. declares execution plans
   <https://en.wikipedia.org/wiki/Execution_plan>,
   5. supports pipeline splits, thus allowing workflows to proceed along
   DAGs <https://en.wikipedia.org/wiki/Directed_acyclic_graph> instead of
   strictly sequential pipelines."

Regardless to the similarities to Spark SQL a pig interpreter is enticing
for a few reasons. Currently many organizations still run pig jobs in
production today and pig continues to get advanced. Pig's support for
custom UDFs have made it a language to do ETL as well has some machine
learning over that same data. There is also a lot of work to utilize Spark
as an execution engine for Pig. The project Spork spawned from Sigmoid
Analytics came about last year and is now a development branch within pig
itself. With Pig executing on Spark (there is also work for Pig to execute
on Flink, Storm, and Apex) it would be an enhancement to the suite of tools
within Zeppelin.

On Thu, Oct 1, 2015 at 8:20 AM, IT CTO <go...@gmail.com> wrote:

> The syntax might be similar but spark context can not execute pig script
> so you would need a pig interpreter to do that.
> Eran
>
> On Thu, Oct 1, 2015 at 3:15 PM Nihal Bhagchandani <
> nihal_bhagchandani@yahoo.com> wrote:
>
>> Hi,
>> so as per my understanding:
>>
>> *PIG*: Uses a scripting language called Pig Latin, which is more
>> workflow driven. Is an abstraction layer on top of map-reduce. Pig use
>> batch oriented frameworks, which means your analytic jobs will run for
>> minutes or may be hours depending upon the volume of data. think PIG as
>> step by step SQL execution.
>>
>> *Spark SQL* : Allows us to do SQL like actions in HDFS or file-system
>> with 100x faster performance than Map reduce when SQL performed in
>> memory.Else on Disk its ten time faster.
>>
>>  Pig, a *SQL-like language* that gracefully tolerates inconsistent
>> schemas, and that runs on Hadoop.
>>
>> The basic concepts in SQL map pretty well onto Pig. There are analogues
>> for the major SQL keywords, and as a result you can write a query in your
>> head as SQL and then translate it into Pig Latin without undue mental
>> gymnastics.
>> WHERE → FILTER
>> The syntax is different, but conceptually this is still putting your data
>> into a funnel to create a smaller dataset.
>> HAVING → FILTER
>> Because a FILTER is done in a separate step from a GROUP or an
>> aggregation, the distinction between HAVING and WHERE doesn’t exist in Pig.
>> ORDER BY → ORDER
>> This keyword behaves pretty much the same in Pig as in SQL.
>> JOIN
>> In Pig, joins can have their execution specified, and they look a little
>> different, but in essence these are the same joins you know from SQL, and
>> you can think about them in the same way. There are INNER and OUTER joins,
>> RIGHT and LEFT specifications, and even CROSS for those rare moments that
>> you actually want a Cartesian product.
>> Because Pig is most appropriately used for data pipelines, there are
>> often fewer distinct relations or tables than you would expect to see in a
>> traditional normalized relational database.
>> Control over Execution
>> SQL performance tuning generally involves some fiddling with indexes,
>> punctuated by the occasional yelling at an explain plan that has
>> inexplicably decided to join the two largest tables first. It can mean
>> getting a different plan the second time you run a query, or having the
>> plan suddenly change after several weeks of use because the statistics have
>> evolved, throwing your query’s performance into the proverbial toilet.
>> Various SQL implementations offer hints to combat this problem—you can
>> use a hint to tell your SQL optimizer that it should use an index, or to
>> force a given table to be first in the join order. Unfortunately, because
>> hints are dependent on the particular SQL implementation, what you actually
>> have at your disposal varies by platform.
>> Pig offers a few different ways to control the execution plan. The first
>> is just the explicit ordering of operations. You can write your FILTER
>> before your JOIN (the reverse of SQL’s order) and be clever about
>> eliminating unused fields along the way, and have confidence that the
>> executed order will not be worse.
>> Secondly, the philosophy of Pig is to allow users to choose
>> implementations where multiple ones are possible. As a result, there are
>> three specialized joins that a can be used when the features of the data
>> are known, and are less appropriate for a regular join. For regular joins,
>> the order of the arguments dictates execution—the larger data set should
>> appear last in this type of join.
>> As with SQL, in Pig you can pretty much ignore the performance tweaks
>> until you can’t. Because of the explicit control of ordering, it can be
>> useful to have a general sense of the “good” order to do things in, though
>> Pig’s optimizer will also try to push up FILTERs and LIMITs, taking some of
>> the pressure off.
>>
>> here is dennylee's link where you can find SPARK vs PIG
>> http://dennyglee.com/2013/08/19/why-all-this-interest-in-spark/
>>
>> most of the task/processing which is possible thru PIG can be easily
>> achieved by using SPARK, in much lesser easy to understandable code and
>> since SPARK is in memory its 100x faster than any hadoop map-reduce tasks.
>>
>> Regards
>> Nihal
>>
>>
>>
>>
>>
>>
>>
>>
>> On Thursday, 1 October 2015 3:35 PM, moon soo Lee <mo...@apache.org>
>> wrote:
>>
>>
>> I dont know Pig very well, but It's little bit difficult to think how
>> spark-sql can help pig users. Can you explain more?
>>
>> Thanks,
>> moon
>> On 2015년 10월 1일 (목) at 오전 11:39 Nihal Bhagchandani <
>> nihal_bhagchandani@yahoo.com> wrote:
>>
>> Is there is any extra advantage to have a PIG Interpreter when zeppelin
>> already support SPARK-SQL?
>>
>> Nihal
>>
>> Sent from my iPhone
>>
>> On 01-Oct-2015, at 12:54, moon soo Lee <mo...@apache.org> wrote:
>>
>> Hi,
>>
>> As far as i know, there're no ongoing work for a pig interpreter. But no
>> reason to not having one. How about file an issue for it?
>>
>> Thanks,
>> moon
>> On 2015년 9월 23일 (수) at 오후 11:23 Michael Parco <
>> 33parco@cardinalmail.cua.edu> wrote:
>>
>> Is there any current work or plans for a Pig interpreter in Zeppelin?
>>
>>
>>
>> --
> Eran | "You don't need eyes to see, you need vision" (Faithless)
>

Re: Pig Interpreter

Posted by IT CTO <go...@gmail.com>.

The syntax might be similar but spark context can not execute pig script so
you would need a pig interpreter to do that.
Eran

On Thu, Oct 1, 2015 at 3:15 PM Nihal Bhagchandani <
nihal_bhagchandani@yahoo.com> wrote:

> Hi,
> so as per my understanding:
>
> *PIG*: Uses a scripting language called Pig Latin, which is more workflow
> driven. Is an abstraction layer on top of map-reduce. Pig use batch
> oriented frameworks, which means your analytic jobs will run for minutes or
> may be hours depending upon the volume of data. think PIG as step by step
> SQL execution.
>
> *Spark SQL* : Allows us to do SQL like actions in HDFS or file-system
> with 100x faster performance than Map reduce when SQL performed in
> memory.Else on Disk its ten time faster.
>
>  Pig, a *SQL-like language* that gracefully tolerates inconsistent
> schemas, and that runs on Hadoop.
>
> The basic concepts in SQL map pretty well onto Pig. There are analogues
> for the major SQL keywords, and as a result you can write a query in your
> head as SQL and then translate it into Pig Latin without undue mental
> gymnastics.
> WHERE → FILTER
> The syntax is different, but conceptually this is still putting your data
> into a funnel to create a smaller dataset.
> HAVING → FILTER
> Because a FILTER is done in a separate step from a GROUP or an
> aggregation, the distinction between HAVING and WHERE doesn’t exist in Pig.
> ORDER BY → ORDER
> This keyword behaves pretty much the same in Pig as in SQL.
> JOIN
> In Pig, joins can have their execution specified, and they look a little
> different, but in essence these are the same joins you know from SQL, and
> you can think about them in the same way. There are INNER and OUTER joins,
> RIGHT and LEFT specifications, and even CROSS for those rare moments that
> you actually want a Cartesian product.
> Because Pig is most appropriately used for data pipelines, there are often
> fewer distinct relations or tables than you would expect to see in a
> traditional normalized relational database.
> Control over Execution
> SQL performance tuning generally involves some fiddling with indexes,
> punctuated by the occasional yelling at an explain plan that has
> inexplicably decided to join the two largest tables first. It can mean
> getting a different plan the second time you run a query, or having the
> plan suddenly change after several weeks of use because the statistics have
> evolved, throwing your query’s performance into the proverbial toilet.
> Various SQL implementations offer hints to combat this problem—you can use
> a hint to tell your SQL optimizer that it should use an index, or to force
> a given table to be first in the join order. Unfortunately, because hints
> are dependent on the particular SQL implementation, what you actually have
> at your disposal varies by platform.
> Pig offers a few different ways to control the execution plan. The first
> is just the explicit ordering of operations. You can write your FILTER
> before your JOIN (the reverse of SQL’s order) and be clever about
> eliminating unused fields along the way, and have confidence that the
> executed order will not be worse.
> Secondly, the philosophy of Pig is to allow users to choose
> implementations where multiple ones are possible. As a result, there are
> three specialized joins that a can be used when the features of the data
> are known, and are less appropriate for a regular join. For regular joins,
> the order of the arguments dictates execution—the larger data set should
> appear last in this type of join.
> As with SQL, in Pig you can pretty much ignore the performance tweaks
> until you can’t. Because of the explicit control of ordering, it can be
> useful to have a general sense of the “good” order to do things in, though
> Pig’s optimizer will also try to push up FILTERs and LIMITs, taking some of
> the pressure off.
>
> here is dennylee's link where you can find SPARK vs PIG
> http://dennyglee.com/2013/08/19/why-all-this-interest-in-spark/
>
> most of the task/processing which is possible thru PIG can be easily
> achieved by using SPARK, in much lesser easy to understandable code and
> since SPARK is in memory its 100x faster than any hadoop map-reduce tasks.
>
> Regards
> Nihal
>
>
>
>
>
>
>
>
> On Thursday, 1 October 2015 3:35 PM, moon soo Lee <mo...@apache.org> wrote:
>
>
> I dont know Pig very well, but It's little bit difficult to think how
> spark-sql can help pig users. Can you explain more?
>
> Thanks,
> moon
> On 2015년 10월 1일 (목) at 오전 11:39 Nihal Bhagchandani <
> nihal_bhagchandani@yahoo.com> wrote:
>
> Is there is any extra advantage to have a PIG Interpreter when zeppelin
> already support SPARK-SQL?
>
> Nihal
>
> Sent from my iPhone
>
> On 01-Oct-2015, at 12:54, moon soo Lee <mo...@apache.org> wrote:
>
> Hi,
>
> As far as i know, there're no ongoing work for a pig interpreter. But no
> reason to not having one. How about file an issue for it?
>
> Thanks,
> moon
> On 2015년 9월 23일 (수) at 오후 11:23 Michael Parco <
> 33parco@cardinalmail.cua.edu> wrote:
>
> Is there any current work or plans for a Pig interpreter in Zeppelin?
>
>
>
> --
Eran | "You don't need eyes to see, you need vision" (Faithless)

Re: Pig Interpreter

Posted by moon soo Lee <mo...@apache.org>.

Thanks Nihal for explanation.

most of the task/processing which is possible thru PIG can be easily
achieved by using SPARK, in much lesser easy to understandable code and
since SPARK is in memory its 100x faster than any hadoop map-reduce tasks.

but I think this can not be the reason to not have pig interpreter.
Pig's syntax and execution engine is different and that's enough to have
interpreter, i think.

Thanks,
moon

On Thu, Oct 1, 2015 at 2:15 PM Nihal Bhagchandani <
nihal_bhagchandani@yahoo.com> wrote:

> Hi,
> so as per my understanding:
>
> *PIG*: Uses a scripting language called Pig Latin, which is more workflow
> driven. Is an abstraction layer on top of map-reduce. Pig use batch
> oriented frameworks, which means your analytic jobs will run for minutes or
> may be hours depending upon the volume of data. think PIG as step by step
> SQL execution.
>
> *Spark SQL* : Allows us to do SQL like actions in HDFS or file-system
> with 100x faster performance than Map reduce when SQL performed in
> memory.Else on Disk its ten time faster.
>
>  Pig, a *SQL-like language* that gracefully tolerates inconsistent
> schemas, and that runs on Hadoop.
>
> The basic concepts in SQL map pretty well onto Pig. There are analogues
> for the major SQL keywords, and as a result you can write a query in your
> head as SQL and then translate it into Pig Latin without undue mental
> gymnastics.
> WHERE → FILTER
> The syntax is different, but conceptually this is still putting your data
> into a funnel to create a smaller dataset.
> HAVING → FILTER
> Because a FILTER is done in a separate step from a GROUP or an
> aggregation, the distinction between HAVING and WHERE doesn’t exist in Pig.
> ORDER BY → ORDER
> This keyword behaves pretty much the same in Pig as in SQL.
> JOIN
> In Pig, joins can have their execution specified, and they look a little
> different, but in essence these are the same joins you know from SQL, and
> you can think about them in the same way. There are INNER and OUTER joins,
> RIGHT and LEFT specifications, and even CROSS for those rare moments that
> you actually want a Cartesian product.
> Because Pig is most appropriately used for data pipelines, there are often
> fewer distinct relations or tables than you would expect to see in a
> traditional normalized relational database.
> Control over Execution
> SQL performance tuning generally involves some fiddling with indexes,
> punctuated by the occasional yelling at an explain plan that has
> inexplicably decided to join the two largest tables first. It can mean
> getting a different plan the second time you run a query, or having the
> plan suddenly change after several weeks of use because the statistics have
> evolved, throwing your query’s performance into the proverbial toilet.
> Various SQL implementations offer hints to combat this problem—you can use
> a hint to tell your SQL optimizer that it should use an index, or to force
> a given table to be first in the join order. Unfortunately, because hints
> are dependent on the particular SQL implementation, what you actually have
> at your disposal varies by platform.
> Pig offers a few different ways to control the execution plan. The first
> is just the explicit ordering of operations. You can write your FILTER
> before your JOIN (the reverse of SQL’s order) and be clever about
> eliminating unused fields along the way, and have confidence that the
> executed order will not be worse.
> Secondly, the philosophy of Pig is to allow users to choose
> implementations where multiple ones are possible. As a result, there are
> three specialized joins that a can be used when the features of the data
> are known, and are less appropriate for a regular join. For regular joins,
> the order of the arguments dictates execution—the larger data set should
> appear last in this type of join.
> As with SQL, in Pig you can pretty much ignore the performance tweaks
> until you can’t. Because of the explicit control of ordering, it can be
> useful to have a general sense of the “good” order to do things in, though
> Pig’s optimizer will also try to push up FILTERs and LIMITs, taking some of
> the pressure off.
>
> here is dennylee's link where you can find SPARK vs PIG
> http://dennyglee.com/2013/08/19/why-all-this-interest-in-spark/
>
> most of the task/processing which is possible thru PIG can be easily
> achieved by using SPARK, in much lesser easy to understandable code and
> since SPARK is in memory its 100x faster than any hadoop map-reduce tasks.
>
> Regards
> Nihal
>
>
>
>
>
>
>
>
> On Thursday, 1 October 2015 3:35 PM, moon soo Lee <mo...@apache.org> wrote:
>
>
> I dont know Pig very well, but It's little bit difficult to think how
> spark-sql can help pig users. Can you explain more?
>
> Thanks,
> moon
> On 2015년 10월 1일 (목) at 오전 11:39 Nihal Bhagchandani <
> nihal_bhagchandani@yahoo.com> wrote:
>
> Is there is any extra advantage to have a PIG Interpreter when zeppelin
> already support SPARK-SQL?
>
> Nihal
>
> Sent from my iPhone
>
> On 01-Oct-2015, at 12:54, moon soo Lee <mo...@apache.org> wrote:
>
> Hi,
>
> As far as i know, there're no ongoing work for a pig interpreter. But no
> reason to not having one. How about file an issue for it?
>
> Thanks,
> moon
> On 2015년 9월 23일 (수) at 오후 11:23 Michael Parco <
> 33parco@cardinalmail.cua.edu> wrote:
>
> Is there any current work or plans for a Pig interpreter in Zeppelin?
>
>
>
>

Re: Pig Interpreter

Posted by Nihal Bhagchandani <ni...@yahoo.com>.

Hi,so as per my understanding:
PIG: Uses a scripting language called Pig Latin, which is more workflow driven. Is an abstraction layer on top of map-reduce. Pig use batch oriented frameworks, which means your analytic jobs will run for minutes or may be hours depending upon the volume of data. think PIG as step by step SQL execution.
Spark SQL : Allows us to do SQL like actions in HDFS or file-system with 100x faster performance than Map reduce when SQL performed in memory.Else on Disk its ten time faster.

Pig, a SQL-like language that gracefully tolerates inconsistent schemas, and that runs on Hadoop.

The basic concepts in SQL map pretty well onto Pig. There are analogues for the major SQL keywords, and as a result you can write a query in your head as SQL and then translate it into Pig Latin without undue mental gymnastics.
WHERE → FILTER
The syntax is different, but conceptually this is still putting your data into a funnel to create a smaller dataset.
HAVING → FILTER
Because a FILTER is done in a separate step from a GROUP or an aggregation, the distinction between HAVING and WHERE doesn’t exist in Pig.
ORDER BY → ORDER
This keyword behaves pretty much the same in Pig as in SQL.
JOIN
In Pig, joins can have their execution specified, and they look a little different, but in essence these are the same joins you know from SQL, and you can think about them in the same way. There are INNER and OUTER joins, RIGHT and LEFT specifications, and even CROSS for those rare moments that you actually want a Cartesian product.Because Pig is most appropriately used for data pipelines, there are often fewer distinct relations or tables than you would expect to see in a traditional normalized relational database.
Control over Execution
SQL performance tuning generally involves some fiddling with indexes, punctuated by the occasional yelling at an explain plan that has inexplicably decided to join the two largest tables first. It can mean getting a different plan the second time you run a query, or having the plan suddenly change after several weeks of use because the statistics have evolved, throwing your query’s performance into the proverbial toilet.Various SQL implementations offer hints to combat this problem—you can use a hint to tell your SQL optimizer that it should use an index, or to force a given table to be first in the join order. Unfortunately, because hints are dependent on the particular SQL implementation, what you actually have at your disposal varies by platform.Pig offers a few different ways to control the execution plan. The first is just the explicit ordering of operations. You can write your FILTER before your JOIN (the reverse of SQL’s order) and be clever about eliminating unused fields along the way, and have confidence that the executed order will not be worse.Secondly, the philosophy of Pig is to allow users to choose implementations where multiple ones are possible. As a result, there are three specialized joins that a can be used when the features of the data are known, and are less appropriate for a regular join. For regular joins, the order of the arguments dictates execution—the larger data set should appear last in this type of join.As with SQL, in Pig you can pretty much ignore the performance tweaks until you can’t. Because of the explicit control of ordering, it can be useful to have a general sense of the “good” order to do things in, though Pig’s optimizer will also try to push up FILTERs and LIMITs, taking some of the pressure off.
here is dennylee's link where you can find SPARK vs PIG http://dennyglee.com/2013/08/19/why-all-this-interest-in-spark/
most of the task/processing which is possible thru PIG can be easily achieved by using SPARK, in much lesser easy to understandable code and since SPARK is in memory its 100x faster than any hadoop map-reduce tasks.
RegardsNihal

On Thursday, 1 October 2015 3:35 PM, moon soo Lee <mo...@apache.org> wrote:

I dont know Pig very well, but It's little bit difficult to think how spark-sql can help pig users. Can you explain more?

Thanks,
moon
On 2015년 10월 1일 (목) at 오전 11:39 Nihal Bhagchandani <ni...@yahoo.com> wrote:

Is there is any extra advantage to have a PIG Interpreter when zeppelin already support SPARK-SQL?
Nihal

Sent from my iPhone
On 01-Oct-2015, at 12:54, moon soo Lee <mo...@apache.org> wrote:

Hi,

As far as i know, there're no ongoing work for a pig interpreter. But no reason to not having one. How about file an issue for it?

Thanks,
moon
On 2015년 9월 23일 (수) at 오후 11:23 Michael Parco <33...@cardinalmail.cua.edu> wrote:

Is there any current work or plans for a Pig interpreter in Zeppelin?

Re: Pig Interpreter

Posted by moon soo Lee <mo...@apache.org>.

I dont know Pig very well, but It's little bit difficult to think how
spark-sql can help pig users. Can you explain more?

Thanks,
moon
On 2015년 10월 1일 (목) at 오전 11:39 Nihal Bhagchandani <
nihal_bhagchandani@yahoo.com> wrote:

> Is there is any extra advantage to have a PIG Interpreter when zeppelin
> already support SPARK-SQL?
>
> Nihal
>
> Sent from my iPhone
>
> On 01-Oct-2015, at 12:54, moon soo Lee <mo...@apache.org> wrote:
>
> Hi,
>
> As far as i know, there're no ongoing work for a pig interpreter. But no
> reason to not having one. How about file an issue for it?
>
> Thanks,
> moon
> On 2015년 9월 23일 (수) at 오후 11:23 Michael Parco <
> 33parco@cardinalmail.cua.edu> wrote:
>
>> Is there any current work or plans for a Pig interpreter in Zeppelin?
>>
>

Re: Pig Interpreter

Posted by Nihal Bhagchandani <ni...@yahoo.com>.

Is there is any extra advantage to have a PIG Interpreter when zeppelin already support SPARK-SQL?

Nihal

Sent from my iPhone

> On 01-Oct-2015, at 12:54, moon soo Lee <mo...@apache.org> wrote:
> 
> Hi,
> 
> As far as i know, there're no ongoing work for a pig interpreter. But no reason to not having one. How about file an issue for it?
> 
> Thanks,
> moon
>> On 2015년 9월 23일 (수) at 오후 11:23 Michael Parco <33...@cardinalmail.cua.edu> wrote:
>> Is there any current work or plans for a Pig interpreter in Zeppelin?