You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Mário Sérgio Fujikawa Ferreira <li...@gmail.com> on 2019/02/04 08:05:19 UTC

Re: Re: Submitting multiple Pig Scripts on the same Session

We're wondering if there was something like Apache Hive LLAP: 
https://cwiki.apache.org/confluence/display/Hive/LLAP

We submit scripts asynchronously throughout the day. Never more than 20 
a time up to a thousand a day. Input file size varies from less than a 
megabyte to a couple terabytes.

1. Hadoop distribution is Hortonworks HDP 2.6.3

2. Apache pig 0.16 using TEZ.

3. SQL database is Pivotal HAWQ 2.3.0.0. Data is sent to the database 
for both insert and joins using Pivotal HAWQ external tables (CSV 
files). Data is retrieved from database using external tables as well.

    3.1.
    https://hdb.docs.pivotal.io/230/hawq/datamgmt/load/g-working-with-file-based-ext-tables.html

    3.2.
    https://hdb.docs.pivotal.io/230/hawq/pxf/PXFExternalTableandAPIReference.html

4. All processing is done on HDFS and all intermediate files are 
compressed with lzo.

We orchestrate everything using python (not jython).

1. python script detects new input files.

2. Prepares a pig script according to rules parameterized on a SQL database.

3. Submits pig script by pig command line client (-exectype tez).

4. Use output files (CSV file generated by pig script on item 3) for 
join operations on the SQL database.

5. Prepares another pig script against result (CSV file generated by 
Pivotal HAWQ on item 4) of join operation on SQL database.

6. Submits pig script by pig command line client (-exectype tez).

7. Finally, loads table (CSV file generated from script from item 5) on 
SQL  database.

We're considering some optimizations:

1. Share AM/tez sessions across different scripts using something 
similar to Hive LLAP. A continuously running YARN daemon that can share 
resources across different pig scripts. I haven't found anything 
similar. Unfortunately, I have no idea how where to begin if we were to 
code this. It's just an out there idea. Any pointers/suggestions would 
be appreciated.

2. Write a pig UDF to arbitrarily submit SQL statements to the database 
so that we don't have to run 2 separate pig script with 2 SQL statements 
in between. It would be a single script as follows:

       1st_pig_script_statements;

       exec;

       sql_udf_run;

       exec;

       2nd_script_statements;

       exec;

       sql_udf_run;

    2.1. This would submit everything under a single AM thus sharing
    resources and reducing overall run time (less start/stop script
    overhead). Is the sql_query_submit_UDF idea feasible? Should I just
    bite the bullet and use jython instead? At least, for the pig
    scripts? Can I just write a standard UDF and run it against a fake
    one  line input file?

3. Set pig.auto.local.enabled to true to reduce some overhead on small 
input files for faster (less time) processing. Unfortunately, I haven't 
seen much gain here on 100 megabytes input files when testing with 
exectype tez_local. Furthermore, the pig script on tez_local mode 
wouldn't find the input files. I had to prefix file paths with hdfs:///

Any help is appreciated. We've been using Apache PIG for ETL purposes 
for more than an year and we're very satisfied with it's 
performance/ease of use.

Best regards,
   Mário Sérgio

On 22/01/2019 16:49, Rohini Palaniswamy wrote:
> If you are using PigServer and submitting programmatically via same jvm, it
> should automatically reuse the application if the requested AM resources
> are same.
>
> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezSessionManager.java#L242-L245
>
> On Fri, Jan 18, 2019 at 12:20 PM Diego Pereira <di...@gmail.com>
> wrote:
>
>> Hi!
>>
>> We are developing an application that is looking for new files on a folder,
>> running a few Pig Scripts to prepare those files and, finally, loading them
>> into our database.
>>
>> The problem is that, for small files, the time that Pig / Tez / Yarn take
>> to create a new application master and spawn new containers is way longer
>> than the time it takes processing.
>>
>> Since Tez Sessions already allows a single Pig script to run multiple DAGs
>> against the same application master, is there a way to reuse that
>> application master and it´s containers for multiple Pig Scripts submissions
>> ?
>>
>> Regards,
>>
>> Diego
>>

Re: Re: Submitting multiple Pig Scripts on the same Session

Posted by Rohini Palaniswamy <ro...@apache.org>.
  To answer your different questions
1) Unless you have a service (Hive LLAP or Hive server 2) which keeps
running and holds a handle to the Tez session (yarn application) , it is
not possible to reuse it. Pig does not host any service of it is own. So
unless you bring a jvm up, keep it running and use the PigServer class to
submit pig scripts periodically, there is no way to hold on to the Tez
sessions. Certainly not possible by running though pig CLI.

In most cases, you don't really need to share AM unless you are doing
something interactive. Keeping the AM and containers running idle till the
next script is submitted will actually waste resources and reduce your
queue capacity for other jobs. Also, it is only possible to run one DAG on
a AM at a time. So if two different pig scripts run in parallel, they can't
share anyways.

2) Merging everything into a single pig script would definitely be
beneficial. You don't need jython. You can use embedded python feature of
Pig (http://pig.apache.org/docs/r0.17.0/cont.html#embed-python) which is
basically calling Pig from inside python and will fit your use case
perfectly. It is different from python or jython udfs. Our users do some
complex coding with that. Apart from regular python code and running pig
scripts, you can import any java class and use them. For eg:

from org.apache.hadoop.conf import Configuration
from org.apache.hadoop.fs import Path

actionConf = Configuration(False)
actionConf.addResource(Path('file:///test-conf.xml'))
testvalue = conf.getInt(testkeyname, 0)

3) pig.auto.local.enabled is actually not implemented in Tez yet. So you
will not see any benefits.

Regards,
Rohini

On Mon, Feb 4, 2019 at 12:06 AM Mário Sérgio Fujikawa Ferreira <
liouxbsd@gmail.com> wrote:

> We're wondering if there was something like Apache Hive LLAP:
> https://cwiki.apache.org/confluence/display/Hive/LLAP
>
> We submit scripts asynchronously throughout the day. Never more than 20 a
> time up to a thousand a day. Input file size varies from less than a
> megabyte to a couple terabytes.
>
> 1. Hadoop distribution is Hortonworks HDP 2.6.3
>
> 2. Apache pig 0.16 using TEZ.
>
> 3. SQL database is Pivotal HAWQ 2.3.0.0. Data is sent to the database for
> both insert and joins using Pivotal HAWQ external tables (CSV files). Data
> is retrieved from database using external tables as well.
>
> 3.1.
> https://hdb.docs.pivotal.io/230/hawq/datamgmt/load/g-working-with-file-based-ext-tables.html
>
> 3.2.
> https://hdb.docs.pivotal.io/230/hawq/pxf/PXFExternalTableandAPIReference.html
>
> 4. All processing is done on HDFS and all intermediate files are
> compressed with lzo.
>
> We orchestrate everything using python (not jython).
>
> 1. python script detects new input files.
>
> 2. Prepares a pig script according to rules parameterized on a SQL
> database.
>
> 3. Submits pig script by pig command line client (-exectype tez).
>
> 4. Use output files (CSV file generated by pig script on item 3) for join
> operations on the SQL database.
>
> 5. Prepares another pig script against result (CSV file generated by
> Pivotal HAWQ on item 4) of join operation on SQL database.
>
> 6. Submits pig script by pig command line client (-exectype tez).
>
> 7. Finally, loads table (CSV file generated from script from item 5) on
> SQL  database.
>
> We're considering some optimizations:
>
> 1. Share AM/tez sessions across different scripts using something similar
> to Hive LLAP. A continuously running YARN daemon that can share resources
> across different pig scripts. I haven't found anything similar.
> Unfortunately, I have no idea how where to begin if we were to code this.
> It's just an out there idea. Any pointers/suggestions would be appreciated.
>
> 2. Write a pig UDF to arbitrarily submit SQL statements to the database so
> that we don't have to run 2 separate pig script with 2 SQL statements in
> between. It would be a single script as follows:
>
>   1st_pig_script_statements;
>
>   exec;
>
>   sql_udf_run;
>
>   exec;
>
>   2nd_script_statements;
>
>   exec;
>
>   sql_udf_run;
>
> 2.1. This would submit everything under a single AM thus sharing resources
> and reducing overall run time (less start/stop script overhead). Is the
> sql_query_submit_UDF idea feasible? Should I just bite the bullet and use
> jython instead? At least, for the pig scripts? Can I just write a standard
> UDF and run it against a fake one  line input file?
>
> 3. Set pig.auto.local.enabled to true to reduce some overhead on small
> input files for faster (less time) processing. Unfortunately, I haven't
> seen much gain here on 100 megabytes input files when testing with exectype
> tez_local. Furthermore, the pig script on tez_local mode wouldn't find the
> input files. I had to prefix file paths with hdfs:///
> Any help is appreciated. We've been using Apache PIG for ETL purposes for
> more than an year and we're very satisfied with it's performance/ease of
> use.
>
> Best regards,
>   Mário Sérgio
>
> On 22/01/2019 16:49, Rohini Palaniswamy wrote:
>
> If you are using PigServer and submitting programmatically via same jvm, it
> should automatically reuse the application if the requested AM resources
> are same.
> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezSessionManager.java#L242-L245
>
> On Fri, Jan 18, 2019 at 12:20 PM Diego Pereira <di...@gmail.com> <di...@gmail.com>
> wrote:
>
>
> Hi!
>
> We are developing an application that is looking for new files on a folder,
> running a few Pig Scripts to prepare those files and, finally, loading them
> into our database.
>
> The problem is that, for small files, the time that Pig / Tez / Yarn take
> to create a new application master and spawn new containers is way longer
> than the time it takes processing.
>
> Since Tez Sessions already allows a single Pig script to run multiple DAGs
> against the same application master, is there a way to reuse that
> application master and it´s containers for multiple Pig Scripts submissions
> ?
>
> Regards,
>
> Diego
>
>
>