You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Yang <te...@gmail.com> on 2012/06/12 08:21:12 UTC

pig generated 2 map-only jobs ?

this is what happened with my pig script.
why would it generate 2 map-only jobs?
wouldn't the optimization process chain together both mappers and keep only
1 mapper stage?


thanks
Yang

Re: pig generated 2 map-only jobs ?

Posted by Daniel Dai <da...@hortonworks.com>.
Feel it should be only one map. Can you do explain? (explain -script xxxx)

On Sun, Jun 17, 2012 at 9:39 AM, Yang <te...@gmail.com> wrote:

> Thanks, Alan, here it is
>
>
>
>
> SET mapred.max.jobs.per.node 1;
> SET mapred.max.maps.per.node  8;
> SET mapred.tasktracker.map.tasks.maximum   8;
> SET mapred.map.tasks 48;
> SET mapred.min.split.size  $min_split_size;
> SET pig.noSplitCombination true;
> SET mapred.map.tasks.speculative.execution false;
> SET mapred.reduce.tasks.speculative.execution false;
>
>
>
>
>
> REGISTER ./myjar.jar;
>
> DEFINE search_index  com.mycompany.SearchUdf();
> DEFINE verify_model  com.mycompany.VerifyDataUsingModelUdf();
> DEFINE verify_model2  com.mycompany.VerifyDataUsingModelUdf();
>
>
> suspects = LOAD '$input_suspects' USING PigStorage('\t') AS (
> --__START_SCHEMA__
> .....
> --__END_SCHEMA__
> );
>
>
>
>
> similars = FOREACH suspects GENERATE
>            *,
>            FLATTEN (
>            search_index(
>                    name,
>                    address,
>                    city,
>                    state,
>                    zip,
>                    phone
>            )) ;
>
>
> similars = FOREACH similars GENERATE
>    *,
>
>    top_10_similars::state AS candidate_state,
>    top_10_similars::zip AS candidate_zip,
>    top_10_similars::phone AS candidate_phone,
>    top_10_similars::profNames AS candidate_profNames,
>    top_10_similars::categories AS candidate_categories,
>    top_10_similars::cgId AS candidate_cgId,
>    top_10_similars::canonName AS candidate_canonName,
>    top_10_similars::canonAddress AS candidate_canonAddress,
>    top_10_similars::privateId AS candidate_id
> ;
>
> similars = FILTER similars BY NOT (legacy_ids IS NOT NULL AND
> candidate_cgId IS NOT NULL AND legacy_ids != candidate_cgId
>                    OR
>                    legacy_ids IS NULL AND candidate_cgId IS NULL
>                    )
> ;
>
>
> bad = FILTER similars BY ( categories is NULL OR categories == '' OR
> categories == '6019') ;
> good = FILTER similars BY NOT ( categories is NULL OR categories == '' OR
> categories == '6019') ;
>
> verdict1 = FOREACH good GENERATE
>    *,
>
>    verify_model( name,
>    address,
>    city,
>    .....
>    )
>
> ;
>
> verdict2 = FOREACH bad GENERATE
>    *,
>
>    verify_model2(
>    name,
>    address,
>    city,
>    )
> ;
>
>
>
> verdict = UNION verdict1, verdict2;
> STORE verdict INTO '$output';
>
>
> On Sat, Jun 16, 2012 at 11:51 PM, Alan Gates <ga...@hortonworks.com>
> wrote:
>
> > Apache mailing lists strip all attachments.  You'll have to inline the
> > script in your message or post it somewhere and send a link.
> >
> > Alan.
> >
> > On Jun 16, 2012, at 9:06 PM, Yang wrote:
> >
> > > Thanks Alan.
> > >
> > >
> > > I attached the trimmed version of my script .
> > >
> > >
> > > basically the similars var generates a bag, explodes it, after that,
> > each of the output record is filtered through a Udf.
> > >
> > > I suspect that the 2 maps are due to the explosion. but it should be
> > possible to put the above sequence into a single map.
> > >
> > >
> > > Yang
> > >
> > > On Tue, Jun 12, 2012 at 2:14 PM, Alan Gates <ga...@hortonworks.com>
> > wrote:
> > > There are cases where it would do this, such as unioning two inputs.
> >  Can you send your script to the list?
> > >
> > > Alan.
> > >
> > > On Jun 11, 2012, at 11:21 PM, Yang wrote:
> > >
> > > > this is what happened with my pig script.
> > > > why would it generate 2 map-only jobs?
> > > > wouldn't the optimization process chain together both mappers and
> keep
> > only
> > > > 1 mapper stage?
> > > >
> > > >
> > > > thanks
> > > > Yang
> > >
> > >
> >
> >
>

Re: pig generated 2 map-only jobs ?

Posted by Yang <te...@gmail.com>.
Thanks, Alan, here it is




SET mapred.max.jobs.per.node 1;
SET mapred.max.maps.per.node  8;
SET mapred.tasktracker.map.tasks.maximum   8;
SET mapred.map.tasks 48;
SET mapred.min.split.size  $min_split_size;
SET pig.noSplitCombination true;
SET mapred.map.tasks.speculative.execution false;
SET mapred.reduce.tasks.speculative.execution false;





REGISTER ./myjar.jar;

DEFINE search_index  com.mycompany.SearchUdf();
DEFINE verify_model  com.mycompany.VerifyDataUsingModelUdf();
DEFINE verify_model2  com.mycompany.VerifyDataUsingModelUdf();


suspects = LOAD '$input_suspects' USING PigStorage('\t') AS (
--__START_SCHEMA__
.....
--__END_SCHEMA__
);




similars = FOREACH suspects GENERATE
            *,
            FLATTEN (
            search_index(
                    name,
                    address,
                    city,
                    state,
                    zip,
                    phone
            )) ;


similars = FOREACH similars GENERATE
    *,

    top_10_similars::state AS candidate_state,
    top_10_similars::zip AS candidate_zip,
    top_10_similars::phone AS candidate_phone,
    top_10_similars::profNames AS candidate_profNames,
    top_10_similars::categories AS candidate_categories,
    top_10_similars::cgId AS candidate_cgId,
    top_10_similars::canonName AS candidate_canonName,
    top_10_similars::canonAddress AS candidate_canonAddress,
    top_10_similars::privateId AS candidate_id
;

similars = FILTER similars BY NOT (legacy_ids IS NOT NULL AND
candidate_cgId IS NOT NULL AND legacy_ids != candidate_cgId
                    OR
                    legacy_ids IS NULL AND candidate_cgId IS NULL
                    )
;


bad = FILTER similars BY ( categories is NULL OR categories == '' OR
categories == '6019') ;
good = FILTER similars BY NOT ( categories is NULL OR categories == '' OR
categories == '6019') ;

verdict1 = FOREACH good GENERATE
    *,

    verify_model( name,
    address,
    city,
    .....
    )

;

verdict2 = FOREACH bad GENERATE
    *,

    verify_model2(
    name,
    address,
    city,
    )
;



verdict = UNION verdict1, verdict2;
STORE verdict INTO '$output';


On Sat, Jun 16, 2012 at 11:51 PM, Alan Gates <ga...@hortonworks.com> wrote:

> Apache mailing lists strip all attachments.  You'll have to inline the
> script in your message or post it somewhere and send a link.
>
> Alan.
>
> On Jun 16, 2012, at 9:06 PM, Yang wrote:
>
> > Thanks Alan.
> >
> >
> > I attached the trimmed version of my script .
> >
> >
> > basically the similars var generates a bag, explodes it, after that,
> each of the output record is filtered through a Udf.
> >
> > I suspect that the 2 maps are due to the explosion. but it should be
> possible to put the above sequence into a single map.
> >
> >
> > Yang
> >
> > On Tue, Jun 12, 2012 at 2:14 PM, Alan Gates <ga...@hortonworks.com>
> wrote:
> > There are cases where it would do this, such as unioning two inputs.
>  Can you send your script to the list?
> >
> > Alan.
> >
> > On Jun 11, 2012, at 11:21 PM, Yang wrote:
> >
> > > this is what happened with my pig script.
> > > why would it generate 2 map-only jobs?
> > > wouldn't the optimization process chain together both mappers and keep
> only
> > > 1 mapper stage?
> > >
> > >
> > > thanks
> > > Yang
> >
> >
>
>

Re: pig generated 2 map-only jobs ?

Posted by Alan Gates <ga...@hortonworks.com>.
Apache mailing lists strip all attachments.  You'll have to inline the script in your message or post it somewhere and send a link.

Alan.

On Jun 16, 2012, at 9:06 PM, Yang wrote:

> Thanks Alan.
> 
> 
> I attached the trimmed version of my script .
> 
> 
> basically the similars var generates a bag, explodes it, after that, each of the output record is filtered through a Udf.
> 
> I suspect that the 2 maps are due to the explosion. but it should be possible to put the above sequence into a single map.
> 
> 
> Yang
> 
> On Tue, Jun 12, 2012 at 2:14 PM, Alan Gates <ga...@hortonworks.com> wrote:
> There are cases where it would do this, such as unioning two inputs.  Can you send your script to the list?
> 
> Alan.
> 
> On Jun 11, 2012, at 11:21 PM, Yang wrote:
> 
> > this is what happened with my pig script.
> > why would it generate 2 map-only jobs?
> > wouldn't the optimization process chain together both mappers and keep only
> > 1 mapper stage?
> >
> >
> > thanks
> > Yang
> 
> 


Re: pig generated 2 map-only jobs ?

Posted by Yang <te...@gmail.com>.
Thanks Alan.


I attached the trimmed version of my script .


basically the similars var generates a bag, explodes it, after that, each
of the output record is filtered through a Udf.

I suspect that the 2 maps are due to the explosion. but it should be
possible to put the above sequence into a single map.


Yang

On Tue, Jun 12, 2012 at 2:14 PM, Alan Gates <ga...@hortonworks.com> wrote:

> There are cases where it would do this, such as unioning two inputs.  Can
> you send your script to the list?
>
> Alan.
>
> On Jun 11, 2012, at 11:21 PM, Yang wrote:
>
> > this is what happened with my pig script.
> > why would it generate 2 map-only jobs?
> > wouldn't the optimization process chain together both mappers and keep
> only
> > 1 mapper stage?
> >
> >
> > thanks
> > Yang
>
>

Re: pig generated 2 map-only jobs ?

Posted by Alan Gates <ga...@hortonworks.com>.
There are cases where it would do this, such as unioning two inputs.  Can you send your script to the list?

Alan.

On Jun 11, 2012, at 11:21 PM, Yang wrote:

> this is what happened with my pig script.
> why would it generate 2 map-only jobs?
> wouldn't the optimization process chain together both mappers and keep only
> 1 mapper stage?
> 
> 
> thanks
> Yang