You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Yang <te...@gmail.com> on 2012/06/12 08:21:12 UTC
pig generated 2 map-only jobs ?
this is what happened with my pig script.
why would it generate 2 map-only jobs?
wouldn't the optimization process chain together both mappers and keep only
1 mapper stage?
thanks
Yang
Re: pig generated 2 map-only jobs ?
Posted by Daniel Dai <da...@hortonworks.com>.
Feel it should be only one map. Can you do explain? (explain -script xxxx)
On Sun, Jun 17, 2012 at 9:39 AM, Yang <te...@gmail.com> wrote:
> Thanks, Alan, here it is
>
>
>
>
> SET mapred.max.jobs.per.node 1;
> SET mapred.max.maps.per.node 8;
> SET mapred.tasktracker.map.tasks.maximum 8;
> SET mapred.map.tasks 48;
> SET mapred.min.split.size $min_split_size;
> SET pig.noSplitCombination true;
> SET mapred.map.tasks.speculative.execution false;
> SET mapred.reduce.tasks.speculative.execution false;
>
>
>
>
>
> REGISTER ./myjar.jar;
>
> DEFINE search_index com.mycompany.SearchUdf();
> DEFINE verify_model com.mycompany.VerifyDataUsingModelUdf();
> DEFINE verify_model2 com.mycompany.VerifyDataUsingModelUdf();
>
>
> suspects = LOAD '$input_suspects' USING PigStorage('\t') AS (
> --__START_SCHEMA__
> .....
> --__END_SCHEMA__
> );
>
>
>
>
> similars = FOREACH suspects GENERATE
> *,
> FLATTEN (
> search_index(
> name,
> address,
> city,
> state,
> zip,
> phone
> )) ;
>
>
> similars = FOREACH similars GENERATE
> *,
>
> top_10_similars::state AS candidate_state,
> top_10_similars::zip AS candidate_zip,
> top_10_similars::phone AS candidate_phone,
> top_10_similars::profNames AS candidate_profNames,
> top_10_similars::categories AS candidate_categories,
> top_10_similars::cgId AS candidate_cgId,
> top_10_similars::canonName AS candidate_canonName,
> top_10_similars::canonAddress AS candidate_canonAddress,
> top_10_similars::privateId AS candidate_id
> ;
>
> similars = FILTER similars BY NOT (legacy_ids IS NOT NULL AND
> candidate_cgId IS NOT NULL AND legacy_ids != candidate_cgId
> OR
> legacy_ids IS NULL AND candidate_cgId IS NULL
> )
> ;
>
>
> bad = FILTER similars BY ( categories is NULL OR categories == '' OR
> categories == '6019') ;
> good = FILTER similars BY NOT ( categories is NULL OR categories == '' OR
> categories == '6019') ;
>
> verdict1 = FOREACH good GENERATE
> *,
>
> verify_model( name,
> address,
> city,
> .....
> )
>
> ;
>
> verdict2 = FOREACH bad GENERATE
> *,
>
> verify_model2(
> name,
> address,
> city,
> )
> ;
>
>
>
> verdict = UNION verdict1, verdict2;
> STORE verdict INTO '$output';
>
>
> On Sat, Jun 16, 2012 at 11:51 PM, Alan Gates <ga...@hortonworks.com>
> wrote:
>
> > Apache mailing lists strip all attachments. You'll have to inline the
> > script in your message or post it somewhere and send a link.
> >
> > Alan.
> >
> > On Jun 16, 2012, at 9:06 PM, Yang wrote:
> >
> > > Thanks Alan.
> > >
> > >
> > > I attached the trimmed version of my script .
> > >
> > >
> > > basically the similars var generates a bag, explodes it, after that,
> > each of the output record is filtered through a Udf.
> > >
> > > I suspect that the 2 maps are due to the explosion. but it should be
> > possible to put the above sequence into a single map.
> > >
> > >
> > > Yang
> > >
> > > On Tue, Jun 12, 2012 at 2:14 PM, Alan Gates <ga...@hortonworks.com>
> > wrote:
> > > There are cases where it would do this, such as unioning two inputs.
> > Can you send your script to the list?
> > >
> > > Alan.
> > >
> > > On Jun 11, 2012, at 11:21 PM, Yang wrote:
> > >
> > > > this is what happened with my pig script.
> > > > why would it generate 2 map-only jobs?
> > > > wouldn't the optimization process chain together both mappers and
> keep
> > only
> > > > 1 mapper stage?
> > > >
> > > >
> > > > thanks
> > > > Yang
> > >
> > >
> >
> >
>
Re: pig generated 2 map-only jobs ?
Posted by Yang <te...@gmail.com>.
Thanks, Alan, here it is
SET mapred.max.jobs.per.node 1;
SET mapred.max.maps.per.node 8;
SET mapred.tasktracker.map.tasks.maximum 8;
SET mapred.map.tasks 48;
SET mapred.min.split.size $min_split_size;
SET pig.noSplitCombination true;
SET mapred.map.tasks.speculative.execution false;
SET mapred.reduce.tasks.speculative.execution false;
REGISTER ./myjar.jar;
DEFINE search_index com.mycompany.SearchUdf();
DEFINE verify_model com.mycompany.VerifyDataUsingModelUdf();
DEFINE verify_model2 com.mycompany.VerifyDataUsingModelUdf();
suspects = LOAD '$input_suspects' USING PigStorage('\t') AS (
--__START_SCHEMA__
.....
--__END_SCHEMA__
);
similars = FOREACH suspects GENERATE
*,
FLATTEN (
search_index(
name,
address,
city,
state,
zip,
phone
)) ;
similars = FOREACH similars GENERATE
*,
top_10_similars::state AS candidate_state,
top_10_similars::zip AS candidate_zip,
top_10_similars::phone AS candidate_phone,
top_10_similars::profNames AS candidate_profNames,
top_10_similars::categories AS candidate_categories,
top_10_similars::cgId AS candidate_cgId,
top_10_similars::canonName AS candidate_canonName,
top_10_similars::canonAddress AS candidate_canonAddress,
top_10_similars::privateId AS candidate_id
;
similars = FILTER similars BY NOT (legacy_ids IS NOT NULL AND
candidate_cgId IS NOT NULL AND legacy_ids != candidate_cgId
OR
legacy_ids IS NULL AND candidate_cgId IS NULL
)
;
bad = FILTER similars BY ( categories is NULL OR categories == '' OR
categories == '6019') ;
good = FILTER similars BY NOT ( categories is NULL OR categories == '' OR
categories == '6019') ;
verdict1 = FOREACH good GENERATE
*,
verify_model( name,
address,
city,
.....
)
;
verdict2 = FOREACH bad GENERATE
*,
verify_model2(
name,
address,
city,
)
;
verdict = UNION verdict1, verdict2;
STORE verdict INTO '$output';
On Sat, Jun 16, 2012 at 11:51 PM, Alan Gates <ga...@hortonworks.com> wrote:
> Apache mailing lists strip all attachments. You'll have to inline the
> script in your message or post it somewhere and send a link.
>
> Alan.
>
> On Jun 16, 2012, at 9:06 PM, Yang wrote:
>
> > Thanks Alan.
> >
> >
> > I attached the trimmed version of my script .
> >
> >
> > basically the similars var generates a bag, explodes it, after that,
> each of the output record is filtered through a Udf.
> >
> > I suspect that the 2 maps are due to the explosion. but it should be
> possible to put the above sequence into a single map.
> >
> >
> > Yang
> >
> > On Tue, Jun 12, 2012 at 2:14 PM, Alan Gates <ga...@hortonworks.com>
> wrote:
> > There are cases where it would do this, such as unioning two inputs.
> Can you send your script to the list?
> >
> > Alan.
> >
> > On Jun 11, 2012, at 11:21 PM, Yang wrote:
> >
> > > this is what happened with my pig script.
> > > why would it generate 2 map-only jobs?
> > > wouldn't the optimization process chain together both mappers and keep
> only
> > > 1 mapper stage?
> > >
> > >
> > > thanks
> > > Yang
> >
> >
>
>
Re: pig generated 2 map-only jobs ?
Posted by Alan Gates <ga...@hortonworks.com>.
Apache mailing lists strip all attachments. You'll have to inline the script in your message or post it somewhere and send a link.
Alan.
On Jun 16, 2012, at 9:06 PM, Yang wrote:
> Thanks Alan.
>
>
> I attached the trimmed version of my script .
>
>
> basically the similars var generates a bag, explodes it, after that, each of the output record is filtered through a Udf.
>
> I suspect that the 2 maps are due to the explosion. but it should be possible to put the above sequence into a single map.
>
>
> Yang
>
> On Tue, Jun 12, 2012 at 2:14 PM, Alan Gates <ga...@hortonworks.com> wrote:
> There are cases where it would do this, such as unioning two inputs. Can you send your script to the list?
>
> Alan.
>
> On Jun 11, 2012, at 11:21 PM, Yang wrote:
>
> > this is what happened with my pig script.
> > why would it generate 2 map-only jobs?
> > wouldn't the optimization process chain together both mappers and keep only
> > 1 mapper stage?
> >
> >
> > thanks
> > Yang
>
>
Re: pig generated 2 map-only jobs ?
Posted by Yang <te...@gmail.com>.
Thanks Alan.
I attached the trimmed version of my script .
basically the similars var generates a bag, explodes it, after that, each
of the output record is filtered through a Udf.
I suspect that the 2 maps are due to the explosion. but it should be
possible to put the above sequence into a single map.
Yang
On Tue, Jun 12, 2012 at 2:14 PM, Alan Gates <ga...@hortonworks.com> wrote:
> There are cases where it would do this, such as unioning two inputs. Can
> you send your script to the list?
>
> Alan.
>
> On Jun 11, 2012, at 11:21 PM, Yang wrote:
>
> > this is what happened with my pig script.
> > why would it generate 2 map-only jobs?
> > wouldn't the optimization process chain together both mappers and keep
> only
> > 1 mapper stage?
> >
> >
> > thanks
> > Yang
>
>
Re: pig generated 2 map-only jobs ?
Posted by Alan Gates <ga...@hortonworks.com>.
There are cases where it would do this, such as unioning two inputs. Can you send your script to the list?
Alan.
On Jun 11, 2012, at 11:21 PM, Yang wrote:
> this is what happened with my pig script.
> why would it generate 2 map-only jobs?
> wouldn't the optimization process chain together both mappers and keep only
> 1 mapper stage?
>
>
> thanks
> Yang