You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Russell Jurney <ru...@gmail.com> on 2013/12/04 22:21:51 UTC
CROSS/Self-Join Bug - Please Help :(
I have this bug that is killing me, where I can't self-join/cross a dataset
with itself. Its blocking my work :(
The script is like this:
businesses = LOAD
'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using
com.twitter.elephantbird.pig.load.JsonLoader() as json:map[];
/* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar,
business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E Camelback
Rd
Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty &
Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business,
city=Phoenix} */
locations = FOREACH businesses GENERATE $0#'business_id' AS business_id,
$0#'longitude' AS longitude,
$0#'latitude' AS latitude;
STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv';
locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS
(business_id:chararray, longitude:double, latitude:double);
location_comparisons = CROSS locations_2, locations;
distances = FOREACH businesses GENERATE locations.business_id AS
business_id_1,
locations_2.business_id AS
business_id_2,
udfs.haversine(locations.longitude,
locations.latitude,
locations_2.longitude,
locations_2.latitude) AS distance;
STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv';
I have also tried converting this to a self-join using JOIN BY '1', and
also locations_2 = locations, and I get the same error:
*org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has
more than one row in the output. 1st :
(rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd
:(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)*
at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
This makes no sense! What am I to do? I can't self-join :(
--
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com
Re: CROSS/Self-Join Bug - Please Help :(
Posted by Russell Jurney <ru...@gmail.com>.
If you store immediately after the CROSS, it works. If you do another
FOREACH/GENERATE, etc. it does not.
On Wed, Dec 4, 2013 at 1:41 PM, Pradeep Gollakota <pr...@gmail.com>wrote:
> I tried to following script (not exactly the same) and it worked correctly
> for me.
>
> businesses = LOAD 'dataset' using PigStorage(',') AS (a, b, c,
> business_id: chararray, lat: double, lng: double);
> locations = FOREACH businesses GENERATE business_id, lat, lng;
> STORE locations INTO 'locations.tsv';
> locations2 = LOAD 'locations.tsv' AS (business_id, lat, long);
> loc_com = CROSS locations2, locations;
> dump loc_com;
>
> I’m wondering your problem has something to do with the way that the
> JsonStorage works. Another thing you can try is to load ‘locations.tsv’
> twice and do a self-cross on that.
>
>
> On Wed, Dec 4, 2013 at 1:21 PM, Russell Jurney <russell.jurney@gmail.com
> >wrote:
>
> > I have this bug that is killing me, where I can't self-join/cross a
> dataset
> > with itself. Its blocking my work :(
> >
> > The script is like this:
> >
> > businesses = LOAD
> > 'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using
> > com.twitter.elephantbird.pig.load.JsonLoader() as json:map[];
> >
> > /* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar,
> > business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E
> Camelback
> > Rd
> > Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty &
> > Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business,
> > city=Phoenix} */
> > locations = FOREACH businesses GENERATE $0#'business_id' AS business_id,
> > $0#'longitude' AS longitude,
> > $0#'latitude' AS latitude;
> > STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv';
> > locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS
> > (business_id:chararray, longitude:double, latitude:double);
> > location_comparisons = CROSS locations_2, locations;
> >
> > distances = FOREACH businesses GENERATE locations.business_id AS
> > business_id_1,
> > locations_2.business_id AS
> > business_id_2,
> >
> udfs.haversine(locations.longitude,
> >
> locations.latitude,
> >
> > locations_2.longitude,
> >
> > locations_2.latitude) AS distance;
> > STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv';
> >
> >
> > I have also tried converting this to a self-join using JOIN BY '1', and
> > also locations_2 = locations, and I get the same error:
> >
> > *org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar
> has
> > more than one row in the output. 1st :
> > (rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd
> > :(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)*
> >
> > at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111)
> >
> > at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336)
> >
> > at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438)
> >
> > at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)
> >
> > at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
> >
> > at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298)
> >
> > at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
> >
> > at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
> >
> > at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
> >
> > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> >
> > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> >
> > at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> >
> > This makes no sense! What am I to do? I can't self-join :(
> > --
> > Russell Jurney twitter.com/rjurney russell.jurney@gmail.com
> > datasyndrome.com
> >
>
--
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com
Re: CROSS/Self-Join Bug - Please Help :(
Posted by Pradeep Gollakota <pr...@gmail.com>.
I tried to following script (not exactly the same) and it worked correctly
for me.
businesses = LOAD 'dataset' using PigStorage(',') AS (a, b, c,
business_id: chararray, lat: double, lng: double);
locations = FOREACH businesses GENERATE business_id, lat, lng;
STORE locations INTO 'locations.tsv';
locations2 = LOAD 'locations.tsv' AS (business_id, lat, long);
loc_com = CROSS locations2, locations;
dump loc_com;
I’m wondering your problem has something to do with the way that the
JsonStorage works. Another thing you can try is to load ‘locations.tsv’
twice and do a self-cross on that.
On Wed, Dec 4, 2013 at 1:21 PM, Russell Jurney <ru...@gmail.com>wrote:
> I have this bug that is killing me, where I can't self-join/cross a dataset
> with itself. Its blocking my work :(
>
> The script is like this:
>
> businesses = LOAD
> 'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using
> com.twitter.elephantbird.pig.load.JsonLoader() as json:map[];
>
> /* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar,
> business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E Camelback
> Rd
> Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty &
> Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business,
> city=Phoenix} */
> locations = FOREACH businesses GENERATE $0#'business_id' AS business_id,
> $0#'longitude' AS longitude,
> $0#'latitude' AS latitude;
> STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv';
> locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS
> (business_id:chararray, longitude:double, latitude:double);
> location_comparisons = CROSS locations_2, locations;
>
> distances = FOREACH businesses GENERATE locations.business_id AS
> business_id_1,
> locations_2.business_id AS
> business_id_2,
> udfs.haversine(locations.longitude,
> locations.latitude,
>
> locations_2.longitude,
>
> locations_2.latitude) AS distance;
> STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv';
>
>
> I have also tried converting this to a self-join using JOIN BY '1', and
> also locations_2 = locations, and I get the same error:
>
> *org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has
> more than one row in the output. 1st :
> (rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd
> :(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)*
>
> at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
>
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>
> This makes no sense! What am I to do? I can't self-join :(
> --
> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com
> datasyndrome.com
>
Re: CROSS/Self-Join Bug - Please Help :(
Posted by Russell Jurney <ru...@gmail.com>.
There was a bug in the script on the 2nd to last line. Fixed it, still have
same issue.
I found a workaround: if I store the CROSSED relation immediately after the
CROSS, then load it... it works. Something about resetting the plan. This
is a bug. I'll file a JIRA.
On Wed, Dec 4, 2013 at 1:21 PM, Russell Jurney <ru...@gmail.com>wrote:
> I have this bug that is killing me, where I can't self-join/cross a
> dataset with itself. Its blocking my work :(
>
> The script is like this:
>
> businesses = LOAD
> 'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using
> com.twitter.elephantbird.pig.load.JsonLoader() as json:map[];
>
> /* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar,
> business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E Camelback
> Rd
> Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty &
> Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business,
> city=Phoenix} */
> locations = FOREACH businesses GENERATE $0#'business_id' AS business_id,
> $0#'longitude' AS longitude,
> $0#'latitude' AS latitude;
> STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv';
> locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS
> (business_id:chararray, longitude:double, latitude:double);
> location_comparisons = CROSS locations_2, locations;
>
> distances = FOREACH businesses GENERATE locations.business_id AS
> business_id_1,
> locations_2.business_id AS
> business_id_2,
>
> udfs.haversine(locations.longitude,
> locations.latitude,
>
> locations_2.longitude,
>
> locations_2.latitude) AS distance;
> STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv';
>
>
> I have also tried converting this to a self-join using JOIN BY '1', and
> also locations_2 = locations, and I get the same error:
>
> *org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has
> more than one row in the output. 1st :
> (rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd
> :(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)*
>
> at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111)
>
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336)
>
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438)
>
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)
>
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
>
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298)
>
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
>
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
>
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
>
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>
> This makes no sense! What am I to do? I can't self-join :(
> --
> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.
> com
>
--
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com