You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Bernard Jesop <be...@gmail.com> on 2017/07/03 09:55:54 UTC

Analysis Exception after join

Hello, I don't understand my error message.

Basically, all I am doing is :
- dfAgg = df.groupBy("S_ID")
- dfRes = df.join(dfAgg, Seq("S_ID"), "left_outer")

However I get this AnalysisException: "
Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved
attribute(s) S_ID#1903L missing from
Dummy_ID#740,sex#37L,PERSONAL_STATUS#726L,W_DEP_CODE#736,W_SIZE#739L,
POSTAL_CODE#735,COUNTRY_CODE#730,
ID#724L,Dummy_ID_1#741,DEP_CODE#729,HOUSEHOLD_TYPE#733L,
HOUSEHOLD_SIZE#734L,AGE#727L,W_ID#738L,H_ID#732L,AGE_TYPE#728,
S_ID#57L,NATIONALITY#731
in operator !Project [ID#724L, sex#37L, PERSON\
 AL_STATUS#726L, AGE#727L, AGE_TYPE#728, DEP_CODE#729, COUNTRY_CODE#730,
NATIONALITY#731 AS Nationality#77, H_ID#732L, HOUSEHOLD_TYPE#733L,
HOUSEHOLD_SIZE#734L, POSTAL_CODE#735, W_DEP_CODE#736, S_ID#1903L,
W_ID#738L, W_SIZE#739L, Dummy_ID#740, Dummy_ID_1#741];; "

What I don't understand is it says S_ID#1903L is missing
but everything seems fine on the Logical Plan.
+- Join LeftOuter, (S_ID#57L = S_ID#1903L)

   :- Project [W_ID#14L, H_ID#8L, ID#0L, sex#37L, category#97L, AGE#3L,
AGE_TYPE#4, DEP_CODE#5, COUNTRY_CODE#6, Nationality#77, HOUSEHOLD_TYPE#9L,
familySize#117L, POSTAL_CODE#11, W_DEP_CODE#12, S_ID#57\
 L, workplaceSize#137L, Dummy_ID#16, Dummy_ID_1#17, Inc_period#157,
Time_inf#1064, Time_inc#200, Health#1014, Inf_period#1039,
infectedFamily#1355L, infectedWorker#1385L]

    +- Aggregate [S_ID#1903L], [S_ID#1903L, count(1) AS
infectedStreet#1415L]

Does someone have a clue about it?
Thanks,

Re: Analysis Exception after join

Posted by Bernard Jesop <be...@gmail.com>.
It seems to be because of this issues:
https://issues.apache.org/jira/browse/SPARK-10925

I added a checkpoint, as suggested, to break the lineage and it worked.

Best regards,

2017-07-04 17:26 GMT+02:00 Bernard Jesop <be...@gmail.com>:

> Thank Didac,
>
> My bad, actually this code is incomplete, it should have been : - dfAgg =
> df.groupBy("S_ID").agg(...).
>
> I want to access the aggregated values (of dfAgg) for each row of 'df',
> that is why I do a left outer join.
>
>
> Also, regarding the second parameter, I am using this signature of join :
> (Dataset, Seq[String], String) => Dataset.
>
> It is a sequence of the column names to use as keys for the join, some
> kind of syntactic sugar for (df1("key1") === df2("key1") && df1("key2") ===
> df2("key2") && ....),
> except it will not duplicate the columns used as keys.
>
> 2017-07-03 12:39 GMT+02:00 Didac Gil <di...@gmail.com>:
>
>> With the left join, you are joining two tables.
>>
>> In your case, df is the left table, dfAgg is the right table.
>> The second parameter should be the joining condition, right?
>> For instance
>>
>>  dfRes = df.join(dfAgg, $”userName”===$”name", "left_outer”)
>>
>> having a field in df called userName, and another in dfAgg called “name”
>>
>> However, what’s the kind of query you want to make? dfAgg is already the
>> df table that has been grouped by S_ID.
>>
>> I guess that you are looking for something more like the following example
>> dfAgg = df.groupBy("S_ID”)
>>    .agg(org.apache.spark.sql.functions.count(*“userName"*).as(
>> *“usersCount**”*),
>>  .agg(org.apache.spark.sql.functions.collect_set(“city")
>> .as("ListofCities”)),
>>  .agg(*org.apache.spark.sql.functions.max(“age").as(“oldest”))*
>> )
>>
>> On 3 Jul 2017, at 11:55, Bernard Jesop <be...@gmail.com> wrote:
>>
>> Hello, I don't understand my error message.
>>
>> Basically, all I am doing is :
>> - dfAgg = df.groupBy("S_ID")
>> - dfRes = df.join(dfAgg, Seq("S_ID"), "left_outer")
>>
>> However I get this AnalysisException: "
>> Exception in thread "main" org.apache.spark.sql.AnalysisException:
>> resolved attribute(s) S_ID#1903L missing from Dummy_ID#740,sex#37L,PERSONAL_
>> STATUS#726L,W_DEP_CODE#736,W_SIZE#739L,
>> POSTAL_CODE#735,COUNTRY_CODE#730,
>> ID#724L,Dummy_ID_1#741,DEP_CODE#729,HOUSEHOLD_TYPE#733L,
>> HOUSEHOLD_SIZE#734L,AGE#727L,W_ID#738L,H_ID#732L,AGE_TYPE#728,
>> S_ID#57L,NATIONALITY#731
>> in operator !Project [ID#724L, sex#37L, PERSON\
>>  AL_STATUS#726L, AGE#727L, AGE_TYPE#728, DEP_CODE#729, COUNTRY_CODE#730,
>> NATIONALITY#731 AS Nationality#77, H_ID#732L, HOUSEHOLD_TYPE#733L,
>> HOUSEHOLD_SIZE#734L, POSTAL_CODE#735, W_DEP_CODE#736, S_ID#1903L,
>> W_ID#738L, W_SIZE#739L, Dummy_ID#740, Dummy_ID_1#741];; "
>>
>> What I don't understand is it says S_ID#1903L is missing
>> but everything seems fine on the Logical Plan.
>> +- Join LeftOuter, (S_ID#57L = S_ID#1903L)
>>
>>    :- Project [W_ID#14L, H_ID#8L, ID#0L, sex#37L, category#97L, AGE#3L,
>> AGE_TYPE#4, DEP_CODE#5, COUNTRY_CODE#6, Nationality#77, HOUSEHOLD_TYPE#9L,
>> familySize#117L, POSTAL_CODE#11, W_DEP_CODE#12, S_ID#57\
>>  L, workplaceSize#137L, Dummy_ID#16, Dummy_ID_1#17, Inc_period#157,
>> Time_inf#1064, Time_inc#200, Health#1014, Inf_period#1039,
>> infectedFamily#1355L, infectedWorker#1385L]
>>
>>     +- Aggregate [S_ID#1903L], [S_ID#1903L, count(1) AS
>> infectedStreet#1415L]
>>
>> Does someone have a clue about it?
>> Thanks,
>>
>>
>>
>>
>> Didac Gil de la Iglesia
>> PhD in Computer Science
>> didacgil9@gmail.com
>> Spain:     +34 696 285 544 <+34%20696%2028%2055%2044>
>> Sweden: +46 (0)730229737 <+46%2073%20022%2097%2037>
>> Skype: didac.gil.de.la.iglesia
>>
>>
>

Re: Analysis Exception after join

Posted by Bernard Jesop <be...@gmail.com>.
Thank Didac,

My bad, actually this code is incomplete, it should have been : - dfAgg =
df.groupBy("S_ID").agg(...).

I want to access the aggregated values (of dfAgg) for each row of 'df',
that is why I do a left outer join.


Also, regarding the second parameter, I am using this signature of join :
(Dataset, Seq[String], String) => Dataset.

It is a sequence of the column names to use as keys for the join, some kind
of syntactic sugar for (df1("key1") === df2("key1") && df1("key2") ===
df2("key2") && ....),
except it will not duplicate the columns used as keys.

2017-07-03 12:39 GMT+02:00 Didac Gil <di...@gmail.com>:

> With the left join, you are joining two tables.
>
> In your case, df is the left table, dfAgg is the right table.
> The second parameter should be the joining condition, right?
> For instance
>
>  dfRes = df.join(dfAgg, $”userName”===$”name", "left_outer”)
>
> having a field in df called userName, and another in dfAgg called “name”
>
> However, what’s the kind of query you want to make? dfAgg is already the
> df table that has been grouped by S_ID.
>
> I guess that you are looking for something more like the following example
> dfAgg = df.groupBy("S_ID”)
>    .agg(org.apache.spark.sql.functions.count(*“userName"*).as(
> *“usersCount**”*),
>  .agg(org.apache.spark.sql.functions.collect_set(“city")
> .as("ListofCities”)),
>  .agg(*org.apache.spark.sql.functions.max(“age").as(“oldest”))*
> )
>
> On 3 Jul 2017, at 11:55, Bernard Jesop <be...@gmail.com> wrote:
>
> Hello, I don't understand my error message.
>
> Basically, all I am doing is :
> - dfAgg = df.groupBy("S_ID")
> - dfRes = df.join(dfAgg, Seq("S_ID"), "left_outer")
>
> However I get this AnalysisException: "
> Exception in thread "main" org.apache.spark.sql.AnalysisException:
> resolved attribute(s) S_ID#1903L missing from Dummy_ID#740,sex#37L,PERSONAL_
> STATUS#726L,W_DEP_CODE#736,W_SIZE#739L,
> POSTAL_CODE#735,COUNTRY_CODE#730,
> ID#724L,Dummy_ID_1#741,DEP_CODE#729,HOUSEHOLD_TYPE#733L,
> HOUSEHOLD_SIZE#734L,AGE#727L,W_ID#738L,H_ID#732L,AGE_TYPE#728,
> S_ID#57L,NATIONALITY#731
> in operator !Project [ID#724L, sex#37L, PERSON\
>  AL_STATUS#726L, AGE#727L, AGE_TYPE#728, DEP_CODE#729, COUNTRY_CODE#730,
> NATIONALITY#731 AS Nationality#77, H_ID#732L, HOUSEHOLD_TYPE#733L,
> HOUSEHOLD_SIZE#734L, POSTAL_CODE#735, W_DEP_CODE#736, S_ID#1903L,
> W_ID#738L, W_SIZE#739L, Dummy_ID#740, Dummy_ID_1#741];; "
>
> What I don't understand is it says S_ID#1903L is missing
> but everything seems fine on the Logical Plan.
> +- Join LeftOuter, (S_ID#57L = S_ID#1903L)
>
>    :- Project [W_ID#14L, H_ID#8L, ID#0L, sex#37L, category#97L, AGE#3L,
> AGE_TYPE#4, DEP_CODE#5, COUNTRY_CODE#6, Nationality#77, HOUSEHOLD_TYPE#9L,
> familySize#117L, POSTAL_CODE#11, W_DEP_CODE#12, S_ID#57\
>  L, workplaceSize#137L, Dummy_ID#16, Dummy_ID_1#17, Inc_period#157,
> Time_inf#1064, Time_inc#200, Health#1014, Inf_period#1039,
> infectedFamily#1355L, infectedWorker#1385L]
>
>     +- Aggregate [S_ID#1903L], [S_ID#1903L, count(1) AS
> infectedStreet#1415L]
>
> Does someone have a clue about it?
> Thanks,
>
>
>
>
> Didac Gil de la Iglesia
> PhD in Computer Science
> didacgil9@gmail.com
> Spain:     +34 696 285 544 <+34%20696%2028%2055%2044>
> Sweden: +46 (0)730229737 <+46%2073%20022%2097%2037>
> Skype: didac.gil.de.la.iglesia
>
>

Re: Analysis Exception after join

Posted by Didac Gil <di...@gmail.com>.
With the left join, you are joining two tables.

In your case, df is the left table, dfAgg is the right table.
The second parameter should be the joining condition, right?
For instance

 dfRes = df.join(dfAgg, $”userName”===$”name", "left_outer”)

having a field in df called userName, and another in dfAgg called “name”

However, what’s the kind of query you want to make? dfAgg is already the df table that has been grouped by S_ID.

I guess that you are looking for something more like the following example
dfAgg = df.groupBy("S_ID”)
   .agg(org.apache.spark.sql.functions.count(“userName").as(“usersCount”),
 .agg(org.apache.spark.sql.functions.collect_set(“city") .as("ListofCities”)),
 .agg(org.apache.spark.sql.functions.max(“age").as(“oldest”))
)

> On 3 Jul 2017, at 11:55, Bernard Jesop <be...@gmail.com> wrote:
> 
> Hello, I don't understand my error message.
> 
> Basically, all I am doing is :
> - dfAgg = df.groupBy("S_ID")
> - dfRes = df.join(dfAgg, Seq("S_ID"), "left_outer")
> 
> However I get this AnalysisException: "
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved attribute(s) S_ID#1903L missing from Dummy_ID#740,sex#37L,PERSONAL_STATUS#726L,W_DEP_CODE#736,W_SIZE#739L,
> POSTAL_CODE#735,COUNTRY_CODE#730,
> ID#724L,Dummy_ID_1#741,DEP_CODE#729,HOUSEHOLD_TYPE#733L,
> HOUSEHOLD_SIZE#734L,AGE#727L,W_ID#738L,H_ID#732L,AGE_TYPE#728,
> S_ID#57L,NATIONALITY#731
> in operator !Project [ID#724L, sex#37L, PERSON\
>  AL_STATUS#726L, AGE#727L, AGE_TYPE#728, DEP_CODE#729, COUNTRY_CODE#730, NATIONALITY#731 AS Nationality#77, H_ID#732L, HOUSEHOLD_TYPE#733L, HOUSEHOLD_SIZE#734L, POSTAL_CODE#735, W_DEP_CODE#736, S_ID#1903L,
> W_ID#738L, W_SIZE#739L, Dummy_ID#740, Dummy_ID_1#741];; "
> 
> What I don't understand is it says S_ID#1903L is missing
> but everything seems fine on the Logical Plan.
> +- Join LeftOuter, (S_ID#57L = S_ID#1903L)
> 
>    :- Project [W_ID#14L, H_ID#8L, ID#0L, sex#37L, category#97L, AGE#3L, AGE_TYPE#4, DEP_CODE#5, COUNTRY_CODE#6, Nationality#77, HOUSEHOLD_TYPE#9L, familySize#117L, POSTAL_CODE#11, W_DEP_CODE#12, S_ID#57\
>  L, workplaceSize#137L, Dummy_ID#16, Dummy_ID_1#17, Inc_period#157, Time_inf#1064, Time_inc#200, Health#1014, Inf_period#1039, infectedFamily#1355L, infectedWorker#1385L]
> 
>     +- Aggregate [S_ID#1903L], [S_ID#1903L, count(1) AS infectedStreet#1415L]
> 
> Does someone have a clue about it?
> Thanks,
> 
> 
> 

Didac Gil de la Iglesia
PhD in Computer Science
didacgil9@gmail.com
Spain:     +34 696 285 544
Sweden: +46 (0)730229737
Skype: didac.gil.de.la.iglesia