You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by ckhari4u <ck...@gmail.com> on 2018/01/13 03:02:03 UTC

Distinct on Map data type -- SPARK-19893

I see SPARK-19893 is backported to Spark 2.1 and 2.0.1 as well. I do not see
a clear justification for why SPARK 19893 is important and needed. I have a
sample table which works fine with an earlier build of Spark 2.1.0. Now that
the latest build is having the backport of SPARK-19893, its failing with
error:

Error in query: Cannot have map type columns in DataFrame which calls set
operations(intersect, except, etc.), but the type of column metrics is
map<string,int>;;
Distinct


*In Old Build of Spark 2.1.0, I tried the below:*


create TABLE map_demo2
(
country_id BIGINT,
metrics MAP <STRING, int>
);

insert into table map_demo2 select 2,map("chaka",102) ;
insert into table map_demo2 select 3,map("chaka",102) ;
insert into table map_demo2 select 4,map("mangaa",103) ;


spark-sql> select distinct metrics from map_demo2;
[Stage 0:>                                                          (0 + 4)
/ 5]18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8501 milliseconds to
create the Initialization Vector used by CryptoStream
18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8503 milliseconds to
create the Initialization Vector used by CryptoStream
18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8497 milliseconds to
create the Initialization Vector used by CryptoStream
18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8496 milliseconds to
create the Initialization Vector used by CryptoStream
[Stage 1:===============================>                       (1[Stage
1:===========================================>           (1[Stage
1:======================================================>(1                                                                 
{"mangaa":103}
{"chaka":102}
{"chaka":103}
Time taken: 15.331 seconds, Fetched 3 row(s)

Here the simple distinct query works fine in Spark. Any thoughts why
DISTINCT/EXCEPT/INTERSECT operators are not supported on Map data types. 
From the PR, it says, 
// TODO: although map type is not orderable, technically map type should be
able to be
 +          // used inequality comparison, remove this type check once we
support it.

Could not figure out the issue caused by using the aforementioned operators? 





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: Distinct on Map data type -- SPARK-19893

Posted by Tejas Patil <te...@gmail.com>.
There is a JIRA for making Map types orderable :
https://issues.apache.org/jira/browse/SPARK-18134 Given that this is a
non-trivial change, it will take time.

On Sat, Jan 13, 2018 at 9:50 PM, ckhari4u <ck...@gmail.com> wrote:

> Wan, Thanks a lot,! I see the issue now.
>
> Do we have any JIRA's open for the future work to be done on this?
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Distinct on Map data type -- SPARK-19893

Posted by ckhari4u <ck...@gmail.com>.
Wan, Thanks a lot,! I see the issue now. 

Do we have any JIRA's open for the future work to be done on this? 



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: Distinct on Map data type -- SPARK-19893

Posted by Wenchen Fan <cl...@gmail.com>.
A very simple example is
sql("select create_map(1, 'a', 2, 'b')")
  .union(sql("select create_map(2, 'b', 1, 'a')"))
  .distinct

By definition a map should not care about the order of its entries, so the
above query should return one record. However it returns 2 records before
SPARK-19893

On Sat, Jan 13, 2018 at 11:51 AM, HariKrishnan CK <ck...@gmail.com>
wrote:

> Hi Wan, could you please be more specific on the scenarios where it will
> give wrong results. I checked distinct and intersect operators in many use
> cases i have and could not figure out a failure scenario giving wrong
> results.
>
> Thanks
>
>
> On Jan 12, 2018 7:36 PM, "Wenchen Fan" <cl...@gmail.com> wrote:
>
> Actually Spark 2.1.0 doesn't work for your case, it may give you wrong
> result...
> We are still working on adding this feature, but before that, we should
> fail earlier instead of returning wrong result.
>
> On Sat, Jan 13, 2018 at 11:02 AM, ckhari4u <ck...@gmail.com> wrote:
>
>> I see SPARK-19893 is backported to Spark 2.1 and 2.0.1 as well. I do not
>> see
>> a clear justification for why SPARK 19893 is important and needed. I have
>> a
>> sample table which works fine with an earlier build of Spark 2.1.0. Now
>> that
>> the latest build is having the backport of SPARK-19893, its failing with
>> error:
>>
>> Error in query: Cannot have map type columns in DataFrame which calls set
>> operations(intersect, except, etc.), but the type of column metrics is
>> map<string,int>;;
>> Distinct
>>
>>
>> *In Old Build of Spark 2.1.0, I tried the below:*
>>
>>
>> create TABLE map_demo2
>> (
>> country_id BIGINT,
>> metrics MAP <STRING, int>
>> );
>>
>> insert into table map_demo2 select 2,map("chaka",102) ;
>> insert into table map_demo2 select 3,map("chaka",102) ;
>> insert into table map_demo2 select 4,map("mangaa",103) ;
>>
>>
>> spark-sql> select distinct metrics from map_demo2;
>> [Stage 0:>                                                          (0 +
>> 4)
>> / 5]18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8501 milliseconds
>> to
>> create the Initialization Vector used by CryptoStream
>> 18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8503 milliseconds to
>> create the Initialization Vector used by CryptoStream
>> 18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8497 milliseconds to
>> create the Initialization Vector used by CryptoStream
>> 18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8496 milliseconds to
>> create the Initialization Vector used by CryptoStream
>> [Stage 1:============================
>> <https://maps.google.com/?q=1:%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3E%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0(1%5BStage&entry=gmail&source=g>
>> ===
>> <https://maps.google.com/?q=1:%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3E%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0(1%5BStage&entry=gmail&source=g>
>> >                       (1[Stage
>> <https://maps.google.com/?q=1:%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3E%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0(1%5BStage&entry=gmail&source=g>
>> 1:===========================================>           (1[Stage
>> 1:======================================================>(1
>> {"mangaa":103}
>> {"chaka":102}
>> {"chaka":103}
>> Time taken: 15.331 seconds, Fetched 3 row(s)
>>
>> Here the simple distinct query works fine in Spark. Any thoughts why
>> DISTINCT/EXCEPT/INTERSECT operators are not supported on Map data types.
>> From the PR, it says,
>> // TODO: although map type is not orderable, technically map type should
>> be
>> able to be
>>  +          // used inequality comparison, remove this type check once we
>> support it.
>>
>> Could not figure out the issue caused by using the aforementioned
>> operators?
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>
>
>

Re: Distinct on Map data type -- SPARK-19893

Posted by HariKrishnan CK <ck...@gmail.com>.
Hi Wan, could you please be more specific on the scenarios where it will
give wrong results. I checked distinct and intersect operators in many use
cases i have and could not figure out a failure scenario giving wrong
results.

Thanks

On Jan 12, 2018 7:36 PM, "Wenchen Fan" <cl...@gmail.com> wrote:

Actually Spark 2.1.0 doesn't work for your case, it may give you wrong
result...
We are still working on adding this feature, but before that, we should
fail earlier instead of returning wrong result.

On Sat, Jan 13, 2018 at 11:02 AM, ckhari4u <ck...@gmail.com> wrote:

> I see SPARK-19893 is backported to Spark 2.1 and 2.0.1 as well. I do not
> see
> a clear justification for why SPARK 19893 is important and needed. I have a
> sample table which works fine with an earlier build of Spark 2.1.0. Now
> that
> the latest build is having the backport of SPARK-19893, its failing with
> error:
>
> Error in query: Cannot have map type columns in DataFrame which calls set
> operations(intersect, except, etc.), but the type of column metrics is
> map<string,int>;;
> Distinct
>
>
> *In Old Build of Spark 2.1.0, I tried the below:*
>
>
> create TABLE map_demo2
> (
> country_id BIGINT,
> metrics MAP <STRING, int>
> );
>
> insert into table map_demo2 select 2,map("chaka",102) ;
> insert into table map_demo2 select 3,map("chaka",102) ;
> insert into table map_demo2 select 4,map("mangaa",103) ;
>
>
> spark-sql> select distinct metrics from map_demo2;
> [Stage 0:>                                                          (0 + 4)
> / 5]18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8501 milliseconds to
> create the Initialization Vector used by CryptoStream
> 18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8503 milliseconds to
> create the Initialization Vector used by CryptoStream
> 18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8497 milliseconds to
> create the Initialization Vector used by CryptoStream
> 18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8496 milliseconds to
> create the Initialization Vector used by CryptoStream
> [Stage 1:============================
> <https://maps.google.com/?q=1:%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3E%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0(1%5BStage&entry=gmail&source=g>
> ===
> <https://maps.google.com/?q=1:%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3E%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0(1%5BStage&entry=gmail&source=g>
> >                       (1[Stage
> <https://maps.google.com/?q=1:%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3E%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0(1%5BStage&entry=gmail&source=g>
> 1:===========================================>           (1[Stage
> 1:======================================================>(1
> {"mangaa":103}
> {"chaka":102}
> {"chaka":103}
> Time taken: 15.331 seconds, Fetched 3 row(s)
>
> Here the simple distinct query works fine in Spark. Any thoughts why
> DISTINCT/EXCEPT/INTERSECT operators are not supported on Map data types.
> From the PR, it says,
> // TODO: although map type is not orderable, technically map type should be
> able to be
>  +          // used inequality comparison, remove this type check once we
> support it.
>
> Could not figure out the issue caused by using the aforementioned
> operators?
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Distinct on Map data type -- SPARK-19893

Posted by Wenchen Fan <cl...@gmail.com>.
Actually Spark 2.1.0 doesn't work for your case, it may give you wrong
result...
We are still working on adding this feature, but before that, we should
fail earlier instead of returning wrong result.

On Sat, Jan 13, 2018 at 11:02 AM, ckhari4u <ck...@gmail.com> wrote:

> I see SPARK-19893 is backported to Spark 2.1 and 2.0.1 as well. I do not
> see
> a clear justification for why SPARK 19893 is important and needed. I have a
> sample table which works fine with an earlier build of Spark 2.1.0. Now
> that
> the latest build is having the backport of SPARK-19893, its failing with
> error:
>
> Error in query: Cannot have map type columns in DataFrame which calls set
> operations(intersect, except, etc.), but the type of column metrics is
> map<string,int>;;
> Distinct
>
>
> *In Old Build of Spark 2.1.0, I tried the below:*
>
>
> create TABLE map_demo2
> (
> country_id BIGINT,
> metrics MAP <STRING, int>
> );
>
> insert into table map_demo2 select 2,map("chaka",102) ;
> insert into table map_demo2 select 3,map("chaka",102) ;
> insert into table map_demo2 select 4,map("mangaa",103) ;
>
>
> spark-sql> select distinct metrics from map_demo2;
> [Stage 0:>                                                          (0 + 4)
> / 5]18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8501 milliseconds to
> create the Initialization Vector used by CryptoStream
> 18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8503 milliseconds to
> create the Initialization Vector used by CryptoStream
> 18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8497 milliseconds to
> create the Initialization Vector used by CryptoStream
> 18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8496 milliseconds to
> create the Initialization Vector used by CryptoStream
> [Stage 1:===============================>                       (1[Stage
> 1:===========================================>           (1[Stage
> 1:======================================================>(1
> {"mangaa":103}
> {"chaka":102}
> {"chaka":103}
> Time taken: 15.331 seconds, Fetched 3 row(s)
>
> Here the simple distinct query works fine in Spark. Any thoughts why
> DISTINCT/EXCEPT/INTERSECT operators are not supported on Map data types.
> From the PR, it says,
> // TODO: although map type is not orderable, technically map type should be
> able to be
>  +          // used inequality comparison, remove this type check once we
> support it.
>
> Could not figure out the issue caused by using the aforementioned
> operators?
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>