You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Valentin Ursu <va...@gmail.com> on 2017/07/11 16:58:55 UTC

Query via Spark Thrift Server return wrong result.

Hello,

Short description: A SQL query sent via Thrift server returns an
inexplicable response. Running the same (exact same) query inside Apache
Zeppelin or submitting a job returns the correct result. Furthermore, a
similar table returns the correct response in both cases.

Details:
I'm using Spark 2.0.0 on a Cloudera 5.7 distribution. I did not test in on
Spark 2.1.0 but if helpful I can install it and test.

I have 2 Hive metastore tables which are saved by a spark batch job.
orders_2years is processed and saveAsTable()
orders_30days is filtered from _2years based on a column containing date
and saveAsTable() (same batch job creates both)

I checked and both tables contain this record and it is the only record
that satisfies all the conditions in the query:

| customer_id | order_id | doc_id | category | vendor_id|
+-------------+----------+---------+----------+----------+
| 916339 | 25144502 | 5596579 | 1455 | 1 |

The exact same query is run using a PHP application connecting to Thrift
and via Zeppelin/Spark using sqlContext.sql("")

The query on table _30days, in Zeppelin > good result
The query on table _30days, via Thrift > bad result
The query on table _2years, via Thrift > good result

Furthermore, changing _30days in _2years in b and leaving _30days to create
a > good result.

The query is:

SELECT a.customer_id_custom,
collect_list(b.id_comanda) AS b__orderIds,
collect_list(b.doc_id) AS b__docIds,
collect_list(b.category_id) AS b__cat,
collect_list(b.vendor_id) AS b__vendor
FROM ((SELECT customer_id AS customer_id_custom
FROM orders_30days
WHERE 1 = 1 AND category_id IN (1455) AND vendor_id IN (1) AND
(fullDate > '2017-06-08 12:07' AND
fullDate < '2017-07-07 12:07')
GROUP BY customer_id
HAVING count(1) >= 1 AND
SUM(total_price_with_vat) BETWEEN 1 AND 999) a )
INNER JOIN orders_30days AS b
ON b.customer_id = a.customer_id_custom AND
1 = 1 AND b.category_id IN (1455) AND
b.doc_id IN (5596579) AND b.vendor_id IN (1) AND
(b.fullDate > '2017-06-08 12:07' AND
b.fullDate < '2017-07-07 12:07')
WHERE 1 = 1
GROUP BY a.customer_id_custom

If you're wondering why I'm so specific with my query, the original is a
lot more complex. For example a is actually obtained by joining 5 tables
but I tried simplifying it as much as I could while obtaining the same
effect.

Good result is:

| customer_id_custom | b__orderIds | b__docIds | b__cat | b__vendor |
+--------------------+---------------+-----------+--------+-----------+
| 916339 | [25144502] | [5596579] | [1455] | [1] |

Bad result:

| customer_id_custom | b__orderIds | b__docIds | b__cat | b__vendor |
+--------------------+---------------+-----------+--------+-----------+
| 916339 | [null] | [1] | [null] | [1455] |

Notice how some columns appear to be misplaced while others just return null

Re: Query via Spark Thrift Server return wrong result.

Posted by Valentin Ursu <va...@gmail.com>.

Apologies, wrong shortcuts in gmail and I managed to send the mail before I
finished editing the query.
I edited it below.


On Tue, Jul 11, 2017 at 7:58 PM, Valentin Ursu <
valentindaniel.ursu@gmail.com> wrote:

> Hello,
>
> Short description: A SQL query sent via Thrift server returns an
> inexplicable response. Running the same (exact same) query inside Apache
> Zeppelin or submitting a job returns the correct result. Furthermore, a
> similar table returns the correct response in both cases.
>
> Details:
> I'm using Spark 2.0.0 on a Cloudera 5.7 distribution. I did not test in on
> Spark 2.1.0 but if helpful I can install it and test.
>
> I have 2 Hive metastore tables which are saved by a spark batch job.
> orders_2years is processed and saveAsTable()
> orders_30days is filtered from _2years based on a column containing date
> and saveAsTable() (same batch job creates both)
>
> I checked and both tables contain this record and it is the only record
> that satisfies all the conditions in the query:
>
> | customer_id | order_id | doc_id | category | vendor_id|
> +-------------+----------+---------+----------+----------+
> | 916339 | 25144502 | 5596579 | 1455 | 1 |
>
> The exact same query is run using a PHP application connecting to Thrift
> and via Zeppelin/Spark using sqlContext.sql("")
>
> The query on table _30days, in Zeppelin > good result
> The query on table _30days, via Thrift > bad result
> The query on table _2years, via Thrift > good result
>
> Furthermore, changing _30days in _2years in b and leaving _30days to
> create a > good result.
>
> The query is:
>
> SELECT a.customer_id_custom,
> collect_list(b.order_id) AS b__orderIds,
> collect_list(b.doc_id) AS b__docIds,
> collect_list(b.category_id) AS b__cat,
> collect_list(b.vendor_id) AS b__vendor
> FROM ((SELECT customer_id AS customer_id_custom
> FROM orders_30days
> WHERE 1 = 1 AND category_id IN (1455) AND vendor_id IN (1) AND
> order_id IN (25144502) AND
>
                         (fullDate > '2017-06-08 12:07' AND

> fullDate < '2017-07-07 12:07')
> GROUP BY customer_id
> HAVING count(1) >= 1 AND
> SUM(total_price_with_vat) BETWEEN 1 AND 999) a )
> INNER JOIN orders_30days AS b
> ON b.customer_id = a.customer_id_custom AND
> 1 = 1 AND b.category_id IN (1455) AND
> b.vendor_id IN (1) AND
> (b.fullDate > '2017-06-08 12:07' AND
> b.fullDate < '2017-07-07 12:07')
> WHERE 1 = 1
> GROUP BY a.customer_id_custom
>
> If you're wondering why I'm so specific with my query, the original is a
> lot more complex. For example a is actually obtained by joining 5 tables
> but I tried simplifying it as much as I could while obtaining the same
> effect.
>
> Good result is:
>
> | customer_id_custom | b__orderIds | b__docIds | b__cat | b__vendor |
> +--------------------+---------------+-----------+--------+-----------+
> | 916339 | [25144502] | [5596579] | [1455] | [1] |
>
> Bad result:
>
> | customer_id_custom | b__orderIds | b__docIds | b__cat | b__vendor |
> +--------------------+---------------+-----------+--------+-----------+
> | 916339 | [null] | [1] | [null] | [1455] |
>
> Notice how some columns appear to be misplaced (b__cat is actually in
> b__vendor) while others just return null
>
>
>