You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Сергей Романов <ro...@inbox.ru.INVALID> on 2016/09/03 11:54:10 UTC

Re[4]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

Hi, Mich,
I don't think it is related to Hive or parquet partitioning. Same issue happens while working with non-partitioned parquet file using python Dataframe API.
Please, take a look at following example:
$ hdfs dfs -ls /user/test   // I had copied partition dt=2016-07-28 to another standalone path.
Found 1 items
-rw-r--r--   2 hdfs supergroup   33568823 2016-09-03 11:11 /user/test/part-00000

$ ./pyspark
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
      /_/

Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
SparkSession available as 'spark'.

>>> spark.read.parquet('hdfs://spark-master1.uslicer.net:8020/user/test').groupBy().sum(*(['dd_convs'] * 57) ).collect()
*response time over 3 runs skipping the first run*
0.8370630741119385
0.22276782989501953
0.7722570896148682

>>> spark.read.parquet('hdfs://spark-master1.uslicer.net:8020/user/test').groupBy().sum(*(['dd_convs'] * 58) ).collect()
*response time over 3 runs skipping the first run*
4.40575098991394
4.320873022079468
3.2484099864959717

$ hdfs dfs -get /user/test/part-00000 .
$ parquet-meta part-00000 
file:                           file:/data/dump/part-00000 
creator:                        parquet-mr 

file schema:                    hive_schema 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
actual_dsp_fee:                 OPTIONAL FLOAT R:0 D:1
actual_pgm_fee:                 OPTIONAL FLOAT R:0 D:1
actual_ssp_fee:                 OPTIONAL FLOAT R:0 D:1
advertiser_id:                  OPTIONAL INT32 R:0 D:1
advertiser_spent:               OPTIONAL DOUBLE R:0 D:1
anomaly_clicks:                 OPTIONAL INT64 R:0 D:1
anomaly_conversions_filtered:   OPTIONAL INT64 R:0 D:1
anomaly_conversions_unfiltered: OPTIONAL INT64 R:0 D:1
anomaly_decisions:              OPTIONAL FLOAT R:0 D:1
bid_price:                      OPTIONAL FLOAT R:0 D:1
campaign_id:                    OPTIONAL INT32 R:0 D:1
click_prob:                     OPTIONAL FLOAT R:0 D:1
clicks:                         OPTIONAL INT64 R:0 D:1
clicks_static:                  OPTIONAL INT64 R:0 D:1
conv_prob:                      OPTIONAL FLOAT R:0 D:1
conversion_id:                  OPTIONAL INT64 R:0 D:1
conversions:                    OPTIONAL INT64 R:0 D:1
creative_id:                    OPTIONAL INT32 R:0 D:1
dd_convs:                       OPTIONAL INT64 R:0 D:1
decisions:                      OPTIONAL FLOAT R:0 D:1
dmp_liveramp_margin:            OPTIONAL FLOAT R:0 D:1
dmp_liveramp_payout:            OPTIONAL FLOAT R:0 D:1
dmp_nielsen_margin:             OPTIONAL FLOAT R:0 D:1
dmp_nielsen_payout:             OPTIONAL FLOAT R:0 D:1
dmp_rapleaf_margin:             OPTIONAL FLOAT R:0 D:1
dmp_rapleaf_payout:             OPTIONAL FLOAT R:0 D:1
e:                              OPTIONAL FLOAT R:0 D:1
expected_cpa:                   OPTIONAL FLOAT R:0 D:1
expected_cpc:                   OPTIONAL FLOAT R:0 D:1
expected_payout:                OPTIONAL FLOAT R:0 D:1
first_impressions:              OPTIONAL INT64 R:0 D:1
fraud_clicks:                   OPTIONAL INT64 R:0 D:1
fraud_impressions:              OPTIONAL INT64 R:0 D:1
g:                              OPTIONAL FLOAT R:0 D:1
impressions:                    OPTIONAL FLOAT R:0 D:1
line_item_id:                   OPTIONAL INT32 R:0 D:1
mail_type:                      OPTIONAL BINARY O:UTF8 R:0 D:1
noads:                          OPTIONAL FLOAT R:0 D:1
predict_version:                OPTIONAL INT64 R:0 D:1
publisher_id:                   OPTIONAL INT32 R:0 D:1
publisher_revenue:              OPTIONAL DOUBLE R:0 D:1
pvc:                            OPTIONAL INT64 R:0 D:1
second_price:                   OPTIONAL FLOAT R:0 D:1
thirdparty_margin:              OPTIONAL FLOAT R:0 D:1
thirdparty_payout:              OPTIONAL FLOAT R:0 D:1

row group 1:                    RC:769163 TS:40249546 OFFSET:4 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
actual_dsp_fee:                  FLOAT SNAPPY DO:0 FPO:4 SZ:1378278/1501551/1.09 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
actual_pgm_fee:                  FLOAT SNAPPY DO:0 FPO:1378282 SZ:39085/42374/1.08 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
actual_ssp_fee:                  FLOAT SNAPPY DO:0 FPO:1417367 SZ:1426888/1553337/1.09 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
advertiser_id:                   INT32 SNAPPY DO:0 FPO:2844255 SZ:339061/572962/1.69 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
advertiser_spent:                DOUBLE SNAPPY DO:0 FPO:3183316 SZ:2731185/3788429/1.39 VC:769163 ENC:PLAIN,RLE,PLAIN_DICTIONARY,BIT_PACKED
anomaly_clicks:                  INT64 SNAPPY DO:0 FPO:5914501 SZ:35/33/0.94 VC:769163 ENC:PLAIN,RLE,BIT_PACKED
anomaly_conversions_filtered:    INT64 SNAPPY DO:0 FPO:5914536 SZ:365/351/0.96 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
anomaly_conversions_unfiltered:  INT64 SNAPPY DO:0 FPO:5914901 SZ:2734/3766/1.38 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
anomaly_decisions:               FLOAT SNAPPY DO:0 FPO:5917635 SZ:35/33/0.94 VC:769163 ENC:PLAIN,RLE,BIT_PACKED
bid_price:                       FLOAT SNAPPY DO:0 FPO:5917670 SZ:2783971/3076795/1.11 VC:769163 ENC:PLAIN,RLE,BIT_PACKED
campaign_id:                     INT32 SNAPPY DO:0 FPO:8701641 SZ:562105/881397/1.57 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
click_prob:                      FLOAT SNAPPY DO:0 FPO:9263746 SZ:2968859/3076795/1.04 VC:769163 ENC:PLAIN,RLE,BIT_PACKED
clicks:                          INT64 SNAPPY DO:0 FPO:12232605 SZ:126877/281792/2.22 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
clicks_static:                   INT64 SNAPPY DO:0 FPO:12359482 SZ:5775/8704/1.51 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
conv_prob:                       FLOAT SNAPPY DO:0 FPO:12365257 SZ:2463487/3076795/1.25 VC:769163 ENC:PLAIN,RLE,BIT_PACKED
conversion_id:                   INT64 SNAPPY DO:0 FPO:14828744 SZ:261396/438714/1.68 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
conversions:                     INT64 SNAPPY DO:0 FPO:15090140 SZ:27433/55397/2.02 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
creative_id:                     INT32 SNAPPY DO:0 FPO:15117573 SZ:1252055/1404477/1.12 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
dd_convs:                        INT64 SNAPPY DO:0 FPO:16369628 SZ:27225/55243/2.03 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
decisions:                       FLOAT SNAPPY DO:0 FPO:16396853 SZ:1016802/1271368/1.25 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
dmp_liveramp_margin:             FLOAT SNAPPY DO:0 FPO:17413655 SZ:117830/176685/1.50 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
dmp_liveramp_payout:             FLOAT SNAPPY DO:0 FPO:17531485 SZ:117830/176685/1.50 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
dmp_nielsen_margin:              FLOAT SNAPPY DO:0 FPO:17649315 SZ:53361/92349/1.73 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
dmp_nielsen_payout:              FLOAT SNAPPY DO:0 FPO:17702676 SZ:53361/92349/1.73 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
dmp_rapleaf_margin:              FLOAT SNAPPY DO:0 FPO:17756037 SZ:41845/68403/1.63 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
dmp_rapleaf_payout:              FLOAT SNAPPY DO:0 FPO:17797882 SZ:41845/68403/1.63 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
e:                               FLOAT SNAPPY DO:0 FPO:17839727 SZ:107662/204625/1.90 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
expected_cpa:                    FLOAT SNAPPY DO:0 FPO:17947389 SZ:1576432/1679656/1.07 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
expected_cpc:                    FLOAT SNAPPY DO:0 FPO:19523821 SZ:85903/100362/1.17 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
expected_payout:                 FLOAT SNAPPY DO:0 FPO:19609724 SZ:2772807/3076795/1.11 VC:769163 ENC:PLAIN,RLE,BIT_PACKED
first_impressions:               INT64 SNAPPY DO:0 FPO:22382531 SZ:1000773/1284812/1.28 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
fraud_clicks:                    INT64 SNAPPY DO:0 FPO:23383304 SZ:1906/2635/1.38 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
fraud_impressions:               INT64 SNAPPY DO:0 FPO:23385210 SZ:9825/17177/1.75 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
g:                               FLOAT SNAPPY DO:0 FPO:23395035 SZ:2749688/2884522/1.05 VC:769163 ENC:PLAIN,RLE,PLAIN_DICTIONARY,BIT_PACKED
impressions:                     FLOAT SNAPPY DO:0 FPO:26144723 SZ:1091228/1315488/1.21 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
line_item_id:                    INT32 SNAPPY DO:0 FPO:27235951 SZ:878489/1182066/1.35 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
mail_type:                       BINARY SNAPPY DO:0 FPO:28114440 SZ:109886/139143/1.27 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
noads:                           FLOAT SNAPPY DO:0 FPO:28224326 SZ:9588/13934/1.45 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
predict_version:                 INT64 SNAPPY DO:0 FPO:28233914 SZ:6222/7272/1.17 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
publisher_id:                    INT32 SNAPPY DO:0 FPO:28240136 SZ:12955/12932/1.00 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
publisher_revenue:               DOUBLE SNAPPY DO:0 FPO:28253091 SZ:2657357/3722991/1.40 VC:769163 ENC:PLAIN,RLE,PLAIN_DICTIONARY,BIT_PACKED
pvc:                             INT64 SNAPPY DO:0 FPO:30910448 SZ:5685/9008/1.58 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
second_price:                    FLOAT SNAPPY DO:0 FPO:30916133 SZ:2263764/2274841/1.00 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
thirdparty_margin:               FLOAT SNAPPY DO:0 FPO:33179897 SZ:192433/278050/1.44 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
thirdparty_payout:               FLOAT SNAPPY DO:0 FPO:33372330 SZ:192432/278050/1.44 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED


>>> spark.read.parquet('hdfs://spark-master1.uslicer.net:8020/user/test').groupBy().sum(*(['dd_convs'] * 57) ).explain()
== Physical Plan ==
*HashAggregate(keys=[], functions=[sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), ... 33 more fields])
+- Exchange SinglePartition
   +- *HashAggregate(keys=[], functions=[partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), ... 33 more fields])
      +- *BatchedScan parquet [dd_convs#3845L] Format: ParquetFormat, InputPaths: hdfs://spark-master1.uslicer.net:8020/user/test, PushedFilters: [], ReadSchema: struct<dd_convs:bigint>


>>> spark.read.parquet('hdfs://spark-master1.uslicer.net:8020/user/test').groupBy().sum(*(['dd_convs'] * 58) ).explain()
== Physical Plan ==
*HashAggregate(keys=[], functions=[sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), ... 34 more fields])
+- Exchange SinglePartition
   +- *HashAggregate(keys=[], functions=[partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), ... 34 more fields])
      +- *BatchedScan parquet [dd_convs#4268L] Format: ParquetFormat, InputPaths: hdfs://spark-master1.uslicer.net:8020/user/test, PushedFilters: [], ReadSchema: struct<dd_convs:bigint>



>Суббота,  3 сентября 2016, 0:22 +03:00 от Mich Talebzadeh <mi...@gmail.com>:
>
>Since you are using Spark Thrift Server (which in turn uses Hive Thrift Server) I have this suspicion that it uses Hive optimiser which indicates that stats do matter. However, that may be just an assumption.
>
>Have you partitioned these parquet tables? 
>
>Is it worth logging to Hive and run the same queries in Hive with EXPLAIN EXTENDED select ...... Can you see whether the relevant partition is picked up?
>
>HTH
>
>Dr Mich Talebzadeh
> 
>LinkedIn    https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> 
>http://talebzadehmich.wordpress.com
>
>Disclaimer:  Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction. 
> 
>On 2 September 2016 at 12:03, Сергей Романов  < romanovsa@inbox.ru > wrote:
>>Hi, Mich,
>>
>>Column x29 does not seems to be any special. It's a newly created table and I did not calculate stats for any columns. Actually, I can sum a single column several times in query and face some landshift performance hit at some "magic" point. Setting "set spark.sql.codegen.wholeStage=false" makes all requests run in a similar slow time (which is slower original time).
>>PS. Does Stats even helps for Spark queries?
>>SELECT field, SUM(x28) FROM parquet_table WHERE partition = 1 GROUP BY field
>>0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS `advertiser_id`,  SUM(`dd_convs`) AS `dd_convs`  FROM `slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY `advertiser_id`  LIMIT 30;
>>30 rows selected (1.37 seconds)
>>30 rows selected (1.382 seconds)
>>30 rows selected (1.399 seconds)
>>
>>SELECT field, SUM(x29) FROM parquet_table WHERE partition = 1 GROUP BY field
>>0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS `advertiser_id`,  SUM(`actual_dsp_fee`) AS `actual_dsp_fee`  FROM `slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY `advertiser_id`  LIMIT 30; 
>>30 rows selected (1.379 seconds)
>>30 rows selected (1.382 seconds)
>>30 rows selected (1.377 seconds)
>>SELECT field, SUM(x28) x repeat 40 times  FROM parquet_table WHERE partition = 1 GROUP BY field -> 1.774s
>>0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS `advertiser_id`, SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`)  FROM `slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY `advertiser_id`  LIMIT 30;
>>30 rows selected (1.774 seconds)
>>30 rows selected (1.721 seconds)
>>30 rows selected (2.813 seconds)
>>SELECT field, SUM(x28) x repeat 41 times FROM parquet_table WHERE partition = 1 GROUP BY field -> 7.314s
>>0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS `advertiser_id`, SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`)  FROM `slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY `advertiser_id`  LIMIT 30;
>>30 rows selected (7.314 seconds)
>>30 rows selected (7.27 seconds)
>>30 rows selected (7.279 seconds)
>>SELECT SUM(x28) x repeat 57 times FROM parquet_table WHERE partition = 1 -> 1.378s
>>0: jdbc:hive2://spark-master1.uslicer> EXPLAIN SELECT SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) FROM `slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28';
>>plan  == Physical Plan ==
>>*HashAggregate(keys=[], functions=[sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), ... 33 more fields])
>>+- Exchange SinglePartition
>>   +- *HashAggregate(keys=[], functions=[partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), ... 33 more fields])
>>      +- *Project [dd_convs#159025L]
>>         +- *BatchedScan parquet slicer.573_slicer_rnd_13[dd_convs#159025L,dt#159005,etl_path#159006] Format: ParquetFormat, InputPaths: hdfs:// spark-master1.uslicer.net:8020/user/hive/warehouse/slicer.db/573_slicer.. ., PushedFilters: [], ReadSchema: struct<dd_convs:bigint>
>>
>>1 row selected (1.378 seconds)
>>1 row selected (1.849 seconds)
>>1 row selected (2.641 seconds)
>>SELECT SUM(x28) x repeat 58 times FROM parquet_table WHERE partition = 1 -> 5.733s
>>0: jdbc:hive2://spark-master1.uslicer> EXPLAIN SELECT SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`)  FROM `slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28';
>>plan  == Physical Plan ==
>>*HashAggregate(keys=[], functions=[sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), ... 34 more fields])
>>+- Exchange SinglePartition
>>   +- *HashAggregate(keys=[], functions=[partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), ... 34 more fields])
>>      +- *Project [dd_convs#158694L]
>>         +- *BatchedScan parquet slicer.573_slicer_rnd_13[dd_convs#158694L,dt#158674,etl_path#158675] Format: ParquetFormat, InputPaths: hdfs:// spark-master1.uslicer.net:8020/user/hive/warehouse/slicer.db/573_slicer.. ., PushedFilters: [], ReadSchema: struct<dd_convs:bigint>
>>
>>1 row selected (5.733 seconds)
>>1 row selected (5.693 seconds)
>>1 row selected (4.67 seconds)
>>
>>SET spark.sql.codegen.wholeStage=false;
>>Without CodeGen: SELECT SUM(x28) x repeat 57 times FROM parquet_table WHERE partition = 1 -> 13.712s
>>Without CodeGen: SELECT SUM(x28) x repeat 58 times FROM parquet_table WHERE partition = 1 -> 13.405s
>>
>>
>>>Четверг,  1 сентября 2016, 19:35 +03:00 от Mich Talebzadeh < mich.talebzadeh@gmail.com >:
>>>
>>>
>>>What happens if you run the following query on its own. How long it takes? 
>>>
>>>SELECT field, SUM(x29) FROM FROM parquet_table WHERE partition = 1 GROUP BY field
>>>
>>>Have Stats been updated for all columns in Hive? And the type x29 field?
>>>
>>>HTH
>>>
>>>
>>>Dr Mich Talebzadeh
>>> 
>>>LinkedIn    https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> 
>>>http://talebzadehmich.wordpress.com
>>>
>>>Disclaimer:  Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction. 
>>> 
>>>On 1 September 2016 at 16:55, Сергей Романов  < romanovsa@inbox.ru.invalid > wrote:
>>>>Hi, 
>>>>
>>>>When I run a query like "SELECT field, SUM(x1), SUM(x2)... SUM(x28) FROM parquet_table WHERE partition = 1 GROUP BY field" it runs in under 2 seconds, but when I add just one more aggregate field to the query "SELECT field, SUM(x1), SUM(x2)... SUM(x28), SUM(x29) FROM parquet_table WHERE partition = 1 GROUP BY field" it runs in about 12 seconds. 
>>>>
>>>>Why does it happens? Can I make second query run as fast as first one? I tried browsing logs in TRACE mode and comparing CODEGEN but everything looks pretty much the same excluding execution time.
>>>>
>>>>Can this be related to SPARK-17115 ?
>>>>
>>>>I'm using Spark 2.0 Thrift Server over YARN/HDFS with partitioned parquet hive tables. 
>>>>
>>>>Complete example using beeline: 
>>>>
>>>>0: jdbc:hive2://spark-master1.uslicer> DESCRIBE EXTENDED `slicer`.`573_slicer_rnd_13`; 
>>>>col_name,data_type,comment 
>>>>actual_dsp_fee,float,NULL 
>>>>actual_pgm_fee,float,NULL 
>>>>actual_ssp_fee,float,NULL 
>>>>advertiser_id,int,NULL 
>>>>advertiser_spent,double,NULL 
>>>>anomaly_clicks,bigint,NULL 
>>>>anomaly_conversions_filtered,bigint,NULL 
>>>>anomaly_conversions_unfiltered,bigint,NULL 
>>>>anomaly_decisions,float,NULL 
>>>>bid_price,float,NULL 
>>>>campaign_id,int,NULL 
>>>>click_prob,float,NULL 
>>>>clicks,bigint,NULL 
>>>>clicks_static,bigint,NULL 
>>>>conv_prob,float,NULL 
>>>>conversion_id,bigint,NULL 
>>>>conversions,bigint,NULL 
>>>>creative_id,int,NULL 
>>>>dd_convs,bigint,NULL 
>>>>decisions,float,NULL 
>>>>dmp_liveramp_margin,float,NULL 
>>>>dmp_liveramp_payout,float,NULL 
>>>>dmp_nielsen_margin,float,NULL 
>>>>dmp_nielsen_payout,float,NULL 
>>>>dmp_rapleaf_margin,float,NULL 
>>>>dmp_rapleaf_payout,float,NULL 
>>>>e,float,NULL 
>>>>expected_cpa,float,NULL 
>>>>expected_cpc,float,NULL 
>>>>expected_payout,float,NULL 
>>>>first_impressions,bigint,NULL 
>>>>fraud_clicks,bigint,NULL 
>>>>fraud_impressions,bigint,NULL 
>>>>g,float,NULL 
>>>>impressions,float,NULL 
>>>>line_item_id,int,NULL 
>>>>mail_type,string,NULL 
>>>>noads,float,NULL 
>>>>predict_version,bigint,NULL 
>>>>publisher_id,int,NULL 
>>>>publisher_revenue,double,NULL 
>>>>pvc,bigint,NULL 
>>>>second_price,float,NULL 
>>>>thirdparty_margin,float,NULL 
>>>>thirdparty_payout,float,NULL 
>>>>dt,string,NULL 
>>>>etl_path,string,NULL 
>>>># Partition Information,, 
>>>># col_name,data_type,comment 
>>>>dt,string,NULL 
>>>>etl_path,string,NULL 
>>>>
>>>>
>>>>data_type  CatalogTable( 
>>>>        Table: `slicer`.`573_slicer_rnd_13` 
>>>>        Owner: spark 
>>>>        Created: Fri Aug 12 12:30:20 UTC 2016 
>>>>        Last Access: Thu Jan 01 00:00:00 UTC 1970 
>>>>        Type: MANAGED 
>>>>        Schema: [`actual_dsp_fee` float, `actual_pgm_fee` float, `actual_ssp_fee` float, `advertiser_id` int, `advertiser_spent` double, `anomaly_clicks` bigint, `anomaly_conversions_filtered` bigint, `anomaly_conversions_unfiltered` bigint, `anomaly_decisions` float, `bid_price` float, `campaign_id` int, `click_prob` float, `clicks` bigint, `clicks_static` bigint, `conv_prob` float, `conversion_id` bigint, `conversions` bigint, `creative_id` int, `dd_convs` bigint, `decisions` float, `dmp_liveramp_margin` float, `dmp_liveramp_payout` float, `dmp_nielsen_margin` float, `dmp_nielsen_payout` float, `dmp_rapleaf_margin` float, `dmp_rapleaf_payout` float, `e` float, `expected_cpa` float, `expected_cpc` float, `expected_payout` float, `first_impressions` bigint, `fraud_clicks` bigint, `fraud_impressions` bigint, `g` float, `impressions` float, `line_item_id` int, `mail_type` string, `noads` float, `predict_version` bigint, `publisher_id` int, `publisher_revenue` double, `pvc` bigint, `second_price` float, `thirdparty_margin` float, `thirdparty_payout` float, `dt` string, `etl_path` string] 
>>>>        Partition Columns: [`dt`, `etl_path`] 
>>>>        Properties: [transient_lastDdlTime=1471005020] 
>>>>        Storage(Location: hdfs:// spark-master1.uslicer.net:8020/user/hive/warehouse/slicer.db/573_slicer_rnd_13 , InputFormat:  org.apache.hadoop.hive.ql.io .parquet.MapredParquetInputFormat, OutputFormat:  org.apache.hadoop.hive.ql.io .parquet.MapredParquetOutputFormat, Serde:  org.apache.hadoop.hive.ql.io .parquet.serde.ParquetHiveSerDe, Properties: [serialization.format=1])) 
>>>>comment     
>>>>
>>>>
>>>>0: jdbc:hive2://spark-master1.uslicer> EXPLAIN SELECT `advertiser_id` AS `advertiser_id`, SUM(`conversions`) AS `conversions`, SUM(`dmp_rapleaf_margin`) AS `dmp_rapleaf_margin`, SUM(`pvc`) AS `pvc`, SUM(`dmp_nielsen_payout`) AS `dmp_nielsen_payout`,  SUM(`fraud_clicks`) AS `fraud_clicks`, SUM(`impressions`) AS `impressions`,  SUM(`conv_prob`) AS `conv_prob`, SUM(`dmp_liveramp_payout`) AS `dmp_liveramp_payout`,  SUM(`decisions`) AS `decisions`,  SUM(`fraud_impressions`) AS `fraud_impressions`, SUM(`advertiser_spent`) AS `advertiser_spent`, SUM(`actual_ssp_fee`) AS `actual_ssp_fee`,  SUM(`dmp_nielsen_margin`) AS `dmp_nielsen_margin`, SUM(`first_impressions`) AS `first_impressions`, SUM(`clicks`) AS `clicks`, SUM(`second_price`) AS `second_price`, SUM(`click_prob`) AS `click_prob`, SUM(`clicks_static`) AS `clicks_static`, SUM(`expected_payout`) AS `expected_payout`, SUM(`bid_price`) AS `bid_price`, SUM(`noads`) AS `noads`, SUM(`e`) AS `e`, SUM(`g`) AS `g`, SUM(`publisher_revenue`) AS `publisher_revenue`, SUM(`dmp_liveramp_margin`) AS `dmp_liveramp_margin`, SUM(`actual_pgm_fee`) AS `actual_pgm_fee`, SUM(`dmp_rapleaf_payout`) AS `dmp_rapleaf_payout`, SUM(`dd_convs`) AS `dd_convs`, SUM(`actual_dsp_fee`) AS `actual_dsp_fee` FROM `slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY `advertiser_id`  LIMIT 30; 
>>>>plan  == Physical Plan == 
>>>>CollectLimit 30 
>>>>+- *HashAggregate(keys=[advertiser_id#13866], functions=[sum(conversions#13879L), sum(cast(dmp_rapleaf_margin#13887 as double)), sum(pvc#13904L), sum(cast(dmp_nielsen_payout#13886 as double)), sum(fraud_clicks#13894L), sum(cast(impressions#13897 as double)), sum(cast(conv_prob#13877 as double)), sum(cast(dmp_liveramp_payout#13884 as double)), sum(cast(decisions#13882 as double)), sum(fraud_impressions#13895L), sum(advertiser_spent#13867), sum(cast(actual_ssp_fee#13865 as double)), sum(cast(dmp_nielsen_margin#13885 as double)), sum(first_impressions#13893L), sum(clicks#13875L), sum(cast(second_price#13905 as double)), sum(cast(click_prob#13874 as double)), sum(clicks_static#13876L), sum(cast(expected_payout#13892 as double)), sum(cast(bid_price#13872 as double)), sum(cast(noads#13900 as double)), sum(cast(e#13889 as double)), sum(cast(g#13896 as double)), sum(publisher_revenue#13903), ... 5 more fields]) 
>>>>   +- Exchange hashpartitioning(advertiser_id#13866, 3) 
>>>>      +- *HashAggregate(keys=[advertiser_id#13866], functions=[partial_sum(conversions#13879L), partial_sum(cast(dmp_rapleaf_margin#13887 as double)), partial_sum(pvc#13904L), partial_sum(cast(dmp_nielsen_payout#13886 as double)), partial_sum(fraud_clicks#13894L), partial_sum(cast(impressions#13897 as double)), partial_sum(cast(conv_prob#13877 as double)), partial_sum(cast(dmp_liveramp_payout#13884 as double)), partial_sum(cast(decisions#13882 as double)), partial_sum(fraud_impressions#13895L), partial_sum(advertiser_spent#13867), partial_sum(cast(actual_ssp_fee#13865 as double)), partial_sum(cast(dmp_nielsen_margin#13885 as double)), partial_sum(first_impressions#13893L), partial_sum(clicks#13875L), partial_sum(cast(second_price#13905 as double)), partial_sum(cast(click_prob#13874 as double)), partial_sum(clicks_static#13876L), partial_sum(cast(expected_payout#13892 as double)), partial_sum(cast(bid_price#13872 as double)), partial_sum(cast(noads#13900 as double)), partial_sum(cast(e#13889 as double)), partial_sum(cast(g#13896 as double)), partial_sum(publisher_revenue#13903), ... 5 more fields]) 
>>>>         +- *Project [actual_dsp_fee#13863, actual_pgm_fee#13864, actual_ssp_fee#13865, advertiser_id#13866, advertiser_spent#13867, bid_price#13872, click_prob#13874, clicks#13875L, clicks_static#13876L, conv_prob#13877, conversions#13879L, dd_convs#13881L, decisions#13882, dmp_liveramp_margin#13883, dmp_liveramp_payout#13884, dmp_nielsen_margin#13885, dmp_nielsen_payout#13886, dmp_rapleaf_margin#13887, dmp_rapleaf_payout#13888, e#13889, expected_payout#13892, first_impressions#13893L, fraud_clicks#13894L, fraud_impressions#13895L, ... 6 more fields] 
>>>>            +- *BatchedScan parquet slicer.573_slicer_rnd_13[actual_dsp_fee#13863,actual_pgm_fee#13864,actual_ssp_fee#13865,advertiser_id#13866,advertiser_spent#13867,bid_price#13872,click_prob#13874,clicks#13875L,clicks_static#13876L,conv_prob#13877,conversions#13879L,dd_convs#13881L,decisions#13882,dmp_liveramp_margin#13883,dmp_liveramp_payout#13884,dmp_nielsen_margin#13885,dmp_nielsen_payout#13886,dmp_rapleaf_margin#13887,dmp_rapleaf_payout#13888,e#13889,expected_payout#13892,first_impressions#13893L,fraud_clicks#13894L,fraud_impressions#13895L,... 8 more fields] Format: ParquetFormat, InputPaths: hdfs:// spark-master1.uslicer.net:8020/user/hive/warehouse/slicer.db/573_slicer.. ., PushedFilters: [], ReadSchema: struct<actual_dsp_fee:float,actual_pgm_fee:float,actual_ssp_fee:float,advertiser_id:int,advertise... 
>>>>
>>>>
>>>>0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS `advertiser_id`, SUM(`conversions`) AS `conversions`, SUM(`dmp_rapleaf_margin`) AS `dmp_rapleaf_margin`, SUM(`pvc`) AS `pvc`, SUM(`dmp_nielsen_payout`) AS `dmp_nielsen_payout`,  SUM(`fraud_clicks`) AS `fraud_clicks`, SUM(`impressions`) AS `impressions`,  SUM(`conv_prob`) AS `conv_prob`, SUM(`dmp_liveramp_payout`) AS `dmp_liveramp_payout`,  SUM(`decisions`) AS `decisions`,  SUM(`fraud_impressions`) AS `fraud_impressions`, SUM(`advertiser_spent`) AS `advertiser_spent`, SUM(`actual_ssp_fee`) AS `actual_ssp_fee`,  SUM(`dmp_nielsen_margin`) AS `dmp_nielsen_margin`, SUM(`first_impressions`) AS `first_impressions`, SUM(`clicks`) AS `clicks`, SUM(`second_price`) AS `second_price`, SUM(`click_prob`) AS `click_prob`, SUM(`clicks_static`) AS `clicks_static`, SUM(`expected_payout`) AS `expected_payout`, SUM(`bid_price`) AS `bid_price`, SUM(`noads`) AS `noads`, SUM(`e`) AS `e`, SUM(`g`) AS `g`, SUM(`publisher_revenue`) AS `publisher_revenue`, SUM(`dmp_liveramp_margin`) AS `dmp_liveramp_margin`, SUM(`actual_pgm_fee`) AS `actual_pgm_fee`, SUM(`dmp_rapleaf_payout`) AS `dmp_rapleaf_payout`, SUM(`dd_convs`) AS `dd_convs`, SUM(`actual_dsp_fee`) AS `actual_dsp_fee` FROM `slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY `advertiser_id`  LIMIT 30; 
>>>>
>>>>(results for three runs) 
>>>>30 rows selected (11.904 seconds) 
>>>>30 rows selected (11.703 seconds) 
>>>>30 rows selected (11.52 seconds) 
>>>>
>>>>XXX 
>>>>
>>>>0: jdbc:hive2://spark-master1.uslicer> EXPLAIN SELECT `advertiser_id` AS `advertiser_id`, SUM(`conversions`) AS `conversions`, SUM(`dmp_rapleaf_margin`) AS `dmp_rapleaf_margin`, SUM(`pvc`) AS `pvc`, SUM(`dmp_nielsen_payout`) AS `dmp_nielsen_payout`,  SUM(`fraud_clicks`) AS `fraud_clicks`, SUM(`impressions`) AS `impressions`,  SUM(`conv_prob`) AS `conv_prob`, SUM(`dmp_liveramp_payout`) AS `dmp_liveramp_payout`,  SUM(`decisions`) AS `decisions`,  SUM(`fraud_impressions`) AS `fraud_impressions`, SUM(`advertiser_spent`) AS `advertiser_spent`, SUM(`actual_ssp_fee`) AS `actual_ssp_fee`,  SUM(`dmp_nielsen_margin`) AS `dmp_nielsen_margin`, SUM(`first_impressions`) AS `first_impressions`, SUM(`clicks`) AS `clicks`, SUM(`second_price`) AS `second_price`, SUM(`click_prob`) AS `click_prob`, SUM(`clicks_static`) AS `clicks_static`, SUM(`expected_payout`) AS `expected_payout`, SUM(`bid_price`) AS `bid_price`, SUM(`noads`) AS `noads`, SUM(`e`) AS `e`, SUM(`g`) AS `g`, SUM(`publisher_revenue`) AS `publisher_revenue`, SUM(`dmp_liveramp_margin`) AS `dmp_liveramp_margin`, SUM(`actual_pgm_fee`) AS `actual_pgm_fee`, SUM(`dmp_rapleaf_payout`) AS `dmp_rapleaf_payout`, SUM(`dd_convs`) AS `dd_convs`  FROM `slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY `advertiser_id`  LIMIT 30; 
>>>>plan  == Physical Plan == 
>>>>CollectLimit 30 
>>>>+- *HashAggregate(keys=[advertiser_id#15269], functions=[sum(conversions#15282L), sum(cast(dmp_rapleaf_margin#15290 as double)), sum(pvc#15307L), sum(cast(dmp_nielsen_payout#15289 as double)), sum(fraud_clicks#15297L), sum(cast(impressions#15300 as double)), sum(cast(conv_prob#15280 as double)), sum(cast(dmp_liveramp_payout#15287 as double)), sum(cast(decisions#15285 as double)), sum(fraud_impressions#15298L), sum(advertiser_spent#15270), sum(cast(actual_ssp_fee#15268 as double)), sum(cast(dmp_nielsen_margin#15288 as double)), sum(first_impressions#15296L), sum(clicks#15278L), sum(cast(second_price#15308 as double)), sum(cast(click_prob#15277 as double)), sum(clicks_static#15279L), sum(cast(expected_payout#15295 as double)), sum(cast(bid_price#15275 as double)), sum(cast(noads#15303 as double)), sum(cast(e#15292 as double)), sum(cast(g#15299 as double)), sum(publisher_revenue#15306), ... 4 more fields]) 
>>>>   +- Exchange hashpartitioning(advertiser_id#15269, 3) 
>>>>      +- *HashAggregate(keys=[advertiser_id#15269], functions=[partial_sum(conversions#15282L), partial_sum(cast(dmp_rapleaf_margin#15290 as double)), partial_sum(pvc#15307L), partial_sum(cast(dmp_nielsen_payout#15289 as double)), partial_sum(fraud_clicks#15297L), partial_sum(cast(impressions#15300 as double)), partial_sum(cast(conv_prob#15280 as double)), partial_sum(cast(dmp_liveramp_payout#15287 as double)), partial_sum(cast(decisions#15285 as double)), partial_sum(fraud_impressions#15298L), partial_sum(advertiser_spent#15270), partial_sum(cast(actual_ssp_fee#15268 as double)), partial_sum(cast(dmp_nielsen_margin#15288 as double)), partial_sum(first_impressions#15296L), partial_sum(clicks#15278L), partial_sum(cast(second_price#15308 as double)), partial_sum(cast(click_prob#15277 as double)), partial_sum(clicks_static#15279L), partial_sum(cast(expected_payout#15295 as double)), partial_sum(cast(bid_price#15275 as double)), partial_sum(cast(noads#15303 as double)), partial_sum(cast(e#15292 as double)), partial_sum(cast(g#15299 as double)), partial_sum(publisher_revenue#15306), ... 4 more fields]) 
>>>>         +- *Project [actual_pgm_fee#15267, actual_ssp_fee#15268, advertiser_id#15269, advertiser_spent#15270, bid_price#15275, click_prob#15277, clicks#15278L, clicks_static#15279L, conv_prob#15280, conversions#15282L, dd_convs#15284L, decisions#15285, dmp_liveramp_margin#15286, dmp_liveramp_payout#15287, dmp_nielsen_margin#15288, dmp_nielsen_payout#15289, dmp_rapleaf_margin#15290, dmp_rapleaf_payout#15291, e#15292, expected_payout#15295, first_impressions#15296L, fraud_clicks#15297L, fraud_impressions#15298L, g#15299, ... 5 more fields] 
>>>>            +- *BatchedScan parquet slicer.573_slicer_rnd_13[actual_pgm_fee#15267,actual_ssp_fee#15268,advertiser_id#15269,advertiser_spent#15270,bid_price#15275,click_prob#15277,clicks#15278L,clicks_static#15279L,conv_prob#15280,conversions#15282L,dd_convs#15284L,decisions#15285,dmp_liveramp_margin#15286,dmp_liveramp_payout#15287,dmp_nielsen_margin#15288,dmp_nielsen_payout#15289,dmp_rapleaf_margin#15290,dmp_rapleaf_payout#15291,e#15292,expected_payout#15295,first_impressions#15296L,fraud_clicks#15297L,fraud_impressions#15298L,g#15299,... 7 more fields] Format: ParquetFormat, InputPaths: hdfs:// spark-master1.uslicer.net:8020/user/hive/warehouse/slicer.db/573_slicer.. ., PushedFilters: [], ReadSchema: struct<actual_pgm_fee:float,actual_ssp_fee:float,advertiser_id:int,advertiser_spent:double,bid_pr... 
>>>>
>>>>
>>>>0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS `advertiser_id`, SUM(`conversions`) AS `conversions`, SUM(`dmp_rapleaf_margin`) AS `dmp_rapleaf_margin`, SUM(`pvc`) AS `pvc`, SUM(`dmp_nielsen_payout`) AS `dmp_nielsen_payout`,  SUM(`fraud_clicks`) AS `fraud_clicks`, SUM(`impressions`) AS `impressions`,  SUM(`conv_prob`) AS `conv_prob`, SUM(`dmp_liveramp_payout`) AS `dmp_liveramp_payout`,  SUM(`decisions`) AS `decisions`,  SUM(`fraud_impressions`) AS `fraud_impressions`, SUM(`advertiser_spent`) AS `advertiser_spent`, SUM(`actual_ssp_fee`) AS `actual_ssp_fee`,  SUM(`dmp_nielsen_margin`) AS `dmp_nielsen_margin`, SUM(`first_impressions`) AS `first_impressions`, SUM(`clicks`) AS `clicks`, SUM(`second_price`) AS `second_price`, SUM(`click_prob`) AS `click_prob`, SUM(`clicks_static`) AS `clicks_static`, SUM(`expected_payout`) AS `expected_payout`, SUM(`bid_price`) AS `bid_price`, SUM(`noads`) AS `noads`, SUM(`e`) AS `e`, SUM(`g`) AS `g`, SUM(`publisher_revenue`) AS `publisher_revenue`, SUM(`dmp_liveramp_margin`) AS `dmp_liveramp_margin`, SUM(`actual_pgm_fee`) AS `actual_pgm_fee`, SUM(`dmp_rapleaf_payout`) AS `dmp_rapleaf_payout`, SUM(`dd_convs`) AS `dd_convs`  FROM `slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY `advertiser_id`  LIMIT 30; 
>>>>
>>>>(results for three runs) 
>>>>30 rows selected (2.158 seconds) 
>>>>30 rows selected (1.83 seconds) 
>>>>30 rows selected (1.979 seconds) 
>>>>
>>>>Sergei Romanov.
>>Sergei Romanov
>

Re[7]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

Posted by Сергей Романов <ro...@inbox.ru.INVALID>.

And even more simple case:

>>> df = sc.parallelize([1] for x in xrange(760857)).toDF()
>>> for x in range(50, 70): print x, timeit.timeit(df.groupBy().sum(*(['_1'] * x)).collect, number=1)
50 1.91226291656
51 1.50933384895
52 1.582903862
53 1.90537405014
54 1.84442877769
55 1.91788887978
56 1.50977802277
57 1.5907189846
// after 57 rows it's 2x slower

58 3.22199988365
59 2.96345090866
60 2.8297970295
61 2.87895679474
62 2.92077898979
63 2.95195293427
64 4.10550689697
65 4.14798402786
66 3.13437199593
67 3.11248207092
68 3.18963003159
69 3.18774986267


>Суббота,  3 сентября 2016, 15:50 +03:00 от Сергей Романов <ro...@inbox.ru.INVALID>:
>
>Same problem happens with CSV data file, so it's not parquet-related either.
>
>Welcome to
>      ____              __
>     / __/__  ___ _____/ /__
>    _\ \/ _ \/ _ `/ __/  '_/
>   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
>      /_/
>
>Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
>SparkSession available as 'spark'.
>>>> import timeit
>>>> from pyspark.sql.types import *
>>>> schema = StructType([StructField('dd_convs', FloatType(), True)])
>>>> for x in range(50, 70): print x, timeit.timeit(spark.read.csv('file:///data/dump/test_csv', schema=schema).groupBy().sum(*(['dd_convs'] * x) ).collect, number=1)
>50 0.372850894928
>51 0.376906871796
>52 0.381325960159
>53 0.385444164276
>54 0.386877775192
>55 0.388918161392
>56 0.397624969482
>57 0.391713142395
>58 2.62714004517
>59 2.68421196938
>60 2.74627685547
>61 2.81081581116
>62 3.43532109261
>63 3.07742786407
>64 3.03904604912
>65 3.01616096497
>66 3.06293702126
>67 3.09386610985
>68 3.27610206604
>69 3.2041969299 Суббота,  3 сентября 2016, 15:40 +03:00 от Сергей Романов < romanovsa@inbox.ru.INVALID >:
>>
>>Hi,
>>I had narrowed down my problem to a very simple case. I'm sending 27kb parquet in attachment. (file:///data/dump/test2 in example)
>>Please, can you take a look at it? Why there is performance drop after 57 sum columns?
>>Welcome to
>>      ____              __
>>     / __/__  ___ _____/ /__
>>    _\ \/ _ \/ _ `/ __/  '_/
>>   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
>>      /_/
>>
>>Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
>>SparkSession available as 'spark'.
>>>>> import timeit
>>>>> for x in range(70): print x, timeit.timeit(spark.read.parquet('file:///data/dump/test2').groupBy().sum(*(['dd_convs'] * x) ).collect, number=1)
>>... 
>>SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>>SLF4J: Defaulting to no-operation (NOP) logger implementation
>>SLF4J: See  http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
>>0 1.05591607094
>>1 0.200426101685
>>2 0.203800916672
>>3 0.176458120346
>>4 0.184863805771
>>5 0.232321023941
>>6 0.216032981873
>>7 0.201778173447
>>8 0.292424917221
>>9 0.228524923325
>>10 0.190534114838
>>11 0.197028160095
>>12 0.270443916321
>>13 0.429781913757
>>14 0.270851135254
>>15 0.776989936829
>>16 0.233337879181
>>17 0.227638959885
>>18 0.212944030762
>>19 0.2144780159
>>20 0.22200012207
>>21 0.262261152267
>>22 0.254227876663
>>23 0.275084018707
>>24 0.292124032974
>>25 0.280488014221
>>16/09/03 15:31:28 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
>>26 0.290093898773
>>27 0.238478899002
>>28 0.246420860291
>>29 0.241401195526
>>30 0.255286931992
>>31 0.42702794075
>>32 0.327946186066
>>33 0.434395074844
>>34 0.314198970795
>>35 0.34576010704
>>36 0.278323888779
>>37 0.289474964142
>>38 0.290827989578
>>39 0.376291036606
>>40 0.347742080688
>>41 0.363158941269
>>42 0.318687915802
>>43 0.376327991486
>>44 0.374994039536
>>45 0.362971067429
>>46 0.425967931747
>>47 0.370860099792
>>48 0.443903923035
>>49 0.374128103256
>>50 0.378985881805
>>51 0.476850986481
>>52 0.451028823853
>>53 0.432540893555
>>54 0.514838933945
>>55 0.53990483284
>>56 0.449142932892
>>57 0.465240001678 // 5x slower after 57 columns
>>58 2.40412116051
>>59 2.41632795334
>>60 2.41812801361
>>61 2.55726218224
>>62 2.55484509468
>>63 2.56128406525
>>64 2.54642391205
>>65 2.56381797791
>>66 2.56871509552
>>67 2.66187620163
>>68 2.63496208191
>>69 2.81545996666
	
>>
>>Sergei Romanov
>>
>>---------------------------------------------------------------------
>>To unsubscribe e-mail:  user-unsubscribe@spark.apache.org
>Sergei Romanov
>
>---------------------------------------------------------------------
>To unsubscribe e-mail:  user-unsubscribe@spark.apache.org

Re[10]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

Posted by Сергей Романов <ro...@inbox.ru.INVALID>.

Thank you, Yong, it looks great.

I had added following lines to spark-defaults.conf and now my original SQL query runs much faster.
spark.executor.extraJavaOptions -XX:-DontCompileHugeMethods
spark.driver.extraJavaOptions -XX:-DontCompileHugeMethods
Can you recommend these configuration settings for production mode? Will it have any side-effects? Will it supersede  SPARK-17115?
SQL:
SELECT `publisher_id` AS `publisher_id`, SUM(`conversions`) AS `conversions`, SUM(`dmp_rapleaf_margin`) AS `dmp_rapleaf_margin`, SUM(`pvc`) AS `pvc`, SUM(`dmp_nielsen_payout`) AS `dmp_nielsen_payout`, SUM(`fraud_clicks`) AS `fraud_clicks`, SUM(`impressions`) AS `impressions`, SUM(`conv_prob`) AS `conv_prob`, SUM(`dmp_liveramp_payout`) AS `dmp_liveramp_payout`, SUM(`decisions`) AS `decisions`, SUM(`fraud_impressions`) AS `fraud_impressions`, SUM(`advertiser_spent`) AS `advertiser_spent`, SUM(`actual_ssp_fee`) AS `actual_ssp_fee`, SUM(`dmp_nielsen_margin`) AS `dmp_nielsen_margin`, SUM(`first_impressions`) AS `first_impressions`, SUM(`clicks`) AS `clicks`, SUM(`second_price`) AS `second_price`, SUM(`click_prob`) AS `click_prob`, SUM(`clicks_static`) AS `clicks_static`, SUM(`expected_payout`) AS `expected_payout`, SUM(`bid_price`) AS `bid_price`, SUM(`noads`) AS `noads`, SUM(`e`) AS `e`, SUM(`e`) as `e2`, SUM(`publisher_revenue`) AS `publisher_revenue`, SUM(`dmp_liveramp_margin`) AS `dmp_liveramp_margin`, SUM(`actual_pgm_fee`) AS `actual_pgm_fee`, SUM(`dmp_rapleaf_payout`) AS `dmp_rapleaf_payout`, SUM(`dd_convs`) AS `dd_convs`, SUM(`actual_dsp_fee`) AS `actual_dsp_fee` FROM `slicer`.`573_slicer_rnd_13` WHERE dt = '2016-07-28' GROUP BY `publisher_id` LIMIT 30;
Original:
30 rows selected (10.047 seconds)
30 rows selected (10.612 seconds)
30 rows selected (9.935 seconds)
With -XX:-DontCompileHugeMethods:
30 rows selected (1.086 seconds)
30 rows selected (1.051 seconds)
30 rows selected (1.073 seconds)

>Среда,  7 сентября 2016, 0:35 +03:00 от Yong Zhang <ja...@hotmail.com>:
>
>This is an interesting point.
>
>I tested with originally data with Spark 2.0 release, I can get the same statistic output in the originally email like following:
>
>50 1.77695393562
>51 0.695149898529
>52 0.638142108917
>53 0.647341966629
>54 0.663456916809
>55 0.629166126251
>56 0.644149065018
>57 0.661190986633
>58 2.6616499424
>59 2.6137509346
>60 2.71165704727
>61 2.63473916054
>
>Then I tested with your suggestion:
>
>spark/bin/pyspark --driver-java-options '-XX:-DontCompileHugeMethods'
>
>Run the same test code, and here is the output:
>
>50 1.77180695534
>51 0.679394006729
>52 0.629493951797
>53 0.62108206749
>54 0.637018918991
>55 0.640591144562
>56 0.649922132492
>57 0.652480125427
>58 0.636356830597
>59 0.667215824127
>60 0.643863916397
>61 0.669810056686
>62 0.664624929428
>63 0.682888031006
>64 0.691393136978
>65 0.690823078156
>66 0.70525097847
>67 0.724694013596
>68 0.737638950348
>69 0.749594926834
>
>
>Yong
>
>----------------------------------------------------------------------
>From: Davies Liu < davies@databricks.com >
>Sent: Tuesday, September 6, 2016 2:27 PM
>To: Сергей Романов
>Cc: Gavin Yue; Mich Talebzadeh; user
>Subject: Re: Re[8]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.
> 
>I think the slowness is caused by generated aggregate method has more
>than 8K bytecodes, than it's not JIT compiled, became much slower.
>
>Could you try to disable the DontCompileHugeMethods by:
>
>-XX:-DontCompileHugeMethods
>
>On Mon, Sep 5, 2016 at 4:21 AM, Сергей Романов
>< romanovsa@inbox.ru.invalid > wrote:
>> Hi, Gavin,
>>
>> Shuffling is exactly the same in both requests and is minimal. Both requests
>> produces one shuffle task. Running time is the only difference I can see in
>> metrics:
>>
>> timeit.timeit(spark.read.csv('file:///data/dump/test_csv',
>> schema=schema).groupBy().sum(*(['dd_convs'] * 57) ).collect, number=1)
>> 0.713730096817
>>  {
>>     "id" : 368,
>>     "name" : "duration total (min, med, max)",
>>     "value" : "524"
>>   }, {
>>     "id" : 375,
>>     "name" : "internal.metrics.executorRunTime",
>>     "value" : "527"
>>   }, {
>>     "id" : 391,
>>     "name" : "internal.metrics.shuffle.write.writeTime",
>>     "value" : "244495"
>>   }
>>
>> timeit.timeit(spark.read.csv('file:///data/dump/test_csv',
>> schema=schema).groupBy().sum(*(['dd_convs'] * 58) ).collect, number=1)
>> 2.97951102257
>>
>>   }, {
>>     "id" : 469,
>>     "name" : "duration total (min, med, max)",
>>     "value" : "2654"
>>   }, {
>>     "id" : 476,
>>     "name" : "internal.metrics.executorRunTime",
>>     "value" : "2661"
>>   }, {
>>     "id" : 492,
>>     "name" : "internal.metrics.shuffle.write.writeTime",
>>     "value" : "371883"
>>   }, {
>>
>> Full metrics in attachment.
>>
>> Суббота, 3 сентября 2016, 19:53 +03:00 от Gavin Yue
>> < yue.yuanyuan@gmail.com >:
>>
>>
>> Any shuffling?
>>
>>
>> On Sep 3, 2016, at 5:50 AM, Сергей Романов < romanovsa@inbox.ru.INVALID >
>> wrote:
>>
>> Same problem happens with CSV data file, so it's not parquet-related either.
>>
>> Welcome to
>>       ____              __
>>      / __/__  ___ _____/ /__
>>     _\ \/ _ \/ _ `/ __/  '_/
>>    /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
>>       /_/
>>
>> Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
>> SparkSession available as 'spark'.
>>>>> import timeit
>>>>> from pyspark.sql.types import *
>>>>> schema = StructType([StructField('dd_convs', FloatType(), True)])
>>>>> for x in range(50, 70): print x,
>>>>> timeit.timeit(spark.read.csv('file:///data/dump/test_csv',
>>>>> schema=schema).groupBy().sum(*(['dd_convs'] * x) ).collect, number=1)
>> 50 0.372850894928
>> 51 0.376906871796
>> 52 0.381325960159
>> 53 0.385444164276
>> 54 0.386877775192
>> 55 0.388918161392
>> 56 0.397624969482
>> 57 0.391713142395
>> 58 2.62714004517
>> 59 2.68421196938
>> 60 2.74627685547
>> 61 2.81081581116
>> 62 3.43532109261
>> 63 3.07742786407
>> 64 3.03904604912
>> 65 3.01616096497
>> 66 3.06293702126
>> 67 3.09386610985
>> 68 3.27610206604
>> 69 3.2041969299
>>
>> Суббота, 3 сентября 2016, 15:40 +03:00 от Сергей Романов
>> < romanovsa@inbox.ru.INVALID >:
>>
>> Hi,
>>
>> I had narrowed down my problem to a very simple case. I'm sending 27kb
>> parquet in attachment. ( file:///data/dump/test2 in example)
>>
>> Please, can you take a look at it? Why there is performance drop after 57
>> sum columns?
>>
>> Welcome to
>>       ____              __
>>      / __/__  ___ _____/ /__
>>     _\ \/ _ \/ _ `/ __/  '_/
>>    /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
>>       /_/
>>
>> Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
>> SparkSession available as 'spark'.
>>>>> import timeit
>>>>> for x in range(70): print x,
>>>>> timeit.timeit(spark.read.parquet('file:///data/dump/test2').groupBy().sum(*(['dd_convs']
>>>>> * x) ).collect, number=1)
>> ...
>> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>> SLF4J: Defaulting to no-operation (NOP) logger implementation
>> SLF4J: See  http://www.slf4j.org/codes.html#StaticLoggerBinder for further
>> details.
>> 0 1.05591607094
>> 1 0.200426101685
>> 2 0.203800916672
>> 3 0.176458120346
>> 4 0.184863805771
>> 5 0.232321023941
>> 6 0.216032981873
>> 7 0.201778173447
>> 8 0.292424917221
>> 9 0.228524923325
>> 10 0.190534114838
>> 11 0.197028160095
>> 12 0.270443916321
>> 13 0.429781913757
>> 14 0.270851135254
>> 15 0.776989936829
>> 16 0.233337879181
>> 17 0.227638959885
>> 18 0.212944030762
>> 19 0.2144780159
>> 20 0.22200012207
>> 21 0.262261152267
>> 22 0.254227876663
>> 23 0.275084018707
>> 24 0.292124032974
>> 25 0.280488014221
>> 16/09/03 15:31:28 WARN Utils: Truncated the string representation of a plan
>> since it was too large. This behavior can be adjusted by setting
>> 'spark.debug.maxToStringFields' in SparkEnv.conf.
>> 26 0.290093898773
>> 27 0.238478899002
>> 28 0.246420860291
>> 29 0.241401195526
>> 30 0.255286931992
>> 31 0.42702794075
>> 32 0.327946186066
>> 33 0.434395074844
>> 34 0.314198970795
>> 35 0.34576010704
>> 36 0.278323888779
>> 37 0.289474964142
>> 38 0.290827989578
>> 39 0.376291036606
>> 40 0.347742080688
>> 41 0.363158941269
>> 42 0.318687915802
>> 43 0.376327991486
>> 44 0.374994039536
>> 45 0.362971067429
>> 46 0.425967931747
>> 47 0.370860099792
>> 48 0.443903923035
>> 49 0.374128103256
>> 50 0.378985881805
>> 51 0.476850986481
>> 52 0.451028823853
>> 53 0.432540893555
>> 54 0.514838933945
>> 55 0.53990483284
>> 56 0.449142932892
>> 57 0.465240001678 // 5x slower after 57 columns
>> 58 2.40412116051
>> 59 2.41632795334
>> 60 2.41812801361
>> 61 2.55726218224
>> 62 2.55484509468
>> 63 2.56128406525
>> 64 2.54642391205
>> 65 2.56381797791
>> 66 2.56871509552
>> 67 2.66187620163
>> 68 2.63496208191
>> 69 2.81545996666
>>
>>
>> Sergei Romanov
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail:  user-unsubscribe@spark.apache.org
>>
>> Sergei Romanov
>>
>> <bad.csv.tgz>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail:  user-unsubscribe@spark.apache.org
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail:  user-unsubscribe@spark.apache.org
>
>---------------------------------------------------------------------
>To unsubscribe e-mail:  user-unsubscribe@spark.apache.org
>

Re: Re[8]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

Posted by Yong Zhang <ja...@hotmail.com>.

This is an interesting point.


I tested with originally data with Spark 2.0 release, I can get the same statistic output in the originally email like following:


50 1.77695393562
51 0.695149898529
52 0.638142108917
53 0.647341966629
54 0.663456916809
55 0.629166126251
56 0.644149065018
57 0.661190986633
58 2.6616499424
59 2.6137509346
60 2.71165704727
61 2.63473916054


Then I tested with your suggestion:


spark/bin/pyspark --driver-java-options '-XX:-DontCompileHugeMethods'


Run the same test code, and here is the output:


50 1.77180695534
51 0.679394006729
52 0.629493951797
53 0.62108206749
54 0.637018918991
55 0.640591144562
56 0.649922132492
57 0.652480125427
58 0.636356830597
59 0.667215824127
60 0.643863916397
61 0.669810056686
62 0.664624929428
63 0.682888031006
64 0.691393136978
65 0.690823078156
66 0.70525097847
67 0.724694013596
68 0.737638950348
69 0.749594926834



Yong

________________________________
From: Davies Liu <da...@databricks.com>
Sent: Tuesday, September 6, 2016 2:27 PM
To: Сергей Романов
Cc: Gavin Yue; Mich Talebzadeh; user
Subject: Re: Re[8]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

I think the slowness is caused by generated aggregate method has more
than 8K bytecodes, than it's not JIT compiled, became much slower.

Could you try to disable the DontCompileHugeMethods by:

-XX:-DontCompileHugeMethods

On Mon, Sep 5, 2016 at 4:21 AM, Сергей Романов
<ro...@inbox.ru.invalid> wrote:
> Hi, Gavin,
>
> Shuffling is exactly the same in both requests and is minimal. Both requests
> produces one shuffle task. Running time is the only difference I can see in
> metrics:
>
> timeit.timeit(spark.read.csv('file:///data/dump/test_csv',
> schema=schema).groupBy().sum(*(['dd_convs'] * 57) ).collect, number=1)
> 0.713730096817
>  {
>     "id" : 368,
>     "name" : "duration total (min, med, max)",
>     "value" : "524"
>   }, {
>     "id" : 375,
>     "name" : "internal.metrics.executorRunTime",
>     "value" : "527"
>   }, {
>     "id" : 391,
>     "name" : "internal.metrics.shuffle.write.writeTime",
>     "value" : "244495"
>   }
>
> timeit.timeit(spark.read.csv('file:///data/dump/test_csv',
> schema=schema).groupBy().sum(*(['dd_convs'] * 58) ).collect, number=1)
> 2.97951102257
>
>   }, {
>     "id" : 469,
>     "name" : "duration total (min, med, max)",
>     "value" : "2654"
>   }, {
>     "id" : 476,
>     "name" : "internal.metrics.executorRunTime",
>     "value" : "2661"
>   }, {
>     "id" : 492,
>     "name" : "internal.metrics.shuffle.write.writeTime",
>     "value" : "371883"
>   }, {
>
> Full metrics in attachment.
>
> Суббота, 3 сентября 2016, 19:53 +03:00 от Gavin Yue
> <yu...@gmail.com>:
>
>
> Any shuffling?
>
>
> On Sep 3, 2016, at 5:50 AM, Сергей Романов <ro...@inbox.ru.INVALID>
> wrote:
>
> Same problem happens with CSV data file, so it's not parquet-related either.
>
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
>       /_/
>
> Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
> SparkSession available as 'spark'.
>>>> import timeit
>>>> from pyspark.sql.types import *
>>>> schema = StructType([StructField('dd_convs', FloatType(), True)])
>>>> for x in range(50, 70): print x,
>>>> timeit.timeit(spark.read.csv('file:///data/dump/test_csv',
>>>> schema=schema).groupBy().sum(*(['dd_convs'] * x) ).collect, number=1)
> 50 0.372850894928
> 51 0.376906871796
> 52 0.381325960159
> 53 0.385444164276
> 54 0.386877775192
> 55 0.388918161392
> 56 0.397624969482
> 57 0.391713142395
> 58 2.62714004517
> 59 2.68421196938
> 60 2.74627685547
> 61 2.81081581116
> 62 3.43532109261
> 63 3.07742786407
> 64 3.03904604912
> 65 3.01616096497
> 66 3.06293702126
> 67 3.09386610985
> 68 3.27610206604
> 69 3.2041969299
>
> Суббота, 3 сентября 2016, 15:40 +03:00 от Сергей Романов
> <ro...@inbox.ru.INVALID>:
>
> Hi,
>
> I had narrowed down my problem to a very simple case. I'm sending 27kb
> parquet in attachment. (file:///data/dump/test2 in example)
>
> Please, can you take a look at it? Why there is performance drop after 57
> sum columns?
>
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
>       /_/
>
> Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
> SparkSession available as 'spark'.
>>>> import timeit
>>>> for x in range(70): print x,
>>>> timeit.timeit(spark.read.parquet('file:///data/dump/test2').groupBy().sum(*(['dd_convs']
>>>> * x) ).collect, number=1)
> ...
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
> details.
> 0 1.05591607094
> 1 0.200426101685
> 2 0.203800916672
> 3 0.176458120346
> 4 0.184863805771
> 5 0.232321023941
> 6 0.216032981873
> 7 0.201778173447
> 8 0.292424917221
> 9 0.228524923325
> 10 0.190534114838
> 11 0.197028160095
> 12 0.270443916321
> 13 0.429781913757
> 14 0.270851135254
> 15 0.776989936829
> 16 0.233337879181
> 17 0.227638959885
> 18 0.212944030762
> 19 0.2144780159
> 20 0.22200012207
> 21 0.262261152267
> 22 0.254227876663
> 23 0.275084018707
> 24 0.292124032974
> 25 0.280488014221
> 16/09/03 15:31:28 WARN Utils: Truncated the string representation of a plan
> since it was too large. This behavior can be adjusted by setting
> 'spark.debug.maxToStringFields' in SparkEnv.conf.
> 26 0.290093898773
> 27 0.238478899002
> 28 0.246420860291
> 29 0.241401195526
> 30 0.255286931992
> 31 0.42702794075
> 32 0.327946186066
> 33 0.434395074844
> 34 0.314198970795
> 35 0.34576010704
> 36 0.278323888779
> 37 0.289474964142
> 38 0.290827989578
> 39 0.376291036606
> 40 0.347742080688
> 41 0.363158941269
> 42 0.318687915802
> 43 0.376327991486
> 44 0.374994039536
> 45 0.362971067429
> 46 0.425967931747
> 47 0.370860099792
> 48 0.443903923035
> 49 0.374128103256
> 50 0.378985881805
> 51 0.476850986481
> 52 0.451028823853
> 53 0.432540893555
> 54 0.514838933945
> 55 0.53990483284
> 56 0.449142932892
> 57 0.465240001678 // 5x slower after 57 columns
> 58 2.40412116051
> 59 2.41632795334
> 60 2.41812801361
> 61 2.55726218224
> 62 2.55484509468
> 63 2.56128406525
> 64 2.54642391205
> 65 2.56381797791
> 66 2.56871509552
> 67 2.66187620163
> 68 2.63496208191
> 69 2.81545996666
>
>
> Sergei Romanov
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
> Sergei Romanov
>
> <bad.csv.tgz>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Re[8]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

Posted by Davies Liu <da...@databricks.com>.

I think the slowness is caused by generated aggregate method has more
than 8K bytecodes, than it's not JIT compiled, became much slower.

Could you try to disable the DontCompileHugeMethods by:

-XX:-DontCompileHugeMethods

On Mon, Sep 5, 2016 at 4:21 AM, Сергей Романов
<ro...@inbox.ru.invalid> wrote:
> Hi, Gavin,
>
> Shuffling is exactly the same in both requests and is minimal. Both requests
> produces one shuffle task. Running time is the only difference I can see in
> metrics:
>
> timeit.timeit(spark.read.csv('file:///data/dump/test_csv',
> schema=schema).groupBy().sum(*(['dd_convs'] * 57) ).collect, number=1)
> 0.713730096817
>  {
>     "id" : 368,
>     "name" : "duration total (min, med, max)",
>     "value" : "524"
>   }, {
>     "id" : 375,
>     "name" : "internal.metrics.executorRunTime",
>     "value" : "527"
>   }, {
>     "id" : 391,
>     "name" : "internal.metrics.shuffle.write.writeTime",
>     "value" : "244495"
>   }
>
> timeit.timeit(spark.read.csv('file:///data/dump/test_csv',
> schema=schema).groupBy().sum(*(['dd_convs'] * 58) ).collect, number=1)
> 2.97951102257
>
>   }, {
>     "id" : 469,
>     "name" : "duration total (min, med, max)",
>     "value" : "2654"
>   }, {
>     "id" : 476,
>     "name" : "internal.metrics.executorRunTime",
>     "value" : "2661"
>   }, {
>     "id" : 492,
>     "name" : "internal.metrics.shuffle.write.writeTime",
>     "value" : "371883"
>   }, {
>
> Full metrics in attachment.
>
> Суббота, 3 сентября 2016, 19:53 +03:00 от Gavin Yue
> <yu...@gmail.com>:
>
>
> Any shuffling?
>
>
> On Sep 3, 2016, at 5:50 AM, Сергей Романов <ro...@inbox.ru.INVALID>
> wrote:
>
> Same problem happens with CSV data file, so it's not parquet-related either.
>
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
>       /_/
>
> Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
> SparkSession available as 'spark'.
>>>> import timeit
>>>> from pyspark.sql.types import *
>>>> schema = StructType([StructField('dd_convs', FloatType(), True)])
>>>> for x in range(50, 70): print x,
>>>> timeit.timeit(spark.read.csv('file:///data/dump/test_csv',
>>>> schema=schema).groupBy().sum(*(['dd_convs'] * x) ).collect, number=1)
> 50 0.372850894928
> 51 0.376906871796
> 52 0.381325960159
> 53 0.385444164276
> 54 0.386877775192
> 55 0.388918161392
> 56 0.397624969482
> 57 0.391713142395
> 58 2.62714004517
> 59 2.68421196938
> 60 2.74627685547
> 61 2.81081581116
> 62 3.43532109261
> 63 3.07742786407
> 64 3.03904604912
> 65 3.01616096497
> 66 3.06293702126
> 67 3.09386610985
> 68 3.27610206604
> 69 3.2041969299
>
> Суббота, 3 сентября 2016, 15:40 +03:00 от Сергей Романов
> <ro...@inbox.ru.INVALID>:
>
> Hi,
>
> I had narrowed down my problem to a very simple case. I'm sending 27kb
> parquet in attachment. (file:///data/dump/test2 in example)
>
> Please, can you take a look at it? Why there is performance drop after 57
> sum columns?
>
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
>       /_/
>
> Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
> SparkSession available as 'spark'.
>>>> import timeit
>>>> for x in range(70): print x,
>>>> timeit.timeit(spark.read.parquet('file:///data/dump/test2').groupBy().sum(*(['dd_convs']
>>>> * x) ).collect, number=1)
> ...
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
> details.
> 0 1.05591607094
> 1 0.200426101685
> 2 0.203800916672
> 3 0.176458120346
> 4 0.184863805771
> 5 0.232321023941
> 6 0.216032981873
> 7 0.201778173447
> 8 0.292424917221
> 9 0.228524923325
> 10 0.190534114838
> 11 0.197028160095
> 12 0.270443916321
> 13 0.429781913757
> 14 0.270851135254
> 15 0.776989936829
> 16 0.233337879181
> 17 0.227638959885
> 18 0.212944030762
> 19 0.2144780159
> 20 0.22200012207
> 21 0.262261152267
> 22 0.254227876663
> 23 0.275084018707
> 24 0.292124032974
> 25 0.280488014221
> 16/09/03 15:31:28 WARN Utils: Truncated the string representation of a plan
> since it was too large. This behavior can be adjusted by setting
> 'spark.debug.maxToStringFields' in SparkEnv.conf.
> 26 0.290093898773
> 27 0.238478899002
> 28 0.246420860291
> 29 0.241401195526
> 30 0.255286931992
> 31 0.42702794075
> 32 0.327946186066
> 33 0.434395074844
> 34 0.314198970795
> 35 0.34576010704
> 36 0.278323888779
> 37 0.289474964142
> 38 0.290827989578
> 39 0.376291036606
> 40 0.347742080688
> 41 0.363158941269
> 42 0.318687915802
> 43 0.376327991486
> 44 0.374994039536
> 45 0.362971067429
> 46 0.425967931747
> 47 0.370860099792
> 48 0.443903923035
> 49 0.374128103256
> 50 0.378985881805
> 51 0.476850986481
> 52 0.451028823853
> 53 0.432540893555
> 54 0.514838933945
> 55 0.53990483284
> 56 0.449142932892
> 57 0.465240001678 // 5x slower after 57 columns
> 58 2.40412116051
> 59 2.41632795334
> 60 2.41812801361
> 61 2.55726218224
> 62 2.55484509468
> 63 2.56128406525
> 64 2.54642391205
> 65 2.56381797791
> 66 2.56871509552
> 67 2.66187620163
> 68 2.63496208191
> 69 2.81545996666
>
>
> Sergei Romanov
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
> Sergei Romanov
>
> <bad.csv.tgz>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re[8]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

Posted by Сергей Романов <ro...@inbox.ru.INVALID>.

Hi, Gavin,

Shuffling is exactly the same in both requests and is minimal. Both requests produces one shuffle task. Running time is the only difference I can see in metrics:

timeit.timeit(spark.read.csv('file:///data/dump/test_csv', schema=schema).groupBy().sum(*(['dd_convs'] * 57) ).collect, number=1)
0.713730096817
 {
    "id" : 368,
    "name" : "duration total (min, med, max)",
    "value" : "524"
  }, {
    "id" : 375,
    "name" : "internal.metrics.executorRunTime",
    "value" : "527"
  }, {
    "id" : 391,
    "name" : "internal.metrics.shuffle.write.writeTime",
    "value" : "244495"
  }

timeit.timeit(spark.read.csv('file:///data/dump/test_csv', schema=schema).groupBy().sum(*(['dd_convs'] * 58) ).collect, number=1)
2.97951102257

  }, {
    "id" : 469,
    "name" : "duration total (min, med, max)",
    "value" : "2654"
  }, {
    "id" : 476,
    "name" : "internal.metrics.executorRunTime",
    "value" : "2661"
  }, {
    "id" : 492,
    "name" : "internal.metrics.shuffle.write.writeTime",
    "value" : "371883"
  }, {
Full metrics in attachment.
>Суббота,  3 сентября 2016, 19:53 +03:00 от Gavin Yue <yu...@gmail.com>:
>
>Any shuffling? 
>
>
>On Sep 3, 2016, at 5:50 AM, Сергей Романов < romanovsa@inbox.ru.INVALID > wrote:
>
>>Same problem happens with CSV data file, so it's not parquet-related either.
>>
>>Welcome to
>>      ____              __
>>     / __/__  ___ _____/ /__
>>    _\ \/ _ \/ _ `/ __/  '_/
>>   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
>>      /_/
>>
>>Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
>>SparkSession available as 'spark'.
>>>>> import timeit
>>>>> from pyspark.sql.types import *
>>>>> schema = StructType([StructField('dd_convs', FloatType(), True)])
>>>>> for x in range(50, 70): print x, timeit.timeit(spark.read.csv('file:///data/dump/test_csv', schema=schema).groupBy().sum(*(['dd_convs'] * x) ).collect, number=1)
>>50 0.372850894928
>>51 0.376906871796
>>52 0.381325960159
>>53 0.385444164276
>>54 0.386877775192
>>55 0.388918161392
>>56 0.397624969482
>>57 0.391713142395
>>58 2.62714004517
>>59 2.68421196938
>>60 2.74627685547
>>61 2.81081581116
>>62 3.43532109261
>>63 3.07742786407
>>64 3.03904604912
>>65 3.01616096497
>>66 3.06293702126
>>67 3.09386610985
>>68 3.27610206604
>>69 3.2041969299 Суббота,  3 сентября 2016, 15:40 +03:00 от Сергей Романов < romanovsa@inbox.ru.INVALID >:
>>>
>>>Hi,
>>>I had narrowed down my problem to a very simple case. I'm sending 27kb parquet in attachment. (file:///data/dump/test2 in example)
>>>Please, can you take a look at it? Why there is performance drop after 57 sum columns?
>>>Welcome to
>>>      ____              __
>>>     / __/__  ___ _____/ /__
>>>    _\ \/ _ \/ _ `/ __/  '_/
>>>   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
>>>      /_/
>>>
>>>Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
>>>SparkSession available as 'spark'.
>>>>>> import timeit
>>>>>> for x in range(70): print x, timeit.timeit(spark.read.parquet('file:///data/dump/test2').groupBy().sum(*(['dd_convs'] * x) ).collect, number=1)
>>>... 
>>>SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>>>SLF4J: Defaulting to no-operation (NOP) logger implementation
>>>SLF4J: See  http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
>>>0 1.05591607094
>>>1 0.200426101685
>>>2 0.203800916672
>>>3 0.176458120346
>>>4 0.184863805771
>>>5 0.232321023941
>>>6 0.216032981873
>>>7 0.201778173447
>>>8 0.292424917221
>>>9 0.228524923325
>>>10 0.190534114838
>>>11 0.197028160095
>>>12 0.270443916321
>>>13 0.429781913757
>>>14 0.270851135254
>>>15 0.776989936829
>>>16 0.233337879181
>>>17 0.227638959885
>>>18 0.212944030762
>>>19 0.2144780159
>>>20 0.22200012207
>>>21 0.262261152267
>>>22 0.254227876663
>>>23 0.275084018707
>>>24 0.292124032974
>>>25 0.280488014221
>>>16/09/03 15:31:28 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
>>>26 0.290093898773
>>>27 0.238478899002
>>>28 0.246420860291
>>>29 0.241401195526
>>>30 0.255286931992
>>>31 0.42702794075
>>>32 0.327946186066
>>>33 0.434395074844
>>>34 0.314198970795
>>>35 0.34576010704
>>>36 0.278323888779
>>>37 0.289474964142
>>>38 0.290827989578
>>>39 0.376291036606
>>>40 0.347742080688
>>>41 0.363158941269
>>>42 0.318687915802
>>>43 0.376327991486
>>>44 0.374994039536
>>>45 0.362971067429
>>>46 0.425967931747
>>>47 0.370860099792
>>>48 0.443903923035
>>>49 0.374128103256
>>>50 0.378985881805
>>>51 0.476850986481
>>>52 0.451028823853
>>>53 0.432540893555
>>>54 0.514838933945
>>>55 0.53990483284
>>>56 0.449142932892
>>>57 0.465240001678 // 5x slower after 57 columns
>>>58 2.40412116051
>>>59 2.41632795334
>>>60 2.41812801361
>>>61 2.55726218224
>>>62 2.55484509468
>>>63 2.56128406525
>>>64 2.54642391205
>>>65 2.56381797791
>>>66 2.56871509552
>>>67 2.66187620163
>>>68 2.63496208191
>>>69 2.81545996666
	
>>>
>>>Sergei Romanov
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe e-mail:  user-unsubscribe@spark.apache.org
>>Sergei Romanov
>><bad.csv.tgz>
>>
>>---------------------------------------------------------------------
>>To unsubscribe e-mail:  user-unsubscribe@spark.apache.org

Re: Re[6]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

Posted by Gavin Yue <yu...@gmail.com>.

Any shuffling? 


> On Sep 3, 2016, at 5:50 AM, ���֧�ԧ֧� ����ާѧߧ�� <ro...@inbox.ru.INVALID> wrote:
> 
> Same problem happens with CSV data file, so it's not parquet-related either.
> 
> 
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
>       /_/
> 
> Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
> SparkSession available as 'spark'.
> >>> import timeit
> >>> from pyspark.sql.types import *
> >>> schema = StructType([StructField('dd_convs', FloatType(), True)])
> >>> for x in range(50, 70): print x, timeit.timeit(spark.read.csv('file:///data/dump/test_csv', schema=schema).groupBy().sum(*(['dd_convs'] * x) ).collect, number=1)
> 50 0.372850894928
> 51 0.376906871796
> 52 0.381325960159
> 53 0.385444164276
> 54 0.386877775192
> 55 0.388918161392
> 56 0.397624969482
> 57 0.391713142395
> 58 2.62714004517
> 59 2.68421196938
> 60 2.74627685547
> 61 2.81081581116
> 62 3.43532109261
> 63 3.07742786407
> 64 3.03904604912
> 65 3.01616096497
> 66 3.06293702126
> 67 3.09386610985
> 68 3.27610206604
> 69 3.2041969299
> 
> ����ҧҧ���, 3 ��֧ߧ��ҧ�� 2016, 15:40 +03:00 ��� ���֧�ԧ֧� ����ާѧߧ�� <ro...@inbox.ru.INVALID>:
> 
> Hi,
> 
> I had narrowed down my problem to a very simple case. I'm sending 27kb parquet in attachment. (file:///data/dump/test2 in example)
> 
> Please, can you take a look at it? Why there is performance drop after 57 sum columns?
> 
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
>       /_/
> 
> Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
> SparkSession available as 'spark'.
> >>> import timeit
> >>> for x in range(70): print x, timeit.timeit(spark.read.parquet('file:///data/dump/test2').groupBy().sum(*(['dd_convs'] * x) ).collect, number=1)
> ... 
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
> 0 1.05591607094
> 1 0.200426101685
> 2 0.203800916672
> 3 0.176458120346
> 4 0.184863805771
> 5 0.232321023941
> 6 0.216032981873
> 7 0.201778173447
> 8 0.292424917221
> 9 0.228524923325
> 10 0.190534114838
> 11 0.197028160095
> 12 0.270443916321
> 13 0.429781913757
> 14 0.270851135254
> 15 0.776989936829
> 16 0.233337879181
> 17 0.227638959885
> 18 0.212944030762
> 19 0.2144780159
> 20 0.22200012207
> 21 0.262261152267
> 22 0.254227876663
> 23 0.275084018707
> 24 0.292124032974
> 25 0.280488014221
> 16/09/03 15:31:28 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
> 26 0.290093898773
> 27 0.238478899002
> 28 0.246420860291
> 29 0.241401195526
> 30 0.255286931992
> 31 0.42702794075
> 32 0.327946186066
> 33 0.434395074844
> 34 0.314198970795
> 35 0.34576010704
> 36 0.278323888779
> 37 0.289474964142
> 38 0.290827989578
> 39 0.376291036606
> 40 0.347742080688
> 41 0.363158941269
> 42 0.318687915802
> 43 0.376327991486
> 44 0.374994039536
> 45 0.362971067429
> 46 0.425967931747
> 47 0.370860099792
> 48 0.443903923035
> 49 0.374128103256
> 50 0.378985881805
> 51 0.476850986481
> 52 0.451028823853
> 53 0.432540893555
> 54 0.514838933945
> 55 0.53990483284
> 56 0.449142932892
> 57 0.465240001678 // 5x slower after 57 columns
> 58 2.40412116051
> 59 2.41632795334
> 60 2.41812801361
> 61 2.55726218224
> 62 2.55484509468
> 63 2.56128406525
> 64 2.54642391205
> 65 2.56381797791
> 66 2.56871509552
> 67 2.66187620163
> 68 2.63496208191
> 69 2.81545996666
> 
> 
> 
> Sergei Romanov
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> Sergei Romanov
> <bad.csv.tgz>
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re[6]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

Posted by Сергей Романов <ro...@inbox.ru.INVALID>.

Same problem happens with CSV data file, so it's not parquet-related either.

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
      /_/

Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
SparkSession available as 'spark'.
>>> import timeit
>>> from pyspark.sql.types import *
>>> schema = StructType([StructField('dd_convs', FloatType(), True)])
>>> for x in range(50, 70): print x, timeit.timeit(spark.read.csv('file:///data/dump/test_csv', schema=schema).groupBy().sum(*(['dd_convs'] * x) ).collect, number=1)
50 0.372850894928
51 0.376906871796
52 0.381325960159
53 0.385444164276
54 0.386877775192
55 0.388918161392
56 0.397624969482
57 0.391713142395
58 2.62714004517
59 2.68421196938
60 2.74627685547
61 2.81081581116
62 3.43532109261
63 3.07742786407
64 3.03904604912
65 3.01616096497
66 3.06293702126
67 3.09386610985
68 3.27610206604
69 3.2041969299 Суббота,  3 сентября 2016, 15:40 +03:00 от Сергей Романов <ro...@inbox.ru.INVALID>:
>
>Hi,
>I had narrowed down my problem to a very simple case. I'm sending 27kb parquet in attachment. (file:///data/dump/test2 in example)
>Please, can you take a look at it? Why there is performance drop after 57 sum columns?
>Welcome to
>      ____              __
>     / __/__  ___ _____/ /__
>    _\ \/ _ \/ _ `/ __/  '_/
>   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
>      /_/
>
>Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
>SparkSession available as 'spark'.
>>>> import timeit
>>>> for x in range(70): print x, timeit.timeit(spark.read.parquet('file:///data/dump/test2').groupBy().sum(*(['dd_convs'] * x) ).collect, number=1)
>... 
>SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>SLF4J: Defaulting to no-operation (NOP) logger implementation
>SLF4J: See  http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
>0 1.05591607094
>1 0.200426101685
>2 0.203800916672
>3 0.176458120346
>4 0.184863805771
>5 0.232321023941
>6 0.216032981873
>7 0.201778173447
>8 0.292424917221
>9 0.228524923325
>10 0.190534114838
>11 0.197028160095
>12 0.270443916321
>13 0.429781913757
>14 0.270851135254
>15 0.776989936829
>16 0.233337879181
>17 0.227638959885
>18 0.212944030762
>19 0.2144780159
>20 0.22200012207
>21 0.262261152267
>22 0.254227876663
>23 0.275084018707
>24 0.292124032974
>25 0.280488014221
>16/09/03 15:31:28 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
>26 0.290093898773
>27 0.238478899002
>28 0.246420860291
>29 0.241401195526
>30 0.255286931992
>31 0.42702794075
>32 0.327946186066
>33 0.434395074844
>34 0.314198970795
>35 0.34576010704
>36 0.278323888779
>37 0.289474964142
>38 0.290827989578
>39 0.376291036606
>40 0.347742080688
>41 0.363158941269
>42 0.318687915802
>43 0.376327991486
>44 0.374994039536
>45 0.362971067429
>46 0.425967931747
>47 0.370860099792
>48 0.443903923035
>49 0.374128103256
>50 0.378985881805
>51 0.476850986481
>52 0.451028823853
>53 0.432540893555
>54 0.514838933945
>55 0.53990483284
>56 0.449142932892
>57 0.465240001678 // 5x slower after 57 columns
>58 2.40412116051
>59 2.41632795334
>60 2.41812801361
>61 2.55726218224
>62 2.55484509468
>63 2.56128406525
>64 2.54642391205
>65 2.56381797791
>66 2.56871509552
>67 2.66187620163
>68 2.63496208191
>69 2.81545996666
	
>
>Sergei Romanov
>
>---------------------------------------------------------------------
>To unsubscribe e-mail:  user-unsubscribe@spark.apache.org
Sergei Romanov

Re[5]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

Posted by Сергей Романов <ro...@inbox.ru.INVALID>.

Hi,
I had narrowed down my problem to a very simple case. I'm sending 27kb parquet in attachment. (file:///data/dump/test2 in example)
Please, can you take a look at it? Why there is performance drop after 57 sum columns?
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
      /_/

Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
SparkSession available as 'spark'.
>>> import timeit
>>> for x in range(70): print x, timeit.timeit(spark.read.parquet('file:///data/dump/test2').groupBy().sum(*(['dd_convs'] * x) ).collect, number=1)
... 
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
0 1.05591607094
1 0.200426101685
2 0.203800916672
3 0.176458120346
4 0.184863805771
5 0.232321023941
6 0.216032981873
7 0.201778173447
8 0.292424917221
9 0.228524923325
10 0.190534114838
11 0.197028160095
12 0.270443916321
13 0.429781913757
14 0.270851135254
15 0.776989936829
16 0.233337879181
17 0.227638959885
18 0.212944030762
19 0.2144780159
20 0.22200012207
21 0.262261152267
22 0.254227876663
23 0.275084018707
24 0.292124032974
25 0.280488014221
16/09/03 15:31:28 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
26 0.290093898773
27 0.238478899002
28 0.246420860291
29 0.241401195526
30 0.255286931992
31 0.42702794075
32 0.327946186066
33 0.434395074844
34 0.314198970795
35 0.34576010704
36 0.278323888779
37 0.289474964142
38 0.290827989578
39 0.376291036606
40 0.347742080688
41 0.363158941269
42 0.318687915802
43 0.376327991486
44 0.374994039536
45 0.362971067429
46 0.425967931747
47 0.370860099792
48 0.443903923035
49 0.374128103256
50 0.378985881805
51 0.476850986481
52 0.451028823853
53 0.432540893555
54 0.514838933945
55 0.53990483284
56 0.449142932892
57 0.465240001678 // 5x slower after 57 columns
58 2.40412116051
59 2.41632795334
60 2.41812801361
61 2.55726218224
62 2.55484509468
63 2.56128406525
64 2.54642391205
65 2.56381797791
66 2.56871509552
67 2.66187620163
68 2.63496208191
69 2.81545996666
	

Sergei Romanov