You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dmitry Kravchuk (Jira)" <ji...@apache.org> on 2020/12/17 14:28:00 UTC
[jira] [Comment Edited] (ARROW-4890) [Python] Spark+Arrow Grouped pandas UDAF - read length must be positive or -1

    [ https://issues.apache.org/jira/browse/ARROW-4890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251061#comment-17251061 ] 

Dmitry Kravchuk edited comment on ARROW-4890 at 12/17/20, 2:27 PM:
-------------------------------------------------------------------

[~fan_li_ya] okay, here we go.

Spark version - 2.4.4

Python env:
 * Dcycler (0.10.0)
 * glmnet-py (0.1.0b2)
 * joblib (1.0.0)
 * kiwisolver (1.3.1)
 * lightgbm (3.1.1) EPRECATION
 * matplotlib (3.0.3)
 * numpy (1.19.4)
 * pandas (1.1.5)
 * pip (9.0.3: The default format will switch to columns in the future. You can)
 * *pyarrow*
 * pyparsing (2.4.7) use --format=(legacy|columns) (or define a format=(python-dateutil (2.8.1)
 * pytz (202legacy|columns) in yo0.4)
 * scikit-learn (0.23.2)
 * scipy (1.5.4)
 * setuptools (51.0.0) ur pip.conf under the [list] section) to disable this warnsix (1.15.0)
 * sklearn (0.0)
 * threadpoolctl (2.1.0)
 * venv-paing. ck (0.2.0)
 * wheel (0.36.2)

I've tested many pyarrow versions on 2 functions which return 172 mb dataset 
{code:java|title=Bar.Python|borderStyle=solid}
import pyspark
from pyspark.sql import functions as F, types as T
import pandas as pd

def analyze(spark, job_args, configs):

    pdf1 = pd.DataFrame(
        [[1234567, 0.0, "abcdefghij", "2000-01-01T00:00:00.000Z"]],
        columns=['df1_c1', 'df1_c2', 'df1_c3', 'df1_c4']
    )
    df1 = spark.createDataFrame(
        pd.concat([pdf1 for i in range(429)]).reset_index()).drop('index')

    pdf2 = pd.DataFrame(
        [[1234567, 0.0, "abcdefghijklmno", "2000-01-01", "abcdefghijklmno", "abcdefghijklmno"]],
        columns=['df2_c1', 'df2_c2', 'df2_c3', 'df2_c4', 'df2_c5', 'df2_c6']
    )
    df2 = spark.createDataFrame(
        pd.concat([pdf2 for i in range(4899)]).reset_index()).drop('index')
    df3 = df1.join(df2, df1['df1_c1'] == df2['df2_c1'], how='inner')

    def myudf(df):
        return df

    df4 = df3
    udf = F.pandas_udf(df4.schema, F.PandasUDFType.GROUPED_MAP)(myudf)

    df5 = df4.groupBy('df1_c1').apply(udf)
    print('df5.count()', df5.count())
{code}
and 1.72 gb dataset using pandas_udf
{code:java|title=Bar.Python|borderStyle=solid}
import pyspark
from pyspark.sql import functions as F, types as T
import pandas as pd

def analyze(spark, job_args, configs):

    pdf1 = pd.DataFrame(
        [[1234567, 0.0, "abcdefghij", "2000-01-01T00:00:00.000Z"]],
        columns=['df1_c1', 'df1_c2', 'df1_c3', 'df1_c4']
    )
    df1 = spark.createDataFrame(
        pd.concat([pdf1 for i in range(429)]).reset_index()).drop('index')

    pdf2 = pd.DataFrame(
        [[1234567, 0.0, "abcdefghijklmno", "2000-01-01", "abcdefghijklmno", "abcdefghijklmno"]],
        columns=['df2_c1', 'df2_c2', 'df2_c3', 'df2_c4', 'df2_c5', 'df2_c6']
    )
    df2 = spark.createDataFrame(
        pd.concat([pdf2 for i in range(48993)]).reset_index()).drop('index')
    df3 = df1.join(df2, df1['df1_c1'] == df2['df2_c1'], how='inner')

    def myudf(df):
        return df

    df4 = df3
    udf = F.pandas_udf(df4.schema, F.PandasUDFType.GROUPED_MAP)(myudf)

    df5 = df4.groupBy('df1_c1').apply(udf)
    print('df5.count()', df5.count())
{code}
 You can find detail log after this table
||Dataset size||pyarrow version||result||stderr||detail log||
|172 mb|0.11.1|success|df5.count() 2101671|1|
|172 mb|0.12.0|success|df5.count() 2101671|1|
|172 mb|0.12.1|success|df5.count() 2101671|1|
|172 mb|0.13.0|success|df5.count() 2101671|1|
|172 mb|0.14.0|success|df5.count() 2101671|1|
|172 mb|0.14.1|success|df5.count() 2101671|1|
|172 mb|0.15.0|error|java.lang.IllegalArgumentException|2|
|172 mb|0.15.1|error|java.lang.IllegalArgumentException|2|
|172 mb|0.16.0|error|java.lang.IllegalArgumentException|2|
|172 mb|0.17.0|error|java.lang.IllegalArgumentException|2|
|172 mb|0.17.1|error|java.lang.IllegalArgumentException|2|
|172 mb|0.17.1|error|java.lang.IllegalArgumentException|2|
|172 mb|1.0.0|error|java.lang.IllegalArgumentException|2|
|172 mb|1.0.1|error|java.lang.IllegalArgumentException|2|
|172 mb|2.0.0|error|java.lang.IllegalArgumentException|2|
|1.7 gb|0.11.1|error|pyarrow.lib.ArrowIOError: read length must be positive or -1|3|
|1.72 gb|0.12.0|error|pyarrow.lib.ArrowIOError: read length must be positive or -1|3|
|1.72 gb|0.12.1|error|pyarrow.lib.ArrowIOError: read length must be positive or -1|3|
|1.72 gb|0.13.0|error|pyarrow.lib.ArrowIOError: read length must be positive or -1|3|
|1.72 gb|0.14.0|error|pyarrow.lib.ArrowIOError: read length must be positive or -1|3|
|1.72 gb|0.14.1|error|pyarrow.lib.ArrowIOError: read length must be positive or -1|3|
|1.72 gb|0.15.0|error|pyarrow.lib.ArrowIOError: read length must be positive or -1|3|
|1.72 gb|0.15.1|error|pyarrow.lib.ArrowIOError: read length must be positive or -1|3|
|1.72 gb|0.16.0|error|pyarrow.lib.ArrowIOError: read length must be positive or -1|3|
|1.72 gb|0.17.0|error|OSError: Invalid IPC message: negative bodyLength|4|
|1.72 gb|0.17.1|error|OSError: Invalid IPC message: negative bodyLength |4|
|1.72 gb|1.0.0|error|OSError: Invalid IPC message: negative bodyLength |4|
|1.72 gb|1.0.1|error|OSError: Invalid IPC message: negative bodyLength |4|
|1.72 gb|2.0.0|error|OSError: Invalid IPC message: negative bodyLength |4|

 

Detail logs:

 

1:
{code:java}
20/12/17 12:41:42 INFO SparkContext: Running Spark version 2.4.4
20/12/17 12:41:42 INFO SparkContext: Submitted application: temp
20/12/17 12:41:42 INFO SecurityManager: Changing view acls to: zeppelin
20/12/17 12:41:42 INFO SecurityManager: Changing modify acls to: zeppelin
20/12/17 12:41:42 INFO SecurityManager: Changing view acls groups to: 
20/12/17 12:41:42 INFO SecurityManager: Changing modify acls groups to: 
20/12/17 12:41:42 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(zeppelin); groups with view permissions: Set(); users  with modify permissions: Set(zeppelin); groups with modify permissions: Set()
20/12/17 12:41:43 INFO Utils: Successfully started service 'sparkDriver' on port 36190.
20/12/17 12:41:43 INFO SparkEnv: Registering MapOutputTracker
20/12/17 12:41:43 INFO SparkEnv: Registering BlockManagerMaster
20/12/17 12:41:43 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/12/17 12:41:43 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/12/17 12:41:43 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-b79cca55-c0c3-4afd-b0b7-3f88faab235c
20/12/17 12:41:43 INFO MemoryStore: MemoryStore started with capacity 8.4 GB
20/12/17 12:41:43 INFO SparkEnv: Registering OutputCommitCoordinator
20/12/17 12:41:43 INFO log: Logging initialized @2400ms
20/12/17 12:41:43 INFO Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
20/12/17 12:41:43 INFO Server: Started @2475ms
20/12/17 12:41:43 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
20/12/17 12:41:43 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
20/12/17 12:41:43 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
20/12/17 12:41:43 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
20/12/17 12:41:43 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.
20/12/17 12:41:43 WARN Utils: Service 'SparkUI' could not bind on port 4045. Attempting port 4046.
20/12/17 12:41:43 INFO AbstractConnector: Started ServerConnector@1bc70483{HTTP/1.1,[http/1.1]}{0.0.0.0:4046}
20/12/17 12:41:43 INFO Utils: Successfully started service 'SparkUI' on port 4046.
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@758cb21d{/jobs,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@101c8ddc{/jobs/json,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@71843b82{/jobs/job,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4f0c10c0{/jobs/job/json,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3b8a1690{/stages,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6b826248{/stages/json,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@1a2100a{/stages/stage,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3d3a8de8{/stages/stage/json,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2c89e805{/stages/pool,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3ecb4790{/stages/pool/json,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6a74cc86{/storage,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3c859ed{/storage/json,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@60e24051{/storage/rdd,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4f17c1a{/storage/rdd/json,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@482d80fe{/environment,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@9ebbf27{/environment/json,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7b798683{/executors,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7467245a{/executors/json,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@50696058{/executors/threadDump,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4afeff5{/executors/threadDump/json,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@33d53c7d{/static,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@104969b0{/,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@385ffa54{/api,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@25e21a77{/jobs/job/kill,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@267d103d{/stages/stage/kill,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://master2.host:4046
20/12/17 12:41:44 INFO RMProxy: Connecting to ResourceManager at master2.host/10.9.14.25:8050
20/12/17 12:41:44 INFO Client: Requesting a new application from cluster with 7 NodeManagers
20/12/17 12:41:44 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (618496 MB per container)
20/12/17 12:41:44 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
20/12/17 12:41:44 INFO Client: Setting up container launch context for our AM
20/12/17 12:41:44 INFO Client: Setting up the launch environment for our AM container
20/12/17 12:41:44 INFO Client: Preparing resources for our AM container
20/12/17 12:41:44 INFO Client: Source and destination file systems are the same. Not copying hdfs:/apps/spark2/jars/spark2-ADH-yarn-archive.tar.gz
20/12/17 12:41:44 INFO Client: Uploading resource file:/opt/deltalake/delta-core_2.11-0.5.0.jar -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1413/delta-core_2.11-0.5.0.jar
20/12/17 12:41:44 INFO Client: Uploading resource file:/home/zeppelin/env3.tar.gz#env3 -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1413/env3.tar.gz
20/12/17 12:41:44 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1413/pyspark.zip
20/12/17 12:41:44 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.7-src.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1413/py4j-0.10.7-src.zip
20/12/17 12:41:44 INFO Client: Uploading resource file:/code/dist/jobs.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1413/jobs.zip
20/12/17 12:41:44 WARN Client: Same path resource file:///opt/deltalake/delta-core_2.11-0.5.0.jar added multiple times to distributed cache.
20/12/17 12:41:44 INFO Client: Uploading resource file:/tmp/spark-69bef22d-3a04-4957-b3a1-e9c32d458350/__spark_conf__7457082691748150995.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1413/__spark_conf__.zip
20/12/17 12:41:45 INFO SecurityManager: Changing view acls to: zeppelin
20/12/17 12:41:45 INFO SecurityManager: Changing modify acls to: zeppelin
20/12/17 12:41:45 INFO SecurityManager: Changing view acls groups to: 
20/12/17 12:41:45 INFO SecurityManager: Changing modify acls groups to: 
20/12/17 12:41:45 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(zeppelin); groups with view permissions: Set(); users  with modify permissions: Set(zeppelin); groups with modify permissions: Set()
20/12/17 12:41:45 INFO Client: Submitting application application_1605081684999_1413 to ResourceManager
20/12/17 12:41:46 INFO YarnClientImpl: Submitted application application_1605081684999_1413
20/12/17 12:41:46 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1605081684999_1413 and attemptId None
20/12/17 12:41:47 INFO Client: Application report for application_1605081684999_1413 (state: ACCEPTED)
20/12/17 12:41:47 INFO Client: 
	 client token: N/A
	 diagnostics: AM container is launched, waiting for AM container to Register with RM
	 ApplicationMaster host: N/A
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1608198105987
	 final status: UNDEFINED
	 tracking URL: http://master2.host:8088/proxy/application_1605081684999_1413/
	 user: zeppelin
20/12/17 12:41:48 INFO Client: Application report for application_1605081684999_1413 (state: ACCEPTED)
20/12/17 12:41:49 INFO Client: Application report for application_1605081684999_1413 (state: ACCEPTED)
20/12/17 12:41:50 INFO Client: Application report for application_1605081684999_1413 (state: ACCEPTED)
20/12/17 12:41:51 INFO Client: Application report for application_1605081684999_1413 (state: ACCEPTED)
20/12/17 12:41:52 INFO Client: Application report for application_1605081684999_1413 (state: ACCEPTED)
20/12/17 12:41:52 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> master2.host, PROXY_URI_BASES -> http://master2.host:8088/proxy/application_1605081684999_1413), /proxy/application_1605081684999_1413
20/12/17 12:41:52 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /jobs, /jobs/json, /jobs/job, /jobs/job/json, /stages, /stages/json, /stages/stage, /stages/stage/json, /stages/pool, /stages/pool/json, /storage, /storage/json, /storage/rdd, /storage/rdd/json, /environment, /environment/json, /executors, /executors/json, /executors/threadDump, /executors/threadDump/json, /static, /, /api, /jobs/job/kill, /stages/stage/kill.
20/12/17 12:41:52 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark-client://YarnAM)
20/12/17 12:41:53 INFO Client: Application report for application_1605081684999_1413 (state: RUNNING)
20/12/17 12:41:53 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: 10.9.14.31
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1608198105987
	 final status: UNDEFINED
	 tracking URL: http://master2.host:8088/proxy/application_1605081684999_1413/
	 user: zeppelin
20/12/17 12:41:53 INFO YarnClientSchedulerBackend: Application application_1605081684999_1413 has started running.
20/12/17 12:41:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 32804.
20/12/17 12:41:53 INFO NettyBlockTransferService: Server created on master2.host:32804
20/12/17 12:41:53 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/12/17 12:41:53 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, master2.host, 32804, None)
20/12/17 12:41:53 INFO BlockManagerMasterEndpoint: Registering block manager master2.host:32804 with 8.4 GB RAM, BlockManagerId(driver, master2.host, 32804, None)
20/12/17 12:41:53 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, master2.host, 32804, None)
20/12/17 12:41:53 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, master2.host, 32804, None)
20/12/17 12:41:53 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /metrics/json.
20/12/17 12:41:53 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@c75e0b3{/metrics/json,null,AVAILABLE,@Spark}
20/12/17 12:41:53 INFO EventLoggingListener: Logging events to hdfs:/spark2-history/application_1605081684999_1413
20/12/17 12:41:56 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.31:45116) with ID 5
20/12/17 12:41:56 INFO BlockManagerMasterEndpoint: Registering block manager node6.host:36177 with 8.4 GB RAM, BlockManagerId(5, node6.host, 36177, None)
20/12/17 12:41:58 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.27:55334) with ID 1
20/12/17 12:41:58 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.27:55332) with ID 8
20/12/17 12:41:58 INFO BlockManagerMasterEndpoint: Registering block manager node2.host:44835 with 8.4 GB RAM, BlockManagerId(1, node2.host, 44835, None)
20/12/17 12:41:59 INFO BlockManagerMasterEndpoint: Registering block manager node2.host:44752 with 8.4 GB RAM, BlockManagerId(8, node2.host, 44752, None)
20/12/17 12:41:59 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.28:33458) with ID 9
20/12/17 12:41:59 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.28:33460) with ID 2
20/12/17 12:41:59 INFO BlockManagerMasterEndpoint: Registering block manager node3.host:37568 with 8.4 GB RAM, BlockManagerId(9, node3.host, 37568, None)
20/12/17 12:41:59 INFO BlockManagerMasterEndpoint: Registering block manager node3.host:39825 with 8.4 GB RAM, BlockManagerId(2, node3.host, 39825, None)
20/12/17 12:41:59 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.32:54662) with ID 10
20/12/17 12:41:59 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.32:54660) with ID 3
20/12/17 12:41:59 INFO BlockManagerMasterEndpoint: Registering block manager node7.host:33706 with 8.4 GB RAM, BlockManagerId(10, node7.host, 33706, None)
20/12/17 12:41:59 INFO BlockManagerMasterEndpoint: Registering block manager node7.host:44495 with 8.4 GB RAM, BlockManagerId(3, node7.host, 44495, None)
20/12/17 12:41:59 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.30:51678) with ID 7
20/12/17 12:41:59 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
20/12/17 12:41:59 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.26:57828) with ID 6
20/12/17 12:41:59 INFO BlockManagerMasterEndpoint: Registering block manager node5.host:41353 with 8.4 GB RAM, BlockManagerId(7, node5.host, 41353, None)
20/12/17 12:41:59 INFO SharedState: loading hive config file: file:/etc/spark/conf.dist/hive-site.xml
20/12/17 12:41:59 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/code/dist/spark-warehouse').
20/12/17 12:41:59 INFO SharedState: Warehouse path is 'file:/code/dist/spark-warehouse'.
20/12/17 12:41:59 INFO BlockManagerMasterEndpoint: Registering block manager node1.host:40037 with 8.4 GB RAM, BlockManagerId(6, node1.host, 40037, None)
20/12/17 12:41:59 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL.
20/12/17 12:41:59 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@13a6a145{/SQL,null,AVAILABLE,@Spark}
20/12/17 12:41:59 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/json.
20/12/17 12:41:59 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@38b6cedb{/SQL/json,null,AVAILABLE,@Spark}
20/12/17 12:41:59 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution.
20/12/17 12:41:59 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@124c532b{/SQL/execution,null,AVAILABLE,@Spark}
20/12/17 12:41:59 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution/json.
20/12/17 12:41:59 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6b73efc3{/SQL/execution/json,null,AVAILABLE,@Spark}
20/12/17 12:41:59 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /static/sql.
20/12/17 12:41:59 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@388c6f85{/static/sql,null,AVAILABLE,@Spark}
20/12/17 12:42:00 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.29:58708) with ID 4
20/12/17 12:42:00 INFO BlockManagerMasterEndpoint: Registering block manager node4.host:44212 with 8.4 GB RAM, BlockManagerId(4, node4.host, 44212, None)
20/12/17 12:42:00 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
df5.count() 2101671
{code}
 

2:
{code:java}
20/12/17 12:59:06 INFO SparkContext: Running Spark version 2.4.4
20/12/17 12:59:06 INFO SparkContext: Submitted application: temp
20/12/17 12:59:06 INFO SecurityManager: Changing view acls to: zeppelin
20/12/17 12:59:06 INFO SecurityManager: Changing modify acls to: zeppelin
20/12/17 12:59:06 INFO SecurityManager: Changing view acls groups to: 
20/12/17 12:59:06 INFO SecurityManager: Changing modify acls groups to: 
20/12/17 12:59:06 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(zeppelin); groups with view permissions: Set(); users  with modify permissions: Set(zeppelin); groups with modify permissions: Set()
20/12/17 12:59:07 INFO Utils: Successfully started service 'sparkDriver' on port 45389.
20/12/17 12:59:07 INFO SparkEnv: Registering MapOutputTracker
20/12/17 12:59:07 INFO SparkEnv: Registering BlockManagerMaster
20/12/17 12:59:07 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/12/17 12:59:07 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/12/17 12:59:07 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-17406a27-5051-4c88-95b0-c405a6410260
20/12/17 12:59:07 INFO MemoryStore: MemoryStore started with capacity 8.4 GB
20/12/17 12:59:07 INFO SparkEnv: Registering OutputCommitCoordinator
20/12/17 12:59:07 INFO log: Logging initialized @2420ms
20/12/17 12:59:07 INFO Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
20/12/17 12:59:07 INFO Server: Started @2495ms
20/12/17 12:59:07 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
20/12/17 12:59:07 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
20/12/17 12:59:07 INFO AbstractConnector: Started ServerConnector@24000415{HTTP/1.1,[http/1.1]}{0.0.0.0:4042}
20/12/17 12:59:07 INFO Utils: Successfully started service 'SparkUI' on port 4042.
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@57b776f5{/jobs,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7b87d546{/jobs/json,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@186596f1{/jobs/job,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@725f2e43{/jobs/job/json,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3737f8ab{/stages,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3d122a2{/stages/json,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@32581ebb{/stages/stage,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3b4ab601{/stages/stage/json,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4995ca15{/stages/pool,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3e0a99{/stages/pool/json,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@139c7561{/storage,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7d8833ad{/storage/json,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@1a2a8b6b{/storage/rdd,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6a303075{/storage/rdd/json,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@1af7672f{/environment,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@265e2a87{/environment/json,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@68e19cf4{/executors,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@339ba05{/executors/json,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@22566b52{/executors/threadDump,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@750b678d{/executors/threadDump/json,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@116943e4{/static,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@24e6b76{/,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@13c067eb{/api,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@769076ae{/jobs/job/kill,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@59b615b3{/stages/stage/kill,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://master2.host:4042
20/12/17 12:59:07 INFO RMProxy: Connecting to ResourceManager at master2.host/10.9.14.25:8050
20/12/17 12:59:07 INFO Client: Requesting a new application from cluster with 7 NodeManagers
20/12/17 12:59:07 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (618496 MB per container)
20/12/17 12:59:07 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
20/12/17 12:59:07 INFO Client: Setting up container launch context for our AM
20/12/17 12:59:07 INFO Client: Setting up the launch environment for our AM container
20/12/17 12:59:07 INFO Client: Preparing resources for our AM container
20/12/17 12:59:08 INFO Client: Source and destination file systems are the same. Not copying hdfs:/apps/spark2/jars/spark2-ADH-yarn-archive.tar.gz
20/12/17 12:59:08 INFO Client: Uploading resource file:/opt/deltalake/delta-core_2.11-0.5.0.jar -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1417/delta-core_2.11-0.5.0.jar
20/12/17 12:59:08 INFO Client: Uploading resource file:/home/zeppelin/env3.tar.gz#env3 -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1417/env3.tar.gz
20/12/17 12:59:09 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1417/pyspark.zip
20/12/17 12:59:09 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.7-src.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1417/py4j-0.10.7-src.zip
20/12/17 12:59:09 INFO Client: Uploading resource file:/code/dist/jobs.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1417/jobs.zip
20/12/17 12:59:09 WARN Client: Same path resource file:///opt/deltalake/delta-core_2.11-0.5.0.jar added multiple times to distributed cache.
20/12/17 12:59:09 INFO Client: Uploading resource file:/tmp/spark-46deff7e-62da-4303-88f7-8832b7e02e38/__spark_conf__3971450022345669465.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1417/__spark_conf__.zip
20/12/17 12:59:09 INFO SecurityManager: Changing view acls to: zeppelin
20/12/17 12:59:09 INFO SecurityManager: Changing modify acls to: zeppelin
20/12/17 12:59:09 INFO SecurityManager: Changing view acls groups to: 
20/12/17 12:59:09 INFO SecurityManager: Changing modify acls groups to: 
20/12/17 12:59:09 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(zeppelin); groups with view permissions: Set(); users  with modify permissions: Set(zeppelin); groups with modify permissions: Set()
20/12/17 12:59:10 INFO Client: Submitting application application_1605081684999_1417 to ResourceManager
20/12/17 12:59:10 INFO YarnClientImpl: Submitted application application_1605081684999_1417
20/12/17 12:59:10 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1605081684999_1417 and attemptId None
20/12/17 12:59:11 INFO Client: Application report for application_1605081684999_1417 (state: ACCEPTED)
20/12/17 12:59:11 INFO Client: 
	 client token: N/A
	 diagnostics: AM container is launched, waiting for AM container to Register with RM
	 ApplicationMaster host: N/A
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1608199150192
	 final status: UNDEFINED
	 tracking URL: http://master2.host:8088/proxy/application_1605081684999_1417/
	 user: zeppelin
20/12/17 12:59:12 INFO Client: Application report for application_1605081684999_1417 (state: ACCEPTED)
20/12/17 12:59:13 INFO Client: Application report for application_1605081684999_1417 (state: ACCEPTED)
20/12/17 12:59:14 INFO Client: Application report for application_1605081684999_1417 (state: ACCEPTED)
20/12/17 12:59:15 INFO Client: Application report for application_1605081684999_1417 (state: ACCEPTED)
20/12/17 12:59:16 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> master2.host, PROXY_URI_BASES -> http://master2.host:8088/proxy/application_1605081684999_1417), /proxy/application_1605081684999_1417
20/12/17 12:59:16 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /jobs, /jobs/json, /jobs/job, /jobs/job/json, /stages, /stages/json, /stages/stage, /stages/stage/json, /stages/pool, /stages/pool/json, /storage, /storage/json, /storage/rdd, /storage/rdd/json, /environment, /environment/json, /executors, /executors/json, /executors/threadDump, /executors/threadDump/json, /static, /, /api, /jobs/job/kill, /stages/stage/kill.
20/12/17 12:59:16 INFO Client: Application report for application_1605081684999_1417 (state: RUNNING)
20/12/17 12:59:16 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: 10.9.14.26
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1608199150192
	 final status: UNDEFINED
	 tracking URL: http://master2.host:8088/proxy/application_1605081684999_1417/
	 user: zeppelin
20/12/17 12:59:16 INFO YarnClientSchedulerBackend: Application application_1605081684999_1417 has started running.
20/12/17 12:59:16 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41162.
20/12/17 12:59:16 INFO NettyBlockTransferService: Server created on master2.host:41162
20/12/17 12:59:16 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/12/17 12:59:16 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, master2.host, 41162, None)
20/12/17 12:59:16 INFO BlockManagerMasterEndpoint: Registering block manager master2.host:41162 with 8.4 GB RAM, BlockManagerId(driver, master2.host, 41162, None)
20/12/17 12:59:16 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, master2.host, 41162, None)
20/12/17 12:59:16 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, master2.host, 41162, None)
20/12/17 12:59:16 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark-client://YarnAM)
20/12/17 12:59:16 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /metrics/json.
20/12/17 12:59:16 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@5fccf348{/metrics/json,null,AVAILABLE,@Spark}
20/12/17 12:59:16 INFO EventLoggingListener: Logging events to hdfs:/spark2-history/application_1605081684999_1417
20/12/17 12:59:19 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.26:40374) with ID 5
20/12/17 12:59:20 INFO BlockManagerMasterEndpoint: Registering block manager node1.host:40686 with 8.4 GB RAM, BlockManagerId(5, node1.host, 40686, None)
20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.28:47210) with ID 2
20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.28:47212) with ID 9
20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.32:51686) with ID 10
20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.30:57124) with ID 6
20/12/17 12:59:23 INFO BlockManagerMasterEndpoint: Registering block manager node3.host:34035 with 8.4 GB RAM, BlockManagerId(2, node3.host, 34035, None)
20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.31:55960) with ID 7
20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.32:51688) with ID 3
20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.29:44172) with ID 4
20/12/17 12:59:23 INFO BlockManagerMasterEndpoint: Registering block manager node3.host:36488 with 8.4 GB RAM, BlockManagerId(9, node3.host, 36488, None)
20/12/17 12:59:23 INFO BlockManagerMasterEndpoint: Registering block manager node7.host:36671 with 8.4 GB RAM, BlockManagerId(10, node7.host, 36671, None)
20/12/17 12:59:23 INFO BlockManagerMasterEndpoint: Registering block manager node5.host:40670 with 8.4 GB RAM, BlockManagerId(6, node5.host, 40670, None)
20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.27:46686) with ID 1
20/12/17 12:59:23 INFO BlockManagerMasterEndpoint: Registering block manager node6.host:33506 with 8.4 GB RAM, BlockManagerId(7, node6.host, 33506, None)
20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.27:46688) with ID 8
20/12/17 12:59:23 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
20/12/17 12:59:23 INFO BlockManagerMasterEndpoint: Registering block manager node7.host:41013 with 8.4 GB RAM, BlockManagerId(3, node7.host, 41013, None)
20/12/17 12:59:23 INFO BlockManagerMasterEndpoint: Registering block manager node4.host:34615 with 8.4 GB RAM, BlockManagerId(4, node4.host, 34615, None)
20/12/17 12:59:24 INFO BlockManagerMasterEndpoint: Registering block manager node2.host:39108 with 8.4 GB RAM, BlockManagerId(1, node2.host, 39108, None)
20/12/17 12:59:24 INFO BlockManagerMasterEndpoint: Registering block manager node2.host:37151 with 8.4 GB RAM, BlockManagerId(8, node2.host, 37151, None)
20/12/17 12:59:24 INFO SharedState: loading hive config file: file:/etc/spark/conf.dist/hive-site.xml
20/12/17 12:59:24 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/code/dist/spark-warehouse').
20/12/17 12:59:24 INFO SharedState: Warehouse path is 'file:/code/dist/spark-warehouse'.
20/12/17 12:59:24 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL.
20/12/17 12:59:24 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@78d74a59{/SQL,null,AVAILABLE,@Spark}
20/12/17 12:59:24 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/json.
20/12/17 12:59:24 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3941bf4b{/SQL/json,null,AVAILABLE,@Spark}
20/12/17 12:59:24 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution.
20/12/17 12:59:24 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@13201296{/SQL/execution,null,AVAILABLE,@Spark}
20/12/17 12:59:24 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution/json.
20/12/17 12:59:24 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@441b0461{/SQL/execution/json,null,AVAILABLE,@Spark}
20/12/17 12:59:24 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /static/sql.
20/12/17 12:59:24 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3212cd96{/static/sql,null,AVAILABLE,@Spark}
20/12/17 12:59:24 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
20/12/17 13:00:14 ERROR TaskSetManager: Task 93 in stage 40.0 failed 4 times; aborting job
Traceback (most recent call last):
  File "/code/dist/main.py", line 155, in <module>
    job_module.analyze(spark, args.job_args, configs)
  File "jobs.zip/jobs/temp/__init__.py", line 29, in analyze
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 523, in count
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o238.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 93 in stage 40.0 failed 4 times, most recent failure: Lost task 93.3 in stage 40.0 (TID 1138, node3.host, executor 9): java.lang.IllegalArgumentException
	at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
	at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
	at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
	at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
	at org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
	at org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
	at org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.agg_doAggregateWithoutKey_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299)
	at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2836)
	at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2835)
	at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
	at org.apache.spark.sql.Dataset.count(Dataset.scala:2835)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException
	at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
	at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
	at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
	at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
	at org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
	at org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
	at org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.agg_doAggregateWithoutKey_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more
{code}
 

3:
{code:java}
20/12/17 12:43:32 INFO SparkContext: Running Spark version 2.4.4
20/12/17 12:43:32 INFO SparkContext: Submitted application: temp
20/12/17 12:43:32 INFO SecurityManager: Changing view acls to: zeppelin
20/12/17 12:43:32 INFO SecurityManager: Changing modify acls to: zeppelin
20/12/17 12:43:32 INFO SecurityManager: Changing view acls groups to: 
20/12/17 12:43:32 INFO SecurityManager: Changing modify acls groups to: 
20/12/17 12:43:32 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(zeppelin); groups with view permissions: Set(); users  with modify permissions: Set(zeppelin); groups with modify permissions: Set()
20/12/17 12:43:32 INFO Utils: Successfully started service 'sparkDriver' on port 33378.
20/12/17 12:43:32 INFO SparkEnv: Registering MapOutputTracker
20/12/17 12:43:32 INFO SparkEnv: Registering BlockManagerMaster
20/12/17 12:43:32 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/12/17 12:43:32 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/12/17 12:43:32 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-8bba8740-cc56-418e-b759-8d2d0181db2c
20/12/17 12:43:32 INFO MemoryStore: MemoryStore started with capacity 8.4 GB
20/12/17 12:43:32 INFO SparkEnv: Registering OutputCommitCoordinator
20/12/17 12:43:33 INFO log: Logging initialized @2474ms
20/12/17 12:43:33 INFO Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
20/12/17 12:43:33 INFO Server: Started @2548ms
20/12/17 12:43:33 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
20/12/17 12:43:33 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
20/12/17 12:43:33 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
20/12/17 12:43:33 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
20/12/17 12:43:33 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.
20/12/17 12:43:33 WARN Utils: Service 'SparkUI' could not bind on port 4045. Attempting port 4046.
20/12/17 12:43:33 INFO AbstractConnector: Started ServerConnector@4128bb9{HTTP/1.1,[http/1.1]}{0.0.0.0:4046}
20/12/17 12:43:33 INFO Utils: Successfully started service 'SparkUI' on port 4046.
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@297aaa65{/jobs,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4817a878{/jobs/json,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4831e6cf{/jobs/job,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2a6bc2a5{/jobs/job/json,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@5fcaebf8{/stages,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@180a7950{/stages/json,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7689d370{/stages/stage,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4195df87{/stages/stage/json,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@793f8696{/stages/pool,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@61eb6c76{/stages/pool/json,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@1cb3ca39{/storage,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@73b557cf{/storage/json,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@763b7419{/storage/rdd,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4c892e74{/storage/rdd/json,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4e69c7de{/environment,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@173beaf3{/environment/json,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3a0dcfb1{/executors,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3ff2ac0a{/executors/json,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3fa5db1d{/executors/threadDump,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7701d268{/executors/threadDump/json,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@23a5eb7e{/static,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@55529b1e{/,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6ad6fed6{/api,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@42315ca4{/jobs/job/kill,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@797cae32{/stages/stage/kill,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://master2.host:4046
20/12/17 12:43:33 INFO RMProxy: Connecting to ResourceManager at master2.host/10.9.14.25:8050
20/12/17 12:43:33 INFO Client: Requesting a new application from cluster with 7 NodeManagers
20/12/17 12:43:33 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (618496 MB per container)
20/12/17 12:43:33 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
20/12/17 12:43:33 INFO Client: Setting up container launch context for our AM
20/12/17 12:43:33 INFO Client: Setting up the launch environment for our AM container
20/12/17 12:43:33 INFO Client: Preparing resources for our AM container
20/12/17 12:43:33 INFO Client: Source and destination file systems are the same. Not copying hdfs:/apps/spark2/jars/spark2-ADH-yarn-archive.tar.gz
20/12/17 12:43:33 INFO Client: Uploading resource file:/opt/deltalake/delta-core_2.11-0.5.0.jar -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1414/delta-core_2.11-0.5.0.jar
20/12/17 12:43:34 INFO Client: Uploading resource file:/home/zeppelin/env3.tar.gz#env3 -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1414/env3.tar.gz
20/12/17 12:43:34 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1414/pyspark.zip
20/12/17 12:43:34 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.7-src.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1414/py4j-0.10.7-src.zip
20/12/17 12:43:34 INFO Client: Uploading resource file:/code/dist/jobs.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1414/jobs.zip
20/12/17 12:43:34 WARN Client: Same path resource file:///opt/deltalake/delta-core_2.11-0.5.0.jar added multiple times to distributed cache.
20/12/17 12:43:34 INFO Client: Uploading resource file:/tmp/spark-7fab6d5d-5cf9-441d-9f8b-4a99a13d5e85/__spark_conf__1923213682968598810.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1414/__spark_conf__.zip
20/12/17 12:43:34 INFO SecurityManager: Changing view acls to: zeppelin
20/12/17 12:43:34 INFO SecurityManager: Changing modify acls to: zeppelin
20/12/17 12:43:34 INFO SecurityManager: Changing view acls groups to: 
20/12/17 12:43:34 INFO SecurityManager: Changing modify acls groups to: 
20/12/17 12:43:34 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(zeppelin); groups with view permissions: Set(); users  with modify permissions: Set(zeppelin); groups with modify permissions: Set()
20/12/17 12:43:35 INFO Client: Submitting application application_1605081684999_1414 to ResourceManager
20/12/17 12:43:35 INFO YarnClientImpl: Submitted application application_1605081684999_1414
20/12/17 12:43:35 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1605081684999_1414 and attemptId None
20/12/17 12:43:36 INFO Client: Application report for application_1605081684999_1414 (state: ACCEPTED)
20/12/17 12:43:36 INFO Client: 
	 client token: N/A
	 diagnostics: AM container is launched, waiting for AM container to Register with RM
	 ApplicationMaster host: N/A
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1608198215763
	 final status: UNDEFINED
	 tracking URL: http://master2.host:8088/proxy/application_1605081684999_1414/
	 user: zeppelin
20/12/17 12:43:37 INFO Client: Application report for application_1605081684999_1414 (state: ACCEPTED)
20/12/17 12:43:38 INFO Client: Application report for application_1605081684999_1414 (state: ACCEPTED)
20/12/17 12:43:39 INFO Client: Application report for application_1605081684999_1414 (state: ACCEPTED)
20/12/17 12:43:40 INFO Client: Application report for application_1605081684999_1414 (state: ACCEPTED)
20/12/17 12:43:41 INFO Client: Application report for application_1605081684999_1414 (state: ACCEPTED)
20/12/17 12:43:42 INFO Client: Application report for application_1605081684999_1414 (state: ACCEPTED)
20/12/17 12:43:43 INFO Client: Application report for application_1605081684999_1414 (state: ACCEPTED)
20/12/17 12:43:44 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> master2.host, PROXY_URI_BASES -> http://master2.host:8088/proxy/application_1605081684999_1414), /proxy/application_1605081684999_1414
20/12/17 12:43:44 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /jobs, /jobs/json, /jobs/job, /jobs/job/json, /stages, /stages/json, /stages/stage, /stages/stage/json, /stages/pool, /stages/pool/json, /storage, /storage/json, /storage/rdd, /storage/rdd/json, /environment, /environment/json, /executors, /executors/json, /executors/threadDump, /executors/threadDump/json, /static, /, /api, /jobs/job/kill, /stages/stage/kill.
20/12/17 12:43:44 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark-client://YarnAM)
20/12/17 12:43:44 INFO Client: Application report for application_1605081684999_1414 (state: RUNNING)
20/12/17 12:43:44 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: 10.9.14.27
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1608198215763
	 final status: UNDEFINED
	 tracking URL: http://master2.host:8088/proxy/application_1605081684999_1414/
	 user: zeppelin
20/12/17 12:43:44 INFO YarnClientSchedulerBackend: Application application_1605081684999_1414 has started running.
20/12/17 12:43:44 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39612.
20/12/17 12:43:44 INFO NettyBlockTransferService: Server created on master2.host:39612
20/12/17 12:43:44 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/12/17 12:43:44 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, master2.host, 39612, None)
20/12/17 12:43:44 INFO BlockManagerMasterEndpoint: Registering block manager master2.host:39612 with 8.4 GB RAM, BlockManagerId(driver, master2.host, 39612, None)
20/12/17 12:43:44 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, master2.host, 39612, None)
20/12/17 12:43:44 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, master2.host, 39612, None)
20/12/17 12:43:45 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /metrics/json.
20/12/17 12:43:45 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4d7fcb59{/metrics/json,null,AVAILABLE,@Spark}
20/12/17 12:43:45 INFO EventLoggingListener: Logging events to hdfs:/spark2-history/application_1605081684999_1414
20/12/17 12:43:47 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.27:37836) with ID 4
20/12/17 12:43:47 INFO BlockManagerMasterEndpoint: Registering block manager node2.host:39627 with 8.4 GB RAM, BlockManagerId(4, node2.host, 39627, None)
20/12/17 12:43:51 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.28:51120) with ID 6
20/12/17 12:43:51 INFO BlockManagerMasterEndpoint: Registering block manager node3.host:33184 with 8.4 GB RAM, BlockManagerId(6, node3.host, 33184, None)
20/12/17 12:43:51 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.26:43924) with ID 2
20/12/17 12:43:52 INFO BlockManagerMasterEndpoint: Registering block manager node1.host:46720 with 8.4 GB RAM, BlockManagerId(2, node1.host, 46720, None)
20/12/17 12:43:52 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.32:59464) with ID 5
20/12/17 12:43:52 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.30:36348) with ID 8
20/12/17 12:43:52 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.26:43926) with ID 9
20/12/17 12:43:52 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.30:36346) with ID 1
20/12/17 12:43:52 INFO BlockManagerMasterEndpoint: Registering block manager node7.host:40610 with 8.4 GB RAM, BlockManagerId(5, node7.host, 40610, None)
20/12/17 12:43:52 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.29:39682) with ID 10
20/12/17 12:43:52 INFO BlockManagerMasterEndpoint: Registering block manager node1.host:46068 with 8.4 GB RAM, BlockManagerId(9, node1.host, 46068, None)
20/12/17 12:43:52 INFO BlockManagerMasterEndpoint: Registering block manager node5.host:45165 with 8.4 GB RAM, BlockManagerId(1, node5.host, 45165, None)
20/12/17 12:43:52 INFO BlockManagerMasterEndpoint: Registering block manager node5.host:37627 with 8.4 GB RAM, BlockManagerId(8, node5.host, 37627, None)
20/12/17 12:43:52 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.29:39684) with ID 3
20/12/17 12:43:52 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
20/12/17 12:43:52 INFO BlockManagerMasterEndpoint: Registering block manager node4.host:40597 with 8.4 GB RAM, BlockManagerId(10, node4.host, 40597, None)
20/12/17 12:43:52 INFO SharedState: loading hive config file: file:/etc/spark/conf.dist/hive-site.xml
20/12/17 12:43:52 INFO BlockManagerMasterEndpoint: Registering block manager node4.host:46284 with 8.4 GB RAM, BlockManagerId(3, node4.host, 46284, None)
20/12/17 12:43:52 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/code/dist/spark-warehouse').
20/12/17 12:43:52 INFO SharedState: Warehouse path is 'file:/code/dist/spark-warehouse'.
20/12/17 12:43:52 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL.
20/12/17 12:43:52 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4f67d4c0{/SQL,null,AVAILABLE,@Spark}
20/12/17 12:43:52 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/json.
20/12/17 12:43:52 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2e1bff0b{/SQL/json,null,AVAILABLE,@Spark}
20/12/17 12:43:52 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution.
20/12/17 12:43:52 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@9336837{/SQL/execution,null,AVAILABLE,@Spark}
20/12/17 12:43:52 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution/json.
20/12/17 12:43:52 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@829dc36{/SQL/execution/json,null,AVAILABLE,@Spark}
20/12/17 12:43:52 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /static/sql.
20/12/17 12:43:52 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4e8262e4{/static/sql,null,AVAILABLE,@Spark}
20/12/17 12:43:52 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
20/12/17 12:46:39 ERROR TaskSetManager: Task 93 in stage 31.0 failed 4 times; aborting job
Traceback (most recent call last):
  File "/code/dist/main.py", line 155, in <module>
    job_module.analyze(spark, args.job_args, configs)
  File "jobs.zip/jobs/temp/__init__.py", line 29, in analyze
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 523, in count
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o238.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 93 in stage 31.0 failed 4 times, most recent failure: Lost task 93.3 in stage 31.0 (TID 1030, node2.host, executor 4): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/hadoop/hdfs/data06/local/usercache/zeppelin/appcache/application_1605081684999_1414/container_e20_1605081684999_1414_01_000005/pyspark.zip/pyspark/worker.py", line 377, in main
    process()
  File "/hadoop/hdfs/data06/local/usercache/zeppelin/appcache/application_1605081684999_1414/container_e20_1605081684999_1414_01_000005/pyspark.zip/pyspark/worker.py", line 372, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/hadoop/hdfs/data06/local/usercache/zeppelin/appcache/application_1605081684999_1414/container_e20_1605081684999_1414_01_000005/pyspark.zip/pyspark/serializers.py", line 286, in dump_stream
    for series in iterator:
  File "/hadoop/hdfs/data06/local/usercache/zeppelin/appcache/application_1605081684999_1414/container_e20_1605081684999_1414_01_000005/pyspark.zip/pyspark/serializers.py", line 303, in load_stream
    for batch in reader:
  File "pyarrow/ipc.pxi", line 265, in __iter__
  File "pyarrow/ipc.pxi", line 281, in pyarrow.lib._RecordBatchReader.read_next_batch
  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: read length must be positive or -1

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.agg_doAggregateWithoutKey_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299)
	at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2836)
	at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2835)
	at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
	at org.apache.spark.sql.Dataset.count(Dataset.scala:2835)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/hadoop/hdfs/data06/local/usercache/zeppelin/appcache/application_1605081684999_1414/container_e20_1605081684999_1414_01_000005/pyspark.zip/pyspark/worker.py", line 377, in main
    process()
  File "/hadoop/hdfs/data06/local/usercache/zeppelin/appcache/application_1605081684999_1414/container_e20_1605081684999_1414_01_000005/pyspark.zip/pyspark/worker.py", line 372, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/hadoop/hdfs/data06/local/usercache/zeppelin/appcache/application_1605081684999_1414/container_e20_1605081684999_1414_01_000005/pyspark.zip/pyspark/serializers.py", line 286, in dump_stream
    for series in iterator:
  File "/hadoop/hdfs/data06/local/usercache/zeppelin/appcache/application_1605081684999_1414/container_e20_1605081684999_1414_01_000005/pyspark.zip/pyspark/serializers.py", line 303, in load_stream
    for batch in reader:
  File "pyarrow/ipc.pxi", line 265, in __iter__
  File "pyarrow/ipc.pxi", line 281, in pyarrow.lib._RecordBatchReader.read_next_batch
  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: read length must be positive or -1

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.agg_doAggregateWithoutKey_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more
{code}
 

4:
{code:java}
--all after pyarrow 0.

20/12/17 13:35:40 INFO SparkContext: Running Spark version 2.4.4
20/12/17 13:35:40 INFO SparkContext: Submitted application: temp
20/12/17 13:35:40 INFO SecurityManager: Changing view acls to: zeppelin
20/12/17 13:35:40 INFO SecurityManager: Changing modify acls to: zeppelin
20/12/17 13:35:40 INFO SecurityManager: Changing view acls groups to: 
20/12/17 13:35:40 INFO SecurityManager: Changing modify acls groups to: 
20/12/17 13:35:40 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(zeppelin); groups with view permissions: Set(); users  with modify permissions: Set(zeppelin); groups with modify permissions: Set()
20/12/17 13:35:40 INFO Utils: Successfully started service 'sparkDriver' on port 33335.
20/12/17 13:35:40 INFO SparkEnv: Registering MapOutputTracker
20/12/17 13:35:40 INFO SparkEnv: Registering BlockManagerMaster
20/12/17 13:35:40 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/12/17 13:35:40 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/12/17 13:35:40 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-aea190e5-8c84-40e4-9f57-cf8af75f73f7
20/12/17 13:35:40 INFO MemoryStore: MemoryStore started with capacity 8.4 GB
20/12/17 13:35:40 INFO SparkEnv: Registering OutputCommitCoordinator
20/12/17 13:35:40 INFO log: Logging initialized @2440ms
20/12/17 13:35:40 INFO Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
20/12/17 13:35:40 INFO Server: Started @2511ms
20/12/17 13:35:40 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
20/12/17 13:35:40 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
20/12/17 13:35:40 INFO AbstractConnector: Started ServerConnector@594d8c7d{HTTP/1.1,[http/1.1]}{0.0.0.0:4042}
20/12/17 13:35:40 INFO Utils: Successfully started service 'SparkUI' on port 4042.
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2271a0eb{/jobs,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@30aa0825{/jobs/json,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@42d2b91d{/jobs/job,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@73f56489{/jobs/job/json,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@137ea1f2{/stages,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@44b3e8d1{/stages/json,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@239dc14b{/stages/stage,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@46ade13d{/stages/stage/json,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@b77f73a{/stages/pool,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@517df609{/stages/pool/json,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@57c1c10d{/storage,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7ecc76c1{/storage/json,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4a369005{/storage/rdd,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7445eaf4{/storage/rdd/json,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2d5594d8{/environment,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@303295cd{/environment/json,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4e7a9d76{/executors,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@15b3dc07{/executors/json,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@14c1793d{/executors/threadDump,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@30f90a95{/executors/threadDump/json,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2a72b0d1{/static,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@76865c63{/,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3bd4e264{/api,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4b660222{/jobs/job/kill,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@16f643b2{/stages/stage/kill,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://master2.host:4042
20/12/17 13:35:41 INFO RMProxy: Connecting to ResourceManager at master2.host/10.9.14.25:8050
20/12/17 13:35:41 INFO Client: Requesting a new application from cluster with 7 NodeManagers
20/12/17 13:35:41 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (618496 MB per container)
20/12/17 13:35:41 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
20/12/17 13:35:41 INFO Client: Setting up container launch context for our AM
20/12/17 13:35:41 INFO Client: Setting up the launch environment for our AM container
20/12/17 13:35:41 INFO Client: Preparing resources for our AM container
20/12/17 13:35:41 INFO Client: Source and destination file systems are the same. Not copying hdfs:/apps/spark2/jars/spark2-ADH-yarn-archive.tar.gz
20/12/17 13:35:41 INFO Client: Uploading resource file:/opt/deltalake/delta-core_2.11-0.5.0.jar -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1428/delta-core_2.11-0.5.0.jar
20/12/17 13:35:41 INFO Client: Uploading resource file:/home/zeppelin/env3.tar.gz#env3 -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1428/env3.tar.gz
20/12/17 13:35:41 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1428/pyspark.zip
20/12/17 13:35:41 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.7-src.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1428/py4j-0.10.7-src.zip
20/12/17 13:35:41 INFO Client: Uploading resource file:/code/dist/jobs.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1428/jobs.zip
20/12/17 13:35:41 WARN Client: Same path resource file:///opt/deltalake/delta-core_2.11-0.5.0.jar added multiple times to distributed cache.
20/12/17 13:35:42 INFO Client: Uploading resource file:/tmp/spark-9804f727-4c1a-44f7-ae39-26ec382332a7/__spark_conf__1703784814635255608.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1428/__spark_conf__.zip
20/12/17 13:35:42 INFO SecurityManager: Changing view acls to: zeppelin
20/12/17 13:35:42 INFO SecurityManager: Changing modify acls to: zeppelin
20/12/17 13:35:42 INFO SecurityManager: Changing view acls groups to: 
20/12/17 13:35:42 INFO SecurityManager: Changing modify acls groups to: 
20/12/17 13:35:42 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(zeppelin); groups with view permissions: Set(); users  with modify permissions: Set(zeppelin); groups with modify permissions: Set()
20/12/17 13:35:43 INFO Client: Submitting application application_1605081684999_1428 to ResourceManager
20/12/17 13:35:43 INFO YarnClientImpl: Submitted application application_1605081684999_1428
20/12/17 13:35:43 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1605081684999_1428 and attemptId None
20/12/17 13:35:44 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED)
20/12/17 13:35:44 INFO Client: 
	 client token: N/A
	 diagnostics: AM container is launched, waiting for AM container to Register with RM
	 ApplicationMaster host: N/A
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1608201343091
	 final status: UNDEFINED
	 tracking URL: http://master2.host:8088/proxy/application_1605081684999_1428/
	 user: zeppelin
20/12/17 13:35:45 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED)
20/12/17 13:35:46 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED)
20/12/17 13:35:47 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED)
20/12/17 13:35:48 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED)
20/12/17 13:35:49 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED)
20/12/17 13:35:50 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED)
20/12/17 13:35:51 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED)
20/12/17 13:35:52 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED)
20/12/17 13:35:52 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> master2.host, PROXY_URI_BASES -> http://master2.host:8088/proxy/application_1605081684999_1428), /proxy/application_1605081684999_1428
20/12/17 13:35:52 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /jobs, /jobs/json, /jobs/job, /jobs/job/json, /stages, /stages/json, /stages/stage, /stages/stage/json, /stages/pool, /stages/pool/json, /storage, /storage/json, /storage/rdd, /storage/rdd/json, /environment, /environment/json, /executors, /executors/json, /executors/threadDump, /executors/threadDump/json, /static, /, /api, /jobs/job/kill, /stages/stage/kill.
20/12/17 13:35:53 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark-client://YarnAM)
20/12/17 13:35:53 INFO Client: Application report for application_1605081684999_1428 (state: RUNNING)
20/12/17 13:35:53 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: 10.9.14.29
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1608201343091
	 final status: UNDEFINED
	 tracking URL: http://master2.host:8088/proxy/application_1605081684999_1428/
	 user: zeppelin
20/12/17 13:35:53 INFO YarnClientSchedulerBackend: Application application_1605081684999_1428 has started running.
20/12/17 13:35:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38210.
20/12/17 13:35:53 INFO NettyBlockTransferService: Server created on master2.host:38210
20/12/17 13:35:53 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/12/17 13:35:53 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, master2.host, 38210, None)
20/12/17 13:35:53 INFO BlockManagerMasterEndpoint: Registering block manager master2.host:38210 with 8.4 GB RAM, BlockManagerId(driver, master2.host, 38210, None)
20/12/17 13:35:53 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, master2.host, 38210, None)
20/12/17 13:35:53 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, master2.host, 38210, None)
20/12/17 13:35:53 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /metrics/json.
20/12/17 13:35:53 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4b00b26c{/metrics/json,null,AVAILABLE,@Spark}
20/12/17 13:35:53 INFO EventLoggingListener: Logging events to hdfs:/spark2-history/application_1605081684999_1428
20/12/17 13:35:57 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.29:42340) with ID 1
20/12/17 13:35:57 INFO BlockManagerMasterEndpoint: Registering block manager node4.host:42288 with 8.4 GB RAM, BlockManagerId(1, node4.host, 42288, None)
20/12/17 13:35:58 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.29:42346) with ID 8
20/12/17 13:35:58 INFO BlockManagerMasterEndpoint: Registering block manager node4.host:46543 with 8.4 GB RAM, BlockManagerId(8, node4.host, 46543, None)
20/12/17 13:36:02 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.26:55632) with ID 9
20/12/17 13:36:02 INFO BlockManagerMasterEndpoint: Registering block manager node1.host:34973 with 8.4 GB RAM, BlockManagerId(9, node1.host, 34973, None)
20/12/17 13:36:02 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.26:55634) with ID 2
20/12/17 13:36:02 INFO BlockManagerMasterEndpoint: Registering block manager node1.host:35223 with 8.4 GB RAM, BlockManagerId(2, node1.host, 35223, None)
20/12/17 13:36:02 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.31:49516) with ID 6
20/12/17 13:36:02 INFO BlockManagerMasterEndpoint: Registering block manager node6.host:35440 with 8.4 GB RAM, BlockManagerId(6, node6.host, 35440, None)
20/12/17 13:36:02 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.28:49014) with ID 4
20/12/17 13:36:02 INFO BlockManagerMasterEndpoint: Registering block manager node3.host:46367 with 8.4 GB RAM, BlockManagerId(4, node3.host, 46367, None)
20/12/17 13:36:03 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.32:38700) with ID 7
20/12/17 13:36:03 INFO BlockManagerMasterEndpoint: Registering block manager node7.host:44537 with 8.4 GB RAM, BlockManagerId(7, node7.host, 44537, None)
20/12/17 13:36:03 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.27:51310) with ID 10
20/12/17 13:36:03 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
20/12/17 13:36:03 INFO BlockManagerMasterEndpoint: Registering block manager node2.host:35748 with 8.4 GB RAM, BlockManagerId(10, node2.host, 35748, None)
20/12/17 13:36:03 INFO SharedState: loading hive config file: file:/etc/spark/conf.dist/hive-site.xml
20/12/17 13:36:03 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/code/dist/spark-warehouse').
20/12/17 13:36:03 INFO SharedState: Warehouse path is 'file:/code/dist/spark-warehouse'.
20/12/17 13:36:03 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL.
20/12/17 13:36:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@674d79fd{/SQL,null,AVAILABLE,@Spark}
20/12/17 13:36:03 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/json.
20/12/17 13:36:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@9d4247b{/SQL/json,null,AVAILABLE,@Spark}
20/12/17 13:36:03 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution.
20/12/17 13:36:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@76da7433{/SQL/execution,null,AVAILABLE,@Spark}
20/12/17 13:36:03 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution/json.
20/12/17 13:36:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3d37b97{/SQL/execution/json,null,AVAILABLE,@Spark}
20/12/17 13:36:03 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /static/sql.
20/12/17 13:36:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@59de5714{/static/sql,null,AVAILABLE,@Spark}
20/12/17 13:36:04 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.27:51324) with ID 3
20/12/17 13:36:04 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.30:55732) with ID 5
20/12/17 13:36:04 INFO BlockManagerMasterEndpoint: Registering block manager node2.host:40602 with 8.4 GB RAM, BlockManagerId(3, node2.host, 40602, None)
20/12/17 13:36:04 INFO BlockManagerMasterEndpoint: Registering block manager node5.host:45001 with 8.4 GB RAM, BlockManagerId(5, node5.host, 45001, None)
20/12/17 13:36:04 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
20/12/17 13:39:34 ERROR TaskSetManager: Task 93 in stage 31.0 failed 4 times; aborting job
Traceback (most recent call last):
  File "/code/dist/main.py", line 155, in <module>
    job_module.analyze(spark, args.job_args, configs)
  File "jobs.zip/jobs/temp/__init__.py", line 29, in analyze
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 523, in count
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o238.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 93 in stage 31.0 failed 4 times, most recent failure: Lost task 93.3 in stage 31.0 (TID 1030, node2.host, executor 10): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/hadoop/hdfs/data03/local/usercache/zeppelin/appcache/application_1605081684999_1428/container_e20_1605081684999_1428_01_000011/pyspark.zip/pyspark/worker.py", line 377, in main
    process()
  File "/hadoop/hdfs/data03/local/usercache/zeppelin/appcache/application_1605081684999_1428/container_e20_1605081684999_1428_01_000011/pyspark.zip/pyspark/worker.py", line 372, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/hadoop/hdfs/data03/local/usercache/zeppelin/appcache/application_1605081684999_1428/container_e20_1605081684999_1428_01_000011/pyspark.zip/pyspark/serializers.py", line 286, in dump_stream
    for series in iterator:
  File "/hadoop/hdfs/data03/local/usercache/zeppelin/appcache/application_1605081684999_1428/container_e20_1605081684999_1428_01_000011/pyspark.zip/pyspark/serializers.py", line 303, in load_stream
    for batch in reader:
  File "pyarrow/ipc.pxi", line 412, in __iter__
  File "pyarrow/ipc.pxi", line 432, in pyarrow.lib._CRecordBatchReader.read_next_batch
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Invalid IPC message: negative bodyLength

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.agg_doAggregateWithoutKey_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299)
	at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2836)
	at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2835)
	at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
	at org.apache.spark.sql.Dataset.count(Dataset.scala:2835)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/hadoop/hdfs/data03/local/usercache/zeppelin/appcache/application_1605081684999_1428/container_e20_1605081684999_1428_01_000011/pyspark.zip/pyspark/worker.py", line 377, in main
    process()
  File "/hadoop/hdfs/data03/local/usercache/zeppelin/appcache/application_1605081684999_1428/container_e20_1605081684999_1428_01_000011/pyspark.zip/pyspark/worker.py", line 372, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/hadoop/hdfs/data03/local/usercache/zeppelin/appcache/application_1605081684999_1428/container_e20_1605081684999_1428_01_000011/pyspark.zip/pyspark/serializers.py", line 286, in dump_stream
    for series in iterator:
  File "/hadoop/hdfs/data03/local/usercache/zeppelin/appcache/application_1605081684999_1428/container_e20_1605081684999_1428_01_000011/pyspark.zip/pyspark/serializers.py", line 303, in load_stream
    for batch in reader:
  File "pyarrow/ipc.pxi", line 412, in __iter__
  File "pyarrow/ipc.pxi", line 432, in pyarrow.lib._CRecordBatchReader.read_next_batch
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Invalid IPC message: negative bodyLength

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.agg_doAggregateWithoutKey_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more
{code}
 

So my question is can I use more than 2gb with pandas_udf approach for python + spark stack with current stable versions of spark and pyarrow?


was (Author: dishka_krauch):
[~fan_li_ya] okay, here we go.

Spark version - 2.4.4

Python env:
 * Dcycler (0.10.0)
 * glmnet-py (0.1.0b2)
 * joblib (1.0.0)
 * kiwisolver (1.3.1)
 * lightgbm (3.1.1) EPRECATION
 * matplotlib (3.0.3)
 * numpy (1.19.4)
 * pandas (1.1.5)
 * pip (9.0.3: The default format will switch to columns in the future. You can)
 * *pyarrow*
 * pyparsing (2.4.7) use --format=(legacy|columns) (or define a format=(python-dateutil (2.8.1)
 * pytz (202legacy|columns) in yo0.4)
 * scikit-learn (0.23.2)
 * scipy (1.5.4)
 * setuptools (51.0.0) ur pip.conf under the [list] section) to disable this warnsix (1.15.0)
 * sklearn (0.0)
 * threadpoolctl (2.1.0)
 * venv-paing. ck (0.2.0)
 * wheel (0.36.2)

I've tested many pyarrow versions on 2 functions which return 172 mb dataset 
{code:java|title=Bar.Python|borderStyle=solid}
import pyspark
from pyspark.sql import functions as F, types as T
import pandas as pd

def analyze(spark, job_args, configs):

    pdf1 = pd.DataFrame(
        [[1234567, 0.0, "abcdefghij", "2000-01-01T00:00:00.000Z"]],
        columns=['df1_c1', 'df1_c2', 'df1_c3', 'df1_c4']
    )
    df1 = spark.createDataFrame(
        pd.concat([pdf1 for i in range(429)]).reset_index()).drop('index')

    pdf2 = pd.DataFrame(
        [[1234567, 0.0, "abcdefghijklmno", "2000-01-01", "abcdefghijklmno", "abcdefghijklmno"]],
        columns=['df2_c1', 'df2_c2', 'df2_c3', 'df2_c4', 'df2_c5', 'df2_c6']
    )
    df2 = spark.createDataFrame(
        pd.concat([pdf2 for i in range(4899)]).reset_index()).drop('index')
    df3 = df1.join(df2, df1['df1_c1'] == df2['df2_c1'], how='inner')

    def myudf(df):
        return df

    df4 = df3
    udf = F.pandas_udf(df4.schema, F.PandasUDFType.GROUPED_MAP)(myudf)

    df5 = df4.groupBy('df1_c1').apply(udf)
    print('df5.count()', df5.count())
{code}
and 1.72 gb dataset using pandas_udf
{code:java|title=Bar.Python|borderStyle=solid}
import pyspark
from pyspark.sql import functions as F, types as T
import pandas as pd

def analyze(spark, job_args, configs):

    pdf1 = pd.DataFrame(
        [[1234567, 0.0, "abcdefghij", "2000-01-01T00:00:00.000Z"]],
        columns=['df1_c1', 'df1_c2', 'df1_c3', 'df1_c4']
    )
    df1 = spark.createDataFrame(
        pd.concat([pdf1 for i in range(429)]).reset_index()).drop('index')

    pdf2 = pd.DataFrame(
        [[1234567, 0.0, "abcdefghijklmno", "2000-01-01", "abcdefghijklmno", "abcdefghijklmno"]],
        columns=['df2_c1', 'df2_c2', 'df2_c3', 'df2_c4', 'df2_c5', 'df2_c6']
    )
    df2 = spark.createDataFrame(
        pd.concat([pdf2 for i in range(48993)]).reset_index()).drop('index')
    df3 = df1.join(df2, df1['df1_c1'] == df2['df2_c1'], how='inner')

    def myudf(df):
        return df

    df4 = df3
    udf = F.pandas_udf(df4.schema, F.PandasUDFType.GROUPED_MAP)(myudf)

    df5 = df4.groupBy('df1_c1').apply(udf)
    print('df5.count()', df5.count())
{code}
 You can find detail log after this table
||Dataset size||pyarrow version||result||stderr||detail log||
|172 mb|0.11.1|success|df5.count() 2101671|1|
|172 mb|0.12.0|success|df5.count() 2101671|1|
|172 mb|0.12.1|success|df5.count() 2101671|1|
|172 mb|0.13.0|success|df5.count() 2101671|1|
|172 mb|0.14.0|success|df5.count() 2101671|1|
|172 mb|0.14.1|success|df5.count() 2101671|1|
|172 mb|0.15.0|error|java.lang.IllegalArgumentException|2|
|172 mb|0.15.1|error|java.lang.IllegalArgumentException|2|
|172 mb|0.16.0|error|java.lang.IllegalArgumentException|2|
|172 mb|0.17.0|error|java.lang.IllegalArgumentException|2|
|172 mb|0.17.1|error|java.lang.IllegalArgumentException|2|
|172 mb|0.17.1|error|java.lang.IllegalArgumentException|2|
|172 mb|1.0.0|error|java.lang.IllegalArgumentException|2|
|172 mb|1.0.1|error|java.lang.IllegalArgumentException|2|
|172 mb|2.0.0|error|java.lang.IllegalArgumentException|2|
|1.7 gb|0.11.1|error|pyarrow.lib.ArrowIOError: read length must be positive or -1|3|
|1.72 gb|0.12.0|error|pyarrow.lib.ArrowIOError: read length must be positive or -1|3|
|1.72 gb|0.12.1|error|pyarrow.lib.ArrowIOError: read length must be positive or -1|3|
|1.72 gb|0.13.0|error|pyarrow.lib.ArrowIOError: read length must be positive or -1|3|
|1.72 gb|0.14.0|error|pyarrow.lib.ArrowIOError: read length must be positive or -1|3|
|1.72 gb|0.14.1|error|pyarrow.lib.ArrowIOError: read length must be positive or -1|3|
|1.72 gb|0.15.0|error|pyarrow.lib.ArrowIOError: read length must be positive or -1|3|
|1.72 gb|0.15.1|error|pyarrow.lib.ArrowIOError: read length must be positive or -1|3|
|1.72 gb|0.16.0|error|pyarrow.lib.ArrowIOError: read length must be positive or -1|3|
|1.72 gb|0.17.0|error|OSError: Invalid IPC message: negative bodyLength|4|
|1.72 gb|0.17.1|error|OSError: Invalid IPC message: negative bodyLength |4|
|1.72 gb|1.0.0|error|OSError: Invalid IPC message: negative bodyLength |4|
|1.72 gb|1.0.1|error|OSError: Invalid IPC message: negative bodyLength |4|
|1.72 gb|2.0.0|error|OSError: Invalid IPC message: negative bodyLength |4|

 

Detail logs:

1:
{code:java}
20/12/17 12:41:42 INFO SparkContext: Running Spark version 2.4.4
20/12/17 12:41:42 INFO SparkContext: Submitted application: temp
20/12/17 12:41:42 INFO SecurityManager: Changing view acls to: zeppelin
20/12/17 12:41:42 INFO SecurityManager: Changing modify acls to: zeppelin
20/12/17 12:41:42 INFO SecurityManager: Changing view acls groups to: 
20/12/17 12:41:42 INFO SecurityManager: Changing modify acls groups to: 
20/12/17 12:41:42 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(zeppelin); groups with view permissions: Set(); users  with modify permissions: Set(zeppelin); groups with modify permissions: Set()
20/12/17 12:41:43 INFO Utils: Successfully started service 'sparkDriver' on port 36190.
20/12/17 12:41:43 INFO SparkEnv: Registering MapOutputTracker
20/12/17 12:41:43 INFO SparkEnv: Registering BlockManagerMaster
20/12/17 12:41:43 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/12/17 12:41:43 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/12/17 12:41:43 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-b79cca55-c0c3-4afd-b0b7-3f88faab235c
20/12/17 12:41:43 INFO MemoryStore: MemoryStore started with capacity 8.4 GB
20/12/17 12:41:43 INFO SparkEnv: Registering OutputCommitCoordinator
20/12/17 12:41:43 INFO log: Logging initialized @2400ms
20/12/17 12:41:43 INFO Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
20/12/17 12:41:43 INFO Server: Started @2475ms
20/12/17 12:41:43 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
20/12/17 12:41:43 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
20/12/17 12:41:43 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
20/12/17 12:41:43 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
20/12/17 12:41:43 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.
20/12/17 12:41:43 WARN Utils: Service 'SparkUI' could not bind on port 4045. Attempting port 4046.
20/12/17 12:41:43 INFO AbstractConnector: Started ServerConnector@1bc70483{HTTP/1.1,[http/1.1]}{0.0.0.0:4046}
20/12/17 12:41:43 INFO Utils: Successfully started service 'SparkUI' on port 4046.
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@758cb21d{/jobs,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@101c8ddc{/jobs/json,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@71843b82{/jobs/job,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4f0c10c0{/jobs/job/json,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3b8a1690{/stages,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6b826248{/stages/json,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@1a2100a{/stages/stage,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3d3a8de8{/stages/stage/json,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2c89e805{/stages/pool,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3ecb4790{/stages/pool/json,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6a74cc86{/storage,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3c859ed{/storage/json,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@60e24051{/storage/rdd,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4f17c1a{/storage/rdd/json,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@482d80fe{/environment,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@9ebbf27{/environment/json,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7b798683{/executors,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7467245a{/executors/json,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@50696058{/executors/threadDump,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4afeff5{/executors/threadDump/json,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@33d53c7d{/static,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@104969b0{/,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@385ffa54{/api,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@25e21a77{/jobs/job/kill,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@267d103d{/stages/stage/kill,null,AVAILABLE,@Spark}
20/12/17 12:41:43 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://master2.host:4046
20/12/17 12:41:44 INFO RMProxy: Connecting to ResourceManager at master2.host/10.9.14.25:8050
20/12/17 12:41:44 INFO Client: Requesting a new application from cluster with 7 NodeManagers
20/12/17 12:41:44 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (618496 MB per container)
20/12/17 12:41:44 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
20/12/17 12:41:44 INFO Client: Setting up container launch context for our AM
20/12/17 12:41:44 INFO Client: Setting up the launch environment for our AM container
20/12/17 12:41:44 INFO Client: Preparing resources for our AM container
20/12/17 12:41:44 INFO Client: Source and destination file systems are the same. Not copying hdfs:/apps/spark2/jars/spark2-ADH-yarn-archive.tar.gz
20/12/17 12:41:44 INFO Client: Uploading resource file:/opt/deltalake/delta-core_2.11-0.5.0.jar -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1413/delta-core_2.11-0.5.0.jar
20/12/17 12:41:44 INFO Client: Uploading resource file:/home/zeppelin/env3.tar.gz#env3 -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1413/env3.tar.gz
20/12/17 12:41:44 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1413/pyspark.zip
20/12/17 12:41:44 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.7-src.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1413/py4j-0.10.7-src.zip
20/12/17 12:41:44 INFO Client: Uploading resource file:/code/dist/jobs.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1413/jobs.zip
20/12/17 12:41:44 WARN Client: Same path resource file:///opt/deltalake/delta-core_2.11-0.5.0.jar added multiple times to distributed cache.
20/12/17 12:41:44 INFO Client: Uploading resource file:/tmp/spark-69bef22d-3a04-4957-b3a1-e9c32d458350/__spark_conf__7457082691748150995.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1413/__spark_conf__.zip
20/12/17 12:41:45 INFO SecurityManager: Changing view acls to: zeppelin
20/12/17 12:41:45 INFO SecurityManager: Changing modify acls to: zeppelin
20/12/17 12:41:45 INFO SecurityManager: Changing view acls groups to: 
20/12/17 12:41:45 INFO SecurityManager: Changing modify acls groups to: 
20/12/17 12:41:45 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(zeppelin); groups with view permissions: Set(); users  with modify permissions: Set(zeppelin); groups with modify permissions: Set()
20/12/17 12:41:45 INFO Client: Submitting application application_1605081684999_1413 to ResourceManager
20/12/17 12:41:46 INFO YarnClientImpl: Submitted application application_1605081684999_1413
20/12/17 12:41:46 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1605081684999_1413 and attemptId None
20/12/17 12:41:47 INFO Client: Application report for application_1605081684999_1413 (state: ACCEPTED)
20/12/17 12:41:47 INFO Client: 
	 client token: N/A
	 diagnostics: AM container is launched, waiting for AM container to Register with RM
	 ApplicationMaster host: N/A
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1608198105987
	 final status: UNDEFINED
	 tracking URL: http://master2.host:8088/proxy/application_1605081684999_1413/
	 user: zeppelin
20/12/17 12:41:48 INFO Client: Application report for application_1605081684999_1413 (state: ACCEPTED)
20/12/17 12:41:49 INFO Client: Application report for application_1605081684999_1413 (state: ACCEPTED)
20/12/17 12:41:50 INFO Client: Application report for application_1605081684999_1413 (state: ACCEPTED)
20/12/17 12:41:51 INFO Client: Application report for application_1605081684999_1413 (state: ACCEPTED)
20/12/17 12:41:52 INFO Client: Application report for application_1605081684999_1413 (state: ACCEPTED)
20/12/17 12:41:52 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> master2.host, PROXY_URI_BASES -> http://master2.host:8088/proxy/application_1605081684999_1413), /proxy/application_1605081684999_1413
20/12/17 12:41:52 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /jobs, /jobs/json, /jobs/job, /jobs/job/json, /stages, /stages/json, /stages/stage, /stages/stage/json, /stages/pool, /stages/pool/json, /storage, /storage/json, /storage/rdd, /storage/rdd/json, /environment, /environment/json, /executors, /executors/json, /executors/threadDump, /executors/threadDump/json, /static, /, /api, /jobs/job/kill, /stages/stage/kill.
20/12/17 12:41:52 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark-client://YarnAM)
20/12/17 12:41:53 INFO Client: Application report for application_1605081684999_1413 (state: RUNNING)
20/12/17 12:41:53 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: 10.9.14.31
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1608198105987
	 final status: UNDEFINED
	 tracking URL: http://master2.host:8088/proxy/application_1605081684999_1413/
	 user: zeppelin
20/12/17 12:41:53 INFO YarnClientSchedulerBackend: Application application_1605081684999_1413 has started running.
20/12/17 12:41:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 32804.
20/12/17 12:41:53 INFO NettyBlockTransferService: Server created on master2.host:32804
20/12/17 12:41:53 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/12/17 12:41:53 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, master2.host, 32804, None)
20/12/17 12:41:53 INFO BlockManagerMasterEndpoint: Registering block manager master2.host:32804 with 8.4 GB RAM, BlockManagerId(driver, master2.host, 32804, None)
20/12/17 12:41:53 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, master2.host, 32804, None)
20/12/17 12:41:53 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, master2.host, 32804, None)
20/12/17 12:41:53 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /metrics/json.
20/12/17 12:41:53 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@c75e0b3{/metrics/json,null,AVAILABLE,@Spark}
20/12/17 12:41:53 INFO EventLoggingListener: Logging events to hdfs:/spark2-history/application_1605081684999_1413
20/12/17 12:41:56 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.31:45116) with ID 5
20/12/17 12:41:56 INFO BlockManagerMasterEndpoint: Registering block manager node6.host:36177 with 8.4 GB RAM, BlockManagerId(5, node6.host, 36177, None)
20/12/17 12:41:58 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.27:55334) with ID 1
20/12/17 12:41:58 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.27:55332) with ID 8
20/12/17 12:41:58 INFO BlockManagerMasterEndpoint: Registering block manager node2.host:44835 with 8.4 GB RAM, BlockManagerId(1, node2.host, 44835, None)
20/12/17 12:41:59 INFO BlockManagerMasterEndpoint: Registering block manager node2.host:44752 with 8.4 GB RAM, BlockManagerId(8, node2.host, 44752, None)
20/12/17 12:41:59 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.28:33458) with ID 9
20/12/17 12:41:59 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.28:33460) with ID 2
20/12/17 12:41:59 INFO BlockManagerMasterEndpoint: Registering block manager node3.host:37568 with 8.4 GB RAM, BlockManagerId(9, node3.host, 37568, None)
20/12/17 12:41:59 INFO BlockManagerMasterEndpoint: Registering block manager node3.host:39825 with 8.4 GB RAM, BlockManagerId(2, node3.host, 39825, None)
20/12/17 12:41:59 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.32:54662) with ID 10
20/12/17 12:41:59 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.32:54660) with ID 3
20/12/17 12:41:59 INFO BlockManagerMasterEndpoint: Registering block manager node7.host:33706 with 8.4 GB RAM, BlockManagerId(10, node7.host, 33706, None)
20/12/17 12:41:59 INFO BlockManagerMasterEndpoint: Registering block manager node7.host:44495 with 8.4 GB RAM, BlockManagerId(3, node7.host, 44495, None)
20/12/17 12:41:59 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.30:51678) with ID 7
20/12/17 12:41:59 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
20/12/17 12:41:59 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.26:57828) with ID 6
20/12/17 12:41:59 INFO BlockManagerMasterEndpoint: Registering block manager node5.host:41353 with 8.4 GB RAM, BlockManagerId(7, node5.host, 41353, None)
20/12/17 12:41:59 INFO SharedState: loading hive config file: file:/etc/spark/conf.dist/hive-site.xml
20/12/17 12:41:59 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/code/dist/spark-warehouse').
20/12/17 12:41:59 INFO SharedState: Warehouse path is 'file:/code/dist/spark-warehouse'.
20/12/17 12:41:59 INFO BlockManagerMasterEndpoint: Registering block manager node1.host:40037 with 8.4 GB RAM, BlockManagerId(6, node1.host, 40037, None)
20/12/17 12:41:59 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL.
20/12/17 12:41:59 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@13a6a145{/SQL,null,AVAILABLE,@Spark}
20/12/17 12:41:59 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/json.
20/12/17 12:41:59 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@38b6cedb{/SQL/json,null,AVAILABLE,@Spark}
20/12/17 12:41:59 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution.
20/12/17 12:41:59 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@124c532b{/SQL/execution,null,AVAILABLE,@Spark}
20/12/17 12:41:59 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution/json.
20/12/17 12:41:59 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6b73efc3{/SQL/execution/json,null,AVAILABLE,@Spark}
20/12/17 12:41:59 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /static/sql.
20/12/17 12:41:59 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@388c6f85{/static/sql,null,AVAILABLE,@Spark}
20/12/17 12:42:00 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.29:58708) with ID 4
20/12/17 12:42:00 INFO BlockManagerMasterEndpoint: Registering block manager node4.host:44212 with 8.4 GB RAM, BlockManagerId(4, node4.host, 44212, None)
20/12/17 12:42:00 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
df5.count() 2101671
{code}
 

2:
{code:java}
20/12/17 12:59:06 INFO SparkContext: Running Spark version 2.4.4
20/12/17 12:59:06 INFO SparkContext: Submitted application: temp
20/12/17 12:59:06 INFO SecurityManager: Changing view acls to: zeppelin
20/12/17 12:59:06 INFO SecurityManager: Changing modify acls to: zeppelin
20/12/17 12:59:06 INFO SecurityManager: Changing view acls groups to: 
20/12/17 12:59:06 INFO SecurityManager: Changing modify acls groups to: 
20/12/17 12:59:06 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(zeppelin); groups with view permissions: Set(); users  with modify permissions: Set(zeppelin); groups with modify permissions: Set()
20/12/17 12:59:07 INFO Utils: Successfully started service 'sparkDriver' on port 45389.
20/12/17 12:59:07 INFO SparkEnv: Registering MapOutputTracker
20/12/17 12:59:07 INFO SparkEnv: Registering BlockManagerMaster
20/12/17 12:59:07 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/12/17 12:59:07 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/12/17 12:59:07 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-17406a27-5051-4c88-95b0-c405a6410260
20/12/17 12:59:07 INFO MemoryStore: MemoryStore started with capacity 8.4 GB
20/12/17 12:59:07 INFO SparkEnv: Registering OutputCommitCoordinator
20/12/17 12:59:07 INFO log: Logging initialized @2420ms
20/12/17 12:59:07 INFO Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
20/12/17 12:59:07 INFO Server: Started @2495ms
20/12/17 12:59:07 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
20/12/17 12:59:07 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
20/12/17 12:59:07 INFO AbstractConnector: Started ServerConnector@24000415{HTTP/1.1,[http/1.1]}{0.0.0.0:4042}
20/12/17 12:59:07 INFO Utils: Successfully started service 'SparkUI' on port 4042.
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@57b776f5{/jobs,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7b87d546{/jobs/json,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@186596f1{/jobs/job,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@725f2e43{/jobs/job/json,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3737f8ab{/stages,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3d122a2{/stages/json,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@32581ebb{/stages/stage,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3b4ab601{/stages/stage/json,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4995ca15{/stages/pool,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3e0a99{/stages/pool/json,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@139c7561{/storage,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7d8833ad{/storage/json,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@1a2a8b6b{/storage/rdd,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6a303075{/storage/rdd/json,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@1af7672f{/environment,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@265e2a87{/environment/json,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@68e19cf4{/executors,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@339ba05{/executors/json,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@22566b52{/executors/threadDump,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@750b678d{/executors/threadDump/json,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@116943e4{/static,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@24e6b76{/,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@13c067eb{/api,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@769076ae{/jobs/job/kill,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@59b615b3{/stages/stage/kill,null,AVAILABLE,@Spark}
20/12/17 12:59:07 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://master2.host:4042
20/12/17 12:59:07 INFO RMProxy: Connecting to ResourceManager at master2.host/10.9.14.25:8050
20/12/17 12:59:07 INFO Client: Requesting a new application from cluster with 7 NodeManagers
20/12/17 12:59:07 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (618496 MB per container)
20/12/17 12:59:07 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
20/12/17 12:59:07 INFO Client: Setting up container launch context for our AM
20/12/17 12:59:07 INFO Client: Setting up the launch environment for our AM container
20/12/17 12:59:07 INFO Client: Preparing resources for our AM container
20/12/17 12:59:08 INFO Client: Source and destination file systems are the same. Not copying hdfs:/apps/spark2/jars/spark2-ADH-yarn-archive.tar.gz
20/12/17 12:59:08 INFO Client: Uploading resource file:/opt/deltalake/delta-core_2.11-0.5.0.jar -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1417/delta-core_2.11-0.5.0.jar
20/12/17 12:59:08 INFO Client: Uploading resource file:/home/zeppelin/env3.tar.gz#env3 -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1417/env3.tar.gz
20/12/17 12:59:09 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1417/pyspark.zip
20/12/17 12:59:09 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.7-src.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1417/py4j-0.10.7-src.zip
20/12/17 12:59:09 INFO Client: Uploading resource file:/code/dist/jobs.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1417/jobs.zip
20/12/17 12:59:09 WARN Client: Same path resource file:///opt/deltalake/delta-core_2.11-0.5.0.jar added multiple times to distributed cache.
20/12/17 12:59:09 INFO Client: Uploading resource file:/tmp/spark-46deff7e-62da-4303-88f7-8832b7e02e38/__spark_conf__3971450022345669465.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1417/__spark_conf__.zip
20/12/17 12:59:09 INFO SecurityManager: Changing view acls to: zeppelin
20/12/17 12:59:09 INFO SecurityManager: Changing modify acls to: zeppelin
20/12/17 12:59:09 INFO SecurityManager: Changing view acls groups to: 
20/12/17 12:59:09 INFO SecurityManager: Changing modify acls groups to: 
20/12/17 12:59:09 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(zeppelin); groups with view permissions: Set(); users  with modify permissions: Set(zeppelin); groups with modify permissions: Set()
20/12/17 12:59:10 INFO Client: Submitting application application_1605081684999_1417 to ResourceManager
20/12/17 12:59:10 INFO YarnClientImpl: Submitted application application_1605081684999_1417
20/12/17 12:59:10 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1605081684999_1417 and attemptId None
20/12/17 12:59:11 INFO Client: Application report for application_1605081684999_1417 (state: ACCEPTED)
20/12/17 12:59:11 INFO Client: 
	 client token: N/A
	 diagnostics: AM container is launched, waiting for AM container to Register with RM
	 ApplicationMaster host: N/A
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1608199150192
	 final status: UNDEFINED
	 tracking URL: http://master2.host:8088/proxy/application_1605081684999_1417/
	 user: zeppelin
20/12/17 12:59:12 INFO Client: Application report for application_1605081684999_1417 (state: ACCEPTED)
20/12/17 12:59:13 INFO Client: Application report for application_1605081684999_1417 (state: ACCEPTED)
20/12/17 12:59:14 INFO Client: Application report for application_1605081684999_1417 (state: ACCEPTED)
20/12/17 12:59:15 INFO Client: Application report for application_1605081684999_1417 (state: ACCEPTED)
20/12/17 12:59:16 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> master2.host, PROXY_URI_BASES -> http://master2.host:8088/proxy/application_1605081684999_1417), /proxy/application_1605081684999_1417
20/12/17 12:59:16 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /jobs, /jobs/json, /jobs/job, /jobs/job/json, /stages, /stages/json, /stages/stage, /stages/stage/json, /stages/pool, /stages/pool/json, /storage, /storage/json, /storage/rdd, /storage/rdd/json, /environment, /environment/json, /executors, /executors/json, /executors/threadDump, /executors/threadDump/json, /static, /, /api, /jobs/job/kill, /stages/stage/kill.
20/12/17 12:59:16 INFO Client: Application report for application_1605081684999_1417 (state: RUNNING)
20/12/17 12:59:16 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: 10.9.14.26
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1608199150192
	 final status: UNDEFINED
	 tracking URL: http://master2.host:8088/proxy/application_1605081684999_1417/
	 user: zeppelin
20/12/17 12:59:16 INFO YarnClientSchedulerBackend: Application application_1605081684999_1417 has started running.
20/12/17 12:59:16 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41162.
20/12/17 12:59:16 INFO NettyBlockTransferService: Server created on master2.host:41162
20/12/17 12:59:16 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/12/17 12:59:16 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, master2.host, 41162, None)
20/12/17 12:59:16 INFO BlockManagerMasterEndpoint: Registering block manager master2.host:41162 with 8.4 GB RAM, BlockManagerId(driver, master2.host, 41162, None)
20/12/17 12:59:16 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, master2.host, 41162, None)
20/12/17 12:59:16 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, master2.host, 41162, None)
20/12/17 12:59:16 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark-client://YarnAM)
20/12/17 12:59:16 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /metrics/json.
20/12/17 12:59:16 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@5fccf348{/metrics/json,null,AVAILABLE,@Spark}
20/12/17 12:59:16 INFO EventLoggingListener: Logging events to hdfs:/spark2-history/application_1605081684999_1417
20/12/17 12:59:19 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.26:40374) with ID 5
20/12/17 12:59:20 INFO BlockManagerMasterEndpoint: Registering block manager node1.host:40686 with 8.4 GB RAM, BlockManagerId(5, node1.host, 40686, None)
20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.28:47210) with ID 2
20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.28:47212) with ID 9
20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.32:51686) with ID 10
20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.30:57124) with ID 6
20/12/17 12:59:23 INFO BlockManagerMasterEndpoint: Registering block manager node3.host:34035 with 8.4 GB RAM, BlockManagerId(2, node3.host, 34035, None)
20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.31:55960) with ID 7
20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.32:51688) with ID 3
20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.29:44172) with ID 4
20/12/17 12:59:23 INFO BlockManagerMasterEndpoint: Registering block manager node3.host:36488 with 8.4 GB RAM, BlockManagerId(9, node3.host, 36488, None)
20/12/17 12:59:23 INFO BlockManagerMasterEndpoint: Registering block manager node7.host:36671 with 8.4 GB RAM, BlockManagerId(10, node7.host, 36671, None)
20/12/17 12:59:23 INFO BlockManagerMasterEndpoint: Registering block manager node5.host:40670 with 8.4 GB RAM, BlockManagerId(6, node5.host, 40670, None)
20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.27:46686) with ID 1
20/12/17 12:59:23 INFO BlockManagerMasterEndpoint: Registering block manager node6.host:33506 with 8.4 GB RAM, BlockManagerId(7, node6.host, 33506, None)
20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.27:46688) with ID 8
20/12/17 12:59:23 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
20/12/17 12:59:23 INFO BlockManagerMasterEndpoint: Registering block manager node7.host:41013 with 8.4 GB RAM, BlockManagerId(3, node7.host, 41013, None)
20/12/17 12:59:23 INFO BlockManagerMasterEndpoint: Registering block manager node4.host:34615 with 8.4 GB RAM, BlockManagerId(4, node4.host, 34615, None)
20/12/17 12:59:24 INFO BlockManagerMasterEndpoint: Registering block manager node2.host:39108 with 8.4 GB RAM, BlockManagerId(1, node2.host, 39108, None)
20/12/17 12:59:24 INFO BlockManagerMasterEndpoint: Registering block manager node2.host:37151 with 8.4 GB RAM, BlockManagerId(8, node2.host, 37151, None)
20/12/17 12:59:24 INFO SharedState: loading hive config file: file:/etc/spark/conf.dist/hive-site.xml
20/12/17 12:59:24 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/code/dist/spark-warehouse').
20/12/17 12:59:24 INFO SharedState: Warehouse path is 'file:/code/dist/spark-warehouse'.
20/12/17 12:59:24 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL.
20/12/17 12:59:24 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@78d74a59{/SQL,null,AVAILABLE,@Spark}
20/12/17 12:59:24 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/json.
20/12/17 12:59:24 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3941bf4b{/SQL/json,null,AVAILABLE,@Spark}
20/12/17 12:59:24 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution.
20/12/17 12:59:24 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@13201296{/SQL/execution,null,AVAILABLE,@Spark}
20/12/17 12:59:24 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution/json.
20/12/17 12:59:24 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@441b0461{/SQL/execution/json,null,AVAILABLE,@Spark}
20/12/17 12:59:24 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /static/sql.
20/12/17 12:59:24 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3212cd96{/static/sql,null,AVAILABLE,@Spark}
20/12/17 12:59:24 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
20/12/17 13:00:14 ERROR TaskSetManager: Task 93 in stage 40.0 failed 4 times; aborting job
Traceback (most recent call last):
  File "/code/dist/main.py", line 155, in <module>
    job_module.analyze(spark, args.job_args, configs)
  File "jobs.zip/jobs/temp/__init__.py", line 29, in analyze
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 523, in count
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o238.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 93 in stage 40.0 failed 4 times, most recent failure: Lost task 93.3 in stage 40.0 (TID 1138, node3.host, executor 9): java.lang.IllegalArgumentException
	at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
	at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
	at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
	at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
	at org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
	at org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
	at org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.agg_doAggregateWithoutKey_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299)
	at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2836)
	at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2835)
	at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
	at org.apache.spark.sql.Dataset.count(Dataset.scala:2835)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException
	at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
	at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
	at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
	at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
	at org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
	at org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
	at org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.agg_doAggregateWithoutKey_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more
{code}
 

3:
{code:java}
20/12/17 12:43:32 INFO SparkContext: Running Spark version 2.4.4
20/12/17 12:43:32 INFO SparkContext: Submitted application: temp
20/12/17 12:43:32 INFO SecurityManager: Changing view acls to: zeppelin
20/12/17 12:43:32 INFO SecurityManager: Changing modify acls to: zeppelin
20/12/17 12:43:32 INFO SecurityManager: Changing view acls groups to: 
20/12/17 12:43:32 INFO SecurityManager: Changing modify acls groups to: 
20/12/17 12:43:32 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(zeppelin); groups with view permissions: Set(); users  with modify permissions: Set(zeppelin); groups with modify permissions: Set()
20/12/17 12:43:32 INFO Utils: Successfully started service 'sparkDriver' on port 33378.
20/12/17 12:43:32 INFO SparkEnv: Registering MapOutputTracker
20/12/17 12:43:32 INFO SparkEnv: Registering BlockManagerMaster
20/12/17 12:43:32 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/12/17 12:43:32 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/12/17 12:43:32 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-8bba8740-cc56-418e-b759-8d2d0181db2c
20/12/17 12:43:32 INFO MemoryStore: MemoryStore started with capacity 8.4 GB
20/12/17 12:43:32 INFO SparkEnv: Registering OutputCommitCoordinator
20/12/17 12:43:33 INFO log: Logging initialized @2474ms
20/12/17 12:43:33 INFO Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
20/12/17 12:43:33 INFO Server: Started @2548ms
20/12/17 12:43:33 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
20/12/17 12:43:33 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
20/12/17 12:43:33 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
20/12/17 12:43:33 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
20/12/17 12:43:33 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.
20/12/17 12:43:33 WARN Utils: Service 'SparkUI' could not bind on port 4045. Attempting port 4046.
20/12/17 12:43:33 INFO AbstractConnector: Started ServerConnector@4128bb9{HTTP/1.1,[http/1.1]}{0.0.0.0:4046}
20/12/17 12:43:33 INFO Utils: Successfully started service 'SparkUI' on port 4046.
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@297aaa65{/jobs,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4817a878{/jobs/json,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4831e6cf{/jobs/job,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2a6bc2a5{/jobs/job/json,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@5fcaebf8{/stages,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@180a7950{/stages/json,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7689d370{/stages/stage,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4195df87{/stages/stage/json,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@793f8696{/stages/pool,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@61eb6c76{/stages/pool/json,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@1cb3ca39{/storage,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@73b557cf{/storage/json,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@763b7419{/storage/rdd,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4c892e74{/storage/rdd/json,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4e69c7de{/environment,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@173beaf3{/environment/json,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3a0dcfb1{/executors,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3ff2ac0a{/executors/json,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3fa5db1d{/executors/threadDump,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7701d268{/executors/threadDump/json,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@23a5eb7e{/static,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@55529b1e{/,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6ad6fed6{/api,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@42315ca4{/jobs/job/kill,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@797cae32{/stages/stage/kill,null,AVAILABLE,@Spark}
20/12/17 12:43:33 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://master2.host:4046
20/12/17 12:43:33 INFO RMProxy: Connecting to ResourceManager at master2.host/10.9.14.25:8050
20/12/17 12:43:33 INFO Client: Requesting a new application from cluster with 7 NodeManagers
20/12/17 12:43:33 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (618496 MB per container)
20/12/17 12:43:33 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
20/12/17 12:43:33 INFO Client: Setting up container launch context for our AM
20/12/17 12:43:33 INFO Client: Setting up the launch environment for our AM container
20/12/17 12:43:33 INFO Client: Preparing resources for our AM container
20/12/17 12:43:33 INFO Client: Source and destination file systems are the same. Not copying hdfs:/apps/spark2/jars/spark2-ADH-yarn-archive.tar.gz
20/12/17 12:43:33 INFO Client: Uploading resource file:/opt/deltalake/delta-core_2.11-0.5.0.jar -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1414/delta-core_2.11-0.5.0.jar
20/12/17 12:43:34 INFO Client: Uploading resource file:/home/zeppelin/env3.tar.gz#env3 -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1414/env3.tar.gz
20/12/17 12:43:34 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1414/pyspark.zip
20/12/17 12:43:34 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.7-src.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1414/py4j-0.10.7-src.zip
20/12/17 12:43:34 INFO Client: Uploading resource file:/code/dist/jobs.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1414/jobs.zip
20/12/17 12:43:34 WARN Client: Same path resource file:///opt/deltalake/delta-core_2.11-0.5.0.jar added multiple times to distributed cache.
20/12/17 12:43:34 INFO Client: Uploading resource file:/tmp/spark-7fab6d5d-5cf9-441d-9f8b-4a99a13d5e85/__spark_conf__1923213682968598810.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1414/__spark_conf__.zip
20/12/17 12:43:34 INFO SecurityManager: Changing view acls to: zeppelin
20/12/17 12:43:34 INFO SecurityManager: Changing modify acls to: zeppelin
20/12/17 12:43:34 INFO SecurityManager: Changing view acls groups to: 
20/12/17 12:43:34 INFO SecurityManager: Changing modify acls groups to: 
20/12/17 12:43:34 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(zeppelin); groups with view permissions: Set(); users  with modify permissions: Set(zeppelin); groups with modify permissions: Set()
20/12/17 12:43:35 INFO Client: Submitting application application_1605081684999_1414 to ResourceManager
20/12/17 12:43:35 INFO YarnClientImpl: Submitted application application_1605081684999_1414
20/12/17 12:43:35 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1605081684999_1414 and attemptId None
20/12/17 12:43:36 INFO Client: Application report for application_1605081684999_1414 (state: ACCEPTED)
20/12/17 12:43:36 INFO Client: 
	 client token: N/A
	 diagnostics: AM container is launched, waiting for AM container to Register with RM
	 ApplicationMaster host: N/A
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1608198215763
	 final status: UNDEFINED
	 tracking URL: http://master2.host:8088/proxy/application_1605081684999_1414/
	 user: zeppelin
20/12/17 12:43:37 INFO Client: Application report for application_1605081684999_1414 (state: ACCEPTED)
20/12/17 12:43:38 INFO Client: Application report for application_1605081684999_1414 (state: ACCEPTED)
20/12/17 12:43:39 INFO Client: Application report for application_1605081684999_1414 (state: ACCEPTED)
20/12/17 12:43:40 INFO Client: Application report for application_1605081684999_1414 (state: ACCEPTED)
20/12/17 12:43:41 INFO Client: Application report for application_1605081684999_1414 (state: ACCEPTED)
20/12/17 12:43:42 INFO Client: Application report for application_1605081684999_1414 (state: ACCEPTED)
20/12/17 12:43:43 INFO Client: Application report for application_1605081684999_1414 (state: ACCEPTED)
20/12/17 12:43:44 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> master2.host, PROXY_URI_BASES -> http://master2.host:8088/proxy/application_1605081684999_1414), /proxy/application_1605081684999_1414
20/12/17 12:43:44 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /jobs, /jobs/json, /jobs/job, /jobs/job/json, /stages, /stages/json, /stages/stage, /stages/stage/json, /stages/pool, /stages/pool/json, /storage, /storage/json, /storage/rdd, /storage/rdd/json, /environment, /environment/json, /executors, /executors/json, /executors/threadDump, /executors/threadDump/json, /static, /, /api, /jobs/job/kill, /stages/stage/kill.
20/12/17 12:43:44 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark-client://YarnAM)
20/12/17 12:43:44 INFO Client: Application report for application_1605081684999_1414 (state: RUNNING)
20/12/17 12:43:44 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: 10.9.14.27
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1608198215763
	 final status: UNDEFINED
	 tracking URL: http://master2.host:8088/proxy/application_1605081684999_1414/
	 user: zeppelin
20/12/17 12:43:44 INFO YarnClientSchedulerBackend: Application application_1605081684999_1414 has started running.
20/12/17 12:43:44 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39612.
20/12/17 12:43:44 INFO NettyBlockTransferService: Server created on master2.host:39612
20/12/17 12:43:44 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/12/17 12:43:44 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, master2.host, 39612, None)
20/12/17 12:43:44 INFO BlockManagerMasterEndpoint: Registering block manager master2.host:39612 with 8.4 GB RAM, BlockManagerId(driver, master2.host, 39612, None)
20/12/17 12:43:44 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, master2.host, 39612, None)
20/12/17 12:43:44 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, master2.host, 39612, None)
20/12/17 12:43:45 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /metrics/json.
20/12/17 12:43:45 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4d7fcb59{/metrics/json,null,AVAILABLE,@Spark}
20/12/17 12:43:45 INFO EventLoggingListener: Logging events to hdfs:/spark2-history/application_1605081684999_1414
20/12/17 12:43:47 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.27:37836) with ID 4
20/12/17 12:43:47 INFO BlockManagerMasterEndpoint: Registering block manager node2.host:39627 with 8.4 GB RAM, BlockManagerId(4, node2.host, 39627, None)
20/12/17 12:43:51 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.28:51120) with ID 6
20/12/17 12:43:51 INFO BlockManagerMasterEndpoint: Registering block manager node3.host:33184 with 8.4 GB RAM, BlockManagerId(6, node3.host, 33184, None)
20/12/17 12:43:51 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.26:43924) with ID 2
20/12/17 12:43:52 INFO BlockManagerMasterEndpoint: Registering block manager node1.host:46720 with 8.4 GB RAM, BlockManagerId(2, node1.host, 46720, None)
20/12/17 12:43:52 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.32:59464) with ID 5
20/12/17 12:43:52 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.30:36348) with ID 8
20/12/17 12:43:52 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.26:43926) with ID 9
20/12/17 12:43:52 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.30:36346) with ID 1
20/12/17 12:43:52 INFO BlockManagerMasterEndpoint: Registering block manager node7.host:40610 with 8.4 GB RAM, BlockManagerId(5, node7.host, 40610, None)
20/12/17 12:43:52 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.29:39682) with ID 10
20/12/17 12:43:52 INFO BlockManagerMasterEndpoint: Registering block manager node1.host:46068 with 8.4 GB RAM, BlockManagerId(9, node1.host, 46068, None)
20/12/17 12:43:52 INFO BlockManagerMasterEndpoint: Registering block manager node5.host:45165 with 8.4 GB RAM, BlockManagerId(1, node5.host, 45165, None)
20/12/17 12:43:52 INFO BlockManagerMasterEndpoint: Registering block manager node5.host:37627 with 8.4 GB RAM, BlockManagerId(8, node5.host, 37627, None)
20/12/17 12:43:52 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.29:39684) with ID 3
20/12/17 12:43:52 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
20/12/17 12:43:52 INFO BlockManagerMasterEndpoint: Registering block manager node4.host:40597 with 8.4 GB RAM, BlockManagerId(10, node4.host, 40597, None)
20/12/17 12:43:52 INFO SharedState: loading hive config file: file:/etc/spark/conf.dist/hive-site.xml
20/12/17 12:43:52 INFO BlockManagerMasterEndpoint: Registering block manager node4.host:46284 with 8.4 GB RAM, BlockManagerId(3, node4.host, 46284, None)
20/12/17 12:43:52 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/code/dist/spark-warehouse').
20/12/17 12:43:52 INFO SharedState: Warehouse path is 'file:/code/dist/spark-warehouse'.
20/12/17 12:43:52 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL.
20/12/17 12:43:52 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4f67d4c0{/SQL,null,AVAILABLE,@Spark}
20/12/17 12:43:52 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/json.
20/12/17 12:43:52 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2e1bff0b{/SQL/json,null,AVAILABLE,@Spark}
20/12/17 12:43:52 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution.
20/12/17 12:43:52 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@9336837{/SQL/execution,null,AVAILABLE,@Spark}
20/12/17 12:43:52 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution/json.
20/12/17 12:43:52 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@829dc36{/SQL/execution/json,null,AVAILABLE,@Spark}
20/12/17 12:43:52 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /static/sql.
20/12/17 12:43:52 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4e8262e4{/static/sql,null,AVAILABLE,@Spark}
20/12/17 12:43:52 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
20/12/17 12:46:39 ERROR TaskSetManager: Task 93 in stage 31.0 failed 4 times; aborting job
Traceback (most recent call last):
  File "/code/dist/main.py", line 155, in <module>
    job_module.analyze(spark, args.job_args, configs)
  File "jobs.zip/jobs/temp/__init__.py", line 29, in analyze
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 523, in count
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o238.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 93 in stage 31.0 failed 4 times, most recent failure: Lost task 93.3 in stage 31.0 (TID 1030, node2.host, executor 4): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/hadoop/hdfs/data06/local/usercache/zeppelin/appcache/application_1605081684999_1414/container_e20_1605081684999_1414_01_000005/pyspark.zip/pyspark/worker.py", line 377, in main
    process()
  File "/hadoop/hdfs/data06/local/usercache/zeppelin/appcache/application_1605081684999_1414/container_e20_1605081684999_1414_01_000005/pyspark.zip/pyspark/worker.py", line 372, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/hadoop/hdfs/data06/local/usercache/zeppelin/appcache/application_1605081684999_1414/container_e20_1605081684999_1414_01_000005/pyspark.zip/pyspark/serializers.py", line 286, in dump_stream
    for series in iterator:
  File "/hadoop/hdfs/data06/local/usercache/zeppelin/appcache/application_1605081684999_1414/container_e20_1605081684999_1414_01_000005/pyspark.zip/pyspark/serializers.py", line 303, in load_stream
    for batch in reader:
  File "pyarrow/ipc.pxi", line 265, in __iter__
  File "pyarrow/ipc.pxi", line 281, in pyarrow.lib._RecordBatchReader.read_next_batch
  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: read length must be positive or -1

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.agg_doAggregateWithoutKey_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299)
	at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2836)
	at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2835)
	at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
	at org.apache.spark.sql.Dataset.count(Dataset.scala:2835)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/hadoop/hdfs/data06/local/usercache/zeppelin/appcache/application_1605081684999_1414/container_e20_1605081684999_1414_01_000005/pyspark.zip/pyspark/worker.py", line 377, in main
    process()
  File "/hadoop/hdfs/data06/local/usercache/zeppelin/appcache/application_1605081684999_1414/container_e20_1605081684999_1414_01_000005/pyspark.zip/pyspark/worker.py", line 372, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/hadoop/hdfs/data06/local/usercache/zeppelin/appcache/application_1605081684999_1414/container_e20_1605081684999_1414_01_000005/pyspark.zip/pyspark/serializers.py", line 286, in dump_stream
    for series in iterator:
  File "/hadoop/hdfs/data06/local/usercache/zeppelin/appcache/application_1605081684999_1414/container_e20_1605081684999_1414_01_000005/pyspark.zip/pyspark/serializers.py", line 303, in load_stream
    for batch in reader:
  File "pyarrow/ipc.pxi", line 265, in __iter__
  File "pyarrow/ipc.pxi", line 281, in pyarrow.lib._RecordBatchReader.read_next_batch
  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: read length must be positive or -1

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.agg_doAggregateWithoutKey_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more
{code}
4:
{code:java}
--all after pyarrow 0.

20/12/17 13:35:40 INFO SparkContext: Running Spark version 2.4.4
20/12/17 13:35:40 INFO SparkContext: Submitted application: temp
20/12/17 13:35:40 INFO SecurityManager: Changing view acls to: zeppelin
20/12/17 13:35:40 INFO SecurityManager: Changing modify acls to: zeppelin
20/12/17 13:35:40 INFO SecurityManager: Changing view acls groups to: 
20/12/17 13:35:40 INFO SecurityManager: Changing modify acls groups to: 
20/12/17 13:35:40 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(zeppelin); groups with view permissions: Set(); users  with modify permissions: Set(zeppelin); groups with modify permissions: Set()
20/12/17 13:35:40 INFO Utils: Successfully started service 'sparkDriver' on port 33335.
20/12/17 13:35:40 INFO SparkEnv: Registering MapOutputTracker
20/12/17 13:35:40 INFO SparkEnv: Registering BlockManagerMaster
20/12/17 13:35:40 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/12/17 13:35:40 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/12/17 13:35:40 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-aea190e5-8c84-40e4-9f57-cf8af75f73f7
20/12/17 13:35:40 INFO MemoryStore: MemoryStore started with capacity 8.4 GB
20/12/17 13:35:40 INFO SparkEnv: Registering OutputCommitCoordinator
20/12/17 13:35:40 INFO log: Logging initialized @2440ms
20/12/17 13:35:40 INFO Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
20/12/17 13:35:40 INFO Server: Started @2511ms
20/12/17 13:35:40 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
20/12/17 13:35:40 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
20/12/17 13:35:40 INFO AbstractConnector: Started ServerConnector@594d8c7d{HTTP/1.1,[http/1.1]}{0.0.0.0:4042}
20/12/17 13:35:40 INFO Utils: Successfully started service 'SparkUI' on port 4042.
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2271a0eb{/jobs,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@30aa0825{/jobs/json,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@42d2b91d{/jobs/job,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@73f56489{/jobs/job/json,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@137ea1f2{/stages,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@44b3e8d1{/stages/json,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@239dc14b{/stages/stage,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@46ade13d{/stages/stage/json,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@b77f73a{/stages/pool,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@517df609{/stages/pool/json,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@57c1c10d{/storage,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7ecc76c1{/storage/json,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4a369005{/storage/rdd,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7445eaf4{/storage/rdd/json,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2d5594d8{/environment,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@303295cd{/environment/json,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4e7a9d76{/executors,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@15b3dc07{/executors/json,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@14c1793d{/executors/threadDump,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@30f90a95{/executors/threadDump/json,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2a72b0d1{/static,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@76865c63{/,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3bd4e264{/api,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4b660222{/jobs/job/kill,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@16f643b2{/stages/stage/kill,null,AVAILABLE,@Spark}
20/12/17 13:35:40 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://master2.host:4042
20/12/17 13:35:41 INFO RMProxy: Connecting to ResourceManager at master2.host/10.9.14.25:8050
20/12/17 13:35:41 INFO Client: Requesting a new application from cluster with 7 NodeManagers
20/12/17 13:35:41 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (618496 MB per container)
20/12/17 13:35:41 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
20/12/17 13:35:41 INFO Client: Setting up container launch context for our AM
20/12/17 13:35:41 INFO Client: Setting up the launch environment for our AM container
20/12/17 13:35:41 INFO Client: Preparing resources for our AM container
20/12/17 13:35:41 INFO Client: Source and destination file systems are the same. Not copying hdfs:/apps/spark2/jars/spark2-ADH-yarn-archive.tar.gz
20/12/17 13:35:41 INFO Client: Uploading resource file:/opt/deltalake/delta-core_2.11-0.5.0.jar -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1428/delta-core_2.11-0.5.0.jar
20/12/17 13:35:41 INFO Client: Uploading resource file:/home/zeppelin/env3.tar.gz#env3 -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1428/env3.tar.gz
20/12/17 13:35:41 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1428/pyspark.zip
20/12/17 13:35:41 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.7-src.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1428/py4j-0.10.7-src.zip
20/12/17 13:35:41 INFO Client: Uploading resource file:/code/dist/jobs.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1428/jobs.zip
20/12/17 13:35:41 WARN Client: Same path resource file:///opt/deltalake/delta-core_2.11-0.5.0.jar added multiple times to distributed cache.
20/12/17 13:35:42 INFO Client: Uploading resource file:/tmp/spark-9804f727-4c1a-44f7-ae39-26ec382332a7/__spark_conf__1703784814635255608.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1428/__spark_conf__.zip
20/12/17 13:35:42 INFO SecurityManager: Changing view acls to: zeppelin
20/12/17 13:35:42 INFO SecurityManager: Changing modify acls to: zeppelin
20/12/17 13:35:42 INFO SecurityManager: Changing view acls groups to: 
20/12/17 13:35:42 INFO SecurityManager: Changing modify acls groups to: 
20/12/17 13:35:42 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(zeppelin); groups with view permissions: Set(); users  with modify permissions: Set(zeppelin); groups with modify permissions: Set()
20/12/17 13:35:43 INFO Client: Submitting application application_1605081684999_1428 to ResourceManager
20/12/17 13:35:43 INFO YarnClientImpl: Submitted application application_1605081684999_1428
20/12/17 13:35:43 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1605081684999_1428 and attemptId None
20/12/17 13:35:44 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED)
20/12/17 13:35:44 INFO Client: 
	 client token: N/A
	 diagnostics: AM container is launched, waiting for AM container to Register with RM
	 ApplicationMaster host: N/A
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1608201343091
	 final status: UNDEFINED
	 tracking URL: http://master2.host:8088/proxy/application_1605081684999_1428/
	 user: zeppelin
20/12/17 13:35:45 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED)
20/12/17 13:35:46 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED)
20/12/17 13:35:47 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED)
20/12/17 13:35:48 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED)
20/12/17 13:35:49 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED)
20/12/17 13:35:50 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED)
20/12/17 13:35:51 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED)
20/12/17 13:35:52 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED)
20/12/17 13:35:52 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> master2.host, PROXY_URI_BASES -> http://master2.host:8088/proxy/application_1605081684999_1428), /proxy/application_1605081684999_1428
20/12/17 13:35:52 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /jobs, /jobs/json, /jobs/job, /jobs/job/json, /stages, /stages/json, /stages/stage, /stages/stage/json, /stages/pool, /stages/pool/json, /storage, /storage/json, /storage/rdd, /storage/rdd/json, /environment, /environment/json, /executors, /executors/json, /executors/threadDump, /executors/threadDump/json, /static, /, /api, /jobs/job/kill, /stages/stage/kill.
20/12/17 13:35:53 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark-client://YarnAM)
20/12/17 13:35:53 INFO Client: Application report for application_1605081684999_1428 (state: RUNNING)
20/12/17 13:35:53 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: 10.9.14.29
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1608201343091
	 final status: UNDEFINED
	 tracking URL: http://master2.host:8088/proxy/application_1605081684999_1428/
	 user: zeppelin
20/12/17 13:35:53 INFO YarnClientSchedulerBackend: Application application_1605081684999_1428 has started running.
20/12/17 13:35:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38210.
20/12/17 13:35:53 INFO NettyBlockTransferService: Server created on master2.host:38210
20/12/17 13:35:53 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/12/17 13:35:53 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, master2.host, 38210, None)
20/12/17 13:35:53 INFO BlockManagerMasterEndpoint: Registering block manager master2.host:38210 with 8.4 GB RAM, BlockManagerId(driver, master2.host, 38210, None)
20/12/17 13:35:53 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, master2.host, 38210, None)
20/12/17 13:35:53 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, master2.host, 38210, None)
20/12/17 13:35:53 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /metrics/json.
20/12/17 13:35:53 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4b00b26c{/metrics/json,null,AVAILABLE,@Spark}
20/12/17 13:35:53 INFO EventLoggingListener: Logging events to hdfs:/spark2-history/application_1605081684999_1428
20/12/17 13:35:57 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.29:42340) with ID 1
20/12/17 13:35:57 INFO BlockManagerMasterEndpoint: Registering block manager node4.host:42288 with 8.4 GB RAM, BlockManagerId(1, node4.host, 42288, None)
20/12/17 13:35:58 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.29:42346) with ID 8
20/12/17 13:35:58 INFO BlockManagerMasterEndpoint: Registering block manager node4.host:46543 with 8.4 GB RAM, BlockManagerId(8, node4.host, 46543, None)
20/12/17 13:36:02 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.26:55632) with ID 9
20/12/17 13:36:02 INFO BlockManagerMasterEndpoint: Registering block manager node1.host:34973 with 8.4 GB RAM, BlockManagerId(9, node1.host, 34973, None)
20/12/17 13:36:02 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.26:55634) with ID 2
20/12/17 13:36:02 INFO BlockManagerMasterEndpoint: Registering block manager node1.host:35223 with 8.4 GB RAM, BlockManagerId(2, node1.host, 35223, None)
20/12/17 13:36:02 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.31:49516) with ID 6
20/12/17 13:36:02 INFO BlockManagerMasterEndpoint: Registering block manager node6.host:35440 with 8.4 GB RAM, BlockManagerId(6, node6.host, 35440, None)
20/12/17 13:36:02 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.28:49014) with ID 4
20/12/17 13:36:02 INFO BlockManagerMasterEndpoint: Registering block manager node3.host:46367 with 8.4 GB RAM, BlockManagerId(4, node3.host, 46367, None)
20/12/17 13:36:03 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.32:38700) with ID 7
20/12/17 13:36:03 INFO BlockManagerMasterEndpoint: Registering block manager node7.host:44537 with 8.4 GB RAM, BlockManagerId(7, node7.host, 44537, None)
20/12/17 13:36:03 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.27:51310) with ID 10
20/12/17 13:36:03 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
20/12/17 13:36:03 INFO BlockManagerMasterEndpoint: Registering block manager node2.host:35748 with 8.4 GB RAM, BlockManagerId(10, node2.host, 35748, None)
20/12/17 13:36:03 INFO SharedState: loading hive config file: file:/etc/spark/conf.dist/hive-site.xml
20/12/17 13:36:03 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/code/dist/spark-warehouse').
20/12/17 13:36:03 INFO SharedState: Warehouse path is 'file:/code/dist/spark-warehouse'.
20/12/17 13:36:03 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL.
20/12/17 13:36:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@674d79fd{/SQL,null,AVAILABLE,@Spark}
20/12/17 13:36:03 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/json.
20/12/17 13:36:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@9d4247b{/SQL/json,null,AVAILABLE,@Spark}
20/12/17 13:36:03 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution.
20/12/17 13:36:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@76da7433{/SQL/execution,null,AVAILABLE,@Spark}
20/12/17 13:36:03 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution/json.
20/12/17 13:36:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3d37b97{/SQL/execution/json,null,AVAILABLE,@Spark}
20/12/17 13:36:03 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /static/sql.
20/12/17 13:36:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@59de5714{/static/sql,null,AVAILABLE,@Spark}
20/12/17 13:36:04 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.27:51324) with ID 3
20/12/17 13:36:04 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.30:55732) with ID 5
20/12/17 13:36:04 INFO BlockManagerMasterEndpoint: Registering block manager node2.host:40602 with 8.4 GB RAM, BlockManagerId(3, node2.host, 40602, None)
20/12/17 13:36:04 INFO BlockManagerMasterEndpoint: Registering block manager node5.host:45001 with 8.4 GB RAM, BlockManagerId(5, node5.host, 45001, None)
20/12/17 13:36:04 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
20/12/17 13:39:34 ERROR TaskSetManager: Task 93 in stage 31.0 failed 4 times; aborting job
Traceback (most recent call last):
  File "/code/dist/main.py", line 155, in <module>
    job_module.analyze(spark, args.job_args, configs)
  File "jobs.zip/jobs/temp/__init__.py", line 29, in analyze
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 523, in count
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o238.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 93 in stage 31.0 failed 4 times, most recent failure: Lost task 93.3 in stage 31.0 (TID 1030, node2.host, executor 10): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/hadoop/hdfs/data03/local/usercache/zeppelin/appcache/application_1605081684999_1428/container_e20_1605081684999_1428_01_000011/pyspark.zip/pyspark/worker.py", line 377, in main
    process()
  File "/hadoop/hdfs/data03/local/usercache/zeppelin/appcache/application_1605081684999_1428/container_e20_1605081684999_1428_01_000011/pyspark.zip/pyspark/worker.py", line 372, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/hadoop/hdfs/data03/local/usercache/zeppelin/appcache/application_1605081684999_1428/container_e20_1605081684999_1428_01_000011/pyspark.zip/pyspark/serializers.py", line 286, in dump_stream
    for series in iterator:
  File "/hadoop/hdfs/data03/local/usercache/zeppelin/appcache/application_1605081684999_1428/container_e20_1605081684999_1428_01_000011/pyspark.zip/pyspark/serializers.py", line 303, in load_stream
    for batch in reader:
  File "pyarrow/ipc.pxi", line 412, in __iter__
  File "pyarrow/ipc.pxi", line 432, in pyarrow.lib._CRecordBatchReader.read_next_batch
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Invalid IPC message: negative bodyLength

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.agg_doAggregateWithoutKey_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299)
	at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2836)
	at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2835)
	at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
	at org.apache.spark.sql.Dataset.count(Dataset.scala:2835)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/hadoop/hdfs/data03/local/usercache/zeppelin/appcache/application_1605081684999_1428/container_e20_1605081684999_1428_01_000011/pyspark.zip/pyspark/worker.py", line 377, in main
    process()
  File "/hadoop/hdfs/data03/local/usercache/zeppelin/appcache/application_1605081684999_1428/container_e20_1605081684999_1428_01_000011/pyspark.zip/pyspark/worker.py", line 372, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/hadoop/hdfs/data03/local/usercache/zeppelin/appcache/application_1605081684999_1428/container_e20_1605081684999_1428_01_000011/pyspark.zip/pyspark/serializers.py", line 286, in dump_stream
    for series in iterator:
  File "/hadoop/hdfs/data03/local/usercache/zeppelin/appcache/application_1605081684999_1428/container_e20_1605081684999_1428_01_000011/pyspark.zip/pyspark/serializers.py", line 303, in load_stream
    for batch in reader:
  File "pyarrow/ipc.pxi", line 412, in __iter__
  File "pyarrow/ipc.pxi", line 432, in pyarrow.lib._CRecordBatchReader.read_next_batch
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Invalid IPC message: negative bodyLength

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.agg_doAggregateWithoutKey_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more
{code}
 

So my question is can I use more than 2gb with pandas_udf approach for python + spark stack with current stable versions of spark and pyarrow?

> [Python] Spark+Arrow Grouped pandas UDAF - read length must be positive or -1
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-4890
>                 URL: https://issues.apache.org/jira/browse/ARROW-4890
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.8.0
>         Environment: Cloudera cdh5.13.3
> Cloudera Spark 2.3.0.cloudera3
>            Reporter: Abdeali Kothari
>            Priority: Major
>         Attachments: Task retry fails.png, image-2019-07-04-12-03-57-002.png
>
>
> Creating this in Arrow project as the traceback seems to suggest this is an issue in Arrow.
>  Continuation from the conversation on the https://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3CCAK7Z5T_mChuqhFDAF2U68dO=P_1Nst5AjjCRg0MExO5Kby9i-g@mail.gmail.com%3E
> When I run a GROUPED_MAP UDF in Spark using PySpark, I run into the error:
> {noformat}
>   File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/serializers.py", line 279, in load_stream
>     for batch in reader:
>   File "pyarrow/ipc.pxi", line 265, in __iter__
>   File "pyarrow/ipc.pxi", line 281, in pyarrow.lib._RecordBatchReader.read_next_batch
>   File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: read length must be positive or -1
> {noformat}
> as my dataset size starts increasing that I want to group on. Here is a reproducible code snippet where I can reproduce this.
>  Note: My actual dataset is much larger and has many more unique IDs and is a valid usecase where I cannot simplify this groupby in any way. I have stripped out all the logic to make this example as simple as I could.
> {code:java}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--executor-memory 9G pyspark-shell'
> import findspark
> findspark.init()
> import pyspark
> from pyspark.sql import functions as F, types as T
> import pandas as pd
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> pdf1 = pd.DataFrame(
> 	[[1234567, 0.0, "abcdefghij", "2000-01-01T00:00:00.000Z"]],
> 	columns=['df1_c1', 'df1_c2', 'df1_c3', 'df1_c4']
> )
> df1 = spark.createDataFrame(pd.concat([pdf1 for i in range(429)]).reset_index()).drop('index')
> pdf2 = pd.DataFrame(
> 	[[1234567, 0.0, "abcdefghijklmno", "2000-01-01", "abcdefghijklmno", "abcdefghijklmno"]],
> 	columns=['df2_c1', 'df2_c2', 'df2_c3', 'df2_c4', 'df2_c5', 'df2_c6']
> )
> df2 = spark.createDataFrame(pd.concat([pdf2 for i in range(48993)]).reset_index()).drop('index')
> df3 = df1.join(df2, df1['df1_c1'] == df2['df2_c1'], how='inner')
> def myudf(df):
>     return df
> df4 = df3
> udf = F.pandas_udf(df4.schema, F.PandasUDFType.GROUPED_MAP)(myudf)
> df5 = df4.groupBy('df1_c1').apply(udf)
> print('df5.count()', df5.count())
> # df5.write.parquet('/tmp/temp.parquet', mode='overwrite')
> {code}
> I have tried running this on Amazon EMR with Spark 2.3.1 and 20GB RAM per executor too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)