You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "David Knupp (JIRA)" <ji...@apache.org> on 2018/05/29 17:12:00 UTC
[jira] [Created] (IMPALA-7088) Parallel data load breaks
load-data.py if loading data on a real cluster
David Knupp created IMPALA-7088:
-----------------------------------
Summary: Parallel data load breaks load-data.py if loading data on a real cluster
Key: IMPALA-7088
URL: https://issues.apache.org/jira/browse/IMPALA-7088
Project: IMPALA
Issue Type: Bug
Components: Infrastructure
Affects Versions: Impala 3.0
Reporter: David Knupp
Impala/bin/load-data.py is most commonly used to load test data onto a simulated standalone cluster running on the local host. However, with the correct inputs, it can also be used to load data onto an actual remote cluster.
A recent enhancement in the load-data.py script to parallelize parts of the data loading process -- https://github.com/apache/impala/commit/d481cd48 -- has introduced a regression in the latter use case:
From *$IMPALA_HOME/logs/data_loading/data-load-functional-exhaustive.log*:
{noformat}
Created table functional_hbase.widetable_1000_cols
Took 0.7121 seconds
09:48:01 Beginning execution of hive SQL: /home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/logs/data_loading/sql/functional/load-functional-query-exhaustive-hive-generated-text-none-none.sql
Traceback (most recent call last):
File "/home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/bin/load-data.py", line 494, in <module>
if __name__ == "__main__": main()
File "/home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/bin/load-data.py", line 468, in main
hive_exec_query_files_parallel(thread_pool, hive_load_text_files)
File "/home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/bin/load-data.py", line 299, in hive_exec_query_files_parallel
exec_query_files_parallel(thread_pool, query_files, 'hive')
File "/home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/bin/load-data.py", line 290, in exec_query_files_parallel
for result in thread_pool.imap_unordered(execution_function, query_files):
File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next
raise value
TypeError: coercing to Unicode: need string or buffer, NoneType found
{noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)