You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2022/06/02 01:54:00 UTC
[jira] [Commented] (IMPALA-11325) Impala-shell hits UnicodeDecodeError when outputting Unicode via --output_file
[ https://issues.apache.org/jira/browse/IMPALA-11325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545231#comment-17545231 ]
ASF subversion and git services commented on IMPALA-11325:
----------------------------------------------------------
Commit ed0d9341d3229b5857c8583d1817172d61b0f68c in impala's branch refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=ed0d9341d ]
IMPALA-11325: Fix UnicodeDecodeError for shell file output
When using the --output_file commandline option for
impala-shell, the shell fails with UnicodeDecodeError
if the output contains Unicode characters.
For example, if running this command:
impala-shell -B -q "select '引'" --output_file=output.txt
This fails with:
UnicodeDecodeError : 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
This happens due to an encode('utf-8') call happening
in OutputStream::write() on a string that is already UTF-8 encoded.
This changes the code to skip the encode('utf-8') call for Python 2.
Python 3 is using a string and still needs the encode call.
This is mostly a pragmatic fix to make the code a little bit
more functional, and there is more work to be done to have
clear contracts for the format() methods and clear points
of conversion to/from bytes.
Testing:
- Ran shell tests with Python 2 and Python 3 on Ubuntu 18
- Added a shell test that outputs a Unicode character
to an output file. Without the fix, this test fails.
Change-Id: Ic40be3d530c2694465f7bd2edb0e0586ff0e1fba
Reviewed-on: http://gerrit.cloudera.org:8080/18576
Reviewed-by: Michael Smith <mi...@cloudera.com>
Reviewed-by: Quanlong Huang <hu...@gmail.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
> Impala-shell hits UnicodeDecodeError when outputting Unicode via --output_file
> ------------------------------------------------------------------------------
>
> Key: IMPALA-11325
> URL: https://issues.apache.org/jira/browse/IMPALA-11325
> Project: IMPALA
> Issue Type: Bug
> Components: Clients
> Affects Versions: Impala 4.2.0
> Reporter: Joe McDonnell
> Priority: Blocker
>
> When running impala-shell and trying to output Unicode to a fail via --output file, it fails:
> {noformat}
> ishell -B -q "select '引'" --output_file=joetest3.txt
> /home/joe/view2/Impala/shell/option_parser.py:359: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
> if '--live_progress' in sys.argv and '--disable_live_progress' in sys.argv:
> /home/joe/view2/Impala/shell/option_parser.py:363: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
> if '--strict_hs2_protocol' in sys.argv:
> /home/joe/view2/Impala/shell/option_parser.py:369: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
> if '--verbose' in sys.argv and '--quiet' in sys.argv:
> Starting Impala Shell with no authentication using Python 2.7.16
> Warning: live_progress only applies to interactive shell sessions, and is being skipped for now.
> Opened TCP connection to localhost:21050
> Connected to localhost:21050
> Server version: impalad version 4.1.0-SNAPSHOT DEBUG (build 4236c307b971881a3b1d85068db5b053a9c34cfa)
> Query: select '引'
> Query submitted at: 2022-05-31 08:31:50 (Coordinator: http://joemcdonnell:25000)
> Query progress can be monitored at: http://joemcdonnell:25000/query_plan?query_id=2347462fe8a18544:bbeedc1800000000
> UnicodeDecodeError : 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
> Please check for columns containing binary data to find the possible source of the error.
> Could not execute command: select '引'{noformat}
> This is specific to file output. This same query works if outputting to the console.
> This line seems to be the problem:
> {noformat}
> with open(self.filename, 'ab') as out_file:
> # Note that instances of this class do not persist, so it's fine to
> # close the we close the file handle after each write.
> out_file.write(formatted_data.encode('utf-8')) # file opened in binary mode <--------
> out_file.write(b'\n')
> {noformat}
> [https://github.com/apache/impala/blob/master/shell/shell_output.py#L115]
> It seems to work if we remove the .encode('utf-8').
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org