You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "BELUGA BEHR (JIRA)" <ji...@apache.org> on 2017/10/31 19:06:00 UTC

[jira] [Commented] (HIVE-16826) Improvements for SeparatedValuesOutputFormat

    [ https://issues.apache.org/jira/browse/HIVE-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16227337#comment-16227337 ] 

BELUGA BEHR commented on HIVE-16826:
------------------------------------

Interestingly, there seems to be an issue with the current code.  When I instruct beeline to use quote {{disable.quoting.for.sv}}, my changes provide the same output as the current implementations.  However, when no quotes are specified, there is a difference.
\\
\\
* theFileWhereToStoreTheData.csv = current implementation
* theFileWhereToStoreTheData.csv.mod = with my changes

{code}
[root@host ~]# md5sum theFileWhereToStoreTheData.csv*
6bfb928df7d2a7d778930bb972bc23c5  theFileWhereToStoreTheData.csv
fb3972fe583a4e1565a4fddb81dc8d62  theFileWhereToStoreTheData.csv.mod
{code}

For the first 20,000 outputs, we are good, but then it gets weird...

{code}
[root@host ~]# head -n 20000 theFileWhereToStoreTheData.csv | xxd | md5sum
280b418c87ed701b509f4cbbdfe8fa29  -
[root@host ~]# head -n 20000 theFileWhereToStoreTheData.csv.mod | xxd | md5sum
280b418c87ed701b509f4cbbdfe8fa29  -

[root@host ~]# head -n 21000 theFileWhereToStoreTheData.csv | xxd | md5sum
3b1eb5b7b63a5255c8e1539230d190a9  -
[root@host ~]# head -n 21000 theFileWhereToStoreTheData.csv.mod | xxd | md5sum
7de5ae6604e91a42a388c9826174ee30  -
{code}

Everything in the file starts fine...

{code}
[root@host ~]# head -n 4 theFileWhereToStoreTheData.csv | tail -n 2 | xxd
0000000: 3030 2d30 3030 302c 416c 6c20 4f63 6375  00-0000,All Occu
0000010: 7061 7469 6f6e 732c 3133 3433 3534 3235  pations,13435425
0000020: 302c 3430 3639 300a 3030 2d30 3030 302c  0,40690.00-0000,
0000030: 416c 6c20 4f63 6375 7061 7469 6f6e 732c  All Occupations,
0000040: 3133 3433 3534 3235 302c 3430 3639 300a  134354250,40690.
{code}

But then it changes behavior.  We see that strings are being quoted with NUL bytes "00":

{code}
[root@nightly513-unsecure-1 ~]# head -n 100000 theFileWhereToStoreTheData.csv | tail -n 2 | xxd
0000000: 3135 2d31 3031 312c 0043 6f6d 7075 7465  15-1011,.Compute
0000010: 7220 616e 6420 696e 666f 726d 6174 696f  r and informatio
0000020: 6e20 7363 6965 6e74 6973 7473 2c20 7265  n scientists, re
0000030: 7365 6172 6368 002c 3238 3732 302c 3130  search.,28720,10
0000040: 3036 3430 0a31 352d 3130 3131 2c00 436f  0640.15-1011,.Co
0000050: 6d70 7574 6572 2061 6e64 2069 6e66 6f72  mputer and infor
0000060: 6d61 7469 6f6e 2073 6369 656e 7469 7374  mation scientist
0000070: 732c 2072 6573 6561 7263 6800 2c32 3837  s, research.,287
0000080: 3230 2c31 3030 3634 300a                 20,100640.
{code}

I can't figure out how these NUL bytes are being introduced in the current implementation, but my changes seem to address this issue and do not include these erroneous extra bytes.

> Improvements for SeparatedValuesOutputFormat
> --------------------------------------------
>
>                 Key: HIVE-16826
>                 URL: https://issues.apache.org/jira/browse/HIVE-16826
>             Project: Hive
>          Issue Type: Improvement
>          Components: Beeline
>    Affects Versions: 2.1.1, 3.0.0
>            Reporter: BELUGA BEHR
>            Assignee: BELUGA BEHR
>            Priority: Minor
>         Attachments: HIVE-16826.1.patch, HIVE-16826.2.patch
>
>
> Proposing changes to class {{org.apache.hive.beeline.SeparatedValuesOutputFormat}}.
> # Simplify the code
> # Code currently creates and destroys {{CsvListWriter}}, which contains a buffer, for every line printed
> # Use Apache Commons libraries for certain actions
> # Prefer non-synchronized {{StringBuilderWriter}} to Java's synchronized {{StringWriter}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)