You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by arina-ielchiieva <gi...@git.apache.org> on 2017/09/07 16:02:32 UTC

[GitHub] drill pull request #936: DRILL-5772: Add unit tests to indicate how utf-8 su...

GitHub user arina-ielchiieva opened a pull request:

    https://github.com/apache/drill/pull/936

    DRILL-5772: Add unit tests to indicate how utf-8 support can be enabled / disabled in Drill

    Details in [DRILL-5772](https://issues.apache.org/jira/browse/DRILL-5772).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/arina-ielchiieva/drill DRILL-5772

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/936.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #936
    
----
commit 6233a3c2c0338554095faf55483a1f78ed7a48b1
Author: Arina Ielchiieva <ar...@gmail.com>
Date:   2017-09-07T15:15:14Z

    DRILL-5772: Add unit tests to indicate how utf-8 support can be enabled / disabled in Drill

----


---

[GitHub] drill pull request #936: DRILL-5772: Enable UTF-8 support in query string by...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/drill/pull/936


---

[GitHub] drill issue #936: DRILL-5772: Add unit tests to indicate how utf-8 support c...

Posted by arina-ielchiieva <gi...@git.apache.org>.
Github user arina-ielchiieva commented on the issue:

    https://github.com/apache/drill/pull/936
  
    @paul-rogers 
    agree with you that charsets used in saffron properties should be defaulted in Drill to `UTF-8` since Drill can read UTF-8 data and it's strange that it would fail by default when Calcite will attempt to parse string into literal used in query.
    
    I have looked into Calcite code and there is no option to hard-code charset values for Calcite but charset can be changed using properties.
    There are two options of setting saffron properties:
    1. as system property;
    2. using `saffron.properties` file.
    
    I don't really like passing them as `-D` when starting the drillbit 9since there are at least two), so I am more inclined to use `saffron.properties` file. Unfortunately, in Calcite code `saffron.properties` location is expected to be working folder [1], i.e. the place where java process was started. I have created Jira and pull request in Calcite to allow `saffron.properties` to be present in classpath since it's more convenient [2]. I'll keep you updated on Calcite community feedback.
    
    [1] https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/util/SaffronProperties.java#L113
    
    [2] https://issues.apache.org/jira/browse/CALCITE-2014


---

[GitHub] drill pull request #936: DRILL-5772: Enable UTF-8 support in query string by...

Posted by arina-ielchiieva <gi...@git.apache.org>.
Github user arina-ielchiieva commented on a diff in the pull request:

    https://github.com/apache/drill/pull/936#discussion_r147552950
  
    --- Diff: exec/java-exec/src/test/resources/saffron.properties ---
    @@ -0,0 +1,23 @@
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to you under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +# http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +
    +# This properties file is used by Apache Calcite to define allowed charset in string literals,
    +# which is by default ISO-8859-1.
    +# Current configuration allows parsing UTF-8 by default, i.e. queries that contain utf-8 string literal.
    +# To take affect this file should be present in classpath.
    +
    +saffron.default.charset=UTF-16LE
    +saffron.default.nationalcharset=UTF-16LE
    +saffron.default.collation.name=UTF-16LE$en_US
    --- End diff --
    
    We need separate test properties file to be included into classpath during tests execution. Tests run in exec module, while production properties reside in distribution module.



---

[GitHub] drill issue #936: DRILL-5772: Add unit tests to indicate how utf-8 support c...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/936
  
    @arina-ielchiieva, thanks for the explanation. Drill's runtime framework assumes that data is either:
    
    1. ASCII (or, at least, single-byte character set based on ASCII), or
    2. UTF-8 (data is converted to/from UTF-8 when converting VarChar to a Java String.)
    
    Since Drill's code seems to assume ASCII (when it cares about character format), then one could claim that Drill does not have an encoding: it just treats characters as bytes. But, things such as determining string length, doing pattern matching, and so on must be aware of the character set -- if only to know which bytes are continuations of a multi-byte character. (That is, a three-byte sequence in UTF-8 might be one, two or three characters, depending.)
    
    Now, if the planner assumes ISO-8859-1, but the Drill execution engine assumes UTF-8, then string constants passed from one to the other can become corrupted in the case where a particular byte sequence in ISO-8859-1 represents a different character than that same byte sequence in UTF-8.
    
    Where would this occur? Look at the [ISO-8859-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1) definition. ISO-8859 is a single-byte character set with meanings associated to the bytes in the range 0x40 to 0x7f. But, in UTF-8, the high bit indicates a prefix character. So, 0xF7 is a valid single-byte character in ISO-8859, but is a lead-in character in UTF-8.
    
    The point here is that setting the character set would seem to be a global setting. If the Saffron setting is purely for the parser (how to interpret incoming text), and the parser always produces Java strings in UTF-16 (which are then encoded into UTF-8 for execution), then we're fine.
    
    But, if the parser encoding is written as bytes sent to the execution engine, we're in trouble.
    
    Further, Drill has a web UI. The typical web character set is UTF-8, so queries coming from the web UI are encoded in UTF-8.
    
    All this suggests two things:
    
    1. Drill should either always accept UTF-8 (the Saffron property should always be set.) or
    2. The property is specified by the client and used to decode the bytes within a Protobuf message to produce a character stream given to the parser.
    
    It appears that UTF-8 is the default Protobuf String type encoding; sender and receiver would have to agree on another format. Does Drill have such an RPC property? I've not seen it, but I'm not an expert.
    
    In short, if this change ensures that the parser *always* uses UTF-8, then this is good. If the character encoding is an option, then we have to consider all the above issues to have a fully working, end-to-end solution.


---

[GitHub] drill issue #936: DRILL-5772: Enable UTF-8 support in query string by defaul...

Posted by arina-ielchiieva <gi...@git.apache.org>.
Github user arina-ielchiieva commented on the issue:

    https://github.com/apache/drill/pull/936
  
    @paul-rogers Calcite community has approved my changes and they have been already cherry-picked into Drill Calcite branch. I have updated pull request to reflect this recent changes. Now utf-8 support in query string will be enabled by default and controlled using saffron.properties file.


---

[GitHub] drill issue #936: DRILL-5772: Add unit tests to indicate how utf-8 support c...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/936
  
    @arina-ielchiieva, my comments boil down to two points:
    
    1. What happens if UTF-8 support in Drill is disabled? Do we have consistent handling of character data? Do we revert to ASCII? The character set on the Drillbit? How is this reconciled with the character set on the Web and the client?
    2. On the other hand, if Drill handles only UTF-8 (with bugs), does it make sense to disable that support if the alternative is undefined or more broken than the UTF-8 support?
    
    In short: shouldn't UTF-8 be the hard-coded by Drill when working with Calcite to avoid these ambiguities? 


---

[GitHub] drill pull request #936: DRILL-5772: Enable UTF-8 support in query string by...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/936#discussion_r147515235
  
    --- Diff: exec/java-exec/src/test/resources/saffron.properties ---
    @@ -0,0 +1,23 @@
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to you under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +# http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +
    +# This properties file is used by Apache Calcite to define allowed charset in string literals,
    +# which is by default ISO-8859-1.
    +# Current configuration allows parsing UTF-8 by default, i.e. queries that contain utf-8 string literal.
    +# To take affect this file should be present in classpath.
    +
    +saffron.default.charset=UTF-16LE
    +saffron.default.nationalcharset=UTF-16LE
    +saffron.default.collation.name=UTF-16LE$en_US
    --- End diff --
    
    Do we need a separate test properties file if it is identical with the main one?


---

[GitHub] drill issue #936: DRILL-5772: Add unit tests to indicate how utf-8 support c...

Posted by arina-ielchiieva <gi...@git.apache.org>.
Github user arina-ielchiieva commented on the issue:

    https://github.com/apache/drill/pull/936
  
    @paul-rogers, the unit tests just show which influence saffron property has on Drill (since people in the community where asking how to enable support UTF-8 in Drill), a long with this PR we'll also add description to Drill documentation. 
    
    So far Drill relies on Calcite saffron property to determine which charset it supports. By default, it's ISO-8859-1. So currently if customer wants to query UTF-8 data in Drill, he will get an error (one of the unit tests shows it) and needs to override saffron property to proceed. I guess, we don't support UTF-8 by default since there are some issues and Drill did not fully tested UTF-8 support.


---