You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Andries Engelbrecht <ae...@maprtech.com> on 2015/03/03 03:23:52 UTC

UTF coding in JSON docs

How can I convert JSON data with various characters in a text field to a more usable UTF8 encoding? (even if characters not in UTF8 is dropped)
Or if needs be filter out the records with characters not in UTF8?

CONVERT_TO is for binary and cast as varchar still passes these characters through.

Results in sqlline works, but when using tools through ODBC various failures occur (especially tools not capable of handling characters outside of UTF8).


Thanks
—Andries

Re: UTF coding in JSON docs

Posted by Andries Engelbrecht <ae...@maprtech.com>.

Was having some issues with the hex codes on large data sets.

Using the following better results for most western languages.

regexp_replace(<column>, '[^ -~¡-ÿ]', '°’)

—Andries


On Mar 12, 2015, at 8:24 AM, Andries Engelbrecht <ae...@maprtech.com> wrote:

> Just an update on this topic.
> 
> The regexp_replace function uses a single escape \ and then the hex code to function properly.
> The ^ is for inverse, or basically for any characters that does not full within the range specified to be replaced.
> 
> I.e.
> regexp_replace(<column_name>, '[^\x00-\xad]', '°’)
> 
> 
> I used this UTF 8 table for hex codes.
> http://www.utf8-chartable.de/unicode-utf8-table.pl
> 
> —Andries 
> 
> 
> 
> On Mar 9, 2015, at 5:38 PM, Andries Engelbrecht <ae...@maprtech.com> wrote:
> 
>> I should have clarified more. It turns out the encoding is UTF8, but various characters (since UTF8 supports all unicode and then some) within the encoding was causing issues for tools using Drill.
>> Using regex_replace to replace undesired characters resolved the issue since the characters are not used and only caused issues.
>> 
>> Thanks
>> —Andries
>> 
>> 
>> On Mar 9, 2015, at 12:42 PM, Ted Dunning <te...@gmail.com> wrote:
>> 
>>> This is dangerous if the source file does not contain UTF-8 but you treat
>>> it as if it does.  The problem is that characters with high bit set will be
>>> treated as leading characters of a multi-byte UTF-8 character encoding.
>>> 
>>> On Mon, Mar 9, 2015 at 11:51 AM, Andries Engelbrecht <
>>> aengelbrecht@maprtech.com> wrote:
>>> 
>>>> The answer for my problem was to use regexp_replace.
>>>> 
>>>> In this case just using an underscore to replace any characters outside of
>>>> the expected range using the UTF8 character set table.
>>>> 
>>>> Instead of just selecting the column used this
>>>> regexp_replace(<column or field name>, '[^\\x00-\\x7f-\\x80-\\xad]', '_')
>>>> 
>>>> The ^ in the front is to select the inverse (or any characters not
>>>> specified in the range of the square brackets []
>>>> 
>>>> (Make a note of this as I’m sure many will run into this issues in the
>>>> future with various JSON data).
>>>> 
>>>> —Andries
>>>> 
>>>> 
>>>> 
>>>> On Mar 2, 2015, at 6:23 PM, Andries Engelbrecht <ae...@maprtech.com>
>>>> wrote:
>>>> 
>>>>> How can I convert JSON data with various characters in a text field to a
>>>> more usable UTF8 encoding? (even if characters not in UTF8 is dropped)
>>>>> Or if needs be filter out the records with characters not in UTF8?
>>>>> 
>>>>> CONVERT_TO is for binary and cast as varchar still passes these
>>>> characters through.
>>>>> 
>>>>> Results in sqlline works, but when using tools through ODBC various
>>>> failures occur (especially tools not capable of handling characters outside
>>>> of UTF8).
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> —Andries
>>>> 
>>>> 
>> 
>

Re: UTF coding in JSON docs

Posted by Andries Engelbrecht <ae...@maprtech.com>.

Just an update on this topic.

The regexp_replace function uses a single escape \ and then the hex code to function properly.
The ^ is for inverse, or basically for any characters that does not full within the range specified to be replaced.

I.e.
regexp_replace(<column_name>, '[^\x00-\xad]', '°’)


I used this UTF 8 table for hex codes.
http://www.utf8-chartable.de/unicode-utf8-table.pl

—Andries 



On Mar 9, 2015, at 5:38 PM, Andries Engelbrecht <ae...@maprtech.com> wrote:

> I should have clarified more. It turns out the encoding is UTF8, but various characters (since UTF8 supports all unicode and then some) within the encoding was causing issues for tools using Drill.
> Using regex_replace to replace undesired characters resolved the issue since the characters are not used and only caused issues.
> 
> Thanks
> —Andries
> 
> 
> On Mar 9, 2015, at 12:42 PM, Ted Dunning <te...@gmail.com> wrote:
> 
>> This is dangerous if the source file does not contain UTF-8 but you treat
>> it as if it does.  The problem is that characters with high bit set will be
>> treated as leading characters of a multi-byte UTF-8 character encoding.
>> 
>> On Mon, Mar 9, 2015 at 11:51 AM, Andries Engelbrecht <
>> aengelbrecht@maprtech.com> wrote:
>> 
>>> The answer for my problem was to use regexp_replace.
>>> 
>>> In this case just using an underscore to replace any characters outside of
>>> the expected range using the UTF8 character set table.
>>> 
>>> Instead of just selecting the column used this
>>> regexp_replace(<column or field name>, '[^\\x00-\\x7f-\\x80-\\xad]', '_')
>>> 
>>> The ^ in the front is to select the inverse (or any characters not
>>> specified in the range of the square brackets []
>>> 
>>> (Make a note of this as I’m sure many will run into this issues in the
>>> future with various JSON data).
>>> 
>>> —Andries
>>> 
>>> 
>>> 
>>> On Mar 2, 2015, at 6:23 PM, Andries Engelbrecht <ae...@maprtech.com>
>>> wrote:
>>> 
>>>> How can I convert JSON data with various characters in a text field to a
>>> more usable UTF8 encoding? (even if characters not in UTF8 is dropped)
>>>> Or if needs be filter out the records with characters not in UTF8?
>>>> 
>>>> CONVERT_TO is for binary and cast as varchar still passes these
>>> characters through.
>>>> 
>>>> Results in sqlline works, but when using tools through ODBC various
>>> failures occur (especially tools not capable of handling characters outside
>>> of UTF8).
>>>> 
>>>> 
>>>> Thanks
>>>> —Andries
>>> 
>>> 
>

Re: UTF coding in JSON docs

Posted by Andries Engelbrecht <ae...@maprtech.com>.

I should have clarified more. It turns out the encoding is UTF8, but various characters (since UTF8 supports all unicode and then some) within the encoding was causing issues for tools using Drill.
Using regex_replace to replace undesired characters resolved the issue since the characters are not used and only caused issues.

Thanks
—Andries


On Mar 9, 2015, at 12:42 PM, Ted Dunning <te...@gmail.com> wrote:

> This is dangerous if the source file does not contain UTF-8 but you treat
> it as if it does.  The problem is that characters with high bit set will be
> treated as leading characters of a multi-byte UTF-8 character encoding.
> 
> On Mon, Mar 9, 2015 at 11:51 AM, Andries Engelbrecht <
> aengelbrecht@maprtech.com> wrote:
> 
>> The answer for my problem was to use regexp_replace.
>> 
>> In this case just using an underscore to replace any characters outside of
>> the expected range using the UTF8 character set table.
>> 
>> Instead of just selecting the column used this
>> regexp_replace(<column or field name>, '[^\\x00-\\x7f-\\x80-\\xad]', '_')
>> 
>> The ^ in the front is to select the inverse (or any characters not
>> specified in the range of the square brackets []
>> 
>> (Make a note of this as I’m sure many will run into this issues in the
>> future with various JSON data).
>> 
>> —Andries
>> 
>> 
>> 
>> On Mar 2, 2015, at 6:23 PM, Andries Engelbrecht <ae...@maprtech.com>
>> wrote:
>> 
>>> How can I convert JSON data with various characters in a text field to a
>> more usable UTF8 encoding? (even if characters not in UTF8 is dropped)
>>> Or if needs be filter out the records with characters not in UTF8?
>>> 
>>> CONVERT_TO is for binary and cast as varchar still passes these
>> characters through.
>>> 
>>> Results in sqlline works, but when using tools through ODBC various
>> failures occur (especially tools not capable of handling characters outside
>> of UTF8).
>>> 
>>> 
>>> Thanks
>>> —Andries
>> 
>>

Re: UTF coding in JSON docs

Posted by Ted Dunning <te...@gmail.com>.

This is dangerous if the source file does not contain UTF-8 but you treat
it as if it does.  The problem is that characters with high bit set will be
treated as leading characters of a multi-byte UTF-8 character encoding.

On Mon, Mar 9, 2015 at 11:51 AM, Andries Engelbrecht <
aengelbrecht@maprtech.com> wrote:

> The answer for my problem was to use regexp_replace.
>
> In this case just using an underscore to replace any characters outside of
> the expected range using the UTF8 character set table.
>
> Instead of just selecting the column used this
> regexp_replace(<column or field name>, '[^\\x00-\\x7f-\\x80-\\xad]', '_')
>
> The ^ in the front is to select the inverse (or any characters not
> specified in the range of the square brackets []
>
> (Make a note of this as I’m sure many will run into this issues in the
> future with various JSON data).
>
> —Andries
>
>
>
> On Mar 2, 2015, at 6:23 PM, Andries Engelbrecht <ae...@maprtech.com>
> wrote:
>
> > How can I convert JSON data with various characters in a text field to a
> more usable UTF8 encoding? (even if characters not in UTF8 is dropped)
> > Or if needs be filter out the records with characters not in UTF8?
> >
> > CONVERT_TO is for binary and cast as varchar still passes these
> characters through.
> >
> > Results in sqlline works, but when using tools through ODBC various
> failures occur (especially tools not capable of handling characters outside
> of UTF8).
> >
> >
> > Thanks
> > —Andries
>
>

Re: UTF coding in JSON docs

Posted by Andries Engelbrecht <ae...@maprtech.com>.

The answer for my problem was to use regexp_replace.

In this case just using an underscore to replace any characters outside of the expected range using the UTF8 character set table.

Instead of just selecting the column used this
regexp_replace(<column or field name>, '[^\\x00-\\x7f-\\x80-\\xad]', '_')

The ^ in the front is to select the inverse (or any characters not specified in the range of the square brackets []

(Make a note of this as I’m sure many will run into this issues in the future with various JSON data).

—Andries

On Mar 2, 2015, at 6:23 PM, Andries Engelbrecht <ae...@maprtech.com> wrote:

> How can I convert JSON data with various characters in a text field to a more usable UTF8 encoding? (even if characters not in UTF8 is dropped)
> Or if needs be filter out the records with characters not in UTF8?
> 
> CONVERT_TO is for binary and cast as varchar still passes these characters through.
> 
> Results in sqlline works, but when using tools through ODBC various failures occur (especially tools not capable of handling characters outside of UTF8).
> 
> 
> Thanks
> —Andries

Re: UTF coding in JSON docs

Posted by Andries Engelbrecht <ae...@maprtech.com>.

I don’t think this is an option as this is for twitter data being fed in.

Any other options.


On Mar 2, 2015, at 10:40 PM, Ted Dunning <te...@gmail.com> wrote:

> The right solution is to go into the JSON format and somehow let character
> encoding be defined there.
> 
> 
> 
> On Tue, Mar 3, 2015 at 3:23 AM, Andries Engelbrecht <
> aengelbrecht@maprtech.com> wrote:
> 
>> How can I convert JSON data with various characters in a text field to a
>> more usable UTF8 encoding? (even if characters not in UTF8 is dropped)
>> Or if needs be filter out the records with characters not in UTF8?
>> 
>> CONVERT_TO is for binary and cast as varchar still passes these characters
>> through.
>> 
>> Results in sqlline works, but when using tools through ODBC various
>> failures occur (especially tools not capable of handling characters outside
>> of UTF8).
>> 
>> 
>> Thanks
>> —Andries

Re: UTF coding in JSON docs

Posted by Ted Dunning <te...@gmail.com>.

The right solution is to go into the JSON format and somehow let character
encoding be defined there.



On Tue, Mar 3, 2015 at 3:23 AM, Andries Engelbrecht <
aengelbrecht@maprtech.com> wrote:

> How can I convert JSON data with various characters in a text field to a
> more usable UTF8 encoding? (even if characters not in UTF8 is dropped)
> Or if needs be filter out the records with characters not in UTF8?
>
> CONVERT_TO is for binary and cast as varchar still passes these characters
> through.
>
> Results in sqlline works, but when using tools through ODBC various
> failures occur (especially tools not capable of handling characters outside
> of UTF8).
>
>
> Thanks
> —Andries