You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Edmon Begoli <eb...@gmail.com> on 2015/08/24 18:05:03 UTC

UTF conversion issue with gz files

We are unable to process files that OSX identifies as character sete
UTF16LE.  After unzipping and converting to UTF8, we ere able to process
one fine.  There are CONVERT_TO and CONVERT_FROM commands that appear to
address the issue, but we were unable to make them work on a gzipped or
unzipped version of the UTF16 file.  We were  able to use CONVERT_FROM ok,
but when we tried to wrap the results of that to cast as a date, or
anything else, it failed.  Trying to work with it natively caused the
double-byte nature to appear (a substring 1,4 only return the first two
characters).

Is there a fix for this or should I file it as an issue?

I cannot post the data because it is proprietary in nature, but I might be
able to try to re-create the data for release testing and
development purposes.

Re: UTF conversion issue with gz files

Posted by Edmon Begoli <eb...@gmail.com>.
Done.

On Tue, Aug 25, 2015 at 10:23 PM, Jacques Nadeau <ja...@dremio.com> wrote:

> Yes, please post an issue.  Right now, the text reader is based on utf8.
> It would need an enhancement to support alternative character sets.
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Mon, Aug 24, 2015 at 9:05 AM, Edmon Begoli <eb...@gmail.com> wrote:
>
> > We are unable to process files that OSX identifies as character sete
> > UTF16LE.  After unzipping and converting to UTF8, we ere able to process
> > one fine.  There are CONVERT_TO and CONVERT_FROM commands that appear to
> > address the issue, but we were unable to make them work on a gzipped or
> > unzipped version of the UTF16 file.  We were  able to use CONVERT_FROM
> ok,
> > but when we tried to wrap the results of that to cast as a date, or
> > anything else, it failed.  Trying to work with it natively caused the
> > double-byte nature to appear (a substring 1,4 only return the first two
> > characters).
> >
> > Is there a fix for this or should I file it as an issue?
> >
> > I cannot post the data because it is proprietary in nature, but I might
> be
> > able to try to re-create the data for release testing and
> > development purposes.
> >
>

Re: UTF conversion issue with gz files

Posted by Jacques Nadeau <ja...@dremio.com>.
Yes, please post an issue.  Right now, the text reader is based on utf8.
It would need an enhancement to support alternative character sets.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Mon, Aug 24, 2015 at 9:05 AM, Edmon Begoli <eb...@gmail.com> wrote:

> We are unable to process files that OSX identifies as character sete
> UTF16LE.  After unzipping and converting to UTF8, we ere able to process
> one fine.  There are CONVERT_TO and CONVERT_FROM commands that appear to
> address the issue, but we were unable to make them work on a gzipped or
> unzipped version of the UTF16 file.  We were  able to use CONVERT_FROM ok,
> but when we tried to wrap the results of that to cast as a date, or
> anything else, it failed.  Trying to work with it natively caused the
> double-byte nature to appear (a substring 1,4 only return the first two
> characters).
>
> Is there a fix for this or should I file it as an issue?
>
> I cannot post the data because it is proprietary in nature, but I might be
> able to try to re-create the data for release testing and
> development purposes.
>