You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@trafodion.apache.org by "Liu, Ming (Ming)" <mi...@esgyn.cn> on 2016/01/06 02:16:52 UTC

答复: 答复: enhance TRANSLATE to support Chinese charset?

Hi, all, 
Thanks all for the discussions, there are a lot of very good ideas and questions. I would like to summarize a little bit.
1. GBK vs. GB2312, and encoding.
Yes, Dave, we will support GBK first, so it should cover GB2312 as well. We don't need to explicitly support GB2312 unless we meet specific issues. I am not aware of different GBK encodings, we will use Glibc standard API iconv() to do the conversion, it should handle all the details I assume. From some initial tests, it can convert some real source data correctly. I need to understand more about GBK here, but yes, we will support GBK, not GB2312.

2. Total support of GBK in Trafodion
I agree to support GBK totally is a big project, so as Qifan suggested, for this task, we only allow GBK to appear in the TRANSLATE function. So we can allow user to change the hive character set to GBK via HIVE_DEFAULT_CHARSET, and use GBK2UTF8 in the TRANSLATE function to do the converting. This will not affect other part of the database, and not a lie. :-) And as Hans/Selva clarified, Hive cannot specific charset per column, so I feel we don't need to make it anymore complex. Qifan's proposal of allowing Trafodion to do implicit charset translation is very great, so we don't bother to write a very complex sql to add TRANSLATE for each 'string' column in Hive. Since we saw tables with 300+ columns in hive before...

3. Total support of GBK in Trafodion, again
To many aspects, I feel GBK is very similar with UCS2. Its width is fixed as 2-bytes. And it will not allow '\0' (null) in any character byte. So I hope the effort will be controlled by simulating what Trafodion already do for UCS2. But we should launch a total support of GBK as a separate task, since for this proposal to TRANSLATE function, the goal is to allow GBK data load into UTF-8 columns in Trafodion simpler.

Thanks,
Ming

-----邮件原件-----
发件人: Dave Birdsall [mailto:dave.birdsall@esgyn.com] 
发送时间: 2016年1月6日 2:30
收件人: dev@trafodion.incubator.apache.org
主题: RE: 答复: enhance TRANSLATE to support Chinese charset?

Hi,

I was reading up on the character sets a little bit (e.g. Wikipedia articles on GB2312 and GBK). It seems GBK is an extension of GB2312, so I'm assuming that we are discussing support for GBK, correct? Also, it appears there are multiple possible encodings for GBK (no surprise, really; Unicode has multiple encodings too). We will want to specify which encoding(s) we wish to support.

Dave

-----Original Message-----
From: Qifan Chen [mailto:qifan.chen@esgyn.com]
Sent: Tuesday, January 5, 2016 10:22 AM
To: dev <de...@trafodion.incubator.apache.org>
Subject: Re: 答复: enhance TRANSLATE to support Chinese charset?

Sorry, in my previous email, I should also mention that the GBK charset is only valid within the TRANSLATE function.

Thanks --Qifan

On Tue, Jan 5, 2016 at 12:17 PM, Qifan Chen <qi...@esgyn.com> wrote:

> Hi Hans,
>
> Yes, either translate GDB to UTF8 or stored as binary data in ISO88591 
> is fine.
>
> I feel the work to allow HIVE_DEFAULT_CHARSET cqd to take GBK, and 
> disabling creating trafodion tables with GBK character columns and GBK 
> literals (both in the parser) is still reasonably sized.
>
> With this, Trafodion only recognizes GBK as a source charset for Hive 
> tables, and is able to convert the GBK source to other data format.
>
> Thanks --Qifan
>
>
> On Tue, Jan 5, 2016 at 12:02 PM, Hans Zeller <ha...@esgyn.com>
> wrote:
>
>> Just want to clarify: I hate to lie to the engine about character 
>> sets just as much. If someone can make a workable proposal to avoid 
>> it that would be great.
>>
>> Unlike in the predecessor products, which tried to interpret the 
>> data, we would treat the GBK strings like binary data stored in 
>> ISO88591, except in the one new translate function that could convert 
>> it to UTF-8. So, we would reduce the lie to the bad but already 
>> accepted one of storing binary data in ISO88591.
>>
>> Allowing a new GBK character set without adequate support for it may 
>> be more trouble than treating GBK as binary data. A proposal to do 
>> that should specify the limitations very carefully.
>>
>> Thanks,
>>
>> Hans
>>
>> On Tue, Jan 5, 2016 at 9:48 AM, Dave Birdsall 
>> <da...@esgyn.com>
>> wrote:
>>
>> > Hi,
>> >
>> > I'm wondering how large a project it is to add a new character set 
>> > to Trafodion? I personally hate the idea of lying to the engine 
>> > (for
>> example,
>> > pretending that GBK is ISO88591 in most places except when we want 
>> > to translate to another character set; we tried this in one of our
>> predecessor
>> > products and it caused endless complications -- it is technical 
>> > debt
>> that
>> > will soon demand to be repaid).
>> >
>> > If we need to reduce the scope of the project, one way might be to 
>> > limit the new character set to Hive tables only (so we'd have to do 
>> > the metadata
>> work
>> > in Trafodion), and translate the data immediately in the scan nodes
>> where
>> > it
>> > enters the engine (so we could limit run-time support to a 
>> > translate operator; other expression operators would operate only 
>> > on known
>> character
>> > sets).
>> >
>> > Dave
>> >
>> > -----Original Message-----
>> > From: Hans Zeller [mailto:hans.zeller@esgyn.com]
>> > Sent: Tuesday, January 5, 2016 9:21 AM
>> > To: dev <de...@trafodion.incubator.apache.org>
>> > Subject: Re: 答复: enhance TRANSLATE to support Chinese charset?
>> >
>> > Hi,
>> >
>> > Thanks, Ming, for confirming what the use case is.
>> >
>> > As Qifan and Selva have pointed out, the HIVE_DEFAULT_CHARSET CQD 
>> > determines the character set of string columns in Hive tables. I 
>> > don't think,
>> however,
>> > that it is feasible to allow GBK as a value for 
>> > HIVE_DEFAULT_CHARSET, unless you are ready to do a lot of work.
>> > That's because Hive columns have
>> regular
>> > Trafodion data types, and supporting GBK in regular data types 
>> > would be
>> a
>> > large project.
>> >
>> > Therefore, I would suggest that we use cqd HIVE_DEFAULT_CHARSET
>> 'iso88591'
>> > instead. We also don't have a BINARY data type and use ISO88591 for
>> that as
>> > well. To make that happen, you could either use some CQD as we 
>> > talked about, or a special TRANSLATE flavor, something like
>> > GBK_AS_ISO_TO_UTF8 (sorry
>> for
>> > all the long names, maybe you can shorten them). This may be better 
>> > than the CQD to suppress character set checks. I agree with Qifan 
>> > that we should
>> try
>> > hard to check all character sets. For the same reason, we should 
>> > not
>> allow
>> > UTF-8 as HIVE_DEFAULT_CHARSET when we store GBK data in Hive tables.
>> >
>> > Qifan mentions different character sets in a single Hive table. My 
>> > impression is that the Hive folks did not anticipate that, since 
>> > Hive
>> used
>> > to be just simple text files. So, I hope that won't be a common case.
>> >
>> > Thanks,
>> >
>> > Hans
>> >
>> > On Tue, Jan 5, 2016 at 8:35 AM, Selva Govindarajan < 
>> > selva.govindarajan@esgyn.com> wrote:
>> >
>> > > Hi Ming,
>> > >
>> > > I believe Hive doesn't have a way to specify the character set 
>> > > encoding of individual columns. It relies on the JVM default 
>> > > encoding for the table as a whole. The JVM default encoding 
>> > > depends on the file.encoding default or the LANG variable.
>> > >
>> > > In Trafodion, we currently support TEXT and SEQUENCE format hive 
>> > > tables only. Trafodion also reads the hive data directly as 
>> > > stream of bytes and imposes the data type conversion on those 
>> > > stream of bytes to match with hive table definition, the columns 
>> > > being separated by the column delimiters.
>> > > This
>> > > processing in Trafodion done in C++ and is not wide character aware.
>> > > As long as the needed character set doesn't have null byte 
>> > > embedded like UTF8 or
>> > > ISO88591 encoding, this processing works fine.
>> > >
>> > > Currently the CQD hive_default_charset takes in UTF8 and ISO88591
>> only.
>> > > However, you can extend this CQD to take different character sets 
>> > > as long as these character set doesn't have null byte embedded in 
>> > > its encoding. For other encodings, this processing should be made 
>> > > wide character aware.
>> > >
>> > > Selva
>> > >
>> > > -----Original Message-----
>> > > From: Qifan Chen [mailto:qifan.chen@esgyn.com]
>> > > Sent: Tuesday, January 5, 2016 7:56 AM
>> > > To: dev <de...@trafodion.incubator.apache.org>
>> > > Subject: Re: 答复: enhance TRANSLATE to support Chinese charset?
>> > >
>> > > Hi Ming,
>> > >
>> > > Thank you for explaining one of the bottlenecks in PoCs in China!
>> > >
>> > > From data integrity point of view, it is important that the data 
>> > > to be inserted into Trafodion tables is verified to be in the 
>> > > said charset, and Trafodion's TRANSLATE function can guarantee 
>> > > that. So I wonder if we can consider these possible cases as follows.
>> > >
>> > >    1. All char columns in the hive table are in GBK. Set CQD
>> > >    hive_default_charset to GBK and replying on implicit casting 
>> > > to the target
>> > >    column of the trafodion table. Note that since CQD
>> > >    ALLOW_IMPLICIT_CHAR_CASTING is on by default, implicit casting
>> calls
>> > >    TRANSLATE under the table.
>> > >    2. Only small number of char columns in the hive table are in GBK.
>> > > Apply
>> > >    the TRANSLATE function on these GBK columns in the UPSERT
>> statement,
>> > as
>> > >    outlined in the Hans' email. The default charset of the source 
>> > > is
>> > still
>> > >    UTF8. If for a particular GBK column that the TRANSLATE 
>> > > function is
>> > not
>> > >    specified, Trafodion will give an error. Fix the UPSERT stmt 
>> > > and re-try.
>> > >    3. Only small number of char columns in the hive table are
>> > > *NOT* in GBK.
>> > >    Set CQD hive_default_charset to GBK. Apply the TRANSLATE 
>> > > function on these
>> > >    non-GBK columns in the UPSERT statement, similar to 2) above.
>> > >    4. Non of the columns are in GBK. No special settings are
>> necessary.
>> > >
>> > >
>> > > With the above proposal, the data integrity is maintained and the 
>> > > amount of work to prepare the UPSERT for charset conversion is 
>> > > minimized. The needed work is to implement GBK2UTF8.
>> > >
>> > > Thanks --Qifan
>> > >
>> > >
>> > >
>> > > On Mon, Jan 4, 2016 at 10:39 PM, Liu, Ming (Ming) 
>> > > <mi...@esgyn.cn>
>> > > wrote:
>> > >
>> > > > Hi, Hans,
>> > > >
>> > > > You are right, each column has a charset type and TRANSLATE 
>> > > > need to match the source/target's charset before performing the 
>> > > > conversion.
>> > > > I do meet this during the implementation after my proposal.
>> > > >
>> > > > My proposal is mainly for bulkloader usage, where the source is 
>> > > > from hive table. I know very little about Hive, but it seems to 
>> > > > me, a column in Hive table does not have a concept of charset.
>> > > > As per my understanding ( I am learning hive), Hive treat all 
>> > > > string column as UTF8, since Hive is using java.
>> > > > Hive does not check the value of a string column during normal 
>> > > > operation, it treats the string as a binary string without any 
>> > > > attempt to parse it.
>> > > > So when Trafodion is doing bulk load, the first step is we copy 
>> > > > the raw data into HDFS, and create a hive external table. In 
>> > > > some cases, the source data is encoded in GBK, so we must 
>> > > > convert the source using 'iconv' before copy that file into 
>> > > > HDFS. This step is very time consuming if the source file is 
>> > > > big, iconv is not very fast, even it is fast by writing a 
>> > > > specific MapReduce job for this, it is still an extra step for 
>> > > > Trafodion bulk loading.
>> > > > On the other hand, it is fine to leave the source data as 
>> > > > encoded in GBK, and create the hive external file. Hive will 
>> > > > not checking the string values's charset. And you can select 
>> > > > those GBK data via HQL, Hive will not do any conversion. So it 
>> > > > is possible that we can skip the 'iconv' step (which is very 
>> > > > slow, or will never success for a big file, say 100G ) , and 
>> > > > using the parallel bulk loader of Trafodion to load the data in 
>> > > > GBK, but during the loading, for each row, do a TRANSLATE, so 
>> > > > convert it from GBK into UTF8. We assume this will be faster 
>> > > > than two separate step, and it is easier for the end user. This 
>> > > > is the goal of the proposal at first.
>> > > >
>> > > > But just as you pointed out, the source column's charset must 
>> > > > be
>> > > matching.
>> > > > So the GBKTOUTF8 transaction require the source column to be in 
>> > > > GBK, which Trafodion doesn't support. And for unknown reason, 
>> > > > the column charset of any hive string column is UTF8 from 
>> > > > Trafodion point of view. So the TRANSLATE will fail. My current 
>> > > > solution is to bypass the source charset type matching , same 
>> > > > as a CQD ALLOW_TRANSLATE_SOURCE_CHARSTET_MISMATCH 'on'.
>> > > >
>> > > > In the long run, I can add the charset GBK as one of the 
>> > > > supported Trafodion column charset. Then we can use this 
>> > > > feature in more cases, and we can support a column to be 
>> > > > encoded as GBK which is also a nice feature.
>> > > > But in short term, I think the most desired requirement is to 
>> > > > allow loading GBK data into Trafodion without an extra iconv step.
>> > > >
>> > > > I think the CQD is a very good suggestion, I now ignore the 
>> > > > charset type matching, I can add this CQD and only bypass type 
>> > > > checking when this CQD is on. And we can turn this CQD on 
>> > > > during bulk loading and turn it off in other cases.
>> > > >
>> > > > Do you think this is better?
>> > > >
>> > > > Thanks,
>> > > > Ming
>> > > >
>> > > > -----邮件原件-----
>> > > > 发件人: Hans Zeller [mailto:hans.zeller@esgyn.com]
>> > > > 发送时间: 2016年1月5日 10:40
>> > > > 收件人: dev <de...@trafodion.incubator.apache.org>
>> > > > 主题: Re: enhance TRANSLATE to support Chinese charset?
>> > > >
>> > > > Hi,
>> > > >
>> > > > Kevin's question make me think of another question: The 
>> > > > TRANSLATE function takes a value a with character set ca and 
>> > > > translates it into a new value b with character set cb. Since 
>> > > > we don't support GBK as character sets in data types, I wonder 
>> > > > how we could call this function. People may in some cases stuff 
>> > > > GBK data into an ISO88591 column, are you thinking of that 
>> > > > case, and would we then allow something like this:
>> > > >
>> > > > create table tiso(a char(32) character set iso88591); insert 
>> > > > into tiso
>> > > ...
>> > > > ; -- insert GBK data into the table without translation select 
>> > > > translate(a using GBKTOUTF8N) from tiso;
>> > > >
>> > > >
>> > > > Or, maybe something like this, to load a GB2312 HDFS file into 
>> > > > a Trafodion
>> > > > table:
>> > > >
>> > > > cqd hive_default_charset 'ISO88591'; -- data is really GB2312, 
>> > > > but that is not supported cqd 
>> > > > allow_translate_source_charset_mismatch
>> > > > 'on';
>> > > > -- some new CQD like the above could suppress errors on 
>> > > > mismatched source charset for translate upsert using load into 
>> > > > traf_utf8_table select translate(a using GBKTOUTF8N) from 
>> > > > hive.hive.src;
>> > > >
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Hans
>> > > >
>> > > > On Mon, Jan 4, 2016 at 5:01 PM, Liu, Ming (Ming) 
>> > > > <ming.liu@esgyn.cn
>> >
>> > > > wrote:
>> > > >
>> > > > > Hi, Kevin,
>> > > > >
>> > > > > I didn't notice GB18030 before, but after some initial 
>> > > > > search, it seems a must to have feature, so Trafodion should 
>> > > > > support it. I will mark it after the GBK support, we saw GBK 
>> > > > > in real customer site, but not GB18030 yet, but we should 
>> > > > > assume wide requirement of GB18030 very
>> > > > soon.
>> > > > >
>> > > > > Thanks,
>> > > > > Ming
>> > > > >
>> > > > > -----邮件原件-----
>> > > > > 发件人: Kevin DeYager [mailto:kevin.deyager@esgyn.com]
>> > > > > 发送时间: 2016年1月5日 8:11
>> > > > > 收件人: dev@trafodion.incubator.apache.org
>> > > > > 主题: RE: enhance TRANSLATE to support Chinese charset?
>> > > > >
>> > > > > Hi Ming,
>> > > > >
>> > > > > I am no expert in this area, but is GB18030 translation also 
>> > > > > needed / desirable?
>> > > > >
>> > > > > Regards,
>> > > > > - Kevin
>> > > > >
>> > > > > -----Original Message-----
>> > > > > From: Liu, Ming (Ming) [mailto:ming.liu@esgyn.cn]
>> > > > > Sent: Monday, December 21, 2015 4:51 PM
>> > > > > To: dev@trafodion.incubator.apache.org
>> > > > > Subject: enhance TRANSLATE to support Chinese charset?
>> > > > >
>> > > > > Hello,
>> > > > >
>> > > > > Trafodion currently has a TRANSLATE function, which can do 
>> > > > > charset conversion among ISO88591, SJIS, UCS2 and UTF8.
>> > > > > I would like to add GBK conversion into this function, it can 
>> > > > > help for data loading sometimes. As we saw previously, source 
>> > > > > data are very typically encoded in GB2312, especially in 
>> > > > > China, so we have to do a 'iconv' from GBK to UTF8 before 
>> > > > > loading, if the data files are huge, it will take a some time.
>> > > > > If TRANSLATE can support GBKTOUTF8, so that conversion can be 
>> > > > > done in one step during the 'LOAD' SQL command. I think there 
>> > > > > are some other use cases as well.
>> > > > >
>> > > > > Do you feel this is worthy? If so, I would like to file a 
>> > > > > JIRA and can work on it.
>> > > > >
>> > > > > At first glance, I would like to propose several translate
>> flavors:
>> > > > > GBKTOUTF8N : which will try to do conversion from GB2312 to 
>> > > > > UTF8, in case there is an error during the conversion, return 
>> > > > > NULL, no SQL Error raised, silently continue.
>> > > > > GBKTOUTF8O: try to do conversion from GB2312 to UTF8, in case 
>> > > > > there is an error during the conversion, return the original 
>> > > > > string without any conversion, no SQL Error raised, silently
>> > continue.
>> > > > > BGKTOUTF8: typical behavior, once there is a conversion 
>> > > > > error, raise a SQL Error.
>> > > > >
>> > > > > Thanks,
>> > > > > Ming
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Regards, --Qifan
>> > >
>> >
>>
>
>
>
> --
> Regards, --Qifan
>
>


--
Regards, --Qifan

Re: 答复: 答复: 答复: enhance TRANSLATE to support Chinese charset?

Posted by Hans Zeller <ha...@esgyn.com>.

Hi Ming,

In my mind there is no difference between the two statements you show,
"load into traftbl select * from hive.hive.hivetbl" and "select * from
hive.hive.hivetbl where substr(TRANSLATE...". If we don't need a TRANSLATE
in the first case, why should we need it in the second case? In both cases
we should do an implicit translation to UTF8, since that is what the column
type of the hive table is.

Interesting that you say we attempt a UTF8TOISO88591 conversion when you
use a GBK file and an ISO88591 Hive table. That's not what I had in mind, I
wonder who added that code. If such code already exists, maybe that's the
place where we should handle the new HIVE_FILE_CHARSET CQD.

I also want to do an experiment and load some data with NUL bytes into an
HDFS file and see whether Hive can handle that.

A similar experiment could be done with the DOS CRLF line ending. If Hive
can handle them, then maybe so should we, but I kind of doubt it. Maybe
best to FTP the files in text mode to get rid of the CRs.

Thanks,

Hans

On Tue, Jan 5, 2016 at 7:18 PM, Liu, Ming (Ming) <mi...@esgyn.cn> wrote:

> Hi, again,
>
> I want to demonstrate with a fake example here. A csv file in size 200 G,
> it has 100 columns and 10 of those columns are string encoded as GBK. If
> one want to load it into Trafodion, it has two options: 1) create Trafodion
> table with 10 columns' charset as ISO88591, so to treat the original data
> as binary 2), create Trafodion table with 10 columns' charset as UTF8, if
> so, the original csv file must be converted into UTF8. This is now done via
> Linux tool iconv. However, using iconv to convert a 200G file is very slow,
> if the file is bigger, I think iconv will simply crash as I heard the
> rumor. It is like to use wc to do word count for huge data, a better
> solution is to run a MapReduce wordcount program. Trafodion bulkloader do
> things in parallel.
> Option 1 is not good, since the 10 string column cannot be used in any SQL
> function. Application must treat them as binary data and decode it in the
> application. And I think I saw loading errors if the source is GBK, since
> Trafodion will try to do an implicit UTF8TOISO88591 converting. So this
> option is not possible in fact. The only solution is ask user to clean up
> their source data , must encoded in UTF8, if not, convert before loading
> into Trafodion. As I mentioned before, if the data file is big, this is
> frustrating. It will be nice, with relatively small effort, Trafodion
> provide some kind of help for this scenario. A small but kind feature. Of
> course we don’t' want to spend huge effort to support GBK totally, that may
> not worthy, we better ask user to do converting first.
>
> So if we can use HIVE_FILE_CHARSET, here is the steps:
> 1. put the csv file into HDFS
> 2. create hive external table: hivetbl
> 3. In sqlci,
>   >CQD HIVE_FILE_CHARSET 'GBK';
>   >load into trafTbl select * from hive.hive.hivetbl;
> Now the traftbl has all UTF8 data, and can run any SQL with any expression
> like trim()/Length()/upper()/substr(), since Trafodion support UTF8. This
> is not a big deal, but sometimes, users will be very happy to have this.
>
> Furthermore, The new TRANSLATE function is still useful for Trafodion in
> order to play with Hive native table, which contains GBK data, for example:
> Get all raws which have address start with beijing in Chinese:
>  > select * from hive.hive.hivetbl substr(address, 2) = "北京";    -- this
> will cause SQL error
>  > CQD HIVE_FILE_CHARSET 'GBK';  -- user know the data is encoded in GBK
>  > select * from hive.hive.hivetbl where substr( TRANSLATE(address, using
> GBKTOUTF8), 2) = utf8'北京';
> Without this new function, again, user have to convert raw data before
> load into Hive.
>
> A similar small but sometimes annoying issue is the DOS file format, user
> has to remove \r for each row before data loading, simply because we don't
> treat \r automatically...
>
> And now, I am curious, how Hive handle GBK :-) It asks user to do charset
> conversion before data loading?
>
> Thanks,
> Ming
>
> -----邮件原件-----
> 发件人: Liu, Ming (Ming) [mailto:ming.liu@esgyn.cn]
> 发送时间: 2016年1月6日 10:32
> 收件人: dev@trafodion.incubator.apache.org
> 主题: 答复: 答复: 答复: enhance TRANSLATE to support Chinese charset?
>
> Hi, Hans,
>
> You have some very thoughtful ideas. And I really misunderstand GBK
> before, I was thinking it is fixed as 2 bytes, since all I checked is
> Chinese character, not English letters... So this is still a big effort if
> we need to handle GBK in Trafodion totally. However, we can still do
> something to just support simpler bulk loading of GBK source data.
>
> For your first comment, in fact, I don't think 'invoke
> hive.hive.myhivetable' show the column as GBK is a problem. In fact, it
> tells the truth ^_^. I think Trafodion will run into issues if that GBK
> column is appeared in some SQL functions, for example get length, or
> substr. That will be a problem, the only solution to my mind , as you said,
> we need to total support of GBK in Trafodion. But that is a big project.
>
> So another CQD seems a practical solution now, I would like to try Hans's
> proposal of HIVE_FILE_CHARSET. So during the implicit charset checking or
> TRANSLATE handling, we check the HIVE_FILE_CHARSET vs.
> HIVE_DEFAULT_CHARSET, if not match, do converting, else do nothing. I feel
> this is doable and no side-effect.
>
> For NUL byte, I think Selva means HIVE_SCAN cannot handle wide character
> right now. So if it is a really a NUL character, that is OK. Hive scan may
> have using string functions like strcpy(), strlen(), so it will stop at a
> NUL byte. If the source is indeed a NUL, then it is OK, no problem, the
> problem is it will treat a not NUL char, which has '\0' in middle of its
> encoding. For example, a multi-bye character is encoded as [ 0x00, 0xFB ],
> that will be an issue for Trafodion, since the strcpy will truncate the
> string at the first byte wrongly. So what I mean GBK not have NUL encoding,
> is it will not use 0x00 in any not-NUL character, but char NUL is '\0' is
> not a problem. I may be wrong here again, but that is my understanding.
>
> Thanks,
> Ming
>
> -----邮件原件-----
> 发件人: Hans Zeller [mailto:hans.zeller@esgyn.com]
> 发送时间: 2016年1月6日 9:56
> 收件人: dev <de...@trafodion.incubator.apache.org>
> 主题: Re: 答复: 答复: enhance TRANSLATE to support Chinese charset?
>
> Hi again,
>
> Sorry, I have to make a few more comments.
>
>    - CQD HIVE_DEFAULT_CHARSET 'GBK': This won't work, at least not with
>    further modification. If we would allow that, an INVOKE
>    hive.hive.myhivetable would show string columns as mycol char(n)
> character
>    set GBK. That's what this CQD means. Maybe we need another CQD,
>    HIVE_FILE_CHARSET or something similar, that indicates the character
> set of
>    the file to be converted to HIVE_DEFAULT_CHARSET.
>    - If we do the implicit conversion for Hive columns, then we don't
>    really need a new TRANSLATE function, at least not for users, it would
> only
>    be used internally - or would there be any SQL expression where we would
>    have a GBK character column?
>    - You say that GBK is fixed at 2 bytes, but when I look at Wikipedia
>    <https://en.wikipedia.org/wiki/GBK>, it says that US-ASCII characters
>    (0-127) are encoded as single byte characters, and all other characters
> are
>    2 bytes.
>    - Allowing a NUL (null byte, code point 0) in strings: Actually,
>    ISO8859-1, UTF-8 and GBK all allow that. UCS2 allows a double-byte NUL
>    character. If we wouldn't allow it, we could not store binary data in
>    ISO8859-1 columns. I'm not sure why Selva said in an earlier email that
> we
>    can't handle NULs.
>
> Thanks,
>
> Hans
>
> On Tue, Jan 5, 2016 at 5:16 PM, Liu, Ming (Ming) <mi...@esgyn.cn>
> wrote:
>
> > Hi, all,
> > Thanks all for the discussions, there are a lot of very good ideas and
> > questions. I would like to summarize a little bit.
> > 1. GBK vs. GB2312, and encoding.
> > Yes, Dave, we will support GBK first, so it should cover GB2312 as well.
> > We don't need to explicitly support GB2312 unless we meet specific
> issues.
> > I am not aware of different GBK encodings, we will use Glibc standard
> > API
> > iconv() to do the conversion, it should handle all the details I assume.
> > From some initial tests, it can convert some real source data
> > correctly. I need to understand more about GBK here, but yes, we will
> > support GBK, not GB2312.
> >
> > 2. Total support of GBK in Trafodion
> > I agree to support GBK totally is a big project, so as Qifan
> > suggested, for this task, we only allow GBK to appear in the TRANSLATE
> > function. So we can allow user to change the hive character set to GBK
> > via HIVE_DEFAULT_CHARSET, and use GBK2UTF8 in the TRANSLATE function
> > to do the converting. This will not affect other part of the database,
> and not a lie.
> > :-) And as Hans/Selva clarified, Hive cannot specific charset per
> > column, so I feel we don't need to make it anymore complex. Qifan's
> > proposal of allowing Trafodion to do implicit charset translation is
> > very great, so we don't bother to write a very complex sql to add
> TRANSLATE for each 'string'
> > column in Hive. Since we saw tables with 300+ columns in hive before...
> >
> > 3. Total support of GBK in Trafodion, again To many aspects, I feel
> > GBK is very similar with UCS2. Its width is fixed as 2-bytes. And it
> > will not allow '\0' (null) in any character byte. So I hope the effort
> > will be controlled by simulating what Trafodion already do for UCS2.
> > But we should launch a total support of GBK as a separate task, since
> > for this proposal to TRANSLATE function, the goal is to allow GBK data
> > load into UTF-8 columns in Trafodion simpler.
> >
> > Thanks,
> > Ming
> >
> > -----邮件原件-----
> > 发件人: Dave Birdsall [mailto:dave.birdsall@esgyn.com]
> > 发送时间: 2016年1月6日 2:30
> > 收件人: dev@trafodion.incubator.apache.org
> > 主题: RE: 答复: enhance TRANSLATE to support Chinese charset?
> >
> > Hi,
> >
> > I was reading up on the character sets a little bit (e.g. Wikipedia
> > articles on GB2312 and GBK). It seems GBK is an extension of GB2312,
> > so I'm assuming that we are discussing support for GBK, correct? Also,
> > it appears there are multiple possible encodings for GBK (no surprise,
> > really; Unicode has multiple encodings too). We will want to specify
> > which encoding(s) we wish to support.
> >
> > Dave
> >
> > -----Original Message-----
> > From: Qifan Chen [mailto:qifan.chen@esgyn.com]
> > Sent: Tuesday, January 5, 2016 10:22 AM
> > To: dev <de...@trafodion.incubator.apache.org>
> > Subject: Re: 答复: enhance TRANSLATE to support Chinese charset?
> >
> > Sorry, in my previous email, I should also mention that the GBK
> > charset is only valid within the TRANSLATE function.
> >
> > Thanks --Qifan
> >
> > On Tue, Jan 5, 2016 at 12:17 PM, Qifan Chen <qi...@esgyn.com>
> wrote:
> >
> > > Hi Hans,
> > >
> > > Yes, either translate GDB to UTF8 or stored as binary data in
> > > ISO88591 is fine.
> > >
> > > I feel the work to allow HIVE_DEFAULT_CHARSET cqd to take GBK, and
> > > disabling creating trafodion tables with GBK character columns and
> > > GBK literals (both in the parser) is still reasonably sized.
> > >
> > > With this, Trafodion only recognizes GBK as a source charset for
> > > Hive tables, and is able to convert the GBK source to other data
> format.
> > >
> > > Thanks --Qifan
> > >
> > >
> > > On Tue, Jan 5, 2016 at 12:02 PM, Hans Zeller <ha...@esgyn.com>
> > > wrote:
> > >
> > >> Just want to clarify: I hate to lie to the engine about character
> > >> sets just as much. If someone can make a workable proposal to avoid
> > >> it that would be great.
> > >>
> > >> Unlike in the predecessor products, which tried to interpret the
> > >> data, we would treat the GBK strings like binary data stored in
> > >> ISO88591, except in the one new translate function that could
> > >> convert it to UTF-8. So, we would reduce the lie to the bad but
> > >> already accepted one of storing binary data in ISO88591.
> > >>
> > >> Allowing a new GBK character set without adequate support for it
> > >> may be more trouble than treating GBK as binary data. A proposal to
> > >> do that should specify the limitations very carefully.
> > >>
> > >> Thanks,
> > >>
> > >> Hans
> > >>
> > >> On Tue, Jan 5, 2016 at 9:48 AM, Dave Birdsall
> > >> <da...@esgyn.com>
> > >> wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > I'm wondering how large a project it is to add a new character
> > >> > set to Trafodion? I personally hate the idea of lying to the
> > >> > engine (for
> > >> example,
> > >> > pretending that GBK is ISO88591 in most places except when we
> > >> > want to translate to another character set; we tried this in one
> > >> > of our
> > >> predecessor
> > >> > products and it caused endless complications -- it is technical
> > >> > debt
> > >> that
> > >> > will soon demand to be repaid).
> > >> >
> > >> > If we need to reduce the scope of the project, one way might be
> > >> > to limit the new character set to Hive tables only (so we'd have
> > >> > to do the metadata
> > >> work
> > >> > in Trafodion), and translate the data immediately in the scan
> > >> > nodes
> > >> where
> > >> > it
> > >> > enters the engine (so we could limit run-time support to a
> > >> > translate operator; other expression operators would operate only
> > >> > on known
> > >> character
> > >> > sets).
> > >> >
> > >> > Dave
> > >> >
> > >> > -----Original Message-----
> > >> > From: Hans Zeller [mailto:hans.zeller@esgyn.com]
> > >> > Sent: Tuesday, January 5, 2016 9:21 AM
> > >> > To: dev <de...@trafodion.incubator.apache.org>
> > >> > Subject: Re: 答复: enhance TRANSLATE to support Chinese charset?
> > >> >
> > >> > Hi,
> > >> >
> > >> > Thanks, Ming, for confirming what the use case is.
> > >> >
> > >> > As Qifan and Selva have pointed out, the HIVE_DEFAULT_CHARSET CQD
> > >> > determines the character set of string columns in Hive tables. I
> > >> > don't think,
> > >> however,
> > >> > that it is feasible to allow GBK as a value for
> > >> > HIVE_DEFAULT_CHARSET, unless you are ready to do a lot of work.
> > >> > That's because Hive columns have
> > >> regular
> > >> > Trafodion data types, and supporting GBK in regular data types
> > >> > would be
> > >> a
> > >> > large project.
> > >> >
> > >> > Therefore, I would suggest that we use cqd HIVE_DEFAULT_CHARSET
> > >> 'iso88591'
> > >> > instead. We also don't have a BINARY data type and use ISO88591
> > >> > for
> > >> that as
> > >> > well. To make that happen, you could either use some CQD as we
> > >> > talked about, or a special TRANSLATE flavor, something like
> > >> > GBK_AS_ISO_TO_UTF8 (sorry
> > >> for
> > >> > all the long names, maybe you can shorten them). This may be
> > >> > better than the CQD to suppress character set checks. I agree
> > >> > with Qifan that we should
> > >> try
> > >> > hard to check all character sets. For the same reason, we should
> > >> > not
> > >> allow
> > >> > UTF-8 as HIVE_DEFAULT_CHARSET when we store GBK data in Hive tables.
> > >> >
> > >> > Qifan mentions different character sets in a single Hive table.
> > >> > My impression is that the Hive folks did not anticipate that,
> > >> > since Hive
> > >> used
> > >> > to be just simple text files. So, I hope that won't be a common
> case.
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Hans
> > >> >
> > >> > On Tue, Jan 5, 2016 at 8:35 AM, Selva Govindarajan <
> > >> > selva.govindarajan@esgyn.com> wrote:
> > >> >
> > >> > > Hi Ming,
> > >> > >
> > >> > > I believe Hive doesn't have a way to specify the character set
> > >> > > encoding of individual columns. It relies on the JVM default
> > >> > > encoding for the table as a whole. The JVM default encoding
> > >> > > depends on the file.encoding default or the LANG variable.
> > >> > >
> > >> > > In Trafodion, we currently support TEXT and SEQUENCE format
> > >> > > hive tables only. Trafodion also reads the hive data directly
> > >> > > as stream of bytes and imposes the data type conversion on
> > >> > > those stream of bytes to match with hive table definition, the
> > >> > > columns being separated by the column delimiters.
> > >> > > This
> > >> > > processing in Trafodion done in C++ and is not wide character
> aware.
> > >> > > As long as the needed character set doesn't have null byte
> > >> > > embedded like UTF8 or
> > >> > > ISO88591 encoding, this processing works fine.
> > >> > >
> > >> > > Currently the CQD hive_default_charset takes in UTF8 and
> > >> > > ISO88591
> > >> only.
> > >> > > However, you can extend this CQD to take different character
> > >> > > sets as long as these character set doesn't have null byte
> > >> > > embedded in its encoding. For other encodings, this processing
> > >> > > should be made wide character aware.
> > >> > >
> > >> > > Selva
> > >> > >
> > >> > > -----Original Message-----
> > >> > > From: Qifan Chen [mailto:qifan.chen@esgyn.com]
> > >> > > Sent: Tuesday, January 5, 2016 7:56 AM
> > >> > > To: dev <de...@trafodion.incubator.apache.org>
> > >> > > Subject: Re: 答复: enhance TRANSLATE to support Chinese charset?
> > >> > >
> > >> > > Hi Ming,
> > >> > >
> > >> > > Thank you for explaining one of the bottlenecks in PoCs in China!
> > >> > >
> > >> > > From data integrity point of view, it is important that the
> > >> > > data to be inserted into Trafodion tables is verified to be in
> > >> > > the said charset, and Trafodion's TRANSLATE function can
> > >> > > guarantee that. So I wonder if we can consider these possible
> > >> > > cases as
> > follows.
> > >> > >
> > >> > >    1. All char columns in the hive table are in GBK. Set CQD
> > >> > >    hive_default_charset to GBK and replying on implicit casting
> > >> > > to the target
> > >> > >    column of the trafodion table. Note that since CQD
> > >> > >    ALLOW_IMPLICIT_CHAR_CASTING is on by default, implicit
> > >> > > casting
> > >> calls
> > >> > >    TRANSLATE under the table.
> > >> > >    2. Only small number of char columns in the hive table are
> > >> > > in
> > GBK.
> > >> > > Apply
> > >> > >    the TRANSLATE function on these GBK columns in the UPSERT
> > >> statement,
> > >> > as
> > >> > >    outlined in the Hans' email. The default charset of the
> > >> > > source is
> > >> > still
> > >> > >    UTF8. If for a particular GBK column that the TRANSLATE
> > >> > > function is
> > >> > not
> > >> > >    specified, Trafodion will give an error. Fix the UPSERT stmt
> > >> > > and re-try.
> > >> > >    3. Only small number of char columns in the hive table are
> > >> > > *NOT* in GBK.
> > >> > >    Set CQD hive_default_charset to GBK. Apply the TRANSLATE
> > >> > > function on these
> > >> > >    non-GBK columns in the UPSERT statement, similar to 2) above.
> > >> > >    4. Non of the columns are in GBK. No special settings are
> > >> necessary.
> > >> > >
> > >> > >
> > >> > > With the above proposal, the data integrity is maintained and
> > >> > > the amount of work to prepare the UPSERT for charset conversion
> > >> > > is minimized. The needed work is to implement GBK2UTF8.
> > >> > >
> > >> > > Thanks --Qifan
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Mon, Jan 4, 2016 at 10:39 PM, Liu, Ming (Ming)
> > >> > > <mi...@esgyn.cn>
> > >> > > wrote:
> > >> > >
> > >> > > > Hi, Hans,
> > >> > > >
> > >> > > > You are right, each column has a charset type and TRANSLATE
> > >> > > > need to match the source/target's charset before performing
> > >> > > > the conversion.
> > >> > > > I do meet this during the implementation after my proposal.
> > >> > > >
> > >> > > > My proposal is mainly for bulkloader usage, where the source
> > >> > > > is from hive table. I know very little about Hive, but it
> > >> > > > seems to me, a column in Hive table does not have a concept of
> charset.
> > >> > > > As per my understanding ( I am learning hive), Hive treat all
> > >> > > > string column as UTF8, since Hive is using java.
> > >> > > > Hive does not check the value of a string column during
> > >> > > > normal operation, it treats the string as a binary string
> > >> > > > without any attempt to parse it.
> > >> > > > So when Trafodion is doing bulk load, the first step is we
> > >> > > > copy the raw data into HDFS, and create a hive external
> > >> > > > table. In some cases, the source data is encoded in GBK, so
> > >> > > > we must convert the source using 'iconv' before copy that
> > >> > > > file into HDFS. This step is very time consuming if the
> > >> > > > source file is big, iconv is not very fast, even it is fast
> > >> > > > by writing a specific MapReduce job for this, it is still an
> > >> > > > extra step for Trafodion bulk loading.
> > >> > > > On the other hand, it is fine to leave the source data as
> > >> > > > encoded in GBK, and create the hive external file. Hive will
> > >> > > > not checking the string values's charset. And you can select
> > >> > > > those GBK data via HQL, Hive will not do any conversion. So
> > >> > > > it is possible that we can skip the 'iconv' step (which is
> > >> > > > very slow, or will never success for a big file, say 100G ) ,
> > >> > > > and using the parallel bulk loader of Trafodion to load the
> > >> > > > data in GBK, but during the loading, for each row, do a
> > >> > > > TRANSLATE, so convert it from GBK into UTF8. We assume this
> > >> > > > will be faster than two separate step, and it is easier for
> > >> > > > the end user. This is the goal of the proposal at first.
> > >> > > >
> > >> > > > But just as you pointed out, the source column's charset must
> > >> > > > be
> > >> > > matching.
> > >> > > > So the GBKTOUTF8 transaction require the source column to be
> > >> > > > in GBK, which Trafodion doesn't support. And for unknown
> > >> > > > reason, the column charset of any hive string column is UTF8
> > >> > > > from Trafodion point of view. So the TRANSLATE will fail. My
> > >> > > > current solution is to bypass the source charset type
> > >> > > > matching , same as a CQD
> ALLOW_TRANSLATE_SOURCE_CHARSTET_MISMATCH 'on'.
> > >> > > >
> > >> > > > In the long run, I can add the charset GBK as one of the
> > >> > > > supported Trafodion column charset. Then we can use this
> > >> > > > feature in more cases, and we can support a column to be
> > >> > > > encoded as GBK which is also a nice feature.
> > >> > > > But in short term, I think the most desired requirement is to
> > >> > > > allow loading GBK data into Trafodion without an extra iconv
> step.
> > >> > > >
> > >> > > > I think the CQD is a very good suggestion, I now ignore the
> > >> > > > charset type matching, I can add this CQD and only bypass
> > >> > > > type checking when this CQD is on. And we can turn this CQD
> > >> > > > on during bulk loading and turn it off in other cases.
> > >> > > >
> > >> > > > Do you think this is better?
> > >> > > >
> > >> > > > Thanks,
> > >> > > > Ming
> > >> > > >
> > >> > > > -----邮件原件-----
> > >> > > > 发件人: Hans Zeller [mailto:hans.zeller@esgyn.com]
> > >> > > > 发送时间: 2016年1月5日 10:40
> > >> > > > 收件人: dev <de...@trafodion.incubator.apache.org>
> > >> > > > 主题: Re: enhance TRANSLATE to support Chinese charset?
> > >> > > >
> > >> > > > Hi,
> > >> > > >
> > >> > > > Kevin's question make me think of another question: The
> > >> > > > TRANSLATE function takes a value a with character set ca and
> > >> > > > translates it into a new value b with character set cb. Since
> > >> > > > we don't support GBK as character sets in data types, I
> > >> > > > wonder how we could call this function. People may in some
> > >> > > > cases stuff GBK data into an ISO88591 column, are you
> > >> > > > thinking of that case, and would we then allow something like
> this:
> > >> > > >
> > >> > > > create table tiso(a char(32) character set iso88591); insert
> > >> > > > into tiso
> > >> > > ...
> > >> > > > ; -- insert GBK data into the table without translation
> > >> > > > select translate(a using GBKTOUTF8N) from tiso;
> > >> > > >
> > >> > > >
> > >> > > > Or, maybe something like this, to load a GB2312 HDFS file
> > >> > > > into a Trafodion
> > >> > > > table:
> > >> > > >
> > >> > > > cqd hive_default_charset 'ISO88591'; -- data is really
> > >> > > > GB2312, but that is not supported cqd
> > >> > > > allow_translate_source_charset_mismatch
> > >> > > > 'on';
> > >> > > > -- some new CQD like the above could suppress errors on
> > >> > > > mismatched source charset for translate upsert using load
> > >> > > > into traf_utf8_table select translate(a using GBKTOUTF8N)
> > >> > > > from hive.hive.src;
> > >> > > >
> > >> > > >
> > >> > > > Thanks,
> > >> > > >
> > >> > > > Hans
> > >> > > >
> > >> > > > On Mon, Jan 4, 2016 at 5:01 PM, Liu, Ming (Ming)
> > >> > > > <ming.liu@esgyn.cn
> > >> >
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Hi, Kevin,
> > >> > > > >
> > >> > > > > I didn't notice GB18030 before, but after some initial
> > >> > > > > search, it seems a must to have feature, so Trafodion
> > >> > > > > should support it. I will mark it after the GBK support, we
> > >> > > > > saw GBK in real customer site, but not GB18030 yet, but we
> > >> > > > > should assume wide requirement of GB18030 very
> > >> > > > soon.
> > >> > > > >
> > >> > > > > Thanks,
> > >> > > > > Ming
> > >> > > > >
> > >> > > > > -----邮件原件-----
> > >> > > > > 发件人: Kevin DeYager [mailto:kevin.deyager@esgyn.com]
> > >> > > > > 发送时间: 2016年1月5日 8:11
> > >> > > > > 收件人: dev@trafodion.incubator.apache.org
> > >> > > > > 主题: RE: enhance TRANSLATE to support Chinese charset?
> > >> > > > >
> > >> > > > > Hi Ming,
> > >> > > > >
> > >> > > > > I am no expert in this area, but is GB18030 translation
> > >> > > > > also needed / desirable?
> > >> > > > >
> > >> > > > > Regards,
> > >> > > > > - Kevin
> > >> > > > >
> > >> > > > > -----Original Message-----
> > >> > > > > From: Liu, Ming (Ming) [mailto:ming.liu@esgyn.cn]
> > >> > > > > Sent: Monday, December 21, 2015 4:51 PM
> > >> > > > > To: dev@trafodion.incubator.apache.org
> > >> > > > > Subject: enhance TRANSLATE to support Chinese charset?
> > >> > > > >
> > >> > > > > Hello,
> > >> > > > >
> > >> > > > > Trafodion currently has a TRANSLATE function, which can do
> > >> > > > > charset conversion among ISO88591, SJIS, UCS2 and UTF8.
> > >> > > > > I would like to add GBK conversion into this function, it
> > >> > > > > can help for data loading sometimes. As we saw previously,
> > >> > > > > source data are very typically encoded in GB2312,
> > >> > > > > especially in China, so we have to do a 'iconv' from GBK to
> > >> > > > > UTF8 before loading, if the data files are huge, it will take
> a some time.
> > >> > > > > If TRANSLATE can support GBKTOUTF8, so that conversion can
> > >> > > > > be done in one step during the 'LOAD' SQL command. I think
> > >> > > > > there are some other use cases as well.
> > >> > > > >
> > >> > > > > Do you feel this is worthy? If so, I would like to file a
> > >> > > > > JIRA and can work on it.
> > >> > > > >
> > >> > > > > At first glance, I would like to propose several translate
> > >> flavors:
> > >> > > > > GBKTOUTF8N : which will try to do conversion from GB2312 to
> > >> > > > > UTF8, in case there is an error during the conversion,
> > >> > > > > return NULL, no SQL Error raised, silently continue.
> > >> > > > > GBKTOUTF8O: try to do conversion from GB2312 to UTF8, in
> > >> > > > > case there is an error during the conversion, return the
> > >> > > > > original string without any conversion, no SQL Error
> > >> > > > > raised, silently
> > >> > continue.
> > >> > > > > BGKTOUTF8: typical behavior, once there is a conversion
> > >> > > > > error, raise a SQL Error.
> > >> > > > >
> > >> > > > > Thanks,
> > >> > > > > Ming
> > >> > > > >
> > >> > > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > Regards, --Qifan
> > >> > >
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > Regards, --Qifan
> > >
> > >
> >
> >
> > --
> > Regards, --Qifan
> >
>

答复: 答复: 答复: enhance TRANSLATE to support Chinese charset?

Posted by "Liu, Ming (Ming)" <mi...@esgyn.cn>.

Hi, again,

I want to demonstrate with a fake example here. A csv file in size 200 G, it has 100 columns and 10 of those columns are string encoded as GBK. If one want to load it into Trafodion, it has two options: 1) create Trafodion table with 10 columns' charset as ISO88591, so to treat the original data as binary 2), create Trafodion table with 10 columns' charset as UTF8, if so, the original csv file must be converted into UTF8. This is now done via Linux tool iconv. However, using iconv to convert a 200G file is very slow, if the file is bigger, I think iconv will simply crash as I heard the rumor. It is like to use wc to do word count for huge data, a better solution is to run a MapReduce wordcount program. Trafodion bulkloader do things in parallel.
Option 1 is not good, since the 10 string column cannot be used in any SQL function. Application must treat them as binary data and decode it in the application. And I think I saw loading errors if the source is GBK, since Trafodion will try to do an implicit UTF8TOISO88591 converting. So this option is not possible in fact. The only solution is ask user to clean up their source data , must encoded in UTF8, if not, convert before loading into Trafodion. As I mentioned before, if the data file is big, this is frustrating. It will be nice, with relatively small effort, Trafodion provide some kind of help for this scenario. A small but kind feature. Of course we don’t' want to spend huge effort to support GBK totally, that may not worthy, we better ask user to do converting first.

So if we can use HIVE_FILE_CHARSET, here is the steps:
1. put the csv file into HDFS
2. create hive external table: hivetbl
3. In sqlci, 
  >CQD HIVE_FILE_CHARSET 'GBK';
  >load into trafTbl select * from hive.hive.hivetbl;
Now the traftbl has all UTF8 data, and can run any SQL with any expression like trim()/Length()/upper()/substr(), since Trafodion support UTF8. This is not a big deal, but sometimes, users will be very happy to have this.
 
Furthermore, The new TRANSLATE function is still useful for Trafodion in order to play with Hive native table, which contains GBK data, for example:
Get all raws which have address start with beijing in Chinese:
 > select * from hive.hive.hivetbl substr(address, 2) = "北京";    -- this will cause SQL error
 > CQD HIVE_FILE_CHARSET 'GBK';  -- user know the data is encoded in GBK
 > select * from hive.hive.hivetbl where substr( TRANSLATE(address, using GBKTOUTF8), 2) = utf8'北京';  
Without this new function, again, user have to convert raw data before load into Hive. 

A similar small but sometimes annoying issue is the DOS file format, user has to remove \r for each row before data loading, simply because we don't treat \r automatically... 

And now, I am curious, how Hive handle GBK :-) It asks user to do charset conversion before data loading? 

Thanks,
Ming 

-----邮件原件-----
发件人: Liu, Ming (Ming) [mailto:ming.liu@esgyn.cn] 
发送时间: 2016年1月6日 10:32
收件人: dev@trafodion.incubator.apache.org
主题: 答复: 答复: 答复: enhance TRANSLATE to support Chinese charset?

Hi, Hans,

You have some very thoughtful ideas. And I really misunderstand GBK before, I was thinking it is fixed as 2 bytes, since all I checked is Chinese character, not English letters... So this is still a big effort if we need to handle GBK in Trafodion totally. However, we can still do something to just support simpler bulk loading of GBK source data.

For your first comment, in fact, I don't think 'invoke hive.hive.myhivetable' show the column as GBK is a problem. In fact, it tells the truth ^_^. I think Trafodion will run into issues if that GBK column is appeared in some SQL functions, for example get length, or substr. That will be a problem, the only solution to my mind , as you said, we need to total support of GBK in Trafodion. But that is a big project.

So another CQD seems a practical solution now, I would like to try Hans's proposal of HIVE_FILE_CHARSET. So during the implicit charset checking or TRANSLATE handling, we check the HIVE_FILE_CHARSET vs. HIVE_DEFAULT_CHARSET, if not match, do converting, else do nothing. I feel this is doable and no side-effect.

For NUL byte, I think Selva means HIVE_SCAN cannot handle wide character right now. So if it is a really a NUL character, that is OK. Hive scan may have using string functions like strcpy(), strlen(), so it will stop at a NUL byte. If the source is indeed a NUL, then it is OK, no problem, the problem is it will treat a not NUL char, which has '\0' in middle of its encoding. For example, a multi-bye character is encoded as [ 0x00, 0xFB ], that will be an issue for Trafodion, since the strcpy will truncate the string at the first byte wrongly. So what I mean GBK not have NUL encoding, is it will not use 0x00 in any not-NUL character, but char NUL is '\0' is not a problem. I may be wrong here again, but that is my understanding.

Thanks,
Ming

-----邮件原件-----
发件人: Hans Zeller [mailto:hans.zeller@esgyn.com]
发送时间: 2016年1月6日 9:56
收件人: dev <de...@trafodion.incubator.apache.org>
主题: Re: 答复: 答复: enhance TRANSLATE to support Chinese charset?

Hi again,

Sorry, I have to make a few more comments.

   - CQD HIVE_DEFAULT_CHARSET 'GBK': This won't work, at least not with
   further modification. If we would allow that, an INVOKE
   hive.hive.myhivetable would show string columns as mycol char(n) character
   set GBK. That's what this CQD means. Maybe we need another CQD,
   HIVE_FILE_CHARSET or something similar, that indicates the character set of
   the file to be converted to HIVE_DEFAULT_CHARSET.
   - If we do the implicit conversion for Hive columns, then we don't
   really need a new TRANSLATE function, at least not for users, it would only
   be used internally - or would there be any SQL expression where we would
   have a GBK character column?
   - You say that GBK is fixed at 2 bytes, but when I look at Wikipedia
   <https://en.wikipedia.org/wiki/GBK>, it says that US-ASCII characters
   (0-127) are encoded as single byte characters, and all other characters are
   2 bytes.
   - Allowing a NUL (null byte, code point 0) in strings: Actually,
   ISO8859-1, UTF-8 and GBK all allow that. UCS2 allows a double-byte NUL
   character. If we wouldn't allow it, we could not store binary data in
   ISO8859-1 columns. I'm not sure why Selva said in an earlier email that we
   can't handle NULs.

Thanks,

Hans

On Tue, Jan 5, 2016 at 5:16 PM, Liu, Ming (Ming) <mi...@esgyn.cn> wrote:

> Hi, all,
> Thanks all for the discussions, there are a lot of very good ideas and 
> questions. I would like to summarize a little bit.
> 1. GBK vs. GB2312, and encoding.
> Yes, Dave, we will support GBK first, so it should cover GB2312 as well.
> We don't need to explicitly support GB2312 unless we meet specific issues.
> I am not aware of different GBK encodings, we will use Glibc standard 
> API
> iconv() to do the conversion, it should handle all the details I assume.
> From some initial tests, it can convert some real source data 
> correctly. I need to understand more about GBK here, but yes, we will 
> support GBK, not GB2312.
>
> 2. Total support of GBK in Trafodion
> I agree to support GBK totally is a big project, so as Qifan 
> suggested, for this task, we only allow GBK to appear in the TRANSLATE 
> function. So we can allow user to change the hive character set to GBK 
> via HIVE_DEFAULT_CHARSET, and use GBK2UTF8 in the TRANSLATE function 
> to do the converting. This will not affect other part of the database, and not a lie.
> :-) And as Hans/Selva clarified, Hive cannot specific charset per 
> column, so I feel we don't need to make it anymore complex. Qifan's 
> proposal of allowing Trafodion to do implicit charset translation is 
> very great, so we don't bother to write a very complex sql to add TRANSLATE for each 'string'
> column in Hive. Since we saw tables with 300+ columns in hive before...
>
> 3. Total support of GBK in Trafodion, again To many aspects, I feel 
> GBK is very similar with UCS2. Its width is fixed as 2-bytes. And it 
> will not allow '\0' (null) in any character byte. So I hope the effort 
> will be controlled by simulating what Trafodion already do for UCS2.
> But we should launch a total support of GBK as a separate task, since 
> for this proposal to TRANSLATE function, the goal is to allow GBK data 
> load into UTF-8 columns in Trafodion simpler.
>
> Thanks,
> Ming
>
> -----邮件原件-----
> 发件人: Dave Birdsall [mailto:dave.birdsall@esgyn.com]
> 发送时间: 2016年1月6日 2:30
> 收件人: dev@trafodion.incubator.apache.org
> 主题: RE: 答复: enhance TRANSLATE to support Chinese charset?
>
> Hi,
>
> I was reading up on the character sets a little bit (e.g. Wikipedia 
> articles on GB2312 and GBK). It seems GBK is an extension of GB2312, 
> so I'm assuming that we are discussing support for GBK, correct? Also, 
> it appears there are multiple possible encodings for GBK (no surprise, 
> really; Unicode has multiple encodings too). We will want to specify 
> which encoding(s) we wish to support.
>
> Dave
>
> -----Original Message-----
> From: Qifan Chen [mailto:qifan.chen@esgyn.com]
> Sent: Tuesday, January 5, 2016 10:22 AM
> To: dev <de...@trafodion.incubator.apache.org>
> Subject: Re: 答复: enhance TRANSLATE to support Chinese charset?
>
> Sorry, in my previous email, I should also mention that the GBK 
> charset is only valid within the TRANSLATE function.
>
> Thanks --Qifan
>
> On Tue, Jan 5, 2016 at 12:17 PM, Qifan Chen <qi...@esgyn.com> wrote:
>
> > Hi Hans,
> >
> > Yes, either translate GDB to UTF8 or stored as binary data in
> > ISO88591 is fine.
> >
> > I feel the work to allow HIVE_DEFAULT_CHARSET cqd to take GBK, and 
> > disabling creating trafodion tables with GBK character columns and 
> > GBK literals (both in the parser) is still reasonably sized.
> >
> > With this, Trafodion only recognizes GBK as a source charset for 
> > Hive tables, and is able to convert the GBK source to other data format.
> >
> > Thanks --Qifan
> >
> >
> > On Tue, Jan 5, 2016 at 12:02 PM, Hans Zeller <ha...@esgyn.com>
> > wrote:
> >
> >> Just want to clarify: I hate to lie to the engine about character 
> >> sets just as much. If someone can make a workable proposal to avoid 
> >> it that would be great.
> >>
> >> Unlike in the predecessor products, which tried to interpret the 
> >> data, we would treat the GBK strings like binary data stored in 
> >> ISO88591, except in the one new translate function that could 
> >> convert it to UTF-8. So, we would reduce the lie to the bad but 
> >> already accepted one of storing binary data in ISO88591.
> >>
> >> Allowing a new GBK character set without adequate support for it 
> >> may be more trouble than treating GBK as binary data. A proposal to 
> >> do that should specify the limitations very carefully.
> >>
> >> Thanks,
> >>
> >> Hans
> >>
> >> On Tue, Jan 5, 2016 at 9:48 AM, Dave Birdsall 
> >> <da...@esgyn.com>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > I'm wondering how large a project it is to add a new character 
> >> > set to Trafodion? I personally hate the idea of lying to the 
> >> > engine (for
> >> example,
> >> > pretending that GBK is ISO88591 in most places except when we 
> >> > want to translate to another character set; we tried this in one 
> >> > of our
> >> predecessor
> >> > products and it caused endless complications -- it is technical 
> >> > debt
> >> that
> >> > will soon demand to be repaid).
> >> >
> >> > If we need to reduce the scope of the project, one way might be 
> >> > to limit the new character set to Hive tables only (so we'd have 
> >> > to do the metadata
> >> work
> >> > in Trafodion), and translate the data immediately in the scan 
> >> > nodes
> >> where
> >> > it
> >> > enters the engine (so we could limit run-time support to a 
> >> > translate operator; other expression operators would operate only 
> >> > on known
> >> character
> >> > sets).
> >> >
> >> > Dave
> >> >
> >> > -----Original Message-----
> >> > From: Hans Zeller [mailto:hans.zeller@esgyn.com]
> >> > Sent: Tuesday, January 5, 2016 9:21 AM
> >> > To: dev <de...@trafodion.incubator.apache.org>
> >> > Subject: Re: 答复: enhance TRANSLATE to support Chinese charset?
> >> >
> >> > Hi,
> >> >
> >> > Thanks, Ming, for confirming what the use case is.
> >> >
> >> > As Qifan and Selva have pointed out, the HIVE_DEFAULT_CHARSET CQD 
> >> > determines the character set of string columns in Hive tables. I 
> >> > don't think,
> >> however,
> >> > that it is feasible to allow GBK as a value for 
> >> > HIVE_DEFAULT_CHARSET, unless you are ready to do a lot of work.
> >> > That's because Hive columns have
> >> regular
> >> > Trafodion data types, and supporting GBK in regular data types 
> >> > would be
> >> a
> >> > large project.
> >> >
> >> > Therefore, I would suggest that we use cqd HIVE_DEFAULT_CHARSET
> >> 'iso88591'
> >> > instead. We also don't have a BINARY data type and use ISO88591 
> >> > for
> >> that as
> >> > well. To make that happen, you could either use some CQD as we 
> >> > talked about, or a special TRANSLATE flavor, something like
> >> > GBK_AS_ISO_TO_UTF8 (sorry
> >> for
> >> > all the long names, maybe you can shorten them). This may be 
> >> > better than the CQD to suppress character set checks. I agree 
> >> > with Qifan that we should
> >> try
> >> > hard to check all character sets. For the same reason, we should 
> >> > not
> >> allow
> >> > UTF-8 as HIVE_DEFAULT_CHARSET when we store GBK data in Hive tables.
> >> >
> >> > Qifan mentions different character sets in a single Hive table. 
> >> > My impression is that the Hive folks did not anticipate that, 
> >> > since Hive
> >> used
> >> > to be just simple text files. So, I hope that won't be a common case.
> >> >
> >> > Thanks,
> >> >
> >> > Hans
> >> >
> >> > On Tue, Jan 5, 2016 at 8:35 AM, Selva Govindarajan < 
> >> > selva.govindarajan@esgyn.com> wrote:
> >> >
> >> > > Hi Ming,
> >> > >
> >> > > I believe Hive doesn't have a way to specify the character set 
> >> > > encoding of individual columns. It relies on the JVM default 
> >> > > encoding for the table as a whole. The JVM default encoding 
> >> > > depends on the file.encoding default or the LANG variable.
> >> > >
> >> > > In Trafodion, we currently support TEXT and SEQUENCE format 
> >> > > hive tables only. Trafodion also reads the hive data directly 
> >> > > as stream of bytes and imposes the data type conversion on 
> >> > > those stream of bytes to match with hive table definition, the 
> >> > > columns being separated by the column delimiters.
> >> > > This
> >> > > processing in Trafodion done in C++ and is not wide character aware.
> >> > > As long as the needed character set doesn't have null byte 
> >> > > embedded like UTF8 or
> >> > > ISO88591 encoding, this processing works fine.
> >> > >
> >> > > Currently the CQD hive_default_charset takes in UTF8 and
> >> > > ISO88591
> >> only.
> >> > > However, you can extend this CQD to take different character 
> >> > > sets as long as these character set doesn't have null byte 
> >> > > embedded in its encoding. For other encodings, this processing 
> >> > > should be made wide character aware.
> >> > >
> >> > > Selva
> >> > >
> >> > > -----Original Message-----
> >> > > From: Qifan Chen [mailto:qifan.chen@esgyn.com]
> >> > > Sent: Tuesday, January 5, 2016 7:56 AM
> >> > > To: dev <de...@trafodion.incubator.apache.org>
> >> > > Subject: Re: 答复: enhance TRANSLATE to support Chinese charset?
> >> > >
> >> > > Hi Ming,
> >> > >
> >> > > Thank you for explaining one of the bottlenecks in PoCs in China!
> >> > >
> >> > > From data integrity point of view, it is important that the 
> >> > > data to be inserted into Trafodion tables is verified to be in 
> >> > > the said charset, and Trafodion's TRANSLATE function can 
> >> > > guarantee that. So I wonder if we can consider these possible 
> >> > > cases as
> follows.
> >> > >
> >> > >    1. All char columns in the hive table are in GBK. Set CQD
> >> > >    hive_default_charset to GBK and replying on implicit casting 
> >> > > to the target
> >> > >    column of the trafodion table. Note that since CQD
> >> > >    ALLOW_IMPLICIT_CHAR_CASTING is on by default, implicit 
> >> > > casting
> >> calls
> >> > >    TRANSLATE under the table.
> >> > >    2. Only small number of char columns in the hive table are 
> >> > > in
> GBK.
> >> > > Apply
> >> > >    the TRANSLATE function on these GBK columns in the UPSERT
> >> statement,
> >> > as
> >> > >    outlined in the Hans' email. The default charset of the 
> >> > > source is
> >> > still
> >> > >    UTF8. If for a particular GBK column that the TRANSLATE 
> >> > > function is
> >> > not
> >> > >    specified, Trafodion will give an error. Fix the UPSERT stmt 
> >> > > and re-try.
> >> > >    3. Only small number of char columns in the hive table are
> >> > > *NOT* in GBK.
> >> > >    Set CQD hive_default_charset to GBK. Apply the TRANSLATE 
> >> > > function on these
> >> > >    non-GBK columns in the UPSERT statement, similar to 2) above.
> >> > >    4. Non of the columns are in GBK. No special settings are
> >> necessary.
> >> > >
> >> > >
> >> > > With the above proposal, the data integrity is maintained and 
> >> > > the amount of work to prepare the UPSERT for charset conversion 
> >> > > is minimized. The needed work is to implement GBK2UTF8.
> >> > >
> >> > > Thanks --Qifan
> >> > >
> >> > >
> >> > >
> >> > > On Mon, Jan 4, 2016 at 10:39 PM, Liu, Ming (Ming) 
> >> > > <mi...@esgyn.cn>
> >> > > wrote:
> >> > >
> >> > > > Hi, Hans,
> >> > > >
> >> > > > You are right, each column has a charset type and TRANSLATE 
> >> > > > need to match the source/target's charset before performing 
> >> > > > the conversion.
> >> > > > I do meet this during the implementation after my proposal.
> >> > > >
> >> > > > My proposal is mainly for bulkloader usage, where the source 
> >> > > > is from hive table. I know very little about Hive, but it 
> >> > > > seems to me, a column in Hive table does not have a concept of charset.
> >> > > > As per my understanding ( I am learning hive), Hive treat all 
> >> > > > string column as UTF8, since Hive is using java.
> >> > > > Hive does not check the value of a string column during 
> >> > > > normal operation, it treats the string as a binary string 
> >> > > > without any attempt to parse it.
> >> > > > So when Trafodion is doing bulk load, the first step is we 
> >> > > > copy the raw data into HDFS, and create a hive external 
> >> > > > table. In some cases, the source data is encoded in GBK, so 
> >> > > > we must convert the source using 'iconv' before copy that 
> >> > > > file into HDFS. This step is very time consuming if the 
> >> > > > source file is big, iconv is not very fast, even it is fast 
> >> > > > by writing a specific MapReduce job for this, it is still an 
> >> > > > extra step for Trafodion bulk loading.
> >> > > > On the other hand, it is fine to leave the source data as 
> >> > > > encoded in GBK, and create the hive external file. Hive will 
> >> > > > not checking the string values's charset. And you can select 
> >> > > > those GBK data via HQL, Hive will not do any conversion. So 
> >> > > > it is possible that we can skip the 'iconv' step (which is 
> >> > > > very slow, or will never success for a big file, say 100G ) , 
> >> > > > and using the parallel bulk loader of Trafodion to load the 
> >> > > > data in GBK, but during the loading, for each row, do a 
> >> > > > TRANSLATE, so convert it from GBK into UTF8. We assume this 
> >> > > > will be faster than two separate step, and it is easier for 
> >> > > > the end user. This is the goal of the proposal at first.
> >> > > >
> >> > > > But just as you pointed out, the source column's charset must 
> >> > > > be
> >> > > matching.
> >> > > > So the GBKTOUTF8 transaction require the source column to be 
> >> > > > in GBK, which Trafodion doesn't support. And for unknown 
> >> > > > reason, the column charset of any hive string column is UTF8 
> >> > > > from Trafodion point of view. So the TRANSLATE will fail. My 
> >> > > > current solution is to bypass the source charset type 
> >> > > > matching , same as a CQD ALLOW_TRANSLATE_SOURCE_CHARSTET_MISMATCH 'on'.
> >> > > >
> >> > > > In the long run, I can add the charset GBK as one of the 
> >> > > > supported Trafodion column charset. Then we can use this 
> >> > > > feature in more cases, and we can support a column to be 
> >> > > > encoded as GBK which is also a nice feature.
> >> > > > But in short term, I think the most desired requirement is to 
> >> > > > allow loading GBK data into Trafodion without an extra iconv step.
> >> > > >
> >> > > > I think the CQD is a very good suggestion, I now ignore the 
> >> > > > charset type matching, I can add this CQD and only bypass 
> >> > > > type checking when this CQD is on. And we can turn this CQD 
> >> > > > on during bulk loading and turn it off in other cases.
> >> > > >
> >> > > > Do you think this is better?
> >> > > >
> >> > > > Thanks,
> >> > > > Ming
> >> > > >
> >> > > > -----邮件原件-----
> >> > > > 发件人: Hans Zeller [mailto:hans.zeller@esgyn.com]
> >> > > > 发送时间: 2016年1月5日 10:40
> >> > > > 收件人: dev <de...@trafodion.incubator.apache.org>
> >> > > > 主题: Re: enhance TRANSLATE to support Chinese charset?
> >> > > >
> >> > > > Hi,
> >> > > >
> >> > > > Kevin's question make me think of another question: The 
> >> > > > TRANSLATE function takes a value a with character set ca and 
> >> > > > translates it into a new value b with character set cb. Since 
> >> > > > we don't support GBK as character sets in data types, I 
> >> > > > wonder how we could call this function. People may in some 
> >> > > > cases stuff GBK data into an ISO88591 column, are you 
> >> > > > thinking of that case, and would we then allow something like this:
> >> > > >
> >> > > > create table tiso(a char(32) character set iso88591); insert 
> >> > > > into tiso
> >> > > ...
> >> > > > ; -- insert GBK data into the table without translation 
> >> > > > select translate(a using GBKTOUTF8N) from tiso;
> >> > > >
> >> > > >
> >> > > > Or, maybe something like this, to load a GB2312 HDFS file 
> >> > > > into a Trafodion
> >> > > > table:
> >> > > >
> >> > > > cqd hive_default_charset 'ISO88591'; -- data is really 
> >> > > > GB2312, but that is not supported cqd 
> >> > > > allow_translate_source_charset_mismatch
> >> > > > 'on';
> >> > > > -- some new CQD like the above could suppress errors on 
> >> > > > mismatched source charset for translate upsert using load 
> >> > > > into traf_utf8_table select translate(a using GBKTOUTF8N) 
> >> > > > from hive.hive.src;
> >> > > >
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Hans
> >> > > >
> >> > > > On Mon, Jan 4, 2016 at 5:01 PM, Liu, Ming (Ming) 
> >> > > > <ming.liu@esgyn.cn
> >> >
> >> > > > wrote:
> >> > > >
> >> > > > > Hi, Kevin,
> >> > > > >
> >> > > > > I didn't notice GB18030 before, but after some initial 
> >> > > > > search, it seems a must to have feature, so Trafodion 
> >> > > > > should support it. I will mark it after the GBK support, we 
> >> > > > > saw GBK in real customer site, but not GB18030 yet, but we 
> >> > > > > should assume wide requirement of GB18030 very
> >> > > > soon.
> >> > > > >
> >> > > > > Thanks,
> >> > > > > Ming
> >> > > > >
> >> > > > > -----邮件原件-----
> >> > > > > 发件人: Kevin DeYager [mailto:kevin.deyager@esgyn.com]
> >> > > > > 发送时间: 2016年1月5日 8:11
> >> > > > > 收件人: dev@trafodion.incubator.apache.org
> >> > > > > 主题: RE: enhance TRANSLATE to support Chinese charset?
> >> > > > >
> >> > > > > Hi Ming,
> >> > > > >
> >> > > > > I am no expert in this area, but is GB18030 translation 
> >> > > > > also needed / desirable?
> >> > > > >
> >> > > > > Regards,
> >> > > > > - Kevin
> >> > > > >
> >> > > > > -----Original Message-----
> >> > > > > From: Liu, Ming (Ming) [mailto:ming.liu@esgyn.cn]
> >> > > > > Sent: Monday, December 21, 2015 4:51 PM
> >> > > > > To: dev@trafodion.incubator.apache.org
> >> > > > > Subject: enhance TRANSLATE to support Chinese charset?
> >> > > > >
> >> > > > > Hello,
> >> > > > >
> >> > > > > Trafodion currently has a TRANSLATE function, which can do 
> >> > > > > charset conversion among ISO88591, SJIS, UCS2 and UTF8.
> >> > > > > I would like to add GBK conversion into this function, it 
> >> > > > > can help for data loading sometimes. As we saw previously, 
> >> > > > > source data are very typically encoded in GB2312, 
> >> > > > > especially in China, so we have to do a 'iconv' from GBK to
> >> > > > > UTF8 before loading, if the data files are huge, it will take a some time.
> >> > > > > If TRANSLATE can support GBKTOUTF8, so that conversion can 
> >> > > > > be done in one step during the 'LOAD' SQL command. I think 
> >> > > > > there are some other use cases as well.
> >> > > > >
> >> > > > > Do you feel this is worthy? If so, I would like to file a 
> >> > > > > JIRA and can work on it.
> >> > > > >
> >> > > > > At first glance, I would like to propose several translate
> >> flavors:
> >> > > > > GBKTOUTF8N : which will try to do conversion from GB2312 to 
> >> > > > > UTF8, in case there is an error during the conversion, 
> >> > > > > return NULL, no SQL Error raised, silently continue.
> >> > > > > GBKTOUTF8O: try to do conversion from GB2312 to UTF8, in 
> >> > > > > case there is an error during the conversion, return the 
> >> > > > > original string without any conversion, no SQL Error 
> >> > > > > raised, silently
> >> > continue.
> >> > > > > BGKTOUTF8: typical behavior, once there is a conversion 
> >> > > > > error, raise a SQL Error.
> >> > > > >
> >> > > > > Thanks,
> >> > > > > Ming
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Regards, --Qifan
> >> > >
> >> >
> >>
> >
> >
> >
> > --
> > Regards, --Qifan
> >
> >
>
>
> --
> Regards, --Qifan
>

答复: 答复: 答复: enhance TRANSLATE to support Chinese charset?

Posted by "Liu, Ming (Ming)" <mi...@esgyn.cn>.

Hi, Hans,

You have some very thoughtful ideas. And I really misunderstand GBK before, I was thinking it is fixed as 2 bytes, since all I checked is Chinese character, not English letters... So this is still a big effort if we need to handle GBK in Trafodion totally. However, we can still do something to just support simpler bulk loading of GBK source data.

For your first comment, in fact, I don't think 'invoke hive.hive.myhivetable' show the column as GBK is a problem. In fact, it tells the truth ^_^. I think Trafodion will run into issues if that GBK column is appeared in some SQL functions, for example get length, or substr. That will be a problem, the only solution to my mind , as you said, we need to total support of GBK in Trafodion. But that is a big project.

So another CQD seems a practical solution now, I would like to try Hans's proposal of HIVE_FILE_CHARSET. So during the implicit charset checking or TRANSLATE handling, we check the HIVE_FILE_CHARSET vs. HIVE_DEFAULT_CHARSET, if not match, do converting, else do nothing. I feel this is doable and no side-effect.

For NUL byte, I think Selva means HIVE_SCAN cannot handle wide character right now. So if it is a really a NUL character, that is OK. Hive scan may have using string functions like strcpy(), strlen(), so it will stop at a NUL byte. If the source is indeed a NUL, then it is OK, no problem, the problem is it will treat a not NUL char, which has '\0' in middle of its encoding. For example, a multi-bye character is encoded as [ 0x00, 0xFB ], that will be an issue for Trafodion, since the strcpy will truncate the string at the first byte wrongly. So what I mean GBK not have NUL encoding, is it will not use 0x00 in any not-NUL character, but char NUL is '\0' is not a problem. I may be wrong here again, but that is my understanding.

Thanks,
Ming

-----邮件原件-----
发件人: Hans Zeller [mailto:hans.zeller@esgyn.com] 
发送时间: 2016年1月6日 9:56
收件人: dev <de...@trafodion.incubator.apache.org>
主题: Re: 答复: 答复: enhance TRANSLATE to support Chinese charset?

Hi again,

Sorry, I have to make a few more comments.

   - CQD HIVE_DEFAULT_CHARSET 'GBK': This won't work, at least not with
   further modification. If we would allow that, an INVOKE
   hive.hive.myhivetable would show string columns as mycol char(n) character
   set GBK. That's what this CQD means. Maybe we need another CQD,
   HIVE_FILE_CHARSET or something similar, that indicates the character set of
   the file to be converted to HIVE_DEFAULT_CHARSET.
   - If we do the implicit conversion for Hive columns, then we don't
   really need a new TRANSLATE function, at least not for users, it would only
   be used internally - or would there be any SQL expression where we would
   have a GBK character column?
   - You say that GBK is fixed at 2 bytes, but when I look at Wikipedia
   <https://en.wikipedia.org/wiki/GBK>, it says that US-ASCII characters
   (0-127) are encoded as single byte characters, and all other characters are
   2 bytes.
   - Allowing a NUL (null byte, code point 0) in strings: Actually,
   ISO8859-1, UTF-8 and GBK all allow that. UCS2 allows a double-byte NUL
   character. If we wouldn't allow it, we could not store binary data in
   ISO8859-1 columns. I'm not sure why Selva said in an earlier email that we
   can't handle NULs.

Thanks,

Hans

On Tue, Jan 5, 2016 at 5:16 PM, Liu, Ming (Ming) <mi...@esgyn.cn> wrote:

> Hi, all,
> Thanks all for the discussions, there are a lot of very good ideas and 
> questions. I would like to summarize a little bit.
> 1. GBK vs. GB2312, and encoding.
> Yes, Dave, we will support GBK first, so it should cover GB2312 as well.
> We don't need to explicitly support GB2312 unless we meet specific issues.
> I am not aware of different GBK encodings, we will use Glibc standard 
> API
> iconv() to do the conversion, it should handle all the details I assume.
> From some initial tests, it can convert some real source data 
> correctly. I need to understand more about GBK here, but yes, we will 
> support GBK, not GB2312.
>
> 2. Total support of GBK in Trafodion
> I agree to support GBK totally is a big project, so as Qifan 
> suggested, for this task, we only allow GBK to appear in the TRANSLATE 
> function. So we can allow user to change the hive character set to GBK 
> via HIVE_DEFAULT_CHARSET, and use GBK2UTF8 in the TRANSLATE function 
> to do the converting. This will not affect other part of the database, and not a lie.
> :-) And as Hans/Selva clarified, Hive cannot specific charset per 
> column, so I feel we don't need to make it anymore complex. Qifan's 
> proposal of allowing Trafodion to do implicit charset translation is 
> very great, so we don't bother to write a very complex sql to add TRANSLATE for each 'string'
> column in Hive. Since we saw tables with 300+ columns in hive before...
>
> 3. Total support of GBK in Trafodion, again To many aspects, I feel 
> GBK is very similar with UCS2. Its width is fixed as 2-bytes. And it 
> will not allow '\0' (null) in any character byte. So I hope the effort 
> will be controlled by simulating what Trafodion already do for UCS2. 
> But we should launch a total support of GBK as a separate task, since 
> for this proposal to TRANSLATE function, the goal is to allow GBK data 
> load into UTF-8 columns in Trafodion simpler.
>
> Thanks,
> Ming
>
> -----邮件原件-----
> 发件人: Dave Birdsall [mailto:dave.birdsall@esgyn.com]
> 发送时间: 2016年1月6日 2:30
> 收件人: dev@trafodion.incubator.apache.org
> 主题: RE: 答复: enhance TRANSLATE to support Chinese charset?
>
> Hi,
>
> I was reading up on the character sets a little bit (e.g. Wikipedia 
> articles on GB2312 and GBK). It seems GBK is an extension of GB2312, 
> so I'm assuming that we are discussing support for GBK, correct? Also, 
> it appears there are multiple possible encodings for GBK (no surprise, 
> really; Unicode has multiple encodings too). We will want to specify 
> which encoding(s) we wish to support.
>
> Dave
>
> -----Original Message-----
> From: Qifan Chen [mailto:qifan.chen@esgyn.com]
> Sent: Tuesday, January 5, 2016 10:22 AM
> To: dev <de...@trafodion.incubator.apache.org>
> Subject: Re: 答复: enhance TRANSLATE to support Chinese charset?
>
> Sorry, in my previous email, I should also mention that the GBK 
> charset is only valid within the TRANSLATE function.
>
> Thanks --Qifan
>
> On Tue, Jan 5, 2016 at 12:17 PM, Qifan Chen <qi...@esgyn.com> wrote:
>
> > Hi Hans,
> >
> > Yes, either translate GDB to UTF8 or stored as binary data in 
> > ISO88591 is fine.
> >
> > I feel the work to allow HIVE_DEFAULT_CHARSET cqd to take GBK, and 
> > disabling creating trafodion tables with GBK character columns and 
> > GBK literals (both in the parser) is still reasonably sized.
> >
> > With this, Trafodion only recognizes GBK as a source charset for 
> > Hive tables, and is able to convert the GBK source to other data format.
> >
> > Thanks --Qifan
> >
> >
> > On Tue, Jan 5, 2016 at 12:02 PM, Hans Zeller <ha...@esgyn.com>
> > wrote:
> >
> >> Just want to clarify: I hate to lie to the engine about character 
> >> sets just as much. If someone can make a workable proposal to avoid 
> >> it that would be great.
> >>
> >> Unlike in the predecessor products, which tried to interpret the 
> >> data, we would treat the GBK strings like binary data stored in 
> >> ISO88591, except in the one new translate function that could 
> >> convert it to UTF-8. So, we would reduce the lie to the bad but 
> >> already accepted one of storing binary data in ISO88591.
> >>
> >> Allowing a new GBK character set without adequate support for it 
> >> may be more trouble than treating GBK as binary data. A proposal to 
> >> do that should specify the limitations very carefully.
> >>
> >> Thanks,
> >>
> >> Hans
> >>
> >> On Tue, Jan 5, 2016 at 9:48 AM, Dave Birdsall 
> >> <da...@esgyn.com>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > I'm wondering how large a project it is to add a new character 
> >> > set to Trafodion? I personally hate the idea of lying to the 
> >> > engine (for
> >> example,
> >> > pretending that GBK is ISO88591 in most places except when we 
> >> > want to translate to another character set; we tried this in one 
> >> > of our
> >> predecessor
> >> > products and it caused endless complications -- it is technical 
> >> > debt
> >> that
> >> > will soon demand to be repaid).
> >> >
> >> > If we need to reduce the scope of the project, one way might be 
> >> > to limit the new character set to Hive tables only (so we'd have 
> >> > to do the metadata
> >> work
> >> > in Trafodion), and translate the data immediately in the scan 
> >> > nodes
> >> where
> >> > it
> >> > enters the engine (so we could limit run-time support to a 
> >> > translate operator; other expression operators would operate only 
> >> > on known
> >> character
> >> > sets).
> >> >
> >> > Dave
> >> >
> >> > -----Original Message-----
> >> > From: Hans Zeller [mailto:hans.zeller@esgyn.com]
> >> > Sent: Tuesday, January 5, 2016 9:21 AM
> >> > To: dev <de...@trafodion.incubator.apache.org>
> >> > Subject: Re: 答复: enhance TRANSLATE to support Chinese charset?
> >> >
> >> > Hi,
> >> >
> >> > Thanks, Ming, for confirming what the use case is.
> >> >
> >> > As Qifan and Selva have pointed out, the HIVE_DEFAULT_CHARSET CQD 
> >> > determines the character set of string columns in Hive tables. I 
> >> > don't think,
> >> however,
> >> > that it is feasible to allow GBK as a value for 
> >> > HIVE_DEFAULT_CHARSET, unless you are ready to do a lot of work.
> >> > That's because Hive columns have
> >> regular
> >> > Trafodion data types, and supporting GBK in regular data types 
> >> > would be
> >> a
> >> > large project.
> >> >
> >> > Therefore, I would suggest that we use cqd HIVE_DEFAULT_CHARSET
> >> 'iso88591'
> >> > instead. We also don't have a BINARY data type and use ISO88591 
> >> > for
> >> that as
> >> > well. To make that happen, you could either use some CQD as we 
> >> > talked about, or a special TRANSLATE flavor, something like
> >> > GBK_AS_ISO_TO_UTF8 (sorry
> >> for
> >> > all the long names, maybe you can shorten them). This may be 
> >> > better than the CQD to suppress character set checks. I agree 
> >> > with Qifan that we should
> >> try
> >> > hard to check all character sets. For the same reason, we should 
> >> > not
> >> allow
> >> > UTF-8 as HIVE_DEFAULT_CHARSET when we store GBK data in Hive tables.
> >> >
> >> > Qifan mentions different character sets in a single Hive table. 
> >> > My impression is that the Hive folks did not anticipate that, 
> >> > since Hive
> >> used
> >> > to be just simple text files. So, I hope that won't be a common case.
> >> >
> >> > Thanks,
> >> >
> >> > Hans
> >> >
> >> > On Tue, Jan 5, 2016 at 8:35 AM, Selva Govindarajan < 
> >> > selva.govindarajan@esgyn.com> wrote:
> >> >
> >> > > Hi Ming,
> >> > >
> >> > > I believe Hive doesn't have a way to specify the character set 
> >> > > encoding of individual columns. It relies on the JVM default 
> >> > > encoding for the table as a whole. The JVM default encoding 
> >> > > depends on the file.encoding default or the LANG variable.
> >> > >
> >> > > In Trafodion, we currently support TEXT and SEQUENCE format 
> >> > > hive tables only. Trafodion also reads the hive data directly 
> >> > > as stream of bytes and imposes the data type conversion on 
> >> > > those stream of bytes to match with hive table definition, the 
> >> > > columns being separated by the column delimiters.
> >> > > This
> >> > > processing in Trafodion done in C++ and is not wide character aware.
> >> > > As long as the needed character set doesn't have null byte 
> >> > > embedded like UTF8 or
> >> > > ISO88591 encoding, this processing works fine.
> >> > >
> >> > > Currently the CQD hive_default_charset takes in UTF8 and 
> >> > > ISO88591
> >> only.
> >> > > However, you can extend this CQD to take different character 
> >> > > sets as long as these character set doesn't have null byte 
> >> > > embedded in its encoding. For other encodings, this processing 
> >> > > should be made wide character aware.
> >> > >
> >> > > Selva
> >> > >
> >> > > -----Original Message-----
> >> > > From: Qifan Chen [mailto:qifan.chen@esgyn.com]
> >> > > Sent: Tuesday, January 5, 2016 7:56 AM
> >> > > To: dev <de...@trafodion.incubator.apache.org>
> >> > > Subject: Re: 答复: enhance TRANSLATE to support Chinese charset?
> >> > >
> >> > > Hi Ming,
> >> > >
> >> > > Thank you for explaining one of the bottlenecks in PoCs in China!
> >> > >
> >> > > From data integrity point of view, it is important that the 
> >> > > data to be inserted into Trafodion tables is verified to be in 
> >> > > the said charset, and Trafodion's TRANSLATE function can 
> >> > > guarantee that. So I wonder if we can consider these possible 
> >> > > cases as
> follows.
> >> > >
> >> > >    1. All char columns in the hive table are in GBK. Set CQD
> >> > >    hive_default_charset to GBK and replying on implicit casting 
> >> > > to the target
> >> > >    column of the trafodion table. Note that since CQD
> >> > >    ALLOW_IMPLICIT_CHAR_CASTING is on by default, implicit 
> >> > > casting
> >> calls
> >> > >    TRANSLATE under the table.
> >> > >    2. Only small number of char columns in the hive table are 
> >> > > in
> GBK.
> >> > > Apply
> >> > >    the TRANSLATE function on these GBK columns in the UPSERT
> >> statement,
> >> > as
> >> > >    outlined in the Hans' email. The default charset of the 
> >> > > source is
> >> > still
> >> > >    UTF8. If for a particular GBK column that the TRANSLATE 
> >> > > function is
> >> > not
> >> > >    specified, Trafodion will give an error. Fix the UPSERT stmt 
> >> > > and re-try.
> >> > >    3. Only small number of char columns in the hive table are
> >> > > *NOT* in GBK.
> >> > >    Set CQD hive_default_charset to GBK. Apply the TRANSLATE 
> >> > > function on these
> >> > >    non-GBK columns in the UPSERT statement, similar to 2) above.
> >> > >    4. Non of the columns are in GBK. No special settings are
> >> necessary.
> >> > >
> >> > >
> >> > > With the above proposal, the data integrity is maintained and 
> >> > > the amount of work to prepare the UPSERT for charset conversion 
> >> > > is minimized. The needed work is to implement GBK2UTF8.
> >> > >
> >> > > Thanks --Qifan
> >> > >
> >> > >
> >> > >
> >> > > On Mon, Jan 4, 2016 at 10:39 PM, Liu, Ming (Ming) 
> >> > > <mi...@esgyn.cn>
> >> > > wrote:
> >> > >
> >> > > > Hi, Hans,
> >> > > >
> >> > > > You are right, each column has a charset type and TRANSLATE 
> >> > > > need to match the source/target's charset before performing 
> >> > > > the conversion.
> >> > > > I do meet this during the implementation after my proposal.
> >> > > >
> >> > > > My proposal is mainly for bulkloader usage, where the source 
> >> > > > is from hive table. I know very little about Hive, but it 
> >> > > > seems to me, a column in Hive table does not have a concept of charset.
> >> > > > As per my understanding ( I am learning hive), Hive treat all 
> >> > > > string column as UTF8, since Hive is using java.
> >> > > > Hive does not check the value of a string column during 
> >> > > > normal operation, it treats the string as a binary string 
> >> > > > without any attempt to parse it.
> >> > > > So when Trafodion is doing bulk load, the first step is we 
> >> > > > copy the raw data into HDFS, and create a hive external 
> >> > > > table. In some cases, the source data is encoded in GBK, so 
> >> > > > we must convert the source using 'iconv' before copy that 
> >> > > > file into HDFS. This step is very time consuming if the 
> >> > > > source file is big, iconv is not very fast, even it is fast 
> >> > > > by writing a specific MapReduce job for this, it is still an 
> >> > > > extra step for Trafodion bulk loading.
> >> > > > On the other hand, it is fine to leave the source data as 
> >> > > > encoded in GBK, and create the hive external file. Hive will 
> >> > > > not checking the string values's charset. And you can select 
> >> > > > those GBK data via HQL, Hive will not do any conversion. So 
> >> > > > it is possible that we can skip the 'iconv' step (which is 
> >> > > > very slow, or will never success for a big file, say 100G ) , 
> >> > > > and using the parallel bulk loader of Trafodion to load the 
> >> > > > data in GBK, but during the loading, for each row, do a 
> >> > > > TRANSLATE, so convert it from GBK into UTF8. We assume this 
> >> > > > will be faster than two separate step, and it is easier for 
> >> > > > the end user. This is the goal of the proposal at first.
> >> > > >
> >> > > > But just as you pointed out, the source column's charset must 
> >> > > > be
> >> > > matching.
> >> > > > So the GBKTOUTF8 transaction require the source column to be 
> >> > > > in GBK, which Trafodion doesn't support. And for unknown 
> >> > > > reason, the column charset of any hive string column is UTF8 
> >> > > > from Trafodion point of view. So the TRANSLATE will fail. My 
> >> > > > current solution is to bypass the source charset type 
> >> > > > matching , same as a CQD ALLOW_TRANSLATE_SOURCE_CHARSTET_MISMATCH 'on'.
> >> > > >
> >> > > > In the long run, I can add the charset GBK as one of the 
> >> > > > supported Trafodion column charset. Then we can use this 
> >> > > > feature in more cases, and we can support a column to be 
> >> > > > encoded as GBK which is also a nice feature.
> >> > > > But in short term, I think the most desired requirement is to 
> >> > > > allow loading GBK data into Trafodion without an extra iconv step.
> >> > > >
> >> > > > I think the CQD is a very good suggestion, I now ignore the 
> >> > > > charset type matching, I can add this CQD and only bypass 
> >> > > > type checking when this CQD is on. And we can turn this CQD 
> >> > > > on during bulk loading and turn it off in other cases.
> >> > > >
> >> > > > Do you think this is better?
> >> > > >
> >> > > > Thanks,
> >> > > > Ming
> >> > > >
> >> > > > -----邮件原件-----
> >> > > > 发件人: Hans Zeller [mailto:hans.zeller@esgyn.com]
> >> > > > 发送时间: 2016年1月5日 10:40
> >> > > > 收件人: dev <de...@trafodion.incubator.apache.org>
> >> > > > 主题: Re: enhance TRANSLATE to support Chinese charset?
> >> > > >
> >> > > > Hi,
> >> > > >
> >> > > > Kevin's question make me think of another question: The 
> >> > > > TRANSLATE function takes a value a with character set ca and 
> >> > > > translates it into a new value b with character set cb. Since 
> >> > > > we don't support GBK as character sets in data types, I 
> >> > > > wonder how we could call this function. People may in some 
> >> > > > cases stuff GBK data into an ISO88591 column, are you 
> >> > > > thinking of that case, and would we then allow something like this:
> >> > > >
> >> > > > create table tiso(a char(32) character set iso88591); insert 
> >> > > > into tiso
> >> > > ...
> >> > > > ; -- insert GBK data into the table without translation 
> >> > > > select translate(a using GBKTOUTF8N) from tiso;
> >> > > >
> >> > > >
> >> > > > Or, maybe something like this, to load a GB2312 HDFS file 
> >> > > > into a Trafodion
> >> > > > table:
> >> > > >
> >> > > > cqd hive_default_charset 'ISO88591'; -- data is really 
> >> > > > GB2312, but that is not supported cqd 
> >> > > > allow_translate_source_charset_mismatch
> >> > > > 'on';
> >> > > > -- some new CQD like the above could suppress errors on 
> >> > > > mismatched source charset for translate upsert using load 
> >> > > > into traf_utf8_table select translate(a using GBKTOUTF8N) 
> >> > > > from hive.hive.src;
> >> > > >
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Hans
> >> > > >
> >> > > > On Mon, Jan 4, 2016 at 5:01 PM, Liu, Ming (Ming) 
> >> > > > <ming.liu@esgyn.cn
> >> >
> >> > > > wrote:
> >> > > >
> >> > > > > Hi, Kevin,
> >> > > > >
> >> > > > > I didn't notice GB18030 before, but after some initial 
> >> > > > > search, it seems a must to have feature, so Trafodion 
> >> > > > > should support it. I will mark it after the GBK support, we 
> >> > > > > saw GBK in real customer site, but not GB18030 yet, but we 
> >> > > > > should assume wide requirement of GB18030 very
> >> > > > soon.
> >> > > > >
> >> > > > > Thanks,
> >> > > > > Ming
> >> > > > >
> >> > > > > -----邮件原件-----
> >> > > > > 发件人: Kevin DeYager [mailto:kevin.deyager@esgyn.com]
> >> > > > > 发送时间: 2016年1月5日 8:11
> >> > > > > 收件人: dev@trafodion.incubator.apache.org
> >> > > > > 主题: RE: enhance TRANSLATE to support Chinese charset?
> >> > > > >
> >> > > > > Hi Ming,
> >> > > > >
> >> > > > > I am no expert in this area, but is GB18030 translation 
> >> > > > > also needed / desirable?
> >> > > > >
> >> > > > > Regards,
> >> > > > > - Kevin
> >> > > > >
> >> > > > > -----Original Message-----
> >> > > > > From: Liu, Ming (Ming) [mailto:ming.liu@esgyn.cn]
> >> > > > > Sent: Monday, December 21, 2015 4:51 PM
> >> > > > > To: dev@trafodion.incubator.apache.org
> >> > > > > Subject: enhance TRANSLATE to support Chinese charset?
> >> > > > >
> >> > > > > Hello,
> >> > > > >
> >> > > > > Trafodion currently has a TRANSLATE function, which can do 
> >> > > > > charset conversion among ISO88591, SJIS, UCS2 and UTF8.
> >> > > > > I would like to add GBK conversion into this function, it 
> >> > > > > can help for data loading sometimes. As we saw previously, 
> >> > > > > source data are very typically encoded in GB2312, 
> >> > > > > especially in China, so we have to do a 'iconv' from GBK to 
> >> > > > > UTF8 before loading, if the data files are huge, it will take a some time.
> >> > > > > If TRANSLATE can support GBKTOUTF8, so that conversion can 
> >> > > > > be done in one step during the 'LOAD' SQL command. I think 
> >> > > > > there are some other use cases as well.
> >> > > > >
> >> > > > > Do you feel this is worthy? If so, I would like to file a 
> >> > > > > JIRA and can work on it.
> >> > > > >
> >> > > > > At first glance, I would like to propose several translate
> >> flavors:
> >> > > > > GBKTOUTF8N : which will try to do conversion from GB2312 to 
> >> > > > > UTF8, in case there is an error during the conversion, 
> >> > > > > return NULL, no SQL Error raised, silently continue.
> >> > > > > GBKTOUTF8O: try to do conversion from GB2312 to UTF8, in 
> >> > > > > case there is an error during the conversion, return the 
> >> > > > > original string without any conversion, no SQL Error 
> >> > > > > raised, silently
> >> > continue.
> >> > > > > BGKTOUTF8: typical behavior, once there is a conversion 
> >> > > > > error, raise a SQL Error.
> >> > > > >
> >> > > > > Thanks,
> >> > > > > Ming
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Regards, --Qifan
> >> > >
> >> >
> >>
> >
> >
> >
> > --
> > Regards, --Qifan
> >
> >
>
>
> --
> Regards, --Qifan
>

Re: 答复: 答复: enhance TRANSLATE to support Chinese charset?

Posted by Hans Zeller <ha...@esgyn.com>.

Hi again,

Sorry, I have to make a few more comments.

   - CQD HIVE_DEFAULT_CHARSET 'GBK': This won't work, at least not with
   further modification. If we would allow that, an INVOKE
   hive.hive.myhivetable would show string columns as mycol char(n) character
   set GBK. That's what this CQD means. Maybe we need another CQD,
   HIVE_FILE_CHARSET or something similar, that indicates the character set of
   the file to be converted to HIVE_DEFAULT_CHARSET.
   - If we do the implicit conversion for Hive columns, then we don't
   really need a new TRANSLATE function, at least not for users, it would only
   be used internally - or would there be any SQL expression where we would
   have a GBK character column?
   - You say that GBK is fixed at 2 bytes, but when I look at Wikipedia
   <https://en.wikipedia.org/wiki/GBK>, it says that US-ASCII characters
   (0-127) are encoded as single byte characters, and all other characters are
   2 bytes.
   - Allowing a NUL (null byte, code point 0) in strings: Actually,
   ISO8859-1, UTF-8 and GBK all allow that. UCS2 allows a double-byte NUL
   character. If we wouldn't allow it, we could not store binary data in
   ISO8859-1 columns. I'm not sure why Selva said in an earlier email that we
   can't handle NULs.

Thanks,

Hans

On Tue, Jan 5, 2016 at 5:16 PM, Liu, Ming (Ming) <mi...@esgyn.cn> wrote:

> Hi, all,
> Thanks all for the discussions, there are a lot of very good ideas and
> questions. I would like to summarize a little bit.
> 1. GBK vs. GB2312, and encoding.
> Yes, Dave, we will support GBK first, so it should cover GB2312 as well.
> We don't need to explicitly support GB2312 unless we meet specific issues.
> I am not aware of different GBK encodings, we will use Glibc standard API
> iconv() to do the conversion, it should handle all the details I assume.
> From some initial tests, it can convert some real source data correctly. I
> need to understand more about GBK here, but yes, we will support GBK, not
> GB2312.
>
> 2. Total support of GBK in Trafodion
> I agree to support GBK totally is a big project, so as Qifan suggested,
> for this task, we only allow GBK to appear in the TRANSLATE function. So we
> can allow user to change the hive character set to GBK via
> HIVE_DEFAULT_CHARSET, and use GBK2UTF8 in the TRANSLATE function to do the
> converting. This will not affect other part of the database, and not a lie.
> :-) And as Hans/Selva clarified, Hive cannot specific charset per column,
> so I feel we don't need to make it anymore complex. Qifan's proposal of
> allowing Trafodion to do implicit charset translation is very great, so we
> don't bother to write a very complex sql to add TRANSLATE for each 'string'
> column in Hive. Since we saw tables with 300+ columns in hive before...
>
> 3. Total support of GBK in Trafodion, again
> To many aspects, I feel GBK is very similar with UCS2. Its width is fixed
> as 2-bytes. And it will not allow '\0' (null) in any character byte. So I
> hope the effort will be controlled by simulating what Trafodion already do
> for UCS2. But we should launch a total support of GBK as a separate task,
> since for this proposal to TRANSLATE function, the goal is to allow GBK
> data load into UTF-8 columns in Trafodion simpler.
>
> Thanks,
> Ming
>
> -----邮件原件-----
> 发件人: Dave Birdsall [mailto:dave.birdsall@esgyn.com]
> 发送时间: 2016年1月6日 2:30
> 收件人: dev@trafodion.incubator.apache.org
> 主题: RE: 答复: enhance TRANSLATE to support Chinese charset?
>
> Hi,
>
> I was reading up on the character sets a little bit (e.g. Wikipedia
> articles on GB2312 and GBK). It seems GBK is an extension of GB2312, so I'm
> assuming that we are discussing support for GBK, correct? Also, it appears
> there are multiple possible encodings for GBK (no surprise, really; Unicode
> has multiple encodings too). We will want to specify which encoding(s) we
> wish to support.
>
> Dave
>
> -----Original Message-----
> From: Qifan Chen [mailto:qifan.chen@esgyn.com]
> Sent: Tuesday, January 5, 2016 10:22 AM
> To: dev <de...@trafodion.incubator.apache.org>
> Subject: Re: 答复: enhance TRANSLATE to support Chinese charset?
>
> Sorry, in my previous email, I should also mention that the GBK charset is
> only valid within the TRANSLATE function.
>
> Thanks --Qifan
>
> On Tue, Jan 5, 2016 at 12:17 PM, Qifan Chen <qi...@esgyn.com> wrote:
>
> > Hi Hans,
> >
> > Yes, either translate GDB to UTF8 or stored as binary data in ISO88591
> > is fine.
> >
> > I feel the work to allow HIVE_DEFAULT_CHARSET cqd to take GBK, and
> > disabling creating trafodion tables with GBK character columns and GBK
> > literals (both in the parser) is still reasonably sized.
> >
> > With this, Trafodion only recognizes GBK as a source charset for Hive
> > tables, and is able to convert the GBK source to other data format.
> >
> > Thanks --Qifan
> >
> >
> > On Tue, Jan 5, 2016 at 12:02 PM, Hans Zeller <ha...@esgyn.com>
> > wrote:
> >
> >> Just want to clarify: I hate to lie to the engine about character
> >> sets just as much. If someone can make a workable proposal to avoid
> >> it that would be great.
> >>
> >> Unlike in the predecessor products, which tried to interpret the
> >> data, we would treat the GBK strings like binary data stored in
> >> ISO88591, except in the one new translate function that could convert
> >> it to UTF-8. So, we would reduce the lie to the bad but already
> >> accepted one of storing binary data in ISO88591.
> >>
> >> Allowing a new GBK character set without adequate support for it may
> >> be more trouble than treating GBK as binary data. A proposal to do
> >> that should specify the limitations very carefully.
> >>
> >> Thanks,
> >>
> >> Hans
> >>
> >> On Tue, Jan 5, 2016 at 9:48 AM, Dave Birdsall
> >> <da...@esgyn.com>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > I'm wondering how large a project it is to add a new character set
> >> > to Trafodion? I personally hate the idea of lying to the engine
> >> > (for
> >> example,
> >> > pretending that GBK is ISO88591 in most places except when we want
> >> > to translate to another character set; we tried this in one of our
> >> predecessor
> >> > products and it caused endless complications -- it is technical
> >> > debt
> >> that
> >> > will soon demand to be repaid).
> >> >
> >> > If we need to reduce the scope of the project, one way might be to
> >> > limit the new character set to Hive tables only (so we'd have to do
> >> > the metadata
> >> work
> >> > in Trafodion), and translate the data immediately in the scan nodes
> >> where
> >> > it
> >> > enters the engine (so we could limit run-time support to a
> >> > translate operator; other expression operators would operate only
> >> > on known
> >> character
> >> > sets).
> >> >
> >> > Dave
> >> >
> >> > -----Original Message-----
> >> > From: Hans Zeller [mailto:hans.zeller@esgyn.com]
> >> > Sent: Tuesday, January 5, 2016 9:21 AM
> >> > To: dev <de...@trafodion.incubator.apache.org>
> >> > Subject: Re: 答复: enhance TRANSLATE to support Chinese charset?
> >> >
> >> > Hi,
> >> >
> >> > Thanks, Ming, for confirming what the use case is.
> >> >
> >> > As Qifan and Selva have pointed out, the HIVE_DEFAULT_CHARSET CQD
> >> > determines the character set of string columns in Hive tables. I
> >> > don't think,
> >> however,
> >> > that it is feasible to allow GBK as a value for
> >> > HIVE_DEFAULT_CHARSET, unless you are ready to do a lot of work.
> >> > That's because Hive columns have
> >> regular
> >> > Trafodion data types, and supporting GBK in regular data types
> >> > would be
> >> a
> >> > large project.
> >> >
> >> > Therefore, I would suggest that we use cqd HIVE_DEFAULT_CHARSET
> >> 'iso88591'
> >> > instead. We also don't have a BINARY data type and use ISO88591 for
> >> that as
> >> > well. To make that happen, you could either use some CQD as we
> >> > talked about, or a special TRANSLATE flavor, something like
> >> > GBK_AS_ISO_TO_UTF8 (sorry
> >> for
> >> > all the long names, maybe you can shorten them). This may be better
> >> > than the CQD to suppress character set checks. I agree with Qifan
> >> > that we should
> >> try
> >> > hard to check all character sets. For the same reason, we should
> >> > not
> >> allow
> >> > UTF-8 as HIVE_DEFAULT_CHARSET when we store GBK data in Hive tables.
> >> >
> >> > Qifan mentions different character sets in a single Hive table. My
> >> > impression is that the Hive folks did not anticipate that, since
> >> > Hive
> >> used
> >> > to be just simple text files. So, I hope that won't be a common case.
> >> >
> >> > Thanks,
> >> >
> >> > Hans
> >> >
> >> > On Tue, Jan 5, 2016 at 8:35 AM, Selva Govindarajan <
> >> > selva.govindarajan@esgyn.com> wrote:
> >> >
> >> > > Hi Ming,
> >> > >
> >> > > I believe Hive doesn't have a way to specify the character set
> >> > > encoding of individual columns. It relies on the JVM default
> >> > > encoding for the table as a whole. The JVM default encoding
> >> > > depends on the file.encoding default or the LANG variable.
> >> > >
> >> > > In Trafodion, we currently support TEXT and SEQUENCE format hive
> >> > > tables only. Trafodion also reads the hive data directly as
> >> > > stream of bytes and imposes the data type conversion on those
> >> > > stream of bytes to match with hive table definition, the columns
> >> > > being separated by the column delimiters.
> >> > > This
> >> > > processing in Trafodion done in C++ and is not wide character aware.
> >> > > As long as the needed character set doesn't have null byte
> >> > > embedded like UTF8 or
> >> > > ISO88591 encoding, this processing works fine.
> >> > >
> >> > > Currently the CQD hive_default_charset takes in UTF8 and ISO88591
> >> only.
> >> > > However, you can extend this CQD to take different character sets
> >> > > as long as these character set doesn't have null byte embedded in
> >> > > its encoding. For other encodings, this processing should be made
> >> > > wide character aware.
> >> > >
> >> > > Selva
> >> > >
> >> > > -----Original Message-----
> >> > > From: Qifan Chen [mailto:qifan.chen@esgyn.com]
> >> > > Sent: Tuesday, January 5, 2016 7:56 AM
> >> > > To: dev <de...@trafodion.incubator.apache.org>
> >> > > Subject: Re: 答复: enhance TRANSLATE to support Chinese charset?
> >> > >
> >> > > Hi Ming,
> >> > >
> >> > > Thank you for explaining one of the bottlenecks in PoCs in China!
> >> > >
> >> > > From data integrity point of view, it is important that the data
> >> > > to be inserted into Trafodion tables is verified to be in the
> >> > > said charset, and Trafodion's TRANSLATE function can guarantee
> >> > > that. So I wonder if we can consider these possible cases as
> follows.
> >> > >
> >> > >    1. All char columns in the hive table are in GBK. Set CQD
> >> > >    hive_default_charset to GBK and replying on implicit casting
> >> > > to the target
> >> > >    column of the trafodion table. Note that since CQD
> >> > >    ALLOW_IMPLICIT_CHAR_CASTING is on by default, implicit casting
> >> calls
> >> > >    TRANSLATE under the table.
> >> > >    2. Only small number of char columns in the hive table are in
> GBK.
> >> > > Apply
> >> > >    the TRANSLATE function on these GBK columns in the UPSERT
> >> statement,
> >> > as
> >> > >    outlined in the Hans' email. The default charset of the source
> >> > > is
> >> > still
> >> > >    UTF8. If for a particular GBK column that the TRANSLATE
> >> > > function is
> >> > not
> >> > >    specified, Trafodion will give an error. Fix the UPSERT stmt
> >> > > and re-try.
> >> > >    3. Only small number of char columns in the hive table are
> >> > > *NOT* in GBK.
> >> > >    Set CQD hive_default_charset to GBK. Apply the TRANSLATE
> >> > > function on these
> >> > >    non-GBK columns in the UPSERT statement, similar to 2) above.
> >> > >    4. Non of the columns are in GBK. No special settings are
> >> necessary.
> >> > >
> >> > >
> >> > > With the above proposal, the data integrity is maintained and the
> >> > > amount of work to prepare the UPSERT for charset conversion is
> >> > > minimized. The needed work is to implement GBK2UTF8.
> >> > >
> >> > > Thanks --Qifan
> >> > >
> >> > >
> >> > >
> >> > > On Mon, Jan 4, 2016 at 10:39 PM, Liu, Ming (Ming)
> >> > > <mi...@esgyn.cn>
> >> > > wrote:
> >> > >
> >> > > > Hi, Hans,
> >> > > >
> >> > > > You are right, each column has a charset type and TRANSLATE
> >> > > > need to match the source/target's charset before performing the
> >> > > > conversion.
> >> > > > I do meet this during the implementation after my proposal.
> >> > > >
> >> > > > My proposal is mainly for bulkloader usage, where the source is
> >> > > > from hive table. I know very little about Hive, but it seems to
> >> > > > me, a column in Hive table does not have a concept of charset.
> >> > > > As per my understanding ( I am learning hive), Hive treat all
> >> > > > string column as UTF8, since Hive is using java.
> >> > > > Hive does not check the value of a string column during normal
> >> > > > operation, it treats the string as a binary string without any
> >> > > > attempt to parse it.
> >> > > > So when Trafodion is doing bulk load, the first step is we copy
> >> > > > the raw data into HDFS, and create a hive external table. In
> >> > > > some cases, the source data is encoded in GBK, so we must
> >> > > > convert the source using 'iconv' before copy that file into
> >> > > > HDFS. This step is very time consuming if the source file is
> >> > > > big, iconv is not very fast, even it is fast by writing a
> >> > > > specific MapReduce job for this, it is still an extra step for
> >> > > > Trafodion bulk loading.
> >> > > > On the other hand, it is fine to leave the source data as
> >> > > > encoded in GBK, and create the hive external file. Hive will
> >> > > > not checking the string values's charset. And you can select
> >> > > > those GBK data via HQL, Hive will not do any conversion. So it
> >> > > > is possible that we can skip the 'iconv' step (which is very
> >> > > > slow, or will never success for a big file, say 100G ) , and
> >> > > > using the parallel bulk loader of Trafodion to load the data in
> >> > > > GBK, but during the loading, for each row, do a TRANSLATE, so
> >> > > > convert it from GBK into UTF8. We assume this will be faster
> >> > > > than two separate step, and it is easier for the end user. This
> >> > > > is the goal of the proposal at first.
> >> > > >
> >> > > > But just as you pointed out, the source column's charset must
> >> > > > be
> >> > > matching.
> >> > > > So the GBKTOUTF8 transaction require the source column to be in
> >> > > > GBK, which Trafodion doesn't support. And for unknown reason,
> >> > > > the column charset of any hive string column is UTF8 from
> >> > > > Trafodion point of view. So the TRANSLATE will fail. My current
> >> > > > solution is to bypass the source charset type matching , same
> >> > > > as a CQD ALLOW_TRANSLATE_SOURCE_CHARSTET_MISMATCH 'on'.
> >> > > >
> >> > > > In the long run, I can add the charset GBK as one of the
> >> > > > supported Trafodion column charset. Then we can use this
> >> > > > feature in more cases, and we can support a column to be
> >> > > > encoded as GBK which is also a nice feature.
> >> > > > But in short term, I think the most desired requirement is to
> >> > > > allow loading GBK data into Trafodion without an extra iconv step.
> >> > > >
> >> > > > I think the CQD is a very good suggestion, I now ignore the
> >> > > > charset type matching, I can add this CQD and only bypass type
> >> > > > checking when this CQD is on. And we can turn this CQD on
> >> > > > during bulk loading and turn it off in other cases.
> >> > > >
> >> > > > Do you think this is better?
> >> > > >
> >> > > > Thanks,
> >> > > > Ming
> >> > > >
> >> > > > -----邮件原件-----
> >> > > > 发件人: Hans Zeller [mailto:hans.zeller@esgyn.com]
> >> > > > 发送时间: 2016年1月5日 10:40
> >> > > > 收件人: dev <de...@trafodion.incubator.apache.org>
> >> > > > 主题: Re: enhance TRANSLATE to support Chinese charset?
> >> > > >
> >> > > > Hi,
> >> > > >
> >> > > > Kevin's question make me think of another question: The
> >> > > > TRANSLATE function takes a value a with character set ca and
> >> > > > translates it into a new value b with character set cb. Since
> >> > > > we don't support GBK as character sets in data types, I wonder
> >> > > > how we could call this function. People may in some cases stuff
> >> > > > GBK data into an ISO88591 column, are you thinking of that
> >> > > > case, and would we then allow something like this:
> >> > > >
> >> > > > create table tiso(a char(32) character set iso88591); insert
> >> > > > into tiso
> >> > > ...
> >> > > > ; -- insert GBK data into the table without translation select
> >> > > > translate(a using GBKTOUTF8N) from tiso;
> >> > > >
> >> > > >
> >> > > > Or, maybe something like this, to load a GB2312 HDFS file into
> >> > > > a Trafodion
> >> > > > table:
> >> > > >
> >> > > > cqd hive_default_charset 'ISO88591'; -- data is really GB2312,
> >> > > > but that is not supported cqd
> >> > > > allow_translate_source_charset_mismatch
> >> > > > 'on';
> >> > > > -- some new CQD like the above could suppress errors on
> >> > > > mismatched source charset for translate upsert using load into
> >> > > > traf_utf8_table select translate(a using GBKTOUTF8N) from
> >> > > > hive.hive.src;
> >> > > >
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Hans
> >> > > >
> >> > > > On Mon, Jan 4, 2016 at 5:01 PM, Liu, Ming (Ming)
> >> > > > <ming.liu@esgyn.cn
> >> >
> >> > > > wrote:
> >> > > >
> >> > > > > Hi, Kevin,
> >> > > > >
> >> > > > > I didn't notice GB18030 before, but after some initial
> >> > > > > search, it seems a must to have feature, so Trafodion should
> >> > > > > support it. I will mark it after the GBK support, we saw GBK
> >> > > > > in real customer site, but not GB18030 yet, but we should
> >> > > > > assume wide requirement of GB18030 very
> >> > > > soon.
> >> > > > >
> >> > > > > Thanks,
> >> > > > > Ming
> >> > > > >
> >> > > > > -----邮件原件-----
> >> > > > > 发件人: Kevin DeYager [mailto:kevin.deyager@esgyn.com]
> >> > > > > 发送时间: 2016年1月5日 8:11
> >> > > > > 收件人: dev@trafodion.incubator.apache.org
> >> > > > > 主题: RE: enhance TRANSLATE to support Chinese charset?
> >> > > > >
> >> > > > > Hi Ming,
> >> > > > >
> >> > > > > I am no expert in this area, but is GB18030 translation also
> >> > > > > needed / desirable?
> >> > > > >
> >> > > > > Regards,
> >> > > > > - Kevin
> >> > > > >
> >> > > > > -----Original Message-----
> >> > > > > From: Liu, Ming (Ming) [mailto:ming.liu@esgyn.cn]
> >> > > > > Sent: Monday, December 21, 2015 4:51 PM
> >> > > > > To: dev@trafodion.incubator.apache.org
> >> > > > > Subject: enhance TRANSLATE to support Chinese charset?
> >> > > > >
> >> > > > > Hello,
> >> > > > >
> >> > > > > Trafodion currently has a TRANSLATE function, which can do
> >> > > > > charset conversion among ISO88591, SJIS, UCS2 and UTF8.
> >> > > > > I would like to add GBK conversion into this function, it can
> >> > > > > help for data loading sometimes. As we saw previously, source
> >> > > > > data are very typically encoded in GB2312, especially in
> >> > > > > China, so we have to do a 'iconv' from GBK to UTF8 before
> >> > > > > loading, if the data files are huge, it will take a some time.
> >> > > > > If TRANSLATE can support GBKTOUTF8, so that conversion can be
> >> > > > > done in one step during the 'LOAD' SQL command. I think there
> >> > > > > are some other use cases as well.
> >> > > > >
> >> > > > > Do you feel this is worthy? If so, I would like to file a
> >> > > > > JIRA and can work on it.
> >> > > > >
> >> > > > > At first glance, I would like to propose several translate
> >> flavors:
> >> > > > > GBKTOUTF8N : which will try to do conversion from GB2312 to
> >> > > > > UTF8, in case there is an error during the conversion, return
> >> > > > > NULL, no SQL Error raised, silently continue.
> >> > > > > GBKTOUTF8O: try to do conversion from GB2312 to UTF8, in case
> >> > > > > there is an error during the conversion, return the original
> >> > > > > string without any conversion, no SQL Error raised, silently
> >> > continue.
> >> > > > > BGKTOUTF8: typical behavior, once there is a conversion
> >> > > > > error, raise a SQL Error.
> >> > > > >
> >> > > > > Thanks,
> >> > > > > Ming
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Regards, --Qifan
> >> > >
> >> >
> >>
> >
> >
> >
> > --
> > Regards, --Qifan
> >
> >
>
>
> --
> Regards, --Qifan
>