You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@trafodion.apache.org by "Liu, Ming (Ming)" <mi...@esgyn.cn> on 2015/12/22 01:51:27 UTC

enhance TRANSLATE to support Chinese charset?

Hello,

Trafodion currently has a TRANSLATE function, which can do charset conversion among ISO88591, SJIS, UCS2 and UTF8. 
I would like to add GBK conversion into this function, it can help for data loading sometimes. As we saw previously, source data are very typically encoded in GB2312, especially in China, so we have to do a 'iconv' from GBK to UTF8 before loading, if the data files are huge, it will take a some time.
If TRANSLATE can support GBKTOUTF8, so that conversion can be done in one step during the 'LOAD' SQL command. I think there are some other use cases as well.

Do you feel this is worthy? If so, I would like to file a JIRA and can work on it.

At first glance, I would like to propose several translate flavors:
GBKTOUTF8N : which will try to do conversion from GB2312 to UTF8, in case there is an error during the conversion, return NULL, no SQL Error raised, silently continue.
GBKTOUTF8O: try to do conversion from GB2312 to UTF8, in case there is an error during the conversion, return the original string without any conversion, no SQL Error raised, silently continue.  
BGKTOUTF8: typical behavior, once there is a conversion error, raise a SQL Error.

Thanks,
Ming

答复: enhance TRANSLATE to support Chinese charset?

Posted by "Liu, Ming (Ming)" <mi...@esgyn.cn>.
Thanks Selva,

This is a very important consideration that I missed. I am not 100% sure, but it seems GB2312 doesn't contain null byte. I need to study further.
If it does contain null byte, I need to check hive scan code carefully to support wide character.

Ming

-----邮件原件-----
发件人: Selva Govindarajan [mailto:selva.govindarajan@esgyn.com] 
发送时间: 2015年12月22日 13:53
收件人: dev@trafodion.incubator.apache.org
主题: RE: enhance TRANSLATE to support Chinese charset?

Hi Ming,

Currently, Trafodion hive scan doesn't support UCS2 or a character set with embedded null byte. Is GBK a double byte character set with null byte embedded in character data to represent characters? If so, you might consider making the Trafodion hive scan to be double byte character set aware before you can flag error rows during hive scan.

Selva

-----Original Message-----
From: Liu, Ming (Ming) [mailto:ming.liu@esgyn.cn]
Sent: Monday, December 21, 2015 8:09 PM
To: dev@trafodion.incubator.apache.org
Subject: 答复: enhance TRANSLATE to support Chinese charset?

Thanks QiFan and Dave for the comments.

I will file a JIRA, and for error handling, I agree with you. So I will keep the current behavior as other charset handling. If we do see requirements to tolerate error silently, we can add a CQD for it.

Thanks,
Ming

-----邮件原件-----
发件人: Qifan Chen [mailto:qifan.chen@esgyn.com]
发送时间: 2015年12月22日 11:24
收件人: dev <de...@trafodion.incubator.apache.org>
主题: Re: enhance TRANSLATE to support Chinese charset?

Yes, if it is a market requirement, just code and ship it :-)

On error handling. It might be a good idea to implement the ANSI standard flavor as default, and use a CQD to turn on a slightly different and localized flavor.

It may also be good idea to log the rows with the conversion error. We experienced similar situation at HPIT and a log of such rows is very valuable to help diagnose the root cause.

Thanks --Qifan

On Mon, Dec 21, 2015 at 7:11 PM, Dave Birdsall <da...@esgyn.com>
wrote:

> Hi Ming,
>
> If there is a need for it (and it sounds like there is), go for it!
>
> I'm not sure about the error semantics. Might be good to compare what 
> you are proposing with the existing TRANSLATE error semantics (maybe 
> you've already done that?).
>
> Dave
>
> -----Original Message-----
> From: Liu, Ming (Ming) [mailto:ming.liu@esgyn.cn]
> Sent: Monday, December 21, 2015 4:51 PM
> To: dev@trafodion.incubator.apache.org
> Subject: enhance TRANSLATE to support Chinese charset?
>
> Hello,
>
> Trafodion currently has a TRANSLATE function, which can do charset 
> conversion among ISO88591, SJIS, UCS2 and UTF8.
> I would like to add GBK conversion into this function, it can help for 
> data loading sometimes. As we saw previously, source data are very 
> typically encoded in GB2312, especially in China, so we have to do a 
> 'iconv' from GBK to UTF8 before loading, if the data files are huge, 
> it will take a some time.
> If TRANSLATE can support GBKTOUTF8, so that conversion can be done in 
> one step during the 'LOAD' SQL command. I think there are some other 
> use cases as well.
>
> Do you feel this is worthy? If so, I would like to file a JIRA and can 
> work on it.
>
> At first glance, I would like to propose several translate flavors:
> GBKTOUTF8N : which will try to do conversion from GB2312 to UTF8, in 
> case there is an error during the conversion, return NULL, no SQL 
> Error raised, silently continue.
> GBKTOUTF8O: try to do conversion from GB2312 to UTF8, in case there is 
> an error during the conversion, return the original string without any 
> conversion, no SQL Error raised, silently continue.
> BGKTOUTF8: typical behavior, once there is a conversion error, raise a 
> SQL Error.
>
> Thanks,
> Ming
>



--
Regards, --Qifan

RE: enhance TRANSLATE to support Chinese charset?

Posted by Selva Govindarajan <se...@esgyn.com>.
Hi Ming,

Currently, Trafodion hive scan doesn't support UCS2 or a character set with
embedded null byte. Is GBK a double byte character set with null byte
embedded in character data to represent characters? If so, you might
consider making the Trafodion hive scan to be double byte character set
aware before you can flag error rows during hive scan.

Selva

-----Original Message-----
From: Liu, Ming (Ming) [mailto:ming.liu@esgyn.cn]
Sent: Monday, December 21, 2015 8:09 PM
To: dev@trafodion.incubator.apache.org
Subject: 答复: enhance TRANSLATE to support Chinese charset?

Thanks QiFan and Dave for the comments.

I will file a JIRA, and for error handling, I agree with you. So I will keep
the current behavior as other charset handling. If we do see requirements to
tolerate error silently, we can add a CQD for it.

Thanks,
Ming

-----邮件原件-----
发件人: Qifan Chen [mailto:qifan.chen@esgyn.com]
发送时间: 2015年12月22日 11:24
收件人: dev <de...@trafodion.incubator.apache.org>
主题: Re: enhance TRANSLATE to support Chinese charset?

Yes, if it is a market requirement, just code and ship it :-)

On error handling. It might be a good idea to implement the ANSI standard
flavor as default, and use a CQD to turn on a slightly different and
localized flavor.

It may also be good idea to log the rows with the conversion error. We
experienced similar situation at HPIT and a log of such rows is very
valuable to help diagnose the root cause.

Thanks --Qifan

On Mon, Dec 21, 2015 at 7:11 PM, Dave Birdsall <da...@esgyn.com>
wrote:

> Hi Ming,
>
> If there is a need for it (and it sounds like there is), go for it!
>
> I'm not sure about the error semantics. Might be good to compare what
> you are proposing with the existing TRANSLATE error semantics (maybe
> you've already done that?).
>
> Dave
>
> -----Original Message-----
> From: Liu, Ming (Ming) [mailto:ming.liu@esgyn.cn]
> Sent: Monday, December 21, 2015 4:51 PM
> To: dev@trafodion.incubator.apache.org
> Subject: enhance TRANSLATE to support Chinese charset?
>
> Hello,
>
> Trafodion currently has a TRANSLATE function, which can do charset
> conversion among ISO88591, SJIS, UCS2 and UTF8.
> I would like to add GBK conversion into this function, it can help for
> data loading sometimes. As we saw previously, source data are very
> typically encoded in GB2312, especially in China, so we have to do a
> 'iconv' from GBK to UTF8 before loading, if the data files are huge,
> it will take a some time.
> If TRANSLATE can support GBKTOUTF8, so that conversion can be done in
> one step during the 'LOAD' SQL command. I think there are some other
> use cases as well.
>
> Do you feel this is worthy? If so, I would like to file a JIRA and can
> work on it.
>
> At first glance, I would like to propose several translate flavors:
> GBKTOUTF8N : which will try to do conversion from GB2312 to UTF8, in
> case there is an error during the conversion, return NULL, no SQL
> Error raised, silently continue.
> GBKTOUTF8O: try to do conversion from GB2312 to UTF8, in case there is
> an error during the conversion, return the original string without any
> conversion, no SQL Error raised, silently continue.
> BGKTOUTF8: typical behavior, once there is a conversion error, raise a
> SQL Error.
>
> Thanks,
> Ming
>



--
Regards, --Qifan

答复: enhance TRANSLATE to support Chinese charset?

Posted by "Liu, Ming (Ming)" <mi...@esgyn.cn>.
Thanks QiFan and Dave for the comments.

I will file a JIRA, and for error handling, I agree with you. So I will keep the current behavior as other charset handling. If we do see requirements to tolerate error silently, we can add a CQD for it.

Thanks,
Ming

-----邮件原件-----
发件人: Qifan Chen [mailto:qifan.chen@esgyn.com] 
发送时间: 2015年12月22日 11:24
收件人: dev <de...@trafodion.incubator.apache.org>
主题: Re: enhance TRANSLATE to support Chinese charset?

Yes, if it is a market requirement, just code and ship it :-)

On error handling. It might be a good idea to implement the ANSI standard flavor as default, and use a CQD to turn on a slightly different and localized flavor.

It may also be good idea to log the rows with the conversion error. We experienced similar situation at HPIT and a log of such rows is very valuable to help diagnose the root cause.

Thanks --Qifan

On Mon, Dec 21, 2015 at 7:11 PM, Dave Birdsall <da...@esgyn.com>
wrote:

> Hi Ming,
>
> If there is a need for it (and it sounds like there is), go for it!
>
> I'm not sure about the error semantics. Might be good to compare what 
> you are proposing with the existing TRANSLATE error semantics (maybe 
> you've already done that?).
>
> Dave
>
> -----Original Message-----
> From: Liu, Ming (Ming) [mailto:ming.liu@esgyn.cn]
> Sent: Monday, December 21, 2015 4:51 PM
> To: dev@trafodion.incubator.apache.org
> Subject: enhance TRANSLATE to support Chinese charset?
>
> Hello,
>
> Trafodion currently has a TRANSLATE function, which can do charset 
> conversion among ISO88591, SJIS, UCS2 and UTF8.
> I would like to add GBK conversion into this function, it can help for 
> data loading sometimes. As we saw previously, source data are very 
> typically encoded in GB2312, especially in China, so we have to do a 
> 'iconv' from GBK to UTF8 before loading, if the data files are huge, 
> it will take a some time.
> If TRANSLATE can support GBKTOUTF8, so that conversion can be done in 
> one step during the 'LOAD' SQL command. I think there are some other 
> use cases as well.
>
> Do you feel this is worthy? If so, I would like to file a JIRA and can 
> work on it.
>
> At first glance, I would like to propose several translate flavors:
> GBKTOUTF8N : which will try to do conversion from GB2312 to UTF8, in 
> case there is an error during the conversion, return NULL, no SQL 
> Error raised, silently continue.
> GBKTOUTF8O: try to do conversion from GB2312 to UTF8, in case there is 
> an error during the conversion, return the original string without any 
> conversion, no SQL Error raised, silently continue.
> BGKTOUTF8: typical behavior, once there is a conversion error, raise a 
> SQL Error.
>
> Thanks,
> Ming
>



--
Regards, --Qifan

Re: enhance TRANSLATE to support Chinese charset?

Posted by Qifan Chen <qi...@esgyn.com>.
Yes, if it is a market requirement, just code and ship it :-)

On error handling. It might be a good idea to implement the ANSI standard
flavor as default, and use a CQD to turn on a slightly different and
localized flavor.

It may also be good idea to log the rows with the conversion error. We
experienced similar situation at HPIT and a log of such rows is very
valuable to help diagnose the root cause.

Thanks --Qifan

On Mon, Dec 21, 2015 at 7:11 PM, Dave Birdsall <da...@esgyn.com>
wrote:

> Hi Ming,
>
> If there is a need for it (and it sounds like there is), go for it!
>
> I'm not sure about the error semantics. Might be good to compare what you
> are proposing with the existing TRANSLATE error semantics (maybe you've
> already done that?).
>
> Dave
>
> -----Original Message-----
> From: Liu, Ming (Ming) [mailto:ming.liu@esgyn.cn]
> Sent: Monday, December 21, 2015 4:51 PM
> To: dev@trafodion.incubator.apache.org
> Subject: enhance TRANSLATE to support Chinese charset?
>
> Hello,
>
> Trafodion currently has a TRANSLATE function, which can do charset
> conversion among ISO88591, SJIS, UCS2 and UTF8.
> I would like to add GBK conversion into this function, it can help for data
> loading sometimes. As we saw previously, source data are very typically
> encoded in GB2312, especially in China, so we have to do a 'iconv' from GBK
> to UTF8 before loading, if the data files are huge, it will take a some
> time.
> If TRANSLATE can support GBKTOUTF8, so that conversion can be done in one
> step during the 'LOAD' SQL command. I think there are some other use cases
> as well.
>
> Do you feel this is worthy? If so, I would like to file a JIRA and can work
> on it.
>
> At first glance, I would like to propose several translate flavors:
> GBKTOUTF8N : which will try to do conversion from GB2312 to UTF8, in case
> there is an error during the conversion, return NULL, no SQL Error raised,
> silently continue.
> GBKTOUTF8O: try to do conversion from GB2312 to UTF8, in case there is an
> error during the conversion, return the original string without any
> conversion, no SQL Error raised, silently continue.
> BGKTOUTF8: typical behavior, once there is a conversion error, raise a SQL
> Error.
>
> Thanks,
> Ming
>



-- 
Regards, --Qifan

RE: enhance TRANSLATE to support Chinese charset?

Posted by Dave Birdsall <da...@esgyn.com>.
Hi Ming,

If there is a need for it (and it sounds like there is), go for it!

I'm not sure about the error semantics. Might be good to compare what you
are proposing with the existing TRANSLATE error semantics (maybe you've
already done that?).

Dave

-----Original Message-----
From: Liu, Ming (Ming) [mailto:ming.liu@esgyn.cn]
Sent: Monday, December 21, 2015 4:51 PM
To: dev@trafodion.incubator.apache.org
Subject: enhance TRANSLATE to support Chinese charset?

Hello,

Trafodion currently has a TRANSLATE function, which can do charset
conversion among ISO88591, SJIS, UCS2 and UTF8.
I would like to add GBK conversion into this function, it can help for data
loading sometimes. As we saw previously, source data are very typically
encoded in GB2312, especially in China, so we have to do a 'iconv' from GBK
to UTF8 before loading, if the data files are huge, it will take a some
time.
If TRANSLATE can support GBKTOUTF8, so that conversion can be done in one
step during the 'LOAD' SQL command. I think there are some other use cases
as well.

Do you feel this is worthy? If so, I would like to file a JIRA and can work
on it.

At first glance, I would like to propose several translate flavors:
GBKTOUTF8N : which will try to do conversion from GB2312 to UTF8, in case
there is an error during the conversion, return NULL, no SQL Error raised,
silently continue.
GBKTOUTF8O: try to do conversion from GB2312 to UTF8, in case there is an
error during the conversion, return the original string without any
conversion, no SQL Error raised, silently continue.
BGKTOUTF8: typical behavior, once there is a conversion error, raise a SQL
Error.

Thanks,
Ming

RE: enhance TRANSLATE to support Chinese charset?

Posted by Kevin DeYager <ke...@esgyn.com>.
Hi Ming,

I am no expert in this area, but is GB18030 translation also needed /
desirable?

Regards,
- Kevin

-----Original Message-----
From: Liu, Ming (Ming) [mailto:ming.liu@esgyn.cn]
Sent: Monday, December 21, 2015 4:51 PM
To: dev@trafodion.incubator.apache.org
Subject: enhance TRANSLATE to support Chinese charset?

Hello,

Trafodion currently has a TRANSLATE function, which can do charset
conversion among ISO88591, SJIS, UCS2 and UTF8.
I would like to add GBK conversion into this function, it can help for data
loading sometimes. As we saw previously, source data are very typically
encoded in GB2312, especially in China, so we have to do a 'iconv' from GBK
to UTF8 before loading, if the data files are huge, it will take a some
time.
If TRANSLATE can support GBKTOUTF8, so that conversion can be done in one
step during the 'LOAD' SQL command. I think there are some other use cases
as well.

Do you feel this is worthy? If so, I would like to file a JIRA and can work
on it.

At first glance, I would like to propose several translate flavors:
GBKTOUTF8N : which will try to do conversion from GB2312 to UTF8, in case
there is an error during the conversion, return NULL, no SQL Error raised,
silently continue.
GBKTOUTF8O: try to do conversion from GB2312 to UTF8, in case there is an
error during the conversion, return the original string without any
conversion, no SQL Error raised, silently continue.
BGKTOUTF8: typical behavior, once there is a conversion error, raise a SQL
Error.

Thanks,
Ming