You are viewing a plain text version of this content. The canonical link for it is here.
Posted to codereview@trafodion.apache.org by traflm <gi...@git.apache.org> on 2016/01/09 12:46:52 UTC

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

GitHub user traflm opened a pull request:

    https://github.com/apache/incubator-trafodion/pull/257

    for jira 1720, add support to convert gbk into utf8

    This pull request is for a fix to JIRA 1720, add support of GBKTOUTF8 translate function. Please help to review it.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/traflm/incubator-trafodion TRAFODION-1720

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-trafodion/pull/257.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #257
    
----
commit d51d2016d29725e3d863b351b3a548ee1b325c0c
Author: Cloud User <ce...@ming01.novalocal>
Date:   2016-01-09T11:43:21Z

    for jira 1720, add support to convert gbk into utf8

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by DaveBirdsall <gi...@git.apache.org>.
Github user DaveBirdsall commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r51169533
  
    --- Diff: core/sql/exp/exp_conv.cpp ---
    @@ -9321,6 +9321,32 @@ convDoIt(char * source,
       };
       break;
     
    +// gb2312 -> utf8
    +// JIRA 1720
    +  case CONV_GBK_F_UTF8_V:
    +  {
    +    int copyLen = 0;
    +    int convLen = gbkToUtf8( source, sourceLen, target, targetLen);
    +    if (convLen > 0) {
    --- End diff --
    
    Should this be >=? (What happens when we convert an empy string?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by traflm <gi...@git.apache.org>.
Github user traflm commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49533889
  
    --- Diff: core/sql/exp/exp_conv.cpp ---
    @@ -9321,6 +9321,32 @@ convDoIt(char * source,
       };
       break;
     
    +// gb2312 -> utf8
    +  case CONV_GBK_F_UTF8_V:
    +  {
    +    char * targetbuf = new char[sourceLen*4+1];
    +    size_t sl = sourceLen;
    +    int convLen = gbkToUtf8( source, sl, targetbuf, sl*4);
    +    int copyLen = 0;
    +    if (convLen > 0) {
    +      copyLen = (convLen< targetLen) ? convLen: targetLen;
    --- End diff --
    
    The new commit check the copyLen and targetLen. If targetLen is not enough, instead of doing truncate, it will raise a SQL Error. 
    This task is primarily designed to support bulk data loading of GBK data source. If the Trafodion do implicit truncation, it may not be a good idea. For example, a 8 char string will become 7 chars after loading, and may cause very hard to detect issue later.
    And it is also complex to implement UTF8 truncate method, thanks Hans point out the two APIs , I will study them carefully.
    However, given the 'raise error' handling, I think we don't need to care about Truncate in the Middle issue. Just report error and stop data loading.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by traflm <gi...@git.apache.org>.
Github user traflm commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49535117
  
    --- Diff: core/sql/exp/exp_conv.cpp ---
    @@ -9321,6 +9321,33 @@ convDoIt(char * source,
       };
       break;
     
    +// gb2312 -> utf8
    +  case CONV_GBK_F_UTF8_V:
    +  {
    +    char * targetbuf = new char[sourceLen*4+1];
    --- End diff --
    
    very good point, new/delete is better to be avoided. let me check this. As a change for 'truncate concern' before, I let TRANSLATE raise SQL error if the target buffer is not long enough. So a better approach is to use the targetbuf directly and let iconv() to do all the dirty works. If the targetbuf is long enough, iconv() can success, if targetbuf is not long enough, iconv() will fail, we don't need to treat all details. iconv is GLibc API, tested and tuned for years. If iconv() return an error, Trafodion just raise SQL error based on the error code. This way, we no longer need to new a temp buffer , and do memcpy, should be a very good optimization.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by sureshsubbiah <gi...@git.apache.org>.
Github user sureshsubbiah commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49485546
  
    --- Diff: core/sql/sqlcomp/nadefaults.cpp ---
    @@ -1958,6 +1958,7 @@ SDDkwd__(EXE_DIAGNOSTIC_EVENTS,		"OFF"),
     
       DDkwd__(HIVE_DEFAULT_CHARSET,            (char *)SQLCHARSETSTRING_UTF8),
       DD_____(HIVE_DEFAULT_SCHEMA,                  "HIVE"),
    +  DD_____(HIVE_FILE_CHARSET,            (char *)SQLCHARSETSTRING_UTF8),
    --- End diff --
    
    If we have to join two Hive tables, one in UTF8 and the other in GBK, I suppose this method will have to be extended? Does Hive metadata store charset info? If not would a comment next to the tablename help? This is just a minor suggestion for something you may have already considered, not a request to add this now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by traflm <gi...@git.apache.org>.
Github user traflm commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49534494
  
    --- Diff: core/sql/optimizer/SynthType.cpp ---
    @@ -5207,6 +5207,18 @@ const NAType *Translate::synthesizeType()
              err4106arg = SQLCHARSETSTRING_UTF8;
            break;
     
    +     case GBK_TO_UTF8:
    +       if (translateSource->getCharSet() == CharInfo::GBK || translateSource->getCharSet() == CharInfo::UnknownCharSet )
    +         charsetTarget = CharInfo::UTF8;
    +       else
    +       {
    +            if( CmpCommon::getDefaultString(HIVE_FILE_CHARSET) ==  CmpCommon::getDefaultString(HIVE_DEFAULT_CHARSET) )
    +              err4106arg = SQLCHARSETCODE_GB2312;
    +            else
    +             charsetTarget = CharInfo::UTF8;
    --- End diff --
    
    yes, I will add comments. 
    Related to what Hans suggested below, this logic need to change. 
    It is the basic for this whole task. I will change the logic and add comments


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by traflm <gi...@git.apache.org>.
Github user traflm commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r51225892
  
    --- Diff: core/sql/common/csconvert.cpp ---
    @@ -1275,3 +1279,45 @@ char * findStartOfChar( char *someByteInChar, char *startOfBuffer )
          rtnv-- ;
       return rtnv ;
     }
    +
    +/* A method to do character set conversion , using Glibc iconv */
    +static int charsetConvert(const char *srcCharset,const char *targetCharset,char *inputbuf, size_t inputlen, char *outbuf,size_t outlen)
    +{
    +  char **ptrin = &inputbuf;
    +  char **ptrout = &outbuf;
    +
    +  iconv_t cd;
    +  cd = iconv_open(targetCharset,srcCharset);
    +
    +  if (cd==0) 
    +    return -1;
    +
    +  if (iconv(cd,ptrin,(size_t*)&inputlen,ptrout,(size_t *)&outlen) == -1) 
    +  {
    +    //error occurs
    +    iconv_close(cd);
    +    return -1;
    +  }
    +
    +  iconv_close(cd);
    +  return outlen;
    +}
    +
    +/* convert gbk string into UTF8 */
    +int gbkToUtf8(char* gbkString, size_t gbklen, 
    +              char* result ,size_t outlen, bool addNullAtEnd)
    +{
    +   int originalOutlen = outlen;
    +   int finalLength = charsetConvert( "gbk","utf-8", gbkString, gbklen,  result, outlen);
    +   
    +   if (finalLength == -1 ) 
    +     return -1;
    +   
    +   if ( addNullAtEnd )
    +   {
    +     if(originalOutlen >= finalLength )
    +       result[finalLength] = 0;
    --- End diff --
    
    Yes, I agree with you, This should return -1. The target buffer allocate length of buffer according to the target attribute, no special consideration an extra byte for NULL char.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by zellerh <gi...@git.apache.org>.
Github user zellerh commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49500872
  
    --- Diff: core/sql/optimizer/SynthType.cpp ---
    @@ -5207,6 +5207,18 @@ const NAType *Translate::synthesizeType()
              err4106arg = SQLCHARSETSTRING_UTF8;
            break;
     
    +     case GBK_TO_UTF8:
    +       if (translateSource->getCharSet() == CharInfo::GBK || translateSource->getCharSet() == CharInfo::UnknownCharSet )
    +         charsetTarget = CharInfo::UTF8;
    +       else
    +       {
    +            if( CmpCommon::getDefaultString(HIVE_FILE_CHARSET) ==  CmpCommon::getDefaultString(HIVE_DEFAULT_CHARSET) )
    +              err4106arg = SQLCHARSETCODE_GB2312;
    +            else
    +             charsetTarget = CharInfo::UTF8;
    +       }
    +       break;
    +
    --- End diff --
    
    Note that we also need to calculate the length of the new type, that is done below by calling CharInfo::getMaxConvertedLenInBytes(). Again, the optimization I mentioned on line 9327 would be useful here, to avoid wasting space. However, I see a problem in Translate::synthType here, in that it always generates an output type with a size specified in characters, not bytes. That's inefficient and we should consider fixing it. For GBK->UTF8, this means that a 1000 byte input buffer will result in a 4000 byte output buffer, even though we can show that 1500 bytes would be enough.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by traflm <gi...@git.apache.org>.
Github user traflm commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r51226147
  
    --- Diff: core/sql/exp/exp_conv.cpp ---
    @@ -9321,6 +9321,32 @@ convDoIt(char * source,
       };
       break;
     
    +// gb2312 -> utf8
    +// JIRA 1720
    +  case CONV_GBK_F_UTF8_V:
    +  {
    +    int copyLen = 0;
    +    int convLen = gbkToUtf8( source, sourceLen, target, targetLen);
    +    if (convLen > 0) {
    --- End diff --
    
    yes, this is correct, thanks a lot, if the source is empty string, the target will be empty string as well and the convLen will be 0. So it is a valid case. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by traflm <gi...@git.apache.org>.
Github user traflm commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49534657
  
    --- Diff: core/sql/sqlcomp/nadefaults.cpp ---
    @@ -1958,6 +1958,7 @@ SDDkwd__(EXE_DIAGNOSTIC_EVENTS,		"OFF"),
     
       DDkwd__(HIVE_DEFAULT_CHARSET,            (char *)SQLCHARSETSTRING_UTF8),
       DD_____(HIVE_DEFAULT_SCHEMA,                  "HIVE"),
    +  DD_____(HIVE_FILE_CHARSET,            (char *)SQLCHARSETSTRING_UTF8),
    --- End diff --
    
    yes, Hans is correct. The current logic cannot work well if in some cases, HIVE_DEFAULT_CHARSET is not UTF8. It is better to change HIVE_FILE_CHARSET as empty and change the checking logic in the synthType.cpp logic accordingly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by zellerh <gi...@git.apache.org>.
Github user zellerh commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r51160672
  
    --- Diff: core/sql/generator/GenRelScan.cpp ---
    @@ -214,22 +215,32 @@ int HbaseAccess::createAsciiColAndCastExpr(Generator * generator,
           ((CharType*)newGivenType)->setUpshifted(FALSE);
         }
     
    +  if (newGivenType->getTypeQualifier() == NA_CHARACTER_TYPE &&
    +      CmpCommon::getDefaultString(HIVE_FILE_CHARSET) == "GBK")
    --- End diff --
    
    This may need to upshift the default value first, to catch the case where the user sets the CQD to lower case 'gbk'?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by DaveBirdsall <gi...@git.apache.org>.
Github user DaveBirdsall commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49367075
  
    --- Diff: core/sql/exp/exp_conv.cpp ---
    @@ -9321,6 +9321,32 @@ convDoIt(char * source,
       };
       break;
     
    +// gb2312 -> utf8
    +  case CONV_GBK_F_UTF8_V:
    +  {
    +    char * targetbuf = new char[sourceLen*4+1];
    +    size_t sl = sourceLen;
    +    int convLen = gbkToUtf8( source, sl, targetbuf, sl*4);
    +    int copyLen = 0;
    +    if (convLen > 0) {
    +      copyLen = (convLen< targetLen) ? convLen: targetLen;
    --- End diff --
    
    Might this truncate part of a UTF-8 character? I think there are already functions in the product that will copy UTF-8 strings safely without risk of truncation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by zellerh <gi...@git.apache.org>.
Github user zellerh commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49496654
  
    --- Diff: core/sql/exp/exp_clause_derived.h ---
    @@ -1542,7 +1542,9 @@ enum conv_case_index {
       CONV_UTF8_F_UCS2_V                   =248,
     
       CONV_BLOB_BLOB                       =249,
    -  CONV_BLOB_ASCII_F                    =250
    +  CONV_BLOB_ASCII_F                    =250,
    +
    +  CONV_GBK_F_UTF8_V                    =251
    --- End diff --
    
    That may come when we support the CHAR datatype in Hive? Since right now we only support STRING in Hive, maybe we never have any CHAR columns in a Hive table?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by traflm <gi...@git.apache.org>.
Github user traflm commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49682367
  
    --- Diff: core/sql/optimizer/SynthType.cpp ---
    @@ -5207,6 +5207,24 @@ const NAType *Translate::synthesizeType()
              err4106arg = SQLCHARSETSTRING_UTF8;
            break;
     
    +     case GBK_TO_UTF8:
    +       if (translateSource->getCharSet() == CharInfo::GBK )
    +         charsetTarget = CharInfo::UTF8;
    +       else
    +       {
    +           /* this is a solution to support GBK before Trafodion can support GBK in total
    +            * see jira 1720 for more details
    +            * the logic here is:
    +            * when HIVE_FILE_CHARSET is not empty, it means the real charset in Hive table is not same as HIVE_DEFAULT_CHARSET
    +            * in this case, allow the converting , ignoring the source charset checking above
    +            */
    +            if( CmpCommon::getDefaultString(HIVE_FILE_CHARSET) ==  "" ) //CmpCommon::getDefaultString(HIVE_DEFAULT_CHARSET) )
    --- End diff --
    
    totally agree. 
    I will add a test case, it will take some time to understand the test suite.
    And I will look at the codeGen, I didn't understand the whole path clearly. Yes, it is the right place to do the converting. My current change allow translate function to be used in unexpected cases, so, I will have another version asap. 
    Thanks for your patient review and guidance, it is very appreciated! There will be more commits in near future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by zellerh <gi...@git.apache.org>.
Github user zellerh commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49619715
  
    --- Diff: core/sql/optimizer/SynthType.cpp ---
    @@ -5207,6 +5207,24 @@ const NAType *Translate::synthesizeType()
              err4106arg = SQLCHARSETSTRING_UTF8;
            break;
     
    +     case GBK_TO_UTF8:
    +       if (translateSource->getCharSet() == CharInfo::GBK )
    +         charsetTarget = CharInfo::UTF8;
    +       else
    +       {
    +           /* this is a solution to support GBK before Trafodion can support GBK in total
    +            * see jira 1720 for more details
    +            * the logic here is:
    +            * when HIVE_FILE_CHARSET is not empty, it means the real charset in Hive table is not same as HIVE_DEFAULT_CHARSET
    +            * in this case, allow the converting , ignoring the source charset checking above
    +            */
    +            if( CmpCommon::getDefaultString(HIVE_FILE_CHARSET) ==  "" ) //CmpCommon::getDefaultString(HIVE_DEFAULT_CHARSET) )
    --- End diff --
    
    See my other comment about where to put the automatic conversion. If you do that, this check will not be needed. Otherwise, this logic will now allow a TRANSLATE(GBKTOUTF8) anywhere in SQL statements. Allowing this just gives people an opportunity to shoot themselves in the foot, so it may be better to avoid it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by traflm <gi...@git.apache.org>.
Github user traflm commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49534811
  
    --- Diff: core/sql/common/csconvert.cpp ---
    @@ -1275,3 +1279,47 @@ char * findStartOfChar( char *someByteInChar, char *startOfBuffer )
          rtnv-- ;
       return rtnv ;
     }
    +/* A method to do character set conversion , using Glibc iconv */
    +int code_convert(const char *from_charset,const char *to_charset,char *inbuf, size_t inlen, char *outbuf,size_t outlen)
    +{
    +  iconv_t cd;
    +  int rc;
    +  char **pin = &inbuf;
    +  char **pout = &outbuf;
    +
    +  cd = iconv_open(to_charset,from_charset);
    +  if (cd==0) return -1;
    +  memset(outbuf,0,outlen);
    +  if (iconv(cd,pin,(size_t*)&inlen,pout,(size_t *)&outlen)==-1) 
    +  {
    +    iconv_close(cd);
    +    return -1;
    +  }
    +  iconv_close(cd);
    +  return outlen;
    +}
    +/* from gbk to utf8 */
    +int gbk2utf8(char *inbuf,size_t inlen,char *outbuf,size_t outlen)
    +{
    +  return code_convert("gbk","utf-8",inbuf,inlen,outbuf,outlen);
    +}
    +
    +/* convert gbk string into UTF8 */
    +int gbkToUtf8(char* gbkString, size_t gbklen, 
    --- End diff --
    
    yes, I will remove dup functions. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by traflm <gi...@git.apache.org>.
Github user traflm commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49447977
  
    --- Diff: core/sql/exp/exp_conv.cpp ---
    @@ -9321,6 +9321,32 @@ convDoIt(char * source,
       };
       break;
     
    +// gb2312 -> utf8
    +  case CONV_GBK_F_UTF8_V:
    +  {
    +    char * targetbuf = new char[sourceLen*4+1];
    +    size_t sl = sourceLen;
    +    int convLen = gbkToUtf8( source, sl, targetbuf, sl*4);
    +    int copyLen = 0;
    +    if (convLen > 0) {
    +      copyLen = (convLen< targetLen) ? convLen: targetLen;
    --- End diff --
    
    yes, truncation is need to be considered. I overlook this part. Will add it asap. 
    thanks Dave for all your valuable comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by DaveBirdsall <gi...@git.apache.org>.
Github user DaveBirdsall commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49366877
  
    --- Diff: core/sql/common/csconvert.cpp ---
    @@ -1275,3 +1279,41 @@ char * findStartOfChar( char *someByteInChar, char *startOfBuffer )
          rtnv-- ;
       return rtnv ;
     }
    +/* A method to do character set conversion , using Glibc iconv */
    +int code_convert(const char *from_charset,const char *to_charset,char *inbuf, size_t inlen, char *outbuf,size_t outlen)
    +{
    +  iconv_t cd;
    +  int rc;
    +  char **pin = &inbuf;
    +  char **pout = &outbuf;
    +
    +  cd = iconv_open(to_charset,from_charset);
    +  if (cd==0) return -1;
    +  memset(outbuf,0,outlen);
    +  if (iconv(cd,pin,(size_t*)&inlen,pout,(size_t *)&outlen)==-1) 
    +  {
    +    iconv_close(cd);
    +    return -1;
    +  }
    +  iconv_close(cd);
    +  return outlen;
    +}
    +/* from gbk to utf8 */
    +int gbk2utf8(char *inbuf,size_t inlen,char *outbuf,size_t outlen)
    +{
    +  return code_convert("gbk","utf-8",inbuf,inlen,outbuf,outlen);
    +}
    +
    +int gbkToUtf8(char* gbkString, size_t gbklen, 
    +              char* result ,size_t outlen, int addNullAtEnd)
    +{
    +
    +   int finalLength = gbk2utf8 ( gbkString, gbklen,  result, outlen);
    +   
    +   if (finalLength == -1 ) return 0;
    +   
    +   if ( addNullAtEnd > 0 )
    +      result[finalLength] = 0;
    --- End diff --
    
    Could we have a buffer overrun here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by sureshsubbiah <gi...@git.apache.org>.
Github user sureshsubbiah commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49485086
  
    --- Diff: core/sql/optimizer/SynthType.cpp ---
    @@ -5207,6 +5207,18 @@ const NAType *Translate::synthesizeType()
              err4106arg = SQLCHARSETSTRING_UTF8;
            break;
     
    +     case GBK_TO_UTF8:
    +       if (translateSource->getCharSet() == CharInfo::GBK || translateSource->getCharSet() == CharInfo::UnknownCharSet )
    +         charsetTarget = CharInfo::UTF8;
    +       else
    +       {
    +            if( CmpCommon::getDefaultString(HIVE_FILE_CHARSET) ==  CmpCommon::getDefaultString(HIVE_DEFAULT_CHARSET) )
    +              err4106arg = SQLCHARSETCODE_GB2312;
    +            else
    +             charsetTarget = CharInfo::UTF8;
    --- End diff --
    
    I suppose this if/else logic is also something basic to this change, but I do not understand it. A comment will be helpful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by DaveBirdsall <gi...@git.apache.org>.
Github user DaveBirdsall commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49366818
  
    --- Diff: core/sql/common/csconvert.cpp ---
    @@ -1275,3 +1279,41 @@ char * findStartOfChar( char *someByteInChar, char *startOfBuffer )
          rtnv-- ;
       return rtnv ;
     }
    +/* A method to do character set conversion , using Glibc iconv */
    +int code_convert(const char *from_charset,const char *to_charset,char *inbuf, size_t inlen, char *outbuf,size_t outlen)
    +{
    +  iconv_t cd;
    +  int rc;
    +  char **pin = &inbuf;
    +  char **pout = &outbuf;
    +
    +  cd = iconv_open(to_charset,from_charset);
    +  if (cd==0) return -1;
    +  memset(outbuf,0,outlen);
    +  if (iconv(cd,pin,(size_t*)&inlen,pout,(size_t *)&outlen)==-1) 
    +  {
    +    iconv_close(cd);
    +    return -1;
    +  }
    +  iconv_close(cd);
    +  return outlen;
    +}
    +/* from gbk to utf8 */
    +int gbk2utf8(char *inbuf,size_t inlen,char *outbuf,size_t outlen)
    +{
    +  return code_convert("gbk","utf-8",inbuf,inlen,outbuf,outlen);
    +}
    +
    +int gbkToUtf8(char* gbkString, size_t gbklen, 
    +              char* result ,size_t outlen, int addNullAtEnd)
    --- End diff --
    
    Would bool be a better data type for addNullAtEnd?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by sureshsubbiah <gi...@git.apache.org>.
Github user sureshsubbiah commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49483990
  
    --- Diff: core/sql/exp/exp_clause_derived.h ---
    @@ -1542,7 +1542,9 @@ enum conv_case_index {
       CONV_UTF8_F_UCS2_V                   =248,
     
       CONV_BLOB_BLOB                       =249,
    -  CONV_BLOB_ASCII_F                    =250
    +  CONV_BLOB_ASCII_F                    =250,
    +
    +  CONV_GBK_F_UTF8_V                    =251
    --- End diff --
    
    Do we also need a case for CONV_GBK_F_UTF8_F ? I am not quite how these case indexes work though. Maybe my question is if we have the user giving us GBK characters for a CHAR(N) UTF8 column, will it work with this change? Or is change applicable only when the target column in of type VARCHAR(N) UTF8 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by traflm <gi...@git.apache.org>.
Github user traflm commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49534202
  
    --- Diff: core/sql/exp/exp_clause_derived.h ---
    @@ -1542,7 +1542,9 @@ enum conv_case_index {
       CONV_UTF8_F_UCS2_V                   =248,
     
       CONV_BLOB_BLOB                       =249,
    -  CONV_BLOB_ASCII_F                    =250
    +  CONV_BLOB_ASCII_F                    =250,
    +
    +  CONV_GBK_F_UTF8_V                    =251
    --- End diff --
    
    Let me add this support. I didn't realize what the '_F' means in the beginning. Hive does not have CHAR or VARCHAR, it only support 'string'. So I assume Trafodion treats it as VARCHAR always, but let me try to add support of CONV_GBK_F_UTF8_F as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by sureshsubbiah <gi...@git.apache.org>.
Github user sureshsubbiah commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49484259
  
    --- Diff: core/sql/exp/exp_conv.cpp ---
    @@ -9321,6 +9321,32 @@ convDoIt(char * source,
       };
       break;
     
    +// gb2312 -> utf8
    +  case CONV_GBK_F_UTF8_V:
    +  {
    +    char * targetbuf = new char[sourceLen*4+1];
    +    size_t sl = sourceLen;
    +    int convLen = gbkToUtf8( source, sl, targetbuf, sl*4);
    +    int copyLen = 0;
    +    if (convLen > 0) {
    +      copyLen = (convLen< targetLen) ? convLen: targetLen;
    --- End diff --
    
    What if there is a real overflow, i.e. user has given a GBK value which when converted to UTF8 exceeds the max varchar length? Will that be covered by this error or is that checked in a different place? Some how I don't see max length of target column in here...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by traflm <gi...@git.apache.org>.
Github user traflm commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49534386
  
    --- Diff: core/sql/optimizer/SynthType.cpp ---
    @@ -5207,6 +5207,18 @@ const NAType *Translate::synthesizeType()
              err4106arg = SQLCHARSETSTRING_UTF8;
            break;
     
    +     case GBK_TO_UTF8:
    +       if (translateSource->getCharSet() == CharInfo::GBK || translateSource->getCharSet() == CharInfo::UnknownCharSet )
    +         charsetTarget = CharInfo::UTF8;
    --- End diff --
    
    I need to remove the 'unknowncharset' checking. It is wrong.
    In the beginning, I assumed Trafodion will treat the column charset of hive table as unknown, since it seems to me Hive doesn't have charset info in the column definition. So I add this checking, but it turns out Trafodion will treat Hive column charset as 'HIVE_DEFAULT_CHARSET', so it should never be 'unknown'. I will remove this checking, it is confusing and wrong assumption.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by zellerh <gi...@git.apache.org>.
Github user zellerh commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49497709
  
    --- Diff: core/sql/exp/exp_conv.cpp ---
    @@ -9321,6 +9321,32 @@ convDoIt(char * source,
       };
       break;
     
    +// gb2312 -> utf8
    +  case CONV_GBK_F_UTF8_V:
    +  {
    +    char * targetbuf = new char[sourceLen*4+1];
    +    size_t sl = sourceLen;
    +    int convLen = gbkToUtf8( source, sl, targetbuf, sl*4);
    +    int copyLen = 0;
    +    if (convLen > 0) {
    +      copyLen = (convLen< targetLen) ? convLen: targetLen;
    --- End diff --
    
    If we have to check for truncation, see functions lightValidateUTF8Str() and findStartOfChar() in file core/sql/common/csconvert.h.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by zellerh <gi...@git.apache.org>.
Github user zellerh commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r51159328
  
    --- Diff: core/sql/generator/GenRelScan.cpp ---
    @@ -787,6 +798,7 @@ short FileScan::codeGenForHive(Generator * generator)
       const Int32 executorPredTuppIndex = 3;
       const Int32 asciiTuppIndex = 4;
       ULng32 asciiRowLen; 
    +  ULng32 translateRowLen; 
    --- End diff --
    
    Looks like this is not needed? Same for transVids on line 811 and transVal on lines 1518 and 1614?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by DaveBirdsall <gi...@git.apache.org>.
Github user DaveBirdsall commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49485252
  
    --- Diff: core/sql/exp/exp_conv.cpp ---
    @@ -9321,6 +9321,32 @@ convDoIt(char * source,
       };
       break;
     
    +// gb2312 -> utf8
    +  case CONV_GBK_F_UTF8_V:
    +  {
    +    char * targetbuf = new char[sourceLen*4+1];
    +    size_t sl = sourceLen;
    +    int convLen = gbkToUtf8( source, sl, targetbuf, sl*4);
    +    int copyLen = 0;
    +    if (convLen > 0) {
    +      copyLen = (convLen< targetLen) ? convLen: targetLen;
    --- End diff --
    
    My guess is that the targetbuf is always long enough, since it allows 4 bytes for every byte in the source. But you are right, there could be truncation when copyLen is calculated. I'm wondering if type synthesis for TRUNCATE guarantees that the output length is always large enough, though, and it is only upon assignment to a target column that we'd truncate?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-trafodion/pull/257


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by zellerh <gi...@git.apache.org>.
Github user zellerh commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49502495
  
    --- Diff: core/sql/sqlcomp/nadefaults.cpp ---
    @@ -1958,6 +1958,7 @@ SDDkwd__(EXE_DIAGNOSTIC_EVENTS,		"OFF"),
     
       DDkwd__(HIVE_DEFAULT_CHARSET,            (char *)SQLCHARSETSTRING_UTF8),
       DD_____(HIVE_DEFAULT_SCHEMA,                  "HIVE"),
    +  DD_____(HIVE_FILE_CHARSET,            (char *)SQLCHARSETSTRING_UTF8),
    --- End diff --
    
    The default value for HIVE_FILE_CHARSET should not be UTF8, since that would be wrong if we have a file with ISO8859-1 characters and set HIVE_DEFAULT_CHARSET to ISO88591. Maybe better to set it to an empty string, which would mean that HIVE_FILE_CHARSET is the same as HIVE_DEFAULT_CHARSET. Given the way we implement it, maybe we should only allow an empty string and GBK as values for HIVE_FILE_CHARSET.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by traflm <gi...@git.apache.org>.
Github user traflm commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r51225471
  
    --- Diff: core/sql/generator/GenRelScan.cpp ---
    @@ -214,22 +215,32 @@ int HbaseAccess::createAsciiColAndCastExpr(Generator * generator,
           ((CharType*)newGivenType)->setUpshifted(FALSE);
         }
     
    +  if (newGivenType->getTypeQualifier() == NA_CHARACTER_TYPE &&
    +      CmpCommon::getDefaultString(HIVE_FILE_CHARSET) == "GBK")
    --- End diff --
    
    will change this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by zellerh <gi...@git.apache.org>.
Github user zellerh commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r51158775
  
    --- Diff: core/sql/generator/GenRelScan.cpp ---
    @@ -214,22 +215,32 @@ int HbaseAccess::createAsciiColAndCastExpr(Generator * generator,
           ((CharType*)newGivenType)->setUpshifted(FALSE);
         }
     
    +  if (newGivenType->getTypeQualifier() == NA_CHARACTER_TYPE &&
    +      CmpCommon::getDefaultString(HIVE_FILE_CHARSET) == "GBK")
    --- End diff --
    
    It might be good to add a check here that HIVE_FILE_CHARSET must be an empty string or equal to HIVE_DEFAULT_CHARSET or GBK. In all other cases we would probably not do what the user expected.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by zellerh <gi...@git.apache.org>.
Github user zellerh commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49616188
  
    --- Diff: core/sql/exp/exp_clause_derived.h ---
    @@ -1542,7 +1542,9 @@ enum conv_case_index {
       CONV_UTF8_F_UCS2_V                   =248,
     
       CONV_BLOB_BLOB                       =249,
    -  CONV_BLOB_ASCII_F                    =250
    +  CONV_BLOB_ASCII_F                    =250,
    +
    +  CONV_GBK_F_UTF8_V                    =251
    --- End diff --
    
    Please see my comment below on automatically making this conversion. Given that, support for fixed characters may not be needed yet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by sureshsubbiah <gi...@git.apache.org>.
Github user sureshsubbiah commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49484840
  
    --- Diff: core/sql/optimizer/SynthType.cpp ---
    @@ -5207,6 +5207,18 @@ const NAType *Translate::synthesizeType()
              err4106arg = SQLCHARSETSTRING_UTF8;
            break;
     
    +     case GBK_TO_UTF8:
    +       if (translateSource->getCharSet() == CharInfo::GBK || translateSource->getCharSet() == CharInfo::UnknownCharSet )
    +         charsetTarget = CharInfo::UTF8;
    --- End diff --
    
    Why do we have || CharInfo::UnknownCharSet here? I do not see it it in neighbouring statements. This is just for my understanding.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by zellerh <gi...@git.apache.org>.
Github user zellerh commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49501425
  
    --- Diff: core/sql/sqlcomp/nadefaults.cpp ---
    @@ -1958,6 +1958,7 @@ SDDkwd__(EXE_DIAGNOSTIC_EVENTS,		"OFF"),
     
       DDkwd__(HIVE_DEFAULT_CHARSET,            (char *)SQLCHARSETSTRING_UTF8),
       DD_____(HIVE_DEFAULT_SCHEMA,                  "HIVE"),
    +  DD_____(HIVE_FILE_CHARSET,            (char *)SQLCHARSETSTRING_UTF8),
    --- End diff --
    
    Yes, but that's also a problem with the current implementation, with fully supported charsets. We can't join two tables with ISO and UTF8, either. IMHO the solution to that is called iconv.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by DaveBirdsall <gi...@git.apache.org>.
Github user DaveBirdsall commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r51169002
  
    --- Diff: core/sql/common/csconvert.cpp ---
    @@ -1275,3 +1279,45 @@ char * findStartOfChar( char *someByteInChar, char *startOfBuffer )
          rtnv-- ;
       return rtnv ;
     }
    +
    +/* A method to do character set conversion , using Glibc iconv */
    +static int charsetConvert(const char *srcCharset,const char *targetCharset,char *inputbuf, size_t inputlen, char *outbuf,size_t outlen)
    +{
    +  char **ptrin = &inputbuf;
    +  char **ptrout = &outbuf;
    +
    +  iconv_t cd;
    +  cd = iconv_open(targetCharset,srcCharset);
    +
    +  if (cd==0) 
    +    return -1;
    +
    +  if (iconv(cd,ptrin,(size_t*)&inputlen,ptrout,(size_t *)&outlen) == -1) 
    +  {
    +    //error occurs
    +    iconv_close(cd);
    +    return -1;
    +  }
    +
    +  iconv_close(cd);
    +  return outlen;
    +}
    +
    +/* convert gbk string into UTF8 */
    +int gbkToUtf8(char* gbkString, size_t gbklen, 
    +              char* result ,size_t outlen, bool addNullAtEnd)
    +{
    +   int originalOutlen = outlen;
    +   int finalLength = charsetConvert( "gbk","utf-8", gbkString, gbklen,  result, outlen);
    +   
    +   if (finalLength == -1 ) 
    +     return -1;
    +   
    +   if ( addNullAtEnd )
    +   {
    +     if(originalOutlen >= finalLength )
    --- End diff --
    
    Wouldn't it need to be >, rather than >=? (Or does the caller leave an extra byte for null terminator?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by zellerh <gi...@git.apache.org>.
Github user zellerh commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49496156
  
    --- Diff: core/sql/common/csconvert.cpp ---
    @@ -1275,3 +1279,47 @@ char * findStartOfChar( char *someByteInChar, char *startOfBuffer )
          rtnv-- ;
       return rtnv ;
     }
    +/* A method to do character set conversion , using Glibc iconv */
    +int code_convert(const char *from_charset,const char *to_charset,char *inbuf, size_t inlen, char *outbuf,size_t outlen)
    +{
    +  iconv_t cd;
    +  int rc;
    +  char **pin = &inbuf;
    +  char **pout = &outbuf;
    +
    +  cd = iconv_open(to_charset,from_charset);
    +  if (cd==0) return -1;
    +  memset(outbuf,0,outlen);
    +  if (iconv(cd,pin,(size_t*)&inlen,pout,(size_t *)&outlen)==-1) 
    +  {
    +    iconv_close(cd);
    +    return -1;
    +  }
    +  iconv_close(cd);
    +  return outlen;
    +}
    +/* from gbk to utf8 */
    +int gbk2utf8(char *inbuf,size_t inlen,char *outbuf,size_t outlen)
    +{
    +  return code_convert("gbk","utf-8",inbuf,inlen,outbuf,outlen);
    +}
    +
    +/* convert gbk string into UTF8 */
    +int gbkToUtf8(char* gbkString, size_t gbklen, 
    --- End diff --
    
    Having two methods with almost the same name and almost the same parameter list is very hard to read. I find that very confusing. Could we give them more different names? Or, maybe just have a single method with an optional parameter?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by DaveBirdsall <gi...@git.apache.org>.
Github user DaveBirdsall commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r51169130
  
    --- Diff: core/sql/common/csconvert.cpp ---
    @@ -1275,3 +1279,45 @@ char * findStartOfChar( char *someByteInChar, char *startOfBuffer )
          rtnv-- ;
       return rtnv ;
     }
    +
    +/* A method to do character set conversion , using Glibc iconv */
    +static int charsetConvert(const char *srcCharset,const char *targetCharset,char *inputbuf, size_t inputlen, char *outbuf,size_t outlen)
    +{
    +  char **ptrin = &inputbuf;
    +  char **ptrout = &outbuf;
    +
    +  iconv_t cd;
    +  cd = iconv_open(targetCharset,srcCharset);
    +
    +  if (cd==0) 
    +    return -1;
    +
    +  if (iconv(cd,ptrin,(size_t*)&inputlen,ptrout,(size_t *)&outlen) == -1) 
    +  {
    +    //error occurs
    +    iconv_close(cd);
    +    return -1;
    +  }
    +
    +  iconv_close(cd);
    +  return outlen;
    +}
    +
    +/* convert gbk string into UTF8 */
    +int gbkToUtf8(char* gbkString, size_t gbklen, 
    +              char* result ,size_t outlen, bool addNullAtEnd)
    +{
    +   int originalOutlen = outlen;
    +   int finalLength = charsetConvert( "gbk","utf-8", gbkString, gbklen,  result, outlen);
    +   
    +   if (finalLength == -1 ) 
    +     return -1;
    +   
    +   if ( addNullAtEnd )
    +   {
    +     if(originalOutlen >= finalLength )
    +       result[finalLength] = 0;
    --- End diff --
    
    If there isn't room for a null terminator, should we return -1? (That is, does the caller always expect a null terminator to be added when "addNullAtEnd" is true?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by traflm <gi...@git.apache.org>.
Github user traflm commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49447681
  
    --- Diff: core/sql/common/csconvert.cpp ---
    @@ -1275,3 +1279,41 @@ char * findStartOfChar( char *someByteInChar, char *startOfBuffer )
          rtnv-- ;
       return rtnv ;
     }
    +/* A method to do character set conversion , using Glibc iconv */
    +int code_convert(const char *from_charset,const char *to_charset,char *inbuf, size_t inlen, char *outbuf,size_t outlen)
    +{
    +  iconv_t cd;
    +  int rc;
    +  char **pin = &inbuf;
    +  char **pout = &outbuf;
    +
    +  cd = iconv_open(to_charset,from_charset);
    +  if (cd==0) return -1;
    +  memset(outbuf,0,outlen);
    +  if (iconv(cd,pin,(size_t*)&inlen,pout,(size_t *)&outlen)==-1) 
    +  {
    +    iconv_close(cd);
    +    return -1;
    +  }
    +  iconv_close(cd);
    +  return outlen;
    +}
    +/* from gbk to utf8 */
    +int gbk2utf8(char *inbuf,size_t inlen,char *outbuf,size_t outlen)
    +{
    +  return code_convert("gbk","utf-8",inbuf,inlen,outbuf,outlen);
    +}
    +
    +int gbkToUtf8(char* gbkString, size_t gbklen, 
    +              char* result ,size_t outlen, int addNullAtEnd)
    --- End diff --
    
    This 'addNullAtEnd is defined as int for all other similar functions, so following the convertion. But you are right, I will try to change it to bool.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by traflm <gi...@git.apache.org>.
Github user traflm commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49447927
  
    --- Diff: core/sql/common/csconvert.cpp ---
    @@ -1275,3 +1279,41 @@ char * findStartOfChar( char *someByteInChar, char *startOfBuffer )
          rtnv-- ;
       return rtnv ;
     }
    +/* A method to do character set conversion , using Glibc iconv */
    +int code_convert(const char *from_charset,const char *to_charset,char *inbuf, size_t inlen, char *outbuf,size_t outlen)
    +{
    +  iconv_t cd;
    +  int rc;
    +  char **pin = &inbuf;
    +  char **pout = &outbuf;
    +
    +  cd = iconv_open(to_charset,from_charset);
    +  if (cd==0) return -1;
    +  memset(outbuf,0,outlen);
    +  if (iconv(cd,pin,(size_t*)&inlen,pout,(size_t *)&outlen)==-1) 
    +  {
    +    iconv_close(cd);
    +    return -1;
    +  }
    +  iconv_close(cd);
    +  return outlen;
    +}
    +/* from gbk to utf8 */
    +int gbk2utf8(char *inbuf,size_t inlen,char *outbuf,size_t outlen)
    +{
    +  return code_convert("gbk","utf-8",inbuf,inlen,outbuf,outlen);
    +}
    +
    +int gbkToUtf8(char* gbkString, size_t gbklen, 
    +              char* result ,size_t outlen, int addNullAtEnd)
    +{
    +
    +   int finalLength = gbk2utf8 ( gbkString, gbklen,  result, outlen);
    +   
    +   if (finalLength == -1 ) return 0;
    +   
    +   if ( addNullAtEnd > 0 )
    +      result[finalLength] = 0;
    --- End diff --
    
    yes, let me double check. A len checker is better here. will change it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: for jira 1720, add support to co...

Posted by zellerh <gi...@git.apache.org>.
Github user zellerh commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/257#discussion_r49499008
  
    --- Diff: core/sql/exp/exp_conv.cpp ---
    @@ -9321,6 +9321,33 @@ convDoIt(char * source,
       };
       break;
     
    +// gb2312 -> utf8
    +  case CONV_GBK_F_UTF8_V:
    +  {
    +    char * targetbuf = new char[sourceLen*4+1];
    --- End diff --
    
    This should call CharInfo::getMaxConvertedLenInBytes() in file core/sql/common/charinfo.h. This class already knows about GBK. Also, maybe we could add an optimization? An ASCII character in GBK maps into a single-byte character in UTF-8. As far as I can tell, GBK does not go beyond the Unicode Basic Multi-lingual Plane (BMP), meaning that a 2 byte GBK sequence makes at most a 3 byte Unicode character. So, the target length is 3/2 * source length, rounded up to the nearest integer. If targetLen is sufficient for that conversion, maybe we can avoid allocating a buffer and write directly into target? Doing a new and a delete is expensive in this performance-critical context.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---