You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by xu...@hust.edu.cn on 2018/05/02 02:30:22 UTC

Grammar about supporting string longer than 32000 characters


Hi, community:

I'm implementing supporting string longer than 32000 characters in carbondata and have a question about the grammar of this feature. Here I'd like to explain it and want to receive your feedbacks.

DESCRIPTION：

In previous implementation, carbondata internally uses a short to store the length of a string (char,varchar,string are all treated as string) value. It does save memory and space for the regular use case by using short instead of int, but will cause problem when the length of string exceeds the range of short.

In order to support the above case and save memory/space if possible, we want to distinguish which string columns are SHORT and which are LONG. (SHORT: lengthOfValue<32000, LONG: lengthOfValue>=32000)


SOLUTION：

Solution1. Add a TableProperty in CreateTableStatement
The grammar looks like below：
For SQL case：
```
    sql(
      s"""
         | CREATE TABLE if not exists $longStringTable(
         | id INT, name STRING, description STRING, address STRING
         | ) STORED BY 'carbondata'
         | TBLPROPERTIES('LONG_STRING_COLUMNS'='description', 'SORT_COLUMNS'='name')
         |""".stripMargin)
```

For DataFrame case:
```
df.write.format("carbondata")
  .option("tableName", "longStringTable")
  .option("SORT_COLUMNS", 'name')
  .option('LONG_STRING_COLUMNS', 'description')
  .mode(SaveMode.Overwrite)
  .save()
```

In the above example, 'description' is LONG while 'address' is SHORT.
We add a property called 'LONG_STRING_COLUMNS' in the statement, its value contains all the columns that are LONG. 
Note: This solution is available as PR2252.


Solution2. Add a new datatype
The grammar looks like below:
For SQL case：
```
    sql(
      s"""
         | CREATE TABLE if not exists $longStringTable(
         | id INT, name STRING, description TEXT, address STRING
         | ) STORED BY 'carbondata'
         | TBLPROPERTIES('SORT_COLUMNS'='name')
         |""".stripMargin)
```

For DataFrame case:
```
df.write.format("carbondata")
  .option("tableName", "longStringTable")
  .option("SORT_COLUMNS", 'name')
  // SparkSQL does not have TEXT datatype, how to specify it?
  .mode(SaveMode.Overwrite)
  .save()
```

In the above example, 'description' is LONG while 'address' is SHORT.
We add a new datatype called 'TEXT' and treat 'description' as TEXT in the statement.

I'd prefer to solution1 since it is compatible with hive/sparksql while solution2 has problem to migrate with them.

END



At last, how do you think the solutions provided above?

Please give your comments, moreover you can provide other solutions here to improve it.

Re: Grammar about supporting string longer than 32000 characters

Posted by ravipesala <ra...@gmail.com>.

In case of dataframe we can take the varchar(max) as default.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Grammar about supporting string longer than 32000 characters

Posted by xuchuanyin <xu...@hust.edu.cn>.

Spark dataframe does not have char/varchar datatype, it only has string type.

Suppose we are saving spark dataframe to carbonata, if the length of value is longer than 32000 and its datatype is string, how will carbondata know to store it as longstring? 
In my opinion, a property to indicate which column is longstring is inevitable in the case.

Re: Grammar about supporting string longer than 32000 characters

Posted by xm_zzc <44...@qq.com>.

+1 for solution 2 
I think it's ok to truncate the string value to N if its length is longer
than N, it's acceptable.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Grammar about supporting string longer than 32000 characters

Posted by xuchuanyin <xu...@hust.edu.cn>.

In traditional RDBMS, varchar(N) means the value contains at least N characters, at the DBMS will truncate the value if its length is longer than N.

Will we implement like this too? Truncate the string value to N if its length is longer than N?

Re: Grammar about supporting string longer than 32000 characters

Posted by Raghunandan S <ca...@gmail.com>.

+1 for solution 2
On Wed, 2 May 2018 at 9:09 PM, ravipesala <ra...@gmail.com> wrote:

> Hi,
>
> I agree with option 2 but not new datatype use varchar(size).
> There are more optimizations we can do with varchar(size) datatype like
> 1. if the size is smaller (less than 8 bytes)  then we can write in fixed
> length encoder instead of  LV encode it can save a lot of space and memory.
> 2. If the size is less than 32000 then use current our string datatype.
> 3. If size is more than 32000 then encode using int as a length in LV
> format.
>
> In spark dataframe support we can by default use string as datatype.
>
> Even if we take option 1 also carbon should internally has new datatype
> otherwise code will not be good as you need to check this property many
> places so ideally new datatype can lead to a new set of implementations and
> easier to code and maintain.
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

Re: Grammar about supporting string longer than 32000 characters

Posted by ravipesala <ra...@gmail.com>.

Hi,

I agree with option 2 but not new datatype use varchar(size).
There are more optimizations we can do with varchar(size) datatype like
1. if the size is smaller (less than 8 bytes)  then we can write in fixed
length encoder instead of  LV encode it can save a lot of space and memory.
2. If the size is less than 32000 then use current our string datatype.
3. If size is more than 32000 then encode using int as a length in LV
format. 

In spark dataframe support we can by default use string as datatype.

Even if we take option 1 also carbon should internally has new datatype
otherwise code will not be good as you need to check this property many
places so ideally new datatype can lead to a new set of implementations and
easier to code and maintain.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/