You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@carbondata.apache.org by xuchuanyin <gi...@git.apache.org> on 2018/04/30 15:57:48 UTC

[GitHub] carbondata pull request #2252: WIP: Support string longer than 32000 charact...

GitHub user xuchuanyin opened a pull request:

    https://github.com/apache/carbondata/pull/2252

    WIP: Support string longer than 32000 characters

    Add a property in creating table 'long_string_columns' to support string columns that will contains more than 32000 characters.
    Inside carbondata, it use an integer instead of short to store the length of bytes content.
    
    Be sure to do all of the following checklist to help us incorporate 
    your contribution quickly and easily:
    
     - [ ] Any interfaces changed?
     
     - [ ] Any backward compatibility impacted?
     
     - [ ] Document update required?
    
     - [ ] Testing done
            Please provide details on 
            - Whether new unit test cases have been added or why no new tests are required?
            - How it is tested? Please attach test report.
            - Is it a performance related change? Please attach the performance test report.
            - Any additional information to help reviewers in testing this change.
           
     - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. 
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/xuchuanyin/carbondata 0428_string_longer_than_32000

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/2252.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2252
    
----
commit 26cbea7f1493d204a1abb4275e052481abccd185
Author: xuchuanyin <xu...@...>
Date:   2018-04-30T15:53:22Z

    Support string longer than 32000 characters
    
    Add a property in creating table 'long_string_columns' to support string columns that will contains more than 32000 characters.
    Inside carbondata, it use an integer instead of short to store the length of bytes content.

----


---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4867/



---

[GitHub] carbondata issue #2252: WIP:[CARBONDATA-2420] Support string longer than 320...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4644/



---

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r190103945
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/reader/dimension/v2/CompressedDimensionChunkFileBasedReaderV2.java ---
    @@ -121,6 +121,7 @@ public DimensionColumnPage decodeColumnPage(
         int[] invertedIndexesReverse = new int[0];
         int[] rlePage = null;
         DataChunk2 dimensionColumnChunk = null;
    +    boolean isLongStringColumn = dimensionRawColumnChunk.isLongStringColumn();
    --- End diff --
    
    The same as above. V2 does not support longStringColumn, so it is fine that the `isLongStringColumn` will always be false.


---

[GitHub] carbondata issue #2252: [CARBONDATA-2420][32K] Support string longer than 32...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    retest it please


---

[GitHub] carbondata issue #2252: WIP:[CARBONDATA-2420] Support string longer than 320...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5089/



---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5210/



---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5229/



---

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r190104257
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/reader/dimension/v1/CompressedDimensionChunkFileBasedReaderV1.java ---
    @@ -99,6 +99,7 @@ public CompressedDimensionChunkFileBasedReaderV1(final BlockletInfo blockletInfo
     
       @Override public DimensionColumnPage decodeColumnPage(
           DimensionRawColumnChunk dimensionRawColumnChunk, int pageNumber) throws IOException {
    +    boolean isLongStringColumn = dimensionRawColumnChunk.isLongStringColumn();
    --- End diff --
    
    I know, you mean just leave it false?


---

[GitHub] carbondata issue #2252: WIP:[CARBONDATA-2420] Support string longer than 320...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5074/



---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5025/



---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    @xuchuanyin Thanks for working on it, but we better have new datatype like varchar(size) or bigstring to support longer strings rather than based on property


---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5567/



---

[GitHub] carbondata issue #2252: WIP: Support string longer than 32000 characters

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4642/



---

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r190104281
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/reader/dimension/v2/CompressedDimensionChunkFileBasedReaderV2.java ---
    @@ -121,6 +121,7 @@ public DimensionColumnPage decodeColumnPage(
         int[] invertedIndexesReverse = new int[0];
         int[] rlePage = null;
         DataChunk2 dimensionColumnChunk = null;
    +    boolean isLongStringColumn = dimensionRawColumnChunk.isLongStringColumn();
    --- End diff --
    
    The same as above


---

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

Posted by kumarvishal09 <gi...@git.apache.org>.
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r190150092
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/reader/dimension/v1/CompressedDimensionChunkFileBasedReaderV1.java ---
    @@ -99,6 +99,7 @@ public CompressedDimensionChunkFileBasedReaderV1(final BlockletInfo blockletInfo
     
       @Override public DimensionColumnPage decodeColumnPage(
           DimensionRawColumnChunk dimensionRawColumnChunk, int pageNumber) throws IOException {
    +    boolean isLongStringColumn = dimensionRawColumnChunk.isLongStringColumn();
    --- End diff --
    
    No we support V3 while writing the data...v1/v2 is supported only for reading to support backward compatibility 


---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by kumarvishal09 <gi...@git.apache.org>.
Github user kumarvishal09 commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    @xuchuanyin how u are deciding isLongStringColumn is true or false in query??


---

[GitHub] carbondata issue #2252: WIP:[CARBONDATA-2420] Support string longer than 320...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5091/



---

[GitHub] carbondata issue #2252: WIP:[CARBONDATA-2420] Support string longer than 320...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4385/



---

[GitHub] carbondata issue #2252: [CARBONDATA-2420][32K] Support string longer than 32...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6325/



---

[GitHub] carbondata issue #2252: WIP:[CARBONDATA-2420] Support string longer than 320...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4933/



---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by kumarvishal09 <gi...@git.apache.org>.
Github user kumarvishal09 commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    @xuchuanyin I think some of the changes are not required ...like method added for converting to LV format..If u check direct compressor  it is already present ..you can use the same.
    In DimensionRawColumnChunk no need to pass any Boolean based on encoder we can decide which type of dimension store object needs to be created 
    Changing the existing store chunk implementation is also not required ...add a child class if possible or add complete new implementation for storing int based LV format...
    Please check ColumnPage.compress and ColumnPage.decompress for you reference(LV)
    V1,V2 reader no need to change anything as its old format code and user will not able to load the data in this format.
    Decide based on encoder. while writing add new encoder "TEXT" and while reading use same encoder for creating DimensionDataChunkStore object 
    
     


---

[GitHub] carbondata issue #2252: WIP:[CARBONDATA-2420] Support string longer than 320...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4913/



---

[GitHub] carbondata issue #2252: [CARBONDATA-2420][32K] Support string longer than 32...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6306/



---

[GitHub] carbondata issue #2252: [CARBONDATA-2420][32K] Support string longer than 32...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5176/



---

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

Posted by kumarvishal09 <gi...@git.apache.org>.
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r190160809
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/block/SegmentProperties.java ---
    @@ -849,7 +852,41 @@ public int getNumberOfDictSortColumns() {
         return this.numberOfSortColumns - this.numberOfNoDictSortColumns;
       }
     
    +  public int getNumberOfLongStringColumns() {
    +    return numberOfLongStringColumns;
    +  }
    +
       public int getLastDimensionColOrdinal() {
         return lastDimensionColOrdinal;
       }
    +
    +  @Override public String toString() {
    --- End diff --
    
    Why this method is required ??...can u add some comment 


---

[GitHub] carbondata issue #2252: WIP:[CARBONDATA-2420] Support string longer than 320...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4935/



---

[GitHub] carbondata issue #2252: [CARBONDATA-2420][32K] Support string longer than 32...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6338/



---

[GitHub] carbondata issue #2252: [CARBONDATA-2420][32K] Support string longer than 32...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5265/



---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by kumarvishal09 <gi...@git.apache.org>.
Github user kumarvishal09 commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    @xuchuanyin 
    1. still use 'long_string_columns' instead of varchar datatype to make it consistent it spark/hive
           Are u facing any problem with varchar??
    2. use an integer (previously short) to store the length of bytes content.
         Only for text data type??



---

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

Posted by kumarvishal09 <gi...@git.apache.org>.
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r194479546
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/store/impl/safe/SafeVariableLengthDimensionDataChunkStore.java ---
    @@ -56,7 +57,8 @@ public SafeVariableLengthDimensionDataChunkStore(boolean isInvertedIndex, int nu
        * @param invertedIndexReverse inverted index reverse to be stored
        * @param data                 data to be stored
        */
    -  @Override public void putArray(final int[] invertedIndex, final int[] invertedIndexReverse,
    +  @Override
    +  public void putArray(final int[] invertedIndex, final int[] invertedIndexReverse,
    --- End diff --
    
    @xuchuanyin In Case of varchar column I think we will not be able to store complete data in single byte array as it may overflow, in case of short it is fine as maximum it will be of  32000(max number of bytes in each value) * 32000(number of records in page) + 64000(length for each value in a page), so it will be 1024064000 value and as maximum u can create an array of integer max value so it is fine, but in varchar case length of each value can be maximum of Integer max it self so we will not be able to store in single byte array. This handling is required in both data loading and query.


---

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

Posted by kumarvishal09 <gi...@git.apache.org>.
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r189913075
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/reader/dimension/v2/CompressedDimensionChunkFileBasedReaderV2.java ---
    @@ -121,6 +121,7 @@ public DimensionColumnPage decodeColumnPage(
         int[] invertedIndexesReverse = new int[0];
         int[] rlePage = null;
         DataChunk2 dimensionColumnChunk = null;
    +    boolean isLongStringColumn = dimensionRawColumnChunk.isLongStringColumn();
    --- End diff --
    
    In V2 case it will be always false...as new encoder type will be only supported for V3 format


---

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r190222169
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/scan/result/BlockletScannedResult.java ---
    @@ -369,6 +379,9 @@ public void fillDataChunks() {
         long startTime = System.currentTimeMillis();
         for (int i = 0; i < dimensionColumnPages.length; i++) {
           if (dimensionColumnPages[i][pageCounter] == null && dimRawColumnChunks[i] != null) {
    +        // the long string columns is at the end
    --- End diff --
    
    @kumarvishal09
    During query, we can get the dimensions that this query requires.
    Also we can know how many longStringColumns this query requires.
    While writing dimension columns, we write the normal-shortStringColumns first and longStringColumns at last.
    So during query, suppose we use `n` dimensions containing `m` longStringColumns.
    Then, the first `n-m` columns will be normal-shortStringColumns and the last `m` columns will be longStringColumns.
    This line of code can be found below.


---

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

Posted by kumarvishal09 <gi...@git.apache.org>.
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r189912968
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/reader/dimension/v1/CompressedDimensionChunkFileBasedReaderV1.java ---
    @@ -99,6 +99,7 @@ public CompressedDimensionChunkFileBasedReaderV1(final BlockletInfo blockletInfo
     
       @Override public DimensionColumnPage decodeColumnPage(
           DimensionRawColumnChunk dimensionRawColumnChunk, int pageNumber) throws IOException {
    +    boolean isLongStringColumn = dimensionRawColumnChunk.isLongStringColumn();
    --- End diff --
    
    In V1 case it will be always false...as new encoder type will be only supported for V3 format


---

[GitHub] carbondata issue #2252: WIP:[CARBONDATA-2420] Support string longer than 320...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5548/



---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4405/



---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    @ravipesala @kumarvishal09 @jackylk 
    Based on @ravipesala 's advice, I changed the name of internal datatype from `text` to `varchar`
    
    Later I'll raise another PR to make it possible for user to directly use `varchar(size)` in DDL statement.


---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    @kumarvishal09 
    
    1. There isn't a problem. I've discussed with Jacky and Ravindra, they agreed that user can specify the longStringColumn by using the ‘long_string_columns’ property.
    They also agreed that we can provide `varchar` for the longStringColumn.
    
    In this initial implementation, I want to keep it simple. 
    Using `varchar` will have more things to deal with such as dataframe only has StringType, so I also need to consider the writing DF to CarbonData.
    Besides, in Hive document, Hive will truncate varchar/char to specified length, while in Spark, spark will handle varchar as String. 
    In a word, if we use varchar, more things need to be considered.
    
    2. yeah, only for text data type now. 
    
    @kumarvishal09 As an initial implementation, I think it's already easy to use for users. How do you think?


---

[GitHub] carbondata issue #2252: WIP:[CARBONDATA-2420] Support string longer than 320...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6097/



---

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r190220126
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/block/SegmentProperties.java ---
    @@ -849,7 +852,41 @@ public int getNumberOfDictSortColumns() {
         return this.numberOfSortColumns - this.numberOfNoDictSortColumns;
       }
     
    +  public int getNumberOfLongStringColumns() {
    +    return numberOfLongStringColumns;
    +  }
    +
       public int getLastDimensionColOrdinal() {
         return lastDimensionColOrdinal;
       }
    +
    +  @Override public String toString() {
    --- End diff --
    
    It's not required. I used it for debug output. Will remove it.


---

[GitHub] carbondata issue #2252: WIP:[CARBONDATA-2420] Support string longer than 320...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5090/



---

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r190103691
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/impl/DimensionRawColumnChunk.java ---
    @@ -92,8 +93,10 @@ public DimensionColumnPage decodeColumnPage(int pageNumber) {
        * @param index
        * @return
        */
    -  public DimensionColumnPage convertToDimColDataChunkWithOutCache(int index) {
    +  public DimensionColumnPage convertToDimColDataChunkWithOutCache(int index,
    +      boolean isLongStringColumn) {
    --- End diff --
    
    What does `reader` mean?


---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5564/



---

[GitHub] carbondata issue #2252: [CARBONDATA-2420][32K] Support string longer than 32...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5144/



---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6026/



---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5078/



---

[GitHub] carbondata issue #2252: [CARBONDATA-2420][32K] Support string longer than 32...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    I deleted the origin branch unexpectedly,so I raised another PR #2379


---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    @kumarvishal09 yeah, it’s an option to add an encoder type and make L-V related class abstract to eliminate the duplicate code.


---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6072/



---

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

Posted by kumarvishal09 <gi...@git.apache.org>.
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r190161224
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/impl/DimensionRawColumnChunk.java ---
    @@ -32,13 +32,14 @@
      *  by specifying page number.
      */
     public class DimensionRawColumnChunk extends AbstractRawColumnChunk {
    -
       private DimensionColumnPage[] dataChunks;
     
       private DimensionColumnChunkReader chunkReader;
     
       private FileReader fileReader;
    +  private boolean isLongStringColumn;
    --- End diff --
    
    how u are deciding it is isLongStringColumn or not ??


---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5039/



---

[GitHub] carbondata issue #2252: [CARBONDATA-2420][32K] Support string longer than 32...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5163/



---

[GitHub] carbondata issue #2252: WIP: Support string longer than 32000 characters

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4383/



---

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

Posted by kumarvishal09 <gi...@git.apache.org>.
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r194350793
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/impl/VariableLengthDimensionColumnPage.java ---
    @@ -30,21 +30,19 @@
     
       /**
        * Constructor for this class
    -   * @param dataChunks
    -   * @param invertedIndex
    -   * @param invertedIndexReverse
    -   * @param numberOfRows
        */
       public VariableLengthDimensionColumnPage(byte[] dataChunks, int[] invertedIndex,
    -      int[] invertedIndexReverse, int numberOfRows) {
    +      int[] invertedIndexReverse, int numberOfRows, boolean isVarcharEncoded) {
    --- End diff --
    
    Its better to pass store type it self from the caller instead for boolean 


---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6259/



---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4859/



---

[GitHub] carbondata issue #2252: WIP: Support string longer than 32000 characters

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4641/



---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5027/



---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5097/



---

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

Posted by kumarvishal09 <gi...@git.apache.org>.
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r189911364
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/impl/DimensionRawColumnChunk.java ---
    @@ -92,8 +93,10 @@ public DimensionColumnPage decodeColumnPage(int pageNumber) {
        * @param index
        * @return
        */
    -  public DimensionColumnPage convertToDimColDataChunkWithOutCache(int index) {
    +  public DimensionColumnPage convertToDimColDataChunkWithOutCache(int index,
    +      boolean isLongStringColumn) {
    --- End diff --
    
    why this is required???...as mentioned in earlier comment...reader does not know which type of data it is reading


---

[GitHub] carbondata issue #2252: WIP: Support string longer than 32000 characters

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5546/



---

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r194597446
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/impl/VariableLengthDimensionColumnPage.java ---
    @@ -30,21 +30,19 @@
     
       /**
        * Constructor for this class
    -   * @param dataChunks
    -   * @param invertedIndex
    -   * @param invertedIndexReverse
    -   * @param numberOfRows
        */
       public VariableLengthDimensionColumnPage(byte[] dataChunks, int[] invertedIndex,
    -      int[] invertedIndexReverse, int numberOfRows) {
    +      int[] invertedIndexReverse, int numberOfRows, boolean isVarcharEncoded) {
    --- End diff --
    
    OK~


---

[GitHub] carbondata issue #2252: [CARBONDATA-2420][32K] Support string longer than 32...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    retest this please


---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6240/



---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    @ravipesala I've considered to add a datatype such as TEXT, but quit the idea due to that the grammar is not general, at least it is not compatible with Spark/Hive. It will cause problem to migrate from/to Carbondata.


---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4664/



---

[GitHub] carbondata pull request #2252: [CARBONDATA-2420][32K] Support string longer ...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin closed the pull request at:

    https://github.com/apache/carbondata/pull/2252


---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    @kumarvishal09 You suggested to add a new TEXT encoder and using this encoder while writing&reading.
    But currently in CarbonData, all dimensions are considered as string, there is no specified encoder for it.
    In the previous description,
    ```
    For DimensionStoreType, I changed VARIABLELENGTH to VARIABLE_INT_LENGTH and VARIABLE_SHORT_LENGTH, they are used for encoding/decoding dimensions
    ```
    We can consider the DimensionStoreType as an encoder. It had two values: FIXEDLENGTH and VARIABLELENGTH and I extended it to tree:FIXED_LENGTH、VARIABLE_SHORT_LENGTH、VARIABLE_INT_LENGTH.
    
    Does this meet your suggestion?


---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by kumarvishal09 <gi...@git.apache.org>.
Github user kumarvishal09 commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    @xuchuanyin For supporting String column with more than 32K character we need below changes.
    **Create**
    1. Support new data type varchar already mentioned by @ravipesala 
    **Loading:**
    1. Add new encoder . Add this encoder for all the varchar columns in datachunk2 while writing the data to carbondata file. Please check DataChunk2 in carbondata.thrift we are adding encoder for each column
    2. Use DirectCompressCodec for compressing the data already code is present in ColumnPage.getLVFlattenedBytePage()
    3. Add stats collector for computing max/min for varchar columns implement new class for handling the same
    4. No need to add sartkey and endkey for varchar columns,
    **Reading**
    1. Add new implementation for DimensionDataChunkStore to store INT LV format data.(already handled)
    2. Based on encoder present in datachunk2 use DimensionDataChunkStore implementation. Like for dictionary encoder we are creating fixedLengthStoreChunk object 
    3. For varchar column just uncompress the data and store the same data with LV format in store (No need to convert LV formatted data to 2D byte array)
    
    **Note:**: We need to handle the same changes for complex data type.
    Please take care of backward compatibility :-)
    Please let me know for any clarification.
     
    @ravipesala @jackylk . Please check if I missed anything.
    



---

[GitHub] carbondata issue #2252: [CARBONDATA-2420][32K] Support string longer than 32...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5288/



---

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r190103851
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/reader/dimension/v1/CompressedDimensionChunkFileBasedReaderV1.java ---
    @@ -99,6 +99,7 @@ public CompressedDimensionChunkFileBasedReaderV1(final BlockletInfo blockletInfo
     
       @Override public DimensionColumnPage decodeColumnPage(
           DimensionRawColumnChunk dimensionRawColumnChunk, int pageNumber) throws IOException {
    +    boolean isLongStringColumn = dimensionRawColumnChunk.isLongStringColumn();
    --- End diff --
    
    Should we support longStringColumn in V1?
    
    Carbondata now only support writing V1 format.


---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by kumarvishal09 <gi...@git.apache.org>.
Github user kumarvishal09 commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    @xuchuanyin Its better to add one encoder type to store long string. In current code reader does not know which type of data it is reading/storing and chunk store object is created based on encoder (fixed/variable) type. In your PR most of the classes you have added one boolean to check its Long string. It's not required, you can add one encoder type(Text) and instead of handling everything in same class(UnsafevariableLengthChunkStore/SafevariableLengthStore) add one more implementation for handling Long String.  


---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    To eliminate if-else on longString as much as possible,
    + For DataMapSchemaType, I changed VARIABLE to VARIABLE_INT and VARIABLE_SHORT, they are used for BlockletDataMap;
    + For DimensionStoreType, I changed VARIABLELENGTH to VARIABLE_INT_LENGTH and VARIABLE_SHORT_LENGTH, they are used for encoding/decoding dimensions;
    + For ColumnPageStatCollector, I changed LVStringStatCollector to LVShortStringStatCollector and LVLongStringStatCollector, they are used for column statistics;
    
    While deep into the code, I found that there is no need to add an internal datatype such as TEXT.
    + The dimensions all are considered as String, we only need a boolean array to indicate whether it is long string. We don't need an array to indicate all the datatype of the dimensions.
    + The procedure for String and Text are nearly the same. A boolean (array) and proper abstraction is enough.


---

[GitHub] carbondata issue #2252: WIP:[CARBONDATA-2420] Support string longer than 320...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5087/



---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6017/



---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    @kumarvishal09 please review the latest update.
    
    1. still use 'long_string_columns' instead of `varchar` datatype to make it consistent it spark/hive
    2. internally add a new datatype called `text` to represent the long string column
    3. add a new encoding called DIRECT_COMPRESS_TEXT to the text column page meta
    4. use an integer (previously short) to store the length of bytes content.
    5. A test was added to test query/select on the text columns


---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4660/



---

[GitHub] carbondata pull request #2252: [CARBONDATA-2420] Support string longer than ...

Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2252#discussion_r194603907
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/store/impl/safe/SafeVariableLengthDimensionDataChunkStore.java ---
    @@ -56,7 +57,8 @@ public SafeVariableLengthDimensionDataChunkStore(boolean isInvertedIndex, int nu
        * @param invertedIndexReverse inverted index reverse to be stored
        * @param data                 data to be stored
        */
    -  @Override public void putArray(final int[] invertedIndex, final int[] invertedIndexReverse,
    +  @Override
    +  public void putArray(final int[] invertedIndex, final int[] invertedIndexReverse,
    --- End diff --
    
    @kumarvishal09 yeah, you are right. Considering that scenario will make the code much more complex.
    As we all know, a maximum of string/bytearray is about 2GB. I think it is enough for current scenarios. 2GB can support average 67104 bytes per column (2GB/32000) in one column page. 
    
    Having discussed with @jackylk , we have the following conclusions:
    1. Add an error message to indicate that the maximum size of one column page is 2GB.
    2. We can reduce the row size in the page to support longer characters, for example if the content is 67104*2, user can reduce the row size from 32000 to 16000.
    
    1 will be included in this PR and 2 will be implemented in another PR.
    Is that OK? @kumarvishal09 


---

[GitHub] carbondata issue #2252: WIP:[CARBONDATA-2420] Support string longer than 320...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6094/



---

[GitHub] carbondata issue #2252: [CARBONDATA-2420] Support string longer than 32000 c...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2252
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4402/



---