You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by lion-x <gi...@git.apache.org> on 2016/09/27 01:07:54 UTC

[GitHub] incubator-carbondata pull request #200: addtrimoption

GitHub user lion-x opened a pull request:

    https://github.com/apache/incubator-carbondata/pull/200

    addtrimoption

    # Why raise this PR?
    Fix a bug and add trim option.
    Bug: When string is contains LeadingWhiteSpace or TrailingWhiteSpace, query result is null. This is because the dictionary ignore the LeadingWhiteSpace and TrailingWhiteSpace and the csvInput dose not.
    
    # How to test?
    Pass all the test cases.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/lion-x/incubator-carbondata addTrimOption

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-carbondata/pull/200.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #200
    
----
commit 857ef775b5f25cac5b7bed8d997bbc8f66a6107b
Author: lion-x <xl...@gmail.com>
Date:   2016-09-27T00:56:36Z

    addtrimoption

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option

Posted by lion-x <gi...@git.apache.org>.

Github user lion-x commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/200#discussion_r83039531
  
    --- Diff: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/dataload/TestDataLoadWithTrimOption.scala ---
    @@ -0,0 +1,78 @@
    +package org.apache.carbondata.spark.testsuite.dataload
    +
    +import java.io.File
    +
    +import org.apache.carbondata.core.constants.CarbonCommonConstants
    +import org.apache.carbondata.core.util.CarbonProperties
    +import org.apache.spark.sql.common.util.CarbonHiveContext._
    +import org.apache.spark.sql.common.util.QueryTest
    +import org.scalatest.BeforeAndAfterAll
    +import org.apache.spark.sql.Row
    +
    +/**
    +  * Created by x00381807 on 2016/9/26.
    --- End diff --
    
    Oh, my fault


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option

Posted by sujith71955 <gi...@git.apache.org>.

Github user sujith71955 commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/200#discussion_r82977743
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/csvreaderstep/UnivocityCsvParser.java ---
    @@ -102,8 +102,8 @@ public void initialize() throws IOException {
         parserSettings.setMaxColumns(
             getMaxColumnsForParsing(csvParserVo.getNumberOfColumns(), csvParserVo.getMaxColumns()));
         parserSettings.setNullValue("");
    -    parserSettings.setIgnoreLeadingWhitespaces(false);
    -    parserSettings.setIgnoreTrailingWhitespaces(false);
    +    parserSettings.setIgnoreLeadingWhitespaces(csvParserVo.getTrim());
    --- End diff --
    
    So better to set this while creating the table as column properties metadata


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim property fo...

Posted by sujith71955 <gi...@git.apache.org>.

Github user sujith71955 commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/200#discussion_r85957411
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/surrogatekeysgenerator/csvbased/CarbonCSVBasedSeqGenMeta.java ---
    @@ -1694,5 +1699,19 @@ public void setTableOption(String tableOption) {
       public TableOptionWrapper getTableOptionWrapper() {
         return tableOptionWrapper;
       }
    +
    +  public String getIsUseTrim() {
    +    return isUseTrim;
    +  }
    +
    +  public void setIsUseTrim(Boolean[] isUseTrim) {
    +    for (Boolean flag: isUseTrim) {
    +      if (flag) {
    +        this.isUseTrim += "T";
    --- End diff --
    
    Use  TRUE/FALSE for better readability


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option

Posted by jackylk <gi...@git.apache.org>.

Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/200#discussion_r82523962
  
    --- Diff: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/dataload/TestDataLoadWithTrimOption.scala ---
    @@ -0,0 +1,78 @@
    +package org.apache.carbondata.spark.testsuite.dataload
    +
    +import java.io.File
    +
    +import org.apache.carbondata.core.constants.CarbonCommonConstants
    +import org.apache.carbondata.core.util.CarbonProperties
    +import org.apache.spark.sql.common.util.CarbonHiveContext._
    +import org.apache.spark.sql.common.util.QueryTest
    +import org.scalatest.BeforeAndAfterAll
    +import org.apache.spark.sql.Row
    +
    +/**
    +  * Created by x00381807 on 2016/9/26.
    --- End diff --
    
    please remove


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim property fo...

Posted by sujith71955 <gi...@git.apache.org>.

Github user sujith71955 commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/200#discussion_r85957010
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/graphgenerator/GraphGenerator.java ---
    @@ -998,4 +1000,24 @@ private void prepareIsUseInvertedIndex(List<CarbonDimension> dims,
         graphConfig.setIsUseInvertedIndex(
             isUseInvertedIndexList.toArray(new Boolean[isUseInvertedIndexList.size()]));
       }
    +
    +  /**
    +   * Preparing the boolean [] to map whether the dimension use trim or not.
    +   *
    +   * @param dims
    +   * @param graphConfig
    +   */
    +  private void prepareIsUseTrim(List<CarbonDimension> dims,
    +                                GraphConfigurationInfo graphConfig) {
    +    List<Boolean> isUseTrimList = new ArrayList<Boolean>();
    +    for (CarbonDimension dimension : dims) {
    +      if (dimension.isUseTrim()) {
    --- End diff --
    
    Can we add this trim option as property of the column, i mean inside a column property rather than setting directly as CarbonDimension property, Already in CarbonColumn there is a property map which holds column related properties, i think we can use that, please check the feasibility.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim property fo...

Posted by lion-x <gi...@git.apache.org>.

Github user lion-x closed the pull request at:

    https://github.com/apache/incubator-carbondata/pull/200


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #200: [CARBONDATA-276]add trim property for cols

Posted by chenliang613 <gi...@git.apache.org>.

Github user chenliang613 commented on the issue:

    https://github.com/apache/incubator-carbondata/pull/200
  
    @lion-x  please rebase it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option

Posted by sujith71955 <gi...@git.apache.org>.

Github user sujith71955 commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/200#discussion_r81281079
  
    --- Diff: hadoop/src/test/java/org/apache/carbondata/hadoop/test/util/StoreCreator.java ---
    @@ -465,6 +466,7 @@ private static void generateGraph(IDataProcessStatus schmaModel, SchemaInfo info
         model.setEscapeCharacter(schmaModel.getEscapeCharacter());
         model.setQuoteCharacter(schmaModel.getQuoteCharacter());
         model.setCommentCharacter(schmaModel.getCommentCharacter());
    +    model.setTrim(schmaModel.getTrim());
    --- End diff --
    
    Not clear about the usecase, can you make it more clear by providing more details


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option

Posted by sujith71955 <gi...@git.apache.org>.

Github user sujith71955 commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/200#discussion_r82758567
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/csvreaderstep/UnivocityCsvParser.java ---
    @@ -102,8 +102,8 @@ public void initialize() throws IOException {
         parserSettings.setMaxColumns(
             getMaxColumnsForParsing(csvParserVo.getNumberOfColumns(), csvParserVo.getMaxColumns()));
         parserSettings.setNullValue("");
    -    parserSettings.setIgnoreLeadingWhitespaces(false);
    -    parserSettings.setIgnoreTrailingWhitespaces(false);
    +    parserSettings.setIgnoreLeadingWhitespaces(csvParserVo.getTrim());
    --- End diff --
    
    Same you need to handle in CarbonFilters.scala also, since while processing filter expressions the spaces are not getting trimmed, so here also you need to take care based on your feature


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #200: [CARBONDATA-276]add trim property for cols

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on the issue:

    https://github.com/apache/incubator-carbondata/pull/200
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option

Posted by lion-x <gi...@git.apache.org>.

Github user lion-x commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/200#discussion_r81306658
  
    --- Diff: hadoop/src/test/java/org/apache/carbondata/hadoop/test/util/StoreCreator.java ---
    @@ -465,6 +466,7 @@ private static void generateGraph(IDataProcessStatus schmaModel, SchemaInfo info
         model.setEscapeCharacter(schmaModel.getEscapeCharacter());
         model.setQuoteCharacter(schmaModel.getQuoteCharacter());
         model.setCommentCharacter(schmaModel.getCommentCharacter());
    +    model.setTrim(schmaModel.getTrim());
    --- End diff --
    
    In some cases, users want to trim some string field in csv, he set in Load DML option, and we need to handle to in csvinput step. So we the trim should be transformed to CSVinput step.
    
    For Example, in some usecases, the LeadingWhiteSpace in a String, like " fsdfsd" maybe meaningful. Users want to keep the LeadingWhiteSpace. So we add this option for users, let them choose.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option

Posted by lion-x <gi...@git.apache.org>.

Github user lion-x commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/200#discussion_r82960653
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/csvreaderstep/UnivocityCsvParser.java ---
    @@ -102,8 +102,8 @@ public void initialize() throws IOException {
         parserSettings.setMaxColumns(
             getMaxColumnsForParsing(csvParserVo.getNumberOfColumns(), csvParserVo.getMaxColumns()));
         parserSettings.setNullValue("");
    -    parserSettings.setIgnoreLeadingWhitespaces(false);
    -    parserSettings.setIgnoreTrailingWhitespaces(false);
    +    parserSettings.setIgnoreLeadingWhitespaces(csvParserVo.getTrim());
    --- End diff --
    
    hi sujith,
    I agree with Eason, when user query with some space filter, it should be considered as forbidden action.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim property fo...

Posted by sujith71955 <gi...@git.apache.org>.

Github user sujith71955 commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/200#discussion_r85957803
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/surrogatekeysgenerator/csvbased/CarbonCSVBasedSeqGenStep.java ---
    @@ -472,6 +475,7 @@ public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws K
                   break;
               }
             }
    +<<<<<<< HEAD
    --- End diff --
    
    is this file is having any conflict?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option

Posted by sujith71955 <gi...@git.apache.org>.

Github user sujith71955 commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/200#discussion_r82977592
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/csvreaderstep/UnivocityCsvParser.java ---
    @@ -102,8 +102,8 @@ public void initialize() throws IOException {
         parserSettings.setMaxColumns(
             getMaxColumnsForParsing(csvParserVo.getNumberOfColumns(), csvParserVo.getMaxColumns()));
         parserSettings.setNullValue("");
    -    parserSettings.setIgnoreLeadingWhitespaces(false);
    -    parserSettings.setIgnoreTrailingWhitespaces(false);
    +    parserSettings.setIgnoreLeadingWhitespaces(csvParserVo.getTrim());
    --- End diff --
    
    pros of this approach will be suppose in one load user loaded with dirty data and suddenly he realizes no i need to trim then in the next load he will enable the option and load the data, this will increase the dictionary space also, also in query dictionary lookup overhead will increase.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option

Posted by eason-lyx <gi...@git.apache.org>.

Github user eason-lyx commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/200#discussion_r82923303
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/csvreaderstep/UnivocityCsvParser.java ---
    @@ -102,8 +102,8 @@ public void initialize() throws IOException {
         parserSettings.setMaxColumns(
             getMaxColumnsForParsing(csvParserVo.getNumberOfColumns(), csvParserVo.getMaxColumns()));
         parserSettings.setNullValue("");
    -    parserSettings.setIgnoreLeadingWhitespaces(false);
    -    parserSettings.setIgnoreTrailingWhitespaces(false);
    +    parserSettings.setIgnoreLeadingWhitespaces(csvParserVo.getTrim());
    --- End diff --
    
    hi sujith
    i am thinking if user already trim the data with the option setting,then when user query with some space filter,it would no getting result.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option

Posted by sujith71955 <gi...@git.apache.org>.

Github user sujith71955 commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/200#discussion_r81281050
  
    --- Diff: hadoop/src/test/java/org/apache/carbondata/hadoop/test/util/StoreCreator.java ---
    @@ -465,6 +466,7 @@ private static void generateGraph(IDataProcessStatus schmaModel, SchemaInfo info
         model.setEscapeCharacter(schmaModel.getEscapeCharacter());
         model.setQuoteCharacter(schmaModel.getQuoteCharacter());
         model.setCommentCharacter(schmaModel.getCommentCharacter());
    +    model.setTrim(schmaModel.getTrim());
    --- End diff --
    
    why we need to set this in schemamodel?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option

Posted by sujith71955 <gi...@git.apache.org>.

Github user sujith71955 commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/200#discussion_r82968468
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/csvreaderstep/UnivocityCsvParser.java ---
    @@ -102,8 +102,8 @@ public void initialize() throws IOException {
         parserSettings.setMaxColumns(
             getMaxColumnsForParsing(csvParserVo.getNumberOfColumns(), csvParserVo.getMaxColumns()));
         parserSettings.setNullValue("");
    -    parserSettings.setIgnoreLeadingWhitespaces(false);
    -    parserSettings.setIgnoreTrailingWhitespaces(false);
    +    parserSettings.setIgnoreLeadingWhitespaces(csvParserVo.getTrim());
    --- End diff --
    
    Guys i think if while data loading we are reading from configuration inorder to trim it or not same we need to do while doing filter also, based on configuration value decide.
    ex:  while loading i enabled trim property, so system will trim and load the data, now in filter query also if user provides while space then it needs to be trimmed while creating filter model. This will provide more system consistentency. if user enable trim then we wont trim it while loading and also while querying.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim option

Posted by sujith71955 <gi...@git.apache.org>.

Github user sujith71955 commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/200#discussion_r82968804
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/csvreaderstep/UnivocityCsvParser.java ---
    @@ -102,8 +102,8 @@ public void initialize() throws IOException {
         parserSettings.setMaxColumns(
             getMaxColumnsForParsing(csvParserVo.getNumberOfColumns(), csvParserVo.getMaxColumns()));
         parserSettings.setNullValue("");
    -    parserSettings.setIgnoreLeadingWhitespaces(false);
    -    parserSettings.setIgnoreTrailingWhitespaces(false);
    +    parserSettings.setIgnoreLeadingWhitespaces(csvParserVo.getTrim());
    --- End diff --
    
    Also one more point it will be better to set this property in column level while creating the table itself as its column properties , this will avoid user to provide this option every time while data loading


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---