You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@carbondata.apache.org by manishgupta88 <gi...@git.apache.org> on 2016/08/31 06:30:15 UTC

[GitHub] incubator-carbondata pull request #111: [CARBONDATA-194] ArrayIndexOfBoundEx...

GitHub user manishgupta88 opened a pull request:

    https://github.com/apache/incubator-carbondata/pull/111

    [CARBONDATA-194] ArrayIndexOfBoundException thrown when number of columns in row more than the max number of columns in univocity parser settings

    ISSUE ID: https://issues.apache.org/jira/browse/CARBONDATA-194
    
    Problem: When the number of columns in CSV data file while parsing a row are more than the max number of columns configured in the univocity parser settings, the parser throws array index of bound exception
    
    Reason: Max number of columns in CSVParserSettings are set equivalent to the number of columns in schema with an addition of 10. if still the number of columns while parsing are more then the univocity parser throws array index of bound exception.
    
    Solution: Configure a higher value for max number of columns and take the max of number of columns in schema and default max columns value while setting in univocity parser settings.
    
    Impact: Data load flow

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/manishgupta88/incubator-carbondata univocity_max_columns_bug

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-carbondata/pull/111.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #111
    
----
commit ef686d505472f33c4b8afe73079a637ff5611a48
Author: manishgupt88 <to...@gmail.com>
Date:   2016-08-31T06:18:53Z

    Problem: When the number of columns in csv data file while parsing a row are more than the number of columns in schema, the parser throws array index of bound exception
    
    Reason: Max number of columns in CSVParserSettings are set equivalent to the number of columns in schema with an addition of 10. if still the number of columns while parsing are more then the univocity parser throws array index of bound exception.
    
    Solution: Configure a higher value for max number of columns and take the max of number of columns in schema and default max columns value while setting in univocity parser settings.
    
    Impact: Data load flow

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #111: [CARBONDATA-194] ArrayIndexOfBoundEx...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-carbondata/pull/111


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #111: [CARBONDATA-194] ArrayIndexOfBoundEx...

Posted by gvramana <gi...@git.apache.org>.
Github user gvramana commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/111#discussion_r76936740
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/csvreaderstep/UnivocityCsvParser.java ---
    @@ -104,6 +108,20 @@ public void initialize() throws IOException {
       }
     
       /**
    +   * This method will decide the number of coulmns to be parsed for a row by univocity parser
    +   *
    +   * @param columnCountInSchema total number of columns in schema
    +   * @return
    +   */
    +  private int getMaxColumnsForParsing(int columnCountInSchema) {
    +    int maxNumberOfColumnsForParsing = columnCountInSchema;
    +    if (columnCountInSchema < MAX_NUMBER_OF_COLUMNS_FOR_PARSING) {
    +      maxNumberOfColumnsForParsing = MAX_NUMBER_OF_COLUMNS_FOR_PARSING;
    +    }
    +    return maxNumberOfColumnsForParsing;
    --- End diff --
    
    Add +10 if schema columns are considered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #111: [CARBONDATA-194] ArrayIndexOfBoundEx...

Posted by sraghunandan <gi...@git.apache.org>.
Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/111#discussion_r77002455
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/csvreaderstep/UnivocityCsvParser.java ---
    @@ -41,6 +41,10 @@
     public class UnivocityCsvParser {
     
       /**
    +   * Max number of columns that will be parsed for a row by univocity parsing
    +   */
    +  private static final int MAX_NUMBER_OF_COLUMNS_FOR_PARSING = 2000;
    --- End diff --
    
    I think user will not be able to provide this.He may get a csv with more cols, but he is interested in few cols he is interested in.Depending on user would be bug prone


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #111: [CARBONDATA-194] ArrayIndexOfBoundEx...

Posted by kumarvishal09 <gi...@git.apache.org>.
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/111#discussion_r76936492
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/csvreaderstep/UnivocityCsvParser.java ---
    @@ -41,6 +41,10 @@
     public class UnivocityCsvParser {
     
       /**
    +   * Max number of columns that will be parsed for a row by univocity parsing
    +   */
    +  private static final int MAX_NUMBER_OF_COLUMNS_FOR_PARSING = 2000;
    --- End diff --
    
    I think i will fail if csv file has more number of column than 2000 and schema you have selected less columns. Better expose one property so user can also configure max number of columns in csv file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #111: [CARBONDATA-194] ArrayIndexOfBoundEx...

Posted by kumarvishal09 <gi...@git.apache.org>.
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/111#discussion_r76973485
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/csvreaderstep/UnivocityCsvParser.java ---
    @@ -104,6 +108,20 @@ public void initialize() throws IOException {
       }
     
       /**
    +   * This method will decide the number of coulmns to be parsed for a row by univocity parser
    +   *
    +   * @param columnCountInSchema total number of columns in schema
    +   * @return
    +   */
    +  private int getMaxColumnsForParsing(int columnCountInSchema) {
    +    int maxNumberOfColumnsForParsing = columnCountInSchema;
    +    if (columnCountInSchema < MAX_NUMBER_OF_COLUMNS_FOR_PARSING) {
    +      maxNumberOfColumnsForParsing = MAX_NUMBER_OF_COLUMNS_FOR_PARSING;
    +    }
    +    return maxNumberOfColumnsForParsing;
    --- End diff --
    
    I added +10 because to avoid this bug, i think now no need to add 10 as we are allowing user to give max number of columns. @gvramana Please comment


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #111: [CARBONDATA-194] ArrayIndexOfBoundEx...

Posted by gvramana <gi...@git.apache.org>.
Github user gvramana commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/111#discussion_r76946838
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/csvreaderstep/UnivocityCsvParser.java ---
    @@ -41,6 +41,10 @@
     public class UnivocityCsvParser {
     
       /**
    +   * Max number of columns that will be parsed for a row by univocity parsing
    +   */
    +  private static final int MAX_NUMBER_OF_COLUMNS_FOR_PARSING = 2000;
    --- End diff --
    
    If array creation is for every row, then we should surely control it.
    If array creation is one time, then can give value of 2000 as default. and can add DataLoad command option to control it, mainly useful for wide tables


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---