You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@orc.apache.org by Citrullin <gi...@git.apache.org> on 2017/04/10 16:38:46 UTC

[GitHub] orc pull request #104: Core documentation fixes

GitHub user Citrullin opened a pull request:

    https://github.com/apache/orc/pull/104

    Core documentation fixes

    There's an issue with the size count and the MapColumnVector is named ListColumnVector
    Also wrote a jira ticket for it:
    https://issues.apache.org/jira/browse/ORC-168

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Citrullin/orc master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/orc/pull/104.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #104
    
----
commit c1083247a917f18289761bcf13f902555b5cbda8
Author: Philipp Blum <ph...@webtrekk.com>
Date:   2017-04-10T16:26:06Z

    fix wrong batch.size counter

commit b4c75761115d3fc48f53ebed51bb64a6e8a204c8
Author: Philipp Blum <ph...@webtrekk.com>
Date:   2017-04-10T16:29:06Z

    Rename ListColumnVector to MapColumnVector

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #104: Core documentation fixes

Posted by Citrullin <gi...@git.apache.org>.

Github user Citrullin commented on a diff in the pull request:

    https://github.com/apache/orc/pull/104#discussion_r111478051
  
    --- Diff: site/_docs/core-java.md ---
    @@ -233,14 +233,15 @@ VectorizedRowBatch batch = schema.createRowBatch();
     LongColumnVector x = (LongColumnVector) batch.cols[0];
     LongColumnVector y = (LongColumnVector) batch.cols[1];
     for(int r=0; r < 10000; ++r) {
    -  int row = batch.size++;
    +  int row = batch.size;
       x.vector[row] = r;
    --- End diff --
    
    Hi, I'm not that good in Java. I'm more familiar with Scala. But you take a look into the Library I wrote. I created two branches where I changed only the position of the up counting.
    
    In example-1 I count both batch.size and rowBatchSize up before I add a row to VectorizedBatch. 
    [See more here](https://github.com/Citrullin/scalaOrcWriter/blob/ORC-168-wrong-example-1/src/main/scala/citrullin/orcwriter/OrcWriter.scala#L77)
    
    In example 2 I count only batch.size up before I add the row to the batch. rowBatchSize will up counted after a row is written.
    [See more here](https://github.com/Citrullin/scalaOrcWriter/blob/ORC-168-wrong-example-2/src/main/scala/citrullin/orcwriter/OrcWriter.scala#L77)
    
    The working example is in the dev branch. 
    [More here](https://github.com/Citrullin/scalaOrcWriter/blob/dev/src/main/scala/citrullin/orcwriter/OrcWriter.scala#L77)
    
    You can run the Implicit ComplexMap example to see the differences. 
    [Here is the source](https://github.com/Citrullin/scalaOrcWriter/blob/dev/src/main/scala/citrullin/orcwriter/examples/orcwriterimplicitapi/WriteComplexMap.scala)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #104: Core documentation fixes

Posted by Citrullin <gi...@git.apache.org>.

Github user Citrullin commented on the issue:

    https://github.com/apache/orc/pull/104
  
    Thanks @omalley,
    I forgot that is a post increment. So, yes, works fine :) I only have to do it a bit differently in scala. :) Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #104: Core documentation fixes

Posted by omalley <gi...@git.apache.org>.

Github user omalley commented on the issue:

    https://github.com/apache/orc/pull/104
  
    As a double check, I tried the two variants:
    
    ```java
        TypeDescription schema = TypeDescription.fromString("struct<x:int,y:int>");
        Writer writer = OrcFile.createWriter(testFilePath,
            OrcFile.writerOptions(conf)
                .setSchema(schema));
        VectorizedRowBatch batch = schema.createRowBatch(10);
        LongColumnVector x = (LongColumnVector) batch.cols[0];
        LongColumnVector y = (LongColumnVector) batch.cols[1];
        for(int r=0; r < 15; ++r) {
          int row = batch.size++;
          x.vector[row] = r;
          y.vector[row] = r * 3;
          // If the batch is full, write it out and start over.
          if (batch.size == batch.getMaxSize()) {
            writer.addRowBatch(batch);
            batch.reset();
          }
        }
        if (batch.size != 0) {
          writer.addRowBatch(batch);
          batch.reset();
        }
        writer.close();
    ```
    
    and
    
    ```java
        TypeDescription schema = TypeDescription.fromString("struct<x:int,y:int>");
        Writer writer = OrcFile.createWriter(testFilePath,
            OrcFile.writerOptions(conf)
                .setSchema(schema));
        VectorizedRowBatch batch = schema.createRowBatch(10);
        LongColumnVector x = (LongColumnVector) batch.cols[0];
        LongColumnVector y = (LongColumnVector) batch.cols[1];
        for(int r=0; r < 15; ++r) {
          int row = batch.size;
          x.vector[row] = r;
          y.vector[row] = r * 3;
          // If the batch is full, write it out and start over.
          if (batch.size == batch.getMaxSize()) {
            writer.addRowBatch(batch);
            batch.reset();
          }
          batch.size++;
        }
        if (batch.size != 0) {
          writer.addRowBatch(batch);
          batch.reset();
        }
        writer.close();
    ```
    
    The first generated the expected output and the second got an array out of bounds exception. The correct output is:
    
    ```json
    {"x": 0, "y": 0}
    {"x": 1, "y": 3}
    {"x": 2, "y": 6}
    {"x": 3, "y": 9}
    {"x": 4, "y": 12}
    {"x": 5, "y": 15}
    {"x": 6, "y": 18}
    {"x": 7, "y": 21}
    {"x": 8, "y": 24}
    {"x": 9, "y": 27}
    {"x": 10, "y": 30}
    {"x": 11, "y": 33}
    {"x": 12, "y": 36}
    {"x": 13, "y": 39}
    {"x": 14, "y": 42}
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #104: Core documentation fixes

Posted by omalley <gi...@git.apache.org>.

Github user omalley commented on a diff in the pull request:

    https://github.com/apache/orc/pull/104#discussion_r111253053
  
    --- Diff: site/_docs/core-java.md ---
    @@ -233,14 +233,15 @@ VectorizedRowBatch batch = schema.createRowBatch();
     LongColumnVector x = (LongColumnVector) batch.cols[0];
     LongColumnVector y = (LongColumnVector) batch.cols[1];
     for(int r=0; r < 10000; ++r) {
    -  int row = batch.size++;
    +  int row = batch.size;
       x.vector[row] = r;
    --- End diff --
    
    batch.size must be incremented before the call to writer.addRowBatch on line 241. This change would break the example.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #104: Core documentation fixes

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/orc/pull/104


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #104: Core documentation fixes

Posted by omalley <gi...@git.apache.org>.

Github user omalley commented on a diff in the pull request:

    https://github.com/apache/orc/pull/104#discussion_r111254734
  
    --- Diff: site/_docs/core-java.md ---
    @@ -233,14 +233,15 @@ VectorizedRowBatch batch = schema.createRowBatch();
     LongColumnVector x = (LongColumnVector) batch.cols[0];
     LongColumnVector y = (LongColumnVector) batch.cols[1];
     for(int r=0; r < 10000; ++r) {
    -  int row = batch.size++;
    +  int row = batch.size;
       x.vector[row] = r;
       y.vector[row] = r * 3;
       // If the batch is full, write it out and start over.
       if (batch.size == batch.getMaxSize()) {
         writer.addRowBatch(batch);
         batch.reset();
       }
    +  batch.size++;
     }
     writer.close();
    --- End diff --
    
    There should be a final call to add the batch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #104: Core documentation fixes

Posted by Citrullin <gi...@git.apache.org>.

Github user Citrullin commented on a diff in the pull request:

    https://github.com/apache/orc/pull/104#discussion_r111481491
  
    --- Diff: site/_docs/core-java.md ---
    @@ -233,14 +233,15 @@ VectorizedRowBatch batch = schema.createRowBatch();
     LongColumnVector x = (LongColumnVector) batch.cols[0];
     LongColumnVector y = (LongColumnVector) batch.cols[1];
     for(int r=0; r < 10000; ++r) {
    -  int row = batch.size++;
    +  int row = batch.size;
       x.vector[row] = r;
       y.vector[row] = r * 3;
       // If the batch is full, write it out and start over.
       if (batch.size == batch.getMaxSize()) {
         writer.addRowBatch(batch);
         batch.reset();
       }
    +  batch.size++;
     }
     writer.close();
    --- End diff --
    
    That's correct. I already fixed it in my code. Had also this issue when the batch smaller than the max size.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---