You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by GitBox <gi...@apache.org> on 2021/08/09 10:50:46 UTC

[GitHub] [orc] guiyanakuang opened a new pull request #841: ORC-927: Extracting duplicate codes for RowFilterBenchmark

guiyanakuang opened a new pull request #841:
URL: https://github.com/apache/orc/pull/841


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. File a JIRA issue first and use it as a prefix of your PR title, e.g., `ORC-001: Fix ABC`.
     2. Use your PR title to summarize what this PR proposes instead of describing the problem.
     3. Make PR title and description complete because these will be the permanent commit log.
     4. If possible, provide a concise and reproducible example to reproduce the issue for a faster review.
     5. If the PR is unfinished, use GitHub PR Draft feature.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If there is a discussion in the mailing list, please add the link.
   -->
   Create a new class RowFilterInputState to hold duplicate code.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   Easier to maintain.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   Pass the CIs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #841: ORC-927: Extracting duplicate codes for RowFilterBenchmark

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #841:
URL: https://github.com/apache/orc/pull/841#issuecomment-895493983


   Please try the following, @guiyanakuang .
   ```
   --- a/java/bench/hive/src/findbugs/exclude.xml
   +++ b/java/bench/hive/src/findbugs/exclude.xml
   @@ -14,7 +14,7 @@
    -->
    <FindBugsFilter>
      <Match>
   -    <Bug pattern="EI_EXPOSE_REP,EI_EXPOSE_REP2,MS_EXPOSE_REP"/>
   +    <Bug pattern="EI_EXPOSE_REP,EI_EXPOSE_REP2,MS_EXPOSE_REP,DM_EXIT"/>
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] guiyanakuang commented on a change in pull request #841: ORC-927: Extracting duplicate codes for RowFilterBenchmark

Posted by GitBox <gi...@apache.org>.
guiyanakuang commented on a change in pull request #841:
URL: https://github.com/apache/orc/pull/841#discussion_r685613269



##########
File path: java/bench/hive/src/java/org/apache/orc/bench/hive/rowfilter/RowFilterInputState.java
##########
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.orc.bench.hive.rowfilter;
+
+import org.apache.commons.lang.reflect.FieldUtils;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.orc.OrcFile;
+import org.apache.orc.OrcFilterContext;
+import org.apache.orc.Reader;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.bench.core.Utilities;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+
+import java.io.IOException;
+import java.util.Random;
+
+@State(Scope.Thread)
+public abstract class RowFilterInputState {
+
+    private static final Path root = new Path(System.getProperty("user.dir"));
+
+    Configuration conf = new Configuration();
+    FileSystem fs;
+    TypeDescription schema;
+    VectorizedRowBatch batch;
+    Path path;
+    boolean[] include;
+    Reader reader;
+    Reader.Options readerOptions;
+    boolean[] filterValues = null;

Review comment:
       I've updated the description. Yes, I should have mentioned some special modifications.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #841: ORC-927: Extracting duplicate codes for RowFilterBenchmark

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #841:
URL: https://github.com/apache/orc/pull/841#issuecomment-895489411


   cc @omalley , @pgaref , @williamhyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] guiyanakuang commented on pull request #841: ORC-927: Extracting duplicate codes for RowFilterBenchmark

Posted by GitBox <gi...@apache.org>.
guiyanakuang commented on pull request #841:
URL: https://github.com/apache/orc/pull/841#issuecomment-895594763


   > Thank you for making a PR, @guiyanakuang .
   > 
   > For findbugs `DM_EXIT`, you can add an exclusion like the following.
   > 
   > ```
   > $ git grep DM_EXIT
   > java/examples/src/findbugs/exclude.xml:    <Bug pattern="EI_EXPOSE_REP,EI_EXPOSE_REP2,DM_EXIT"/>
   > java/tools/src/findbugs/exclude.xml:    <Bug pattern="EI_EXPOSE_REP,EI_EXPOSE_REP2,DM_EXIT"/>
   > ```
   
   Thanks to @dongjoon-hyun for the tip. We are more than 10 hours apart in time zone, so I am late in replying.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on a change in pull request #841: ORC-927: Extracting duplicate codes for RowFilterBenchmark

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #841:
URL: https://github.com/apache/orc/pull/841#discussion_r685600601



##########
File path: java/bench/hive/src/java/org/apache/orc/bench/hive/rowfilter/RowFilterInputState.java
##########
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.orc.bench.hive.rowfilter;
+
+import org.apache.commons.lang.reflect.FieldUtils;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.orc.OrcFile;
+import org.apache.orc.OrcFilterContext;
+import org.apache.orc.Reader;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.bench.core.Utilities;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+
+import java.io.IOException;
+import java.util.Random;
+
+@State(Scope.Thread)
+public abstract class RowFilterInputState {
+
+    private static final Path root = new Path(System.getProperty("user.dir"));
+
+    Configuration conf = new Configuration();
+    FileSystem fs;
+    TypeDescription schema;
+    VectorizedRowBatch batch;
+    Path path;
+    boolean[] include;
+    Reader reader;
+    Reader.Options readerOptions;
+    boolean[] filterValues = null;
+
+    @Setup
+    public void setup() throws IOException, IllegalAccessException {
+        TypeDescription.RowBatchVersion version =
+            (TypeDescription.RowBatchVersion)FieldUtils.readField(this, "version", true);
+        TypeDescription.Category benchType = (TypeDescription.Category)FieldUtils.readField(this, "benchType", true);
+        String filterPerc = (String)FieldUtils.readField(this, "filterPerc", true);
+        int filterColsNum = (int)FieldUtils.readField(this, "filterColsNum", true);
+        String dataRelativePath = (String)FieldUtils.readField(this, "dataRelativePath", true);
+        String schemaName = (String)FieldUtils.readField(this, "schemaName", true);
+        String filterColumn = (String)FieldUtils.readField(this, "filterColumn", true);
+
+        fs = FileSystem.getLocal(conf).getRaw();
+        path = new Path(root, dataRelativePath);
+        schema = Utilities.loadSchema(schemaName);
+        batch = schema.createRowBatch(version, 1024);
+        include = new boolean[schema.getMaximumId() + 1];
+        for(TypeDescription child: schema.getChildren()) {
+            if (schema.getFieldNames().get(child.getId()-1).compareTo(filterColumn) == 0) {
+                System.out.println("Apply Filter on column: " + schema.getFieldNames().get(child.getId()-1));
+                include[child.getId()] = true;
+            } else if (child.getCategory() == benchType) {
+                System.out.println("Skip column(s): " + schema.getFieldNames().get(child.getId()-1));
+                include[child.getId()] = true;
+                if (--filterColsNum == 0) break;
+            }
+        }
+        if (filterColsNum != 0) {
+            System.err.println("Dataset does not contain type: " + benchType);
+            System.exit(-1);
+        }
+        generateRandomSet(Double.parseDouble(filterPerc));
+        reader = OrcFile.createReader(path,
+                OrcFile.readerOptions(conf).filesystem(fs));
+        // just read the Boolean columns
+        readerOptions = reader.options().include(include);
+    }
+
+    public void generateRandomSet(double percentage) throws IllegalArgumentException {

Review comment:
       Ditto. This was `public static` before.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun merged pull request #841: ORC-927: Extracting duplicate codes for RowFilterBenchmark

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun merged pull request #841:
URL: https://github.com/apache/orc/pull/841


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on a change in pull request #841: ORC-927: Extracting duplicate codes for RowFilterBenchmark

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #841:
URL: https://github.com/apache/orc/pull/841#discussion_r685600498



##########
File path: java/bench/hive/src/java/org/apache/orc/bench/hive/rowfilter/RowFilterInputState.java
##########
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.orc.bench.hive.rowfilter;
+
+import org.apache.commons.lang.reflect.FieldUtils;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.orc.OrcFile;
+import org.apache.orc.OrcFilterContext;
+import org.apache.orc.Reader;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.bench.core.Utilities;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+
+import java.io.IOException;
+import java.util.Random;
+
+@State(Scope.Thread)
+public abstract class RowFilterInputState {
+
+    private static final Path root = new Path(System.getProperty("user.dir"));
+
+    Configuration conf = new Configuration();
+    FileSystem fs;
+    TypeDescription schema;
+    VectorizedRowBatch batch;
+    Path path;
+    boolean[] include;
+    Reader reader;
+    Reader.Options readerOptions;
+    boolean[] filterValues = null;

Review comment:
       When you change things like this, please mention in the PR description. Otherwise, some reviewers might not notice the difference.
   
   This was `static boolean[]` before this PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] guiyanakuang commented on pull request #841: ORC-927: Extracting duplicate codes for RowFilterBenchmark

Posted by GitBox <gi...@apache.org>.
guiyanakuang commented on pull request #841:
URL: https://github.com/apache/orc/pull/841#issuecomment-895734124


   Thanks for the reviewing and merging! @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] guiyanakuang commented on a change in pull request #841: ORC-927: Extracting duplicate codes for RowFilterBenchmark

Posted by GitBox <gi...@apache.org>.
guiyanakuang commented on a change in pull request #841:
URL: https://github.com/apache/orc/pull/841#discussion_r685149454



##########
File path: java/bench/hive/src/java/org/apache/orc/bench/hive/rowfilter/RowFilterInputState.java
##########
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.orc.bench.hive.rowfilter;
+
+import org.apache.commons.lang.reflect.FieldUtils;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.orc.OrcFile;
+import org.apache.orc.OrcFilterContext;
+import org.apache.orc.Reader;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.bench.core.Utilities;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+
+import java.io.IOException;
+import java.util.Random;
+
+@State(Scope.Thread)
+public abstract class RowFilterInputState {
+
+    private static final Path root = new Path(System.getProperty("user.dir"));
+
+    Configuration conf = new Configuration();
+    FileSystem fs;
+    TypeDescription schema;
+    VectorizedRowBatch batch;
+    Path path;
+    boolean[] include;
+    Reader reader;
+    Reader.Options readerOptions;
+    boolean[] filterValues = null;
+
+    @Setup
+    public void setup() throws IOException, IllegalAccessException {
+        TypeDescription.RowBatchVersion version =
+            (TypeDescription.RowBatchVersion)FieldUtils.readField(this, "version", true);
+        TypeDescription.Category benchType = (TypeDescription.Category)FieldUtils.readField(this, "benchType", true);
+        String filterPerc = (String)FieldUtils.readField(this, "filterPerc", true);
+        int filterColsNum = (int)FieldUtils.readField(this, "filterColsNum", true);
+        String dataRelativePath = (String)FieldUtils.readField(this, "dataRelativePath", true);
+        String schemaName = (String)FieldUtils.readField(this, "schemaName", true);
+        String filterColumn = (String)FieldUtils.readField(this, "filterColumn", true);
+
+        fs = FileSystem.getLocal(conf).getRaw();
+        path = new Path(root, dataRelativePath);
+        schema = Utilities.loadSchema(schemaName);
+        batch = schema.createRowBatch(version, 1024);
+        include = new boolean[schema.getMaximumId() + 1];
+        for(TypeDescription child: schema.getChildren()) {
+            if (schema.getFieldNames().get(child.getId()-1).compareTo(filterColumn) == 0) {
+                System.out.println("Apply Filter on column: " + schema.getFieldNames().get(child.getId()-1));
+                include[child.getId()] = true;
+            } else if (child.getCategory() == benchType) {
+                System.out.println("Skip column(s): " + schema.getFieldNames().get(child.getId()-1));
+                include[child.getId()] = true;
+                if (--filterColsNum == 0) break;
+            }
+        }
+        if (filterColsNum != 0) {
+            System.err.println("Dataset does not contain type: " + benchType);
+            System.exit(-1);

Review comment:
       findbugs prompt
   ```
   [INFO] org.apache.orc.bench.hive.rowfilter.RowFilterInputState.setup() invokes System.exit(...), which shuts down the entire virtual machine [org.apache.orc.bench.hive.rowfilter.RowFilterInputState] At RowFilterInputState.java:[line 80] DM_EXIT
   ```
   Also use System.exit before modifying,I don't know how to get past the findbugs check




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #841: ORC-927: Extracting duplicate codes for RowFilterBenchmark

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #841:
URL: https://github.com/apache/orc/pull/841#issuecomment-895629482


   No problem, @guiyanakuang . Thank you for updating.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] guiyanakuang commented on pull request #841: ORC-927: Extracting duplicate codes for RowFilterBenchmark

Posted by GitBox <gi...@apache.org>.
guiyanakuang commented on pull request #841:
URL: https://github.com/apache/orc/pull/841#issuecomment-895666127


   > Please follow the guideline by using`2-space indentation` in the new file.
   > 
   > * https://orc.apache.org/develop/coding/
   > 
   > ```
   > Indentation should be 2 spaces.
   > ```
   
   Modified indentation issues,I will follow this guide.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org