You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by GitBox <gi...@apache.org> on 2022/05/25 04:15:30 UTC

[GitHub] [orc] mwlon opened a new pull request, #1141: ORC-1189

mwlon opened a new pull request, #1141:
URL: https://github.com/apache/orc/pull/1141

   ### What changes were proposed in this pull request?
   Get benchmark suite and documentation suite working
   
   
   ### Why are the changes needed?
   * more helpful git ignore and package command for working with benchmarks
   * graceful error handling of misconfigured benchmarks
   * updated taxi dataset source and schema (the old one no longer exists)
   * dependency plugin now ignores test dependencies unused in main
   * removed unused mapreduce dependency
   
   ### How was this patch tested?
   This does not change functionality or tests (except for the updated taxi dataset).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] mwlon commented on a diff in pull request #1141: ORC-1191: Updated TLC Taxi Benchmark Dataset

Posted by GitBox <gi...@apache.org>.
mwlon commented on code in PR #1141:
URL: https://github.com/apache/orc/pull/1141#discussion_r884331174


##########
java/bench/README.md:
##########
@@ -24,7 +24,7 @@ To fetch the source data:
 
 ```% ./fetch-data.sh```
 
-> :warning: Script will fetch 7GB of data
+> :warning: Script will fetch 500MB of data

Review Comment:
   Ah you're right, I had downsized the github data. I'll update to 4MB



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1141: ORC-1189

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1141:
URL: https://github.com/apache/orc/pull/1141#discussion_r881206239


##########
java/bench/core/src/java/org/apache/orc/bench/core/convert/GenerateVariants.java:
##########
@@ -221,8 +221,8 @@ public static BatchReader createReader(Path root,
                                          long salesRecords) throws IOException {
     switch (dataName) {
       case "taxi":
-        return new RecursiveReader(new Path(root, "sources/" + dataName), "csv",
-            schema, conf, CompressionKind.ZLIB);
+        return new RecursiveReader(new Path(root, "sources/" + dataName), "parquet",
+            schema, conf, CompressionKind.NONE);

Review Comment:
   https://github.com/apache/orc/pull/1141#issuecomment-1136712610 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] mwlon commented on a diff in pull request #1141: ORC-1189

Posted by GitBox <gi...@apache.org>.
mwlon commented on code in PR #1141:
URL: https://github.com/apache/orc/pull/1141#discussion_r881585045


##########
java/mapreduce/pom.xml:
##########
@@ -63,10 +63,6 @@
       <groupId>org.apache.hive</groupId>
       <artifactId>hive-storage-api</artifactId>
     </dependency>
-    <dependency>
-      <groupId>org.slf4j</groupId>
-      <artifactId>slf4j-api</artifactId>
-    </dependency>

Review Comment:
   Ah actually it looks like I was using with `-Dmaven.test.skip`, but when I changed that to `-DskipTests` like you said it was fixed. Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1141: ORC-1191: Updated TLC Taxi Benchmark Dataset

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1141:
URL: https://github.com/apache/orc/pull/1141#discussion_r884332274


##########
java/bench/README.md:
##########
@@ -24,7 +24,7 @@ To fetch the source data:
 
 ```% ./fetch-data.sh```
 
-> :warning: Script will fetch 7GB of data
+> :warning: Script will fetch 500MB of data

Review Comment:
   Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on pull request #1141: ORC-1191: Updated TLC Taxi Benchmark Dataset

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #1141:
URL: https://github.com/apache/orc/pull/1141#issuecomment-1140627368

   Here is the PR for `orc.none`-related failure.
   - https://github.com/apache/orc/pull/1144


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] mwlon commented on a diff in pull request #1141: ORC-1189

Posted by GitBox <gi...@apache.org>.
mwlon commented on code in PR #1141:
URL: https://github.com/apache/orc/pull/1141#discussion_r881203095


##########
java/mapreduce/pom.xml:
##########
@@ -63,10 +63,6 @@
       <groupId>org.apache.hive</groupId>
       <artifactId>hive-storage-api</artifactId>
     </dependency>
-    <dependency>
-      <groupId>org.slf4j</groupId>
-      <artifactId>slf4j-api</artifactId>
-    </dependency>

Review Comment:
   The maven package command needed to run the benchmarks fails without it. Should I still make it a separate PR that we merge in first?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1141: ORC-1189

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1141:
URL: https://github.com/apache/orc/pull/1141#discussion_r881206633


##########
java/bench/core/src/java/org/apache/orc/bench/core/convert/GenerateVariants.java:
##########
@@ -221,8 +221,8 @@ public static BatchReader createReader(Path root,
                                          long salesRecords) throws IOException {
     switch (dataName) {
       case "taxi":
-        return new RecursiveReader(new Path(root, "sources/" + dataName), "csv",
-            schema, conf, CompressionKind.ZLIB);
+        return new RecursiveReader(new Path(root, "sources/" + dataName), "parquet",
+            schema, conf, CompressionKind.NONE);

Review Comment:
   I overlooked this sentence, `updated taxi dataset source and schema (the old one no longer exists)`, at the initial review.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on pull request #1141: ORC-1191: Updated TLC Taxi Benchmark Dataset

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #1141:
URL: https://github.com/apache/orc/pull/1141#issuecomment-1140624385

   Merged to main/1.8/1.7.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1141: ORC-1189

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1141:
URL: https://github.com/apache/orc/pull/1141#discussion_r881205171


##########
java/mapreduce/pom.xml:
##########
@@ -63,10 +63,6 @@
       <groupId>org.apache.hive</groupId>
       <artifactId>hive-storage-api</artifactId>
     </dependency>
-    <dependency>
-      <groupId>org.slf4j</groupId>
-      <artifactId>slf4j-api</artifactId>
-    </dependency>

Review Comment:
   What command did you try?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1141: ORC-1189

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1141:
URL: https://github.com/apache/orc/pull/1141#discussion_r882344779


##########
java/bench/fetch-data.sh:
##########
@@ -15,8 +15,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 mkdir -p data/sources/taxi
-(cd data/sources/taxi; wget -O - https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-11.csv | gzip > yellow_tripdata_2015-11.csv.gz )
-(cd data/sources/taxi; wget -O - https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-12.csv | gzip > yellow_tripdata_2015-12.csv.gz )
+(cd data/sources/taxi; wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-11.parquet )

Review Comment:
   BTW, could you update the following accordingly because the new file are 10 times smaller now?
   https://github.com/apache/orc/blob/1afc31d6c04729d7e194a6423c690af4519aab33/java/bench/README.md#L27
   
   ```
   m1max orc:$ ls -alh yello*
   -rw-r--r--  1 dongjoon  staff   1.7G May 25 23:05 yellow_tripdata_2015-11.csv
   -rw-r--r--  1 dongjoon  staff   150M May 25 23:06 yellow_tripdata_2015-11.parquet
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on pull request #1141: ORC-1189

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #1141:
URL: https://github.com/apache/orc/pull/1141#issuecomment-1138798392

   Please rebase and use different JIRA ID because https://github.com/apache/orc/pull/1142 is merged as ORC-1189 already.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on pull request #1141: ORC-1189

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #1141:
URL: https://github.com/apache/orc/pull/1141#issuecomment-1138179487

   Shall we spin off the following at least as another PR? I can help you by merging it first.
   - Minor git and docs changes: java/bench/data and README.md
   - A typo fix in help message: `convert` -> `generate`
   
   Please add the following example in your new PR description.
   ```
   $ java -jar core/target/orc-benchmarks-core-*-uber.jar generate data --help
   usage: convert <root>
    -c,--compress <arg>   List of compression
    -d,--data <arg>       List of data sets
    -f,--format <arg>     List of formats
    -h,--help             Provide help
    -s,--sales <arg>      Number of records for sales
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1141: ORC-1189

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1141:
URL: https://github.com/apache/orc/pull/1141#discussion_r881201109


##########
java/bench/core/src/java/org/apache/orc/bench/core/Driver.java:
##########
@@ -18,10 +18,7 @@
 
 package org.apache.orc.bench.core;
 
-import java.util.Arrays;
-import java.util.Map;
-import java.util.ServiceLoader;
-import java.util.TreeMap;
+import java.util.*;

Review Comment:
   Please recover this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] mwlon commented on pull request #1141: ORC-1191

Posted by GitBox <gi...@apache.org>.
mwlon commented on PR #1141:
URL: https://github.com/apache/orc/pull/1141#issuecomment-1140277000

   @dongjoon-hyun this is ready for review when you get a chance


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on pull request #1141: ORC-1191

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #1141:
URL: https://github.com/apache/orc/pull/1141#issuecomment-1139716610

   #1143 is merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1141: ORC-1189

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1141:
URL: https://github.com/apache/orc/pull/1141#discussion_r881202692


##########
java/bench/core/src/java/org/apache/orc/bench/core/convert/GenerateVariants.java:
##########
@@ -221,8 +221,8 @@ public static BatchReader createReader(Path root,
                                          long salesRecords) throws IOException {
     switch (dataName) {
       case "taxi":
-        return new RecursiveReader(new Path(root, "sources/" + dataName), "csv",
-            schema, conf, CompressionKind.ZLIB);
+        return new RecursiveReader(new Path(root, "sources/" + dataName), "parquet",
+            schema, conf, CompressionKind.NONE);

Review Comment:
   I believe this should be recovered along with https://github.com/apache/orc/pull/1141/files#r881201813



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1141: ORC-1189

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1141:
URL: https://github.com/apache/orc/pull/1141#discussion_r881201813


##########
java/bench/fetch-data.sh:
##########
@@ -15,8 +15,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 mkdir -p data/sources/taxi
-(cd data/sources/taxi; wget -O - https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-11.csv | gzip > yellow_tripdata_2015-11.csv.gz )
-(cd data/sources/taxi; wget -O - https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-12.csv | gzip > yellow_tripdata_2015-12.csv.gz )
+(cd data/sources/taxi; wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-11.parquet )
+(cd data/sources/taxi; wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-12.parquet )

Review Comment:
   ~I guess there is a reason why Apache ORC didn't download `.parquet` format. ;)~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1141: ORC-1189

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1141:
URL: https://github.com/apache/orc/pull/1141#discussion_r881202692


##########
java/bench/core/src/java/org/apache/orc/bench/core/convert/GenerateVariants.java:
##########
@@ -221,8 +221,8 @@ public static BatchReader createReader(Path root,
                                          long salesRecords) throws IOException {
     switch (dataName) {
       case "taxi":
-        return new RecursiveReader(new Path(root, "sources/" + dataName), "csv",
-            schema, conf, CompressionKind.ZLIB);
+        return new RecursiveReader(new Path(root, "sources/" + dataName), "parquet",
+            schema, conf, CompressionKind.NONE);

Review Comment:
   ~I believe this should be recovered along with https://github.com/apache/orc/pull/1141/files#r881201813~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on pull request #1141: ORC-1191: Updated TLC Taxi Benchmark Dataset

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #1141:
URL: https://github.com/apache/orc/pull/1141#issuecomment-1140536424

   I'll resume the verification now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun closed pull request #1141: ORC-1191: Updated TLC Taxi Benchmark Dataset

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #1141: ORC-1191: Updated TLC Taxi Benchmark Dataset
URL: https://github.com/apache/orc/pull/1141


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] mwlon commented on pull request #1141: ORC-1191: Updated TLC Taxi Benchmark Dataset

Posted by GitBox <gi...@apache.org>.
mwlon commented on PR #1141:
URL: https://github.com/apache/orc/pull/1141#issuecomment-1140453919

   @dongjoon-hyun done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1141: ORC-1189

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1141:
URL: https://github.com/apache/orc/pull/1141#discussion_r881201813


##########
java/bench/fetch-data.sh:
##########
@@ -15,8 +15,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 mkdir -p data/sources/taxi
-(cd data/sources/taxi; wget -O - https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-11.csv | gzip > yellow_tripdata_2015-11.csv.gz )
-(cd data/sources/taxi; wget -O - https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-12.csv | gzip > yellow_tripdata_2015-12.csv.gz )
+(cd data/sources/taxi; wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-11.parquet )
+(cd data/sources/taxi; wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-12.parquet )

Review Comment:
   I guess there is a reason why Apache ORC didn't download `.parquet` format. ;)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1141: ORC-1189

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1141:
URL: https://github.com/apache/orc/pull/1141#discussion_r881202362


##########
java/mapreduce/pom.xml:
##########
@@ -63,10 +63,6 @@
       <groupId>org.apache.hive</groupId>
       <artifactId>hive-storage-api</artifactId>
     </dependency>
-    <dependency>
-      <groupId>org.slf4j</groupId>
-      <artifactId>slf4j-api</artifactId>
-    </dependency>

Review Comment:
   Could you spin off this as a separate PR? We prefer to keep one theme in one PR. Especially, we avoid mixing 1) bug fix, 2) improvement, 3) clean-up. This looks like some kind of clean-up which is irrelevant to your PR intention.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1141: ORC-1189

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1141:
URL: https://github.com/apache/orc/pull/1141#discussion_r881202469


##########
java/pom.xml:
##########
@@ -314,6 +314,7 @@
           </executions>
           <configuration>
             <failOnWarning>true</failOnWarning>
+            <ignoreNonCompile>true</ignoreNonCompile>

Review Comment:
   Do we need this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1141: ORC-1189

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1141:
URL: https://github.com/apache/orc/pull/1141#discussion_r882325625


##########
java/bench/core/src/java/org/apache/orc/bench/core/Driver.java:
##########
@@ -32,8 +34,14 @@ public class Driver {
 
   private static Map<String, OrcBenchmark> getBenchmarks() {
     Map<String, OrcBenchmark> result = new TreeMap<>();
-    for(OrcBenchmark bench: loader) {
-      result.put(bench.getName(), bench);
+    Iterator<OrcBenchmark> iterator = loader.iterator();
+    while (iterator.hasNext()) {
+      try {
+        OrcBenchmark bench = iterator.next();
+        result.put(bench.getName(), bench);
+      } catch (ServiceConfigurationError e) {
+        System.out.println("WARN (Driver): misconfigured benchmark exists:" + e);

Review Comment:
   Shall we use `System.err`?
   ```java
   System.out.println("WARN (Driver): misconfigured benchmark exists:" + e);
   System.err.println("Misconfigured benchmark exists:" + e);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] mwlon commented on a diff in pull request #1141: ORC-1189

Posted by GitBox <gi...@apache.org>.
mwlon commented on code in PR #1141:
URL: https://github.com/apache/orc/pull/1141#discussion_r881204329


##########
java/bench/core/src/java/org/apache/orc/bench/core/convert/GenerateVariants.java:
##########
@@ -221,8 +221,8 @@ public static BatchReader createReader(Path root,
                                          long salesRecords) throws IOException {
     switch (dataName) {
       case "taxi":
-        return new RecursiveReader(new Path(root, "sources/" + dataName), "csv",
-            schema, conf, CompressionKind.ZLIB);
+        return new RecursiveReader(new Path(root, "sources/" + dataName), "parquet",
+            schema, conf, CompressionKind.NONE);

Review Comment:
   What do you mean be recovered? The CSV files are no longer available



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] mwlon commented on a diff in pull request #1141: ORC-1189

Posted by GitBox <gi...@apache.org>.
mwlon commented on code in PR #1141:
URL: https://github.com/apache/orc/pull/1141#discussion_r881206569


##########
java/mapreduce/pom.xml:
##########
@@ -63,10 +63,6 @@
       <groupId>org.apache.hive</groupId>
       <artifactId>hive-storage-api</artifactId>
     </dependency>
-    <dependency>
-      <groupId>org.slf4j</groupId>
-      <artifactId>slf4j-api</artifactId>
-    </dependency>

Review Comment:
   `./mvnw clean package -Pbenchmark`, as stated in the benchmark documentation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1141: ORC-1189

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1141:
URL: https://github.com/apache/orc/pull/1141#discussion_r882329160


##########
java/bench/core/src/java/org/apache/orc/bench/core/Driver.java:
##########
@@ -32,8 +34,14 @@ public class Driver {
 
   private static Map<String, OrcBenchmark> getBenchmarks() {
     Map<String, OrcBenchmark> result = new TreeMap<>();
-    for(OrcBenchmark bench: loader) {
-      result.put(bench.getName(), bench);
+    Iterator<OrcBenchmark> iterator = loader.iterator();

Review Comment:
   Why do we need to define `iterator` variable separately?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on pull request #1141: ORC-1191

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #1141:
URL: https://github.com/apache/orc/pull/1141#issuecomment-1140308923

   Thank you. Could you update the PR title, @mwlon ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1141: ORC-1191: Updated TLC Taxi Benchmark Dataset

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1141:
URL: https://github.com/apache/orc/pull/1141#discussion_r884324617


##########
java/bench/README.md:
##########
@@ -24,7 +24,7 @@ To fetch the source data:
 
 ```% ./fetch-data.sh```
 
-> :warning: Script will fetch 7GB of data
+> :warning: Script will fetch 500MB of data

Review Comment:
   During the final verification at the merging step, I found that this is wrong. Maybe, you didn't run `fetch-data.sh` successfully?
   ```
   $ ./fetch-data.sh
   ...
   FINISHED --2022-05-29 14:08:32--
   Total wall clock time: 7m 21s
   Downloaded: 360 files, 3.5G in 4m 34s (13.2 MB/s)
   
   $ du -h data
   3.6G	data/sources/github
   321M	data/sources/taxi
   3.9G	data/sources
   3.9G	data
   
   $ tree data
   data
   └── sources
       ├── github
       │   ├── 2015-11-01-0.json.gz
       │   ├── 2015-11-01-1.json.gz
       │   ├── 2015-11-01-10.json.gz
       │   ├── 2015-11-01-11.json.gz
       │   ├── 2015-11-01-12.json.gz
       │   ├── 2015-11-01-13.json.gz
       │   ├── 2015-11-01-14.json.gz
       │   ├── 2015-11-01-15.json.gz
       │   ├── 2015-11-01-16.json.gz
       │   ├── 2015-11-01-17.json.gz
       │   ├── 2015-11-01-18.json.gz
       │   ├── 2015-11-01-19.json.gz
       │   ├── 2015-11-01-2.json.gz
       │   ├── 2015-11-01-20.json.gz
       │   ├── 2015-11-01-21.json.gz
       │   ├── 2015-11-01-22.json.gz
       │   ├── 2015-11-01-23.json.gz
       │   ├── 2015-11-01-3.json.gz
       │   ├── 2015-11-01-4.json.gz
       │   ├── 2015-11-01-5.json.gz
       │   ├── 2015-11-01-6.json.gz
       │   ├── 2015-11-01-7.json.gz
       │   ├── 2015-11-01-8.json.gz
       │   ├── 2015-11-01-9.json.gz
       │   ├── 2015-11-02-0.json.gz
       │   ├── 2015-11-02-1.json.gz
       │   ├── 2015-11-02-10.json.gz
       │   ├── 2015-11-02-11.json.gz
       │   ├── 2015-11-02-12.json.gz
       │   ├── 2015-11-02-13.json.gz
       │   ├── 2015-11-02-14.json.gz
       │   ├── 2015-11-02-15.json.gz
       │   ├── 2015-11-02-16.json.gz
       │   ├── 2015-11-02-17.json.gz
       │   ├── 2015-11-02-18.json.gz
       │   ├── 2015-11-02-19.json.gz
       │   ├── 2015-11-02-2.json.gz
       │   ├── 2015-11-02-20.json.gz
       │   ├── 2015-11-02-21.json.gz
       │   ├── 2015-11-02-22.json.gz
       │   ├── 2015-11-02-23.json.gz
       │   ├── 2015-11-02-3.json.gz
       │   ├── 2015-11-02-4.json.gz
       │   ├── 2015-11-02-5.json.gz
       │   ├── 2015-11-02-6.json.gz
       │   ├── 2015-11-02-7.json.gz
       │   ├── 2015-11-02-8.json.gz
       │   ├── 2015-11-02-9.json.gz
       │   ├── 2015-11-03-0.json.gz
       │   ├── 2015-11-03-1.json.gz
       │   ├── 2015-11-03-10.json.gz
       │   ├── 2015-11-03-11.json.gz
       │   ├── 2015-11-03-12.json.gz
       │   ├── 2015-11-03-13.json.gz
       │   ├── 2015-11-03-14.json.gz
       │   ├── 2015-11-03-15.json.gz
       │   ├── 2015-11-03-16.json.gz
       │   ├── 2015-11-03-17.json.gz
       │   ├── 2015-11-03-18.json.gz
       │   ├── 2015-11-03-19.json.gz
       │   ├── 2015-11-03-2.json.gz
       │   ├── 2015-11-03-20.json.gz
       │   ├── 2015-11-03-21.json.gz
       │   ├── 2015-11-03-22.json.gz
       │   ├── 2015-11-03-23.json.gz
       │   ├── 2015-11-03-3.json.gz
       │   ├── 2015-11-03-4.json.gz
       │   ├── 2015-11-03-5.json.gz
       │   ├── 2015-11-03-6.json.gz
       │   ├── 2015-11-03-7.json.gz
       │   ├── 2015-11-03-8.json.gz
       │   ├── 2015-11-03-9.json.gz
       │   ├── 2015-11-04-0.json.gz
       │   ├── 2015-11-04-1.json.gz
       │   ├── 2015-11-04-10.json.gz
       │   ├── 2015-11-04-11.json.gz
       │   ├── 2015-11-04-12.json.gz
       │   ├── 2015-11-04-13.json.gz
       │   ├── 2015-11-04-14.json.gz
       │   ├── 2015-11-04-15.json.gz
       │   ├── 2015-11-04-16.json.gz
       │   ├── 2015-11-04-17.json.gz
       │   ├── 2015-11-04-18.json.gz
       │   ├── 2015-11-04-19.json.gz
       │   ├── 2015-11-04-2.json.gz
       │   ├── 2015-11-04-20.json.gz
       │   ├── 2015-11-04-21.json.gz
       │   ├── 2015-11-04-22.json.gz
       │   ├── 2015-11-04-23.json.gz
       │   ├── 2015-11-04-3.json.gz
       │   ├── 2015-11-04-4.json.gz
       │   ├── 2015-11-04-5.json.gz
       │   ├── 2015-11-04-6.json.gz
       │   ├── 2015-11-04-7.json.gz
       │   ├── 2015-11-04-8.json.gz
       │   ├── 2015-11-04-9.json.gz
       │   ├── 2015-11-05-0.json.gz
       │   ├── 2015-11-05-1.json.gz
       │   ├── 2015-11-05-10.json.gz
       │   ├── 2015-11-05-11.json.gz
       │   ├── 2015-11-05-12.json.gz
       │   ├── 2015-11-05-13.json.gz
       │   ├── 2015-11-05-14.json.gz
       │   ├── 2015-11-05-15.json.gz
       │   ├── 2015-11-05-16.json.gz
       │   ├── 2015-11-05-17.json.gz
       │   ├── 2015-11-05-18.json.gz
       │   ├── 2015-11-05-19.json.gz
       │   ├── 2015-11-05-2.json.gz
       │   ├── 2015-11-05-20.json.gz
       │   ├── 2015-11-05-21.json.gz
       │   ├── 2015-11-05-22.json.gz
       │   ├── 2015-11-05-23.json.gz
       │   ├── 2015-11-05-3.json.gz
       │   ├── 2015-11-05-4.json.gz
       │   ├── 2015-11-05-5.json.gz
       │   ├── 2015-11-05-6.json.gz
       │   ├── 2015-11-05-7.json.gz
       │   ├── 2015-11-05-8.json.gz
       │   ├── 2015-11-05-9.json.gz
       │   ├── 2015-11-06-0.json.gz
       │   ├── 2015-11-06-1.json.gz
       │   ├── 2015-11-06-10.json.gz
       │   ├── 2015-11-06-11.json.gz
       │   ├── 2015-11-06-12.json.gz
       │   ├── 2015-11-06-13.json.gz
       │   ├── 2015-11-06-14.json.gz
       │   ├── 2015-11-06-15.json.gz
       │   ├── 2015-11-06-16.json.gz
       │   ├── 2015-11-06-17.json.gz
       │   ├── 2015-11-06-18.json.gz
       │   ├── 2015-11-06-19.json.gz
       │   ├── 2015-11-06-2.json.gz
       │   ├── 2015-11-06-20.json.gz
       │   ├── 2015-11-06-21.json.gz
       │   ├── 2015-11-06-22.json.gz
       │   ├── 2015-11-06-23.json.gz
       │   ├── 2015-11-06-3.json.gz
       │   ├── 2015-11-06-4.json.gz
       │   ├── 2015-11-06-5.json.gz
       │   ├── 2015-11-06-6.json.gz
       │   ├── 2015-11-06-7.json.gz
       │   ├── 2015-11-06-8.json.gz
       │   ├── 2015-11-06-9.json.gz
       │   ├── 2015-11-07-0.json.gz
       │   ├── 2015-11-07-1.json.gz
       │   ├── 2015-11-07-10.json.gz
       │   ├── 2015-11-07-11.json.gz
       │   ├── 2015-11-07-12.json.gz
       │   ├── 2015-11-07-13.json.gz
       │   ├── 2015-11-07-14.json.gz
       │   ├── 2015-11-07-15.json.gz
       │   ├── 2015-11-07-16.json.gz
       │   ├── 2015-11-07-17.json.gz
       │   ├── 2015-11-07-18.json.gz
       │   ├── 2015-11-07-19.json.gz
       │   ├── 2015-11-07-2.json.gz
       │   ├── 2015-11-07-20.json.gz
       │   ├── 2015-11-07-21.json.gz
       │   ├── 2015-11-07-22.json.gz
       │   ├── 2015-11-07-23.json.gz
       │   ├── 2015-11-07-3.json.gz
       │   ├── 2015-11-07-4.json.gz
       │   ├── 2015-11-07-5.json.gz
       │   ├── 2015-11-07-6.json.gz
       │   ├── 2015-11-07-7.json.gz
       │   ├── 2015-11-07-8.json.gz
       │   ├── 2015-11-07-9.json.gz
       │   ├── 2015-11-08-0.json.gz
       │   ├── 2015-11-08-1.json.gz
       │   ├── 2015-11-08-10.json.gz
       │   ├── 2015-11-08-11.json.gz
       │   ├── 2015-11-08-12.json.gz
       │   ├── 2015-11-08-13.json.gz
       │   ├── 2015-11-08-14.json.gz
       │   ├── 2015-11-08-15.json.gz
       │   ├── 2015-11-08-16.json.gz
       │   ├── 2015-11-08-17.json.gz
       │   ├── 2015-11-08-18.json.gz
       │   ├── 2015-11-08-19.json.gz
       │   ├── 2015-11-08-2.json.gz
       │   ├── 2015-11-08-20.json.gz
       │   ├── 2015-11-08-21.json.gz
       │   ├── 2015-11-08-22.json.gz
       │   ├── 2015-11-08-23.json.gz
       │   ├── 2015-11-08-3.json.gz
       │   ├── 2015-11-08-4.json.gz
       │   ├── 2015-11-08-5.json.gz
       │   ├── 2015-11-08-6.json.gz
       │   ├── 2015-11-08-7.json.gz
       │   ├── 2015-11-08-8.json.gz
       │   ├── 2015-11-08-9.json.gz
       │   ├── 2015-11-09-0.json.gz
       │   ├── 2015-11-09-1.json.gz
       │   ├── 2015-11-09-10.json.gz
       │   ├── 2015-11-09-11.json.gz
       │   ├── 2015-11-09-12.json.gz
       │   ├── 2015-11-09-13.json.gz
       │   ├── 2015-11-09-14.json.gz
       │   ├── 2015-11-09-15.json.gz
       │   ├── 2015-11-09-16.json.gz
       │   ├── 2015-11-09-17.json.gz
       │   ├── 2015-11-09-18.json.gz
       │   ├── 2015-11-09-19.json.gz
       │   ├── 2015-11-09-2.json.gz
       │   ├── 2015-11-09-20.json.gz
       │   ├── 2015-11-09-21.json.gz
       │   ├── 2015-11-09-22.json.gz
       │   ├── 2015-11-09-23.json.gz
       │   ├── 2015-11-09-3.json.gz
       │   ├── 2015-11-09-4.json.gz
       │   ├── 2015-11-09-5.json.gz
       │   ├── 2015-11-09-6.json.gz
       │   ├── 2015-11-09-7.json.gz
       │   ├── 2015-11-09-8.json.gz
       │   ├── 2015-11-09-9.json.gz
       │   ├── 2015-11-10-0.json.gz
       │   ├── 2015-11-10-1.json.gz
       │   ├── 2015-11-10-10.json.gz
       │   ├── 2015-11-10-11.json.gz
       │   ├── 2015-11-10-12.json.gz
       │   ├── 2015-11-10-13.json.gz
       │   ├── 2015-11-10-14.json.gz
       │   ├── 2015-11-10-15.json.gz
       │   ├── 2015-11-10-16.json.gz
       │   ├── 2015-11-10-17.json.gz
       │   ├── 2015-11-10-18.json.gz
       │   ├── 2015-11-10-19.json.gz
       │   ├── 2015-11-10-2.json.gz
       │   ├── 2015-11-10-20.json.gz
       │   ├── 2015-11-10-21.json.gz
       │   ├── 2015-11-10-22.json.gz
       │   ├── 2015-11-10-23.json.gz
       │   ├── 2015-11-10-3.json.gz
       │   ├── 2015-11-10-4.json.gz
       │   ├── 2015-11-10-5.json.gz
       │   ├── 2015-11-10-6.json.gz
       │   ├── 2015-11-10-7.json.gz
       │   ├── 2015-11-10-8.json.gz
       │   ├── 2015-11-10-9.json.gz
       │   ├── 2015-11-11-0.json.gz
       │   ├── 2015-11-11-1.json.gz
       │   ├── 2015-11-11-10.json.gz
       │   ├── 2015-11-11-11.json.gz
       │   ├── 2015-11-11-12.json.gz
       │   ├── 2015-11-11-13.json.gz
       │   ├── 2015-11-11-14.json.gz
       │   ├── 2015-11-11-15.json.gz
       │   ├── 2015-11-11-16.json.gz
       │   ├── 2015-11-11-17.json.gz
       │   ├── 2015-11-11-18.json.gz
       │   ├── 2015-11-11-19.json.gz
       │   ├── 2015-11-11-2.json.gz
       │   ├── 2015-11-11-20.json.gz
       │   ├── 2015-11-11-21.json.gz
       │   ├── 2015-11-11-22.json.gz
       │   ├── 2015-11-11-23.json.gz
       │   ├── 2015-11-11-3.json.gz
       │   ├── 2015-11-11-4.json.gz
       │   ├── 2015-11-11-5.json.gz
       │   ├── 2015-11-11-6.json.gz
       │   ├── 2015-11-11-7.json.gz
       │   ├── 2015-11-11-8.json.gz
       │   ├── 2015-11-11-9.json.gz
       │   ├── 2015-11-12-0.json.gz
       │   ├── 2015-11-12-1.json.gz
       │   ├── 2015-11-12-10.json.gz
       │   ├── 2015-11-12-11.json.gz
       │   ├── 2015-11-12-12.json.gz
       │   ├── 2015-11-12-13.json.gz
       │   ├── 2015-11-12-14.json.gz
       │   ├── 2015-11-12-15.json.gz
       │   ├── 2015-11-12-16.json.gz
       │   ├── 2015-11-12-17.json.gz
       │   ├── 2015-11-12-18.json.gz
       │   ├── 2015-11-12-19.json.gz
       │   ├── 2015-11-12-2.json.gz
       │   ├── 2015-11-12-20.json.gz
       │   ├── 2015-11-12-21.json.gz
       │   ├── 2015-11-12-22.json.gz
       │   ├── 2015-11-12-23.json.gz
       │   ├── 2015-11-12-3.json.gz
       │   ├── 2015-11-12-4.json.gz
       │   ├── 2015-11-12-5.json.gz
       │   ├── 2015-11-12-6.json.gz
       │   ├── 2015-11-12-7.json.gz
       │   ├── 2015-11-12-8.json.gz
       │   ├── 2015-11-12-9.json.gz
       │   ├── 2015-11-13-0.json.gz
       │   ├── 2015-11-13-1.json.gz
       │   ├── 2015-11-13-10.json.gz
       │   ├── 2015-11-13-11.json.gz
       │   ├── 2015-11-13-12.json.gz
       │   ├── 2015-11-13-13.json.gz
       │   ├── 2015-11-13-14.json.gz
       │   ├── 2015-11-13-15.json.gz
       │   ├── 2015-11-13-16.json.gz
       │   ├── 2015-11-13-17.json.gz
       │   ├── 2015-11-13-18.json.gz
       │   ├── 2015-11-13-19.json.gz
       │   ├── 2015-11-13-2.json.gz
       │   ├── 2015-11-13-20.json.gz
       │   ├── 2015-11-13-21.json.gz
       │   ├── 2015-11-13-22.json.gz
       │   ├── 2015-11-13-23.json.gz
       │   ├── 2015-11-13-3.json.gz
       │   ├── 2015-11-13-4.json.gz
       │   ├── 2015-11-13-5.json.gz
       │   ├── 2015-11-13-6.json.gz
       │   ├── 2015-11-13-7.json.gz
       │   ├── 2015-11-13-8.json.gz
       │   ├── 2015-11-13-9.json.gz
       │   ├── 2015-11-14-0.json.gz
       │   ├── 2015-11-14-1.json.gz
       │   ├── 2015-11-14-10.json.gz
       │   ├── 2015-11-14-11.json.gz
       │   ├── 2015-11-14-12.json.gz
       │   ├── 2015-11-14-13.json.gz
       │   ├── 2015-11-14-14.json.gz
       │   ├── 2015-11-14-15.json.gz
       │   ├── 2015-11-14-16.json.gz
       │   ├── 2015-11-14-17.json.gz
       │   ├── 2015-11-14-18.json.gz
       │   ├── 2015-11-14-19.json.gz
       │   ├── 2015-11-14-2.json.gz
       │   ├── 2015-11-14-20.json.gz
       │   ├── 2015-11-14-21.json.gz
       │   ├── 2015-11-14-22.json.gz
       │   ├── 2015-11-14-23.json.gz
       │   ├── 2015-11-14-3.json.gz
       │   ├── 2015-11-14-4.json.gz
       │   ├── 2015-11-14-5.json.gz
       │   ├── 2015-11-14-6.json.gz
       │   ├── 2015-11-14-7.json.gz
       │   ├── 2015-11-14-8.json.gz
       │   ├── 2015-11-14-9.json.gz
       │   ├── 2015-11-15-0.json.gz
       │   ├── 2015-11-15-1.json.gz
       │   ├── 2015-11-15-10.json.gz
       │   ├── 2015-11-15-11.json.gz
       │   ├── 2015-11-15-12.json.gz
       │   ├── 2015-11-15-13.json.gz
       │   ├── 2015-11-15-14.json.gz
       │   ├── 2015-11-15-15.json.gz
       │   ├── 2015-11-15-16.json.gz
       │   ├── 2015-11-15-17.json.gz
       │   ├── 2015-11-15-18.json.gz
       │   ├── 2015-11-15-19.json.gz
       │   ├── 2015-11-15-2.json.gz
       │   ├── 2015-11-15-20.json.gz
       │   ├── 2015-11-15-21.json.gz
       │   ├── 2015-11-15-22.json.gz
       │   ├── 2015-11-15-23.json.gz
       │   ├── 2015-11-15-3.json.gz
       │   ├── 2015-11-15-4.json.gz
       │   ├── 2015-11-15-5.json.gz
       │   ├── 2015-11-15-6.json.gz
       │   ├── 2015-11-15-7.json.gz
       │   ├── 2015-11-15-8.json.gz
       │   └── 2015-11-15-9.json.gz
       └── taxi
           ├── yellow_tripdata_2015-11.parquet
           └── yellow_tripdata_2015-12.parquet
   
   3 directories, 362 files
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] mwlon commented on a diff in pull request #1141: ORC-1189

Posted by GitBox <gi...@apache.org>.
mwlon commented on code in PR #1141:
URL: https://github.com/apache/orc/pull/1141#discussion_r882686122


##########
java/bench/core/src/java/org/apache/orc/bench/core/Driver.java:
##########
@@ -32,8 +34,14 @@ public class Driver {
 
   private static Map<String, OrcBenchmark> getBenchmarks() {
     Map<String, OrcBenchmark> result = new TreeMap<>();
-    for(OrcBenchmark bench: loader) {
-      result.put(bench.getName(), bench);
+    Iterator<OrcBenchmark> iterator = loader.iterator();

Review Comment:
   We need to try/catch each `.next()` call. I'm not aware of any way to do that with an `Iterable` other than to create an `Iterator` from it, but let me know if there is a way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on pull request #1141: ORC-1189

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #1141:
URL: https://github.com/apache/orc/pull/1141#issuecomment-1137334530

   I'll do the second review by running your PR today. Thanks, @mwlon .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] mwlon commented on a diff in pull request #1141: ORC-1191: Updated TLC Taxi Benchmark Dataset

Posted by GitBox <gi...@apache.org>.
mwlon commented on code in PR #1141:
URL: https://github.com/apache/orc/pull/1141#discussion_r884331174


##########
java/bench/README.md:
##########
@@ -24,7 +24,7 @@ To fetch the source data:
 
 ```% ./fetch-data.sh```
 
-> :warning: Script will fetch 7GB of data
+> :warning: Script will fetch 500MB of data

Review Comment:
   Ah you're right, I had downsized the github data. I'll update to 4GB



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1141: ORC-1189

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1141:
URL: https://github.com/apache/orc/pull/1141#discussion_r882344779


##########
java/bench/fetch-data.sh:
##########
@@ -15,8 +15,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 mkdir -p data/sources/taxi
-(cd data/sources/taxi; wget -O - https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-11.csv | gzip > yellow_tripdata_2015-11.csv.gz )
-(cd data/sources/taxi; wget -O - https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-12.csv | gzip > yellow_tripdata_2015-12.csv.gz )
+(cd data/sources/taxi; wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-11.parquet )

Review Comment:
   BTW, could you update the following accordingly because the new file are very small now?
   https://github.com/apache/orc/blob/1afc31d6c04729d7e194a6423c690af4519aab33/java/bench/README.md#L27



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1141: ORC-1189

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1141:
URL: https://github.com/apache/orc/pull/1141#discussion_r881206816


##########
java/bench/core/src/java/org/apache/orc/bench/core/convert/GenerateVariants.java:
##########
@@ -221,8 +221,8 @@ public static BatchReader createReader(Path root,
                                          long salesRecords) throws IOException {
     switch (dataName) {
       case "taxi":
-        return new RecursiveReader(new Path(root, "sources/" + dataName), "csv",
-            schema, conf, CompressionKind.ZLIB);
+        return new RecursiveReader(new Path(root, "sources/" + dataName), "parquet",
+            schema, conf, CompressionKind.NONE);

Review Comment:
   So, never mind about Parquet downloading part. I'll revise those comments.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on pull request #1141: ORC-1189

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #1141:
URL: https://github.com/apache/orc/pull/1141#issuecomment-1136712610

   BTW, I saw your comment on JIRA which is not a part of this PR description.
   > NYC Taxi dataset used in benchmarks no longer exists as CSV's; has been replaced with Parquet
   
   Actually, I used the downloaded one. So I didn't realized that. In that case, I agree with you to download Parquet data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] mwlon commented on a diff in pull request #1141: ORC-1189

Posted by GitBox <gi...@apache.org>.
mwlon commented on code in PR #1141:
URL: https://github.com/apache/orc/pull/1141#discussion_r881203809


##########
java/pom.xml:
##########
@@ -314,6 +314,7 @@
           </executions>
           <configuration>
             <failOnWarning>true</failOnWarning>
+            <ignoreNonCompile>true</ignoreNonCompile>

Review Comment:
   I believe so - the maven dependency plugin gives warnings for test-scoped dependencies not used in main (which causes packaging to fail) without this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1141: ORC-1189

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1141:
URL: https://github.com/apache/orc/pull/1141#discussion_r881210652


##########
java/bench/README.md:
##########
@@ -16,7 +16,7 @@ There are three sub-modules to try to mitigate dependency hell:
 To build this library, run the following in the parent directory:
 
 ```
-% ./mvnw clean package -Pbenchmark
+% ./mvnw clean package -Pbenchmark -Dmaven.test.skip

Review Comment:
   Please use `-DskipTests` if you want this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org