You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "cxzl25 (via GitHub)" <gi...@apache.org> on 2024/03/01 12:55:50 UTC

[PR] ORC-1644: Add merge tool to merges multiple ORC files to produce a single ORC file [orc]

cxzl25 opened a new pull request, #1834:
URL: https://github.com/apache/orc/pull/1834

   ### What changes were proposed in this pull request?
   This PR aims to add merge tool to merges multiple ORC files to produce a single ORC file.
   
   ### Why are the changes needed?
   In the ORC 1.3.0 version, the `OrcFile#mergeFiles` method was introduced by [ORC-132](https://issues.apache.org/jira/browse/ORC-132) , which supports merging multiple ORC files into one ORC file. 
   However, when merging, we need to write Java code to call it. 
   There is no simple command that can be called directly.
   
   ### How was this patch tested?
   Add UT
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] ORC-1644: Add `merge` tool to merge multiple ORC files into a single ORC file [orc]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #1834:
URL: https://github.com/apache/orc/pull/1834#issuecomment-1989482803

   Thank you, @williamhyun .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] ORC-1644: Add `merge` tool to merge multiple ORC files into a single ORC file [orc]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1834:
URL: https://github.com/apache/orc/pull/1834#discussion_r1511701304


##########
java/tools/src/java/org/apache/orc/tools/MergeFiles.java:
##########
@@ -0,0 +1,132 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * <p/>
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * <p/>
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.tools;
+
+import org.apache.commons.cli.CommandLine;
+import org.apache.commons.cli.DefaultParser;
+import org.apache.commons.cli.HelpFormatter;
+import org.apache.commons.cli.Option;
+import org.apache.commons.cli.Options;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.LocatedFileStatus;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.RemoteIterator;
+import org.apache.orc.OrcFile;
+
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * Merges multiple ORC files that all have the same schema to produce a single ORC file.
+ */
+public class MergeFiles {
+
+  public static void main(Configuration conf, String[] args) throws Exception {
+    Options opts = createOptions();
+    CommandLine cli = new DefaultParser().parse(opts, args);
+    HelpFormatter formatter = new HelpFormatter();
+    if (cli.hasOption('h')) {
+      formatter.printHelp("merge", opts);
+      return;
+    }
+    String inputDir = cli.getOptionValue("inputDir");
+    if (inputDir == null || inputDir.isEmpty()) {
+      System.err.println("inputDir is null");
+      formatter.printHelp("merge", opts);
+      return;
+    }
+    String outputPath = cli.getOptionValue("outputPath");
+    if (outputPath == null || outputPath.isEmpty()) {
+      System.err.println("outputPath is null");
+      formatter.printHelp("merge", opts);
+      return;
+    }
+    boolean ignoreExtension = cli.hasOption("ignoreExtension");
+
+    List<Path> inputFiles = new ArrayList<>();
+    OrcFile.WriterOptions writerOptions = OrcFile.writerOptions(conf);
+
+    Path rootPath = new Path(inputDir);
+    FileSystem fs = rootPath.getFileSystem(conf);
+    for (RemoteIterator<LocatedFileStatus> itr = fs.listFiles(rootPath, true); itr.hasNext(); ) {
+      LocatedFileStatus status = itr.next();
+      if (status.isFile() && (ignoreExtension || status.getPath().getName().endsWith(".orc"))) {
+        inputFiles.add(status.getPath());
+      }
+    }
+    if (inputFiles.isEmpty()) {
+      System.err.println("No files found.");
+      System.exit(1);
+    }
+
+    List<Path> mergedFiles = OrcFile.mergeFiles(new Path(outputPath), writerOptions, inputFiles);
+
+    List<Path> unSuccessMergedFiles = new ArrayList<>();
+    if (mergedFiles.size() != inputFiles.size()) {
+      Set<Path> mergedFilesSet = new HashSet<>(mergedFiles);
+      for (Path inputFile : inputFiles) {
+        if (!mergedFilesSet.contains(inputFile)) {
+          unSuccessMergedFiles.add(inputFile);
+        }
+      }
+    }
+
+    if (!unSuccessMergedFiles.isEmpty()) {
+      System.err.println("List of files that could not be merged:");
+      unSuccessMergedFiles.forEach(path -> System.err.println(path.toString()));
+    }
+
+    System.out.printf("Input directory: %s, Output path: %s, " +
+            "Input files size: %d, Merge files size: %d%n",
+        inputDir, outputPath, inputFiles.size(), mergedFiles.size());
+    if (!unSuccessMergedFiles.isEmpty()) {
+      System.exit(1);
+    }
+  }
+
+  private static Options createOptions() {
+    Options result = new Options();
+
+    result.addOption(Option.builder("id")
+        .longOpt("inputDir")
+        .desc("Input orc directory to be merged")
+        .hasArg()
+        .build());
+
+    result.addOption(Option.builder("op")
+        .longOpt("outputPath")

Review Comment:
   Why do we need to chars, `op`, when there is no conflicts with other options?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] ORC-1644: Add `merge` tool to merge multiple ORC files into a single ORC file [orc]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1834:
URL: https://github.com/apache/orc/pull/1834#discussion_r1511696311


##########
java/tools/src/java/org/apache/orc/tools/Driver.java:
##########
@@ -94,6 +94,7 @@ public static void main(String[] args) throws Exception {
       System.err.println("   meta - print the metadata about the ORC file");
       System.err.println("   scan - scan the ORC file");
       System.err.println("   sizes - list size on disk of each column");
+      System.err.println("   merge - Merges multiple ORC files to produce a single ORC file");

Review Comment:
   Also, please revise the description like the following; (1) Use a lower case at the first char, (2) drop `s` at the end of `Merges`, (3) simplify from `to produce` -> `into`.
   ```
   - Merges multiple ORC files to produce a single ORC file
   - merge multiple ORC files into a single ORC file
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] ORC-1644: Add `merge` tool to merge multiple ORC files into a single ORC file [orc]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1834:
URL: https://github.com/apache/orc/pull/1834#discussion_r1511700488


##########
java/tools/src/java/org/apache/orc/tools/MergeFiles.java:
##########
@@ -0,0 +1,132 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * <p/>
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * <p/>
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.tools;
+
+import org.apache.commons.cli.CommandLine;
+import org.apache.commons.cli.DefaultParser;
+import org.apache.commons.cli.HelpFormatter;
+import org.apache.commons.cli.Option;
+import org.apache.commons.cli.Options;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.LocatedFileStatus;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.RemoteIterator;
+import org.apache.orc.OrcFile;
+
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * Merges multiple ORC files that all have the same schema to produce a single ORC file.
+ */
+public class MergeFiles {
+
+  public static void main(Configuration conf, String[] args) throws Exception {
+    Options opts = createOptions();
+    CommandLine cli = new DefaultParser().parse(opts, args);
+    HelpFormatter formatter = new HelpFormatter();
+    if (cli.hasOption('h')) {
+      formatter.printHelp("merge", opts);
+      return;
+    }
+    String inputDir = cli.getOptionValue("inputDir");
+    if (inputDir == null || inputDir.isEmpty()) {
+      System.err.println("inputDir is null");
+      formatter.printHelp("merge", opts);
+      return;
+    }
+    String outputPath = cli.getOptionValue("outputPath");
+    if (outputPath == null || outputPath.isEmpty()) {
+      System.err.println("outputPath is null");
+      formatter.printHelp("merge", opts);
+      return;
+    }
+    boolean ignoreExtension = cli.hasOption("ignoreExtension");
+
+    List<Path> inputFiles = new ArrayList<>();
+    OrcFile.WriterOptions writerOptions = OrcFile.writerOptions(conf);
+
+    Path rootPath = new Path(inputDir);
+    FileSystem fs = rootPath.getFileSystem(conf);
+    for (RemoteIterator<LocatedFileStatus> itr = fs.listFiles(rootPath, true); itr.hasNext(); ) {
+      LocatedFileStatus status = itr.next();
+      if (status.isFile() && (ignoreExtension || status.getPath().getName().endsWith(".orc"))) {
+        inputFiles.add(status.getPath());
+      }
+    }
+    if (inputFiles.isEmpty()) {
+      System.err.println("No files found.");
+      System.exit(1);
+    }
+
+    List<Path> mergedFiles = OrcFile.mergeFiles(new Path(outputPath), writerOptions, inputFiles);
+
+    List<Path> unSuccessMergedFiles = new ArrayList<>();
+    if (mergedFiles.size() != inputFiles.size()) {
+      Set<Path> mergedFilesSet = new HashSet<>(mergedFiles);
+      for (Path inputFile : inputFiles) {
+        if (!mergedFilesSet.contains(inputFile)) {
+          unSuccessMergedFiles.add(inputFile);
+        }
+      }
+    }
+
+    if (!unSuccessMergedFiles.isEmpty()) {
+      System.err.println("List of files that could not be merged:");
+      unSuccessMergedFiles.forEach(path -> System.err.println(path.toString()));
+    }
+
+    System.out.printf("Input directory: %s, Output path: %s, " +
+            "Input files size: %d, Merge files size: %d%n",
+        inputDir, outputPath, inputFiles.size(), mergedFiles.size());
+    if (!unSuccessMergedFiles.isEmpty()) {
+      System.exit(1);
+    }
+  }
+
+  private static Options createOptions() {
+    Options result = new Options();
+
+    result.addOption(Option.builder("id")

Review Comment:
   `id`? Is this consistent inside Apache ORC args? This looks very misleading to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] ORC-1644: Add `merge` tool to merge multiple ORC files into a single ORC file [orc]

Posted by "cxzl25 (via GitHub)" <gi...@apache.org>.
cxzl25 commented on code in PR #1834:
URL: https://github.com/apache/orc/pull/1834#discussion_r1512453915


##########
java/tools/src/java/org/apache/orc/tools/Driver.java:
##########
@@ -94,6 +94,7 @@ public static void main(String[] args) throws Exception {
       System.err.println("   meta - print the metadata about the ORC file");
       System.err.println("   scan - scan the ORC file");
       System.err.println("   sizes - list size on disk of each column");
+      System.err.println("   merge - Merges multiple ORC files to produce a single ORC file");

Review Comment:
   Thanks, I fixed it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] ORC-1644: Add `merge` tool to merge multiple ORC files into a single ORC file [orc]

Posted by "cxzl25 (via GitHub)" <gi...@apache.org>.
cxzl25 commented on code in PR #1834:
URL: https://github.com/apache/orc/pull/1834#discussion_r1512460364


##########
java/tools/src/java/org/apache/orc/tools/MergeFiles.java:
##########
@@ -0,0 +1,132 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * <p/>
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * <p/>
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.tools;
+
+import org.apache.commons.cli.CommandLine;
+import org.apache.commons.cli.DefaultParser;
+import org.apache.commons.cli.HelpFormatter;
+import org.apache.commons.cli.Option;
+import org.apache.commons.cli.Options;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.LocatedFileStatus;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.RemoteIterator;
+import org.apache.orc.OrcFile;
+
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * Merges multiple ORC files that all have the same schema to produce a single ORC file.
+ */
+public class MergeFiles {
+
+  public static void main(Configuration conf, String[] args) throws Exception {
+    Options opts = createOptions();
+    CommandLine cli = new DefaultParser().parse(opts, args);
+    HelpFormatter formatter = new HelpFormatter();
+    if (cli.hasOption('h')) {
+      formatter.printHelp("merge", opts);
+      return;
+    }
+    String inputDir = cli.getOptionValue("inputDir");
+    if (inputDir == null || inputDir.isEmpty()) {
+      System.err.println("inputDir is null");
+      formatter.printHelp("merge", opts);
+      return;
+    }
+    String outputPath = cli.getOptionValue("outputPath");
+    if (outputPath == null || outputPath.isEmpty()) {
+      System.err.println("outputPath is null");
+      formatter.printHelp("merge", opts);
+      return;
+    }
+    boolean ignoreExtension = cli.hasOption("ignoreExtension");
+
+    List<Path> inputFiles = new ArrayList<>();
+    OrcFile.WriterOptions writerOptions = OrcFile.writerOptions(conf);
+
+    Path rootPath = new Path(inputDir);
+    FileSystem fs = rootPath.getFileSystem(conf);
+    for (RemoteIterator<LocatedFileStatus> itr = fs.listFiles(rootPath, true); itr.hasNext(); ) {
+      LocatedFileStatus status = itr.next();
+      if (status.isFile() && (ignoreExtension || status.getPath().getName().endsWith(".orc"))) {
+        inputFiles.add(status.getPath());
+      }
+    }
+    if (inputFiles.isEmpty()) {
+      System.err.println("No files found.");
+      System.exit(1);
+    }
+
+    List<Path> mergedFiles = OrcFile.mergeFiles(new Path(outputPath), writerOptions, inputFiles);
+
+    List<Path> unSuccessMergedFiles = new ArrayList<>();
+    if (mergedFiles.size() != inputFiles.size()) {
+      Set<Path> mergedFilesSet = new HashSet<>(mergedFiles);
+      for (Path inputFile : inputFiles) {
+        if (!mergedFilesSet.contains(inputFile)) {
+          unSuccessMergedFiles.add(inputFile);
+        }
+      }
+    }
+
+    if (!unSuccessMergedFiles.isEmpty()) {
+      System.err.println("List of files that could not be merged:");
+      unSuccessMergedFiles.forEach(path -> System.err.println(path.toString()));
+    }
+
+    System.out.printf("Input directory: %s, Output path: %s, " +
+            "Input files size: %d, Merge files size: %d%n",
+        inputDir, outputPath, inputFiles.size(), mergedFiles.size());
+    if (!unSuccessMergedFiles.isEmpty()) {
+      System.exit(1);
+    }
+  }
+
+  private static Options createOptions() {
+    Options result = new Options();
+
+    result.addOption(Option.builder("id")
+        .longOpt("inputDir")
+        .desc("Input orc directory to be merged")
+        .hasArg()
+        .build());
+
+    result.addOption(Option.builder("op")
+        .longOpt("outputPath")

Review Comment:
   Now I use the `-o` and `--output` options, consistent with other tools commands.
   
   https://github.com/apache/orc/blob/01ebb961ba30f25efd33777a5220225feedc45c2/java/tools/src/java/org/apache/orc/tools/KeyTool.java#L101-L103
   
   https://github.com/apache/orc/blob/01ebb961ba30f25efd33777a5220225feedc45c2/java/tools/src/java/org/apache/orc/tools/convert/ConvertTool.java#L257-L259



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] ORC-1644: Add `merge` tool to merge multiple ORC files into a single ORC file [orc]

Posted by "cxzl25 (via GitHub)" <gi...@apache.org>.
cxzl25 commented on code in PR #1834:
URL: https://github.com/apache/orc/pull/1834#discussion_r1512455636


##########
java/tools/src/java/org/apache/orc/tools/MergeFiles.java:
##########
@@ -0,0 +1,132 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * <p/>
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * <p/>
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.tools;
+
+import org.apache.commons.cli.CommandLine;
+import org.apache.commons.cli.DefaultParser;
+import org.apache.commons.cli.HelpFormatter;
+import org.apache.commons.cli.Option;
+import org.apache.commons.cli.Options;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.LocatedFileStatus;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.RemoteIterator;
+import org.apache.orc.OrcFile;
+
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * Merges multiple ORC files that all have the same schema to produce a single ORC file.
+ */
+public class MergeFiles {
+
+  public static void main(Configuration conf, String[] args) throws Exception {
+    Options opts = createOptions();
+    CommandLine cli = new DefaultParser().parse(opts, args);
+    HelpFormatter formatter = new HelpFormatter();
+    if (cli.hasOption('h')) {
+      formatter.printHelp("merge", opts);
+      return;
+    }
+    String inputDir = cli.getOptionValue("inputDir");
+    if (inputDir == null || inputDir.isEmpty()) {
+      System.err.println("inputDir is null");
+      formatter.printHelp("merge", opts);
+      return;
+    }
+    String outputPath = cli.getOptionValue("outputPath");
+    if (outputPath == null || outputPath.isEmpty()) {
+      System.err.println("outputPath is null");
+      formatter.printHelp("merge", opts);
+      return;
+    }
+    boolean ignoreExtension = cli.hasOption("ignoreExtension");
+
+    List<Path> inputFiles = new ArrayList<>();
+    OrcFile.WriterOptions writerOptions = OrcFile.writerOptions(conf);
+
+    Path rootPath = new Path(inputDir);
+    FileSystem fs = rootPath.getFileSystem(conf);
+    for (RemoteIterator<LocatedFileStatus> itr = fs.listFiles(rootPath, true); itr.hasNext(); ) {
+      LocatedFileStatus status = itr.next();
+      if (status.isFile() && (ignoreExtension || status.getPath().getName().endsWith(".orc"))) {
+        inputFiles.add(status.getPath());
+      }
+    }
+    if (inputFiles.isEmpty()) {
+      System.err.println("No files found.");
+      System.exit(1);
+    }
+
+    List<Path> mergedFiles = OrcFile.mergeFiles(new Path(outputPath), writerOptions, inputFiles);
+
+    List<Path> unSuccessMergedFiles = new ArrayList<>();
+    if (mergedFiles.size() != inputFiles.size()) {
+      Set<Path> mergedFilesSet = new HashSet<>(mergedFiles);
+      for (Path inputFile : inputFiles) {
+        if (!mergedFilesSet.contains(inputFile)) {
+          unSuccessMergedFiles.add(inputFile);
+        }
+      }
+    }
+
+    if (!unSuccessMergedFiles.isEmpty()) {
+      System.err.println("List of files that could not be merged:");
+      unSuccessMergedFiles.forEach(path -> System.err.println(path.toString()));
+    }
+
+    System.out.printf("Input directory: %s, Output path: %s, " +
+            "Input files size: %d, Merge files size: %d%n",
+        inputDir, outputPath, inputFiles.size(), mergedFiles.size());
+    if (!unSuccessMergedFiles.isEmpty()) {
+      System.exit(1);
+    }
+  }
+
+  private static Options createOptions() {
+    Options result = new Options();
+
+    result.addOption(Option.builder("id")
+        .longOpt("inputDir")
+        .desc("Input orc directory to be merged")

Review Comment:
   I removed the parameter of the input path. Now we can support multiple directories and files without this limitation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] ORC-1644: Add `merge` tool to merge multiple ORC files into a single ORC file [orc]

Posted by "williamhyun (via GitHub)" <gi...@apache.org>.
williamhyun closed pull request #1834: ORC-1644: Add `merge` tool to merge multiple ORC files into a single ORC file
URL: https://github.com/apache/orc/pull/1834


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] ORC-1644: Add `merge` tool to merge multiple ORC files into a single ORC file [orc]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1834:
URL: https://github.com/apache/orc/pull/1834#discussion_r1511697786


##########
java/tools/src/java/org/apache/orc/tools/MergeFiles.java:
##########
@@ -0,0 +1,132 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * <p/>
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * <p/>
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.tools;
+
+import org.apache.commons.cli.CommandLine;
+import org.apache.commons.cli.DefaultParser;
+import org.apache.commons.cli.HelpFormatter;
+import org.apache.commons.cli.Option;
+import org.apache.commons.cli.Options;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.LocatedFileStatus;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.RemoteIterator;
+import org.apache.orc.OrcFile;
+
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * Merges multiple ORC files that all have the same schema to produce a single ORC file.
+ */
+public class MergeFiles {
+
+  public static void main(Configuration conf, String[] args) throws Exception {
+    Options opts = createOptions();
+    CommandLine cli = new DefaultParser().parse(opts, args);
+    HelpFormatter formatter = new HelpFormatter();
+    if (cli.hasOption('h')) {
+      formatter.printHelp("merge", opts);
+      return;
+    }
+    String inputDir = cli.getOptionValue("inputDir");
+    if (inputDir == null || inputDir.isEmpty()) {
+      System.err.println("inputDir is null");
+      formatter.printHelp("merge", opts);
+      return;
+    }
+    String outputPath = cli.getOptionValue("outputPath");
+    if (outputPath == null || outputPath.isEmpty()) {
+      System.err.println("outputPath is null");
+      formatter.printHelp("merge", opts);
+      return;
+    }
+    boolean ignoreExtension = cli.hasOption("ignoreExtension");
+
+    List<Path> inputFiles = new ArrayList<>();
+    OrcFile.WriterOptions writerOptions = OrcFile.writerOptions(conf);
+
+    Path rootPath = new Path(inputDir);
+    FileSystem fs = rootPath.getFileSystem(conf);
+    for (RemoteIterator<LocatedFileStatus> itr = fs.listFiles(rootPath, true); itr.hasNext(); ) {
+      LocatedFileStatus status = itr.next();
+      if (status.isFile() && (ignoreExtension || status.getPath().getName().endsWith(".orc"))) {
+        inputFiles.add(status.getPath());
+      }
+    }
+    if (inputFiles.isEmpty()) {
+      System.err.println("No files found.");
+      System.exit(1);
+    }
+
+    List<Path> mergedFiles = OrcFile.mergeFiles(new Path(outputPath), writerOptions, inputFiles);
+
+    List<Path> unSuccessMergedFiles = new ArrayList<>();
+    if (mergedFiles.size() != inputFiles.size()) {
+      Set<Path> mergedFilesSet = new HashSet<>(mergedFiles);
+      for (Path inputFile : inputFiles) {
+        if (!mergedFilesSet.contains(inputFile)) {
+          unSuccessMergedFiles.add(inputFile);
+        }
+      }
+    }
+
+    if (!unSuccessMergedFiles.isEmpty()) {
+      System.err.println("List of files that could not be merged:");
+      unSuccessMergedFiles.forEach(path -> System.err.println(path.toString()));
+    }
+
+    System.out.printf("Input directory: %s, Output path: %s, " +
+            "Input files size: %d, Merge files size: %d%n",
+        inputDir, outputPath, inputFiles.size(), mergedFiles.size());
+    if (!unSuccessMergedFiles.isEmpty()) {
+      System.exit(1);
+    }
+  }
+
+  private static Options createOptions() {
+    Options result = new Options();
+
+    result.addOption(Option.builder("id")
+        .longOpt("inputDir")
+        .desc("Input orc directory to be merged")

Review Comment:
   This looks like a limitation. Do you mean we cannot simply merge two files without putting them into a new artificial directory?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] ORC-1644: Add `merge` tool to merge multiple ORC files into a single ORC file [orc]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1834:
URL: https://github.com/apache/orc/pull/1834#discussion_r1511693010


##########
java/tools/src/java/org/apache/orc/tools/Driver.java:
##########
@@ -94,6 +94,7 @@ public static void main(String[] args) throws Exception {
       System.err.println("   meta - print the metadata about the ORC file");
       System.err.println("   scan - scan the ORC file");
       System.err.println("   sizes - list size on disk of each column");
+      System.err.println("   merge - Merges multiple ORC files to produce a single ORC file");

Review Comment:
   `Commands` are sorted in alphabetical order. Could you move this to the correct position?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org