You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by omalley <gi...@git.apache.org> on 2017/06/16 15:57:43 UTC

[GitHub] orc pull request #131: ORC-199. Add convert from CSV.

GitHub user omalley opened a pull request:

    https://github.com/apache/orc/pull/131

    ORC-199. Add convert from CSV.

    This patch incorporates Carter's converter into the tool jar. I haven't made any tests yet.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/omalley/orc orc-199

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/orc/pull/131.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #131
    
----
commit 0389dccf8d8f223be52788163abdbf130d484455
Author: Owen O'Malley <om...@apache.org>
Date:   2017-05-23T20:54:44Z

    ORC-199. Add convert from CSV.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #131: ORC-199. Add convert from CSV.

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/orc/pull/131


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #131: ORC-199. Add convert from CSV.

Posted by spasam <gi...@git.apache.org>.
Github user spasam commented on a diff in the pull request:

    https://github.com/apache/orc/pull/131#discussion_r127508667
  
    --- Diff: java/tools/src/java/org/apache/orc/tools/convert/ConvertTool.java ---
    @@ -18,53 +18,178 @@
     package org.apache.orc.tools.convert;
     
     import org.apache.commons.cli.CommandLine;
    -import org.apache.commons.cli.GnuParser;
    +import org.apache.commons.cli.DefaultParser;
     import org.apache.commons.cli.HelpFormatter;
     import org.apache.commons.cli.Option;
     import org.apache.commons.cli.Options;
     import org.apache.commons.cli.ParseException;
     import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.fs.FSDataInputStream;
    +import org.apache.hadoop.fs.FileSystem;
     import org.apache.hadoop.fs.Path;
     import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
     import org.apache.orc.OrcFile;
    +import org.apache.orc.Reader;
     import org.apache.orc.RecordReader;
     import org.apache.orc.TypeDescription;
     import org.apache.orc.Writer;
     import org.apache.orc.tools.json.JsonSchemaFinder;
     
     import java.io.IOException;
    +import java.io.InputStream;
    +import java.io.InputStreamReader;
    +import java.nio.charset.StandardCharsets;
    +import java.util.ArrayList;
    +import java.util.List;
    +import java.util.zip.GZIPInputStream;
     
     /**
    - * A conversion tool to convert JSON files into ORC files.
    + * A conversion tool to convert CSV or JSON files into ORC files.
      */
     public class ConvertTool {
    +  private final List<FileInformation> fileList;
    +  private final TypeDescription schema;
    +  private final char csvSeparator;
    +  private final char csvQuote;
    +  private final char csvEscape;
    +  private final int csvHeaderLines;
    +  private final String csvNullString;
    +  private final Writer writer;
    +  private final VectorizedRowBatch batch;
     
    -  static TypeDescription computeSchema(String[] filename) throws IOException {
    +  TypeDescription buildSchema(List<FileInformation> files,
    +                              Configuration conf) throws IOException {
         JsonSchemaFinder schemaFinder = new JsonSchemaFinder();
    -    for(String file: filename) {
    -      System.err.println("Scanning " + file + " for schema");
    -      schemaFinder.addFile(file);
    +    for(FileInformation file: files) {
    +      if (file.format == Format.JSON) {
    +        System.err.println("Scanning " + file.path + " for schema");
    +        schemaFinder.addFile(file.getReader(file.filesystem.open(file.path)));
    +      } else if (file.format == Format.ORC) {
    +        System.err.println("Merging schema from " + file.path);
    +        Reader reader = OrcFile.createReader(file.path,
    +            OrcFile.readerOptions(conf)
    +                .filesystem(file.filesystem));
    +        schemaFinder.addSchema(reader.getSchema());
    +      }
         }
         return schemaFinder.getSchema();
    --- End diff --
    
    This is throwing NPE if no command line arguments are specified except for CSV file:
    
    ```
    Exception in thread "main" java.lang.NullPointerException
    	at org.apache.orc.tools.json.JsonSchemaFinder.getSchema(JsonSchemaFinder.java:321)
    	at org.apache.orc.tools.convert.ConvertTool.buildSchema(ConvertTool.java:75)
    
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #131: ORC-199. Add convert from CSV.

Posted by cartershanklin <gi...@git.apache.org>.
Github user cartershanklin commented on the issue:

    https://github.com/apache/orc/pull/131
  
    I think Owen mentioned he had incorporated the code into another tool, if so we should close the ticket with the right pointer.
    
    The original project had customizable null strings and strict mode that were put there with Postgres in mind. Interested to hear your experiences when you try it outl


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #131: ORC-199. Add convert from CSV.

Posted by spasam <gi...@git.apache.org>.
Github user spasam commented on the issue:

    https://github.com/apache/orc/pull/131
  
    @omalley Am interested in testing this with PostgreSQL dump output (CSV and TEXT). Is something blocking you from merging this? Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #131: ORC-199. Add convert from CSV.

Posted by spasam <gi...@git.apache.org>.
Github user spasam commented on the issue:

    https://github.com/apache/orc/pull/131
  
    Finally was able to test this end to end. I had to tweak CsvReader. Added customer converters for boolean and timestamp. I can upload changes in a separate pull request.
    
    Ideally, if **json-schema** Driver command line option is changed to **schema** and CSV schema is determined on the fly (assuming there is a header line with column names), this would be awesome.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #131: ORC-199. Add convert from CSV.

Posted by spasam <gi...@git.apache.org>.
Github user spasam commented on a diff in the pull request:

    https://github.com/apache/orc/pull/131#discussion_r127508336
  
    --- Diff: java/tools/src/java/org/apache/orc/tools/convert/ConvertTool.java ---
    @@ -84,7 +225,22 @@ static CommandLine parseOptions(String[] args) throws ParseException {
         options.addOption(
             Option.builder("o").longOpt("output").desc("Output filename")
                 .hasArg().build());
    -    CommandLine cli = new GnuParser().parse(options, args);
    +    options.addOption(
    --- End diff --
    
    Could you document the default values for these options in the help text?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---