You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by omalley <gi...@git.apache.org> on 2016/04/20 08:08:34 UTC

[GitHub] orc pull request: ORC-1 Import of ORC code from Hive.

GitHub user omalley opened a pull request:

    https://github.com/apache/orc/pull/23

    ORC-1 Import of ORC code from Hive.

    This patch pulls the current Java code for the ORC reader and writer out of Hive. Under the java directory there are three modules:
    * storage-api - a temporary copy of hive's storage api until we release hive with the changes we need
    * core - the core reader and writer for the vectorized reader and writer
    * mapreduce - an implementation of InputFormat and OutputFormat that uses core to read and write row by row

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/omalley/orc orc-1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/orc/pull/23.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #23
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request: ORC-1 Import of ORC code from Hive.

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/orc/pull/23


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request: ORC-1 Import of ORC code from Hive.

Posted by nahguam <gi...@git.apache.org>.
Github user nahguam commented on the pull request:

    https://github.com/apache/orc/pull/23#issuecomment-214777817
  
    Hi, what's the best place for comments?
    
    The diff is too big to comment directly so I'll list a couple here:
    
    1. OrcRecordReader.next L94 - seems to be a dead-end?
    
    2. How is one to use OrcStruct as an end user? getFieldValue and setFieldValue are package private. Previously you'd access via the StructObjectInspector, but we seem to have no ObjectInspectors or StructFields.
    
        FileInputFormat.setInputPaths(conf, path);
        OrcInputFormat<OrcStruct> inputFormat = new OrcInputFormat<>();
        InputSplit[] splits = inputFormat.getSplits(conf, 1);
        RecordReader<NullWritable, OrcStruct> recordReader = inputFormat.getRecordReader(splits[0], conf, null);
    
        NullWritable key = recordReader.createKey();
        OrcStruct value = recordReader.createValue();
    
        while (recordReader.next(key, value)) {
          // How do I interrogate value for it's fields' values?
        }
    
        recordReader.close();
        
    3. Perhaps for another ticket, but it would be nice to have a mechanism to access a struct's fields by name as well as by index.
    
    4. Is there any particular reason that the value generic type is V extends Writable instead of OrcStruct?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request: ORC-1 Import of ORC code from Hive.

Posted by omalley <gi...@git.apache.org>.
Github user omalley commented on the pull request:

    https://github.com/apache/orc/pull/23#issuecomment-214806455
  
    1. You're right that you can't actually pass null in to value. *smile*
    2. I made the accessors public.
    3. I added getters and setters that take the field name.
    4. The input format doesn't need to have a struct as the root type. Look at TestOrcOutputFormat.testLongRoot for an example.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request: ORC-1 Import of ORC code from Hive.

Posted by omalley <gi...@git.apache.org>.
Github user omalley commented on the pull request:

    https://github.com/apache/orc/pull/23#issuecomment-215112415
  
    1. The storage-api is a clone of Hive's until they release a version that has the bits that we need. So the method VectorizedRowBatch.toUTF8 is used by Hive.
    2. Done
    3. Done.
    4. Done.
    
    I also moved TestOrcOutputFormat to a different package so that it can only access the public API, which should prevent similar problems.
    
    Thanks for your reviews.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request: ORC-1 Import of ORC code from Hive.

Posted by nahguam <gi...@git.apache.org>.
Github user nahguam commented on the pull request:

    https://github.com/apache/orc/pull/23#issuecomment-215073080
  
    Excellent, Thanks!
    
    I've just been going over all the types and have a few more:
    
    1. `VectorizedRowBatch.toUTF8` appears to be unused
    2. `OrcMap` & `OrcList` - please could the constructors be public?
    3. `OrcList` - please could we have another constructor so we can pass in the initialCapacity as per `ArrayList`?
    4. `OrcUnion` - please could the class itself and the `set` method be public?
    
    I'm just looking at the Formats/RecordReader/RecordWriter now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---