You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2007/01/26 22:50:49 UTC
[jira] Commented: (HADOOP-941) Make Hadoop Record I/O Easier to use outside Hadoop

    [ https://issues.apache.org/jira/browse/HADOOP-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467928 ] 

Doug Cutting commented on HADOOP-941:
-------------------------------------

> generate class code for records that would not have Hadoop dependency on WritableComparable interface

I don't understand the motivation for this.  The Writable and WritableComparable interfaces are small and standalone.  Having record-defined classes that cannot be easily used with the rest of Hadoop seems like it could cause more confusion than including these.

> As part of Hadoop build process, produce a tar bundle for Record I/O alone.

Are you proposing a separate release cycle or just separate release artifacts?  If a separate release cycle, then this should be placed in a separate sub-project.  Each sub-project requires a diverse developer community, which I'm not sure that record alone has.

As a counter-proposal, we might consider splitting hadoop's jar into a layered set of jars: hadoop-common (util, config, io, record), hadoop-fs and hadoop-mapred, and perhaps other, all still bundled into a single release artifact.  The javadoc could be structured into corresponding sections.  I suspect that many users of records might find SequenceFile also useful.

More generally, what is the goal?  Is the hadoop jar file, at 1MB, unwieldy?  Or is there some other problem?  Creating more release artifacts could create more confusion (will folks know what to download?).  Multiple jars are less confusing, since most developers don't deal explicitly with jar files, but just use the set included in the download.  Should metrics also be released separately?  How about fs w/o mapred?  There're just too many possible ways of bundling things.  I think the general mapping is one project, one release artifact, unless there's a very strong reason otherwise, which I've not yet seen.

> Make Hadoop Record I/O Easier to use outside Hadoop
> ---------------------------------------------------
>
>                 Key: HADOOP-941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-941
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: record
>    Affects Versions: 0.10.1
>         Environment: All
>            Reporter: Milind Bhandarkar
>         Assigned To: Milind Bhandarkar
>             Fix For: 0.11.0
>
>
> Hadoop record I/O can be used effectively outside of Hadoop. It would increase its utility if developers can use it without having to import hadoop classes, or having to depend on Hadoop jars. Following changes to the current translator and runtime are proposed.
> Proposed Changes:
> 1. Use java.lang.String as a native type for ustring (instead of Text.)
> 2. Provide a Buffer class as a native Java type for buffer (instead of BytesWritable), so that later BytesWritable could be implemented as following DDL:
> module org.apache.hadoop.io {
>   record BytesWritable {
>     buffer value;
>   }
> }
> 3. Member names in generated classes should not have prefixes 'm' before their names. In the above example, the private member name would be 'value' not 'mvalue' as it is done now.
> 4. Convert getters and setters to have CamelCase. e.g. in the above example the getter will be:
>   public Buffer getValue();
> 5. Provide a 'swiggable' C binding, so that processing the generated C code with swig allows it to be used in scripting languages such as Python and Perl.
> 6. The default --language="java" target would generate class code for records that would not have Hadoop dependency on WritableComparable interface, but instead would have "implements Record, Comparable". (i.e. It will not have write() and readFields() methods.) An additional option "--writable" will need to be specified on rcc commandline to generate classes that "implements Record, WritableComparable".
> 7. Optimize generated write() and readFields() methods, so that they do not have to create BinaryOutputArchive or BinaryInputArchive every time these methods are called on a record.
> 8. Implement ByteInStream and ByteOutStream for C++ runtime, as they will be needed for using Hadoop Record I/O with forthcoming C++ MapReduce framework (currently, only FileStreams are provided.)
> 9. Generate clone() methods for records in Java i.e. the generated classes should implement Cloneable.
> 10. As part of Hadoop build process, produce a tar bundle for Record I/O alone. This tar bundle will contain the translator classes and ant task (lib/rcc.jar), translator script (bin/rcc), Java runtime (recordio.jar) that includes org.apache.hadoop.record.*, sources for the java runtime (src/java), and c/c++ runtime sources with Makefiles (src/c++, src/c).
> 11. Make generated Java codes for maps and vectors use Java generics.
> These are the proposed user-visible changes. Internally, the translator will be restructured so that it is easier to plug-in translators for different targets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.