You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Joe Crobak (JIRA)" <ji...@apache.org> on 2011/07/28 22:00:10 UTC

[jira] [Created] (AVRO-867) Allow tools to read files via hadoop FileSystem class

Allow tools to read files via hadoop FileSystem class
-----------------------------------------------------

                 Key: AVRO-867
                 URL: https://issues.apache.org/jira/browse/AVRO-867
             Project: Avro
          Issue Type: New Feature
          Components: java
            Reporter: Joe Crobak
            Assignee: Joe Crobak


It would be great if I could use the various tools to read/parse files that are in HDFS, S3, etc via the [FileSystem|http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html] api. We could retain backwards compatibility by assuming that unqualified urls are "file://" but allow reading of files from fully qualified urls such as hdfs://. The required apis are already part of the avro-tools uber jar to support the TetherTool.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-867) Allow tools to read files via hadoop FileSystem class

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073806#comment-13073806 ] 

Doug Cutting commented on AVRO-867:
-----------------------------------

That sounds great!  +1

> Allow tools to read files via hadoop FileSystem class
> -----------------------------------------------------
>
>                 Key: AVRO-867
>                 URL: https://issues.apache.org/jira/browse/AVRO-867
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Joe Crobak
>            Assignee: Joe Crobak
>
> It would be great if I could use the various tools to read/parse files that are in HDFS, S3, etc via the [FileSystem|http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html] api. We could retain backwards compatibility by assuming that unqualified urls are "file://" but allow reading of files from fully qualified urls such as hdfs://. The required apis are already part of the avro-tools uber jar to support the TetherTool.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-867) Allow tools to read files via hadoop FileSystem class

Posted by "Joe Crobak (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072520#comment-13072520 ] 

Joe Crobak commented on AVRO-867:
---------------------------------

bq. I assume you're proposing to move something like Util#fileOrStdin and #fileOrStdin into another module? That sounds reasonable. These could probably go into the mapred module, since it already depends on HDFS.

Ah, I hadn't realized that Util#fileOrStdin does exactly this. In that case, this is more about updating all the tools to use #fileOrStdin if that makes sense (e.g. DataFileReader and DataFileGetSchema don't use it).

> Allow tools to read files via hadoop FileSystem class
> -----------------------------------------------------
>
>                 Key: AVRO-867
>                 URL: https://issues.apache.org/jira/browse/AVRO-867
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Joe Crobak
>            Assignee: Joe Crobak
>
> It would be great if I could use the various tools to read/parse files that are in HDFS, S3, etc via the [FileSystem|http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html] api. We could retain backwards compatibility by assuming that unqualified urls are "file://" but allow reading of files from fully qualified urls such as hdfs://. The required apis are already part of the avro-tools uber jar to support the TetherTool.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-867) Allow tools to read files via hadoop FileSystem class

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072530#comment-13072530 ] 

Doug Cutting commented on AVRO-867:
-----------------------------------

If DataFileReader were to incorporate this, then the core Avro pom might depend on Hadoop.  Some have complained about this before, since Hadoop depends on Avro, creating a circular dependency.  (In practice this is not an issue as long as both provide some backwards compatibility.  Avro can build against an older, published version of Hadoop and vice-versa.)

Perhaps this could be implemented using reflection, e.g., something like:

Class.forName("org.apache.hadoop.fs.FileSystem").getMethod("open").invoke(...)

That way it'd work if Hadoop is on the classpath, but would not require a dependency on Hadoop.

As a middle ground, Hadoop could be required for compilation but only used at runtime when an HDFS URI is passed in.

Alternately, we might add a UriResolver interface and a base implementation that just works for local files.  Then Avro's mapred module could add an implementation that supports HDFS too.  The default factory might first look for an org.apache.avro.mapred.FileSystemResolver class, and, if that doesn't exist, use the base implementation.

> Allow tools to read files via hadoop FileSystem class
> -----------------------------------------------------
>
>                 Key: AVRO-867
>                 URL: https://issues.apache.org/jira/browse/AVRO-867
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Joe Crobak
>            Assignee: Joe Crobak
>
> It would be great if I could use the various tools to read/parse files that are in HDFS, S3, etc via the [FileSystem|http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html] api. We could retain backwards compatibility by assuming that unqualified urls are "file://" but allow reading of files from fully qualified urls such as hdfs://. The required apis are already part of the avro-tools uber jar to support the TetherTool.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-867) Allow tools to read files via hadoop FileSystem class

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072516#comment-13072516 ] 

Doug Cutting commented on AVRO-867:
-----------------------------------

I assume you're proposing to move something like Util#fileOrStdin and #fileOrStdin into another module?  That sounds reasonable.  These could probably go into the mapred module, since it already depends on HDFS.

> Allow tools to read files via hadoop FileSystem class
> -----------------------------------------------------
>
>                 Key: AVRO-867
>                 URL: https://issues.apache.org/jira/browse/AVRO-867
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Joe Crobak
>            Assignee: Joe Crobak
>
> It would be great if I could use the various tools to read/parse files that are in HDFS, S3, etc via the [FileSystem|http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html] api. We could retain backwards compatibility by assuming that unqualified urls are "file://" but allow reading of files from fully qualified urls such as hdfs://. The required apis are already part of the avro-tools uber jar to support the TetherTool.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-867) Allow tools to read files via hadoop FileSystem class

Posted by "Joe Crobak (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072571#comment-13072571 ] 

Joe Crobak commented on AVRO-867:
---------------------------------

bq. If DataFileReader were to incorporate this, then the core Avro pom might depend on Hadoop. Some have complained about this before, since Hadoop depends on Avro, creating a circular dependency. (In practice this is not an issue as long as both provide some backwards compatibility. Avro can build against an older, published version of Hadoop and vice-versa.)

Sorry -- when I mentioned DataFileReader I really meant DataFileReaderTool (same goes for DataFileGetSchemaTool).  My thought was to modify DataFileReaderTool as follows...

Rather than:

{code}
GenericDatumReader<Object> reader = new GenericDatumReader<Object>();
FileReader<Object> fileReader =
      DataFileReader.openReader(new File(args.get(0)), reader);
...
for (Object datum : fileReader) {
  ...
}
{code}

use the DataFileStream like:

{code}
GenericDatumReader<Object> reader = new GenericDatumReader<Object>();
DataFileStream<Object> streamReader =
      new DataFileStream(Util.fileOrStdin(args.get(0)), reader);
...
for (Object datum : streamReader) {
 ...
}
{code}

There are a few other Tools that could be simplified with the usage of fileOrStdin, too. How does this sound?

> Allow tools to read files via hadoop FileSystem class
> -----------------------------------------------------
>
>                 Key: AVRO-867
>                 URL: https://issues.apache.org/jira/browse/AVRO-867
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Joe Crobak
>            Assignee: Joe Crobak
>
> It would be great if I could use the various tools to read/parse files that are in HDFS, S3, etc via the [FileSystem|http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html] api. We could retain backwards compatibility by assuming that unqualified urls are "file://" but allow reading of files from fully qualified urls such as hdfs://. The required apis are already part of the avro-tools uber jar to support the TetherTool.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira