You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Steven Willis (JIRA)" <ji...@apache.org> on 2014/06/06 15:16:02 UTC

[jira] [Commented] (AVRO-1130) MapReduce Jobs can output write SortedKeyValueFiles directly

    [ https://issues.apache.org/jira/browse/AVRO-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019825#comment-14019825 ] 

Steven Willis commented on AVRO-1130:
-------------------------------------

I'd very much like this as well. How would you imagine the output would be structured? With a normal {{SortedKeyValueFile}} you've got a single directory containing exactly two files {{data}} and {{index}}. With a mapreduce that has multiple reducers I wonder how this should look.

Maybe:

{noformat}
output_path/data-part-00000
output_path/data-part-00001
output_path/data-part-00002
output_path/index-part-00000
output_path/index-part-00001
output_path/index-part-00002
{noformat}

But then if you wanted to treat {{output_path}} as a {{SortedKeyValueFile}}, you'd have to modify the code to allow for multiple data and index files. Perhaps any directory containing exactly the same number of {{data*}} and {{index*}} files can be treated as a {{SKVF}} as long as the trailing portion of each {{data}} filename matched an {{index}} filename.

Or would something like this be better:

{noformat}
output_path/part-00000/data
output_path/part-00000/index
output_path/part-00001/data
output_path/part-00001/index
output_path/part-00002/data
output_path/part-00002/index
{noformat}

That way, each part is a {{SKVF}} and works with the existing code. But then you wouldn't be able to treat {{output_path}} as a {{SKVF}}. Maybe the new {{SKVFInputFormat}} would allow for the input path to be either an {{SKVF}} directory, or a directory containing {{SKVF}} directories.

I think I'd lean towards the first approach myself.

> MapReduce Jobs can output write SortedKeyValueFiles directly
> ------------------------------------------------------------
>
>                 Key: AVRO-1130
>                 URL: https://issues.apache.org/jira/browse/AVRO-1130
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>    Affects Versions: 1.7.1
>            Reporter: Jeremy Lewi
>            Assignee: Harsh J
>            Priority: Minor
>
> It would be nice if MapReduce jobs could write directly to SortedKeyValueFile's.
> harsh@'s response on this thread http://goo.gl/OT1rN for some more information on what needs to be done.



--
This message was sent by Atlassian JIRA
(v6.2#6252)