You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Priyo Mustafi (JIRA)" <ji...@apache.org> on 2013/02/01 01:13:13 UTC

[jira] [Commented] (AVRO-1215) AvroMultipleOutputs not working when specifying baseOutputPath

    [ https://issues.apache.org/jira/browse/AVRO-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13568269#comment-13568269 ] 

Priyo Mustafi commented on AVRO-1215:
-------------------------------------

Hi Ashish,
Are you coming up with a new API?  Would you be having a patch for these as well which I use?
    public void write(String namedOutput, Object key, Object value, String baseOutputPath)
    public void write(Object key, Object value, String baseOutputPath) 

I just found another issue.  I have a MR which writes different schemas on different outputs.  My driver registers these using this method in AvroMultipleOutputs. The output of the mapper is a union-schema as each mapper can output to different multipleoutputs using their respective schemas.

  public static void addNamedOutput(Job job, String namedOutput, Class<? extends OutputFormat> outputFormatClass,
            Schema keySchema, Schema valueSchema)

When I run locally everything works fine.  Each output avro file has their respective key and value schema.  When I run on a cluster, all the different avro files have the union schema which is set on the main output otherwise it gives classcastexception.

Looking deeper into AvroMultipleOutputs I notice that the addNamedOutput() registering method above adds the key and value schema to "private static" map and these schemas are never added to the configuration.  So there is no way this map would be populated on the reducer side and when it tries to lookup the schemas using namedoutput+"_KEYSCHEMA" etc, it is obviously not there and it picks up the main outputschema which in my case is the union.  If we put these on the configuration and read it back on the reducer, this should work.  
                
> AvroMultipleOutputs not working when specifying baseOutputPath
> --------------------------------------------------------------
>
>                 Key: AVRO-1215
>                 URL: https://issues.apache.org/jira/browse/AVRO-1215
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.7.2
>            Reporter: Matthew Hayes
>            Assignee: Ashish Nagavaram
>              Labels: avro, mapreduce
>         Attachments: avro-1215.patch, AVRO-1215.patch, AVRO-1215-v2.patch, AVRO-1215-v3.patch
>
>
> I'm calling the write() method of AvroMultipleOutputs which takes the baseOutputPath.  The reducer appears to begin hanging once it tries writing to a baseOuputPath value not already encountered.  It then fails with:
> org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file ... because current leaseholder is trying to recreate file.
> I think the problem has to do with this line in AvroMultipleOutputs:
> {code}
> // get the record writer from context output format
> //FileOutputFormat.setOutputName(taskContext, baseFileName);
> {code}
> This line is not commented out in the similar code from Hadoop.  So I think the baseOutputPath is ignored.  As a result when each record writer is created it uses the same path, leading to the exception.
> Uncommenting this line does not work because of visibility of the method.  However what this method does is set "mapreduce.output.basename".  But setting this doesn't work either.  
> After digging through Avro code I found that AvroOutputFormatBase is using "avro.mo.config.namedOutput" to create the path.  If I replace the commented out line with this it seems to work:
> {code}
> taskContext.getConfiguration().set("avro.mo.config.namedOutput", baseFileName);  
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira