You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@avro.apache.org by "Harsh J Chouraria (JIRA)" <ji...@apache.org> on 2010/04/30 14:32:53 UTC

[jira] Created: (AVRO-534) AvroRecordReader (org.apache.avro.mapred) should support a JobConf-given schema

AvroRecordReader (org.apache.avro.mapred) should support a JobConf-given schema
-------------------------------------------------------------------------------

                 Key: AVRO-534
                 URL: https://issues.apache.org/jira/browse/AVRO-534
             Project: Avro
          Issue Type: Bug
          Components: java
    Affects Versions: 1.4.0
         Environment: ArchLinux, JAVA 1.6, Apache Hadoop (0.20.2), Apache Avro (trunk -- 1.4.0 SNAPSHOT), Using Avro Generic API (JAVA)
            Reporter: Harsh J Chouraria
            Priority: Trivial
             Fix For: 1.4.0


Consider an Avro File of a single record type with about 70 fields in the order (str, str, str, long, str, double, [lets take only first 6 into consideration] ...).
To pass this into a simple MapReduce job I do: AvroInputFormat.addInputPath(...) and it works well with an IdentityMapper.

Now I'd like to read only three fields, say fields 0, 1 and 3 so I give the special schema with my 3 fields as (str (0), str (1), long(2)) using AvroJob.setInputGeneric(..., mySchema). This leads to a failure of the mapreduce job since the Avro record reader reads the file for its entire schema (of 70 fields) and tries to convert my given 'long' field to 'str' as is at the index 2 of the actual schema (meaning its using the actual schema embedded into the file, not what I supplied!).

The AvroRecordReader must support reading in the schema specified by the user using AvroJob.setInputGeneric.

I've written a patch for it to do the same but am not sure if its actually the solution (MAP_OUTPUT_SCHEMA use?)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-534) AvroRecordReader (org.apache.avro.mapred) should support a JobConf-given schema

Posted by "Harsh J Chouraria (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12862738#action_12862738 ] 

Harsh J Chouraria commented on AVRO-534:
----------------------------------------

Hello Doug,

Could you tell me in simple points how to go about doing that? Not been in Java development for long but am willing to do this :)

I see a WordCount test for Avro in trunk, shall I extend that or write a custom one?

> AvroRecordReader (org.apache.avro.mapred) should support a JobConf-given schema
> -------------------------------------------------------------------------------
>
>                 Key: AVRO-534
>                 URL: https://issues.apache.org/jira/browse/AVRO-534
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.4.0
>         Environment: ArchLinux, JAVA 1.6, Apache Hadoop (0.20.2), Apache Avro (trunk -- 1.4.0 SNAPSHOT), Using Avro Generic API (JAVA)
>            Reporter: Harsh J Chouraria
>            Priority: Trivial
>             Fix For: 1.4.0
>
>         Attachments: avro.mapreduce.r1.diff
>
>
> Consider an Avro File of a single record type with about 70 fields in the order (str, str, str, long, str, double, [lets take only first 6 into consideration] ...).
> To pass this into a simple MapReduce job I do: AvroInputFormat.addInputPath(...) and it works well with an IdentityMapper.
> Now I'd like to read only three fields, say fields 0, 1 and 3 so I give the special schema with my 3 fields as (str (0), str (1), long(2)) using AvroJob.setInputGeneric(..., mySchema). This leads to a failure of the mapreduce job since the Avro record reader reads the file for its entire schema (of 70 fields) and tries to convert my given 'long' field to 'str' as is at the index 2 of the actual schema (meaning its using the actual schema embedded into the file, not what I supplied!).
> The AvroRecordReader must support reading in the schema specified by the user using AvroJob.setInputGeneric.
> I've written a patch for it to do the same but am not sure if its actually the solution (MAP_OUTPUT_SCHEMA use?)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-534) AvroRecordReader (org.apache.avro.mapred) should support a JobConf-given schema

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12862732#action_12862732 ] 

Doug Cutting commented on AVRO-534:
-----------------------------------

This looks right.  Can you please add a test for this?  Thanks!

> AvroRecordReader (org.apache.avro.mapred) should support a JobConf-given schema
> -------------------------------------------------------------------------------
>
>                 Key: AVRO-534
>                 URL: https://issues.apache.org/jira/browse/AVRO-534
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.4.0
>         Environment: ArchLinux, JAVA 1.6, Apache Hadoop (0.20.2), Apache Avro (trunk -- 1.4.0 SNAPSHOT), Using Avro Generic API (JAVA)
>            Reporter: Harsh J Chouraria
>            Priority: Trivial
>             Fix For: 1.4.0
>
>         Attachments: avro.mapreduce.r1.diff
>
>
> Consider an Avro File of a single record type with about 70 fields in the order (str, str, str, long, str, double, [lets take only first 6 into consideration] ...).
> To pass this into a simple MapReduce job I do: AvroInputFormat.addInputPath(...) and it works well with an IdentityMapper.
> Now I'd like to read only three fields, say fields 0, 1 and 3 so I give the special schema with my 3 fields as (str (0), str (1), long(2)) using AvroJob.setInputGeneric(..., mySchema). This leads to a failure of the mapreduce job since the Avro record reader reads the file for its entire schema (of 70 fields) and tries to convert my given 'long' field to 'str' as is at the index 2 of the actual schema (meaning its using the actual schema embedded into the file, not what I supplied!).
> The AvroRecordReader must support reading in the schema specified by the user using AvroJob.setInputGeneric.
> I've written a patch for it to do the same but am not sure if its actually the solution (MAP_OUTPUT_SCHEMA use?)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-534) AvroRecordReader (org.apache.avro.mapred) should support a JobConf-given schema

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12862747#action_12862747 ] 

Doug Cutting commented on AVRO-534:
-----------------------------------

> I see a WordCount test for Avro in trunk, shall I extend that or write a custom one?

We might add an @Test testProjection() method to TestWordCountGeneric that reads the job's output file with AvroRecordReader using a different schema.  We can add a new field to the schema with a default value, and remove one of the existing fields.  So the schema might look like:

{code}
{"type":"record", "name":"org.apache.avro.mapred.WordCount",
 "fields":[
     {"name":"count", "type":"int"}
     {"name":"rank", "type":"int", "default": -1},
 ]
}
{code}

Then we check, e.g., that there are the expected number of counts and that they sum to the expected total, that the rank is always -1, and that the field "word" is not present in the record.  How's that sound?

Also, instead of deleting the output file after the test, we might delete it before the test runs so we don't get "attempt to overwrite" errors.  Leaving the file after the test runs also makes debugging easier.

> AvroRecordReader (org.apache.avro.mapred) should support a JobConf-given schema
> -------------------------------------------------------------------------------
>
>                 Key: AVRO-534
>                 URL: https://issues.apache.org/jira/browse/AVRO-534
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.4.0
>         Environment: ArchLinux, JAVA 1.6, Apache Hadoop (0.20.2), Apache Avro (trunk -- 1.4.0 SNAPSHOT), Using Avro Generic API (JAVA)
>            Reporter: Harsh J Chouraria
>            Priority: Trivial
>             Fix For: 1.4.0
>
>         Attachments: avro.mapreduce.r1.diff
>
>
> Consider an Avro File of a single record type with about 70 fields in the order (str, str, str, long, str, double, [lets take only first 6 into consideration] ...).
> To pass this into a simple MapReduce job I do: AvroInputFormat.addInputPath(...) and it works well with an IdentityMapper.
> Now I'd like to read only three fields, say fields 0, 1 and 3 so I give the special schema with my 3 fields as (str (0), str (1), long(2)) using AvroJob.setInputGeneric(..., mySchema). This leads to a failure of the mapreduce job since the Avro record reader reads the file for its entire schema (of 70 fields) and tries to convert my given 'long' field to 'str' as is at the index 2 of the actual schema (meaning its using the actual schema embedded into the file, not what I supplied!).
> The AvroRecordReader must support reading in the schema specified by the user using AvroJob.setInputGeneric.
> I've written a patch for it to do the same but am not sure if its actually the solution (MAP_OUTPUT_SCHEMA use?)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (AVRO-534) AvroRecordReader (org.apache.avro.mapred) should support a JobConf-given schema

Posted by "Harsh J Chouraria (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/AVRO-534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Harsh J Chouraria updated AVRO-534:
-----------------------------------

    Attachment: avro.mapreduce.r1.diff

Patch to fix the issue with AvroRecordReader.

> AvroRecordReader (org.apache.avro.mapred) should support a JobConf-given schema
> -------------------------------------------------------------------------------
>
>                 Key: AVRO-534
>                 URL: https://issues.apache.org/jira/browse/AVRO-534
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.4.0
>         Environment: ArchLinux, JAVA 1.6, Apache Hadoop (0.20.2), Apache Avro (trunk -- 1.4.0 SNAPSHOT), Using Avro Generic API (JAVA)
>            Reporter: Harsh J Chouraria
>            Priority: Trivial
>             Fix For: 1.4.0
>
>         Attachments: avro.mapreduce.r1.diff
>
>
> Consider an Avro File of a single record type with about 70 fields in the order (str, str, str, long, str, double, [lets take only first 6 into consideration] ...).
> To pass this into a simple MapReduce job I do: AvroInputFormat.addInputPath(...) and it works well with an IdentityMapper.
> Now I'd like to read only three fields, say fields 0, 1 and 3 so I give the special schema with my 3 fields as (str (0), str (1), long(2)) using AvroJob.setInputGeneric(..., mySchema). This leads to a failure of the mapreduce job since the Avro record reader reads the file for its entire schema (of 70 fields) and tries to convert my given 'long' field to 'str' as is at the index 2 of the actual schema (meaning its using the actual schema embedded into the file, not what I supplied!).
> The AvroRecordReader must support reading in the schema specified by the user using AvroJob.setInputGeneric.
> I've written a patch for it to do the same but am not sure if its actually the solution (MAP_OUTPUT_SCHEMA use?)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.