You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by Lurga <lu...@gmail.com> on 2010/04/13 03:13:45 UTC

Question on writing/reading file with different schema

Hello,
I create a "Person" record (3 fields: first,last,age), and an "Extract" record (2 fields: first,last). Then I use "Person" to write some object to a file. When I use "Extract" to read data from the file, I got an exception: Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -14.
It seems like GenericDatumReader.readRecord won't skip the last fields. How can I read the data corretly?

My code is below:
public void browseName() throws IOException {
  List<Field> fields = new ArrayList<Field>();
  fields.add(new Field("First", Schema.create(Type.STRING), null, null));
  fields.add(new Field("Last", Schema.create(Type.STRING), null, null)); 
  Schema extractSchema = Schema.createRecord(fields);
  DataFileReader<Record> reader = new DataFileReader<Record>(new File(
    fileName), new GenericDatumReader<Record>(extractSchema));
  try {
    while (reader.hasNext()) {
      Record person = reader.next();
      System.out.print(person.get("First").toString() + " " + person.get("Last").toString() + "\t");
    }
  } finally {
    reader.close();
  }
}

Regards,

2010-04-13 



Lurga

Re: Question on writing/reading file with different schema

Posted by Scott Carey <sc...@richrelevance.com>.

On Apr 12, 2010, at 7:42 PM, Scott Carey wrote:

> Try something like:
> 
> DatumReader dr = new GenericDatumReader();
> dr.setExpected(extractSchema);
> DataFileReader<Record> reader = new DataFileReader<Record>(new File(
>  fileName), dr);
> 

If the above still fails, please provide the full stack trace that results.
> 
>> Regards,
>> 
>> 2010-04-13 
>> 
>> 
>> 
>> Lurga 
>> 
>

Re: Question on writing/reading file with different schema

Posted by Scott Carey <sc...@richrelevance.com>.

So a concrete example of the workaround is to change:

{code}
public void browseName() throws IOException {
 List<Field> fields = new ArrayList<Field>();
 fields.add(new Field("First", Schema.create(Type.STRING), null, null));
 fields.add(new Field("Last", Schema.create(Type.STRING), null, null)); 
 Schema extractSchema = Schema.createRecord(fields);
 DataFileReader<Record> reader = new DataFileReader<Record>(new File(
   fileName), new GenericDatumReader<Record>(extractSchema));
 try {
   while (reader.hasNext()) {
     Record person = reader.next();
     System.out.print(person.get("First").toString() + " " + person.get("Last").toString() + "\t");
   }
 } finally {
   reader.close();
 }
}
{code}

{code}
public void browseName() throws IOException {
 DataFileReader<Record> reader = new DataFileReader<Record>(new File(
   fileName), new GenericDatumReader<Record>());
 try {
   while (reader.hasNext()) {
     Record person = reader.next();
     System.out.print(person.get("First").toString() + " " + person.get("Last").toString() + "\t");
   }
 } finally {
   reader.close();
 }
}
{code}

On Apr 15, 2010, at 10:15 AM, Scott Carey wrote:

> It appears there is a bug in the ResolvingDecoder when the actual schema has trailing fields not in the expected schema.  I have not had time to track it down.  I filed a JIRA ticket: https://issues.apache.org/jira/browse/AVRO-517.
> 
> I have a suggested work-around.  You probably don't want to explicitly use a different reader schema than the file.  The primary use case of the schema resolving is for schema evolution and migration.  Most of the time a single version of an application will want to use a single schema to represent the data.
> If you simply want to read 2 of 3 fields, read 2 of three fields from the full schema -- don't define a schema with only 2 of the three fields.
> Every client can use the full "Person" schema, but wrapper classes or helper methods can read the subset of the fields they want to.
> 
> In the example below, browseAge() and browseName() can use personSchema, there is no need to create or manage the other two schemas.  I am not sure if that applies to your real-world usage, but it likely does.
> 
> -Scott
> 
> On Apr 13, 2010, at 12:04 AM, Lurga wrote:
> 
>> Hi,
>> 27
>> 20
>> 31
>> Dante Hicks	
>> Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.EOFException
>> 	at org.apache.avro.file.DataFileStream.next(DataFileStream.java:184)
>> 	at cn.znest.test.avro.AddressBook.browseName(AddressBook.java:91)
>> 	at cn.znest.test.avro.AddressBook.main(AddressBook.java:43)
>> Caused by: java.io.EOFException
>> 	at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:163)
>> 	at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:262)
>> 	at org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:93)
>> 	at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:277)
>> 	at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:271)
>> 	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:83)
>> 	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:105)
>> 	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:77)
>> 	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:70)
>> 	at org.apache.avro.file.DataFileStream.next(DataFileStream.java:195)
>> 	at org.apache.avro.file.DataFileStream.next(DataFileStream.java:182)
>> 	... 2 more
>> 
>> My code is below. In this example, I create three record: person(3 fileds: First Last Age), age(Age), extract(First Last). The record "age" has the last filed of "Person", so AddressBook.browseAge() will be executed successfully. But the record "extract" does not have the last filed of "Person", so executing AddressBook.browseName() will cause an exception.  
>> In avro/c, read_record (datum_read.c) loops every write_schema fileds. 
>> In avro/java, GenericDatumReader.readRecord  loops every read_schema fileds. I think that's the point.
>> 
>> import java.io.File;
>> import java.io.IOException;
>> 
>> import org.apache.avro.Schema;
>> import org.apache.avro.file.DataFileReader;
>> import org.apache.avro.file.DataFileWriter;
>> import org.apache.avro.generic.GenericData;
>> import org.apache.avro.generic.GenericDatumReader;
>> import org.apache.avro.generic.GenericDatumWriter;
>> import org.apache.avro.generic.GenericData.Record;
>> import org.apache.avro.util.Utf8;
>> 
>> public class AddressBook {
>> 	String fileName = "AddressBook.db";
>> 	String prefix = "{\"type\":\"record\",\"name\": \"Person\",\"fields\":[";
>> 	String suffix = "]}";
>> 	String fieldFirst = "{\"name\":\"First\",\"type\":\"string\"}";
>> 	String fieldLast = "{\"name\":\"Last\",\"type\":\"string\"}";
>> 	String fieldAge = "{\"name\":\"Age\",\"type\":\"int\"}";
>> 	Schema personSchema = Schema.parse(prefix + fieldFirst + "," + fieldLast + ","  + fieldAge + suffix);
>> 	Schema ageSchema = Schema.parse(prefix + fieldAge + suffix);
>> 	Schema extractSchema = Schema.parse(prefix + fieldFirst + "," + fieldLast + suffix);
>> 	/**
>> 	 * @param args
>> 	 * @throws IOException
>> 	 */
>> 	public static void main(String[] args) throws IOException {
>> 		AddressBook ab = new AddressBook();
>> 		ab.init();
>> 		ab.browseAge();
>> 		ab.browseName();
>> 	}
>> 
>> 	public void init() throws IOException {		
>> 		DataFileWriter<Record> writer = new DataFileWriter<Record>(
>> 				new GenericDatumWriter<Record>(personSchema)).create(
>> 						personSchema, new File(fileName));
>> 		try {
>> 			writer.append(createPerson("Dante", "Hicks", 27));
>> 			writer.append(createPerson("Randal", "Graves", 20));
>> 			writer.append(createPerson("Steve", "Jobs", 31));
>> 		} finally {
>> 			writer.close();
>> 		}
>> 	}
>> 	
>> 	private Record createPerson(String first, String last, int age) {
>> 		Record person = new GenericData.Record(personSchema);
>> 		person.put("First", new Utf8(first));
>> 		person.put("Last", new Utf8(last));
>> 		person.put("Age", age);
>> 		return person;
>> 	}
>> 	
>> 	public void browseAge() throws IOException {		
>> 		GenericDatumReader<Record> dr = new GenericDatumReader<Record>();
>> 		dr.setExpected(ageSchema);
>> 		DataFileReader<Record> reader = new DataFileReader<Record>(new File(
>> 		  fileName), dr);
>> 		
>> 		try {
>> 			while (reader.hasNext()) {
>> 				Record person = reader.next();
>> 				System.out.println(person.get("Age").toString());
>> 			}
>> 		} finally {
>> 			reader.close();
>> 		}
>> 	}
>> 	
>> 	public void browseName() throws IOException {		
>> 		GenericDatumReader<Record> dr = new GenericDatumReader<Record>();
>> 		dr.setExpected(extractSchema);
>> 		DataFileReader<Record> reader = new DataFileReader<Record>(new File(
>> 		  fileName), dr);
>> 		
>> 		try {
>> 			while (reader.hasNext()) {
>> 				Record person = reader.next();
>> 				System.out.println(person.get("First").toString() + " " + person.get("Last").toString() + "\t");
>> 			}
>> 		} finally {
>> 			reader.close();
>> 		}
>> 	}
>> }
>> 
>> 
>> 2010-04-13 
>> Lurga            
>> 
>> 发件人： Scott Carey 
>> 发送时间： 2010-04-13  10:55:41 
>> 收件人： avro-user@hadoop.apache.org 
>> 抄送： 
>> 主题： Re: Question on writing/reading file with different schema 
>> 
>> On Apr 12, 2010, at 7:42 PM, Scott Carey wrote:
>>> Try something like:
>>> 
>>> DatumReader dr = new GenericDatumReader();
>>> dr.setExpected(extractSchema);
>>> DataFileReader<Record> reader = new DataFileReader<Record>(new File(
>>> fileName), dr);
>>> 
>> If the above still fails, please provide the full stack trace that results.
>>> 
>>>> Regards,
>>>> 
>>>> 2010-04-13 
>>>> 
>>>> 
>>>> 
>>>> Lurga 
>>>> 
>>> 
>

Re: Question on writing/reading file with different schema

Posted by Scott Carey <sc...@richrelevance.com>.

It appears there is a bug in the ResolvingDecoder when the actual schema has trailing fields not in the expected schema.  I have not had time to track it down.  I filed a JIRA ticket: https://issues.apache.org/jira/browse/AVRO-517.

I have a suggested work-around.  You probably don't want to explicitly use a different reader schema than the file.  The primary use case of the schema resolving is for schema evolution and migration.  Most of the time a single version of an application will want to use a single schema to represent the data.
If you simply want to read 2 of 3 fields, read 2 of three fields from the full schema -- don't define a schema with only 2 of the three fields.
Every client can use the full "Person" schema, but wrapper classes or helper methods can read the subset of the fields they want to.

In the example below, browseAge() and browseName() can use personSchema, there is no need to create or manage the other two schemas.  I am not sure if that applies to your real-world usage, but it likely does.

-Scott

On Apr 13, 2010, at 12:04 AM, Lurga wrote:

> Hi,
> 27
> 20
> 31
> Dante Hicks	
> Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.EOFException
> 	at org.apache.avro.file.DataFileStream.next(DataFileStream.java:184)
> 	at cn.znest.test.avro.AddressBook.browseName(AddressBook.java:91)
> 	at cn.znest.test.avro.AddressBook.main(AddressBook.java:43)
> Caused by: java.io.EOFException
> 	at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:163)
> 	at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:262)
> 	at org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:93)
> 	at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:277)
> 	at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:271)
> 	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:83)
> 	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:105)
> 	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:77)
> 	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:70)
> 	at org.apache.avro.file.DataFileStream.next(DataFileStream.java:195)
> 	at org.apache.avro.file.DataFileStream.next(DataFileStream.java:182)
> 	... 2 more
> 
> My code is below. In this example, I create three record: person(3 fileds: First Last Age), age(Age), extract(First Last). The record "age" has the last filed of "Person", so AddressBook.browseAge() will be executed successfully. But the record "extract" does not have the last filed of "Person", so executing AddressBook.browseName() will cause an exception.  
> In avro/c, read_record (datum_read.c) loops every write_schema fileds. 
> In avro/java, GenericDatumReader.readRecord  loops every read_schema fileds. I think that's the point.
> 
> import java.io.File;
> import java.io.IOException;
> 
> import org.apache.avro.Schema;
> import org.apache.avro.file.DataFileReader;
> import org.apache.avro.file.DataFileWriter;
> import org.apache.avro.generic.GenericData;
> import org.apache.avro.generic.GenericDatumReader;
> import org.apache.avro.generic.GenericDatumWriter;
> import org.apache.avro.generic.GenericData.Record;
> import org.apache.avro.util.Utf8;
> 
> public class AddressBook {
> 	String fileName = "AddressBook.db";
> 	String prefix = "{\"type\":\"record\",\"name\": \"Person\",\"fields\":[";
> 	String suffix = "]}";
> 	String fieldFirst = "{\"name\":\"First\",\"type\":\"string\"}";
> 	String fieldLast = "{\"name\":\"Last\",\"type\":\"string\"}";
> 	String fieldAge = "{\"name\":\"Age\",\"type\":\"int\"}";
> 	Schema personSchema = Schema.parse(prefix + fieldFirst + "," + fieldLast + ","  + fieldAge + suffix);
> 	Schema ageSchema = Schema.parse(prefix + fieldAge + suffix);
> 	Schema extractSchema = Schema.parse(prefix + fieldFirst + "," + fieldLast + suffix);
> 	/**
> 	 * @param args
> 	 * @throws IOException
> 	 */
> 	public static void main(String[] args) throws IOException {
> 		AddressBook ab = new AddressBook();
> 		ab.init();
> 		ab.browseAge();
> 		ab.browseName();
> 	}
> 
> 	public void init() throws IOException {		
> 		DataFileWriter<Record> writer = new DataFileWriter<Record>(
> 				new GenericDatumWriter<Record>(personSchema)).create(
> 						personSchema, new File(fileName));
> 		try {
> 			writer.append(createPerson("Dante", "Hicks", 27));
> 			writer.append(createPerson("Randal", "Graves", 20));
> 			writer.append(createPerson("Steve", "Jobs", 31));
> 		} finally {
> 			writer.close();
> 		}
> 	}
> 	
> 	private Record createPerson(String first, String last, int age) {
> 		Record person = new GenericData.Record(personSchema);
> 		person.put("First", new Utf8(first));
> 		person.put("Last", new Utf8(last));
> 		person.put("Age", age);
> 		return person;
> 	}
> 	
> 	public void browseAge() throws IOException {		
> 		GenericDatumReader<Record> dr = new GenericDatumReader<Record>();
> 		dr.setExpected(ageSchema);
> 		DataFileReader<Record> reader = new DataFileReader<Record>(new File(
> 		  fileName), dr);
> 		
> 		try {
> 			while (reader.hasNext()) {
> 				Record person = reader.next();
> 				System.out.println(person.get("Age").toString());
> 			}
> 		} finally {
> 			reader.close();
> 		}
> 	}
> 	
> 	public void browseName() throws IOException {		
> 		GenericDatumReader<Record> dr = new GenericDatumReader<Record>();
> 		dr.setExpected(extractSchema);
> 		DataFileReader<Record> reader = new DataFileReader<Record>(new File(
> 		  fileName), dr);
> 		
> 		try {
> 			while (reader.hasNext()) {
> 				Record person = reader.next();
> 				System.out.println(person.get("First").toString() + " " + person.get("Last").toString() + "\t");
> 			}
> 		} finally {
> 			reader.close();
> 		}
> 	}
> }
> 
> 
> 2010-04-13 
> Lurga            
> 
> 发件人： Scott Carey 
> 发送时间： 2010-04-13  10:55:41 
> 收件人： avro-user@hadoop.apache.org 
> 抄送： 
> 主题： Re: Question on writing/reading file with different schema 
> 
> On Apr 12, 2010, at 7:42 PM, Scott Carey wrote:
>> Try something like:
>> 
>> DatumReader dr = new GenericDatumReader();
>> dr.setExpected(extractSchema);
>> DataFileReader<Record> reader = new DataFileReader<Record>(new File(
>> fileName), dr);
>> 
> If the above still fails, please provide the full stack trace that results.
>> 
>>> Regards,
>>> 
>>> 2010-04-13 
>>> 
>>> 
>>> 
>>> Lurga 
>>> 
>>

Re: Re: Question on writing/reading file with different schema

Posted by Lurga <lu...@gmail.com>.

Thanks a lot!
I'm a newbie. I want to learn how to use avro/java, but there's few examples. When I followed the example in [1], I met this problem.
[1] http://hadoop.apache.org/avro/docs/current/api/c/index.html#_examples 

2010-04-16 



Lurga 



发件人： Scott Carey 
发送时间： 2010-04-16  01:17:09 
收件人： avro-user@hadoop.apache.org 
抄送： 
主题： Re: Question on writing/reading file with different schema 
 
It appears there is a bug in the ResolvingDecoder when the actual schema has trailing fields not in the expected schema.  I have not had time to track it down.  I filed a JIRA ticket: https://issues.apache.org/jira/browse/AVRO-517.
I have a suggested work-around.  You probably don't want to explicitly use a different reader schema than the file.  The primary use case of the schema resolving is for schema evolution and migration.  Most of the time a single version of an application will want to use a single schema to represent the data.
If you simply want to read 2 of 3 fields, read 2 of three fields from the full schema -- don't define a schema with only 2 of the three fields.
Every client can use the full "Person" schema, but wrapper classes or helper methods can read the subset of the fields they want to.
In the example below, browseAge() and browseName() can use personSchema, there is no need to create or manage the other two schemas.  I am not sure if that applies to your real-world usage, but it likely does.
-Scott

Re: Re: Question on writing/reading file with different schema

Posted by Lurga <lu...@gmail.com>.

Hi,
27
20
31
Dante Hicks	
Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.EOFException
	at org.apache.avro.file.DataFileStream.next(DataFileStream.java:184)
	at cn.znest.test.avro.AddressBook.browseName(AddressBook.java:91)
	at cn.znest.test.avro.AddressBook.main(AddressBook.java:43)
Caused by: java.io.EOFException
	at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:163)
	at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:262)
	at org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:93)
	at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:277)
	at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:271)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:83)
	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:105)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:77)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:70)
	at org.apache.avro.file.DataFileStream.next(DataFileStream.java:195)
	at org.apache.avro.file.DataFileStream.next(DataFileStream.java:182)
	... 2 more

My code is below. In this example, I create three record: person(3 fileds: First Last Age), age(Age), extract(First Last). The record "age" has the last filed of "Person", so AddressBook.browseAge() will be executed successfully. But the record "extract" does not have the last filed of "Person", so executing AddressBook.browseName() will cause an exception.  
In avro/c, read_record (datum_read.c) loops every write_schema fileds. 
In avro/java, GenericDatumReader.readRecord  loops every read_schema fileds. I think that's the point.

import java.io.File;
import java.io.IOException;

import org.apache.avro.Schema;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.generic.GenericData.Record;
import org.apache.avro.util.Utf8;

public class AddressBook {
	String fileName = "AddressBook.db";
	String prefix = "{\"type\":\"record\",\"name\": \"Person\",\"fields\":[";
	String suffix = "]}";
	String fieldFirst = "{\"name\":\"First\",\"type\":\"string\"}";
	String fieldLast = "{\"name\":\"Last\",\"type\":\"string\"}";
	String fieldAge = "{\"name\":\"Age\",\"type\":\"int\"}";
	Schema personSchema = Schema.parse(prefix + fieldFirst + "," + fieldLast + ","  + fieldAge + suffix);
	Schema ageSchema = Schema.parse(prefix + fieldAge + suffix);
	Schema extractSchema = Schema.parse(prefix + fieldFirst + "," + fieldLast + suffix);
	/**
	 * @param args
	 * @throws IOException
	 */
	public static void main(String[] args) throws IOException {
		AddressBook ab = new AddressBook();
		ab.init();
		ab.browseAge();
		ab.browseName();
	}

	public void init() throws IOException {		
		DataFileWriter<Record> writer = new DataFileWriter<Record>(
				new GenericDatumWriter<Record>(personSchema)).create(
						personSchema, new File(fileName));
		try {
			writer.append(createPerson("Dante", "Hicks", 27));
			writer.append(createPerson("Randal", "Graves", 20));
			writer.append(createPerson("Steve", "Jobs", 31));
		} finally {
			writer.close();
		}
	}
	
	private Record createPerson(String first, String last, int age) {
		Record person = new GenericData.Record(personSchema);
		person.put("First", new Utf8(first));
		person.put("Last", new Utf8(last));
		person.put("Age", age);
		return person;
	}
	
	public void browseAge() throws IOException {		
		GenericDatumReader<Record> dr = new GenericDatumReader<Record>();
		dr.setExpected(ageSchema);
		DataFileReader<Record> reader = new DataFileReader<Record>(new File(
		  fileName), dr);
		
		try {
			while (reader.hasNext()) {
				Record person = reader.next();
				System.out.println(person.get("Age").toString());
			}
		} finally {
			reader.close();
		}
	}
	
	public void browseName() throws IOException {		
		GenericDatumReader<Record> dr = new GenericDatumReader<Record>();
		dr.setExpected(extractSchema);
		DataFileReader<Record> reader = new DataFileReader<Record>(new File(
		  fileName), dr);
		
		try {
			while (reader.hasNext()) {
				Record person = reader.next();
				System.out.println(person.get("First").toString() + " " + person.get("Last").toString() + "\t");
			}
		} finally {
			reader.close();
		}
	}
}


2010-04-13 
Lurga            

发件人： Scott Carey 
发送时间： 2010-04-13  10:55:41 
收件人： avro-user@hadoop.apache.org 
抄送： 
主题： Re: Question on writing/reading file with different schema 
 
On Apr 12, 2010, at 7:42 PM, Scott Carey wrote:
> Try something like:
> 
> DatumReader dr = new GenericDatumReader();
> dr.setExpected(extractSchema);
> DataFileReader<Record> reader = new DataFileReader<Record>(new File(
>  fileName), dr);
> 
If the above still fails, please provide the full stack trace that results.
> 
>> Regards,
>> 
>> 2010-04-13 
>> 
>> 
>> 
>> Lurga 
>> 
>

Re: Question on writing/reading file with different schema

Posted by Scott Carey <sc...@richrelevance.com>.

On Apr 12, 2010, at 6:13 PM, Lurga wrote:

> Hello,
> I create a "Person" record (3 fields: first,last,age), and an "Extract" record (2 fields: first,last). Then I use "Person" to write some object to a file. When I use "Extract" to read data from the file, I got an exception: Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -14.
> It seems like GenericDatumReader.readRecord won't skip the last fields. How can I read the data corretly?
> 
> My code is below:
> public void browseName() throws IOException {
>  List<Field> fields = new ArrayList<Field>();
>  fields.add(new Field("First", Schema.create(Type.STRING), null, null));
>  fields.add(new Field("Last", Schema.create(Type.STRING), null, null)); 
>  Schema extractSchema = Schema.createRecord(fields);
>  DataFileReader<Record> reader = new DataFileReader<Record>(new File(
>    fileName), new GenericDatumReader<Record>(extractSchema));
>  try {
>    while (reader.hasNext()) {
>      Record person = reader.next();
>      System.out.print(person.get("First").toString() + " " + person.get("Last").toString() + "\t");
>    }
>  } finally {
>    reader.close();
>  }
> }
> 

Try configuring the 'expected' schema.

The schema you are creating above is the expected (reader's) schema, but you are configuring the actual 'data' schema.

See
GenericDatumReader.setExpected(Schema expected);

It looks like the above needs javadoc improvement.

setSchema() sets the schema of the data being read (what is in the file).  The DataFileReader calls setSchema() on its own to what it finds in the file (overwriting what you passed in).  But you will have to set your expected schema yourself.

Try something like:

DatumReader dr = new GenericDatumReader();
dr.setExpected(extractSchema);
DataFileReader<Record> reader = new DataFileReader<Record>(new File(
  fileName), dr);


> Regards,
> 
> 2010-04-13 
> 
> 
> 
> Lurga 
>