You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Jim the Standing Bear <st...@gmail.com> on 2008/05/28 04:19:20 UTC

How to make a lucene Document hadoop Writable?

Hello,

I am not sure if this is a genuine hadoop question or more towards a
core-java question.  I am hoping to create a wrapper over Lucene
Document, so that this wrapper can be used for the value field of a
Hadoop SequenceFile, and therefore, this wrapper must also implement
the Writable interface.

Lucene's Document is already made serializable, which is quite nice.
However, the Writable interface definition gives only DataInput and
DataOutput, and I am having a hard time trying to figure out how to
serialize/deserialize an lucene Document object using
DataInput/DataOutput.  In other words, how do I go from DataInput to
ObjectInputStream, or from DataOutput to ObjectOutputStream?  Thanks.

-- Jim

Re: How to make a lucene Document hadoop Writable?

Posted by David Chung <da...@gmail.com>.
unsubscribe


      

Re: How to make a lucene Document hadoop Writable?

Posted by Jim the Standing Bear <st...@gmail.com>.
apparently things have changed from nutch 0.9.  Actually Hadoop's API
has also changed quite a bit since then, deprecated methods such as
JobConf.setInputKeyClass and setInputValueClass are no longer
available, which was used extensively in nutch 0.9.

-- Jim

On Tue, May 27, 2008 at 11:26 PM, Dennis Kubes <ku...@apache.org> wrote:
> In the nutch trunk svn it is like this:
>
> output.collect(key, new LuceneDocumentWrapper(doc));
>
> But that is only a passthrough to the output format.  Write and readFields
> isn't implmented for the writable, it just passes the object through to the
> output format which creates the lucene index.
>
> Dennis
>
> Jim the Standing Bear wrote:
>>
>> I am replying to myself because I just found something interesting in
>> Nutch, yet it raises more questions.
>>
>> In Nutch 0.9 source code, in org.apache.nutch.indexer.Indexer.java,
>> there is a line that says:
>>
>> output.collect(key, new ObjectWritable(doc));
>>
>> where doc is a lucene Document object.  This seems to be casting a
>> Document to a Hadoop ObjectWritable object.
>>
>> However, in Hadoop's (v0.17.0) ObjectWritable.java, I found the following
>> lines:
>>
>>   } else if (Writable.class.isAssignableFrom(declaredClass)) { // Writable
>>      UTF8.writeString(out, instance.getClass().getName());
>>      ((Writable)instance).write(out);
>>
>>    } else {
>>      throw new IOException("Can't write: "+instance+" as "+declaredClass);
>>    }
>>
>> where instance is an Object object, set in the constructor, and
>> declaredClass is the class of the object.  But I am a bit suspicious
>> on the check and wonder how it will ever be true:
>>
>> Writable.class.isAssignableFrom(Document)
>>
>> Is it because Nutch 0.9 is using an older version of Hadoop as well as
>> lucene?  I am really confused.  Thanks.
>>
>> -- Jim
>>
>>
>>
>>
>> On Tue, May 27, 2008 at 11:02 PM, Jim the Standing Bear
>> <st...@gmail.com> wrote:
>>>
>>> Thanks for the quick response, Dennis.  However, your code snippet was
>>> about how to serialize/deserialize using
>>> ObjectInputStream/ObjectOutputStream.  Maybe it was my fault for not
>>> making the question clear enough - I was wondering if and how I can
>>> serialize/deserialize using only DataInput and DataOutput.
>>>
>>> This is because the Writable Interface defined by Hadoop has the
>>> following two methods:
>>>
>>> void    readFields(DataInput in)
>>>         Deserialize the fields of this object from in.
>>> void    write(DataOutput out)
>>>         Serialize the fields of this object to out
>>>
>>> so I must start with DataInput and DataOutput, and work my way to
>>> ObjectInputStream and ObjectOutputStream.  Yet I have not found a way
>>> to go from DataInput to ObjectInputStream.  Any ideas?
>>>
>>> -- Jim
>>>
>>>
>>>
>>>
>>> On Tue, May 27, 2008 at 10:50 PM, Dennis Kubes <ku...@apache.org> wrote:
>>>>
>>>> You can use something like the code below to go back and forth from
>>>> serializables.  The problem with lucene documents is that fields which
>>>> are
>>>> not stored will be lost during the serialization / deserialization
>>>> process.
>>>>
>>>> Dennis
>>>>
>>>> public static Object toObject(byte[] bytes, int start)
>>>>  throws IOException, ClassNotFoundException {
>>>>
>>>>  if (bytes == null || bytes.length == 0 || start >= bytes.length) {
>>>>   return null;
>>>>  }
>>>>
>>>>  ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
>>>>  bais.skip(start);
>>>>  ObjectInputStream ois = new ObjectInputStream(bais);
>>>>
>>>>  Object bObject = ois.readObject();
>>>>
>>>>  bais.close();
>>>>  ois.close();
>>>>
>>>>  return bObject;
>>>> }
>>>>
>>>> public static byte[] fromObject(Serializable toBytes)
>>>>  throws IOException {
>>>>
>>>>  ByteArrayOutputStream baos = new ByteArrayOutputStream();
>>>>  ObjectOutputStream oos = new ObjectOutputStream(baos);
>>>>
>>>>  oos.writeObject(toBytes);
>>>>  oos.flush();
>>>>
>>>>  byte[] objBytes = baos.toByteArray();
>>>>
>>>>  baos.close();
>>>>  oos.close();
>>>>
>>>>  return objBytes;
>>>> }
>>>>
>>>>
>>>> Jim the Standing Bear wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> I am not sure if this is a genuine hadoop question or more towards a
>>>>> core-java question.  I am hoping to create a wrapper over Lucene
>>>>> Document, so that this wrapper can be used for the value field of a
>>>>> Hadoop SequenceFile, and therefore, this wrapper must also implement
>>>>> the Writable interface.
>>>>>
>>>>> Lucene's Document is already made serializable, which is quite nice.
>>>>> However, the Writable interface definition gives only DataInput and
>>>>> DataOutput, and I am having a hard time trying to figure out how to
>>>>> serialize/deserialize an lucene Document object using
>>>>> DataInput/DataOutput.  In other words, how do I go from DataInput to
>>>>> ObjectInputStream, or from DataOutput to ObjectOutputStream?  Thanks.
>>>>>
>>>>> -- Jim
>>>
>>>
>>> --
>>> --------------------------------------
>>> Standing Bear Has Spoken
>>> --------------------------------------
>>>
>>
>>
>>
>



-- 
--------------------------------------
Standing Bear Has Spoken
--------------------------------------

Re: How to make a lucene Document hadoop Writable?

Posted by Dennis Kubes <ku...@apache.org>.
In the nutch trunk svn it is like this:

output.collect(key, new LuceneDocumentWrapper(doc));

But that is only a passthrough to the output format.  Write and 
readFields isn't implmented for the writable, it just passes the object 
through to the output format which creates the lucene index.

Dennis

Jim the Standing Bear wrote:
> I am replying to myself because I just found something interesting in
> Nutch, yet it raises more questions.
> 
> In Nutch 0.9 source code, in org.apache.nutch.indexer.Indexer.java,
> there is a line that says:
> 
> output.collect(key, new ObjectWritable(doc));
> 
> where doc is a lucene Document object.  This seems to be casting a
> Document to a Hadoop ObjectWritable object.
> 
> However, in Hadoop's (v0.17.0) ObjectWritable.java, I found the following lines:
> 
>    } else if (Writable.class.isAssignableFrom(declaredClass)) { // Writable
>       UTF8.writeString(out, instance.getClass().getName());
>       ((Writable)instance).write(out);
> 
>     } else {
>       throw new IOException("Can't write: "+instance+" as "+declaredClass);
>     }
> 
> where instance is an Object object, set in the constructor, and
> declaredClass is the class of the object.  But I am a bit suspicious
> on the check and wonder how it will ever be true:
> 
> Writable.class.isAssignableFrom(Document)
> 
> Is it because Nutch 0.9 is using an older version of Hadoop as well as
> lucene?  I am really confused.  Thanks.
> 
> -- Jim
> 
> 
> 
> 
> On Tue, May 27, 2008 at 11:02 PM, Jim the Standing Bear
> <st...@gmail.com> wrote:
>> Thanks for the quick response, Dennis.  However, your code snippet was
>> about how to serialize/deserialize using
>> ObjectInputStream/ObjectOutputStream.  Maybe it was my fault for not
>> making the question clear enough - I was wondering if and how I can
>> serialize/deserialize using only DataInput and DataOutput.
>>
>> This is because the Writable Interface defined by Hadoop has the
>> following two methods:
>>
>> void    readFields(DataInput in)
>>          Deserialize the fields of this object from in.
>> void    write(DataOutput out)
>>          Serialize the fields of this object to out
>>
>> so I must start with DataInput and DataOutput, and work my way to
>> ObjectInputStream and ObjectOutputStream.  Yet I have not found a way
>> to go from DataInput to ObjectInputStream.  Any ideas?
>>
>> -- Jim
>>
>>
>>
>>
>> On Tue, May 27, 2008 at 10:50 PM, Dennis Kubes <ku...@apache.org> wrote:
>>> You can use something like the code below to go back and forth from
>>> serializables.  The problem with lucene documents is that fields which are
>>> not stored will be lost during the serialization / deserialization process.
>>>
>>> Dennis
>>>
>>> public static Object toObject(byte[] bytes, int start)
>>>  throws IOException, ClassNotFoundException {
>>>
>>>  if (bytes == null || bytes.length == 0 || start >= bytes.length) {
>>>    return null;
>>>  }
>>>
>>>  ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
>>>  bais.skip(start);
>>>  ObjectInputStream ois = new ObjectInputStream(bais);
>>>
>>>  Object bObject = ois.readObject();
>>>
>>>  bais.close();
>>>  ois.close();
>>>
>>>  return bObject;
>>> }
>>>
>>> public static byte[] fromObject(Serializable toBytes)
>>>  throws IOException {
>>>
>>>  ByteArrayOutputStream baos = new ByteArrayOutputStream();
>>>  ObjectOutputStream oos = new ObjectOutputStream(baos);
>>>
>>>  oos.writeObject(toBytes);
>>>  oos.flush();
>>>
>>>  byte[] objBytes = baos.toByteArray();
>>>
>>>  baos.close();
>>>  oos.close();
>>>
>>>  return objBytes;
>>> }
>>>
>>>
>>> Jim the Standing Bear wrote:
>>>> Hello,
>>>>
>>>> I am not sure if this is a genuine hadoop question or more towards a
>>>> core-java question.  I am hoping to create a wrapper over Lucene
>>>> Document, so that this wrapper can be used for the value field of a
>>>> Hadoop SequenceFile, and therefore, this wrapper must also implement
>>>> the Writable interface.
>>>>
>>>> Lucene's Document is already made serializable, which is quite nice.
>>>> However, the Writable interface definition gives only DataInput and
>>>> DataOutput, and I am having a hard time trying to figure out how to
>>>> serialize/deserialize an lucene Document object using
>>>> DataInput/DataOutput.  In other words, how do I go from DataInput to
>>>> ObjectInputStream, or from DataOutput to ObjectOutputStream?  Thanks.
>>>>
>>>> -- Jim
>>
>>
>> --
>> --------------------------------------
>> Standing Bear Has Spoken
>> --------------------------------------
>>
> 
> 
> 

Re: How to make a lucene Document hadoop Writable?

Posted by Jim the Standing Bear <st...@gmail.com>.
I am replying to myself because I just found something interesting in
Nutch, yet it raises more questions.

In Nutch 0.9 source code, in org.apache.nutch.indexer.Indexer.java,
there is a line that says:

output.collect(key, new ObjectWritable(doc));

where doc is a lucene Document object.  This seems to be casting a
Document to a Hadoop ObjectWritable object.

However, in Hadoop's (v0.17.0) ObjectWritable.java, I found the following lines:

   } else if (Writable.class.isAssignableFrom(declaredClass)) { // Writable
      UTF8.writeString(out, instance.getClass().getName());
      ((Writable)instance).write(out);

    } else {
      throw new IOException("Can't write: "+instance+" as "+declaredClass);
    }

where instance is an Object object, set in the constructor, and
declaredClass is the class of the object.  But I am a bit suspicious
on the check and wonder how it will ever be true:

Writable.class.isAssignableFrom(Document)

Is it because Nutch 0.9 is using an older version of Hadoop as well as
lucene?  I am really confused.  Thanks.

-- Jim




On Tue, May 27, 2008 at 11:02 PM, Jim the Standing Bear
<st...@gmail.com> wrote:
> Thanks for the quick response, Dennis.  However, your code snippet was
> about how to serialize/deserialize using
> ObjectInputStream/ObjectOutputStream.  Maybe it was my fault for not
> making the question clear enough - I was wondering if and how I can
> serialize/deserialize using only DataInput and DataOutput.
>
> This is because the Writable Interface defined by Hadoop has the
> following two methods:
>
> void    readFields(DataInput in)
>          Deserialize the fields of this object from in.
> void    write(DataOutput out)
>          Serialize the fields of this object to out
>
> so I must start with DataInput and DataOutput, and work my way to
> ObjectInputStream and ObjectOutputStream.  Yet I have not found a way
> to go from DataInput to ObjectInputStream.  Any ideas?
>
> -- Jim
>
>
>
>
> On Tue, May 27, 2008 at 10:50 PM, Dennis Kubes <ku...@apache.org> wrote:
>> You can use something like the code below to go back and forth from
>> serializables.  The problem with lucene documents is that fields which are
>> not stored will be lost during the serialization / deserialization process.
>>
>> Dennis
>>
>> public static Object toObject(byte[] bytes, int start)
>>  throws IOException, ClassNotFoundException {
>>
>>  if (bytes == null || bytes.length == 0 || start >= bytes.length) {
>>    return null;
>>  }
>>
>>  ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
>>  bais.skip(start);
>>  ObjectInputStream ois = new ObjectInputStream(bais);
>>
>>  Object bObject = ois.readObject();
>>
>>  bais.close();
>>  ois.close();
>>
>>  return bObject;
>> }
>>
>> public static byte[] fromObject(Serializable toBytes)
>>  throws IOException {
>>
>>  ByteArrayOutputStream baos = new ByteArrayOutputStream();
>>  ObjectOutputStream oos = new ObjectOutputStream(baos);
>>
>>  oos.writeObject(toBytes);
>>  oos.flush();
>>
>>  byte[] objBytes = baos.toByteArray();
>>
>>  baos.close();
>>  oos.close();
>>
>>  return objBytes;
>> }
>>
>>
>> Jim the Standing Bear wrote:
>>>
>>> Hello,
>>>
>>> I am not sure if this is a genuine hadoop question or more towards a
>>> core-java question.  I am hoping to create a wrapper over Lucene
>>> Document, so that this wrapper can be used for the value field of a
>>> Hadoop SequenceFile, and therefore, this wrapper must also implement
>>> the Writable interface.
>>>
>>> Lucene's Document is already made serializable, which is quite nice.
>>> However, the Writable interface definition gives only DataInput and
>>> DataOutput, and I am having a hard time trying to figure out how to
>>> serialize/deserialize an lucene Document object using
>>> DataInput/DataOutput.  In other words, how do I go from DataInput to
>>> ObjectInputStream, or from DataOutput to ObjectOutputStream?  Thanks.
>>>
>>> -- Jim
>>
>
>
>
> --
> --------------------------------------
> Standing Bear Has Spoken
> --------------------------------------
>



-- 
--------------------------------------
Standing Bear Has Spoken
--------------------------------------

Re: How to make a lucene Document hadoop Writable?

Posted by Jim the Standing Bear <st...@gmail.com>.
I see.  Just realized the lucene tutorial fromwhich I grabbed:

>> document.add(Field.Text("author", author));
>>        document.add(Field.Text("title", title));
>>        document.add(Field.Text("topic", topic));

are using obsolete APIs.  The latest ones should use constructors
instead, and it becomes very clear what you meant by "not stored",
which means Field.Store is using NO.

public Field(String name,
             String value,
             Field.Store store,
             Field.Index index,
             Field.TermVector termVector)

Thanks!

-- Jim

On Tue, May 27, 2008 at 11:29 PM, Dennis Kubes <ku...@apache.org> wrote:
> When reading docs from a processed lucene index, say off disk, any fields
> that are not stored are not repopulated and will not appear in documents
> fields.  You also won't be able to read a document and then pass it to
> another indexwriter (say if you wanted to do an index splitter) and have it
> keep the unstored fields.
>
> Dennis
>
> Jim the Standing Bear wrote:
>>
>> Hi Dennis,
>>
>> Now I see the picture.  I would love to see the code you have for
>> creating complex writables - thanks for sharing it!
>>
>> Since I just started to look at lucence the other day, I may once
>> again misunderstand what you were saying by
>> "serialization/deserialization of lucene document will lose its fields
>> that are not stored".  So if I do
>>
>> Document document = new Document();
>>        document.add(Field.Text("author", author));
>>        document.add(Field.Text("title", title));
>>        document.add(Field.Text("topic", topic));
>>
>> and then serialize document to a file or something, the fields will
>> not be serialized?  It seems a bit odd since the Field class has also
>> implemented Serializable interface.
>>
>> -- Jim
>>
>> On Tue, May 27, 2008 at 11:11 PM, Dennis Kubes <ku...@apache.org> wrote:
>>>
>>> You can get the bytes using those methods and write them to a data
>>> output.
>>>  You would probably also want to write an int before it in the stream to
>>> tell the number of bytes for the object.  If you are wanting to not use
>>> the
>>> java serialization process and translate an object to bytes that is a
>>> little
>>> harder.
>>>
>>> To do it involves using reflection to get the fields of an object
>>> recursively and translate those fields into their byte equivalents. Just
>>> so
>>> happens that I have that functionality already developed.  We are going
>>> to
>>> use it in nutch 2 to make it easy to create complex writables.  Let me
>>> know
>>> if you would like the code and I will send it to you.
>>>
>>> Also I spoke to soon about the serialization / deserialization process.
>>>  Reading a document from a Lucene index will also lose the fields that
>>> are
>>> not stored so it may have nothing to do with the serialization process.
>>>
>>> Dennis
>>>
>>> Jim the Standing Bear wrote:
>>>>
>>>> Thanks for the quick response, Dennis.  However, your code snippet was
>>>> about how to serialize/deserialize using
>>>> ObjectInputStream/ObjectOutputStream.  Maybe it was my fault for not
>>>> making the question clear enough - I was wondering if and how I can
>>>> serialize/deserialize using only DataInput and DataOutput.
>>>>
>>>> This is because the Writable Interface defined by Hadoop has the
>>>> following two methods:
>>>>
>>>> void    readFields(DataInput in)
>>>>         Deserialize the fields of this object from in.
>>>> void    write(DataOutput out)
>>>>         Serialize the fields of this object to out
>>>>
>>>> so I must start with DataInput and DataOutput, and work my way to
>>>> ObjectInputStream and ObjectOutputStream.  Yet I have not found a way
>>>> to go from DataInput to ObjectInputStream.  Any ideas?
>>>>
>>>> -- Jim
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, May 27, 2008 at 10:50 PM, Dennis Kubes <ku...@apache.org> wrote:
>>>>>
>>>>> You can use something like the code below to go back and forth from
>>>>> serializables.  The problem with lucene documents is that fields which
>>>>> are
>>>>> not stored will be lost during the serialization / deserialization
>>>>> process.
>>>>>
>>>>> Dennis
>>>>>
>>>>> public static Object toObject(byte[] bytes, int start)
>>>>>  throws IOException, ClassNotFoundException {
>>>>>
>>>>>  if (bytes == null || bytes.length == 0 || start >= bytes.length) {
>>>>>  return null;
>>>>>  }
>>>>>
>>>>>  ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
>>>>>  bais.skip(start);
>>>>>  ObjectInputStream ois = new ObjectInputStream(bais);
>>>>>
>>>>>  Object bObject = ois.readObject();
>>>>>
>>>>>  bais.close();
>>>>>  ois.close();
>>>>>
>>>>>  return bObject;
>>>>> }
>>>>>
>>>>> public static byte[] fromObject(Serializable toBytes)
>>>>>  throws IOException {
>>>>>
>>>>>  ByteArrayOutputStream baos = new ByteArrayOutputStream();
>>>>>  ObjectOutputStream oos = new ObjectOutputStream(baos);
>>>>>
>>>>>  oos.writeObject(toBytes);
>>>>>  oos.flush();
>>>>>
>>>>>  byte[] objBytes = baos.toByteArray();
>>>>>
>>>>>  baos.close();
>>>>>  oos.close();
>>>>>
>>>>>  return objBytes;
>>>>> }
>>>>>
>>>>>
>>>>> Jim the Standing Bear wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I am not sure if this is a genuine hadoop question or more towards a
>>>>>> core-java question.  I am hoping to create a wrapper over Lucene
>>>>>> Document, so that this wrapper can be used for the value field of a
>>>>>> Hadoop SequenceFile, and therefore, this wrapper must also implement
>>>>>> the Writable interface.
>>>>>>
>>>>>> Lucene's Document is already made serializable, which is quite nice.
>>>>>> However, the Writable interface definition gives only DataInput and
>>>>>> DataOutput, and I am having a hard time trying to figure out how to
>>>>>> serialize/deserialize an lucene Document object using
>>>>>> DataInput/DataOutput.  In other words, how do I go from DataInput to
>>>>>> ObjectInputStream, or from DataOutput to ObjectOutputStream?  Thanks.
>>>>>>
>>>>>> -- Jim
>>>>
>>>>
>>
>>
>>
>



-- 
--------------------------------------
Standing Bear Has Spoken
--------------------------------------

Re: How to make a lucene Document hadoop Writable?

Posted by Dennis Kubes <ku...@apache.org>.
When reading docs from a processed lucene index, say off disk, any 
fields that are not stored are not repopulated and will not appear in 
documents fields.  You also won't be able to read a document and then 
pass it to another indexwriter (say if you wanted to do an index 
splitter) and have it keep the unstored fields.

Dennis

Jim the Standing Bear wrote:
> Hi Dennis,
> 
> Now I see the picture.  I would love to see the code you have for
> creating complex writables - thanks for sharing it!
> 
> Since I just started to look at lucence the other day, I may once
> again misunderstand what you were saying by
> "serialization/deserialization of lucene document will lose its fields
> that are not stored".  So if I do
> 
> Document document = new Document();
>         document.add(Field.Text("author", author));
>         document.add(Field.Text("title", title));
>         document.add(Field.Text("topic", topic));
> 
> and then serialize document to a file or something, the fields will
> not be serialized?  It seems a bit odd since the Field class has also
> implemented Serializable interface.
> 
> -- Jim
> 
> On Tue, May 27, 2008 at 11:11 PM, Dennis Kubes <ku...@apache.org> wrote:
>> You can get the bytes using those methods and write them to a data output.
>>  You would probably also want to write an int before it in the stream to
>> tell the number of bytes for the object.  If you are wanting to not use the
>> java serialization process and translate an object to bytes that is a little
>> harder.
>>
>> To do it involves using reflection to get the fields of an object
>> recursively and translate those fields into their byte equivalents. Just so
>> happens that I have that functionality already developed.  We are going to
>> use it in nutch 2 to make it easy to create complex writables.  Let me know
>> if you would like the code and I will send it to you.
>>
>> Also I spoke to soon about the serialization / deserialization process.
>>  Reading a document from a Lucene index will also lose the fields that are
>> not stored so it may have nothing to do with the serialization process.
>>
>> Dennis
>>
>> Jim the Standing Bear wrote:
>>> Thanks for the quick response, Dennis.  However, your code snippet was
>>> about how to serialize/deserialize using
>>> ObjectInputStream/ObjectOutputStream.  Maybe it was my fault for not
>>> making the question clear enough - I was wondering if and how I can
>>> serialize/deserialize using only DataInput and DataOutput.
>>>
>>> This is because the Writable Interface defined by Hadoop has the
>>> following two methods:
>>>
>>> void    readFields(DataInput in)
>>>          Deserialize the fields of this object from in.
>>> void    write(DataOutput out)
>>>          Serialize the fields of this object to out
>>>
>>> so I must start with DataInput and DataOutput, and work my way to
>>> ObjectInputStream and ObjectOutputStream.  Yet I have not found a way
>>> to go from DataInput to ObjectInputStream.  Any ideas?
>>>
>>> -- Jim
>>>
>>>
>>>
>>>
>>> On Tue, May 27, 2008 at 10:50 PM, Dennis Kubes <ku...@apache.org> wrote:
>>>> You can use something like the code below to go back and forth from
>>>> serializables.  The problem with lucene documents is that fields which
>>>> are
>>>> not stored will be lost during the serialization / deserialization
>>>> process.
>>>>
>>>> Dennis
>>>>
>>>> public static Object toObject(byte[] bytes, int start)
>>>>  throws IOException, ClassNotFoundException {
>>>>
>>>>  if (bytes == null || bytes.length == 0 || start >= bytes.length) {
>>>>   return null;
>>>>  }
>>>>
>>>>  ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
>>>>  bais.skip(start);
>>>>  ObjectInputStream ois = new ObjectInputStream(bais);
>>>>
>>>>  Object bObject = ois.readObject();
>>>>
>>>>  bais.close();
>>>>  ois.close();
>>>>
>>>>  return bObject;
>>>> }
>>>>
>>>> public static byte[] fromObject(Serializable toBytes)
>>>>  throws IOException {
>>>>
>>>>  ByteArrayOutputStream baos = new ByteArrayOutputStream();
>>>>  ObjectOutputStream oos = new ObjectOutputStream(baos);
>>>>
>>>>  oos.writeObject(toBytes);
>>>>  oos.flush();
>>>>
>>>>  byte[] objBytes = baos.toByteArray();
>>>>
>>>>  baos.close();
>>>>  oos.close();
>>>>
>>>>  return objBytes;
>>>> }
>>>>
>>>>
>>>> Jim the Standing Bear wrote:
>>>>> Hello,
>>>>>
>>>>> I am not sure if this is a genuine hadoop question or more towards a
>>>>> core-java question.  I am hoping to create a wrapper over Lucene
>>>>> Document, so that this wrapper can be used for the value field of a
>>>>> Hadoop SequenceFile, and therefore, this wrapper must also implement
>>>>> the Writable interface.
>>>>>
>>>>> Lucene's Document is already made serializable, which is quite nice.
>>>>> However, the Writable interface definition gives only DataInput and
>>>>> DataOutput, and I am having a hard time trying to figure out how to
>>>>> serialize/deserialize an lucene Document object using
>>>>> DataInput/DataOutput.  In other words, how do I go from DataInput to
>>>>> ObjectInputStream, or from DataOutput to ObjectOutputStream?  Thanks.
>>>>>
>>>>> -- Jim
>>>
>>>
> 
> 
> 

Re: How to make a lucene Document hadoop Writable?

Posted by Jim the Standing Bear <st...@gmail.com>.
Hi Dennis,

Now I see the picture.  I would love to see the code you have for
creating complex writables - thanks for sharing it!

Since I just started to look at lucence the other day, I may once
again misunderstand what you were saying by
"serialization/deserialization of lucene document will lose its fields
that are not stored".  So if I do

Document document = new Document();
        document.add(Field.Text("author", author));
        document.add(Field.Text("title", title));
        document.add(Field.Text("topic", topic));

and then serialize document to a file or something, the fields will
not be serialized?  It seems a bit odd since the Field class has also
implemented Serializable interface.

-- Jim

On Tue, May 27, 2008 at 11:11 PM, Dennis Kubes <ku...@apache.org> wrote:
> You can get the bytes using those methods and write them to a data output.
>  You would probably also want to write an int before it in the stream to
> tell the number of bytes for the object.  If you are wanting to not use the
> java serialization process and translate an object to bytes that is a little
> harder.
>
> To do it involves using reflection to get the fields of an object
> recursively and translate those fields into their byte equivalents. Just so
> happens that I have that functionality already developed.  We are going to
> use it in nutch 2 to make it easy to create complex writables.  Let me know
> if you would like the code and I will send it to you.
>
> Also I spoke to soon about the serialization / deserialization process.
>  Reading a document from a Lucene index will also lose the fields that are
> not stored so it may have nothing to do with the serialization process.
>
> Dennis
>
> Jim the Standing Bear wrote:
>>
>> Thanks for the quick response, Dennis.  However, your code snippet was
>> about how to serialize/deserialize using
>> ObjectInputStream/ObjectOutputStream.  Maybe it was my fault for not
>> making the question clear enough - I was wondering if and how I can
>> serialize/deserialize using only DataInput and DataOutput.
>>
>> This is because the Writable Interface defined by Hadoop has the
>> following two methods:
>>
>> void    readFields(DataInput in)
>>          Deserialize the fields of this object from in.
>> void    write(DataOutput out)
>>          Serialize the fields of this object to out
>>
>> so I must start with DataInput and DataOutput, and work my way to
>> ObjectInputStream and ObjectOutputStream.  Yet I have not found a way
>> to go from DataInput to ObjectInputStream.  Any ideas?
>>
>> -- Jim
>>
>>
>>
>>
>> On Tue, May 27, 2008 at 10:50 PM, Dennis Kubes <ku...@apache.org> wrote:
>>>
>>> You can use something like the code below to go back and forth from
>>> serializables.  The problem with lucene documents is that fields which
>>> are
>>> not stored will be lost during the serialization / deserialization
>>> process.
>>>
>>> Dennis
>>>
>>> public static Object toObject(byte[] bytes, int start)
>>>  throws IOException, ClassNotFoundException {
>>>
>>>  if (bytes == null || bytes.length == 0 || start >= bytes.length) {
>>>   return null;
>>>  }
>>>
>>>  ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
>>>  bais.skip(start);
>>>  ObjectInputStream ois = new ObjectInputStream(bais);
>>>
>>>  Object bObject = ois.readObject();
>>>
>>>  bais.close();
>>>  ois.close();
>>>
>>>  return bObject;
>>> }
>>>
>>> public static byte[] fromObject(Serializable toBytes)
>>>  throws IOException {
>>>
>>>  ByteArrayOutputStream baos = new ByteArrayOutputStream();
>>>  ObjectOutputStream oos = new ObjectOutputStream(baos);
>>>
>>>  oos.writeObject(toBytes);
>>>  oos.flush();
>>>
>>>  byte[] objBytes = baos.toByteArray();
>>>
>>>  baos.close();
>>>  oos.close();
>>>
>>>  return objBytes;
>>> }
>>>
>>>
>>> Jim the Standing Bear wrote:
>>>>
>>>> Hello,
>>>>
>>>> I am not sure if this is a genuine hadoop question or more towards a
>>>> core-java question.  I am hoping to create a wrapper over Lucene
>>>> Document, so that this wrapper can be used for the value field of a
>>>> Hadoop SequenceFile, and therefore, this wrapper must also implement
>>>> the Writable interface.
>>>>
>>>> Lucene's Document is already made serializable, which is quite nice.
>>>> However, the Writable interface definition gives only DataInput and
>>>> DataOutput, and I am having a hard time trying to figure out how to
>>>> serialize/deserialize an lucene Document object using
>>>> DataInput/DataOutput.  In other words, how do I go from DataInput to
>>>> ObjectInputStream, or from DataOutput to ObjectOutputStream?  Thanks.
>>>>
>>>> -- Jim
>>
>>
>>
>



-- 
--------------------------------------
Standing Bear Has Spoken
--------------------------------------

Re: How to make a lucene Document hadoop Writable?

Posted by Dennis Kubes <ku...@apache.org>.
You can get the bytes using those methods and write them to a data 
output.  You would probably also want to write an int before it in the 
stream to tell the number of bytes for the object.  If you are wanting 
to not use the java serialization process and translate an object to 
bytes that is a little harder.

To do it involves using reflection to get the fields of an object 
recursively and translate those fields into their byte equivalents. 
Just so happens that I have that functionality already developed.  We 
are going to use it in nutch 2 to make it easy to create complex 
writables.  Let me know if you would like the code and I will send it to 
you.

Also I spoke to soon about the serialization / deserialization process. 
  Reading a document from a Lucene index will also lose the fields that 
are not stored so it may have nothing to do with the serialization process.

Dennis

Jim the Standing Bear wrote:
> Thanks for the quick response, Dennis.  However, your code snippet was
> about how to serialize/deserialize using
> ObjectInputStream/ObjectOutputStream.  Maybe it was my fault for not
> making the question clear enough - I was wondering if and how I can
> serialize/deserialize using only DataInput and DataOutput.
> 
> This is because the Writable Interface defined by Hadoop has the
> following two methods:
> 
> void 	readFields(DataInput in)
>           Deserialize the fields of this object from in.
> void 	write(DataOutput out)
>           Serialize the fields of this object to out
> 
> so I must start with DataInput and DataOutput, and work my way to
> ObjectInputStream and ObjectOutputStream.  Yet I have not found a way
> to go from DataInput to ObjectInputStream.  Any ideas?
> 
> -- Jim
> 
> 
> 
> 
> On Tue, May 27, 2008 at 10:50 PM, Dennis Kubes <ku...@apache.org> wrote:
>> You can use something like the code below to go back and forth from
>> serializables.  The problem with lucene documents is that fields which are
>> not stored will be lost during the serialization / deserialization process.
>>
>> Dennis
>>
>> public static Object toObject(byte[] bytes, int start)
>>  throws IOException, ClassNotFoundException {
>>
>>  if (bytes == null || bytes.length == 0 || start >= bytes.length) {
>>    return null;
>>  }
>>
>>  ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
>>  bais.skip(start);
>>  ObjectInputStream ois = new ObjectInputStream(bais);
>>
>>  Object bObject = ois.readObject();
>>
>>  bais.close();
>>  ois.close();
>>
>>  return bObject;
>> }
>>
>> public static byte[] fromObject(Serializable toBytes)
>>  throws IOException {
>>
>>  ByteArrayOutputStream baos = new ByteArrayOutputStream();
>>  ObjectOutputStream oos = new ObjectOutputStream(baos);
>>
>>  oos.writeObject(toBytes);
>>  oos.flush();
>>
>>  byte[] objBytes = baos.toByteArray();
>>
>>  baos.close();
>>  oos.close();
>>
>>  return objBytes;
>> }
>>
>>
>> Jim the Standing Bear wrote:
>>> Hello,
>>>
>>> I am not sure if this is a genuine hadoop question or more towards a
>>> core-java question.  I am hoping to create a wrapper over Lucene
>>> Document, so that this wrapper can be used for the value field of a
>>> Hadoop SequenceFile, and therefore, this wrapper must also implement
>>> the Writable interface.
>>>
>>> Lucene's Document is already made serializable, which is quite nice.
>>> However, the Writable interface definition gives only DataInput and
>>> DataOutput, and I am having a hard time trying to figure out how to
>>> serialize/deserialize an lucene Document object using
>>> DataInput/DataOutput.  In other words, how do I go from DataInput to
>>> ObjectInputStream, or from DataOutput to ObjectOutputStream?  Thanks.
>>>
>>> -- Jim
> 
> 
> 

Re: How to make a lucene Document hadoop Writable?

Posted by Jim the Standing Bear <st...@gmail.com>.
Thanks for the quick response, Dennis.  However, your code snippet was
about how to serialize/deserialize using
ObjectInputStream/ObjectOutputStream.  Maybe it was my fault for not
making the question clear enough - I was wondering if and how I can
serialize/deserialize using only DataInput and DataOutput.

This is because the Writable Interface defined by Hadoop has the
following two methods:

void 	readFields(DataInput in)
          Deserialize the fields of this object from in.
void 	write(DataOutput out)
          Serialize the fields of this object to out

so I must start with DataInput and DataOutput, and work my way to
ObjectInputStream and ObjectOutputStream.  Yet I have not found a way
to go from DataInput to ObjectInputStream.  Any ideas?

-- Jim




On Tue, May 27, 2008 at 10:50 PM, Dennis Kubes <ku...@apache.org> wrote:
> You can use something like the code below to go back and forth from
> serializables.  The problem with lucene documents is that fields which are
> not stored will be lost during the serialization / deserialization process.
>
> Dennis
>
> public static Object toObject(byte[] bytes, int start)
>  throws IOException, ClassNotFoundException {
>
>  if (bytes == null || bytes.length == 0 || start >= bytes.length) {
>    return null;
>  }
>
>  ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
>  bais.skip(start);
>  ObjectInputStream ois = new ObjectInputStream(bais);
>
>  Object bObject = ois.readObject();
>
>  bais.close();
>  ois.close();
>
>  return bObject;
> }
>
> public static byte[] fromObject(Serializable toBytes)
>  throws IOException {
>
>  ByteArrayOutputStream baos = new ByteArrayOutputStream();
>  ObjectOutputStream oos = new ObjectOutputStream(baos);
>
>  oos.writeObject(toBytes);
>  oos.flush();
>
>  byte[] objBytes = baos.toByteArray();
>
>  baos.close();
>  oos.close();
>
>  return objBytes;
> }
>
>
> Jim the Standing Bear wrote:
>>
>> Hello,
>>
>> I am not sure if this is a genuine hadoop question or more towards a
>> core-java question.  I am hoping to create a wrapper over Lucene
>> Document, so that this wrapper can be used for the value field of a
>> Hadoop SequenceFile, and therefore, this wrapper must also implement
>> the Writable interface.
>>
>> Lucene's Document is already made serializable, which is quite nice.
>> However, the Writable interface definition gives only DataInput and
>> DataOutput, and I am having a hard time trying to figure out how to
>> serialize/deserialize an lucene Document object using
>> DataInput/DataOutput.  In other words, how do I go from DataInput to
>> ObjectInputStream, or from DataOutput to ObjectOutputStream?  Thanks.
>>
>> -- Jim
>



-- 
--------------------------------------
Standing Bear Has Spoken
--------------------------------------

Re: How to make a lucene Document hadoop Writable?

Posted by Dennis Kubes <ku...@apache.org>.
You can use something like the code below to go back and forth from 
serializables.  The problem with lucene documents is that fields which 
are not stored will be lost during the serialization / deserialization 
process.

Dennis

public static Object toObject(byte[] bytes, int start)
   throws IOException, ClassNotFoundException {

   if (bytes == null || bytes.length == 0 || start >= bytes.length) {
     return null;
   }

   ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
   bais.skip(start);
   ObjectInputStream ois = new ObjectInputStream(bais);

   Object bObject = ois.readObject();

   bais.close();
   ois.close();

   return bObject;
}

public static byte[] fromObject(Serializable toBytes)
   throws IOException {

   ByteArrayOutputStream baos = new ByteArrayOutputStream();
   ObjectOutputStream oos = new ObjectOutputStream(baos);

   oos.writeObject(toBytes);
   oos.flush();

   byte[] objBytes = baos.toByteArray();

   baos.close();
   oos.close();

   return objBytes;
}


Jim the Standing Bear wrote:
> Hello,
> 
> I am not sure if this is a genuine hadoop question or more towards a
> core-java question.  I am hoping to create a wrapper over Lucene
> Document, so that this wrapper can be used for the value field of a
> Hadoop SequenceFile, and therefore, this wrapper must also implement
> the Writable interface.
> 
> Lucene's Document is already made serializable, which is quite nice.
> However, the Writable interface definition gives only DataInput and
> DataOutput, and I am having a hard time trying to figure out how to
> serialize/deserialize an lucene Document object using
> DataInput/DataOutput.  In other words, how do I go from DataInput to
> ObjectInputStream, or from DataOutput to ObjectOutputStream?  Thanks.
> 
> -- Jim