You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Arif Iqbal <it...@gmail.com> on 2006/12/08 07:45:23 UTC

Finding term vector per host using hadoop

In the mapreduce paper published by Google in OSDI 2004, they give an
example of term vector per host as a usage of mapreduce. They write

""" Term-Vector per Host: A term vector summarizes the most important words
that occur in a document or a set of documents as a list of
(word; frequency) pairs. The map function emits a (hostname; term vector)
pair for each input document (where the hostname is extracted from the URL
of the document). The reduce function is passed all per-document term
vectors for a given host. It adds these term vectors together, throwing away
infrequent terms, and then emits a final (hostname; term vector) pair. """

I want to implement the same thing but was wondering if this is possible
with Hadoop. In this case the map function emits (hostname, term vector)
.... is this possible with Hadoop. if yes, can someone paste some sample
code for me.

Cheers,
Arif

Re: Finding term vector per host using hadoop

Posted by Dennis Kubes <nu...@dragonflymc.com>.

Lukas Vlcek wrote:
> Hi,
>
> I would appreciate if you send me your code.
> I am not sure if I will use serializabe approach but this could be a 
> useful
> source of inspiration :-)
>
> Regards,
> Lukas
>
> On 12/13/06, Dennis Kubes <nu...@dragonflymc.com> wrote:
>>
>> One thing we have done in the past is to have a SerializableWritable
>> that writes out a Serializable object to a byte stream which is then
>> stored in the writable.  If anyone is interested, email me and I will
>> send you the code for the SerializableWritable.
>>
>> Dennis
>>
>> Lukas Vlcek wrote:
>> > Hi,
>> >
>> > I have just found the ObjectWritable class in org.apache.hadoop.io
>> > package.
>> > However, it does not support any type from java Collection framework.
>> > As for
>> > tree-like data structure it is useful to use LinkedList for node
>> > childs (as
>> > opposed to fixed size array). This is not directly supported by Haddop
>> > as of
>> > now.
>> >
>> > Do you think it would be hard to extend the ObjectWritable so that it
>> > handles the Collections as well? Would this be useful feature /
>> > contribution
>> > for Hadoop community?
>> >
>> > Regards,
>> > Lukas
>> >
>> > On 12/12/06, Andrzej Bialecki <ab...@getopt.org> wrote:
>> >>
>> >> Lukas Vlcek wrote:
>> >> > Hi,
>> >> >
>> >> > Is there any good example how to wrap compex custom data structure
>> >> into
>> >> > Writable (or WritableComparable)? I haven't found anything on Wiki.
>> >> >
>> >> > Let's imagine that I need to wrap a tree-like structure (nodes, 
>> edges
>> >> and
>> >> > couple of other properties for each node). Is there any existing
>> >> code in
>> >> > hadoop where can I get inspiration?
>> >>
>> >> These illustrate serialization of Map-like structures:
>> >>
>> >> org.apache.nutch.crawl.MapWritable
>> >> org.apache.nutch.metadata.Metadata
>> >>
>> >> I don't think we have examples of tree-like structures, but the
>> >> serialization parts would look similar, you would just need to 
>> traverse
>> >> the tree depth-first.
>> >>
>> >> And if you need to process values stored in several different classes
>> >> you could use ObjectWritable to wrap them.
>> >>
>> >> --
>> >> Best regards,
>> >> Andrzej Bialecki     <><
>> >> ___. ___ ___ ___ _ _   __________________________________
>> >> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> >> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> >> http://www.sigram.com  Contact: info at sigram dot com
>> >>
>> >>
>> >>
>> >
>>
>

Re: Finding term vector per host using hadoop

Posted by Lukas Vlcek <lu...@gmail.com>.
Hi,
I have already sent it to Arif (on behalf of Dennis).
Lukas

On 12/14/06, Arif Iqbal <it...@gmail.com> wrote:
>
> Dennis,
>
> I also want this code. Kindly send it to me.
>
> Cheers,
> AI
>
> On 12/14/06, Lukas Vlcek <lu...@gmail.com> wrote:
> >
> > Hi,
> >
> > I would appreciate if you send me your code.
> > I am not sure if I will use serializabe approach but this could be a
> > useful
> > source of inspiration :-)
> >
> > Regards,
> > Lukas
> >
> > On 12/13/06, Dennis Kubes <nu...@dragonflymc.com> wrote:
> > >
> > > One thing we have done in the past is to have a SerializableWritable
> > > that writes out a Serializable object to a byte stream which is then
> > > stored in the writable.  If anyone is interested, email me and I will
> > > send you the code for the SerializableWritable.
> > >
> > > Dennis
> > >
> > > Lukas Vlcek wrote:
> > > > Hi,
> > > >
> > > > I have just found the ObjectWritable class in org.apache.hadoop.io
> > > > package.
> > > > However, it does not support any type from java Collection
> framework.
> > > > As for
> > > > tree-like data structure it is useful to use LinkedList for node
> > > > childs (as
> > > > opposed to fixed size array). This is not directly supported by
> Haddop
> > > > as of
> > > > now.
> > > >
> > > > Do you think it would be hard to extend the ObjectWritable so that
> it
> > > > handles the Collections as well? Would this be useful feature /
> > > > contribution
> > > > for Hadoop community?
> > > >
> > > > Regards,
> > > > Lukas
> > > >
> > > > On 12/12/06, Andrzej Bialecki <ab...@getopt.org> wrote:
> > > >>
> > > >> Lukas Vlcek wrote:
> > > >> > Hi,
> > > >> >
> > > >> > Is there any good example how to wrap compex custom data
> structure
> > > >> into
> > > >> > Writable (or WritableComparable)? I haven't found anything on
> Wiki.
> > > >> >
> > > >> > Let's imagine that I need to wrap a tree-like structure (nodes,
> > edges
> > > >> and
> > > >> > couple of other properties for each node). Is there any existing
> > > >> code in
> > > >> > hadoop where can I get inspiration?
> > > >>
> > > >> These illustrate serialization of Map-like structures:
> > > >>
> > > >> org.apache.nutch.crawl.MapWritable
> > > >> org.apache.nutch.metadata.Metadata
> > > >>
> > > >> I don't think we have examples of tree-like structures, but the
> > > >> serialization parts would look similar, you would just need to
> > traverse
> > > >> the tree depth-first.
> > > >>
> > > >> And if you need to process values stored in several different
> classes
> > > >> you could use ObjectWritable to wrap them.
> > > >>
> > > >> --
> > > >> Best regards,
> > > >> Andrzej Bialecki     <><
> > > >> ___. ___ ___ ___ _ _   __________________________________
> > > >> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > > >> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > > >> http://www.sigram.com  Contact: info at sigram dot com
> > > >>
> > > >>
> > > >>
> > > >
> > >
> >
> >
>
>

Re: Finding term vector per host using hadoop

Posted by Arif Iqbal <it...@gmail.com>.
Dennis,

I also want this code. Kindly send it to me.

Cheers,
AI

On 12/14/06, Lukas Vlcek <lu...@gmail.com> wrote:
>
> Hi,
>
> I would appreciate if you send me your code.
> I am not sure if I will use serializabe approach but this could be a
> useful
> source of inspiration :-)
>
> Regards,
> Lukas
>
> On 12/13/06, Dennis Kubes <nu...@dragonflymc.com> wrote:
> >
> > One thing we have done in the past is to have a SerializableWritable
> > that writes out a Serializable object to a byte stream which is then
> > stored in the writable.  If anyone is interested, email me and I will
> > send you the code for the SerializableWritable.
> >
> > Dennis
> >
> > Lukas Vlcek wrote:
> > > Hi,
> > >
> > > I have just found the ObjectWritable class in org.apache.hadoop.io
> > > package.
> > > However, it does not support any type from java Collection framework.
> > > As for
> > > tree-like data structure it is useful to use LinkedList for node
> > > childs (as
> > > opposed to fixed size array). This is not directly supported by Haddop
> > > as of
> > > now.
> > >
> > > Do you think it would be hard to extend the ObjectWritable so that it
> > > handles the Collections as well? Would this be useful feature /
> > > contribution
> > > for Hadoop community?
> > >
> > > Regards,
> > > Lukas
> > >
> > > On 12/12/06, Andrzej Bialecki <ab...@getopt.org> wrote:
> > >>
> > >> Lukas Vlcek wrote:
> > >> > Hi,
> > >> >
> > >> > Is there any good example how to wrap compex custom data structure
> > >> into
> > >> > Writable (or WritableComparable)? I haven't found anything on Wiki.
> > >> >
> > >> > Let's imagine that I need to wrap a tree-like structure (nodes,
> edges
> > >> and
> > >> > couple of other properties for each node). Is there any existing
> > >> code in
> > >> > hadoop where can I get inspiration?
> > >>
> > >> These illustrate serialization of Map-like structures:
> > >>
> > >> org.apache.nutch.crawl.MapWritable
> > >> org.apache.nutch.metadata.Metadata
> > >>
> > >> I don't think we have examples of tree-like structures, but the
> > >> serialization parts would look similar, you would just need to
> traverse
> > >> the tree depth-first.
> > >>
> > >> And if you need to process values stored in several different classes
> > >> you could use ObjectWritable to wrap them.
> > >>
> > >> --
> > >> Best regards,
> > >> Andrzej Bialecki     <><
> > >> ___. ___ ___ ___ _ _   __________________________________
> > >> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > >> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > >> http://www.sigram.com  Contact: info at sigram dot com
> > >>
> > >>
> > >>
> > >
> >
>
>

Re: Finding term vector per host using hadoop

Posted by Lukas Vlcek <lu...@gmail.com>.
Hi,

I would appreciate if you send me your code.
I am not sure if I will use serializabe approach but this could be a useful
source of inspiration :-)

Regards,
Lukas

On 12/13/06, Dennis Kubes <nu...@dragonflymc.com> wrote:
>
> One thing we have done in the past is to have a SerializableWritable
> that writes out a Serializable object to a byte stream which is then
> stored in the writable.  If anyone is interested, email me and I will
> send you the code for the SerializableWritable.
>
> Dennis
>
> Lukas Vlcek wrote:
> > Hi,
> >
> > I have just found the ObjectWritable class in org.apache.hadoop.io
> > package.
> > However, it does not support any type from java Collection framework.
> > As for
> > tree-like data structure it is useful to use LinkedList for node
> > childs (as
> > opposed to fixed size array). This is not directly supported by Haddop
> > as of
> > now.
> >
> > Do you think it would be hard to extend the ObjectWritable so that it
> > handles the Collections as well? Would this be useful feature /
> > contribution
> > for Hadoop community?
> >
> > Regards,
> > Lukas
> >
> > On 12/12/06, Andrzej Bialecki <ab...@getopt.org> wrote:
> >>
> >> Lukas Vlcek wrote:
> >> > Hi,
> >> >
> >> > Is there any good example how to wrap compex custom data structure
> >> into
> >> > Writable (or WritableComparable)? I haven't found anything on Wiki.
> >> >
> >> > Let's imagine that I need to wrap a tree-like structure (nodes, edges
> >> and
> >> > couple of other properties for each node). Is there any existing
> >> code in
> >> > hadoop where can I get inspiration?
> >>
> >> These illustrate serialization of Map-like structures:
> >>
> >> org.apache.nutch.crawl.MapWritable
> >> org.apache.nutch.metadata.Metadata
> >>
> >> I don't think we have examples of tree-like structures, but the
> >> serialization parts would look similar, you would just need to traverse
> >> the tree depth-first.
> >>
> >> And if you need to process values stored in several different classes
> >> you could use ObjectWritable to wrap them.
> >>
> >> --
> >> Best regards,
> >> Andrzej Bialecki     <><
> >> ___. ___ ___ ___ _ _   __________________________________
> >> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> >> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> >> http://www.sigram.com  Contact: info at sigram dot com
> >>
> >>
> >>
> >
>

Re: Finding term vector per host using hadoop

Posted by Dennis Kubes <nu...@dragonflymc.com>.
One thing we have done in the past is to have a SerializableWritable 
that writes out a Serializable object to a byte stream which is then 
stored in the writable.  If anyone is interested, email me and I will 
send you the code for the SerializableWritable.

Dennis

Lukas Vlcek wrote:
> Hi,
>
> I have just found the ObjectWritable class in org.apache.hadoop.io 
> package.
> However, it does not support any type from java Collection framework. 
> As for
> tree-like data structure it is useful to use LinkedList for node 
> childs (as
> opposed to fixed size array). This is not directly supported by Haddop 
> as of
> now.
>
> Do you think it would be hard to extend the ObjectWritable so that it
> handles the Collections as well? Would this be useful feature / 
> contribution
> for Hadoop community?
>
> Regards,
> Lukas
>
> On 12/12/06, Andrzej Bialecki <ab...@getopt.org> wrote:
>>
>> Lukas Vlcek wrote:
>> > Hi,
>> >
>> > Is there any good example how to wrap compex custom data structure 
>> into
>> > Writable (or WritableComparable)? I haven't found anything on Wiki.
>> >
>> > Let's imagine that I need to wrap a tree-like structure (nodes, edges
>> and
>> > couple of other properties for each node). Is there any existing 
>> code in
>> > hadoop where can I get inspiration?
>>
>> These illustrate serialization of Map-like structures:
>>
>> org.apache.nutch.crawl.MapWritable
>> org.apache.nutch.metadata.Metadata
>>
>> I don't think we have examples of tree-like structures, but the
>> serialization parts would look similar, you would just need to traverse
>> the tree depth-first.
>>
>> And if you need to process values stored in several different classes
>> you could use ObjectWritable to wrap them.
>>
>> -- 
>> Best regards,
>> Andrzej Bialecki     <><
>> ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>>
>

Re: Finding term vector per host using hadoop

Posted by Lukas Vlcek <lu...@gmail.com>.
Hi,

I have just found the ObjectWritable class in org.apache.hadoop.io package.
However, it does not support any type from java Collection framework. As for
tree-like data structure it is useful to use LinkedList for node childs (as
opposed to fixed size array). This is not directly supported by Haddop as of
now.

Do you think it would be hard to extend the ObjectWritable so that it
handles the Collections as well? Would this be useful feature / contribution
for Hadoop community?

Regards,
Lukas

On 12/12/06, Andrzej Bialecki <ab...@getopt.org> wrote:
>
> Lukas Vlcek wrote:
> > Hi,
> >
> > Is there any good example how to wrap compex custom data structure into
> > Writable (or WritableComparable)? I haven't found anything on Wiki.
> >
> > Let's imagine that I need to wrap a tree-like structure (nodes, edges
> and
> > couple of other properties for each node). Is there any existing code in
> > hadoop where can I get inspiration?
>
> These illustrate serialization of Map-like structures:
>
> org.apache.nutch.crawl.MapWritable
> org.apache.nutch.metadata.Metadata
>
> I don't think we have examples of tree-like structures, but the
> serialization parts would look similar, you would just need to traverse
> the tree depth-first.
>
> And if you need to process values stored in several different classes
> you could use ObjectWritable to wrap them.
>
> --
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

Re: Finding term vector per host using hadoop

Posted by Andrzej Bialecki <ab...@getopt.org>.
Lukas Vlcek wrote:
> Hi,
>
> Is there any good example how to wrap compex custom data structure into
> Writable (or WritableComparable)? I haven't found anything on Wiki.
>
> Let's imagine that I need to wrap a tree-like structure (nodes, edges and
> couple of other properties for each node). Is there any existing code in
> hadoop where can I get inspiration?

These illustrate serialization of Map-like structures:

org.apache.nutch.crawl.MapWritable
org.apache.nutch.metadata.Metadata

I don't think we have examples of tree-like structures, but the 
serialization parts would look similar, you would just need to traverse 
the tree depth-first.

And if you need to process values stored in several different classes 
you could use ObjectWritable to wrap them.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Finding term vector per host using hadoop

Posted by Lukas Vlcek <lu...@gmail.com>.
Hi,

Is there any good example how to wrap compex custom data structure into
Writable (or WritableComparable)? I haven't found anything on Wiki.

Let's imagine that I need to wrap a tree-like structure (nodes, edges and
couple of other properties for each node). Is there any existing code in
hadoop where can I get inspiration?

Thanks,
Lukas

On 12/8/06, Andrzej Bialecki <ab...@getopt.org> wrote:
>
> Arif Iqbal wrote:
> > Can we emit (word, termVector) in map function.? I am new to hadoop and
> i
> > dont think that any given example in hadoop emits any other data type
> > than
> > intwritable or text. is it possible to emit other datatypes like
> > termVector,
> > Hashtable etc.
>
> Sure, so long as they implement Writable (for values) or
> WritableComparable (for keys) - so practically speaking you need to wrap
> any internal data structure that you use in an implementation of
> Writable / WritableComparable.
>
> --
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

Re: Finding term vector per host using hadoop

Posted by Andrzej Bialecki <ab...@getopt.org>.
Arif Iqbal wrote:
> Can we emit (word, termVector) in map function.? I am new to hadoop and i
> dont think that any given example in hadoop emits any other data type 
> than
> intwritable or text. is it possible to emit other datatypes like 
> termVector,
> Hashtable etc.

Sure, so long as they implement Writable (for values) or 
WritableComparable (for keys) - so practically speaking you need to wrap 
any internal data structure that you use in an implementation of 
Writable / WritableComparable.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Finding term vector per host using hadoop

Posted by Arif Iqbal <it...@gmail.com>.
Can we emit (word, termVector) in map function.? I am new to hadoop and i
dont think that any given example in hadoop emits any other data type than
intwritable or text. is it possible to emit other datatypes like termVector,
Hashtable etc.
Thanks.

On 12/8/06, Andrzej Bialecki <ab...@getopt.org> wrote:
>
> Arif Iqbal wrote:
> > In the mapreduce paper published by Google in OSDI 2004, they give an
> > example of term vector per host as a usage of mapreduce. They write
> >
> > """ Term-Vector per Host: A term vector summarizes the most important
> > words
> > that occur in a document or a set of documents as a list of
> > (word; frequency) pairs. The map function emits a (hostname; term
> vector)
> > pair for each input document (where the hostname is extracted from the
> > URL
> > of the document). The reduce function is passed all per-document term
> > vectors for a given host. It adds these term vectors together,
> > throwing away
> > infrequent terms, and then emits a final (hostname; term vector) pair.
> > """
> >
> > I want to implement the same thing but was wondering if this is possible
> > with Hadoop. In this case the map function emits (hostname, term vector)
> > .... is this possible with Hadoop. if yes, can someone paste some sample
> > code for me.
>
> Yes, in fact this should be quite easy, you can follow exactly the steps
> described above ... You can use Lucene's MemoryIndex to quickly create
> term vectors from each document, then output them in map() operation as
> <host, termVector> for each input document, and finally in reduce() you
> will need to summarize term vectors ... Look at Grep.java or
> WordCount.java in examples, your mapred job will follow a very similar
> pattern.
>
> --
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

Re: Finding term vector per host using hadoop

Posted by Andrzej Bialecki <ab...@getopt.org>.
Arif Iqbal wrote:
> In the mapreduce paper published by Google in OSDI 2004, they give an
> example of term vector per host as a usage of mapreduce. They write
>
> """ Term-Vector per Host: A term vector summarizes the most important 
> words
> that occur in a document or a set of documents as a list of
> (word; frequency) pairs. The map function emits a (hostname; term vector)
> pair for each input document (where the hostname is extracted from the 
> URL
> of the document). The reduce function is passed all per-document term
> vectors for a given host. It adds these term vectors together, 
> throwing away
> infrequent terms, and then emits a final (hostname; term vector) pair. 
> """
>
> I want to implement the same thing but was wondering if this is possible
> with Hadoop. In this case the map function emits (hostname, term vector)
> .... is this possible with Hadoop. if yes, can someone paste some sample
> code for me.

Yes, in fact this should be quite easy, you can follow exactly the steps 
described above ... You can use Lucene's MemoryIndex to quickly create 
term vectors from each document, then output them in map() operation as 
<host, termVector> for each input document, and finally in reduce() you 
will need to summarize term vectors ... Look at Grep.java or 
WordCount.java in examples, your mapred job will follow a very similar 
pattern.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com