You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Natalia Connolly <na...@gmail.com> on 2014/04/08 17:46:42 UTC

MapReduce for complex key/value pairs?

Dear All,

    I was wondering if the following is possible using MapReduce.

    I would like to create a job that loops over a bunch of documents,
tokenizes them into ngrams, and stores the ngrams and not only the counts
of ngrams but also _which_ document(s) had this particular ngram.  In other
words, the key would be the ngram but the value would be an integer (the
count) _and_ an array of document id's.

    Is this something that can be done?  Any pointers would be appreciated.


    I am using Java, btw.

   Thank you,

   Natalia Connolly

Re: MapReduce for complex key/value pairs?

Posted by Jay Vyas <ja...@gmail.com>.

- Adding parsing logic in mappers/reducers is the simplest, least elegant
way to do it, or just writing json  strings is one simple way to do it.

- You get more advanced by writing custom writables which parse the data
are the first way to do it.

- The truly portable and "right" way is to do it is to define a schema and
use Avro to parse it.   Unlike manually adding parsing to app logic, or
adding json deser to your mapper/reducers, proper Avro serialization has
the benefit of increasing performance and app portability while also code
more maintainable (it interoperates with pure java domain objects)


On Tue, Apr 8, 2014 at 2:30 PM, Harsh J <ha...@cloudera.com> wrote:

> Yes, you can write custom writable classes that detail and serialise
> your required data structure. If you have Hadoop: The Definitive
> Guide, checkout its section "Serialization" under chapter "Hadoop
> I/O".
>
> On Tue, Apr 8, 2014 at 9:16 PM, Natalia Connolly
> <na...@gmail.com> wrote:
> > Dear All,
> >
> >     I was wondering if the following is possible using MapReduce.
> >
> >     I would like to create a job that loops over a bunch of documents,
> > tokenizes them into ngrams, and stores the ngrams and not only the
> counts of
> > ngrams but also _which_ document(s) had this particular ngram.  In other
> > words, the key would be the ngram but the value would be an integer (the
> > count) _and_ an array of document id's.
> >
> >     Is this something that can be done?  Any pointers would be
> appreciated.
> >
> >     I am using Java, btw.
> >
> >    Thank you,
> >
> >    Natalia Connolly
> >
>
>
>
> --
> Harsh J
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: MapReduce for complex key/value pairs?

Posted by Jay Vyas <ja...@gmail.com>.

- Adding parsing logic in mappers/reducers is the simplest, least elegant
way to do it, or just writing json  strings is one simple way to do it.

- You get more advanced by writing custom writables which parse the data
are the first way to do it.

- The truly portable and "right" way is to do it is to define a schema and
use Avro to parse it.   Unlike manually adding parsing to app logic, or
adding json deser to your mapper/reducers, proper Avro serialization has
the benefit of increasing performance and app portability while also code
more maintainable (it interoperates with pure java domain objects)


On Tue, Apr 8, 2014 at 2:30 PM, Harsh J <ha...@cloudera.com> wrote:

> Yes, you can write custom writable classes that detail and serialise
> your required data structure. If you have Hadoop: The Definitive
> Guide, checkout its section "Serialization" under chapter "Hadoop
> I/O".
>
> On Tue, Apr 8, 2014 at 9:16 PM, Natalia Connolly
> <na...@gmail.com> wrote:
> > Dear All,
> >
> >     I was wondering if the following is possible using MapReduce.
> >
> >     I would like to create a job that loops over a bunch of documents,
> > tokenizes them into ngrams, and stores the ngrams and not only the
> counts of
> > ngrams but also _which_ document(s) had this particular ngram.  In other
> > words, the key would be the ngram but the value would be an integer (the
> > count) _and_ an array of document id's.
> >
> >     Is this something that can be done?  Any pointers would be
> appreciated.
> >
> >     I am using Java, btw.
> >
> >    Thank you,
> >
> >    Natalia Connolly
> >
>
>
>
> --
> Harsh J
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: MapReduce for complex key/value pairs?

Posted by Jay Vyas <ja...@gmail.com>.

- Adding parsing logic in mappers/reducers is the simplest, least elegant
way to do it, or just writing json  strings is one simple way to do it.

- You get more advanced by writing custom writables which parse the data
are the first way to do it.

- The truly portable and "right" way is to do it is to define a schema and
use Avro to parse it.   Unlike manually adding parsing to app logic, or
adding json deser to your mapper/reducers, proper Avro serialization has
the benefit of increasing performance and app portability while also code
more maintainable (it interoperates with pure java domain objects)


On Tue, Apr 8, 2014 at 2:30 PM, Harsh J <ha...@cloudera.com> wrote:

> Yes, you can write custom writable classes that detail and serialise
> your required data structure. If you have Hadoop: The Definitive
> Guide, checkout its section "Serialization" under chapter "Hadoop
> I/O".
>
> On Tue, Apr 8, 2014 at 9:16 PM, Natalia Connolly
> <na...@gmail.com> wrote:
> > Dear All,
> >
> >     I was wondering if the following is possible using MapReduce.
> >
> >     I would like to create a job that loops over a bunch of documents,
> > tokenizes them into ngrams, and stores the ngrams and not only the
> counts of
> > ngrams but also _which_ document(s) had this particular ngram.  In other
> > words, the key would be the ngram but the value would be an integer (the
> > count) _and_ an array of document id's.
> >
> >     Is this something that can be done?  Any pointers would be
> appreciated.
> >
> >     I am using Java, btw.
> >
> >    Thank you,
> >
> >    Natalia Connolly
> >
>
>
>
> --
> Harsh J
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: MapReduce for complex key/value pairs?

Posted by Jay Vyas <ja...@gmail.com>.

- Adding parsing logic in mappers/reducers is the simplest, least elegant
way to do it, or just writing json  strings is one simple way to do it.

- You get more advanced by writing custom writables which parse the data
are the first way to do it.

- The truly portable and "right" way is to do it is to define a schema and
use Avro to parse it.   Unlike manually adding parsing to app logic, or
adding json deser to your mapper/reducers, proper Avro serialization has
the benefit of increasing performance and app portability while also code
more maintainable (it interoperates with pure java domain objects)


On Tue, Apr 8, 2014 at 2:30 PM, Harsh J <ha...@cloudera.com> wrote:

> Yes, you can write custom writable classes that detail and serialise
> your required data structure. If you have Hadoop: The Definitive
> Guide, checkout its section "Serialization" under chapter "Hadoop
> I/O".
>
> On Tue, Apr 8, 2014 at 9:16 PM, Natalia Connolly
> <na...@gmail.com> wrote:
> > Dear All,
> >
> >     I was wondering if the following is possible using MapReduce.
> >
> >     I would like to create a job that loops over a bunch of documents,
> > tokenizes them into ngrams, and stores the ngrams and not only the
> counts of
> > ngrams but also _which_ document(s) had this particular ngram.  In other
> > words, the key would be the ngram but the value would be an integer (the
> > count) _and_ an array of document id's.
> >
> >     Is this something that can be done?  Any pointers would be
> appreciated.
> >
> >     I am using Java, btw.
> >
> >    Thank you,
> >
> >    Natalia Connolly
> >
>
>
>
> --
> Harsh J
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: MapReduce for complex key/value pairs?

Posted by Harsh J <ha...@cloudera.com>.

Yes, you can write custom writable classes that detail and serialise
your required data structure. If you have Hadoop: The Definitive
Guide, checkout its section "Serialization" under chapter "Hadoop
I/O".

On Tue, Apr 8, 2014 at 9:16 PM, Natalia Connolly
<na...@gmail.com> wrote:
> Dear All,
>
>     I was wondering if the following is possible using MapReduce.
>
>     I would like to create a job that loops over a bunch of documents,
> tokenizes them into ngrams, and stores the ngrams and not only the counts of
> ngrams but also _which_ document(s) had this particular ngram.  In other
> words, the key would be the ngram but the value would be an integer (the
> count) _and_ an array of document id's.
>
>     Is this something that can be done?  Any pointers would be appreciated.
>
>     I am using Java, btw.
>
>    Thank you,
>
>    Natalia Connolly
>



-- 
Harsh J

Re: MapReduce for complex key/value pairs?

Posted by Harsh J <ha...@cloudera.com>.

Yes, you can write custom writable classes that detail and serialise
your required data structure. If you have Hadoop: The Definitive
Guide, checkout its section "Serialization" under chapter "Hadoop
I/O".

On Tue, Apr 8, 2014 at 9:16 PM, Natalia Connolly
<na...@gmail.com> wrote:
> Dear All,
>
>     I was wondering if the following is possible using MapReduce.
>
>     I would like to create a job that loops over a bunch of documents,
> tokenizes them into ngrams, and stores the ngrams and not only the counts of
> ngrams but also _which_ document(s) had this particular ngram.  In other
> words, the key would be the ngram but the value would be an integer (the
> count) _and_ an array of document id's.
>
>     Is this something that can be done?  Any pointers would be appreciated.
>
>     I am using Java, btw.
>
>    Thank you,
>
>    Natalia Connolly
>



-- 
Harsh J

Re: MapReduce for complex key/value pairs?

Posted by Harsh J <ha...@cloudera.com>.

Yes, you can write custom writable classes that detail and serialise
your required data structure. If you have Hadoop: The Definitive
Guide, checkout its section "Serialization" under chapter "Hadoop
I/O".

On Tue, Apr 8, 2014 at 9:16 PM, Natalia Connolly
<na...@gmail.com> wrote:
> Dear All,
>
>     I was wondering if the following is possible using MapReduce.
>
>     I would like to create a job that loops over a bunch of documents,
> tokenizes them into ngrams, and stores the ngrams and not only the counts of
> ngrams but also _which_ document(s) had this particular ngram.  In other
> words, the key would be the ngram but the value would be an integer (the
> count) _and_ an array of document id's.
>
>     Is this something that can be done?  Any pointers would be appreciated.
>
>     I am using Java, btw.
>
>    Thank you,
>
>    Natalia Connolly
>



-- 
Harsh J

Re: MapReduce for complex key/value pairs?

Posted by Harsh J <ha...@cloudera.com>.

Yes, you can write custom writable classes that detail and serialise
your required data structure. If you have Hadoop: The Definitive
Guide, checkout its section "Serialization" under chapter "Hadoop
I/O".

On Tue, Apr 8, 2014 at 9:16 PM, Natalia Connolly
<na...@gmail.com> wrote:
> Dear All,
>
>     I was wondering if the following is possible using MapReduce.
>
>     I would like to create a job that loops over a bunch of documents,
> tokenizes them into ngrams, and stores the ngrams and not only the counts of
> ngrams but also _which_ document(s) had this particular ngram.  In other
> words, the key would be the ngram but the value would be an integer (the
> count) _and_ an array of document id's.
>
>     Is this something that can be done?  Any pointers would be appreciated.
>
>     I am using Java, btw.
>
>    Thank you,
>
>    Natalia Connolly
>



-- 
Harsh J