You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Hrishikesh Agashe <hr...@persistent.co.in> on 2009/11/10 10:56:33 UTC

Lucene + Hadoop

Hi,

I am trying to use Hadoop for Lucene index creation. I have to create multiple indexes based on contents of the files (i.e. if author is "hrishikesh", it should be added to a index for "hrishikesh". There has to be a separate index for every author). For this, I am keeping multiple IndexWriter open for every author and maintaining them in a hashmap in map() function. I parse incoming file and if I see author is one for which I already have opened a IndexWriter, I just add this file in that index, else I create a new IndesWriter for new author. As authors might run into thousands, I am closing IndexWriter and clearing hashmap once it reaches a certain threshold and starting all over again. There is no reduced function.

Does this logic sound correct? Is there any other way of implementing this requirement?

--Hrishi

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

Re: Lucene + Hadoop

Posted by "Eason.Lee" <le...@gmail.com>.

I think you'd better using map to group all the file belong to the same
author together
and using reduce to index the files?

2009/11/11 Otis Gospodnetic <ot...@yahoo.com>

> I think that sounds right.
> I believe that's what I did when I implemented this type of functionality
> for http://simpy.com/
>
> I'm not sure why this is a Hadoop thing, though.
>
>  Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
>
> ----- Original Message ----
> > From: Hrishikesh Agashe <hr...@persistent.co.in>
> > To: "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
> > Sent: Tue, November 10, 2009 4:56:33 AM
> > Subject: Lucene + Hadoop
> >
> > Hi,
> >
> > I am trying to use Hadoop for Lucene index creation. I have to create
> multiple
> > indexes based on contents of the files (i.e. if author is "hrishikesh",
> it
> > should be added to a index for "hrishikesh". There has to be a separate
> index
> > for every author). For this, I am keeping multiple IndexWriter open for
> every
> > author and maintaining them in a hashmap in map() function. I parse
> incoming
> > file and if I see author is one for which I already have opened a
> IndexWriter, I
> > just add this file in that index, else I create a new IndesWriter for new
> > author. As authors might run into thousands, I am closing IndexWriter and
> > clearing hashmap once it reaches a certain threshold and starting all
> over
> > again. There is no reduced function.
> >
> > Does this logic sound correct? Is there any other way of implementing
> this
> > requirement?
> >
> > --Hrishi
> >
> > DISCLAIMER
> > ==========
> > This e-mail may contain privileged and confidential information which is
> the
> > property of Persistent Systems Ltd. It is intended only for the use of
> the
> > individual or entity to which it is addressed. If you are not the
> intended
> > recipient, you are not authorized to read, retain, copy, print,
> distribute or
> > use this message. If you have received this communication in error,
> please
> > notify the sender and delete all copies of this message. Persistent
> Systems Ltd.
> > does not accept any liability for virus infected mails.
>
>

Re: Lucene + Hadoop

Posted by Otis Gospodnetic <ot...@yahoo.com>.

I think that sounds right.
I believe that's what I did when I implemented this type of functionality for http://simpy.com/

I'm not sure why this is a Hadoop thing, though.

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: Hrishikesh Agashe <hr...@persistent.co.in>
> To: "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
> Sent: Tue, November 10, 2009 4:56:33 AM
> Subject: Lucene + Hadoop
> 
> Hi,
> 
> I am trying to use Hadoop for Lucene index creation. I have to create multiple 
> indexes based on contents of the files (i.e. if author is "hrishikesh", it 
> should be added to a index for "hrishikesh". There has to be a separate index 
> for every author). For this, I am keeping multiple IndexWriter open for every 
> author and maintaining them in a hashmap in map() function. I parse incoming 
> file and if I see author is one for which I already have opened a IndexWriter, I 
> just add this file in that index, else I create a new IndesWriter for new 
> author. As authors might run into thousands, I am closing IndexWriter and 
> clearing hashmap once it reaches a certain threshold and starting all over 
> again. There is no reduced function.
> 
> Does this logic sound correct? Is there any other way of implementing this 
> requirement?
> 
> --Hrishi
> 
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is the 
> property of Persistent Systems Ltd. It is intended only for the use of the 
> individual or entity to which it is addressed. If you are not the intended 
> recipient, you are not authorized to read, retain, copy, print, distribute or 
> use this message. If you have received this communication in error, please 
> notify the sender and delete all copies of this message. Persistent Systems Ltd. 
> does not accept any liability for virus infected mails.

Re: Lucene + Hadoop

Posted by Sagar <sn...@attributor.com>.

Checkout MultipleOutputFormat
(it is same as per u r implementation )

Having separate index for author may not be a good idea.
U can have one index for all authors and query it per author
But, I m not sure of requirements

-Sagar
Hrishikesh Agashe wrote:
> Hi,
>
> I am trying to use Hadoop for Lucene index creation. I have to create multiple indexes based on contents of the files (i.e. if author is "hrishikesh", it should be added to a index for "hrishikesh". There has to be a separate index for every author). For this, I am keeping multiple IndexWriter open for every author and maintaining them in a hashmap in map() function. I parse incoming file and if I see author is one for which I already have opened a IndexWriter, I just add this file in that index, else I create a new IndesWriter for new author. As authors might run into thousands, I am closing IndexWriter and clearing hashmap once it reaches a certain threshold and starting all over again. There is no reduced function.
>
> Does this logic sound correct? Is there any other way of implementing this requirement?
>
> --Hrishi
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
>
>