You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Michael J. Prichard" <mi...@mac.com> on 2006/07/27 18:29:31 UTC

Indexing large sets of documents?

I built an indexer that runs through email and its attachments, rips out 
content and what not and then creates a Document and adds it to an 
index.  It works w/ no problem.  The issue is that it takes around 3-5 
seconds per email and I have seen up to 10-15 seconds for email w/ 
attachments.  I need to index 750k emails and at those times it will 
take FOREVER!  I am trying to find places to cut a second or two here or 
there but are there any suggestions as to what I can do?  Should I look 
into parallelizing indexing?  Help?!

Thanks,
Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing large sets of documents?

Posted by Rafael Rossini <ra...@gmail.com>.

Ok, there is a patch (http://issues.apache.org/jira/browse/LUCENE-532). This
is what I saw. But I still have a question. I guess it´s better to ask this
in the hadoop mailing list, anyway,  the hadoop project implements a DFS and
the whole MapReduce paradigm. Is it possible to do the indexing and
searching using all of that? Because the concept of a DFS and the MapReduce
is to have an app that is scalable and fast, like google. I´m sure this is
not easy at all, but it would be nice if one day this kind of stuff became a
commodity.

[]s
    Rossini


On 7/27/06, Otis Gospodnetic <ot...@yahoo.com> wrote:
>
> Rossini{},
>
> I think what you might have read might have been that searching a Lucene
> index that lives in a HDFS would be slow.  As far as I understand things,
> the thing to do is to copy the index to a local disk, out of HDFS, and then
> search it with Lucene from there.
>
> Otis()
>
> ----- Original Message ----
> From: Rafael Rossini <ra...@gmail.com>
> To: java-user@lucene.apache.org; Otis Gospodnetic <
> otis_gospodnetic@yahoo.com>
> Sent: Thursday, July 27, 2006 4:23:56 PM
> Subject: Re: Indexing large sets of documents?
>
> Oits,
>
>   You mentioned the hadoop project. I check it out not a long time ago and
> I read someting about it did not support the lucene index. Is it possible
> to
> index and then search in a HDFS?
>
> []s
>     Rossini
>
>
> On 7/27/06, Otis Gospodnetic <ot...@yahoo.com> wrote:
> >
> > Michael,
> >
> > Certainly parallelizing on a set of servers would work (hmm... hadoop?),
> > but if you want to do this on a single machine you should tune some of
> the
> > IndexWriter params.  You didn't mention them, so I assume you didn't
> tune
> > anything yet.  If you have Lucene in Action, check out
> > 2.7.1  : Tuning indexing performance        starts on page 42  under
> > section 2.7 (Controlling the indexing process) in chapter 2 (Indexing)
> > (found via: http://lucenebook.com/search?query=index+tuning )
> >
> > If not, check maxBufferedDocs and mergeFactor in IndexWriter
> > javadocs.  This is likely in the FAQ, too, but I didn't check.
> >
> > Otis
> >
> > ----- Original Message ----
> > From: Michael J. Prichard
> > To: java-user@lucene.apache.org
> > Sent: Thursday, July 27, 2006 12:29:31 PM
> > Subject: Indexing large sets of documents?
> >
> > I built an indexer that runs through email and its attachments, rips out
> > content and what not and then creates a Document and adds it to an
> > index.  It works w/ no problem.  The issue is that it takes around 3-5
> > seconds per email and I have seen up to 10-15 seconds for email w/
> > attachments.  I need to index 750k emails and at those times it will
> > take FOREVER!  I am trying to find places to cut a second or two here or
> > there but are there any suggestions as to what I can do?  Should I look
> > into parallelizing indexing?  Help?!
> >
> > Thanks,
> > Michael
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Indexing large sets of documents?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Rossini{},

I think what you might have read might have been that searching a Lucene index that lives in a HDFS would be slow.  As far as I understand things, the thing to do is to copy the index to a local disk, out of HDFS, and then search it with Lucene from there.

Otis()

----- Original Message ----
From: Rafael Rossini <ra...@gmail.com>
To: java-user@lucene.apache.org; Otis Gospodnetic <ot...@yahoo.com>
Sent: Thursday, July 27, 2006 4:23:56 PM
Subject: Re: Indexing large sets of documents?

Oits,

   You mentioned the hadoop project. I check it out not a long time ago and
I read someting about it did not support the lucene index. Is it possible to
index and then search in a HDFS?

[]s
     Rossini


On 7/27/06, Otis Gospodnetic <ot...@yahoo.com> wrote:
>
> Michael,
>
> Certainly parallelizing on a set of servers would work (hmm... hadoop?),
> but if you want to do this on a single machine you should tune some of the
> IndexWriter params.  You didn't mention them, so I assume you didn't tune
> anything yet.  If you have Lucene in Action, check out
> 2.7.1  : Tuning indexing performance        starts on page 42  under
> section 2.7 (Controlling the indexing process) in chapter 2 (Indexing)
> (found via: http://lucenebook.com/search?query=index+tuning )
>
> If not, check maxBufferedDocs and mergeFactor in IndexWriter
> javadocs.  This is likely in the FAQ, too, but I didn't check.
>
> Otis
>
> ----- Original Message ----
> From: Michael J. Prichard
> To: java-user@lucene.apache.org
> Sent: Thursday, July 27, 2006 12:29:31 PM
> Subject: Indexing large sets of documents?
>
> I built an indexer that runs through email and its attachments, rips out
> content and what not and then creates a Document and adds it to an
> index.  It works w/ no problem.  The issue is that it takes around 3-5
> seconds per email and I have seen up to 10-15 seconds for email w/
> attachments.  I need to index 750k emails and at those times it will
> take FOREVER!  I am trying to find places to cut a second or two here or
> there but are there any suggestions as to what I can do?  Should I look
> into parallelizing indexing?  Help?!
>
> Thanks,
> Michael
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing large sets of documents?

Posted by Rafael Rossini <ra...@gmail.com>.

Oits,

   You mentioned the hadoop project. I check it out not a long time ago and
I read someting about it did not support the lucene index. Is it possible to
index and then search in a HDFS?

[]s
     Rossini


On 7/27/06, Otis Gospodnetic <ot...@yahoo.com> wrote:
>
> Michael,
>
> Certainly parallelizing on a set of servers would work (hmm... hadoop?),
> but if you want to do this on a single machine you should tune some of the
> IndexWriter params.  You didn't mention them, so I assume you didn't tune
> anything yet.  If you have Lucene in Action, check out
> 2.7.1  : Tuning indexing performance        starts on page 42  under
> section 2.7 (Controlling the indexing process) in chapter 2 (Indexing)
> (found via: http://lucenebook.com/search?query=index+tuning )
>
> If not, check maxBufferedDocs and mergeFactor in IndexWriter
> javadocs.  This is likely in the FAQ, too, but I didn't check.
>
> Otis
>
> ----- Original Message ----
> From: Michael J. Prichard
> To: java-user@lucene.apache.org
> Sent: Thursday, July 27, 2006 12:29:31 PM
> Subject: Indexing large sets of documents?
>
> I built an indexer that runs through email and its attachments, rips out
> content and what not and then creates a Document and adds it to an
> index.  It works w/ no problem.  The issue is that it takes around 3-5
> seconds per email and I have seen up to 10-15 seconds for email w/
> attachments.  I need to index 750k emails and at those times it will
> take FOREVER!  I am trying to find places to cut a second or two here or
> there but are there any suggestions as to what I can do?  Should I look
> into parallelizing indexing?  Help?!
>
> Thanks,
> Michael
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Indexing large sets of documents?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Michael, 
 
Certainly parallelizing on a set of servers would work (hmm... hadoop?), but if you want to do this on a single machine you should tune some of the IndexWriter params.  You didn't mention them, so I assume you didn't tune anything yet.  If you have Lucene in Action, check out
 2.7.1  : Tuning indexing performance        starts on page 42  under section 2.7 (Controlling the indexing process) in chapter 2 (Indexing)  
 (found via: http://lucenebook.com/search?query=index+tuning )

If not, check maxBufferedDocs and mergeFactor in IndexWriter javadocs.  This is likely in the FAQ, too, but I didn't check.

Otis

----- Original Message ---- 
From: Michael J. Prichard  
To: java-user@lucene.apache.org 
Sent: Thursday, July 27, 2006 12:29:31 PM 
Subject: Indexing large sets of documents? 
 
I built an indexer that runs through email and its attachments, rips out  
content and what not and then creates a Document and adds it to an  
index.  It works w/ no problem.  The issue is that it takes around 3-5  
seconds per email and I have seen up to 10-15 seconds for email w/  
attachments.  I need to index 750k emails and at those times it will  
take FOREVER!  I am trying to find places to cut a second or two here or  
there but are there any suggestions as to what I can do?  Should I look  
into parallelizing indexing?  Help?! 
 
Thanks, 
Michael 
 
--------------------------------------------------------------------- 
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org 
For additional commands, e-mail: java-user-help@lucene.apache.org 
 
 
 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Indexing large sets of documents?

Posted by Dejan Nenov <de...@jollyobject.com>.

Yes - parallelizing works great - we built a share-nothing java-spaces based
system at X1 and on a 11-way cluster were able to index 350 office documents
per second - this included the binary-2-text conversion, using Stellent INSO
libraries. The trick is to create separate indexes and, if you do not have a
federated search setup  - merge the indexes into one big index after they
are completed.

Dejan

-----Original Message-----
From: Michael J. Prichard [mailto:michael_prichard@mac.com] 
Sent: Thursday, July 27, 2006 9:30 AM
To: java-user@lucene.apache.org
Subject: Indexing large sets of documents?

I built an indexer that runs through email and its attachments, rips out 
content and what not and then creates a Document and adds it to an 
index.  It works w/ no problem.  The issue is that it takes around 3-5 
seconds per email and I have seen up to 10-15 seconds for email w/ 
attachments.  I need to index 750k emails and at those times it will 
take FOREVER!  I am trying to find places to cut a second or two here or 
there but are there any suggestions as to what I can do?  Should I look 
into parallelizing indexing?  Help?!

Thanks,
Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing large sets of documents?

Posted by MALCOLM CLARK <ma...@btinternet.com>.



Is this the W3 Ent collection you are indexing?
 
MC