You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "m.harig" <m....@gmail.com> on 2009/07/22 08:07:45 UTC

indexing 100GB of data

hello all

             We've got 100GB of data which has doc,txt,pdf,ppt,etc.., we've
separate parser for each file format, so we're going to index those data by
lucene. (since we scared of Nutch setup , thats why we didn't use it) My
doubt is , will it be scalable when i index those dcouments ? we planned to
do separate index for each file format , and we planned to use multi index
reader for searching, please anyone suggest me 

          1. Are we going on the right way?
            2. Please suggest me about mergeFactors & segments
            3. How much index size can lucene handle?
            4. Will it cause for java OOM.
-- 
View this message in context: http://www.nabble.com/indexing-100GB-of-data-tp24600563p24600563.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: indexing 100GB of data

Posted by Shai Erera <se...@gmail.com>.
Generally you shouldn't hit OOM. But it may change depending on how you use
the index. For example, if you have millions of documents spread across the
100 GB, and you use sorting for various fields, then it will consume lots of
RAM. Also, if you run hundreds of queries in parallel, each with a dozen
terms, it will also consume some considerable amount of RAM.

But if you don't do anything extreme w/ it, and you can allocate enough heap
size, then you should be ok.

The way I make such decisions is I design a test which mimics the
typical/common scenario I expect to face, and then I run it on a machine I
believe will be used in production (or as close as I can get), and analyze
the results.

If you choose to do that, and you're not satisfied w/ the results, you're
welcome to post back w/ the machine statistics and exact use case, and I
believe there are plenty of folks here who'd be willing to help you optimize
the usage of Lucene by your app. Or at least then we'll be able to tell you:
"for this index and this machine, you cannot run a 100GB index".

Shai

On Thu, Jul 23, 2009 at 10:42 AM, m.harig <m....@gmail.com> wrote:

>
> Thanks all ,
>
>               Very thankful to all , am tired of hadoop settings , is it
> good to use read such type large index with lucene alone? will it go for
> OOM
> ? anyone pl suggest me.
> --
> View this message in context:
> http://www.nabble.com/indexing-100GB-of-data-tp24600563p24620846.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: indexing 100GB of data

Posted by "m.harig" <m....@gmail.com>.
Thanks all ,

               Very thankful to all , am tired of hadoop settings , is it
good to use read such type large index with lucene alone? will it go for OOM
? anyone pl suggest me.
-- 
View this message in context: http://www.nabble.com/indexing-100GB-of-data-tp24600563p24620846.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: indexing 100GB of data

Posted by Steven A Rowe <sa...@syr.edu>.
You may also be interested in Andrzej Bialecki's patch to Solr that provides distributed indexing using Hadoop:

   https://issues.apache.org/jira/browse/SOLR-1301

Steve

> -----Original Message-----
> From: Phil Whelan [mailto:phil123@gmail.com]
> Sent: Wednesday, July 22, 2009 12:46 PM
> To: java-user@lucene.apache.org
> Subject: Re: indexing 100GB of data
> 
> On Wed, Jul 22, 2009 at 5:46 AM, m.harig<m....@gmail.com> wrote:
> 
> > Is there any article or forum for using Hadoop with lucene? Please
> any1 help
> > me
> 
> Hi M,
> 
> Katta is a project that is combining Lucene and Hadoop. Check it out
> here...
> http://katta.sourceforge.net/
> 
> Thanks,
> Phil
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: indexing 100GB of data

Posted by Phil Whelan <ph...@gmail.com>.
On Wed, Jul 22, 2009 at 5:46 AM, m.harig<m....@gmail.com> wrote:

> Is there any article or forum for using Hadoop with lucene? Please any1 help
> me

Hi M,

Katta is a project that is combining Lucene and Hadoop. Check it out here...
http://katta.sourceforge.net/

Thanks,
Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: indexing 100GB of data

Posted by "m.harig" <m....@gmail.com>.
Is there any article or forum for using Hadoop with lucene? Please any1 help
me
-- 
View this message in context: http://www.nabble.com/indexing-100GB-of-data-tp24600563p24605164.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: indexing 100GB of data

Posted by Dan OConnor <do...@acquiremedia.com>.
Hi Jamie,

I would appreciate if you could provide details on the hardware/OS you are running this system on and what kind of search response time you are getting.

As well as how you add email data to your index.

Thanks,
Dan


-----Original Message-----
From: Jamie [mailto:jamie@stimulussoft.com] 
Sent: Wednesday, July 22, 2009 8:51 AM
To: java-user@lucene.apache.org
Subject: Re: indexing 100GB of data

HI There

We have lucene searching across several terabytes of email data and 
there is no problem at all.

Regards,

Jamie



Shai Erera wrote:
> There shouldn't be a problem to search such index. It depends on the machine
> you use. If it's a strong enough machine, I don't think you should have any
> problems.
>
> But like I said, you can always try it out on your machine before you make a
> decision.
>
> Also, Lucene has a Benchmark package which includes some indexing and search
> algorithms through which you can test the performance on your machine.
>
> On Wed, Jul 22, 2009 at 11:30 AM, m.harig <m....@gmail.com> wrote:
>
>   
>> Thanks Shai
>>
>>           So there won't be problem when searching that kind of large index
>> . am i right?
>>
>>           Can anyone tell me is it possible to use hadoop with lucene??
>> --
>> View this message in context:
>> http://www.nabble.com/indexing-100GB-of-data-tp24600563p24602064.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
>   


-- 
Stimulus Software - MailArchiva
Email Archiving And Compliance
USA Tel: +1-713-343-8824 ext 100
UK Tel: +44-20-80991035 ext 100
Email:  jamie@stimulussoft.com
Web: http://www.mailarchiva.com
To receive MailArchiva Enterprise Edition product announcements, send a message to: <ma...@stimulussoft.com>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: indexing 100GB of data

Posted by Jamie <ja...@stimulussoft.com>.
HI There

We have lucene searching across several terabytes of email data and 
there is no problem at all.

Regards,

Jamie



Shai Erera wrote:
> There shouldn't be a problem to search such index. It depends on the machine
> you use. If it's a strong enough machine, I don't think you should have any
> problems.
>
> But like I said, you can always try it out on your machine before you make a
> decision.
>
> Also, Lucene has a Benchmark package which includes some indexing and search
> algorithms through which you can test the performance on your machine.
>
> On Wed, Jul 22, 2009 at 11:30 AM, m.harig <m....@gmail.com> wrote:
>
>   
>> Thanks Shai
>>
>>           So there won't be problem when searching that kind of large index
>> . am i right?
>>
>>           Can anyone tell me is it possible to use hadoop with lucene??
>> --
>> View this message in context:
>> http://www.nabble.com/indexing-100GB-of-data-tp24600563p24602064.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
>   


-- 
Stimulus Software - MailArchiva
Email Archiving And Compliance
USA Tel: +1-713-343-8824 ext 100
UK Tel: +44-20-80991035 ext 100
Email:  jamie@stimulussoft.com
Web: http://www.mailarchiva.com
To receive MailArchiva Enterprise Edition product announcements, send a message to: <ma...@stimulussoft.com>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: indexing 100GB of data

Posted by Shai Erera <se...@gmail.com>.
There shouldn't be a problem to search such index. It depends on the machine
you use. If it's a strong enough machine, I don't think you should have any
problems.

But like I said, you can always try it out on your machine before you make a
decision.

Also, Lucene has a Benchmark package which includes some indexing and search
algorithms through which you can test the performance on your machine.

On Wed, Jul 22, 2009 at 11:30 AM, m.harig <m....@gmail.com> wrote:

>
> Thanks Shai
>
>           So there won't be problem when searching that kind of large index
> . am i right?
>
>           Can anyone tell me is it possible to use hadoop with lucene??
> --
> View this message in context:
> http://www.nabble.com/indexing-100GB-of-data-tp24600563p24602064.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: indexing 100GB of data

Posted by prashant ullegaddi <pr...@gmail.com>.
Yes you can use Hadoop with Lucene. Borrow some code from Nutch. Look at
org.apache.nutch.indexer.IndexerMapReduce and org.apache.nutch.indexer.
Indexer.

Prashant.

On Wed, Jul 22, 2009 at 2:00 PM, m.harig <m....@gmail.com> wrote:

>
> Thanks Shai
>
>           So there won't be problem when searching that kind of large index
> . am i right?
>
>           Can anyone tell me is it possible to use hadoop with lucene??
> --
> View this message in context:
> http://www.nabble.com/indexing-100GB-of-data-tp24600563p24602064.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: indexing 100GB of data

Posted by "m.harig" <m....@gmail.com>.
Thanks Shai

           So there won't be problem when searching that kind of large index
. am i right? 

           Can anyone tell me is it possible to use hadoop with lucene??
-- 
View this message in context: http://www.nabble.com/indexing-100GB-of-data-tp24600563p24602064.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: indexing 100GB of data

Posted by Shai Erera <se...@gmail.com>.
>From my experience, you shouldn't have any problems indexing that amount of
content even into one index. I've successfully indexed 450 GB of data w/
Lucene, and I believe it can scale much higher if rich text documents are
indexed. Though I haven't tried yet, I believe it can scale into the 1-5 TB
domain, on a modern CPU + HD and enough RAM.

Usually, when rich text documents are involved, some considerable time is
spent converting these into raw text documents. The raw size of a rich text
document (PDF, DOC, HTML) is usually (based on my measures) 15-20% of its
original size, and that is compressed even more when added to Lucene.

I hope this helps. BTW, you can always just try to index that amount of
content in one index on your machine and decide if the machine can handle
that amount of data.

Shai

On Wed, Jul 22, 2009 at 9:07 AM, m.harig <m....@gmail.com> wrote:

>
> hello all
>
>             We've got 100GB of data which has doc,txt,pdf,ppt,etc.., we've
> separate parser for each file format, so we're going to index those data by
> lucene. (since we scared of Nutch setup , thats why we didn't use it) My
> doubt is , will it be scalable when i index those dcouments ? we planned to
> do separate index for each file format , and we planned to use multi index
> reader for searching, please anyone suggest me
>
>          1. Are we going on the right way?
>            2. Please suggest me about mergeFactors & segments
>            3. How much index size can lucene handle?
>            4. Will it cause for java OOM.
> --
> View this message in context:
> http://www.nabble.com/indexing-100GB-of-data-tp24600563p24600563.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>