You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Li Bing <lb...@gmail.com> on 2009/06/03 12:12:04 UTC

How Large can the size of the Index Lucene generated Be?

Hi, all,

I am planning to use Lucene to create the index for a large count of
URLs by crawling. I think the total data size should be huge. If so, I
worry that the performance of Lucene will get lower and lower with the
increasing of the cralwed pages. Have you ever encountered such a
problem? How large can the size of the index Lucence generated be?
What about the performance? I would like to utilize multiple machines
to do the load balancing if the size exceeds the maximum limit.

Thanks so much!
LB Labs

Re: How Large can the size of the Index Lucene generated Be?

Posted by Shashi Kant <sh...@gmail.com>.
You need consider an appropriate partitioning strategy depending on
the types of queries you are making, performance you expect, type of
data/fields you store etc.

Hence you need to experiment and see what works for *you*. There are
several best practices shared on the Apache Lucene website, Solr and
on related message boards which should get you started.

Bingoogle is your buddy.



On Wed, Jun 3, 2009 at 7:06 AM, Todd Carrico <To...@match.com> wrote:
> Can you be a bit more specific with sizing?  Words like Large and Huge
> are relative.  How many documents are you sizing for?
>
> My understanding is that hardware specifically disk IO would be a big
> part of it.  There are several strategies inside Lucene for dealing with
> largish indexes, and outside of that you may be able to partition
> indexes across servers.
>
> We have folks on this list operating in the terabyte range I believe,
> and not complaining about perf issues.
>
> tc
>
> -----Original Message-----
> From: Li Bing [mailto:lblabs@gmail.com]
> Sent: Wednesday, June 03, 2009 5:12 AM
> To: lucene-net-user@incubator.apache.org
> Cc: Li Bing - Gmail
> Subject: How Large can the size of the Index Lucene generated Be?
>
> Hi, all,
>
> I am planning to use Lucene to create the index for a large count of
> URLs by crawling. I think the total data size should be huge. If so, I
> worry that the performance of Lucene will get lower and lower with the
> increasing of the cralwed pages. Have you ever encountered such a
> problem? How large can the size of the index Lucence generated be?
> What about the performance? I would like to utilize multiple machines
> to do the load balancing if the size exceeds the maximum limit.
>
> Thanks so much!
> LB Labs
>

Re: How Large can the size of the Index Lucene generated Be?

Posted by Li Bing <lb...@gmail.com>.
Hi, Todd Carrico,

I have counted the exact size of the total data. What I am trying to
crawl is a popluar directory site in China. Most successful business
sites are included and many small unknown sites are also there. The
total number of the site is about more than ten thousands. Since this
is my first time to try this, I am not sure if Lucene is a good choice
or not in terms of performance.

Thanks,
LB

On Wed, Jun 3, 2009 at 7:06 PM, Todd Carrico <To...@match.com> wrote:
> Can you be a bit more specific with sizing?  Words like Large and Huge
> are relative.  How many documents are you sizing for?
>
> My understanding is that hardware specifically disk IO would be a big
> part of it.  There are several strategies inside Lucene for dealing with
> largish indexes, and outside of that you may be able to partition
> indexes across servers.
>
> We have folks on this list operating in the terabyte range I believe,
> and not complaining about perf issues.
>
> tc
>
> -----Original Message-----
> From: Li Bing [mailto:lblabs@gmail.com]
> Sent: Wednesday, June 03, 2009 5:12 AM
> To: lucene-net-user@incubator.apache.org
> Cc: Li Bing - Gmail
> Subject: How Large can the size of the Index Lucene generated Be?
>
> Hi, all,
>
> I am planning to use Lucene to create the index for a large count of
> URLs by crawling. I think the total data size should be huge. If so, I
> worry that the performance of Lucene will get lower and lower with the
> increasing of the cralwed pages. Have you ever encountered such a
> problem? How large can the size of the index Lucence generated be?
> What about the performance? I would like to utilize multiple machines
> to do the load balancing if the size exceeds the maximum limit.
>
> Thanks so much!
> LB Labs
>

Re: How Large can the size of the Index Lucene generated Be?

Posted by Li Bing <lb...@gmail.com>.
I might not explain that clearly. The more than ten thousands are Web
sites not URLs. Those sites are from a popular Web portal in China. A
lot of people access the Internet according to its directories. So the
total URLs to be indexed must be huge.

thanks,
LB

On Wed, Jun 3, 2009 at 8:34 PM, Wayne Douglas <wa...@codingvista.com> wrote:
> around 10,000 indexed urls would be a relatively small index I'd have
> thought.
>
> On Wed, Jun 3, 2009 at 1:28 PM, Li Bing <lb...@gmail.com> wrote:
>>
>> Hi, Todd Carrico,
>>
>> I have NOT counted the exact size of the total data. What I am trying to
>> crawl is a popluar directory site in China. Most successful business
>> sites are included and many small unknown sites are also there. The
>> total number of the site is about more than ten thousands. Since this
>> is my first time to try this, I am not sure if Lucene is a good choice
>> or not in terms of performance.
>>
>> Thanks,
>> LB
>>
>> On Wed, Jun 3, 2009 at 7:06 PM, Todd Carrico <To...@match.com>
>> wrote:
>> > Can you be a bit more specific with sizing?  Words like Large and Huge
>> > are relative.  How many documents are you sizing for?
>> >
>> > My understanding is that hardware specifically disk IO would be a big
>> > part of it.  There are several strategies inside Lucene for dealing with
>> > largish indexes, and outside of that you may be able to partition
>> > indexes across servers.
>> >
>> > We have folks on this list operating in the terabyte range I believe,
>> > and not complaining about perf issues.
>> >
>> > tc
>> >
>> > -----Original Message-----
>> > From: Li Bing [mailto:lblabs@gmail.com]
>> > Sent: Wednesday, June 03, 2009 5:12 AM
>> > To: lucene-net-user@incubator.apache.org
>> > Cc: Li Bing - Gmail
>> > Subject: How Large can the size of the Index Lucene generated Be?
>> >
>> > Hi, all,
>> >
>> > I am planning to use Lucene to create the index for a large count of
>> > URLs by crawling. I think the total data size should be huge. If so, I
>> > worry that the performance of Lucene will get lower and lower with the
>> > increasing of the cralwed pages. Have you ever encountered such a
>> > problem? How large can the size of the index Lucence generated be?
>> > What about the performance? I would like to utilize multiple machines
>> > to do the load balancing if the size exceeds the maximum limit.
>> >
>> > Thanks so much!
>> > LB Labs
>> >
>
>
>
> --
> Cheers,
>
> w://
>

Re: How Large can the size of the Index Lucene generated Be?

Posted by Wayne Douglas <wa...@codingvista.com>.
around 10,000 indexed urls would be a relatively small index I'd have
thought.

On Wed, Jun 3, 2009 at 1:28 PM, Li Bing <lb...@gmail.com> wrote:

> Hi, Todd Carrico,
>
> I have NOT counted the exact size of the total data. What I am trying to
> crawl is a popluar directory site in China. Most successful business
> sites are included and many small unknown sites are also there. The
> total number of the site is about more than ten thousands. Since this
> is my first time to try this, I am not sure if Lucene is a good choice
> or not in terms of performance.
>
> Thanks,
> LB
>
> On Wed, Jun 3, 2009 at 7:06 PM, Todd Carrico <To...@match.com>
> wrote:
> > Can you be a bit more specific with sizing?  Words like Large and Huge
> > are relative.  How many documents are you sizing for?
> >
> > My understanding is that hardware specifically disk IO would be a big
> > part of it.  There are several strategies inside Lucene for dealing with
> > largish indexes, and outside of that you may be able to partition
> > indexes across servers.
> >
> > We have folks on this list operating in the terabyte range I believe,
> > and not complaining about perf issues.
> >
> > tc
> >
> > -----Original Message-----
> > From: Li Bing [mailto:lblabs@gmail.com]
> > Sent: Wednesday, June 03, 2009 5:12 AM
> > To: lucene-net-user@incubator.apache.org
> > Cc: Li Bing - Gmail
> > Subject: How Large can the size of the Index Lucene generated Be?
> >
> > Hi, all,
> >
> > I am planning to use Lucene to create the index for a large count of
> > URLs by crawling. I think the total data size should be huge. If so, I
> > worry that the performance of Lucene will get lower and lower with the
> > increasing of the cralwed pages. Have you ever encountered such a
> > problem? How large can the size of the index Lucence generated be?
> > What about the performance? I would like to utilize multiple machines
> > to do the load balancing if the size exceeds the maximum limit.
> >
> > Thanks so much!
> > LB Labs
> >
>



-- 
Cheers,

w://

RE: How Large can the size of the Index Lucene generated Be?

Posted by Todd Carrico <To...@match.com>.
Ok, so what I'm seeing is between 10k and 100k sites.  If your crawler is just hitting the first page, then that would be between 10,000 and 100,000 documents.

If your crawler goes deeper into the sites, then it would depend on how much deeper.  If you are going into the site 3 pages deep, then you could be looking at 1,000,000,000,000 to 1,000,000,000,000,000 documents worst case.  The rest depends on how much you actually add to your index.

I don't think you'll have an issue with this as long as you have decent hardware.

-----Original Message-----
From: Li Bing [mailto:lblabs@gmail.com] 
Sent: Wednesday, June 03, 2009 7:28 AM
To: lucene-net-user@incubator.apache.org
Subject: Re: How Large can the size of the Index Lucene generated Be?

Hi, Todd Carrico,

I have NOT counted the exact size of the total data. What I am trying to
crawl is a popluar directory site in China. Most successful business
sites are included and many small unknown sites are also there. The
total number of the site is about more than ten thousands. Since this
is my first time to try this, I am not sure if Lucene is a good choice
or not in terms of performance.

Thanks,
LB

On Wed, Jun 3, 2009 at 7:06 PM, Todd Carrico <To...@match.com> wrote:
> Can you be a bit more specific with sizing?  Words like Large and Huge
> are relative.  How many documents are you sizing for?
>
> My understanding is that hardware specifically disk IO would be a big
> part of it.  There are several strategies inside Lucene for dealing with
> largish indexes, and outside of that you may be able to partition
> indexes across servers.
>
> We have folks on this list operating in the terabyte range I believe,
> and not complaining about perf issues.
>
> tc
>
> -----Original Message-----
> From: Li Bing [mailto:lblabs@gmail.com]
> Sent: Wednesday, June 03, 2009 5:12 AM
> To: lucene-net-user@incubator.apache.org
> Cc: Li Bing - Gmail
> Subject: How Large can the size of the Index Lucene generated Be?
>
> Hi, all,
>
> I am planning to use Lucene to create the index for a large count of
> URLs by crawling. I think the total data size should be huge. If so, I
> worry that the performance of Lucene will get lower and lower with the
> increasing of the cralwed pages. Have you ever encountered such a
> problem? How large can the size of the index Lucence generated be?
> What about the performance? I would like to utilize multiple machines
> to do the load balancing if the size exceeds the maximum limit.
>
> Thanks so much!
> LB Labs
>

Re: How Large can the size of the Index Lucene generated Be?

Posted by Li Bing <lb...@gmail.com>.
Hi, Todd Carrico,

I have NOT counted the exact size of the total data. What I am trying to
crawl is a popluar directory site in China. Most successful business
sites are included and many small unknown sites are also there. The
total number of the site is about more than ten thousands. Since this
is my first time to try this, I am not sure if Lucene is a good choice
or not in terms of performance.

Thanks,
LB

On Wed, Jun 3, 2009 at 7:06 PM, Todd Carrico <To...@match.com> wrote:
> Can you be a bit more specific with sizing?  Words like Large and Huge
> are relative.  How many documents are you sizing for?
>
> My understanding is that hardware specifically disk IO would be a big
> part of it.  There are several strategies inside Lucene for dealing with
> largish indexes, and outside of that you may be able to partition
> indexes across servers.
>
> We have folks on this list operating in the terabyte range I believe,
> and not complaining about perf issues.
>
> tc
>
> -----Original Message-----
> From: Li Bing [mailto:lblabs@gmail.com]
> Sent: Wednesday, June 03, 2009 5:12 AM
> To: lucene-net-user@incubator.apache.org
> Cc: Li Bing - Gmail
> Subject: How Large can the size of the Index Lucene generated Be?
>
> Hi, all,
>
> I am planning to use Lucene to create the index for a large count of
> URLs by crawling. I think the total data size should be huge. If so, I
> worry that the performance of Lucene will get lower and lower with the
> increasing of the cralwed pages. Have you ever encountered such a
> problem? How large can the size of the index Lucence generated be?
> What about the performance? I would like to utilize multiple machines
> to do the load balancing if the size exceeds the maximum limit.
>
> Thanks so much!
> LB Labs
>

RE: How Large can the size of the Index Lucene generated Be?

Posted by Todd Carrico <To...@match.com>.
Can you be a bit more specific with sizing?  Words like Large and Huge
are relative.  How many documents are you sizing for?

My understanding is that hardware specifically disk IO would be a big
part of it.  There are several strategies inside Lucene for dealing with
largish indexes, and outside of that you may be able to partition
indexes across servers.

We have folks on this list operating in the terabyte range I believe,
and not complaining about perf issues.

tc

-----Original Message-----
From: Li Bing [mailto:lblabs@gmail.com] 
Sent: Wednesday, June 03, 2009 5:12 AM
To: lucene-net-user@incubator.apache.org
Cc: Li Bing - Gmail
Subject: How Large can the size of the Index Lucene generated Be?

Hi, all,

I am planning to use Lucene to create the index for a large count of
URLs by crawling. I think the total data size should be huge. If so, I
worry that the performance of Lucene will get lower and lower with the
increasing of the cralwed pages. Have you ever encountered such a
problem? How large can the size of the index Lucence generated be?
What about the performance? I would like to utilize multiple machines
to do the load balancing if the size exceeds the maximum limit.

Thanks so much!
LB Labs