You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by v k <vk...@gmail.com> on 2007/12/18 17:21:28 UTC

Infrastructure Question

Hello,

I am using Lucene to build an index from roughly  10 million documents
in number. The  documents are about 4 TB in total.

After some trial runs, indexing a subset of the documents I am trying
to figure out a hosting service configuration to create a full index
from the entire 10 TB of data. As I am still unsure how this project
will turn out I am not purchasing hardware/ram but considering a web
host.
for the purpose of :
1)  download the data and to start indexing it.
2) The web front end to access this index will be a python framework (
eg. Django  etc)

I am seriously contemplating signing up with Joyent for this plan:
AMD Opteron x64 multi-core servers with 4GiB RAM per core
1/16 (Burstable up to 95%)
1 TB    - Bandwidth/month, 1 GB RAM, + as such as NAS  storage as I can
afford to pay for.

My QUESTION is - Will this RAM and CPU be sufficient during
development of the search application and building the index, etc. or
is it so abysmal and under-equipped in terms of hardware that the
development version of my application will not work.
I understand that having more RAM is always good, but is 1GB as good as nothing?

This setup is NOT for production but for for development so I can get
my hands dirty with lucene which will require plenty of tweaks as the
project moves along.

What initial configuration would you recommend for a development
version given the corpus size. I am not even sure how large my index
will look like at this point.

I hope to build an my indexes this way and once the search
infrastructure is working and the web-front end complete, I plan to
worry about Redundancy, availability and scalability for the many
users I hope to provide this free service for :-)

Many of you in this forum have built successful products with Lucene.
To name a few I am aware of -  Ken Krugle, James Ryley, Dennis Kubes

Some of you must have started with small machines,test set-ups etc
where you built your initial search apps. I hope  to receive some
advise about my plan and approach to start building an infrastructure
to support my Lucene app.

Thank you.

Venkat

Re: Infrastructure Question

Posted by Dennis Kubes <ku...@apache.org>.
When we started we had 5 machines that were 800mghz and maybe 512M to 1G 
  of ram.  It was enough to get started and start testing things 
although I wouldn't recommend that setup because looking back I don't 
think it was enough.  And of course we started getting OutOfMemory 
errors pretty quickly as our data grew.

Remember that in search the serving is the hardware intensive part.  For 
getting your hands dirty and processing the data, the hardware you 
propose should be more than sufficient.  Amazon's EC2, especially the 
large and extra-large instances would also work very well for this and 
give you the opportunity to grow your serving computer if/when needed.

Dennis Kubes


v k wrote:
> Sorry about that. For some reason, my post did not show up in the
> mailing list and I still cannot see it  ( maybe a settings issue). I
> don't mean to barrage the mailing  list with the same question. Thanks
> for the advise.
> 
> 
> On Dec 18, 2007 11:43 AM, Grant Ingersoll <gs...@apache.org> wrote:
>> Hi Venkat,
>>
>> There is no need to post your question multiple times or cross-post.
>> People are distributed all around the world on this list and aren't
>> always available or capable to answer your question.  Having to wait
>> 11 hours for an answer on a free mailing list is not at all
>> unreasonable.
>>
>> If you are just looking to get your hands dirty with Lucene, why not
>> just start w/ a subset on a machine you already own and work to scale
>> up?  This way, you could start with what you have available and get a
>> feel for your memory usage, etc.  Then you will be in a better
>> position to decide what your needs are.
>>
>> If there is one thing that is true about search it is the fact that
>> everyone's situation is different.
>>
>> Cheers,
>> Grant
>>
>>
>> On Dec 18, 2007, at 11:21 AM, v k wrote:
>>
>>> Hello,
>>>
>>> I am using Lucene to build an index from roughly  10 million documents
>>> in number. The  documents are about 4 TB in total.
>>>
>>> After some trial runs, indexing a subset of the documents I am trying
>>> to figure out a hosting service configuration to create a full index
>>> from the entire 10 TB of data. As I am still unsure how this project
>>> will turn out I am not purchasing hardware/ram but considering a web
>>> host.
>>> for the purpose of :
>>> 1)  download the data and to start indexing it.
>>> 2) The web front end to access this index will be a python framework (
>>> eg. Django  etc)
>>>
>>> I am seriously contemplating signing up with Joyent for this plan:
>>> AMD Opteron x64 multi-core servers with 4GiB RAM per core
>>> 1/16 (Burstable up to 95%)
>>> 1 TB    - Bandwidth/month, 1 GB RAM, + as such as NAS  storage as I
>>> can
>>> afford to pay for.
>>>
>>> My QUESTION is - Will this RAM and CPU be sufficient during
>>> development of the search application and building the index, etc. or
>>> is it so abysmal and under-equipped in terms of hardware that the
>>> development version of my application will not work.
>>> I understand that having more RAM is always good, but is 1GB as good
>>> as nothing?
>>>
>>> This setup is NOT for production but for for development so I can get
>>> my hands dirty with lucene which will require plenty of tweaks as the
>>> project moves along.
>>>
>>> What initial configuration would you recommend for a development
>>> version given the corpus size. I am not even sure how large my index
>>> will look like at this point.
>>>
>>> I hope to build an my indexes this way and once the search
>>> infrastructure is working and the web-front end complete, I plan to
>>> worry about Redundancy, availability and scalability for the many
>>> users I hope to provide this free service for :-)
>>>
>>> Many of you in this forum have built successful products with Lucene.
>>> To name a few I am aware of -  Ken Krugle, James Ryley, Dennis Kubes
>>>
>>> Some of you must have started with small machines,test set-ups etc
>>> where you built your initial search apps. I hope  to receive some
>>> advise about my plan and approach to start building an infrastructure
>>> to support my Lucene app.
>>>
>>> Thank you.
>>>
>>> Venkat
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>> --------------------------
>> Grant Ingersoll
>> http://lucene.grantingersoll.com
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Infrastructure Question

Posted by v k <vk...@gmail.com>.
Sorry about that. For some reason, my post did not show up in the
mailing list and I still cannot see it  ( maybe a settings issue). I
don't mean to barrage the mailing  list with the same question. Thanks
for the advise.


On Dec 18, 2007 11:43 AM, Grant Ingersoll <gs...@apache.org> wrote:
> Hi Venkat,
>
> There is no need to post your question multiple times or cross-post.
> People are distributed all around the world on this list and aren't
> always available or capable to answer your question.  Having to wait
> 11 hours for an answer on a free mailing list is not at all
> unreasonable.
>
> If you are just looking to get your hands dirty with Lucene, why not
> just start w/ a subset on a machine you already own and work to scale
> up?  This way, you could start with what you have available and get a
> feel for your memory usage, etc.  Then you will be in a better
> position to decide what your needs are.
>
> If there is one thing that is true about search it is the fact that
> everyone's situation is different.
>
> Cheers,
> Grant
>
>
> On Dec 18, 2007, at 11:21 AM, v k wrote:
>
> > Hello,
> >
> > I am using Lucene to build an index from roughly  10 million documents
> > in number. The  documents are about 4 TB in total.
> >
> > After some trial runs, indexing a subset of the documents I am trying
> > to figure out a hosting service configuration to create a full index
> > from the entire 10 TB of data. As I am still unsure how this project
> > will turn out I am not purchasing hardware/ram but considering a web
> > host.
> > for the purpose of :
> > 1)  download the data and to start indexing it.
> > 2) The web front end to access this index will be a python framework (
> > eg. Django  etc)
> >
> > I am seriously contemplating signing up with Joyent for this plan:
> > AMD Opteron x64 multi-core servers with 4GiB RAM per core
> > 1/16 (Burstable up to 95%)
> > 1 TB    - Bandwidth/month, 1 GB RAM, + as such as NAS  storage as I
> > can
> > afford to pay for.
> >
> > My QUESTION is - Will this RAM and CPU be sufficient during
> > development of the search application and building the index, etc. or
> > is it so abysmal and under-equipped in terms of hardware that the
> > development version of my application will not work.
> > I understand that having more RAM is always good, but is 1GB as good
> > as nothing?
> >
> > This setup is NOT for production but for for development so I can get
> > my hands dirty with lucene which will require plenty of tweaks as the
> > project moves along.
> >
> > What initial configuration would you recommend for a development
> > version given the corpus size. I am not even sure how large my index
> > will look like at this point.
> >
> > I hope to build an my indexes this way and once the search
> > infrastructure is working and the web-front end complete, I plan to
> > worry about Redundancy, availability and scalability for the many
> > users I hope to provide this free service for :-)
> >
> > Many of you in this forum have built successful products with Lucene.
> > To name a few I am aware of -  Ken Krugle, James Ryley, Dennis Kubes
> >
> > Some of you must have started with small machines,test set-ups etc
> > where you built your initial search apps. I hope  to receive some
> > advise about my plan and approach to start building an infrastructure
> > to support my Lucene app.
> >
> > Thank you.
> >
> > Venkat
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Infrastructure Question

Posted by Grant Ingersoll <gs...@apache.org>.
Hi Venkat,

There is no need to post your question multiple times or cross-post.   
People are distributed all around the world on this list and aren't  
always available or capable to answer your question.  Having to wait  
11 hours for an answer on a free mailing list is not at all  
unreasonable.

If you are just looking to get your hands dirty with Lucene, why not  
just start w/ a subset on a machine you already own and work to scale  
up?  This way, you could start with what you have available and get a  
feel for your memory usage, etc.  Then you will be in a better  
position to decide what your needs are.

If there is one thing that is true about search it is the fact that  
everyone's situation is different.

Cheers,
Grant

On Dec 18, 2007, at 11:21 AM, v k wrote:

> Hello,
>
> I am using Lucene to build an index from roughly  10 million documents
> in number. The  documents are about 4 TB in total.
>
> After some trial runs, indexing a subset of the documents I am trying
> to figure out a hosting service configuration to create a full index
> from the entire 10 TB of data. As I am still unsure how this project
> will turn out I am not purchasing hardware/ram but considering a web
> host.
> for the purpose of :
> 1)  download the data and to start indexing it.
> 2) The web front end to access this index will be a python framework (
> eg. Django  etc)
>
> I am seriously contemplating signing up with Joyent for this plan:
> AMD Opteron x64 multi-core servers with 4GiB RAM per core
> 1/16 (Burstable up to 95%)
> 1 TB    - Bandwidth/month, 1 GB RAM, + as such as NAS  storage as I  
> can
> afford to pay for.
>
> My QUESTION is - Will this RAM and CPU be sufficient during
> development of the search application and building the index, etc. or
> is it so abysmal and under-equipped in terms of hardware that the
> development version of my application will not work.
> I understand that having more RAM is always good, but is 1GB as good  
> as nothing?
>
> This setup is NOT for production but for for development so I can get
> my hands dirty with lucene which will require plenty of tweaks as the
> project moves along.
>
> What initial configuration would you recommend for a development
> version given the corpus size. I am not even sure how large my index
> will look like at this point.
>
> I hope to build an my indexes this way and once the search
> infrastructure is working and the web-front end complete, I plan to
> worry about Redundancy, availability and scalability for the many
> users I hope to provide this free service for :-)
>
> Many of you in this forum have built successful products with Lucene.
> To name a few I am aware of -  Ken Krugle, James Ryley, Dennis Kubes
>
> Some of you must have started with small machines,test set-ups etc
> where you built your initial search apps. I hope  to receive some
> advise about my plan and approach to start building an infrastructure
> to support my Lucene app.
>
> Thank you.
>
> Venkat
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org