You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by v k <vk...@gmail.com> on 2007/12/18 17:21:28 UTC
Infrastructure Question
Hello,
I am using Lucene to build an index from roughly 10 million documents
in number. The documents are about 4 TB in total.
After some trial runs, indexing a subset of the documents I am trying
to figure out a hosting service configuration to create a full index
from the entire 10 TB of data. As I am still unsure how this project
will turn out I am not purchasing hardware/ram but considering a web
host.
for the purpose of :
1) download the data and to start indexing it.
2) The web front end to access this index will be a python framework (
eg. Django etc)
I am seriously contemplating signing up with Joyent for this plan:
AMD Opteron x64 multi-core servers with 4GiB RAM per core
1/16 (Burstable up to 95%)
1 TB - Bandwidth/month, 1 GB RAM, + as such as NAS storage as I can
afford to pay for.
My QUESTION is - Will this RAM and CPU be sufficient during
development of the search application and building the index, etc. or
is it so abysmal and under-equipped in terms of hardware that the
development version of my application will not work.
I understand that having more RAM is always good, but is 1GB as good as nothing?
This setup is NOT for production but for for development so I can get
my hands dirty with lucene which will require plenty of tweaks as the
project moves along.
What initial configuration would you recommend for a development
version given the corpus size. I am not even sure how large my index
will look like at this point.
I hope to build an my indexes this way and once the search
infrastructure is working and the web-front end complete, I plan to
worry about Redundancy, availability and scalability for the many
users I hope to provide this free service for :-)
Many of you in this forum have built successful products with Lucene.
To name a few I am aware of - Ken Krugle, James Ryley, Dennis Kubes
Some of you must have started with small machines,test set-ups etc
where you built your initial search apps. I hope to receive some
advise about my plan and approach to start building an infrastructure
to support my Lucene app.
Thank you.
Venkat
Re: Infrastructure Question
Posted by Dennis Kubes <ku...@apache.org>.
When we started we had 5 machines that were 800mghz and maybe 512M to 1G
of ram. It was enough to get started and start testing things
although I wouldn't recommend that setup because looking back I don't
think it was enough. And of course we started getting OutOfMemory
errors pretty quickly as our data grew.
Remember that in search the serving is the hardware intensive part. For
getting your hands dirty and processing the data, the hardware you
propose should be more than sufficient. Amazon's EC2, especially the
large and extra-large instances would also work very well for this and
give you the opportunity to grow your serving computer if/when needed.
Dennis Kubes
v k wrote:
> Sorry about that. For some reason, my post did not show up in the
> mailing list and I still cannot see it ( maybe a settings issue). I
> don't mean to barrage the mailing list with the same question. Thanks
> for the advise.
>
>
> On Dec 18, 2007 11:43 AM, Grant Ingersoll <gs...@apache.org> wrote:
>> Hi Venkat,
>>
>> There is no need to post your question multiple times or cross-post.
>> People are distributed all around the world on this list and aren't
>> always available or capable to answer your question. Having to wait
>> 11 hours for an answer on a free mailing list is not at all
>> unreasonable.
>>
>> If you are just looking to get your hands dirty with Lucene, why not
>> just start w/ a subset on a machine you already own and work to scale
>> up? This way, you could start with what you have available and get a
>> feel for your memory usage, etc. Then you will be in a better
>> position to decide what your needs are.
>>
>> If there is one thing that is true about search it is the fact that
>> everyone's situation is different.
>>
>> Cheers,
>> Grant
>>
>>
>> On Dec 18, 2007, at 11:21 AM, v k wrote:
>>
>>> Hello,
>>>
>>> I am using Lucene to build an index from roughly 10 million documents
>>> in number. The documents are about 4 TB in total.
>>>
>>> After some trial runs, indexing a subset of the documents I am trying
>>> to figure out a hosting service configuration to create a full index
>>> from the entire 10 TB of data. As I am still unsure how this project
>>> will turn out I am not purchasing hardware/ram but considering a web
>>> host.
>>> for the purpose of :
>>> 1) download the data and to start indexing it.
>>> 2) The web front end to access this index will be a python framework (
>>> eg. Django etc)
>>>
>>> I am seriously contemplating signing up with Joyent for this plan:
>>> AMD Opteron x64 multi-core servers with 4GiB RAM per core
>>> 1/16 (Burstable up to 95%)
>>> 1 TB - Bandwidth/month, 1 GB RAM, + as such as NAS storage as I
>>> can
>>> afford to pay for.
>>>
>>> My QUESTION is - Will this RAM and CPU be sufficient during
>>> development of the search application and building the index, etc. or
>>> is it so abysmal and under-equipped in terms of hardware that the
>>> development version of my application will not work.
>>> I understand that having more RAM is always good, but is 1GB as good
>>> as nothing?
>>>
>>> This setup is NOT for production but for for development so I can get
>>> my hands dirty with lucene which will require plenty of tweaks as the
>>> project moves along.
>>>
>>> What initial configuration would you recommend for a development
>>> version given the corpus size. I am not even sure how large my index
>>> will look like at this point.
>>>
>>> I hope to build an my indexes this way and once the search
>>> infrastructure is working and the web-front end complete, I plan to
>>> worry about Redundancy, availability and scalability for the many
>>> users I hope to provide this free service for :-)
>>>
>>> Many of you in this forum have built successful products with Lucene.
>>> To name a few I am aware of - Ken Krugle, James Ryley, Dennis Kubes
>>>
>>> Some of you must have started with small machines,test set-ups etc
>>> where you built your initial search apps. I hope to receive some
>>> advise about my plan and approach to start building an infrastructure
>>> to support my Lucene app.
>>>
>>> Thank you.
>>>
>>> Venkat
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>> --------------------------
>> Grant Ingersoll
>> http://lucene.grantingersoll.com
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Infrastructure Question
Posted by v k <vk...@gmail.com>.
Sorry about that. For some reason, my post did not show up in the
mailing list and I still cannot see it ( maybe a settings issue). I
don't mean to barrage the mailing list with the same question. Thanks
for the advise.
On Dec 18, 2007 11:43 AM, Grant Ingersoll <gs...@apache.org> wrote:
> Hi Venkat,
>
> There is no need to post your question multiple times or cross-post.
> People are distributed all around the world on this list and aren't
> always available or capable to answer your question. Having to wait
> 11 hours for an answer on a free mailing list is not at all
> unreasonable.
>
> If you are just looking to get your hands dirty with Lucene, why not
> just start w/ a subset on a machine you already own and work to scale
> up? This way, you could start with what you have available and get a
> feel for your memory usage, etc. Then you will be in a better
> position to decide what your needs are.
>
> If there is one thing that is true about search it is the fact that
> everyone's situation is different.
>
> Cheers,
> Grant
>
>
> On Dec 18, 2007, at 11:21 AM, v k wrote:
>
> > Hello,
> >
> > I am using Lucene to build an index from roughly 10 million documents
> > in number. The documents are about 4 TB in total.
> >
> > After some trial runs, indexing a subset of the documents I am trying
> > to figure out a hosting service configuration to create a full index
> > from the entire 10 TB of data. As I am still unsure how this project
> > will turn out I am not purchasing hardware/ram but considering a web
> > host.
> > for the purpose of :
> > 1) download the data and to start indexing it.
> > 2) The web front end to access this index will be a python framework (
> > eg. Django etc)
> >
> > I am seriously contemplating signing up with Joyent for this plan:
> > AMD Opteron x64 multi-core servers with 4GiB RAM per core
> > 1/16 (Burstable up to 95%)
> > 1 TB - Bandwidth/month, 1 GB RAM, + as such as NAS storage as I
> > can
> > afford to pay for.
> >
> > My QUESTION is - Will this RAM and CPU be sufficient during
> > development of the search application and building the index, etc. or
> > is it so abysmal and under-equipped in terms of hardware that the
> > development version of my application will not work.
> > I understand that having more RAM is always good, but is 1GB as good
> > as nothing?
> >
> > This setup is NOT for production but for for development so I can get
> > my hands dirty with lucene which will require plenty of tweaks as the
> > project moves along.
> >
> > What initial configuration would you recommend for a development
> > version given the corpus size. I am not even sure how large my index
> > will look like at this point.
> >
> > I hope to build an my indexes this way and once the search
> > infrastructure is working and the web-front end complete, I plan to
> > worry about Redundancy, availability and scalability for the many
> > users I hope to provide this free service for :-)
> >
> > Many of you in this forum have built successful products with Lucene.
> > To name a few I am aware of - Ken Krugle, James Ryley, Dennis Kubes
> >
> > Some of you must have started with small machines,test set-ups etc
> > where you built your initial search apps. I hope to receive some
> > advise about my plan and approach to start building an infrastructure
> > to support my Lucene app.
> >
> > Thank you.
> >
> > Venkat
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Infrastructure Question
Posted by Grant Ingersoll <gs...@apache.org>.
Hi Venkat,
There is no need to post your question multiple times or cross-post.
People are distributed all around the world on this list and aren't
always available or capable to answer your question. Having to wait
11 hours for an answer on a free mailing list is not at all
unreasonable.
If you are just looking to get your hands dirty with Lucene, why not
just start w/ a subset on a machine you already own and work to scale
up? This way, you could start with what you have available and get a
feel for your memory usage, etc. Then you will be in a better
position to decide what your needs are.
If there is one thing that is true about search it is the fact that
everyone's situation is different.
Cheers,
Grant
On Dec 18, 2007, at 11:21 AM, v k wrote:
> Hello,
>
> I am using Lucene to build an index from roughly 10 million documents
> in number. The documents are about 4 TB in total.
>
> After some trial runs, indexing a subset of the documents I am trying
> to figure out a hosting service configuration to create a full index
> from the entire 10 TB of data. As I am still unsure how this project
> will turn out I am not purchasing hardware/ram but considering a web
> host.
> for the purpose of :
> 1) download the data and to start indexing it.
> 2) The web front end to access this index will be a python framework (
> eg. Django etc)
>
> I am seriously contemplating signing up with Joyent for this plan:
> AMD Opteron x64 multi-core servers with 4GiB RAM per core
> 1/16 (Burstable up to 95%)
> 1 TB - Bandwidth/month, 1 GB RAM, + as such as NAS storage as I
> can
> afford to pay for.
>
> My QUESTION is - Will this RAM and CPU be sufficient during
> development of the search application and building the index, etc. or
> is it so abysmal and under-equipped in terms of hardware that the
> development version of my application will not work.
> I understand that having more RAM is always good, but is 1GB as good
> as nothing?
>
> This setup is NOT for production but for for development so I can get
> my hands dirty with lucene which will require plenty of tweaks as the
> project moves along.
>
> What initial configuration would you recommend for a development
> version given the corpus size. I am not even sure how large my index
> will look like at this point.
>
> I hope to build an my indexes this way and once the search
> infrastructure is working and the web-front end complete, I plan to
> worry about Redundancy, availability and scalability for the many
> users I hope to provide this free service for :-)
>
> Many of you in this forum have built successful products with Lucene.
> To name a few I am aware of - Ken Krugle, James Ryley, Dennis Kubes
>
> Some of you must have started with small machines,test set-ups etc
> where you built your initial search apps. I hope to receive some
> advise about my plan and approach to start building an infrastructure
> to support my Lucene app.
>
> Thank you.
>
> Venkat
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org