You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Nutch User <nu...@gmail.com> on 2007/01/02 22:24:15 UTC

Nutch Programmer Wanted

 Hello Nutch Developers,

I hope this post is appropriate for the list, and apologize if it is not.

Our company is currently utilizing Nutch 0.7 for a 4-6 billion page search
engine.  This engine is used both by internal staff and external users for
searching on internet content.  As you well know, there are many many issues
associated with this large of an index.  We were hoping some of these issues
would be addressed in 0.8, but we don't think 0.8 is quite ready yet for
prime time.

Therefore, we would like to hire a Nutch programmer to help us make Nutch
into a more viable solution for large indexes such as ours.  We prefer a
full-time person to work on-site with us in the US, but will consider
possible remote work as well.  If you are interested, please reply to this
e-mail address ( nutchjob@gmail.com) with your resume and salary
requirements.  Please include any java experience, Nutch-specific
experience, and any experience with large data sets (particularly with large
url databases).  We (the company) prefer to remain confidential for now, but
will discuss details with candidates.

Thank you for your time,
Nutch User

Re: Nutch Programmer Wanted

Posted by Andrzej Bialecki <ab...@getopt.org>.
e w wrote:
> (The message below was posted to nutch-dev a few days ago.) Can anyone
> (anonymous or otherwise) confirm whether it's possible to use Nutch 
> 0.7 for
> a "4-6 billion page search engine"? Is this a typo or for real? Just 
> curious
> and if it's true what were the major issues e.g. time, RAM, (storage
> presumably)? My understanding was that the practical limit on 0.7 was 
> about
> 100 million pages whatever hardware you have.

Unless we are talking about an extensively re-written version 0.7, I'd 
say it's next to impossible to use an out-of-the-box 0.7 for anything 
more than 200-300 mln urls, if even that many. The main bottleneck were 
the DB operations, which for any type of hardware would take even days 
to complete.

These limitations have been largely removed in 0.8 and later, due to the 
Hadoop framework.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch Programmer Wanted

Posted by e w <ep...@gmail.com>.
(The message below was posted to nutch-dev a few days ago.) Can anyone
(anonymous or otherwise) confirm whether it's possible to use Nutch 0.7 for
a "4-6 billion page search engine"? Is this a typo or for real? Just curious
and if it's true what were the major issues e.g. time, RAM, (storage
presumably)? My understanding was that the practical limit on 0.7 was about
100 million pages whatever hardware you have.

-Ed

On 1/3/07, Nutch User <nu...@gmail.com> wrote:
>
> Hello Nutch Developers,
>
> I hope this post is appropriate for the list, and apologize if it is not.
>
> Our company is currently utilizing Nutch 0.7 for a 4-6 billion page search
> engine.  This engine is used both by internal staff and external users for
>
> searching on internet content.  As you well know, there are many many
> issues
> associated with this large of an index.  We were hoping some of these
> issues
> would be addressed in 0.8, but we don't think 0.8 is quite ready yet for
> prime time.
>
> Therefore, we would like to hire a Nutch programmer to help us make Nutch
> into a more viable solution for large indexes such as ours.  We prefer a
> full-time person to work on-site with us in the US, but will consider
> possible remote work as well.  If you are interested, please reply to this
> e-mail address ( nutchjob@gmail.com) with your resume and salary
> requirements.  Please include any java experience, Nutch-specific
> experience, and any experience with large data sets (particularly with
> large
> url databases).  We (the company) prefer to remain confidential for now,
> but
> will discuss details with candidates.
>
> Thank you for your time,
> Nutch User
>
>