You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Winton Davies <wd...@cs.stanford.edu> on 2008/06/25 02:03:28 UTC

Wiki Index

So, kudos - finally had a turnkey indexing of a  local file:/// 
install of static Wikipedia 07 (3 million pages) - some fiddling with 
tomcat and stuff, but the crawl and index ran perfectly.

However....

a) No Anchor Text ever seems to be referred to.
b) Doesnt appear to be any PageRank computed (less important)
c) Performance on a fairly decent machine in EC2 is kinda slow (1+ seconds).

My guess is that it is either a feature of file:/// urls  or a 
feature of it being a single site index, that is causing the loss of 
(a) and (b).  Does anyone know how to get these run, and even better 
who to get them computed incrementally (so I dont have to spend 
another 30 hours recrawling,parsing etc).

The crawl database directory is 33 GBs. What parts could I eliminate 
if I were to copy it to another machine? Do I just need the 
crawl/index subdir ? Anything I can do to speed up searches (other 
than memory,cpu, disk?).

Cheers,
  Winton


EC2 machine:

     1.7 GB memory
     5 EC2 Compute Units (2 virtual core with 2.5 EC2 Compute Unit each)
     350 GB instance storage (340 GB plus 10 GB root partition)
     32-bit platform
     I/O Performance: Moderate

Re: Wiki Index

Posted by Winton Davies <wd...@cs.stanford.edu>.

No-one has any idea? Should I ping nutch-dev?

Winton

>So, kudos - finally had a turnkey indexing of a  local file:/// 
>install of static Wikipedia 07 (3 million pages) - some fiddling 
>with tomcat and stuff, but the crawl and index ran perfectly.
>
>However....
>
>a) No Anchor Text ever seems to be referred to.
>b) Doesnt appear to be any PageRank computed (less important)
>c) Performance on a fairly decent machine in EC2 is kinda slow (1+ seconds).
>
>My guess is that it is either a feature of file:/// urls  or a 
>feature of it being a single site index, that is causing the loss of 
>(a) and (b).  Does anyone know how to get these run, and even better 
>who to get them computed incrementally (so I dont have to spend 
>another 30 hours recrawling,parsing etc).
>
>The crawl database directory is 33 GBs. What parts could I eliminate 
>if I were to copy it to another machine? Do I just need the 
>crawl/index subdir ? Anything I can do to speed up searches (other 
>than memory,cpu, disk?).
>
>Cheers,
>  Winton
>
>
>EC2 machine:
>
>     1.7 GB memory
>     5 EC2 Compute Units (2 virtual core with 2.5 EC2 Compute Unit each)
>     350 GB instance storage (340 GB plus 10 GB root partition)
>     32-bit platform
>     I/O Performance: Moderate