You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Winton Davies <wd...@cs.stanford.edu> on 2008/06/25 02:03:28 UTC
Wiki Index
So, kudos - finally had a turnkey indexing of a local file:///
install of static Wikipedia 07 (3 million pages) - some fiddling with
tomcat and stuff, but the crawl and index ran perfectly.
However....
a) No Anchor Text ever seems to be referred to.
b) Doesnt appear to be any PageRank computed (less important)
c) Performance on a fairly decent machine in EC2 is kinda slow (1+ seconds).
My guess is that it is either a feature of file:/// urls or a
feature of it being a single site index, that is causing the loss of
(a) and (b). Does anyone know how to get these run, and even better
who to get them computed incrementally (so I dont have to spend
another 30 hours recrawling,parsing etc).
The crawl database directory is 33 GBs. What parts could I eliminate
if I were to copy it to another machine? Do I just need the
crawl/index subdir ? Anything I can do to speed up searches (other
than memory,cpu, disk?).
Cheers,
Winton
EC2 machine:
1.7 GB memory
5 EC2 Compute Units (2 virtual core with 2.5 EC2 Compute Unit each)
350 GB instance storage (340 GB plus 10 GB root partition)
32-bit platform
I/O Performance: Moderate
Re: Wiki Index
Posted by Winton Davies <wd...@cs.stanford.edu>.
No-one has any idea? Should I ping nutch-dev?
Winton
>So, kudos - finally had a turnkey indexing of a local file:///
>install of static Wikipedia 07 (3 million pages) - some fiddling
>with tomcat and stuff, but the crawl and index ran perfectly.
>
>However....
>
>a) No Anchor Text ever seems to be referred to.
>b) Doesnt appear to be any PageRank computed (less important)
>c) Performance on a fairly decent machine in EC2 is kinda slow (1+ seconds).
>
>My guess is that it is either a feature of file:/// urls or a
>feature of it being a single site index, that is causing the loss of
>(a) and (b). Does anyone know how to get these run, and even better
>who to get them computed incrementally (so I dont have to spend
>another 30 hours recrawling,parsing etc).
>
>The crawl database directory is 33 GBs. What parts could I eliminate
>if I were to copy it to another machine? Do I just need the
>crawl/index subdir ? Anything I can do to speed up searches (other
>than memory,cpu, disk?).
>
>Cheers,
> Winton
>
>
>EC2 machine:
>
> 1.7 GB memory
> 5 EC2 Compute Units (2 virtual core with 2.5 EC2 Compute Unit each)
> 350 GB instance storage (340 GB plus 10 GB root partition)
> 32-bit platform
> I/O Performance: Moderate