You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Alexander E Genaud <lx...@pobox.com> on 2006/03/06 12:53:34 UTC

Offline search (Vicaya 0.1)

Hello,

I've just released a modified version of nutch071 and tomcat50 running
off a CDROM or local harddrive cross-platform:

http://sf.net/projects/vicaya

My ambitions are not 'the whole web' but a small and static collection
of pages. I intend to allow users to use nutch offline with the
occasional online content and index update (RSS, webstart, and/or
Subversion). Please let me know if such questions are out of scope.

I have found that reading the segments on CDROM is the biggest
performance bottleneck. However, I do not want to require that the
user copies the entire segments directory to disk. Is it possible to
separate some data - such as the reverse index from the other fields?
Would this require a change to Lucene or Nutch's source code?

I am considering importing content and index segments into an SVN
repository so that users may receive periodic updates. Will the
segments directory lend itself well to SVN patches? I have
experimented mostly with intranet search, but I've noticed that whole
web search creates dated indices. Might it be a matter of adding new
crawl segments since the last update?

Thanks,
Alex
--
Those who can make you believe absurdities can make you commit atrocities
-- François Marie Arouet (Voltaire)
http://cph.blogsome.com
http://genaud.org/alex/key.asc
--
CCC7 D19D D107 F079 2F3D BF97 8443 DB5A 6DB8 9CE1

Re: Offline search (Vicaya 0.1)

Posted by Stefan Groschupf <sg...@media-style.com>.
Hi,
storing the index on the hdd would be a good idea.
Take a  look to the nutchBean init method to get an idea what you  
need to change.
Should be simple by just allowing to provide an location for the  
index that is different than the segments folder.

Stefan

Am 06.03.2006 um 12:53 schrieb Alexander E Genaud:

> Hello,
>
> I've just released a modified version of nutch071 and tomcat50 running
> off a CDROM or local harddrive cross-platform:
>
> http://sf.net/projects/vicaya
>
> My ambitions are not 'the whole web' but a small and static collection
> of pages. I intend to allow users to use nutch offline with the
> occasional online content and index update (RSS, webstart, and/or
> Subversion). Please let me know if such questions are out of scope.
>
> I have found that reading the segments on CDROM is the biggest
> performance bottleneck. However, I do not want to require that the
> user copies the entire segments directory to disk. Is it possible to
> separate some data - such as the reverse index from the other fields?
> Would this require a change to Lucene or Nutch's source code?
>
> I am considering importing content and index segments into an SVN
> repository so that users may receive periodic updates. Will the
> segments directory lend itself well to SVN patches? I have
> experimented mostly with intranet search, but I've noticed that whole
> web search creates dated indices. Might it be a matter of adding new
> crawl segments since the last update?
>
> Thanks,
> Alex
> --
> Those who can make you believe absurdities can make you commit  
> atrocities
> -- François Marie Arouet (Voltaire)
> http://cph.blogsome.com
> http://genaud.org/alex/key.asc
> --
> CCC7 D19D D107 F079 2F3D BF97 8443 DB5A 6DB8 9CE1
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net