You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Doğacan Güney <do...@gmail.com> on 2008/10/01 09:36:46 UTC

Re: Dumping raw html and javascript

On Mon, Sep 29, 2008 at 9:19 PM, Kevin MacDonald <ke...@hautesecure.com> wrote:
> Once I have done a crawl I have a need to pass all of the raw HTML and
> javascript that has been fetched through a custom parser. During a fetch
> does nutch store all of the raw content including HTML tags on disk?

Yes, if you have fetcher.store.content set to true (which is true by default).

Raw content of a page will be saved under <segment>/content directory.
To reach a particular content, you may try this

bin/nutch readseg -get <segment> <url> -noparse -noparsedata -nofetch
-nogenerate -noparsetext

> Thanks
>
> Kevin
>

-- 
Doğacan Güney