You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Charlie Williams <cw...@gmail.com> on 2007/04/25 16:42:39 UTC

retrieving original html from database

I have an index of pages from the web, a bit over 1 million. The fetch took
several weeks to complete, since it was mainly over a small set of domains.
Once we had a completed fetch, and index we began trying to work with the
retrieved text, and found that the cached text is just that, flat text. Is
the original HTML cached anywhere that it can be accessed after the intial
fetch? It would be a shame to have to recrawl all those pages. We are using
Nutch  .8

Thanks for any help.

-Charlie

Re: retrieving original html from database

Posted by Charlie Williams <cw...@gmail.com>.

thank you, I will give it a try :)

On 4/25/07, Doğacan Güney <do...@gmail.com> wrote:
>
> On 4/25/07, Charlie Williams <cw...@gmail.com> wrote:
> > I have an index of pages from the web, a bit over 1 million. The fetch
> took
> > several weeks to complete, since it was mainly over a small set of
> domains.
> > Once we had a completed fetch, and index we began trying to work with
> the
> > retrieved text, and found that the cached text is just that, flat text.
> Is
> > the original HTML cached anywhere that it can be accessed after the
> intial
> > fetch? It would be a shame to have to recrawl all those pages. We are
> using
> > Nutch  .8
>
> If you have fetcher.store.content set to true then Nutch has stored a
> copy of all the pages in <segment_dir>/content. You can extract
> individual contents with the command "./nutch readseg -get
> <segment_dir> <url> -noparse -nofetch -nogenerate -noparsetext
> -noparsedata".
>
> >
> > Thanks for any help.
> >
> > -Charlie
> >
>
>
> --
> Doğacan Güney
>

Re: retrieving original html from database

Posted by Doğacan Güney <do...@gmail.com>.

On 4/25/07, Charlie Williams <cw...@gmail.com> wrote:
> I have an index of pages from the web, a bit over 1 million. The fetch took
> several weeks to complete, since it was mainly over a small set of domains.
> Once we had a completed fetch, and index we began trying to work with the
> retrieved text, and found that the cached text is just that, flat text. Is
> the original HTML cached anywhere that it can be accessed after the intial
> fetch? It would be a shame to have to recrawl all those pages. We are using
> Nutch  .8

If you have fetcher.store.content set to true then Nutch has stored a
copy of all the pages in <segment_dir>/content. You can extract
individual contents with the command "./nutch readseg -get
<segment_dir> <url> -noparse -nofetch -nogenerate -noparsetext
-noparsedata".

>
> Thanks for any help.
>
> -Charlie
>


-- 
Doğacan Güney

Re: Re: retrieving original html from database

Posted by songjue <so...@gmail.com>.

That's just what I need! thanks Brigg.




songjue
2007-04-29



发件人： Briggs
发送时间： 2007-04-28 00:12:36
收件人： nutch-dev@lucene.apache.org
抄送： 
主题： Re: retrieving original html from database

If you need an api for getting the content, can't you just look into
the cachedContent.jsp of the demo search application?  That shows how
to retrieve the original text/html that is stored within the segments.

Perhaps I am missing something.





On 4/27/07, songjue  <songjue@gmail.com > wrote:
> You can try this command:  bin/nutch readseg (-dump ... | -get ...) .
> If you need an API instead of the command line, you may have to hack
> the segment/SegmentReader.java? I'm also wondering this.
>
> BTW, make sure you set the 'http.content.limit' property to -1 to avoid
> content truncation.
>
>
>
>
> songjue
> 2007-04-27
>
>
>
> 发件人： Charlie Williams
> 发送时间： 2007-04-25 22:43:12
> 收件人： nutch-dev@lucene.apache.org
> 抄送：
> 主题： retrieving original html from database
>
> I have an index of pages from the web, a bit over 1 million. The fetch took
> several weeks to complete, since it was mainly over a small set of domains.
> Once we had a completed fetch, and index we began trying to work with the
> retrieved text, and found that the cached text is just that, flat text. Is
> the original HTML cached anywhere that it can be accessed after the intial
> fetch? It would be a shame to have to recrawl all those pages. We are using
> Nutch  .8
>
> Thanks for any help.
>
> -Charlie
>


-- 
"Conscious decisions by conscious minds are what make reality real"

Re: retrieving original html from database

Posted by Briggs <ac...@gmail.com>.

If you need an api for getting the content, can't you just look into
the cachedContent.jsp of the demo search application?  That shows how
to retrieve the original text/html that is stored within the segments.

Perhaps I am missing something.





On 4/27/07, songjue <so...@gmail.com> wrote:
> You can try this command:  bin/nutch readseg (-dump ... | -get ...) .
> If you need an API instead of the command line, you may have to hack
> the segment/SegmentReader.java? I'm also wondering this.
>
> BTW, make sure you set the 'http.content.limit' property to -1 to avoid
> content truncation.
>
>
>
>
> songjue
> 2007-04-27
>
>
>
> 发件人： Charlie Williams
> 发送时间： 2007-04-25 22:43:12
> 收件人： nutch-dev@lucene.apache.org
> 抄送：
> 主题： retrieving original html from database
>
> I have an index of pages from the web, a bit over 1 million. The fetch took
> several weeks to complete, since it was mainly over a small set of domains.
> Once we had a completed fetch, and index we began trying to work with the
> retrieved text, and found that the cached text is just that, flat text. Is
> the original HTML cached anywhere that it can be accessed after the intial
> fetch? It would be a shame to have to recrawl all those pages. We are using
> Nutch  .8
>
> Thanks for any help.
>
> -Charlie
>


-- 
"Conscious decisions by conscious minds are what make reality real"

Re: retrieving original html from database

Posted by songjue <so...@gmail.com>.

You can try this command:  bin/nutch readseg (-dump ... | -get ...) .
If you need an API instead of the command line, you may have to hack 
the segment/SegmentReader.java? I'm also wondering this.

BTW, make sure you set the 'http.content.limit' property to -1 to avoid 
content truncation.
 



songjue
2007-04-27



发件人： Charlie Williams
发送时间： 2007-04-25 22:43:12
收件人： nutch-dev@lucene.apache.org
抄送： 
主题： retrieving original html from database

I have an index of pages from the web, a bit over 1 million. The fetch took
several weeks to complete, since it was mainly over a small set of domains.
Once we had a completed fetch, and index we began trying to work with the
retrieved text, and found that the cached text is just that, flat text. Is
the original HTML cached anywhere that it can be accessed after the intial
fetch? It would be a shame to have to recrawl all those pages. We are using
Nutch  .8

Thanks for any help.

-Charlie