You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Hasan Diwan <ha...@gmail.com> on 2005/04/25 18:10:52 UTC

Re: [Nutch-dev] Getting HTML source

On 23/04/05, rajat swarup <ra...@gmail.com> wrote:
> Most of the methods in the code return Page objects in the code. But
> looking at the Page class definitions I found that there were no
> fields in the Page class that would give me access to the actual HTML
> source code or the parsed data inside the HTML page.
> 
> Is there a place in the Nutch source code where we can get the HTML
> source code (or maybe just the textual content of the pages) given the
> URL or may be the Page object?

org.apache.nutch.db.Page.write writes everything out to a
DataOutputStream. Also, the Page object has accessors. I do not see a
method to get the page source. It looks like there is also a getPage
method in org.apache.nutch.db.DBSectionReader. Hope this helps...
-- 
Cheers,
Hasan Diwan <ha...@gmail.com>

Re: [Nutch-dev] Getting HTML source

Posted by Piotr Kosiorowski <pk...@gmail.com>.
Hello,

Page object does not contain html page content. To access fetched page 
content you have to iterate over segment data and extract it from there.
Please have a look at SegmentReader class - it gives you a simple API to 
access all segment data.
Regards
Piotr

Hasan Diwan wrote:
> On 23/04/05, rajat swarup <ra...@gmail.com> wrote:
> 
>>Most of the methods in the code return Page objects in the code. But
>>looking at the Page class definitions I found that there were no
>>fields in the Page class that would give me access to the actual HTML
>>source code or the parsed data inside the HTML page.
>>
>>Is there a place in the Nutch source code where we can get the HTML
>>source code (or maybe just the textual content of the pages) given the
>>URL or may be the Page object?
> 
> 
> org.apache.nutch.db.Page.write writes everything out to a
> DataOutputStream. Also, the Page object has accessors. I do not see a
> method to get the page source. It looks like there is also a getPage
> method in org.apache.nutch.db.DBSectionReader. Hope this helps...