You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Hasan Diwan <ha...@gmail.com> on 2005/04/25 18:10:52 UTC
Re: [Nutch-dev] Getting HTML source
On 23/04/05, rajat swarup <ra...@gmail.com> wrote:
> Most of the methods in the code return Page objects in the code. But
> looking at the Page class definitions I found that there were no
> fields in the Page class that would give me access to the actual HTML
> source code or the parsed data inside the HTML page.
>
> Is there a place in the Nutch source code where we can get the HTML
> source code (or maybe just the textual content of the pages) given the
> URL or may be the Page object?
org.apache.nutch.db.Page.write writes everything out to a
DataOutputStream. Also, the Page object has accessors. I do not see a
method to get the page source. It looks like there is also a getPage
method in org.apache.nutch.db.DBSectionReader. Hope this helps...
--
Cheers,
Hasan Diwan <ha...@gmail.com>
Re: [Nutch-dev] Getting HTML source
Posted by Piotr Kosiorowski <pk...@gmail.com>.
Hello,
Page object does not contain html page content. To access fetched page
content you have to iterate over segment data and extract it from there.
Please have a look at SegmentReader class - it gives you a simple API to
access all segment data.
Regards
Piotr
Hasan Diwan wrote:
> On 23/04/05, rajat swarup <ra...@gmail.com> wrote:
>
>>Most of the methods in the code return Page objects in the code. But
>>looking at the Page class definitions I found that there were no
>>fields in the Page class that would give me access to the actual HTML
>>source code or the parsed data inside the HTML page.
>>
>>Is there a place in the Nutch source code where we can get the HTML
>>source code (or maybe just the textual content of the pages) given the
>>URL or may be the Page object?
>
>
> org.apache.nutch.db.Page.write writes everything out to a
> DataOutputStream. Also, the Page object has accessors. I do not see a
> method to get the page source. It looks like there is also a getPage
> method in org.apache.nutch.db.DBSectionReader. Hope this helps...