You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by rajat swarup <ra...@gmail.com> on 2005/04/24 02:56:25 UTC

Getting HTML source

Hi,
We are working on a project where the actual text content of the pages
would be used to decide the topical relevance of the pages
(implementing "Focused crawling" in Nutch).
Most of the methods in the code return Page objects in the code. But
looking at the Page class definitions I found that there were no
fields in the Page class that would give me access to the actual HTML
source code or the parsed data inside the HTML page.

Is there a place in the Nutch source code where we can get the HTML
source code (or maybe just the textual content of the pages) given the
URL or may be the Page object?

Thanks for any forthcoming help!

-Rajat
http://www-scf.usc.edu/~swarup/

Re: [Nutch-dev] Getting HTML source

Posted by Piotr Kosiorowski <pk...@gmail.com>.

Hello,

Page object does not contain html page content. To access fetched page 
content you have to iterate over segment data and extract it from there.
Please have a look at SegmentReader class - it gives you a simple API to 
access all segment data.
Regards
Piotr

Hasan Diwan wrote:
> On 23/04/05, rajat swarup <ra...@gmail.com> wrote:
> 
>>Most of the methods in the code return Page objects in the code. But
>>looking at the Page class definitions I found that there were no
>>fields in the Page class that would give me access to the actual HTML
>>source code or the parsed data inside the HTML page.
>>
>>Is there a place in the Nutch source code where we can get the HTML
>>source code (or maybe just the textual content of the pages) given the
>>URL or may be the Page object?
> 
> 
> org.apache.nutch.db.Page.write writes everything out to a
> DataOutputStream. Also, the Page object has accessors. I do not see a
> method to get the page source. It looks like there is also a getPage
> method in org.apache.nutch.db.DBSectionReader. Hope this helps...

Re: [Nutch-dev] Getting HTML source

Posted by Hasan Diwan <ha...@gmail.com>.

On 23/04/05, rajat swarup <ra...@gmail.com> wrote:
> Most of the methods in the code return Page objects in the code. But
> looking at the Page class definitions I found that there were no
> fields in the Page class that would give me access to the actual HTML
> source code or the parsed data inside the HTML page.
> 
> Is there a place in the Nutch source code where we can get the HTML
> source code (or maybe just the textual content of the pages) given the
> URL or may be the Page object?

org.apache.nutch.db.Page.write writes everything out to a
DataOutputStream. Also, the Page object has accessors. I do not see a
method to get the page source. It looks like there is also a getPage
method in org.apache.nutch.db.DBSectionReader. Hope this helps...
-- 
Cheers,
Hasan Diwan <ha...@gmail.com>