You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by alw37 <al...@gmail.com> on 2012/12/12 03:12:17 UTC

Best way to extract content from a web page

I'm brand new to Nutch, so please bear with me on this one.

My goal is to simply extract some content from a web page and be able to
retrieve the resultant data. For example, let's say I'm crawling pages on an
e-commerce site and intend to store product information (i.e., name,
category, price, etc) in JSON format. I'd like to be able to quickly and
easily retrieve these data so they can be inserted into a database.

Currently, my code for parsing page content resides in a custom plugin that
subclasses the HtmlParseFilter extension point; I'm not sure if parsing
logic should be added in an HtmlParseFilter or HtmlParser extension, so
please let me know. Here's a snippet of code to help clarify what I'm
looking for:

@Override
public ParseResult filter(Content content, ParseResult parseResult,
HTMLMetaTags metaTags, DocumentFragment doc) {
.
. *Scrape the page and setup my JSON object*
.

Parse parse = parseResult.get(content.getUrl());
parse.getData().getContentMeta().set("product", jsonValue); <-- I JUST WANT
THIS

return parseResult;
}

I've looked into some of Nutch's parsing utilities (SegmentReader and
ParseData, for example) but haven't found a convenient means of acquiring
ONLY the ParseResult. Even when employing all of the available flags in
Segment reader (-nocontent, -noparse, etc) I still end up with far more data
than I need in the resultant dump file. I figured there was very likely a
better means of going about doing this besides hacking
SegmentReader.reduce() to fit my needs; what approach should i be taking
here?

Don't hesitate to let me know if you need any clarification.



--
View this message in context: http://lucene.472066.n3.nabble.com/Best-way-to-extract-content-from-a-web-page-tp4026227.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Best way to extract content from a web page

Posted by alw37 <al...@gmail.com>.
Hi Lewis,

Thanks for the info! That class definitely points me in the right direction,
though my confusion (to some extent, at least) still remains.

The use case I mentioned earlier focuses on the parsed output, i.e., the
org.apache.nutch.parse.ParseResults that are produced by the extensions of
org.apache.nutch.parse.HtmlParseFilter that apply to the specified (via
seed.txt) URL. I'm expecting to do two things:

1. Start a crawl, based on the URLs in my seed.txt file and the filters in
regex-urlfilter.txt
2. Once the crawl is complete, retrieve the results of *each*
parse/filtration; in my case, just the product information I've pulled from
each filtered page in the crawl results. 

For example, suppose my seed.txt file contains www.site.com and I start a
crawl. Assume my HtmlParseFilters are set up to correctly parse product
information from www.site.com, and that this crawl will parse both
www.site.com and www.site.com/link. If www.site.com contains product A with
id 1, and www.site.com/link contains a product B with id 2, I'm expecting
I'll be able to use some sort of predefined utility to give me those results
/alone/:

A 1
B 2

At the moment, I'm finding myself modifying
org.apache.nutch.segment.SegmentReader and the toString() methods of
org.apache.nutch.parse.ParseResult AND org.apache.nutch.metadata.MetaData to
isolate my extracted results, which makes me feel like I'm missing
something... I don't want to see the outlinks, recno, URL, parse metadata,
Playback or any other data about ... I just want the results of my HTML
filtration, as described above.

In short, I'm really trying to understand how to use nutch to *scrape* a
group of sites and give me ONLY what I have scraped.

I really appreciate any input you can provide on this matter; please, don't
hesitate to let me know if any clarification is needed.




--
View this message in context: http://lucene.472066.n3.nabble.com/Best-way-to-extract-content-from-a-web-page-tp4026227p4026492.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Best way to extract content from a web page

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi,

You can take a look at around line 102 in the ParserChecker tool [0]
for details on how to find desired fields and display them.

hth

Lewis

[0] https://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java?view=markup

On Wed, Dec 12, 2012 at 2:12 AM, alw37 <al...@gmail.com> wrote:
> I'm brand new to Nutch, so please bear with me on this one.
>
> My goal is to simply extract some content from a web page and be able to
> retrieve the resultant data. For example, let's say I'm crawling pages on an
> e-commerce site and intend to store product information (i.e., name,
> category, price, etc) in JSON format. I'd like to be able to quickly and
> easily retrieve these data so they can be inserted into a database.
>
> Currently, my code for parsing page content resides in a custom plugin that
> subclasses the HtmlParseFilter extension point; I'm not sure if parsing
> logic should be added in an HtmlParseFilter or HtmlParser extension, so
> please let me know. Here's a snippet of code to help clarify what I'm
> looking for:
>
> @Override
> public ParseResult filter(Content content, ParseResult parseResult,
> HTMLMetaTags metaTags, DocumentFragment doc) {
> .
> . *Scrape the page and setup my JSON object*
> .
>
> Parse parse = parseResult.get(content.getUrl());
> parse.getData().getContentMeta().set("product", jsonValue); <-- I JUST WANT
> THIS
>
> return parseResult;
> }
>
> I've looked into some of Nutch's parsing utilities (SegmentReader and
> ParseData, for example) but haven't found a convenient means of acquiring
> ONLY the ParseResult. Even when employing all of the available flags in
> Segment reader (-nocontent, -noparse, etc) I still end up with far more data
> than I need in the resultant dump file. I figured there was very likely a
> better means of going about doing this besides hacking
> SegmentReader.reduce() to fit my needs; what approach should i be taking
> here?
>
> Don't hesitate to let me know if you need any clarification.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Best-way-to-extract-content-from-a-web-page-tp4026227.html
> Sent from the Nutch - User mailing list archive at Nabble.com.



-- 
Lewis