You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Cam Bazz <ca...@gmail.com> on 2011/07/06 18:10:24 UTC

custom extractor

Hello,

Previously I have build a primitive crawler in java, extracting
certain information per html page using xpaths. then I discovered
nutch, and now I want to be able to extract certain elements in dom,
tru xpath, multiple xpaths per site.

I am crawling a number of web sites, lets say 16, and I would like to
be able to write multiple xpaths per site, and then index the output
of those extractions in solr, as a different field.

I have googled for a while, and I understand certain plugin can be
developed that will act as a custom html parser. I understand that
another path is using tika.

I also have experimented with boilerpiple library, and It was
insufficient to extract the data I want. (I am extracting
specificiations of certain products, usually in tables, and
fragmented)

One diffuculty with my htmlcleaner based xpath evaluator was that the
real world htmls sometime were broken, and even when I cleaned them
html cleaner will not find xpaths taken from firebug.

Which way should I start?

Any ideas / help / recomendation greatly appreciated,

Best Regards,
C.B.

Re: custom extractor

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi C.B.,

Your description gets slightly cloudy towards the end e.g. around "One
diffuculty with my htmlcleaner...taken from firebug"???

Are you trying to say that some of the URLs are bad HTML, you know this
because it is flagged up by firebug? If this is the case are you able to
edit the HTML and make it well-formed so to speak?

It would also be of great help if you could post a small suggestion of the
type of xpath extraction you are looking to so, if anyone has built plugins
implementing xpath (which I have not) then they may be able to comment
further.

On Wed, Jul 6, 2011 at 5:10 PM, Cam Bazz <ca...@gmail.com> wrote:

> Hello,
>
> Previously I have build a primitive crawler in java, extracting
> certain information per html page using xpaths. then I discovered
> nutch, and now I want to be able to extract certain elements in dom,
> tru xpath, multiple xpaths per site.
>
> I am crawling a number of web sites, lets say 16, and I would like to
> be able to write multiple xpaths per site, and then index the output
> of those extractions in solr, as a different field.
>
> I have googled for a while, and I understand certain plugin can be
> developed that will act as a custom html parser. I understand that
> another path is using tika.
>
> I also have experimented with boilerpiple library, and It was
> insufficient to extract the data I want. (I am extracting
> specificiations of certain products, usually in tables, and
> fragmented)
>
> One diffuculty with my htmlcleaner based xpath evaluator was that the
> real world htmls sometime were broken, and even when I cleaned them
> html cleaner will not find xpaths taken from firebug.
>
> Which way should I start?
>
> Any ideas / help / recomendation greatly appreciated,
>
> Best Regards,
> C.B.
>

-- 
*Lewis*