You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Simone Fonda <si...@gmail.com> on 2011/09/29 17:09:34 UTC

Parse and index tags from crawled HTML documents

Hello everybody,
we are trying to setup nutch+solr to crawl and index an entire website.

We would like to extract, index and store some information kept in the
document <meta> tags (with some fixed 'name' attribute) for our own
convenience.


We (mistakenly) tried the urlmeta plugin, but it seems that it only
propagate the values you enter in the initial seed urls file, and
that's not what we need.

We tried to use the parse-metatags plugin found at
https://issues.apache.org/jira/browse/NUTCH-809 , but we couldnt get
ant to build it properly: first 'ivy.xml' was missing, we created one
stealing it from another plugin. Then we updated ANT to the latest
release, but the compiler still complained that he didnt found some
libraries imported in the java source files (some org.apache.lucene.*
stuff).

Since nutch's 'title' field is something VERY similar to what we need
(in the end it just extracts the content of the <title>), we tried to
discover if the parse-html plugin could fit our needs someway, with no
success yet.

We tried to find more informations on how to use an xpath-driven
approach, with no luck.



Does anybody ever indexed and stored the content of <meta> tags?

Thanks a lot,
Simone

Parse and index tags from crawled HTML documents

Posted by Simone Fonda <fo...@netseven.it>.

Hello everybody,
we are trying to setup nutch+solr to crawl and index an entire website.

We would like to extract, index and store some information kept in the
document <meta> tags (with some fixed 'name' attribute) for our own
convenience.


We (mistakenly) tried the urlmeta plugin, but it seems that it only
propagate the values you enter in the initial seed urls file, and
that's not what we need.

We tried to use the parse-metatags plugin found at
https://issues.apache.org/jira/browse/NUTCH-809 , but we couldnt get
ant to build it properly: first 'ivy.xml' was missing, we created one
stealing it from another plugin. Then we updated ANT to the latest
release, but the compiler still complained that he didnt found some
libraries imported in the java source files (some org.apache.lucene.*
stuff).

Since nutch's 'title' field is something VERY similar to what we need
(in the end it just extracts the content of the <title>), we tried to
discover if the parse-html plugin could fit our needs someway, with no
success yet.

We tried to find more informations on how to use an xpath-driven
approach, with no luck.



Does anybody ever indexed and stored the content of <meta> tags?

Thanks a lot,
Simone

Re: Parse and index tags from crawled HTML documents

Posted by Elisabeth Adler <el...@gmail.com>.

Hi,
I had the same requirement and changed the plugin to get it working 
under Nutch 1.3. I uploaded a patch to the jira.
Best,
Elisabeth

On 29.09.2011 17:41, Julien Nioche wrote:
> Hi Simone,
>
> We tried to use the parse-metatags plugin found at
>> https://issues.apache.org/jira/browse/NUTCH-809 , but we couldnt get
>> ant to build it properly: first 'ivy.xml' was missing, we created one
>> stealing it from another plugin. Then we updated ANT to the latest
>> release, but the compiler still complained that he didnt found some
>> libraries imported in the java source files (some org.apache.lucene.*
>> stuff).
>>
> The patch in NUTCH-809 is a bit old :-)
>
> It should be a matter of removing the class in in
> src/plugin/parse-metatags/src/java/org/apache/nutch/searcher, modify the
> plugin.xml accordingly and remove the method addIndexBackendOptions from the
> indexing filter + get rid of references of Lucene in imports.
>
> Feel free to attach a new version of the patch if you manage to get it to
> work
>
> Julien
>
>

Re: Parse and index tags from crawled HTML documents

Posted by Julien Nioche <li...@gmail.com>.

Hi Simone,

We tried to use the parse-metatags plugin found at
> https://issues.apache.org/jira/browse/NUTCH-809 , but we couldnt get
> ant to build it properly: first 'ivy.xml' was missing, we created one
> stealing it from another plugin. Then we updated ANT to the latest
> release, but the compiler still complained that he didnt found some
> libraries imported in the java source files (some org.apache.lucene.*
> stuff).
>

The patch in NUTCH-809 is a bit old :-)

It should be a matter of removing the class in in
src/plugin/parse-metatags/src/java/org/apache/nutch/searcher, modify the
plugin.xml accordingly and remove the method addIndexBackendOptions from the
indexing filter + get rid of references of Lucene in imports.

Feel free to attach a new version of the patch if you manage to get it to
work

Julien


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com