You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Simone Fonda <si...@gmail.com> on 2011/09/29 17:09:34 UTC
Parse and index tags from crawled HTML documents
Hello everybody,
we are trying to setup nutch+solr to crawl and index an entire website.
We would like to extract, index and store some information kept in the
document <meta> tags (with some fixed 'name' attribute) for our own
convenience.
We (mistakenly) tried the urlmeta plugin, but it seems that it only
propagate the values you enter in the initial seed urls file, and
that's not what we need.
We tried to use the parse-metatags plugin found at
https://issues.apache.org/jira/browse/NUTCH-809 , but we couldnt get
ant to build it properly: first 'ivy.xml' was missing, we created one
stealing it from another plugin. Then we updated ANT to the latest
release, but the compiler still complained that he didnt found some
libraries imported in the java source files (some org.apache.lucene.*
stuff).
Since nutch's 'title' field is something VERY similar to what we need
(in the end it just extracts the content of the <title>), we tried to
discover if the parse-html plugin could fit our needs someway, with no
success yet.
We tried to find more informations on how to use an xpath-driven
approach, with no luck.
Does anybody ever indexed and stored the content of <meta> tags?
Thanks a lot,
Simone
Parse and index tags from crawled HTML documents
Posted by Simone Fonda <fo...@netseven.it>.
Hello everybody,
we are trying to setup nutch+solr to crawl and index an entire website.
We would like to extract, index and store some information kept in the
document <meta> tags (with some fixed 'name' attribute) for our own
convenience.
We (mistakenly) tried the urlmeta plugin, but it seems that it only
propagate the values you enter in the initial seed urls file, and
that's not what we need.
We tried to use the parse-metatags plugin found at
https://issues.apache.org/jira/browse/NUTCH-809 , but we couldnt get
ant to build it properly: first 'ivy.xml' was missing, we created one
stealing it from another plugin. Then we updated ANT to the latest
release, but the compiler still complained that he didnt found some
libraries imported in the java source files (some org.apache.lucene.*
stuff).
Since nutch's 'title' field is something VERY similar to what we need
(in the end it just extracts the content of the <title>), we tried to
discover if the parse-html plugin could fit our needs someway, with no
success yet.
We tried to find more informations on how to use an xpath-driven
approach, with no luck.
Does anybody ever indexed and stored the content of <meta> tags?
Thanks a lot,
Simone
Re: Parse and index tags from crawled HTML documents
Posted by Elisabeth Adler <el...@gmail.com>.
Hi,
I had the same requirement and changed the plugin to get it working
under Nutch 1.3. I uploaded a patch to the jira.
Best,
Elisabeth
On 29.09.2011 17:41, Julien Nioche wrote:
> Hi Simone,
>
> We tried to use the parse-metatags plugin found at
>> https://issues.apache.org/jira/browse/NUTCH-809 , but we couldnt get
>> ant to build it properly: first 'ivy.xml' was missing, we created one
>> stealing it from another plugin. Then we updated ANT to the latest
>> release, but the compiler still complained that he didnt found some
>> libraries imported in the java source files (some org.apache.lucene.*
>> stuff).
>>
> The patch in NUTCH-809 is a bit old :-)
>
> It should be a matter of removing the class in in
> src/plugin/parse-metatags/src/java/org/apache/nutch/searcher, modify the
> plugin.xml accordingly and remove the method addIndexBackendOptions from the
> indexing filter + get rid of references of Lucene in imports.
>
> Feel free to attach a new version of the patch if you manage to get it to
> work
>
> Julien
>
>
Re: Parse and index tags from crawled HTML documents
Posted by Julien Nioche <li...@gmail.com>.
Hi Simone,
We tried to use the parse-metatags plugin found at
> https://issues.apache.org/jira/browse/NUTCH-809 , but we couldnt get
> ant to build it properly: first 'ivy.xml' was missing, we created one
> stealing it from another plugin. Then we updated ANT to the latest
> release, but the compiler still complained that he didnt found some
> libraries imported in the java source files (some org.apache.lucene.*
> stuff).
>
The patch in NUTCH-809 is a bit old :-)
It should be a matter of removing the class in in
src/plugin/parse-metatags/src/java/org/apache/nutch/searcher, modify the
plugin.xml accordingly and remove the method addIndexBackendOptions from the
indexing filter + get rid of references of Lucene in imports.
Feel free to attach a new version of the patch if you manage to get it to
work
Julien
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com