You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Praveen Adivi <pr...@yaskawa.com> on 2011/11/01 21:55:56 UTC
Question regarding meta tags
Hi Guys,
I am new to Nutch and I am trying to understand if we can
crawl a website and index the content of the <meta> tag in the <head>
section and, if there was a way to pass this to SOLR for indexing.
--
Thanks and regards,
Praveen Adivi
Java Developer
Yaskawa America
Ext: 7232
IMPORTANT: The information contained in this transmission may be privileged,
proprietary and confidential and protected from disclosure. It is intended only for
the intended recipient. If you are not the intended recipient or a person responsible
for delivering this transmission to the intended recipient, you may not disclose, copy
or distribute this transmission or take any action in reliance on it. If you received this
transmission in error, please notify us immediately by replying to this message and
please dispose of and delete this transmission.
Thank you.
Yaskawa America, Inc.
Re: Question regarding meta tags
Posted by Praveen Adivi <pr...@yaskawa.com>.
Thank you Elisabeth
On Wed, Nov 2, 2011 at 3:44 AM, Elisabeth Adler
<el...@gmail.com>wrote:
> Hi,
> Yes, you can index the content of the meta tags and put them into a Solr
> index, but you need a plugin for this, see https://issues.apache.org/**
> jira/browse/NUTCH-809 <https://issues.apache.org/jira/browse/NUTCH-809>
> Best,
> Elisabeth
>
>
> On 01.11.2011 21:55, Praveen Adivi wrote:
>
>> Hi Guys,
>> I am new to Nutch and I am trying to understand if we can
>> crawl a website and index the content of the<meta> tag in the<head>
>> section and, if there was a way to pass this to SOLR for indexing.
>>
>>
--
Thanks and regards,
Praveen Adivi
Java Developer
Yaskawa America
Ext: 7232
IMPORTANT: The information contained in this transmission may be privileged,
proprietary and confidential and protected from disclosure. It is intended only for
the intended recipient. If you are not the intended recipient or a person responsible
for delivering this transmission to the intended recipient, you may not disclose, copy
or distribute this transmission or take any action in reliance on it. If you received this
transmission in error, please notify us immediately by replying to this message and
please dispose of and delete this transmission.
Thank you.
Yaskawa America, Inc.
Re: Question regarding meta tags
Posted by Elisabeth Adler <el...@gmail.com>.
Hi,
Yes, you can index the content of the meta tags and put them into a Solr
index, but you need a plugin for this, see
https://issues.apache.org/jira/browse/NUTCH-809
Best,
Elisabeth
On 01.11.2011 21:55, Praveen Adivi wrote:
> Hi Guys,
> I am new to Nutch and I am trying to understand if we can
> crawl a website and index the content of the<meta> tag in the<head>
> section and, if there was a way to pass this to SOLR for indexing.
>
Re: Question regarding meta tags
Posted by jotta <so...@gmail.com>.
Hi!
Yes, it is possible,
http://nutch.apache.org/apidocs-1.2/org/apache/nutch/parse/HTMLMetaTags.html
HTMLMetaTags (Nutch1.2 API) allows you to get data from <head> section.
You can write new Nutch plugin and write your own
http://nutch.apache.org/apidocs-1.2/org/apache/nutch/parse/HtmlParseFilter.html
HtmlParseFilter implementation - this interface provides access into
HTMLMetaTags object.
You can send crawled data into Solr using this command:
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb
crawl/segments/*
-----
Regards,
Jotta
PS. Sorry for my English :)
--
View this message in context: http://lucene.472066.n3.nabble.com/Question-regarding-meta-tags-tp3471871p3473108.html
Sent from the Nutch - User mailing list archive at Nabble.com.