You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Praveen Adivi <pr...@yaskawa.com> on 2011/11/01 21:55:56 UTC

Question regarding meta tags

Hi Guys,
              I am new to Nutch and I am trying to understand if we can
crawl a website and index the content of the <meta> tag in the <head>
section and, if there was a way to pass this to SOLR for indexing.

-- 
Thanks and regards,

Praveen Adivi
Java Developer
Yaskawa America
Ext: 7232

IMPORTANT: The information contained in this transmission may be privileged, 
proprietary and confidential and protected from disclosure. It is intended only for 
the intended recipient. If you are not the intended recipient or a person responsible 
for delivering this transmission to the intended recipient, you may not disclose, copy 
or distribute this transmission or take any action in reliance on it. If you received this 
transmission in error, please notify us immediately by replying to this message and 
please dispose of and delete this transmission.

Thank you.

Yaskawa America, Inc.

Re: Question regarding meta tags

Posted by Praveen Adivi <pr...@yaskawa.com>.
Thank you Elisabeth

On Wed, Nov 2, 2011 at 3:44 AM, Elisabeth Adler
<el...@gmail.com>wrote:

> Hi,
> Yes, you can index the content of the meta tags and put them into a Solr
> index, but you need a plugin for this, see https://issues.apache.org/**
> jira/browse/NUTCH-809 <https://issues.apache.org/jira/browse/NUTCH-809>
> Best,
> Elisabeth
>
>
> On 01.11.2011 21:55, Praveen Adivi wrote:
>
>> Hi Guys,
>>               I am new to Nutch and I am trying to understand if we can
>> crawl a website and index the content of the<meta>  tag in the<head>
>> section and, if there was a way to pass this to SOLR for indexing.
>>
>>


-- 
Thanks and regards,

Praveen Adivi
Java Developer
Yaskawa America
Ext: 7232

IMPORTANT: The information contained in this transmission may be privileged, 
proprietary and confidential and protected from disclosure. It is intended only for 
the intended recipient. If you are not the intended recipient or a person responsible 
for delivering this transmission to the intended recipient, you may not disclose, copy 
or distribute this transmission or take any action in reliance on it. If you received this 
transmission in error, please notify us immediately by replying to this message and 
please dispose of and delete this transmission.

Thank you.

Yaskawa America, Inc.

Re: Question regarding meta tags

Posted by Elisabeth Adler <el...@gmail.com>.
Hi,
Yes, you can index the content of the meta tags and put them into a Solr 
index, but you need a plugin for this, see 
https://issues.apache.org/jira/browse/NUTCH-809
Best,
Elisabeth

On 01.11.2011 21:55, Praveen Adivi wrote:
> Hi Guys,
>                I am new to Nutch and I am trying to understand if we can
> crawl a website and index the content of the<meta>  tag in the<head>
> section and, if there was a way to pass this to SOLR for indexing.
>

Re: Question regarding meta tags

Posted by jotta <so...@gmail.com>.
Hi!

Yes, it is possible, 
http://nutch.apache.org/apidocs-1.2/org/apache/nutch/parse/HTMLMetaTags.html
HTMLMetaTags  (Nutch1.2 API) allows you to get data from <head> section.
You can write new Nutch plugin and write your own 
http://nutch.apache.org/apidocs-1.2/org/apache/nutch/parse/HtmlParseFilter.html
HtmlParseFilter  implementation - this interface provides access into
HTMLMetaTags object.

You can send crawled data into Solr using this command:
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb
crawl/segments/*

-----
Regards,
Jotta

PS. Sorry for my English :)
--
View this message in context: http://lucene.472066.n3.nabble.com/Question-regarding-meta-tags-tp3471871p3473108.html
Sent from the Nutch - User mailing list archive at Nabble.com.