You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ahammad <ah...@gmail.com> on 2009/01/13 16:38:53 UTC

Indexing HTML meta tags

Hello,

I have been using Nutch for a few days now, and it seems to be working
great. One thing that I do need is the ability to index HTML meta tags from
pages. I'm using Nutch to search some article, so there are tags like
"author" in the html pages. From searching the mailing list, I saw that
there were a few requests made last year for this, but that there was no
built-in functionality. Is this accurate?

A few people suggested writing plug-ins while some other claimed that you
could modify certain files to do the job. Is there a simple way to do this
or do I have no choice but to write a plug-in for it? 

I read http://wiki.apache.org/nutch/WritingPluginExample-0%2e9 but it seems
somewhat overwhelming at this point. Any suggestions would be helpful.

Thanks.

Cheers
-- 
View this message in context: http://www.nabble.com/Indexing-HTML-meta-tags-tp21438171p21438171.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Indexing HTML meta tags

Posted by ahammad <ah...@gmail.com>.
Thanks for the reply.  I will create a new list about writing plug-ins since
it is technically a new topic.

If ay of the other people have suggestions please add them. I read somewhere
that we can copy the existing index-more plugin and add a few lines so that
it reads meta tags and indexes them. Any ideas about that?

Cheers,



Doğacan Güney-3 wrote:
> 
> On Tue, Jan 13, 2009 at 5:38 PM, ahammad <ah...@gmail.com> wrote:
>>
>> Hello,
>>
>> I have been using Nutch for a few days now, and it seems to be working
>> great. One thing that I do need is the ability to index HTML meta tags
>> from
>> pages. I'm using Nutch to search some article, so there are tags like
>> "author" in the html pages. From searching the mailing list, I saw that
>> there were a few requests made last year for this, but that there was no
>> built-in functionality. Is this accurate?
>>
>> A few people suggested writing plug-ins while some other claimed that you
>> could modify certain files to do the job. Is there a simple way to do
>> this
>> or do I have no choice but to write a plug-in for it?
>>
> 
> No unfortunately you will have to write a plug-in for it. I have
> something in mind
> that will make extracting data from html pages easier, but that's for
> post-1.0.
> 
>> I read http://wiki.apache.org/nutch/WritingPluginExample-0%2e9 but it
>> seems
>> somewhat overwhelming at this point. Any suggestions would be helpful.
>>
>> Thanks.
>>
>> Cheers
>> --
>> View this message in context:
>> http://www.nabble.com/Indexing-HTML-meta-tags-tp21438171p21438171.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> 
> -- 
> Doğacan Güney
> 
> 

-- 
View this message in context: http://www.nabble.com/Indexing-HTML-meta-tags-tp21438171p21441215.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Indexing HTML meta tags

Posted by Doğacan Güney <do...@gmail.com>.
On Tue, Jan 13, 2009 at 5:38 PM, ahammad <ah...@gmail.com> wrote:
>
> Hello,
>
> I have been using Nutch for a few days now, and it seems to be working
> great. One thing that I do need is the ability to index HTML meta tags from
> pages. I'm using Nutch to search some article, so there are tags like
> "author" in the html pages. From searching the mailing list, I saw that
> there were a few requests made last year for this, but that there was no
> built-in functionality. Is this accurate?
>
> A few people suggested writing plug-ins while some other claimed that you
> could modify certain files to do the job. Is there a simple way to do this
> or do I have no choice but to write a plug-in for it?
>

No unfortunately you will have to write a plug-in for it. I have
something in mind
that will make extracting data from html pages easier, but that's for post-1.0.

> I read http://wiki.apache.org/nutch/WritingPluginExample-0%2e9 but it seems
> somewhat overwhelming at this point. Any suggestions would be helpful.
>
> Thanks.
>
> Cheers
> --
> View this message in context: http://www.nabble.com/Indexing-HTML-meta-tags-tp21438171p21438171.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>



-- 
Doğacan Güney