You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by kiran chitturi <ch...@gmail.com> on 2012/09/06 22:25:02 UTC

How to configure nutch so that apache tika can extract all the tags ?

Hi,

I have worked with solr and tika. When solr indexes an html document using
apache tika, it can extract all the tags in the page and put them under
'attr_' dynamic field. I have not see that happen with nutch ?

Can nutch use tika as the parser and get all the tags in the html page and
then send them to solr under 'attr_*'  ?

Is this possible using nutch ? I think in this way, multivalued tags
problem can be solved which i posted here earlier (
https://issues.apache.org/jira/browse/NUTCH-1467)

Are there any configuration settings for using tika as a parse and fetch
all the tags in the page like it does with solr ?

Many Thanks for your reply.
-- 
Kiran Chitturi

Re: How to configure nutch so that apache tika can extract all the tags ?

Posted by kiran chitturi <ch...@gmail.com>.

Hi Julien,

Thank you for your response and work on this.

Yesterday, i have tried going in to the java files and looking to find
where the tags  are getting overwritten with the same name. When the tags
are being extracted, they should check if a tag with similar name is
already extracted and if so, then an array might be helpful to save  both
the contents. Then solr might accept this directly, this is the idea i had
yesterday when i wanted to solve this problem.

I got in to the file src/java/org/apache/nutch/parse/HTMLMetaTags.java
where i thought tags are being extracted. I might be wrong on this also.

I got stuck on the HTMLMetaTags.java yesterday

If you could solve this, that would be great.

Many Thanks,
Kiran.

On Fri, Sep 7, 2012 at 5:20 AM, Julien Nioche <lists.digitalpebble@gmail.com
> wrote:

> Hi Kiran
>
> You should be able to do that with either parse-html and parse-tika by
> implementing an extension of HtmlParseFilter and store the attr_* values in
> the parse metadata then write a modified version of the MetadataIndexer to
> generate the fields to index + of course modify the SOLR schema. Look at
> the existing plugins for examples of how to do it
>
> I will have a look at NUTCH-1467
>
> Thanks
> Julien
>
> On 6 September 2012 21:25, kiran chitturi <ch...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I have worked with solr and tika. When solr indexes an html document
> using
> > apache tika, it can extract all the tags in the page and put them under
> > 'attr_' dynamic field. I have not see that happen with nutch ?
> >
> > Can nutch use tika as the parser and get all the tags in the html page
> and
> > then send them to solr under 'attr_*'  ?
> >
> > Is this possible using nutch ? I think in this way, multivalued tags
> > problem can be solved which i posted here earlier (
> > https://issues.apache.org/jira/browse/NUTCH-1467)
> >
> > Are there any configuration settings for using tika as a parse and fetch
> > all the tags in the page like it does with solr ?
> >
> > Many Thanks for your reply.
> > --
> > Kiran Chitturi
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>



-- 
Kiran Chitturi

Re: How to configure nutch so that apache tika can extract all the tags ?

Posted by Julien Nioche <li...@gmail.com>.

Hi Kiran

You should be able to do that with either parse-html and parse-tika by
implementing an extension of HtmlParseFilter and store the attr_* values in
the parse metadata then write a modified version of the MetadataIndexer to
generate the fields to index + of course modify the SOLR schema. Look at
the existing plugins for examples of how to do it

I will have a look at NUTCH-1467

Thanks
Julien

On 6 September 2012 21:25, kiran chitturi <ch...@gmail.com> wrote:

> Hi,
>
> I have worked with solr and tika. When solr indexes an html document using
> apache tika, it can extract all the tags in the page and put them under
> 'attr_' dynamic field. I have not see that happen with nutch ?
>
> Can nutch use tika as the parser and get all the tags in the html page and
> then send them to solr under 'attr_*'  ?
>
> Is this possible using nutch ? I think in this way, multivalued tags
> problem can be solved which i posted here earlier (
> https://issues.apache.org/jira/browse/NUTCH-1467)
>
> Are there any configuration settings for using tika as a parse and fetch
> all the tags in the page like it does with solr ?
>
> Many Thanks for your reply.
> --
> Kiran Chitturi
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble