You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Vishal Sharma <vi...@grazitti.com> on 2014/11/27 17:58:13 UTC

How to parse specific html tag in nutch+solr while crawling

I tried this on Google also. But, nothing useful. Appreciate any help.

Is there a way to parse specific html tag while doing the crawling with
nutch and then indexing it to solr.

For-example I don't want all html page to go to content node. I would want
to parse h1 h2 tags into separate nodes.



*Vishal Sharma**TL, SFDC*T: +1 650 288 6711
E: vishals@grazitti.com <al...@grazitti.com>
www.grazitti.com [image: Description: LinkedIn]
<http://www.linkedin.com/company/grazitti-interactive>[image: Description:
Twitter] <https://twitter.com/grazitti>[image: fbook]
<https://www.facebook.com/grazitti.interactive>*Zak*Calendar
Salesforce1TM Calendar
App for Teams
<https://appexchange.salesforce.com/listingDetail?listingId=a0N3000000B5UPKEA3>

Re: How to parse specific html tag in nutch+solr while crawling

Posted by Jorge Luis Betancourt González <jl...@uci.cu>.

Building such a plugin is not complicated, in our team we built one similar that let us store a list of tags specified in the nutch-site.xml. Right now the plugin is not published but It could be done, basically in our case we store each extracted tag in a Solr field and we prefix each field name with the "custom-" text.

Regards,

----- Original Message -----
From: "Vishal Sharma" <vi...@grazitti.com>
To: "user" <us...@nutch.apache.org>
Sent: Friday, November 28, 2014 12:11:33 AM
Subject: Re: How to parse specific html tag in nutch+solr while crawling

Thanks for replying Markus. I'll check that.

*Vishal Sharma**TL, SFDC*T: +1 650 288 6711
E: vishals@grazitti.com <al...@grazitti.com>
www.grazitti.com [image: Description: LinkedIn]
<http://www.linkedin.com/company/grazitti-interactive>[image: Description:
Twitter] <https://twitter.com/grazitti>[image: fbook]
<https://www.facebook.com/grazitti.interactive>*Zak*Calendar
Salesforce1TM Calendar
App for Teams
<https://appexchange.salesforce.com/listingDetail?listingId=a0N3000000B5UPKEA3>




On Thu, Nov 27, 2014 at 11:00 PM, Markus Jelsma <ma...@openindex.io>
wrote:

> The plugin works on headings only, but if you check the sources, you can
> quickly adapt it to any element/attribute section.
>
> -----Original message-----
> > From:Vishal Sharma <vi...@grazitti.com>
> > Sent: Thursday 27th November 2014 18:25
> > To: user <us...@nutch.apache.org>
> > Subject: Re: How to parse specific html tag in nutch+solr while crawling
> >
> > Hi Markus,
> >
> > Thank you so much for your reply.
> >
> > Quick question: Will this parse only hN tags only or can we confiure it
> for
> > other html tags also like <div class=''test"> ?
> >
> > *Vishal Sharma**TL, SFDC*T: +1 650 288 6711
> > E: vishals@grazitti.com <al...@grazitti.com>
> > www.grazitti.com [image: Description: LinkedIn]
> > <http://www.linkedin.com/company/grazitti-interactive>[image:
> Description:
> > Twitter] <https://twitter.com/grazitti>[image: fbook]
> > <https://www.facebook.com/grazitti.interactive>*Zak*Calendar
> > Salesforce1TM Calendar
> > App for Teams
> > <
> https://appexchange.salesforce.com/listingDetail?listingId=a0N3000000B5UPKEA3
> >
> >
> >
> >
> >
> > On Thu, Nov 27, 2014 at 10:33 PM, Markus Jelsma <
> markus.jelsma@openindex.io>
> > wrote:
> >
> > > You may want to check the headings plugin, it reads content from those
> > > elements and writes them to some field. Very basic.
> > >
> > >
> > >
> > > -----Original message-----
> > > > From:Vishal Sharma <vi...@grazitti.com>
> > > > Sent: Thursday 27th November 2014 17:59
> > > > To: user <us...@nutch.apache.org>
> > > > Subject: How to parse specific html tag in nutch+solr while crawling
> > > >
> > > > I tried this on Google also. But, nothing useful. Appreciate any
> help.
> > > >
> > > > Is there a way to parse specific html tag while doing the crawling
> with
> > > > nutch and then indexing it to solr.
> > > >
> > > > For-example I don't want all html page to go to content node. I would
> > > want
> > > > to parse h1 h2 tags into separate nodes.
> > > >
> > > >
> > > >
> > > > *Vishal Sharma**TL, SFDC*T: +1 650 288 6711
> > > > E: vishals@grazitti.com <al...@grazitti.com>
> > > > www.grazitti.com [image: Description: LinkedIn]
> > > > <http://www.linkedin.com/company/grazitti-interactive>[image:
> > > Description:
> > > > Twitter] <https://twitter.com/grazitti>[image: fbook]
> > > > <https://www.facebook.com/grazitti.interactive>*Zak*Calendar
> > > > Salesforce1TM Calendar
> > > > App for Teams
> > > > <
> > >
> https://appexchange.salesforce.com/listingDetail?listingId=a0N3000000B5UPKEA3
> > > >
> > > >
> > >
> >
>
Proceso de Acreditación de la Maestría en Gestión de Proyectos Informáticos.
En busca de la Excelencia. Del 24 al 28 de noviembre de 2014.

Re: How to parse specific html tag in nutch+solr while crawling

Posted by Vishal Sharma <vi...@grazitti.com>.

Thanks for replying Markus. I'll check that.

*Vishal Sharma**TL, SFDC*T: +1 650 288 6711
E: vishals@grazitti.com <al...@grazitti.com>
www.grazitti.com [image: Description: LinkedIn]
<http://www.linkedin.com/company/grazitti-interactive>[image: Description:
Twitter] <https://twitter.com/grazitti>[image: fbook]
<https://www.facebook.com/grazitti.interactive>*Zak*Calendar
Salesforce1TM Calendar
App for Teams
<https://appexchange.salesforce.com/listingDetail?listingId=a0N3000000B5UPKEA3>




On Thu, Nov 27, 2014 at 11:00 PM, Markus Jelsma <ma...@openindex.io>
wrote:

> The plugin works on headings only, but if you check the sources, you can
> quickly adapt it to any element/attribute section.
>
> -----Original message-----
> > From:Vishal Sharma <vi...@grazitti.com>
> > Sent: Thursday 27th November 2014 18:25
> > To: user <us...@nutch.apache.org>
> > Subject: Re: How to parse specific html tag in nutch+solr while crawling
> >
> > Hi Markus,
> >
> > Thank you so much for your reply.
> >
> > Quick question: Will this parse only hN tags only or can we confiure it
> for
> > other html tags also like <div class=''test"> ?
> >
> > *Vishal Sharma**TL, SFDC*T: +1 650 288 6711
> > E: vishals@grazitti.com <al...@grazitti.com>
> > www.grazitti.com [image: Description: LinkedIn]
> > <http://www.linkedin.com/company/grazitti-interactive>[image:
> Description:
> > Twitter] <https://twitter.com/grazitti>[image: fbook]
> > <https://www.facebook.com/grazitti.interactive>*Zak*Calendar
> > Salesforce1TM Calendar
> > App for Teams
> > <
> https://appexchange.salesforce.com/listingDetail?listingId=a0N3000000B5UPKEA3
> >
> >
> >
> >
> >
> > On Thu, Nov 27, 2014 at 10:33 PM, Markus Jelsma <
> markus.jelsma@openindex.io>
> > wrote:
> >
> > > You may want to check the headings plugin, it reads content from those
> > > elements and writes them to some field. Very basic.
> > >
> > >
> > >
> > > -----Original message-----
> > > > From:Vishal Sharma <vi...@grazitti.com>
> > > > Sent: Thursday 27th November 2014 17:59
> > > > To: user <us...@nutch.apache.org>
> > > > Subject: How to parse specific html tag in nutch+solr while crawling
> > > >
> > > > I tried this on Google also. But, nothing useful. Appreciate any
> help.
> > > >
> > > > Is there a way to parse specific html tag while doing the crawling
> with
> > > > nutch and then indexing it to solr.
> > > >
> > > > For-example I don't want all html page to go to content node. I would
> > > want
> > > > to parse h1 h2 tags into separate nodes.
> > > >
> > > >
> > > >
> > > > *Vishal Sharma**TL, SFDC*T: +1 650 288 6711
> > > > E: vishals@grazitti.com <al...@grazitti.com>
> > > > www.grazitti.com [image: Description: LinkedIn]
> > > > <http://www.linkedin.com/company/grazitti-interactive>[image:
> > > Description:
> > > > Twitter] <https://twitter.com/grazitti>[image: fbook]
> > > > <https://www.facebook.com/grazitti.interactive>*Zak*Calendar
> > > > Salesforce1TM Calendar
> > > > App for Teams
> > > > <
> > >
> https://appexchange.salesforce.com/listingDetail?listingId=a0N3000000B5UPKEA3
> > > >
> > > >
> > >
> >
>

Re: How to parse specific html tag in nutch+solr while crawling

Posted by Vishal Sharma <vi...@grazitti.com>.

Hi Markus,

Thank you so much for your reply.

Quick question: Will this parse only hN tags only or can we confiure it for
other html tags also like <div class=''test"> ?

*Vishal Sharma**TL, SFDC*T: +1 650 288 6711
E: vishals@grazitti.com <al...@grazitti.com>
www.grazitti.com [image: Description: LinkedIn]
<http://www.linkedin.com/company/grazitti-interactive>[image: Description:
Twitter] <https://twitter.com/grazitti>[image: fbook]
<https://www.facebook.com/grazitti.interactive>*Zak*Calendar
Salesforce1TM Calendar
App for Teams
<https://appexchange.salesforce.com/listingDetail?listingId=a0N3000000B5UPKEA3>




On Thu, Nov 27, 2014 at 10:33 PM, Markus Jelsma <ma...@openindex.io>
wrote:

> You may want to check the headings plugin, it reads content from those
> elements and writes them to some field. Very basic.
>
>
>
> -----Original message-----
> > From:Vishal Sharma <vi...@grazitti.com>
> > Sent: Thursday 27th November 2014 17:59
> > To: user <us...@nutch.apache.org>
> > Subject: How to parse specific html tag in nutch+solr while crawling
> >
> > I tried this on Google also. But, nothing useful. Appreciate any help.
> >
> > Is there a way to parse specific html tag while doing the crawling with
> > nutch and then indexing it to solr.
> >
> > For-example I don't want all html page to go to content node. I would
> want
> > to parse h1 h2 tags into separate nodes.
> >
> >
> >
> > *Vishal Sharma**TL, SFDC*T: +1 650 288 6711
> > E: vishals@grazitti.com <al...@grazitti.com>
> > www.grazitti.com [image: Description: LinkedIn]
> > <http://www.linkedin.com/company/grazitti-interactive>[image:
> Description:
> > Twitter] <https://twitter.com/grazitti>[image: fbook]
> > <https://www.facebook.com/grazitti.interactive>*Zak*Calendar
> > Salesforce1TM Calendar
> > App for Teams
> > <
> https://appexchange.salesforce.com/listingDetail?listingId=a0N3000000B5UPKEA3
> >
> >
>

RE: How to parse specific html tag in nutch+solr while crawling

Posted by Markus Jelsma <ma...@openindex.io>.

You may want to check the headings plugin, it reads content from those elements and writes them to some field. Very basic.

 
 
-----Original message-----
> From:Vishal Sharma <vi...@grazitti.com>
> Sent: Thursday 27th November 2014 17:59
> To: user <us...@nutch.apache.org>
> Subject: How to parse specific html tag in nutch+solr while crawling
> 
> I tried this on Google also. But, nothing useful. Appreciate any help.
> 
> Is there a way to parse specific html tag while doing the crawling with
> nutch and then indexing it to solr.
> 
> For-example I don't want all html page to go to content node. I would want
> to parse h1 h2 tags into separate nodes.
> 
> 
> 
> *Vishal Sharma**TL, SFDC*T: +1 650 288 6711
> E: vishals@grazitti.com <al...@grazitti.com>
> www.grazitti.com [image: Description: LinkedIn]
> <http://www.linkedin.com/company/grazitti-interactive>[image: Description:
> Twitter] <https://twitter.com/grazitti>[image: fbook]
> <https://www.facebook.com/grazitti.interactive>*Zak*Calendar
> Salesforce1TM Calendar
> App for Teams
> <https://appexchange.salesforce.com/listingDetail?listingId=a0N3000000B5UPKEA3>
>

RE: How to parse specific html tag in nutch+solr while crawling

Posted by Markus Jelsma <ma...@openindex.io>.

The plugin works on headings only, but if you check the sources, you can quickly adapt it to any element/attribute section. 
 
-----Original message-----
> From:Vishal Sharma <vi...@grazitti.com>
> Sent: Thursday 27th November 2014 18:25
> To: user <us...@nutch.apache.org>
> Subject: Re: How to parse specific html tag in nutch+solr while crawling
> 
> Hi Markus,
> 
> Thank you so much for your reply.
> 
> Quick question: Will this parse only hN tags only or can we confiure it for
> other html tags also like <div class=''test"> ?
> 
> *Vishal Sharma**TL, SFDC*T: +1 650 288 6711
> E: vishals@grazitti.com <al...@grazitti.com>
> www.grazitti.com [image: Description: LinkedIn]
> <http://www.linkedin.com/company/grazitti-interactive>[image: Description:
> Twitter] <https://twitter.com/grazitti>[image: fbook]
> <https://www.facebook.com/grazitti.interactive>*Zak*Calendar
> Salesforce1TM Calendar
> App for Teams
> <https://appexchange.salesforce.com/listingDetail?listingId=a0N3000000B5UPKEA3>
> 
> 
> 
> 
> On Thu, Nov 27, 2014 at 10:33 PM, Markus Jelsma <ma...@openindex.io>
> wrote:
> 
> > You may want to check the headings plugin, it reads content from those
> > elements and writes them to some field. Very basic.
> >
> >
> >
> > -----Original message-----
> > > From:Vishal Sharma <vi...@grazitti.com>
> > > Sent: Thursday 27th November 2014 17:59
> > > To: user <us...@nutch.apache.org>
> > > Subject: How to parse specific html tag in nutch+solr while crawling
> > >
> > > I tried this on Google also. But, nothing useful. Appreciate any help.
> > >
> > > Is there a way to parse specific html tag while doing the crawling with
> > > nutch and then indexing it to solr.
> > >
> > > For-example I don't want all html page to go to content node. I would
> > want
> > > to parse h1 h2 tags into separate nodes.
> > >
> > >
> > >
> > > *Vishal Sharma**TL, SFDC*T: +1 650 288 6711
> > > E: vishals@grazitti.com <al...@grazitti.com>
> > > www.grazitti.com [image: Description: LinkedIn]
> > > <http://www.linkedin.com/company/grazitti-interactive>[image:
> > Description:
> > > Twitter] <https://twitter.com/grazitti>[image: fbook]
> > > <https://www.facebook.com/grazitti.interactive>*Zak*Calendar
> > > Salesforce1TM Calendar
> > > App for Teams
> > > <
> > https://appexchange.salesforce.com/listingDetail?listingId=a0N3000000B5UPKEA3
> > >
> > >
> >
>