You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Reyes, Mark" <Ma...@bpiedu.com> on 2013/11/20 19:21:07 UTC

Nutch 1.7: Crawling Specific Content for One Page That's Deep-linked

Question on crawling specific content for one page being deep-linked…

- On Nutch 1.7, my crawl is specifically for one page that deep-links such as:
http://www.mywebsite.com/1761.htm#catalog1762
http://www.mywebsite.com/1761.htm#catalog1986
http://www.mywebsite.com/1761.htm#catalog1987

- Currently, the entire document of that page is parsed and returning the JSON on Solr such as:
‘content’ : ‘Everything at the header. Stuff about catalog 1762.  Stuff about catalog 1986. Stuff about catalog 1987. Everything at the footer.'
‘content’ : ‘Everything at the header. Stuff about catalog 1762.  Stuff about catalog 1986. Stuff about catalog 1987. Everything at the footer.'
‘content’ : ‘Everything at the header. Stuff about catalog 1762.  Stuff about catalog 1986. Stuff about catalog 1987. Everything at the footer.'

- That said, I want the information returned to be based off of where those pages point.The HTML for what those links point to are the following:

<a id="catalog1762"></a>
<h2 class="catalog-section-headline”>Catalog 1762</h2>
<span class="catalog-section-text”>
Stuff about catalog 1762.
</span>

<a id="catalog1986"></a>
<h2 class="catalog-section-headline”>Catalog 1986</h2>
<span class="catalog-section-text”>
Stuff about catalog 1986.
</span>

<a id="catalog1987"></a>
<h2 class="catalog-section-headline”>Catalog 1987</h2>
<span class="catalog-section-text”>
Stuff about catalog 1987.
</span>

What would be your recommendation so the JSON that I validate from my Solr instance returns those specific h2 and span tags instead?

Thank you,
Mark

IMPORTANT NOTICE: This e-mail message is intended to be received only by persons entitled to receive the confidential information it may contain. E-mail messages sent from Bridgepoint Education may contain information that is confidential and may be legally privileged. Please do not read, copy, forward or store this message unless you are an intended recipient of it. If you received this transmission in error, please notify the sender by reply e-mail and delete the message and any attachments.