You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Michael Kelleher <mj...@gmail.com> on 2011/11/18 17:04:10 UTC

Crawling question

I have content that is not linkable from anywhere on a site.  This 
content is only reachable via a search page.

Is it possible via some type of connector or custom plugin to index this 
content?

Thanks,

--mike

Re: Crawling question

Posted by Peyman Mohajerian <mo...@gmail.com>.

Mike,

I had a similar issue, the way I dealt with it was to change the code in
org.apache.nutch.parse.html.HtmlParser around line 178 to add my own
url to the list based on some info on 'content.getUrl()'.
The good thing is that this class is mentioned in 'parse-plugin.xml'
configuration and you can create your own HTML parser and update this
config and one more conf, so it is extensible.

Peyman

On Fri, Nov 18, 2011 at 8:04 AM, Michael Kelleher <mj...@gmail.com> wrote:
> I have content that is not linkable from anywhere on a site.  This content
> is only reachable via a search page.
>
> Is it possible via some type of connector or custom plugin to index this
> content?
>
> Thanks,
>
> --mike
>