You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Christopher Gross <co...@gmail.com> on 2011/12/15 21:13:12 UTC

Crawling Sharepoint

I'm able to start crawling a SharePoint site, but then I get this for
the body of ALL the pages it finds:

" You may be trying to access this site from a secured browser on the
server. Please enable scripts and reload this page. Turn on more
accessible mode Turn off more accessible mode Skip Ribbon Commands
Skip to main content To navigate through the Ribbon, use standard
browser navigation ...."

Is this something that I have to fix on the SharePoint side of things,
or is it on nutch?  I'm thinking that if I put the right stuff in the
authentication for nutch it may work -- but I'm not sure what needs to
go in there either.

Is anyone willing to share experience/configuration files for crawling
SharePoint content with nutch?

Nutch 1.4, SharePoint 2010, Java 6

-- Chris

Re: Crawling Sharepoint

Posted by Christopher Gross <co...@gmail.com>.
Digging more, that text is on every SharePoint page, in a
class="NOINDEX" div (I guess the MS FAST indexer skips over it -- is
there a way for nutch to do the same?)

Now I'm trying to determine why I'm not getting some of the files.  On
the main page, I have a link to:

"http://url/Shared%20Documents/vi.pdf"

I have successfully run:
nutch org.apache.nutch.indexer.IndexingFiltersChecker <url>
nutch parseChecker -dumpText <url>

And both return successfully and make it seem like it can be
indexed...any idea of where to get started with the config files?

-- Chris



On Thu, Dec 15, 2011 at 3:13 PM, Christopher Gross <co...@gmail.com> wrote:
> I'm able to start crawling a SharePoint site, but then I get this for
> the body of ALL the pages it finds:
>
> " You may be trying to access this site from a secured browser on the
> server. Please enable scripts and reload this page. Turn on more
> accessible mode Turn off more accessible mode Skip Ribbon Commands
> Skip to main content To navigate through the Ribbon, use standard
> browser navigation ...."
>
> Is this something that I have to fix on the SharePoint side of things,
> or is it on nutch?  I'm thinking that if I put the right stuff in the
> authentication for nutch it may work -- but I'm not sure what needs to
> go in there either.
>
> Is anyone willing to share experience/configuration files for crawling
> SharePoint content with nutch?
>
> Nutch 1.4, SharePoint 2010, Java 6
>
> -- Chris