You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by "Silvia, Daniel [USA]" <Si...@bah.com> on 2012/02/08 14:24:26 UTC

Web Crawl using ManifoldCF

Hi Carl



I want to thank you for your help regarding the Sharepoint to Solr connections, everything seems to be working properly after getting the Viewers and Home Owners groups permission set properly by our SharePoint Admins. However, I have another question regarding pulling site content from the SharePoint instance and not the files stored on the SharePoint instance.



When creating a Respository connection, would you use the "Web" connection type to pull site content? If that is the case, when creating the job, do you indicate just the site url you want to crawl to pull site content in the "Seed" tab? Are we using the correct connection repository? Is there a respository type we use to just crawl websites for the content and not files?



As you can see, I hope I have explained myself properly, we are trying to just crawl site content.



Thanks



Dan

RE: Web Crawl using ManifoldCF

Posted by "Silvia, Daniel [USA]" <Si...@bah.com>.

Thanks Karl

________________________________________
From: Karl Wright [daddywri@gmail.com]
Sent: Wednesday, February 08, 2012 8:40 AM
To: Silvia, Daniel [USA]
Cc: connectors-user@incubator.apache.org
Subject: Re: Web Crawl using ManifoldCF

On Wed, Feb 8, 2012 at 8:24 AM, Silvia, Daniel [USA]
<Si...@bah.com> wrote:
> Hi Carl
>
>
>
> I want to thank you for your help regarding the Sharepoint to Solr
> connections, everything seems to be working properly after getting the
> Viewers and Home Owners groups permission set properly by our SharePoint
> Admins.

That's great news!  Thanks for sticking with it. ;-)

> However, I have another question regarding pulling site content from
> the SharePoint instance and not the files stored on the SharePoint instance.
>
>
>
> When creating a Respository connection, would you use the "Web" connection
> type to pull site content? If that is the case, when creating the job, do
> you indicate just the site url you want to crawl to pull site content in the
> "Seed" tab? Are we using the correct connection repository? Is there a
> respository type we use to just crawl websites for the content and not
> files?
>
>

I think that's the right approach, if there's a document you can crawl
somewhere that has a reference to the other documents, or the
documents all refer to each other.  You need such a document or
documents at the root of a document web, otherwise a web crawler has
no way of locating the documents in question.  That would be how you
identify your "seed" document.  For typical (non SharePoint) sites,
that's usually the main URL of the site.  So, for example, if you
wanted to crawl cnn.com you'd probably use a seed of
http://www.cnn.com, because that's a good place to start to get to all
of cnn's content.

If no such document(s) exist, then web crawling is not going to do it.
 If this "site" is served by SharePoint, then some kind of enhancement
to the SharePoint connector would be a better approach.

Thanks,
Karl

>
> As you can see, I hope I have explained myself properly, we are trying to
> just crawl site content.
>
>
>
> Thanks
>
>
>
> Dan

Re: Web Crawl using ManifoldCF

Posted by Karl Wright <da...@gmail.com>.

On Wed, Feb 8, 2012 at 8:24 AM, Silvia, Daniel [USA]
<Si...@bah.com> wrote:
> Hi Carl
>
>
>
> I want to thank you for your help regarding the Sharepoint to Solr
> connections, everything seems to be working properly after getting the
> Viewers and Home Owners groups permission set properly by our SharePoint
> Admins.

That's great news!  Thanks for sticking with it. ;-)

> However, I have another question regarding pulling site content from
> the SharePoint instance and not the files stored on the SharePoint instance.
>
>
>
> When creating a Respository connection, would you use the "Web" connection
> type to pull site content? If that is the case, when creating the job, do
> you indicate just the site url you want to crawl to pull site content in the
> "Seed" tab? Are we using the correct connection repository? Is there a
> respository type we use to just crawl websites for the content and not
> files?
>
>

I think that's the right approach, if there's a document you can crawl
somewhere that has a reference to the other documents, or the
documents all refer to each other.  You need such a document or
documents at the root of a document web, otherwise a web crawler has
no way of locating the documents in question.  That would be how you
identify your "seed" document.  For typical (non SharePoint) sites,
that's usually the main URL of the site.  So, for example, if you
wanted to crawl cnn.com you'd probably use a seed of
http://www.cnn.com, because that's a good place to start to get to all
of cnn's content.

If no such document(s) exist, then web crawling is not going to do it.
 If this "site" is served by SharePoint, then some kind of enhancement
to the SharePoint connector would be a better approach.

Thanks,
Karl

>
> As you can see, I hope I have explained myself properly, we are trying to
> just crawl site content.
>
>
>
> Thanks
>
>
>
> Dan