You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ashish Saini <as...@gmail.com> on 2018/10/29 17:55:54 UTC

Getting Nutch To Crawl Sharepoint Online

We are looking at solutions for crawling and indexing documents in
Sharepoint Online (Office 365) into Elasticsearch. We already use Nutch
1.14 for crawling websites and are looking to extend the solution to crawl
Sharepoint as well.

Looking around on the Wiki, it seems adding a custom authentication scheme
and implementing an AuthScheme interface is a path available for Nutch
users.

I just wanted to see if anyone has recently crawled Sharepoint content and
if there are any caveats or tips to keep in mind.

Thanks.

Re: Getting Nutch To Crawl Sharepoint Online

Posted by Furkan KAMACI <fu...@gmail.com>.
Hi Ashish,

It is related to ManifoldCF as Markus mentioned. You check related issue
from here: CONNECTORS-1133
<https://jira.apache.org/jira/browse/CONNECTORS-1133>. For further
information, you can ask your question at ManifoldCF user mail list.

Kind Regards,
Furkan KAMACI

On Tue, Oct 30, 2018 at 8:17 AM Ashish Saini <as...@gmail.com> wrote:

> Markus,
>
> We looked at Apache ManifoldCF earlier. While it provides connectors for
> source systems of Sharepoint 2003 to 2016, it does not provide one for
> Sharepoint Online.
>
> Our source system is Sharepoint Online (Azure AD working with our own ADFS
> Identity Provider).
>
> The authentication would involve getting an authentication token from Azure
> AD (after providing on-premise AD account and password), which is then
> presented to Sharepoint Online for validation, and two cookies FedAuth and
> rtFA are obtained.
>
> Thanks
> -Ashish
>
> On Mon, Oct 29, 2018 at 5:16 PM Markus Jelsma <ma...@openindex.io>
> wrote:
>
> > Hello Ashish,
> >
> > You might want to check out Apache ManifoldCF.
> >
> > Regards.
> > Markus
> >
> > http://manifoldcf.apache.org/
> >
> >
> >
> > -----Original message-----
> > > From:Ashish Saini <as...@gmail.com>
> > > Sent: Monday 29th October 2018 18:56
> > > To: user@nutch.apache.org
> > > Subject: Getting Nutch To Crawl Sharepoint Online
> > >
> > > We are looking at solutions for crawling and indexing documents in
> > > Sharepoint Online (Office 365) into Elasticsearch. We already use Nutch
> > > 1.14 for crawling websites and are looking to extend the solution to
> > crawl
> > > Sharepoint as well.
> > >
> > > Looking around on the Wiki, it seems adding a custom authentication
> > scheme
> > > and implementing an AuthScheme interface is a path available for Nutch
> > > users.
> > >
> > > I just wanted to see if anyone has recently crawled Sharepoint content
> > and
> > > if there are any caveats or tips to keep in mind.
> > >
> > > Thanks.
> > >
> >
>

Re: Getting Nutch To Crawl Sharepoint Online

Posted by Ashish Saini <as...@gmail.com>.
Markus,

We looked at Apache ManifoldCF earlier. While it provides connectors for
source systems of Sharepoint 2003 to 2016, it does not provide one for
Sharepoint Online.

Our source system is Sharepoint Online (Azure AD working with our own ADFS
Identity Provider).

The authentication would involve getting an authentication token from Azure
AD (after providing on-premise AD account and password), which is then
presented to Sharepoint Online for validation, and two cookies FedAuth and
rtFA are obtained.

Thanks
-Ashish

On Mon, Oct 29, 2018 at 5:16 PM Markus Jelsma <ma...@openindex.io>
wrote:

> Hello Ashish,
>
> You might want to check out Apache ManifoldCF.
>
> Regards.
> Markus
>
> http://manifoldcf.apache.org/
>
>
>
> -----Original message-----
> > From:Ashish Saini <as...@gmail.com>
> > Sent: Monday 29th October 2018 18:56
> > To: user@nutch.apache.org
> > Subject: Getting Nutch To Crawl Sharepoint Online
> >
> > We are looking at solutions for crawling and indexing documents in
> > Sharepoint Online (Office 365) into Elasticsearch. We already use Nutch
> > 1.14 for crawling websites and are looking to extend the solution to
> crawl
> > Sharepoint as well.
> >
> > Looking around on the Wiki, it seems adding a custom authentication
> scheme
> > and implementing an AuthScheme interface is a path available for Nutch
> > users.
> >
> > I just wanted to see if anyone has recently crawled Sharepoint content
> and
> > if there are any caveats or tips to keep in mind.
> >
> > Thanks.
> >
>

RE: Getting Nutch To Crawl Sharepoint Online

Posted by Markus Jelsma <ma...@openindex.io>.
Hello Ashish,

You might want to check out Apache ManifoldCF.

Regards.
Markus

http://manifoldcf.apache.org/

 
 
-----Original message-----
> From:Ashish Saini <as...@gmail.com>
> Sent: Monday 29th October 2018 18:56
> To: user@nutch.apache.org
> Subject: Getting Nutch To Crawl Sharepoint Online
> 
> We are looking at solutions for crawling and indexing documents in
> Sharepoint Online (Office 365) into Elasticsearch. We already use Nutch
> 1.14 for crawling websites and are looking to extend the solution to crawl
> Sharepoint as well.
> 
> Looking around on the Wiki, it seems adding a custom authentication scheme
> and implementing an AuthScheme interface is a path available for Nutch
> users.
> 
> I just wanted to see if anyone has recently crawled Sharepoint content and
> if there are any caveats or tips to keep in mind.
> 
> Thanks.
>