You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by rashmi maheshwari <ma...@gmail.com> on 2014/01/26 16:25:25 UTC

Fwd: Search Engine Framework decision

Hi,

I want to create a POC to search INTRANET along with documents uploaded on
intranet. Documents(PDF, excel, word document, text files, images, videos)
are also exists on SHAREPOINT. sharepoint has Authentication access at
module level(folder level).

My interanet website is http://myintranet/ <http://sparsh/> . and
Sharepoint url is different. Documents also exist in file folders.

I have below queries:
A) Which crawler framework do I use along with Solr for this POC, "Nutch"
or "Apache ManifoldCF"?

B) Is it possible to crawl Sharepoint documents usiing Nutch? If yes, only
configuration level change would make this possible? or I have to write
code to parse and send to solr?

C) Which version of Solr+nutch+MCF should be used? because nutch version
has dependency on solr version. wold nutch 1.7 works properly with solr
4.6.0?
-- 
Rashmi
Be the change that you want to see in this world!




-- 
Rashmi
Be the change that you want to see in this world!
www.minnal.zor.org
disha.resolve.at
www.artofliving.org

Re: Search Engine Framework decision

Posted by Tejas Patil <te...@gmail.com>.
On Sun, Jan 26, 2014 at 9:45 PM, rashmi maheshwari <
maheshwari.rashmi@gmail.com> wrote:

> Thanks Tejas and Ahmet for quick  response.
>
>
> Could I give intranet web url, file folder path and sharepoint url in same
> seed.txt file under urls folder in nutch?
>

Yes. Please see this tutorial for crawling :
http://wiki.apache.org/nutch/NutchTutorial

>
>
> Regards,
> Rashmi
>
> ---------- Forwarded message ----------
> From: Tejas Patil <te...@gmail.com>
> Date: Sun, Jan 26, 2014 at 9:39 PM
> Subject: Re: Search Engine Framework decision
> To: "user@nutch.apache.org" <us...@nutch.apache.org>
>
>
> On Sun, Jan 26, 2014 at 8:55 PM, rashmi maheshwari <
> maheshwari.rashmi@gmail.com> wrote:
>
> > Hi,
> >
> > I want to create a POC to search INTRANET along with documents uploaded
> on
> > intranet. Documents(PDF, excel, word document, text files, images,
> videos)
> > are also exists on SHAREPOINT. sharepoint has Authentication access at
> > module level(folder level).
> >
> > My interanet website is http://myintranet/ <http://sparsh/> . and
> > Sharepoint url is different. Documents also exist in file folders.
> >
> > I have below queries:
> > A) Which crawler framework do I use along with Solr for this POC, "Nutch"
> > or "Apache ManifoldCF"?
> >
>
>
> > B) Is it possible to crawl Sharepoint documents usiing Nutch? If yes,
> only
> > configuration level change would make this possible? or I have to write
> > code to parse and send to solr?
> >
>
> As long you have http:// or https:// for the documents (and there is no
> weird IIS authentication), Nutch should be able to do this. Nutch delegates
> parsing of documents to Apache Tika. So you need to check if Tika supports
> parsing of the documents you need.
>
> >
> > C) Which version of Solr+nutch+MCF should be used? because nutch version
> > has dependency on solr version. wold nutch 1.7 works properly with solr
>
> 4.6.0?
> >
>
> I have not used MCF so would leave this question for others. Nutch 1.7 +
> Solr 4.5 works fine.
>
>
> > --
> > Rashmi
> > Be the change that you want to see in this world!
> >
> >
> >
> >
> > --
> > Rashmi
> > Be the change that you want to see in this world!
> > www.minnal.zor.org
> > disha.resolve.at
> > www.artofliving.org
> >
>
>
>
> --
> Rashmi
> Be the change that you want to see in this world!
> www.minnal.zor.org
> disha.resolve.at
> www.artofliving.org
>

Re: Search Engine Framework decision

Posted by rashmi maheshwari <ma...@gmail.com>.
Thanks Tejas and Ahmet for quick  response.


Could I give intranet web url, file folder path and sharepoint url in same
seed.txt file under urls folder in nutch?


Regards,
Rashmi

---------- Forwarded message ----------
From: Tejas Patil <te...@gmail.com>
Date: Sun, Jan 26, 2014 at 9:39 PM
Subject: Re: Search Engine Framework decision
To: "user@nutch.apache.org" <us...@nutch.apache.org>


On Sun, Jan 26, 2014 at 8:55 PM, rashmi maheshwari <
maheshwari.rashmi@gmail.com> wrote:

> Hi,
>
> I want to create a POC to search INTRANET along with documents uploaded on
> intranet. Documents(PDF, excel, word document, text files, images, videos)
> are also exists on SHAREPOINT. sharepoint has Authentication access at
> module level(folder level).
>
> My interanet website is http://myintranet/ <http://sparsh/> . and
> Sharepoint url is different. Documents also exist in file folders.
>
> I have below queries:
> A) Which crawler framework do I use along with Solr for this POC, "Nutch"
> or "Apache ManifoldCF"?
>


> B) Is it possible to crawl Sharepoint documents usiing Nutch? If yes, only
> configuration level change would make this possible? or I have to write
> code to parse and send to solr?
>

As long you have http:// or https:// for the documents (and there is no
weird IIS authentication), Nutch should be able to do this. Nutch delegates
parsing of documents to Apache Tika. So you need to check if Tika supports
parsing of the documents you need.

>
> C) Which version of Solr+nutch+MCF should be used? because nutch version
> has dependency on solr version. wold nutch 1.7 works properly with solr

4.6.0?
>

I have not used MCF so would leave this question for others. Nutch 1.7 +
Solr 4.5 works fine.


> --
> Rashmi
> Be the change that you want to see in this world!
>
>
>
>
> --
> Rashmi
> Be the change that you want to see in this world!
> www.minnal.zor.org
> disha.resolve.at
> www.artofliving.org
>



-- 
Rashmi
Be the change that you want to see in this world!
www.minnal.zor.org
disha.resolve.at
www.artofliving.org

Re: Search Engine Framework decision

Posted by Tejas Patil <te...@gmail.com>.
On Sun, Jan 26, 2014 at 8:55 PM, rashmi maheshwari <
maheshwari.rashmi@gmail.com> wrote:

> Hi,
>
> I want to create a POC to search INTRANET along with documents uploaded on
> intranet. Documents(PDF, excel, word document, text files, images, videos)
> are also exists on SHAREPOINT. sharepoint has Authentication access at
> module level(folder level).
>
> My interanet website is http://myintranet/ <http://sparsh/> . and
> Sharepoint url is different. Documents also exist in file folders.
>
> I have below queries:
> A) Which crawler framework do I use along with Solr for this POC, "Nutch"
> or "Apache ManifoldCF"?
>


> B) Is it possible to crawl Sharepoint documents usiing Nutch? If yes, only
> configuration level change would make this possible? or I have to write
> code to parse and send to solr?
>

As long you have http:// or https:// for the documents (and there is no
weird IIS authentication), Nutch should be able to do this. Nutch delegates
parsing of documents to Apache Tika. So you need to check if Tika supports
parsing of the documents you need.

>
> C) Which version of Solr+nutch+MCF should be used? because nutch version
> has dependency on solr version. wold nutch 1.7 works properly with solr

4.6.0?
>

I have not used MCF so would leave this question for others. Nutch 1.7 +
Solr 4.5 works fine.


> --
> Rashmi
> Be the change that you want to see in this world!
>
>
>
>
> --
> Rashmi
> Be the change that you want to see in this world!
> www.minnal.zor.org
> disha.resolve.at
> www.artofliving.org
>