You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Stevenson, Kerry" <Ke...@gwl.ca> on 2006/08/10 23:28:01 UTC

Nutch vs. Google Appliance

Hello all - I have been taking a look at Nutch for purposes of indexing
a large pile of internal LAN files at our company, and so far it looks
quite impressive. I believe it could substitute for the Google Mini
appliance. However, the bigger Google boxes add more features that I am
not sure can be duplicated in Nutch. Specifically I am interested in the
indexing and searching for secured files. Apparently Google will index
all files, including those that are secure (given appropriate authority)
- but will only show search results based on the security and
credentials of the searcher. In other words, if you don't have access to
a document, Google won't show you that it even exists. Can something
like that be done in Nutch? Are there other differences between Nutch
and Google? 

-----
The contents of this communication, including any attachment(s), are
confidential and may be privileged. If you are not the intended
recipient (or are not receiving this communication on behalf of the
intended recipient), please notify the sender immediately and delete or
destroy this communication without reading it, and without making,
forwarding, or retaining any copy or record of it or its contents. Thank
you. Note: We have taken precautions against viruses, but take no
responsibility for loss or damage caused by any virus present.


Re: Nutch vs. Google Appliance

Posted by jian chen <ch...@gmail.com>.
The thing I don't like commercial products like google mini or similar is
that, they charge you based on the number of documents allowable for
indexing. While in its core, the software probably is the same with just
some switches turned on and off.

I know that you can use httpclient and java to do NTLM authentication. I
also did some experiments and successfully used a modified version of the
httpclient library to do NTLM v2 authentication.

Some guy posted some java code that patches httpclient to do that. If you
would like to know, I can provide you that link.

For using Nutch to index intranet https pages, it is just a matter of
hacking the http plugin code, or the httpclient plugin code.

I am happy to provide you with some more info if you want to know details...

Jian Chen
Lead Developer
www.destinationlighting.com

On 8/11/06, Sami Siren <ss...@gmail.com> wrote:
>
> Stevenson, Kerry wrote:
> > Hello all - I have been taking a look at Nutch for purposes of indexing
> > a large pile of internal LAN files at our company, and so far it looks
> > quite impressive. I believe it could substitute for the Google Mini
> > appliance. However, the bigger Google boxes add more features that I am
> > not sure can be duplicated in Nutch. Specifically I am interested in the
> > indexing and searching for secured files. Apparently Google will index
> > all files, including those that are secure (given appropriate authority)
>
> HttpClient supports NTLM based on javadocs so the fetching part is
> tackled with little custom coding I guess.
>
> > - but will only show search results based on the security and
> > credentials of the searcher. In other words, if you don't have access to
>
> We can get the windows identity of IE user (NTLM) requesting search page
> but I don't see where one could get the authorization data automatically
> for arbitrary objects fetched from intranet.
>
> Once one figures how to get that data it's again quite easy to implement
> dynamic filtering on search results based on user identity.
>
> > a document, Google won't show you that it even exists. Can something
> > like that be done in Nutch? Are there other differences between Nutch
> > and Google?
>
> There might be some features not available but then again as the source
> is open and nothing is stopping from creating such features and
> optimally (and optionally) contributing them back to the community.
>
> --
>   Sami Siren
>
>
> >
> > -----
> > The contents of this communication, including any attachment(s), are
> > confidential and may be privileged. If you are not the intended
> > recipient (or are not receiving this communication on behalf of the
> > intended recipient), please notify the sender immediately and delete or
> > destroy this communication without reading it, and without making,
> > forwarding, or retaining any copy or record of it or its contents. Thank
> > you. Note: We have taken precautions against viruses, but take no
> > responsibility for loss or damage caused by any virus present.
> >
> >
>
>

Re: Nutch vs. Google Appliance

Posted by Sami Siren <ss...@gmail.com>.
Stevenson, Kerry wrote:
> Hello all - I have been taking a look at Nutch for purposes of indexing
> a large pile of internal LAN files at our company, and so far it looks
> quite impressive. I believe it could substitute for the Google Mini
> appliance. However, the bigger Google boxes add more features that I am
> not sure can be duplicated in Nutch. Specifically I am interested in the
> indexing and searching for secured files. Apparently Google will index
> all files, including those that are secure (given appropriate authority)

HttpClient supports NTLM based on javadocs so the fetching part is 
tackled with little custom coding I guess.

> - but will only show search results based on the security and
> credentials of the searcher. In other words, if you don't have access to

We can get the windows identity of IE user (NTLM) requesting search page 
but I don't see where one could get the authorization data automatically 
for arbitrary objects fetched from intranet.

Once one figures how to get that data it's again quite easy to implement
dynamic filtering on search results based on user identity.

> a document, Google won't show you that it even exists. Can something
> like that be done in Nutch? Are there other differences between Nutch
> and Google? 

There might be some features not available but then again as the source 
is open and nothing is stopping from creating such features and 
optimally (and optionally) contributing them back to the community.

--
  Sami Siren


> 
> -----
> The contents of this communication, including any attachment(s), are
> confidential and may be privileged. If you are not the intended
> recipient (or are not receiving this communication on behalf of the
> intended recipient), please notify the sender immediately and delete or
> destroy this communication without reading it, and without making,
> forwarding, or retaining any copy or record of it or its contents. Thank
> you. Note: We have taken precautions against viruses, but take no
> responsibility for loss or damage caused by any virus present.
> 
>