You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by David Thibault <da...@itstrategypartners.com> on 2007/11/23 16:28:10 UTC

Solr for enterprise search

Hello all,

I'm new to Solr.  From what little I have seen, Solr has made great strides
in open source search, but is lacking some significant features that would
really allow it to become a viable alternative to things like FAST and
Autonomy for enterprise search.  I am sure these issues have been discussed
on the list before, but I would like to help push these issues forward if I
can:

1) Crawling--ShareHound does windows shares, but it ignores document-level
permissions.  A modular approach to crawling file systems, websites,
intranet sites, etc, would be huge.  Also, I realize Nutch has a crawler but
Solr looks much more full-featured in terms of things like faceted search,
etc, so I'd rather help push Solr forward.

2) ACLs and document-level security--The lack of doc-level security is a
real deal-breaker in terms of indexing enterprise fileshares.  I could
envision this type of functionality to be embedded in the various crawlers
above, on an OS-dependent or web app-dependent basis.  For example, when
indexing a file from a share, the ACL should be indexed as well, that way a
results list can be brought back and the permissions would not need to be
re-checked against the original file server.  Also, this implies that ACL
changes need to be monitored and updated as well as file content changes.

There are other differences, obviously, between the leading commercial
products and Solr, but those two features alone would make a huge difference
in the power of Solr, in my opinion. I have little Java experience, but I
could easily prototype this functionality in other languages and work with
others to integrate them into the code base in Java.  Also, I headed up an
enterprise search request for information for a large pharmaceutical company
in the past, so I am familiar with the feature sets of FAST and Autonomy,
and I could help manage the project in terms of competing feature sets.

Best,
Dave

Re: Solr for enterprise search

Posted by Ken Krugler <kk...@transpac.com>.

Hi Dave,

>I'm new to Solr.  From what little I have seen, Solr has made great strides
>in open source search, but is lacking some significant features that would
>really allow it to become a viable alternative to things like FAST and
>Autonomy for enterprise search.  I am sure these issues have been discussed
>on the list before, but I would like to help push these issues forward if I
>can:
>
>1) Crawling--ShareHound does windows shares, but it ignores document-level
>permissions.  A modular approach to crawling file systems, websites,
>intranet sites, etc, would be huge.  Also, I realize Nutch has a crawler but
>Solr looks much more full-featured in terms of things like faceted search,
>etc, so I'd rather help push Solr forward.

I think pushing Solr into the crawler space is probably going to be a 
non-starter. Solr's focus is on the index management and serving side 
of things, which is very different from the multitude of issues faced 
by a crawler.

>2) ACLs and document-level security--The lack of doc-level security is a
>real deal-breaker in terms of indexing enterprise fileshares.  I could
>envision this type of functionality to be embedded in the various crawlers
>above, on an OS-dependent or web app-dependent basis.  For example, when
>indexing a file from a share, the ACL should be indexed as well, that way a
>results list can be brought back and the permissions would not need to be
>re-checked against the original file server.  Also, this implies that ACL
>changes need to be monitored and updated as well as file content changes.
>
>There are other differences, obviously, between the leading commercial
>products and Solr, but those two features alone would make a huge difference
>in the power of Solr, in my opinion. I have little Java experience, but I
>could easily prototype this functionality in other languages and work with
>others to integrate them into the code base in Java.  Also, I headed up an
>enterprise search request for information for a large pharmaceutical company
>in the past, so I am familiar with the feature sets of FAST and Autonomy,
>and I could help manage the project in terms of competing feature sets.

If you're looking for a good, free crawler then I'd try IBM's 
Omnifind Yahoo Edition. Handles up to 500K documents, Java-based, 
uses Lucene under the hood. In fact, I wonder if you could reverse 
the index and create a Solr schema for it? :)

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Re: Solr for enterprise search

Posted by Mike Klaas <mi...@gmail.com>.

On 23-Nov-07, at 7:28 AM, David Thibault wrote:

> Hello all,
>
> I'm new to Solr.  From what little I have seen, Solr has made great  
> strides
> in open source search, but is lacking some significant features  
> that would
> really allow it to become a viable alternative to things like FAST and
> Autonomy for enterprise search.  I am sure these issues have been  
> discussed
> on the list before, but I would like to help push these issues  
> forward if I
> can:

It sounds to me like you are describing an application that can be  
built with Solr rather than what Solr aims to provide.  That said, I  
see no reason that there couldn't exist some add-on modules providing  
this functionality.

> 1) Crawling--ShareHound does windows shares, but it ignores  
> document-level
> permissions.  A modular approach to crawling file systems, websites,
> intranet sites, etc, would be huge.  Also, I realize Nutch has a  
> crawler but
> Solr looks much more full-featured in terms of things like faceted  
> search,
> etc, so I'd rather help push Solr forward.

It seems to be that every domain would require a different schema and  
have different requirements.  I'm not sure that the solution to this  
problem belongs in Solr.

> 2) ACLs and document-level security--The lack of doc-level security  
> is a
> real deal-breaker in terms of indexing enterprise fileshares.  I could
> envision this type of functionality to be embedded in the various  
> crawlers
> above, on an OS-dependent or web app-dependent basis.  For example,  
> when
> indexing a file from a share, the ACL should be indexed as well,  
> that way a
> results list can be brought back and the permissions would not need  
> to be
> re-checked against the original file server.  Also, this implies  
> that ACL
> changes need to be monitored and updated as well as file content  
> changes.

Again, I don't see this as within the purview of Solr.  Solr provides  
lots of functionality to help implement access control (namely, rich  
filtering and faceting support), and may provide more once updateable  
documents are implemented.  However, it has no concept of users,  
files, permissions, monitoring os-level changes, etc.  Growing such  
awareness seems somewhat outside what Solr should provide.

> There are other differences, obviously, between the leading commercial
> products and Solr, but those two features alone would make a huge  
> difference
> in the power of Solr, in my opinion. I have little Java experience,  
> but I
> could easily prototype this functionality in other languages and  
> work with
> others to integrate them into the code base in Java.  Also, I  
> headed up an
> enterprise search request for information for a large  
> pharmaceutical company
> in the past, so I am familiar with the feature sets of FAST and  
> Autonomy,
> and I could help manage the project in terms of competing feature  
> sets.

Again, this feels more like an application to me.  I could see  
someone putting together a solution to these problems in one package,  
perhaps by distributing a separate webapp along with solr, complete  
with a pre-defined schema, a nicer admin console, and automatic  
crawling/indexing tools.  In fact, I suspect such a product would be  
very cool and garner lots of attention.  I don't see Solr becoming  
that product, though.  Besides being outside the scope of the  
project, I think there might be a lack of interest among the core  
devs to develop and maintain that direction.  Mightn't it be better  
to start a separate project, where a different set of people (with  
different priorities and interests) could have full control?

This situation is analogous to the Solr/Lucene: they are tightly  
integrated, and several people contribute to both, but they are  
different layers, and can proceed somewhat independently.  And that  
is a Good Thing.

-Mike