You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Cathy Hemsley <ca...@converteam.com> on 2011/01/14 13:05:28 UTC

Solr: using to index large folders recursively containing lots of different documents, and querying over the web

Hi Solr users,

I hope you can help.  We are migrating our intranet web site management
system to Windows 2008 and need a replacement for Index Server to do the
text searching.  I am trying to establish if Lucene and Solr is a feasible
replacement, but I cannot find the answers to these questions:

1. Can Solr be set up to recursively index a folder containing an
indeterminate and variable large number of subfolders, containing files of
all types:  XML, HTML, PDF, DOC, spreadsheets, powerpoint presentations,
text files etc.  If so, how?
2. Can Solr be queried over the web and return a list of files that match a
search query entered by a user, and also return the abstracts for these
files, as well as 'hit highlighting'.  If so, how?
3. Can Solr be run as a service (like Index Server) that automatically
detects changes to the files within the indexed folder and updates the
index? If so, how?

Thanks for your help

Cathy Hemsley

-- 
Converteam UK Ltd. Registration Number: 5571739 and Converteam Ltd.
Registration Number: 2416188

Registered in England and Wales.

Registered office: Boughton Road, Rugby, Warwickshire, CV21 1BU.

CONFIDENTIALITY : This e-mail and any attachments are confidential and may
be privileged. If you are not a named recipient, please notify the sender
immediately and do not disclose the contents to another person, use it for
any purpose or store or copy the information in any medium.

Please consider the environment before printing this e-mail

Converteam UK Ltd. Registration Number: 5571739 and Converteam Ltd. Registration Number: 2416188 Registered in England and Wales. Registered office: Boughton Road, Rugby, Warwickshire, CV21 1BU.

CONFIDENTIALITY : This e-mail and any attachments are confidential and may be privileged. 
If you are not a named recipient, please notify the sender immediately and do not disclose the contents to another person, use it for any purpose or store or copy the information in any medium.

http://www.converteam.com

Please consider the environment before printing this e-mail.

Re: Solr: using to index large folders recursively containing lots of different documents, and querying over the web

Posted by Lance Norskog <go...@gmail.com>.
Solr itself does all three things. There is no need for Nutch- that is
needed for crawling web sites, not file systems (as the original
question specifies).

Solr operates as a web service, running in any Java servlet container.

Detecting changes to files is more tricky: there is no implementation
for the real-time update system available for Windows. You would have
to implement that. Otherwise you can poll a file system and re-index
altered files.

On Fri, Jan 14, 2011 at 4:54 AM, Markus Jelsma
<ma...@openindex.io> wrote:
> Nutch can crawl the file system as well. Nutch 1.x can also provide search but
> this is delegated to Solr in Nutch 2.x. Solr can provide the search and Nutch
> can provide Solr with content from your intranet.
>
> On Friday 14 January 2011 13:17:52 Cathy Hemsley wrote:
>> Hi,
>> Thanks for suggesting this.
>> However, I'm not sure a 'crawler' will work:  as the various pages are not
>> necessarily linked (it's complicated:  basically our intranet is a dynamic
>> and managed collection of independantly published web sites, and users
>> found information using categorisation and/or text searching), so we need
>> something that will index all the files in a given folder, rather than
>> follow links like a crawler. Can Nutch do this? As well as the other
>> requirements below?
>> Regards
>> Cathy
>>
>> On 14 January 2011 12:09, Markus Jelsma <ma...@openindex.io> wrote:
>> > Please visit the Nutch project. It is a powerful crawler and can
>> > integrate with Solr.
>> >
>> > http://nutch.apache.org/
>> >
>> > > Hi Solr users,
>> > >
>> > > I hope you can help.  We are migrating our intranet web site management
>> > > system to Windows 2008 and need a replacement for Index Server to do
>> > > the text searching.  I am trying to establish if Lucene and Solr is a
>> >
>> > feasible
>> >
>> > > replacement, but I cannot find the answers to these questions:
>> > >
>> > > 1. Can Solr be set up to recursively index a folder containing an
>> > > indeterminate and variable large number of subfolders, containing files
>> >
>> > of
>> >
>> > > all types:  XML, HTML, PDF, DOC, spreadsheets, powerpoint
>> > > presentations, text files etc.  If so, how?
>> > > 2. Can Solr be queried over the web and return a list of files that
>> > > match
>> >
>> > a
>> >
>> > > search query entered by a user, and also return the abstracts for these
>> > > files, as well as 'hit highlighting'.  If so, how?
>> > > 3. Can Solr be run as a service (like Index Server) that automatically
>> > > detects changes to the files within the indexed folder and updates the
>> > > index? If so, how?
>> > >
>> > > Thanks for your help
>> > >
>> > > Cathy Hemsley
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
Lance Norskog
goksron@gmail.com

Re: Solr: using to index large folders recursively containing lots of different documents, and querying over the web

Posted by Markus Jelsma <ma...@openindex.io>.
Nutch can crawl the file system as well. Nutch 1.x can also provide search but 
this is delegated to Solr in Nutch 2.x. Solr can provide the search and Nutch 
can provide Solr with content from your intranet.

On Friday 14 January 2011 13:17:52 Cathy Hemsley wrote:
> Hi,
> Thanks for suggesting this.
> However, I'm not sure a 'crawler' will work:  as the various pages are not
> necessarily linked (it's complicated:  basically our intranet is a dynamic
> and managed collection of independantly published web sites, and users
> found information using categorisation and/or text searching), so we need
> something that will index all the files in a given folder, rather than
> follow links like a crawler. Can Nutch do this? As well as the other
> requirements below?
> Regards
> Cathy
> 
> On 14 January 2011 12:09, Markus Jelsma <ma...@openindex.io> wrote:
> > Please visit the Nutch project. It is a powerful crawler and can
> > integrate with Solr.
> > 
> > http://nutch.apache.org/
> > 
> > > Hi Solr users,
> > > 
> > > I hope you can help.  We are migrating our intranet web site management
> > > system to Windows 2008 and need a replacement for Index Server to do
> > > the text searching.  I am trying to establish if Lucene and Solr is a
> > 
> > feasible
> > 
> > > replacement, but I cannot find the answers to these questions:
> > > 
> > > 1. Can Solr be set up to recursively index a folder containing an
> > > indeterminate and variable large number of subfolders, containing files
> > 
> > of
> > 
> > > all types:  XML, HTML, PDF, DOC, spreadsheets, powerpoint
> > > presentations, text files etc.  If so, how?
> > > 2. Can Solr be queried over the web and return a list of files that
> > > match
> > 
> > a
> > 
> > > search query entered by a user, and also return the abstracts for these
> > > files, as well as 'hit highlighting'.  If so, how?
> > > 3. Can Solr be run as a service (like Index Server) that automatically
> > > detects changes to the files within the indexed folder and updates the
> > > index? If so, how?
> > > 
> > > Thanks for your help
> > > 
> > > Cathy Hemsley

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Solr: using to index large folders recursively containing lots of different documents, and querying over the web

Posted by Markus Jelsma <ma...@openindex.io>.
Please visit the Nutch project. It is a powerful crawler and can integrate 
with Solr.

http://nutch.apache.org/

> Hi Solr users,
> 
> I hope you can help.  We are migrating our intranet web site management
> system to Windows 2008 and need a replacement for Index Server to do the
> text searching.  I am trying to establish if Lucene and Solr is a feasible
> replacement, but I cannot find the answers to these questions:
> 
> 1. Can Solr be set up to recursively index a folder containing an
> indeterminate and variable large number of subfolders, containing files of
> all types:  XML, HTML, PDF, DOC, spreadsheets, powerpoint presentations,
> text files etc.  If so, how?
> 2. Can Solr be queried over the web and return a list of files that match a
> search query entered by a user, and also return the abstracts for these
> files, as well as 'hit highlighting'.  If so, how?
> 3. Can Solr be run as a service (like Index Server) that automatically
> detects changes to the files within the indexed folder and updates the
> index? If so, how?
> 
> Thanks for your help
> 
> Cathy Hemsley

Re: Solr: using to index large folders recursively containing lots of different documents, and querying over the web

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Fri, 2011-01-14 at 13:05 +0100, Cathy Hemsley wrote:
> I hope you can help.  We are migrating our intranet web site management
> system to Windows 2008 and need a replacement for Index Server to do the
> text searching.  I am trying to establish if Lucene and Solr is a feasible
> replacement, but I cannot find the answers to these questions:

The answers to your questions are yes and no to all of them. Solr does
not do what you ask out of the box, but it can certainly be done by
extending Solr or using it as at the core of another system.

Some time ago I stumbled upon http://www.constellio.com/ which seems to
be exactly what you're looking for.