You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Developer In London <eb...@gmail.com> on 2009/04/24 12:18:44 UTC

Is solr right for this scenario?

Hi All,

I am new to the whole Solr/Lucene community. But I think this might be the
solution ot what I am looking to do. I would appreciate any feedback on how
I can go about doing this with Solr:

I am looking to make a system where -
a) mainly lots of different blog sites, web journals, articles are indexed
on a regular basis. Data that has already been indexed needs to be revisited
to see if there are any changes.
b) The end users has very fixed search terms, eg 'Lloyds TSB' and 'Corporate
Banking'. All the documents that are found matching this are presented to a
human to analyse.
c) Once the human analyses the document he gives it a rating of 1, 0 or -1.
This rating needs to be saved somewhere and be linked with the specific
document and also with the search term (eg 'Lloyds TSB' & 'Corporate
Banking' in this case).
d) End users can then see these documents with the ratings next to them.

What would be the best approach to this?

Should I set up a different database to save the rating and relevant
mappings, or is there any way to put it in to Solr?

My 2nd question is, can Solr Index be saved in a database in any way? Whats
the backup and recovery method on Solr?

Thanks in advance.

Nayeem

Re: Is solr right for this scenario?

Posted by Eric Pugh <ep...@opensourceconnections.com>.

On Apr 24, 2009, at 7:54 AM, Developer In London wrote:

> Thanks for the fast reply. Wow this seems a very active community.
>
> I have a few more questions in that case:
>
> 1) If Solr is going to be file-based, os it then preferable to run  
> multiple
> Solrs with Shards? How can I determine what capacity 1 Solr can cope?
It depends!  Solr can manage up to X records easily in a single index,  
however your milage may vary.  One of the nice things about Solr is it  
is very scalable, and offers you many options.   I would go with the  
most simple setup for Solr for now, and then as your development  
progresses, and you load data then investigate sharding etc.  Solr,  
properly managed, won't be your bottleneck, it's be your data loading  
scripts or elsewhere.
>
>
> 2) I am presuming there is already tokenizers for hypertext and xml  
> in Solr
> so that it can use extract the right information out?
There are a number of different options available out there for  
indexing content.
>
>
> 3) I need to also get the 'author' information out for things like  
> blogs. I
> guess theres no universal way of doing it and I have to have someone
> manually go through the documents and feed the solr index with the  
> author
> information?
Your loading script will be bespoke to your situation, however any  
competent developer can put together scripts to load from your varous  
data sources.
>
>
> When you mention 'write a loader script...', do you mean I should
> incorporate the date checking in the loader script? Solr has no  
> internal way
> of checking the timestamp in a document and updating?
Solr makes no assumptions about your data sources, it isn't a document  
management system, it is just a search engine.  Well, that isn't  
totally true, the new DataImportHandler architecture does allow you to  
preserve some information about "when did I last run an update, what  
has been updated since", however it's pretty new stuff.


Eric

>
>
>
>
> Thanks,
>
> Nayeem
>
> 2009/4/24 Eric Pugh <ep...@opensourceconnections.com>
>
>> It seems like you have three components to your system:
>>
>> 1) Data indexing from multiple sources
>>
>> 2) Search for specific words in documents
>>
>> 3) Preserve rating and search term.
>>
>> I think that Solr comes into play on #1 and #2.  You can index  
>> content in
>> any number of approaches, either via the new DataImportHandler  
>> architecture,
>> or the more traditional write a loader script that puts the  
>> documents in
>> Solr.  You can store in Solr when a document was indexed, and use  
>> that to
>> check against the original documents to see if they changed.  Check  
>> a last
>> published tag on an RSS feed, or the last updated time on a  
>> physical file.
>> This is a very common use case for Solr.
>>
>> For #2, you could have users issue queries, and make them  
>> "favorites",
>> storing them in the DB.  Assuming they like the results they mark the
>> documents with the ratings, which you could store in Solr, but I  
>> would put
>> in a DB..  Easier to manage User A says 1, User B says 0.
>>
>> Then for the UI, just issue the search baseed on queries stored in  
>> the db,
>> and match the id's up with the ranking in the DB.  Simple!
>>
>> As far as the last part, Solr works best in filesystem, that is  
>> part of why
>> it is so fast, no clunky SQL.  There are scripts for backing up and
>> restoring indexes that you can use, check the wiki
>> http://wiki.apache.org/solr/SolrOperationsTools.
>>
>> Eric
>>
>>
>>
>>
>> On Apr 24, 2009, at 6:18 AM, Developer In London wrote:
>>
>> Hi All,
>>>
>>> I am new to the whole Solr/Lucene community. But I think this  
>>> might be the
>>> solution ot what I am looking to do. I would appreciate any  
>>> feedback on
>>> how
>>> I can go about doing this with Solr:
>>>
>>> I am looking to make a system where -
>>> a) mainly lots of different blog sites, web journals, articles are  
>>> indexed
>>> on a regular basis. Data that has already been indexed needs to be
>>> revisited
>>> to see if there are any changes.
>>> b) The end users has very fixed search terms, eg 'Lloyds TSB' and
>>> 'Corporate
>>> Banking'. All the documents that are found matching this are  
>>> presented to
>>> a
>>> human to analyse.
>>> c) Once the human analyses the document he gives it a rating of 1,  
>>> 0 or
>>> -1.
>>> This rating needs to be saved somewhere and be linked with the  
>>> specific
>>> document and also with the search term (eg 'Lloyds TSB' & 'Corporate
>>> Banking' in this case).
>>> d) End users can then see these documents with the ratings next to  
>>> them.
>>>
>>> What would be the best approach to this?
>>>
>>> Should I set up a different database to save the rating and relevant
>>> mappings, or is there any way to put it in to Solr?
>>>
>>> My 2nd question is, can Solr Index be saved in a database in any  
>>> way?
>>> Whats
>>> the backup and recovery method on Solr?
>>>
>>> Thanks in advance.
>>>
>>> Nayeem
>>>
>>
>> -----------------------------------------------------
>> Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 |
>> http://www.opensourceconnections.com
>> Free/Busy: http://tinyurl.com/eric-cal
>>
>>
>>
>>
>>
>
>
> -- 
> cashflowclublondon.co.uk
>
>                      ("`-''-/").___..--''"`-._
>                       `6_ 6  )   `-.  (     ).`-.__.`)
>                       (_Y_.)'  ._   )  `._ `. ``-..-'
>                     _..`--'_..-_/  /--'_.' ,'
>                    (il),-''  (li),'  ((!.-'
> .

-----------------------------------------------------
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com
Free/Busy: http://tinyurl.com/eric-cal

Re: Is solr right for this scenario?

Posted by Developer In London <eb...@gmail.com>.

Thanks for the fast reply. Wow this seems a very active community.

I have a few more questions in that case:

1) If Solr is going to be file-based, os it then preferable to run multiple
Solrs with Shards? How can I determine what capacity 1 Solr can cope?

2) I am presuming there is already tokenizers for hypertext and xml in Solr
so that it can use extract the right information out?

3) I need to also get the 'author' information out for things like blogs. I
guess theres no universal way of doing it and I have to have someone
manually go through the documents and feed the solr index with the author
information?

When you mention 'write a loader script...', do you mean I should
incorporate the date checking in the loader script? Solr has no internal way
of checking the timestamp in a document and updating?



Thanks,

Nayeem

2009/4/24 Eric Pugh <ep...@opensourceconnections.com>

> It seems like you have three components to your system:
>
> 1) Data indexing from multiple sources
>
> 2) Search for specific words in documents
>
> 3) Preserve rating and search term.
>
> I think that Solr comes into play on #1 and #2.  You can index content in
> any number of approaches, either via the new DataImportHandler architecture,
> or the more traditional write a loader script that puts the documents in
> Solr.  You can store in Solr when a document was indexed, and use that to
> check against the original documents to see if they changed.  Check a last
> published tag on an RSS feed, or the last updated time on a physical file.
>  This is a very common use case for Solr.
>
> For #2, you could have users issue queries, and make them "favorites",
> storing them in the DB.  Assuming they like the results they mark the
> documents with the ratings, which you could store in Solr, but I would put
> in a DB..  Easier to manage User A says 1, User B says 0.
>
> Then for the UI, just issue the search baseed on queries stored in the db,
> and match the id's up with the ranking in the DB.  Simple!
>
> As far as the last part, Solr works best in filesystem, that is part of why
> it is so fast, no clunky SQL.  There are scripts for backing up and
> restoring indexes that you can use, check the wiki
> http://wiki.apache.org/solr/SolrOperationsTools.
>
> Eric
>
>
>
>
> On Apr 24, 2009, at 6:18 AM, Developer In London wrote:
>
>  Hi All,
>>
>> I am new to the whole Solr/Lucene community. But I think this might be the
>> solution ot what I am looking to do. I would appreciate any feedback on
>> how
>> I can go about doing this with Solr:
>>
>> I am looking to make a system where -
>> a) mainly lots of different blog sites, web journals, articles are indexed
>> on a regular basis. Data that has already been indexed needs to be
>> revisited
>> to see if there are any changes.
>> b) The end users has very fixed search terms, eg 'Lloyds TSB' and
>> 'Corporate
>> Banking'. All the documents that are found matching this are presented to
>> a
>> human to analyse.
>> c) Once the human analyses the document he gives it a rating of 1, 0 or
>> -1.
>> This rating needs to be saved somewhere and be linked with the specific
>> document and also with the search term (eg 'Lloyds TSB' & 'Corporate
>> Banking' in this case).
>> d) End users can then see these documents with the ratings next to them.
>>
>> What would be the best approach to this?
>>
>> Should I set up a different database to save the rating and relevant
>> mappings, or is there any way to put it in to Solr?
>>
>> My 2nd question is, can Solr Index be saved in a database in any way?
>> Whats
>> the backup and recovery method on Solr?
>>
>> Thanks in advance.
>>
>> Nayeem
>>
>
> -----------------------------------------------------
> Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com
> Free/Busy: http://tinyurl.com/eric-cal
>
>
>
>
>


-- 
cashflowclublondon.co.uk

                      ("`-''-/").___..--''"`-._
                       `6_ 6  )   `-.  (     ).`-.__.`)
                       (_Y_.)'  ._   )  `._ `. ``-..-'
                     _..`--'_..-_/  /--'_.' ,'
                    (il),-''  (li),'  ((!.-'
.

Re: Is solr right for this scenario?

Posted by Eric Pugh <ep...@opensourceconnections.com>.

It seems like you have three components to your system:

1) Data indexing from multiple sources

2) Search for specific words in documents

3) Preserve rating and search term.

I think that Solr comes into play on #1 and #2.  You can index content  
in any number of approaches, either via the new DataImportHandler  
architecture, or the more traditional write a loader script that puts  
the documents in Solr.  You can store in Solr when a document was  
indexed, and use that to check against the original documents to see  
if they changed.  Check a last published tag on an RSS feed, or the  
last updated time on a physical file.  This is a very common use case  
for Solr.

For #2, you could have users issue queries, and make them "favorites",  
storing them in the DB.  Assuming they like the results they mark the  
documents with the ratings, which you could store in Solr, but I would  
put in a DB..  Easier to manage User A says 1, User B says 0.

Then for the UI, just issue the search baseed on queries stored in the  
db, and match the id's up with the ranking in the DB.  Simple!

As far as the last part, Solr works best in filesystem, that is part  
of why it is so fast, no clunky SQL.  There are scripts for backing up  
and restoring indexes that you can use, check the wiki http://wiki.apache.org/solr/SolrOperationsTools 
.

Eric

On Apr 24, 2009, at 6:18 AM, Developer In London wrote:

> Hi All,
>
> I am new to the whole Solr/Lucene community. But I think this might  
> be the
> solution ot what I am looking to do. I would appreciate any feedback  
> on how
> I can go about doing this with Solr:
>
> I am looking to make a system where -
> a) mainly lots of different blog sites, web journals, articles are  
> indexed
> on a regular basis. Data that has already been indexed needs to be  
> revisited
> to see if there are any changes.
> b) The end users has very fixed search terms, eg 'Lloyds TSB' and  
> 'Corporate
> Banking'. All the documents that are found matching this are  
> presented to a
> human to analyse.
> c) Once the human analyses the document he gives it a rating of 1, 0  
> or -1.
> This rating needs to be saved somewhere and be linked with the  
> specific
> document and also with the search term (eg 'Lloyds TSB' & 'Corporate
> Banking' in this case).
> d) End users can then see these documents with the ratings next to  
> them.
>
> What would be the best approach to this?
>
> Should I set up a different database to save the rating and relevant
> mappings, or is there any way to put it in to Solr?
>
> My 2nd question is, can Solr Index be saved in a database in any  
> way? Whats
> the backup and recovery method on Solr?
>
> Thanks in advance.
>
> Nayeem

-----------------------------------------------------
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com
Free/Busy: http://tinyurl.com/eric-cal