You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Darx Oman <da...@gmail.com> on 2011/04/11 08:20:13 UTC

Indexing Best Practice

Hi guys

I'm wondering how to best configure solr to fulfills my requirements.

I'm indexing data from 2 data sources:
1- Database
2- PDF files (password encrypted)

Every file has related information stored in the database.  Both the file
content and the related database fields must be indexed as one document in
solr.  Among the DB data is *per-user* permissions for every document.

The file contents nearly never change, on the other hand, the DB data and
especially the permissions change very frequently which require me to
re-index everything for every modified document.

My problem is in process of decrypting the PDF files before re-indexing them
which takes too much time for a large number of documents, it could span to
days in full re-indexing.

What I'm trying to accomplish is eliminating the need to re-index the PDF
content if not changed even if the DB data changed.  I know this is not
possible in solr, because solr doesn't update documents.

So how to best accomplish this:

Can I use 2 indexes one for PDF contents and the other for DB data and have
a common id field for both as a link between them, *and results are treated
as one Document*?

Re: Indexing Best Practice

Posted by Darx Oman <da...@gmail.com>.
Hi Lance

thanx for your reply, but I have a question
is this patch committed to trunk?

Re: Indexing Best Practice

Posted by Lance Norskog <go...@gmail.com>.
SOLR-1499 is a plug-in for the DIH that uses Solr as a DataSource.
This means that you can read the database and PDFs separately. You
could index all of the PDF content in one DIH script. Then, when
there's a database update, you have a separate DIH scripts that reads
the old row from Solr, and pulls the stripped text from the PDF, and
then re-indexes the whole thing. This would cut out the need to
reparse the PDF.

Lance

On Mon, Apr 11, 2011 at 8:48 AM, Shaun Campbell
<ca...@gmail.com> wrote:
> If it's of any help I've split the processing of PDF files from the
> indexing. I put the PDF content into a text file (but I guess you could load
> it into a database) and use that as part of the indexing.  My processing of
> the PDF files also compares timestamps on the document and the text file so
> that I'm only processing documents that have changed.
>
> I am a newbie so perhaps there's more sophisticated approaches.
>
> Hope that helps.
> Shaun
>
> On 11 April 2011 07:20, Darx Oman <da...@gmail.com> wrote:
>
>> Hi guys
>>
>> I'm wondering how to best configure solr to fulfills my requirements.
>>
>> I'm indexing data from 2 data sources:
>> 1- Database
>> 2- PDF files (password encrypted)
>>
>> Every file has related information stored in the database.  Both the file
>> content and the related database fields must be indexed as one document in
>> solr.  Among the DB data is *per-user* permissions for every document.
>>
>> The file contents nearly never change, on the other hand, the DB data and
>> especially the permissions change very frequently which require me to
>> re-index everything for every modified document.
>>
>> My problem is in process of decrypting the PDF files before re-indexing
>> them
>> which takes too much time for a large number of documents, it could span to
>> days in full re-indexing.
>>
>> What I'm trying to accomplish is eliminating the need to re-index the PDF
>> content if not changed even if the DB data changed.  I know this is not
>> possible in solr, because solr doesn't update documents.
>>
>> So how to best accomplish this:
>>
>> Can I use 2 indexes one for PDF contents and the other for DB data and have
>> a common id field for both as a link between them, *and results are treated
>> as one Document*?
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Indexing Best Practice

Posted by Shaun Campbell <ca...@gmail.com>.
If it's of any help I've split the processing of PDF files from the
indexing. I put the PDF content into a text file (but I guess you could load
it into a database) and use that as part of the indexing.  My processing of
the PDF files also compares timestamps on the document and the text file so
that I'm only processing documents that have changed.

I am a newbie so perhaps there's more sophisticated approaches.

Hope that helps.
Shaun

On 11 April 2011 07:20, Darx Oman <da...@gmail.com> wrote:

> Hi guys
>
> I'm wondering how to best configure solr to fulfills my requirements.
>
> I'm indexing data from 2 data sources:
> 1- Database
> 2- PDF files (password encrypted)
>
> Every file has related information stored in the database.  Both the file
> content and the related database fields must be indexed as one document in
> solr.  Among the DB data is *per-user* permissions for every document.
>
> The file contents nearly never change, on the other hand, the DB data and
> especially the permissions change very frequently which require me to
> re-index everything for every modified document.
>
> My problem is in process of decrypting the PDF files before re-indexing
> them
> which takes too much time for a large number of documents, it could span to
> days in full re-indexing.
>
> What I'm trying to accomplish is eliminating the need to re-index the PDF
> content if not changed even if the DB data changed.  I know this is not
> possible in solr, because solr doesn't update documents.
>
> So how to best accomplish this:
>
> Can I use 2 indexes one for PDF contents and the other for DB data and have
> a common id field for both as a link between them, *and results are treated
> as one Document*?
>