You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Veselin Kantsev <ve...@campbell-lange.net> on 2009/04/06 11:24:44 UTC

How could I avoid reindexing same files?

Hello,
apologies for the basic question.

How can I avoid double indexing files?

In case all my files are in one folder which is scanned frequently, is
there a Solr feature of checking and skipping a file if it has already been indexed
and not changed since?


Thank you.

Regards,
Veselin K


Re: How could I avoid reindexing same files?

Posted by Fergus McMenemie <fe...@twig.me.uk>.
>Hi Fergus,
>
>On Tue, Apr 07, 2009 at 05:06:23PM +0100, Fergus McMenemie wrote:
>> >Thank you much Fergus,
>> >
>> >I was considering implementing a database which would hold a path name
>> >and an MD5 sum of each file.
>> Snap. That is close to what we did. However due to our pervious
>> duff full text search engine we had to hold this information in
>> a separate checksums file. Solr is much better at allowing you
>> to add extra meta information as the document is being submitted
>> for indexing.
>> 
>> curl http://localhost...update/extract 
>>    -F "myfile=@file.pdf;ext.literal.id=file.pdf;ext.literal.chksum=XXXXX"
>
>- Great idea, simpler and cleaner!
>
> 
>> >Then as a part of Solr indexing, one could check against the DB if a
>> >file path exists, if Yes, then compare MD5 and only index if different.
>> Using solr you could hold the checksum and pathname as solr fields,
>> then rather than looking up a DB you would look up solr. Having every
>> thing in the one place is better for consistency and quality. You
>> could also dump all checksums and pathnames from solr if/when you wanted
>> to validate your folder structure and or indexes.
>
>- What kind of query could I use with Solr, to check for a specific
>  filename/checksum and get an answer as close to "TRUE or FALSE" as possible?

Some thought needs to be given to this to make sure that
the performance is adequate. But at its simplest:-

curl http://localhost.../select?id=file.pdf&fl=id,chksum
-- 

===============================================================
Fergus McMenemie               Email:fergus@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Re: How could I avoid reindexing same files?

Posted by Veselin Kantsev <ve...@campbell-lange.net>.
Hi Fergus,

On Tue, Apr 07, 2009 at 05:06:23PM +0100, Fergus McMenemie wrote:
> >Thank you much Fergus,
> >
> >I was considering implementing a database which would hold a path name
> >and an MD5 sum of each file.
> Snap. That is close to what we did. However due to our pervious
> duff full text search engine we had to hold this information in
> a separate checksums file. Solr is much better at allowing you
> to add extra meta information as the document is being submitted
> for indexing.
> 
> curl http://localhost...update/extract 
>    -F "myfile=@file.pdf;ext.literal.id=file.pdg;ext.literal.chksum=XXXXX"

- Great idea, simpler and cleaner!

 
> >Then as a part of Solr indexing, one could check against the DB if a
> >file path exists, if Yes, then compare MD5 and only index if different.
> Using solr you could hold the checksum and pathname as solr fields,
> then rather than looking up a DB you would look up solr. Having every
> thing in the one place is better for consistency and quality. You
> could also dump all checksums and pathnames from solr if/when you wanted
> to validate your folder structure and or indexes.

- What kind of query could I use with Solr, to check for a specific
  filename/checksum and get an answer as close to "TRUE or FALSE" as possible?

Regards,
Veselin K

Re: How could I avoid reindexing same files?

Posted by Fergus McMenemie <fe...@twig.me.uk>.
>Thank you much Fergus,
>
>I was considering implementing a database which would hold a path name
>and an MD5 sum of each file.
Snap. That is close to what we did. However due to our pervious
duff full text search engine we had to hold this information in
a separate checksums file. Solr is much better at allowing you
to add extra meta information as the document is being submitted
for indexing.

curl http://localhost...update/extract 
   -F "myfile=@file.pdf;ext.literal.id=file.pdg;ext.literal.chksum=XXXXX"

>Then as a part of Solr indexing, one could check against the DB if a
>file path exists, if Yes, then compare MD5 and only index if different.
Using solr you could hold the checksum and pathname as solr fields,
then rather than looking up a DB you would look up solr. Having every
thing in the one place is better for consistency and quality. You
could also dump all checksums and pathnames from solr if/when you wanted
to validate your folder structure and or indexes.

>Regards,
>Veselin K
>
>On Tue, Apr 07, 2009 at 09:01:31AM +0100, Fergus McMenemie wrote:
>> Veselin,
>> 
>> Well, as far as solr is concerned, there is two issues here:-
>> 
>> 1) To stop the same document ending up in the indexes twice, use the document
>>    pathname as the unique ID. Then if you do index it twice, the previous index
>>    information will be discarded. Not very efficient, but it may be tolerable.
>>    IMHO using pathname as the unique ID is often best practice.
>> 
>> 2) To stop a document even being submitted to solr. You need to implement some
>>    middle ware that either performs a search/lookup using a documents pathname
>>    to see if it is already indexed. Or, after examining timestampts, only submits
>>    documents which have changed since the last folder scan.
>> 
>> Fergus.
>> >Hello Paul,
>> >I'm indexing with "curl http://localhost... -F myfile=@file.pdf" 
>> >
>> >Regards,
>> >Veselin K
>> >
>> >
>> >On Mon, Apr 06, 2009 at 02:56:20PM +0530, Noble Paul ?????????????????????  ?????????????????? wrote:
>> >> how are you indexing?
>> >> 
>> >> On Mon, Apr 6, 2009 at 2:54 PM, Veselin Kantsev
>> >> <ve...@campbell-lange.net> wrote:
>> >> > Hello,
>> >> > apologies for the basic question.
>> >> >
>> >> > How can I avoid double indexing files?
>> >> >
>> >> > In case all my files are in one folder which is scanned frequently, is
>> >> > there a Solr feature of checking and skipping a file if it has already been indexed
>> >> > and not changed since?
>> >> >
>> >> >
>> >> > Thank you.
>> >> >
>> >> > Regards,
>> >> > Veselin K

>> >> --Noble Paul
-- 

===============================================================
Fergus McMenemie               Email:fergus@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Re: How could I avoid reindexing same files?

Posted by Veselin K <ve...@campbell-lange.net>.
Useful tip Erik, this will save a lot of hassle.

Thank you much.

Regards,
Veselin K


On Tue, Apr 07, 2009 at 11:29:38AM -0400, Erik Hatcher wrote:
> Note that Solr (trunk, soon to be 1.4) has a duplicate detection feature 
> that may work for your need. See 
> http://wiki.apache.org/solr/Deduplication (looks like docs need updating 
> to say 1.4 here) and http://issues.apache.org/jira/browse/SOLR-799
>
> 	Erik
>
>
> On Apr 7, 2009, at 11:25 AM, Veselin K wrote:
>
>> Thank you much Fergus,
>>
>> I was considering implementing a database which would hold a path name
>> and an MD5 sum of each file.
>>
>> Then as a part of Solr indexing, one could check against the DB if a
>> file path exists, if Yes, then compare MD5 and only index if  
>> different.
>>
>>
>> Regards,
>> Veselin K
>>
>> On Tue, Apr 07, 2009 at 09:01:31AM +0100, Fergus McMenemie wrote:
>>> Veselin,
>>>
>>> Well, as far as solr is concerned, there is two issues here:-
>>>
>>> 1) To stop the same document ending up in the indexes twice, use the 
>>> document
>>>   pathname as the unique ID. Then if you do index it twice, the  
>>> previous index
>>>   information will be discarded. Not very efficient, but it may be  
>>> tolerable.
>>>   IMHO using pathname as the unique ID is often best practice.
>>>
>>> 2) To stop a document even being submitted to solr. You need to  
>>> implement some
>>>   middle ware that either performs a search/lookup using a documents 
>>> pathname
>>>   to see if it is already indexed. Or, after examining timestampts,  
>>> only submits
>>>   documents which have changed since the last folder scan.
>>>
>>> Fergus.

Re: How could I avoid reindexing same files?

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Note that Solr (trunk, soon to be 1.4) has a duplicate detection  
feature that may work for your need. See http://wiki.apache.org/solr/Deduplication 
  (looks like docs need updating to say 1.4 here) and http://issues.apache.org/jira/browse/SOLR-799

	Erik


On Apr 7, 2009, at 11:25 AM, Veselin K wrote:

> Thank you much Fergus,
>
> I was considering implementing a database which would hold a path name
> and an MD5 sum of each file.
>
> Then as a part of Solr indexing, one could check against the DB if a
> file path exists, if Yes, then compare MD5 and only index if  
> different.
>
>
> Regards,
> Veselin K
>
> On Tue, Apr 07, 2009 at 09:01:31AM +0100, Fergus McMenemie wrote:
>> Veselin,
>>
>> Well, as far as solr is concerned, there is two issues here:-
>>
>> 1) To stop the same document ending up in the indexes twice, use  
>> the document
>>   pathname as the unique ID. Then if you do index it twice, the  
>> previous index
>>   information will be discarded. Not very efficient, but it may be  
>> tolerable.
>>   IMHO using pathname as the unique ID is often best practice.
>>
>> 2) To stop a document even being submitted to solr. You need to  
>> implement some
>>   middle ware that either performs a search/lookup using a  
>> documents pathname
>>   to see if it is already indexed. Or, after examining timestampts,  
>> only submits
>>   documents which have changed since the last folder scan.
>>
>> Fergus.
>>> Hello Paul,
>>> I'm indexing with "curl http://localhost... -F myfile=@file.pdf"
>>>
>>> Regards,
>>> Veselin K
>>>
>>>
>>> On Mon, Apr 06, 2009 at 02:56:20PM +0530, Noble  
>>> Paul ?????????????????????  ?????????????????? wrote:
>>>> how are you indexing?
>>>>
>>>> On Mon, Apr 6, 2009 at 2:54 PM, Veselin Kantsev
>>>> <ve...@campbell-lange.net> wrote:
>>>>> Hello,
>>>>> apologies for the basic question.
>>>>>
>>>>> How can I avoid double indexing files?
>>>>>
>>>>> In case all my files are in one folder which is scanned  
>>>>> frequently, is
>>>>> there a Solr feature of checking and skipping a file if it has  
>>>>> already been indexed
>>>>> and not changed since?
>>>>>
>>>>>
>>>>> Thank you.
>>>>>
>>>>> Regards,
>>>>> Veselin K
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> -- 
>>>> --Noble Paul
>>
>> -- 
>>
>> ===============================================================
>> Fergus McMenemie               Email:fergus@twig.me.uk
>> Techmore Ltd                   Phone:(UK) 07721 376021
>>
>> Unix/Mac/Intranets             Analyst Programmer
>> ===============================================================


Re: How could I avoid reindexing same files?

Posted by Veselin K <ve...@campbell-lange.net>.
Thank you much Fergus,

I was considering implementing a database which would hold a path name
and an MD5 sum of each file.

Then as a part of Solr indexing, one could check against the DB if a
file path exists, if Yes, then compare MD5 and only index if different.


Regards,
Veselin K

On Tue, Apr 07, 2009 at 09:01:31AM +0100, Fergus McMenemie wrote:
> Veselin,
> 
> Well, as far as solr is concerned, there is two issues here:-
> 
> 1) To stop the same document ending up in the indexes twice, use the document
>    pathname as the unique ID. Then if you do index it twice, the previous index
>    information will be discarded. Not very efficient, but it may be tolerable.
>    IMHO using pathname as the unique ID is often best practice.
> 
> 2) To stop a document even being submitted to solr. You need to implement some
>    middle ware that either performs a search/lookup using a documents pathname
>    to see if it is already indexed. Or, after examining timestampts, only submits
>    documents which have changed since the last folder scan.
> 
> Fergus.
> >Hello Paul,
> >I'm indexing with "curl http://localhost... -F myfile=@file.pdf" 
> >
> >Regards,
> >Veselin K
> >
> >
> >On Mon, Apr 06, 2009 at 02:56:20PM +0530, Noble Paul ?????????????????????  ?????????????????? wrote:
> >> how are you indexing?
> >> 
> >> On Mon, Apr 6, 2009 at 2:54 PM, Veselin Kantsev
> >> <ve...@campbell-lange.net> wrote:
> >> > Hello,
> >> > apologies for the basic question.
> >> >
> >> > How can I avoid double indexing files?
> >> >
> >> > In case all my files are in one folder which is scanned frequently, is
> >> > there a Solr feature of checking and skipping a file if it has already been indexed
> >> > and not changed since?
> >> >
> >> >
> >> > Thank you.
> >> >
> >> > Regards,
> >> > Veselin K
> >> >
> >> >
> >> 
> >> 
> >> 
> >> -- 
> >> --Noble Paul
> 
> -- 
> 
> ===============================================================
> Fergus McMenemie               Email:fergus@twig.me.uk
> Techmore Ltd                   Phone:(UK) 07721 376021
> 
> Unix/Mac/Intranets             Analyst Programmer
> ===============================================================

Re: How could I avoid reindexing same files?

Posted by Fergus McMenemie <fe...@twig.me.uk>.
Veselin,

Well, as far as solr is concerned, there is two issues here:-

1) To stop the same document ending up in the indexes twice, use the document
   pathname as the unique ID. Then if you do index it twice, the previous index
   information will be discarded. Not very efficient, but it may be tolerable.
   IMHO using pathname as the unique ID is often best practice.

2) To stop a document even being submitted to solr. You need to implement some
   middle ware that either performs a search/lookup using a documents pathname
   to see if it is already indexed. Or, after examining timestampts, only submits
   documents which have changed since the last folder scan.

Fergus.
>Hello Paul,
>I'm indexing with "curl http://localhost... -F myfile=@file.pdf" 
>
>Regards,
>Veselin K
>
>
>On Mon, Apr 06, 2009 at 02:56:20PM +0530, Noble Paul ?????????????????????  ?????????????????? wrote:
>> how are you indexing?
>> 
>> On Mon, Apr 6, 2009 at 2:54 PM, Veselin Kantsev
>> <ve...@campbell-lange.net> wrote:
>> > Hello,
>> > apologies for the basic question.
>> >
>> > How can I avoid double indexing files?
>> >
>> > In case all my files are in one folder which is scanned frequently, is
>> > there a Solr feature of checking and skipping a file if it has already been indexed
>> > and not changed since?
>> >
>> >
>> > Thank you.
>> >
>> > Regards,
>> > Veselin K
>> >
>> >
>> 
>> 
>> 
>> -- 
>> --Noble Paul

-- 

===============================================================
Fergus McMenemie               Email:fergus@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Re: How could I avoid reindexing same files?

Posted by Veselin K <ve...@campbell-lange.net>.
Hello Paul,
I'm indexing with "curl http://localhost... -F myfile=@file.pdf" 

Regards,
Veselin K


On Mon, Apr 06, 2009 at 02:56:20PM +0530, Noble Paul ?????????????????????  ?????????????????? wrote:
> how are you indexing?
> 
> On Mon, Apr 6, 2009 at 2:54 PM, Veselin Kantsev
> <ve...@campbell-lange.net> wrote:
> > Hello,
> > apologies for the basic question.
> >
> > How can I avoid double indexing files?
> >
> > In case all my files are in one folder which is scanned frequently, is
> > there a Solr feature of checking and skipping a file if it has already been indexed
> > and not changed since?
> >
> >
> > Thank you.
> >
> > Regards,
> > Veselin K
> >
> >
> 
> 
> 
> -- 
> --Noble Paul

Re: How could I avoid reindexing same files?

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.
how are you indexing?

On Mon, Apr 6, 2009 at 2:54 PM, Veselin Kantsev
<ve...@campbell-lange.net> wrote:
> Hello,
> apologies for the basic question.
>
> How can I avoid double indexing files?
>
> In case all my files are in one folder which is scanned frequently, is
> there a Solr feature of checking and skipping a file if it has already been indexed
> and not changed since?
>
>
> Thank you.
>
> Regards,
> Veselin K
>
>



-- 
--Noble Paul