You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by ku...@mail.org on 2012/04/20 16:07:31 UTC

Storing the md5 hash of pdf files as a field in the index

Hi,

 I want to build an index of quite a number of pdf and msword files using the Data Import Request Handler and the Tika Entity Processor. It works very well. Now I would like to use the md5 digest of the binary (pdf/word) file as the unique key in t
 he index. But I do not know how to implement this. In the data-config.xml configuring the FileListEntityProcessor I have access to the absolute file name of a pdf to be indexed. I'm sitting on a Linux box and so there is an easy way to calculate t
 he md5 hash using the operating system command md5sum. But how can I trigger this calculation and store the result as a field in my index?

 Any tips or ideas are really appreciated.

 Thanks.
 Joe

Re: Storing the md5 hash of pdf files as a field in the index

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi Joe,

You could write a custom URP - Update Request Processor.  This URP would take the value from one SolrDocument field (say the one that has the full path to your PDF and is thus unique), compute MD5 using Java API for doing that, and would stick that MD5 value in some field that you've defined as string to hold that value.

Otis
----
Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html



>________________________________
> From: "kuchenbrett@mail.org" <ku...@mail.org>
>To: solr-user@lucene.apache.org 
>Sent: Friday, April 20, 2012 10:07 AM
>Subject: Storing the md5 hash of pdf files as a field in the index
> 
>Hi,
>
>I want to build an index of quite a number of pdf and msword files using the Data Import Request Handler and the Tika Entity Processor. It works very well. Now I would like to use the md5 digest of the binary (pdf/word) file as the unique key in t
>he index. But I do not know how to implement this. In the data-config.xml configuring the FileListEntityProcessor I have access to the absolute file name of a pdf to be indexed. I'm sitting on a Linux box and so there is an easy way to calculate t
>he md5 hash using the operating system command md5sum. But how can I trigger this calculation and store the result as a field in my index?
>
>Any tips or ideas are really appreciated.
>
>Thanks.
>Joe
>
>
>