You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by sabman <sa...@gmail.com> on 2011/07/15 22:41:14 UTC
Indexing PDF documents with no UniqueKey
I want to index PDF (and other rich) documents. I am using the
DataImportHandler.
Here is how my schema.xml looks:
.........
.........
<field name="title" type="text" indexed="true" stored="true"
multiValued="false"/>
<field name="description" type="text" indexed="true" stored="true"
multiValued="false"/>
<field name="date_published" type="string" indexed="false" stored="true"
multiValued="false"/>
<field name="link" type="string" indexed="true" stored="true"
multiValued="false" required="false"/>
<dynamicField name="attr_*" type="textgen" indexed="true" stored="true"
multiValued="false"/>
........
........
<uniqueKey>link</uniqueKey>
As you can see I have set link as the unique key so that when the indexing
happens documents are not duplicated again. Now I have the file paths stored
in a database and I have set the DataImportHandler to get a list of all the
file paths and index each document. To test it I used the tutorial.pdf file
that comes with example docs in Solr. The problem is of course this pdf
document won't have a field 'link'. I am thinking of way how I can manually
set the file path as link when indexing these documents. I tried the
data-config settings as below,
<entity name="fileItems" rootEntity="false" dataSource="dbSource"
query="select path from file_paths">
<entity name="tika-test" processor="TikaEntityProcessor"
url="${fileItems.path}" dataSource="fileSource">
<field column="title" name="title" meta="true"/>
<field column="Creation-Date" name="date_published" meta="true"/>
<entity name="filePath" dataSource="dbSource" query="SELECT path FROM
file_paths as link where path = '${fileItems.path}'">
<field column="link" name="link"/>
</entity>
</entity>
</entity>
where I create a sub-entity which queries for the path name and makes it
return the results in a column titled 'link'. But I still see this error:
WARNING: Error creating document :
SolrInputDocument[{date_published=date_published(1.0)={2011-06-23T12:47:45Z},
title=title(1.0)={Solr tutorial}}]
org.apache.solr.common.SolrException: Document is missing mandatory
uniqueKey field: link
Is there anyway for me to create a field called link for the pdf documents?
This was already asked
http://lucene.472066.n3.nabble.com/Trouble-with-exception-Document-Null-missing-required-field-DocID-td1641048.html
here before but the solution provided uses ExtractRequestHandler but I want
to use it through the DataImportHandler.
--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-PDF-documents-with-no-UniqueKey-tp3173272p3173272.html
Sent from the Solr - User mailing list archive at Nabble.com.