You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by sabman <sa...@gmail.com> on 2011/07/29 23:47:24 UTC
Error with Extracting PDF metadata

I am using Solr 3.3 and I am trying to extract and index meta data from PDF
files. I am using the DataImportHandler with the TikaEntityProcessor to add
the documents. Here is are the fields as defined in my schema.xml file:


    <field name="title" type="text" indexed="true" stored="true"
multiValued="false"/>
   <field name="description" type="text" indexed="true" stored="true"
multiValued="false"/>
   <field name="date_published" type="string" indexed="false" stored="true"
multiValued="false"/>
   <field name="link" type="string" indexed="true" stored="true"
multiValued="false" required="false"/>
   <field name="imgName" type="string" indexed="false" stored="true"
multiValued="false" required="false"/>
   <dynamicField name="attr_*" type="textgen" indexed="true" stored="true"
multiValued="false"/>

So I suppose the meta data information should be indexed and stored in
fields prefixed as "attr_".

Here is how my data config file looks. It takes a source directory path from
a database, passes it to a FileListEntityProcessor which will pass each of
the pdf files found in the directory to the TikaEntityProcessor to extract
and index the content.

<entity onError="skip" name="fileSourcePaths" rootEntity="false"
dataSource="dbSource" fileName=".*pdf" query="select path from
file_sources">
      <entity name="fileSource" processor="FileListEntityProcessor"
transformer="ThumbnailTransformer" baseDir="${fileSourcePaths.path}"
recursive="true" rootEntity="false">
        <field name="link" column="fileAbsolutePath" thumbnail="true"/>
        <field name="imgName" column="imgName"/>
        <entity rootEntity="true" onError="abort" name="file"
processor="TikaEntityProcessor" url="${fileSource.fileAbsolutePath}"
dataSource="fileSource" format="text">
          <field column="resourceName" name="title" meta="true"/>
          <field column="Creation-Date" name="date_published" meta="true"/>
          <field column="text" name="description"/>
        </entity>
      </entity>

It extracts the description and Creation-date just fine but it doesn't seem
like it is extracting resourceName and so  there is no title field for the
documents when I query the index . This is weird because both Creation-date
and resourceName are meta data. Also, none of the other possible meta data
was being stored under the attr_ fields. I came across some threads which
said there are know problems with using Tika 0.8 so I downloaded Tika 0.9
and replaced it over 0.8. I also downloaded and replaced pdfbox, jempbox and
fontbox from 1.3 to 1.4. 

I tested one of the pdf's separately with just Tika to see what meta data is
stored with the file. This is what I found:

Content-Length: 546459
Content-Type: application/pdf
Creation-Date: 2010-06-09T12:11:12Z
Last-Modified: 2010-06-09T14:53:38Z
created: Wed Jun 09 08:11:12 EDT 2010
creator: XSL Formatter V4.3 MR9a (4,3,2009,1022) for Windows
producer: Antenna House PDF Output Library 2.6.0 (Windows)
resourceName: Argentina.pdf
trapped: False
xmpTPg:NPages: 2


As you can see, it does have a resourceName meta data. I tried indexing
again but I got the same result. Creation-date extracts and indexes just
fine but not resourceName. Also the rest of the attributes are not being
indexed under the attr_ fields.

Whats going wrong?


--
View this message in context: http://lucene.472066.n3.nabble.com/Error-with-Extracting-PDF-metadata-tp3210813p3210813.html
Sent from the Solr - User mailing list archive at Nabble.com.