You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Erick Erickson <er...@gmail.com> on 2012/02/09 19:41:44 UTC

Re: Help:Solr can't put all pdf files into index

Tika is not guaranteed to be able to parse any PDF file that can be read. There
are significant differences in how pdf files are constructed by different
"compatible" vendors, and the reader is quite forgiving about still displaying
them.

Sometimes you can get around this by re-writing the PDF with an app that
Tika seems to be able to handle the output from.

Also, you haven't said what version of Solr you're using. Tika has been
upgraded to 1.0 in the 3.6 build, which has not been released yet. You might
try using that, you can get the build from:
https://builds.apache.org//view/S-Z/view/Solr/job/Solr-3.x/

Best
Erick

2012/2/9 Vivek Shrivastava <vs...@shopzilla.com>:
> I think you might need to figure out what files are not coming in the index, and see if you can find command pattern in  those files. Since these are pdf files, please make sure the file's security settings allow content extraction etc..
>
> Regards,
>
> Vivek
>
> -----Original Message-----
> From: 荣康 [mailto:whuiss_cs2011@163.com]
> Sent: Wednesday, February 08, 2012 11:30 PM
> To: solr-user@lucene.apache.org
> Subject: Help:Solr can't put all pdf files into index
>
> Hey ,
> I am using solr as my search engine to search my pdf files. I have 18219 files(different file names) and all the files are in one same directory。But when I use solr to import the files into index using Dataimport method, solr report only import 17233 files. It's very strange. This problem has stoped out project for a few days. I can't handle it.
>
>
>  please help me!
>
>
> Schema.xml
>
>
> <fields>
>   <field name="text" type="text" indexed="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>
>   <field name="filename" type="filenametext" indexed="true" required="true" termVectors="true" termPositions="true" termOffsets="true"/>
>   <field name="id" type="string" stored="true"/>
>  </fields>
>  <uniqueKey>id</uniqueKey>
>  <copyField source="filename" dest="text"/>
>
>
> and
> <dataConfig>
>    <dataSource type="BinFileDataSource" name="bin"/>
>  <document>
> <entity name="f" processor="FileListEntityProcessor" recursive="true"
> rootEntity="false"
>  dataSource="null"  baseDir="H:/pdf/cls_1_16800_OCRed/1"
> fileName=".*\.(PDF)|(pdf)|(Pdf)|(pDf)|(pdF)|(PDf)|(PdF)|(pDF)" onError="skip">
>
>
> <entity name="tika-test" processor="TikaEntityProcessor"
> url="${f.fileAbsolutePath}" format="text" dataSource="bin" onError="skip">
>                <field column="text" name="text"/>
> </entity>
>  <field column="file" name="id"/>
>  <field column="file" name="filename"/>
> </entity>
>    </document>
> </dataConfig>
>
>
>
>
> sincerecly
> Rong Kang
>
>
>