You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by 荣康 <wh...@163.com> on 2012/02/09 08:30:09 UTC

Help:Solr can't put all pdf files into index

Hey ,
I am using solr as my search engine to search my pdf files. I have 18219 files(different file names) and all the files are in one same directory。But when I use solr to import the files into index using Dataimport method, solr report only import 17233 files. It's very strange. This problem has stoped out project for a few days. I can't handle it.


 please help me!


Schema.xml


<fields>
   <field name="text" type="text" indexed="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>
   <field name="filename" type="filenametext" indexed="true" required="true" termVectors="true" termPositions="true" termOffsets="true"/>
   <field name="id" type="string" stored="true"/> 
 </fields>
 <uniqueKey>id</uniqueKey> 
 <copyField source="filename" dest="text"/>


and 
<dataConfig> 
    <dataSource type="BinFileDataSource" name="bin"/> 
 <document> 
<entity name="f" processor="FileListEntityProcessor" recursive="true" 
rootEntity="false" 
 dataSource="null"  baseDir="H:/pdf/cls_1_16800_OCRed/1" 
fileName=".*\.(PDF)|(pdf)|(Pdf)|(pDf)|(pdF)|(PDf)|(PdF)|(pDF)" onError="skip"> 


<entity name="tika-test" processor="TikaEntityProcessor" 
url="${f.fileAbsolutePath}" format="text" dataSource="bin" onError="skip">
                <field column="text" name="text"/>      
</entity> 
 <field column="file" name="id"/>
 <field column="file" name="filename"/> 
</entity> 
    </document> 
</dataConfig> 




sincerecly
Rong Kang

Re: Help:Solr can't put all pdf files into index

Posted by François Schiettecatte <fs...@gmail.com>.

Have you tried checking any logs?

Have you tried identifying a file which did not make it in and submitting just that one and seeing what happens?

François

On Feb 9, 2012, at 10:37 AM, Rong Kang wrote:

> 
> Yes, I put all file in one directory and I have tested file names using code.  
> 
> 
> 
> 
> At 2012-02-09 20:45:49,"Jan Høydahl" <ja...@cominvent.com> wrote:
>> Hi,
>> 
>> Are you 100% sure that the filename is globally unique, since you use it as the uniqueKey?
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Solr Training - www.solrtraining.com
>> 
>> On 9. feb. 2012, at 08:30, 荣康 wrote:
>> 
>>> Hey ,
>>> I am using solr as my search engine to search my pdf files. I have 18219 files(different file names) and all the files are in one same directory。But when I use solr to import the files into index using Dataimport method, solr report only import 17233 files. It's very strange. This problem has stoped out project for a few days. I can't handle it.
>>> 
>>> 
>>> please help me!
>>> 
>>> 
>>> Schema.xml
>>> 
>>> 
>>> <fields>
>>>  <field name="text" type="text" indexed="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>
>>>  <field name="filename" type="filenametext" indexed="true" required="true" termVectors="true" termPositions="true" termOffsets="true"/>
>>>  <field name="id" type="string" stored="true"/> 
>>> </fields>
>>> <uniqueKey>id</uniqueKey> 
>>> <copyField source="filename" dest="text"/>
>>> 
>>> 
>>> and 
>>> <dataConfig> 
>>>   <dataSource type="BinFileDataSource" name="bin"/> 
>>> <document> 
>>> <entity name="f" processor="FileListEntityProcessor" recursive="true" 
>>> rootEntity="false" 
>>> dataSource="null"  baseDir="H:/pdf/cls_1_16800_OCRed/1" 
>>> fileName=".*\.(PDF)|(pdf)|(Pdf)|(pDf)|(pdF)|(PDf)|(PdF)|(pDF)" onError="skip"> 
>>> 
>>> 
>>> <entity name="tika-test" processor="TikaEntityProcessor" 
>>> url="${f.fileAbsolutePath}" format="text" dataSource="bin" onError="skip">
>>>               <field column="text" name="text"/>      
>>> </entity> 
>>> <field column="file" name="id"/>
>>> <field column="file" name="filename"/> 
>>> </entity> 
>>>   </document> 
>>> </dataConfig> 
>>> 
>>> 
>>> 
>>> 
>>> sincerecly
>>> Rong Kang
>>> 
>>> 
>>> 
>>

Re: Help:Solr can't put all pdf files into index

Posted by Michael Kuhlmann <ku...@solarier.de>.

I don't know much about Tika, but this seems to be a bug in PDFBox.

See: https://issues.apache.org/jira/browse/PDFBOX-797

Yoz might also have a look at this: 
http://stackoverflow.com/questions/7489206/error-while-parsing-binary-files-mostly-pdf

At least that's what I found when I googled the NPE.

Greetings,
Kuli

On 09.02.2012 17:13, Rong Kang wrote:
> I test one file that is missing in Solr index. And solr response as below
[...]

> Exception in entity : tika-test:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read content Processing Document # 1
> at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
> at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:130)
> at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
> at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:591)
> at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:617)
> at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:267)
> at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:186)
> at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:353)
> at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:411)
> at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:392)
> Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@190725e
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
> at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:128)
> ... 8 more
> Caused by: java.lang.NullPointerException
> at org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:109)
> at org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:943)
> at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:108)
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
> at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> ... 10 more
>
>
> I think this is because tika can't read the pdf file or this  pdf file's format has some error. But I can read this pdf file in Adobe Reader.
> Regards,
>
> Rong Kang

Re:Re: Help:Solr can't put all pdf files into index

Posted by Rong Kang <wh...@163.com>.

I test one file that is missing in Solr index. And solr response as below  

...
<str name="Total Requests made to DataSource">0</str>
<str name="Total Rows Fetched">1</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2012-02-10 00:03:23</str>
<str name="">
Indexing completed. Added/Updated: 0 documents. Deleted 0 documents.
</str>
..

I see tomcat's log file and find this

Exception in entity : tika-test:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read content Processing Document # 1
at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:130)
at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:591)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:617)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:267)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:186)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:353)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:411)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:392)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@190725e
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:128)
... 8 more
Caused by: java.lang.NullPointerException
at org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:109)
at org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:943)
at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:108)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
... 10 more

I think this is because tika can't read the pdf file or this  pdf file's format has some error. But I can read this pdf file in Adobe Reader.
Regards,

Rong Kang
At 2012-02-09 23:49:28,"Michael Kuhlmann" <ku...@solarier.de> wrote:
>I'd suggest that you check which documents *exactly* are missing in Solr 
>index. Or find at least one that's missing, and try to figure out how 
>this document differs from the other ones that can be found in Solr.
>
>Maybe we can then find out what exact problem there is.
>
>Greetings,
>-Kuli
>
>On 09.02.2012 16:37, Rong Kang wrote:
>>
>> Yes, I put all file in one directory and I have tested file names using code.
>>
>>
>>
>>
>> At 2012-02-09 20:45:49,"Jan Høydahl"<ja...@cominvent.com>  wrote:
>>> Hi,
>>>
>>> Are you 100% sure that the filename is globally unique, since you use it as the uniqueKey?
>>>
>>> --
>>> Jan Høydahl, search solution architect
>>> Cominvent AS - www.cominvent.com
>>> Solr Training - www.solrtraining.com
>>>
>>> On 9. feb. 2012, at 08:30, 荣康 wrote:
>>>
>>>> Hey ,
>>>> I am using solr as my search engine to search my pdf files. I have 18219 files(different file names) and all the files are in one same directory。But when I use solr to import the files into index using Dataimport method, solr report only import 17233 files. It's very strange. This problem has stoped out project for a few days. I can't handle it.
>>>>
>>>>
>>>> please help me!
>>>>
>>>>
>>>> Schema.xml
>>>>
>>>>
>>>> <fields>
>>>>    <field name="text" type="text" indexed="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>
>>>>    <field name="filename" type="filenametext" indexed="true" required="true" termVectors="true" termPositions="true" termOffsets="true"/>
>>>>    <field name="id" type="string" stored="true"/>
>>>> </fields>
>>>> <uniqueKey>id</uniqueKey>
>>>> <copyField source="filename" dest="text"/>
>>>>
>>>>
>>>> and
>>>> <dataConfig>
>>>>     <dataSource type="BinFileDataSource" name="bin"/>
>>>> <document>
>>>> <entity name="f" processor="FileListEntityProcessor" recursive="true"
>>>> rootEntity="false"
>>>> dataSource="null"  baseDir="H:/pdf/cls_1_16800_OCRed/1"
>>>> fileName=".*\.(PDF)|(pdf)|(Pdf)|(pDf)|(pdF)|(PDf)|(PdF)|(pDF)" onError="skip">
>>>>
>>>>
>>>> <entity name="tika-test" processor="TikaEntityProcessor"
>>>> url="${f.fileAbsolutePath}" format="text" dataSource="bin" onError="skip">
>>>>                 <field column="text" name="text"/>
>>>> </entity>
>>>> <field column="file" name="id"/>
>>>> <field column="file" name="filename"/>
>>>> </entity>
>>>>     </document>
>>>> </dataConfig>
>>>>
>>>>
>>>>
>>>>
>>>> sincerecly
>>>> Rong Kang
>>>>
>>>>
>>>>
>>>
>

Re: Help:Solr can't put all pdf files into index

Posted by Michael Kuhlmann <ku...@solarier.de>.

I'd suggest that you check which documents *exactly* are missing in Solr 
index. Or find at least one that's missing, and try to figure out how 
this document differs from the other ones that can be found in Solr.

Maybe we can then find out what exact problem there is.

Greetings,
-Kuli

On 09.02.2012 16:37, Rong Kang wrote:
>
> Yes, I put all file in one directory and I have tested file names using code.
>
>
>
>
> At 2012-02-09 20:45:49,"Jan Høydahl"<ja...@cominvent.com>  wrote:
>> Hi,
>>
>> Are you 100% sure that the filename is globally unique, since you use it as the uniqueKey?
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Solr Training - www.solrtraining.com
>>
>> On 9. feb. 2012, at 08:30, 荣康 wrote:
>>
>>> Hey ,
>>> I am using solr as my search engine to search my pdf files. I have 18219 files(different file names) and all the files are in one same directory。But when I use solr to import the files into index using Dataimport method, solr report only import 17233 files. It's very strange. This problem has stoped out project for a few days. I can't handle it.
>>>
>>>
>>> please help me!
>>>
>>>
>>> Schema.xml
>>>
>>>
>>> <fields>
>>>    <field name="text" type="text" indexed="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>
>>>    <field name="filename" type="filenametext" indexed="true" required="true" termVectors="true" termPositions="true" termOffsets="true"/>
>>>    <field name="id" type="string" stored="true"/>
>>> </fields>
>>> <uniqueKey>id</uniqueKey>
>>> <copyField source="filename" dest="text"/>
>>>
>>>
>>> and
>>> <dataConfig>
>>>     <dataSource type="BinFileDataSource" name="bin"/>
>>> <document>
>>> <entity name="f" processor="FileListEntityProcessor" recursive="true"
>>> rootEntity="false"
>>> dataSource="null"  baseDir="H:/pdf/cls_1_16800_OCRed/1"
>>> fileName=".*\.(PDF)|(pdf)|(Pdf)|(pDf)|(pdF)|(PDf)|(PdF)|(pDF)" onError="skip">
>>>
>>>
>>> <entity name="tika-test" processor="TikaEntityProcessor"
>>> url="${f.fileAbsolutePath}" format="text" dataSource="bin" onError="skip">
>>>                 <field column="text" name="text"/>
>>> </entity>
>>> <field column="file" name="id"/>
>>> <field column="file" name="filename"/>
>>> </entity>
>>>     </document>
>>> </dataConfig>
>>>
>>>
>>>
>>>
>>> sincerecly
>>> Rong Kang
>>>
>>>
>>>
>>

Re:Re: Help:Solr can't put all pdf files into index

Posted by Rong Kang <wh...@163.com>.

Yes, I put all file in one directory and I have tested file names using code.  




At 2012-02-09 20:45:49,"Jan Høydahl" <ja...@cominvent.com> wrote:
>Hi,
>
>Are you 100% sure that the filename is globally unique, since you use it as the uniqueKey?
>
>--
>Jan Høydahl, search solution architect
>Cominvent AS - www.cominvent.com
>Solr Training - www.solrtraining.com
>
>On 9. feb. 2012, at 08:30, 荣康 wrote:
>
>> Hey ,
>> I am using solr as my search engine to search my pdf files. I have 18219 files(different file names) and all the files are in one same directory。But when I use solr to import the files into index using Dataimport method, solr report only import 17233 files. It's very strange. This problem has stoped out project for a few days. I can't handle it.
>> 
>> 
>> please help me!
>> 
>> 
>> Schema.xml
>> 
>> 
>> <fields>
>>   <field name="text" type="text" indexed="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>
>>   <field name="filename" type="filenametext" indexed="true" required="true" termVectors="true" termPositions="true" termOffsets="true"/>
>>   <field name="id" type="string" stored="true"/> 
>> </fields>
>> <uniqueKey>id</uniqueKey> 
>> <copyField source="filename" dest="text"/>
>> 
>> 
>> and 
>> <dataConfig> 
>>    <dataSource type="BinFileDataSource" name="bin"/> 
>> <document> 
>> <entity name="f" processor="FileListEntityProcessor" recursive="true" 
>> rootEntity="false" 
>> dataSource="null"  baseDir="H:/pdf/cls_1_16800_OCRed/1" 
>> fileName=".*\.(PDF)|(pdf)|(Pdf)|(pDf)|(pdF)|(PDf)|(PdF)|(pDF)" onError="skip"> 
>> 
>> 
>> <entity name="tika-test" processor="TikaEntityProcessor" 
>> url="${f.fileAbsolutePath}" format="text" dataSource="bin" onError="skip">
>>                <field column="text" name="text"/>      
>> </entity> 
>> <field column="file" name="id"/>
>> <field column="file" name="filename"/> 
>> </entity> 
>>    </document> 
>> </dataConfig> 
>> 
>> 
>> 
>> 
>> sincerecly
>> Rong Kang
>> 
>> 
>> 
>

Re: Help:Solr can't put all pdf files into index

Posted by Jan Høydahl <ja...@cominvent.com>.

Hi,

Are you 100% sure that the filename is globally unique, since you use it as the uniqueKey?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 9. feb. 2012, at 08:30, 荣康 wrote:

> Hey ,
> I am using solr as my search engine to search my pdf files. I have 18219 files(different file names) and all the files are in one same directory。But when I use solr to import the files into index using Dataimport method, solr report only import 17233 files. It's very strange. This problem has stoped out project for a few days. I can't handle it.
> 
> 
> please help me!
> 
> 
> Schema.xml
> 
> 
> <fields>
>   <field name="text" type="text" indexed="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>
>   <field name="filename" type="filenametext" indexed="true" required="true" termVectors="true" termPositions="true" termOffsets="true"/>
>   <field name="id" type="string" stored="true"/> 
> </fields>
> <uniqueKey>id</uniqueKey> 
> <copyField source="filename" dest="text"/>
> 
> 
> and 
> <dataConfig> 
>    <dataSource type="BinFileDataSource" name="bin"/> 
> <document> 
> <entity name="f" processor="FileListEntityProcessor" recursive="true" 
> rootEntity="false" 
> dataSource="null"  baseDir="H:/pdf/cls_1_16800_OCRed/1" 
> fileName=".*\.(PDF)|(pdf)|(Pdf)|(pDf)|(pdF)|(PDf)|(PdF)|(pDF)" onError="skip"> 
> 
> 
> <entity name="tika-test" processor="TikaEntityProcessor" 
> url="${f.fileAbsolutePath}" format="text" dataSource="bin" onError="skip">
>                <field column="text" name="text"/>      
> </entity> 
> <field column="file" name="id"/>
> <field column="file" name="filename"/> 
> </entity> 
>    </document> 
> </dataConfig> 
> 
> 
> 
> 
> sincerecly
> Rong Kang
> 
> 
>

Re: Help:Solr can't put all pdf files into index

Posted by Erick Erickson <er...@gmail.com>.

Tika is not guaranteed to be able to parse any PDF file that can be read. There
are significant differences in how pdf files are constructed by different
"compatible" vendors, and the reader is quite forgiving about still displaying
them.

Sometimes you can get around this by re-writing the PDF with an app that
Tika seems to be able to handle the output from.

Also, you haven't said what version of Solr you're using. Tika has been
upgraded to 1.0 in the 3.6 build, which has not been released yet. You might
try using that, you can get the build from:
https://builds.apache.org//view/S-Z/view/Solr/job/Solr-3.x/

Best
Erick

2012/2/9 Vivek Shrivastava <vs...@shopzilla.com>:
> I think you might need to figure out what files are not coming in the index, and see if you can find command pattern in  those files. Since these are pdf files, please make sure the file's security settings allow content extraction etc..
>
> Regards,
>
> Vivek
>
> -----Original Message-----
> From: 荣康 [mailto:whuiss_cs2011@163.com]
> Sent: Wednesday, February 08, 2012 11:30 PM
> To: solr-user@lucene.apache.org
> Subject: Help:Solr can't put all pdf files into index
>
> Hey ,
> I am using solr as my search engine to search my pdf files. I have 18219 files(different file names) and all the files are in one same directory。But when I use solr to import the files into index using Dataimport method, solr report only import 17233 files. It's very strange. This problem has stoped out project for a few days. I can't handle it.
>
>
>  please help me!
>
>
> Schema.xml
>
>
> <fields>
>   <field name="text" type="text" indexed="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>
>   <field name="filename" type="filenametext" indexed="true" required="true" termVectors="true" termPositions="true" termOffsets="true"/>
>   <field name="id" type="string" stored="true"/>
>  </fields>
>  <uniqueKey>id</uniqueKey>
>  <copyField source="filename" dest="text"/>
>
>
> and
> <dataConfig>
>    <dataSource type="BinFileDataSource" name="bin"/>
>  <document>
> <entity name="f" processor="FileListEntityProcessor" recursive="true"
> rootEntity="false"
>  dataSource="null"  baseDir="H:/pdf/cls_1_16800_OCRed/1"
> fileName=".*\.(PDF)|(pdf)|(Pdf)|(pDf)|(pdF)|(PDf)|(PdF)|(pDF)" onError="skip">
>
>
> <entity name="tika-test" processor="TikaEntityProcessor"
> url="${f.fileAbsolutePath}" format="text" dataSource="bin" onError="skip">
>                <field column="text" name="text"/>
> </entity>
>  <field column="file" name="id"/>
>  <field column="file" name="filename"/>
> </entity>
>    </document>
> </dataConfig>
>
>
>
>
> sincerecly
> Rong Kang
>
>
>

RE: Help:Solr can't put all pdf files into index

Posted by Vivek Shrivastava <vs...@Shopzilla.com>.

I think you might need to figure out what files are not coming in the index, and see if you can find command pattern in  those files. Since these are pdf files, please make sure the file's security settings allow content extraction etc..

Regards,

Vivek

-----Original Message-----
From: 荣康 [mailto:whuiss_cs2011@163.com] 
Sent: Wednesday, February 08, 2012 11:30 PM
To: solr-user@lucene.apache.org
Subject: Help:Solr can't put all pdf files into index

Hey ,
I am using solr as my search engine to search my pdf files. I have 18219 files(different file names) and all the files are in one same directory。But when I use solr to import the files into index using Dataimport method, solr report only import 17233 files. It's very strange. This problem has stoped out project for a few days. I can't handle it.

 please help me!

Schema.xml

<fields>
   <field name="text" type="text" indexed="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>
   <field name="filename" type="filenametext" indexed="true" required="true" termVectors="true" termPositions="true" termOffsets="true"/>
   <field name="id" type="string" stored="true"/> 
 </fields>
 <uniqueKey>id</uniqueKey> 
 <copyField source="filename" dest="text"/>

and 
<dataConfig> 
    <dataSource type="BinFileDataSource" name="bin"/> 
 <document> 
<entity name="f" processor="FileListEntityProcessor" recursive="true" 
rootEntity="false" 
 dataSource="null"  baseDir="H:/pdf/cls_1_16800_OCRed/1" 
fileName=".*\.(PDF)|(pdf)|(Pdf)|(pDf)|(pdF)|(PDf)|(PdF)|(pDF)" onError="skip"> 

<entity name="tika-test" processor="TikaEntityProcessor" 
url="${f.fileAbsolutePath}" format="text" dataSource="bin" onError="skip">
                <field column="text" name="text"/>      
</entity> 
 <field column="file" name="id"/>
 <field column="file" name="filename"/> 
</entity> 
    </document> 
</dataConfig> 

sincerecly
Rong Kang