You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Srinivas Kashyap <sr...@bamboorose.com.INVALID> on 2020/04/20 08:14:06 UTC

TikaEntityProcessor with DIH

Hi,

we were in Solr 5.2.1 and TikaEntityProcessor to index pdf documents through DIH and was working fine. The jars were tika-core-1.4.jar and tika-parsers-1.4.jar.

Below is my schema.xml: (p,s. All filed types have been defined)

<fields>
   <field name="title" type="string" indexed="true" stored="true"/>
   <field name="author" type="string" indexed="true" stored="true" />
   <field name="text" type="exact_text" indexed="true" stored="true" />
   <field name="path" type="string" indexed="true" stored="true" />
   <field name="size" type="string" indexed="true" stored="true" />
   <field name="lastmodified" type="string" indexed="true" stored="true" />
   <field name="fileName" type="string" indexed="true" stored="true" />

And my tika-data-config.xml:

<?xml version="1.0" encoding="UTF-8" standalone="no"?><dataConfig>
    <dataSource type="BinFileDataSource"/>
    <document>
                <entity baseDir="C:\help" dataSource="null" fileName=".*\.(PDF)|(pdf)|(doc)|(docx)|(DOC)|(DOCX)|(txt)|(ppt)|(xls)|(csv)" name="f" onError="skip" processor="FileListEntityProcessor" recursive="true" rootEntity="false">
             <field column="fileAbsolutePath" name="path"/>
             <field column="fileSize" name="size"/>
             <field column="fileLastModified" name="lastmodified"/>
             <field column="file" name="fileName"/>
        <entity format="text" name="tika-test" processor="TikaEntityProcessor" url="${f.fileAbsolutePath}">
                <field column="Author" meta="true" name="author"/>
                <field column="title" meta="true" name="title"/>
                <field column="text" name="text"/>
        </entity>
                </entity>
    </document>
</dataConfig>

Now we have upgraded to solr-8.4.1 and when I try to put the above jars and index, I see only below are getting indexed:

{
        "fileName":"01 - System-Wide Functions.pdf",
        "size":"2524884",
        "lastmodified":"Mon Jul 15 06:26:52 UTC 2019",
        "path":"D:\\tssindex\\server\\solr\\help\\help\\01 - System-Wide Functions.pdf",
        "text":"",
        "_version_":1664474933885927424},
{

As you can see, the text field is empty & author, title fields are not getting indexed and any search on that text field is not returning the documents.

Please help me in this regard.


Thanks,
Srinivas


________________________________
DISCLAIMER:
E-mails and attachments from Bamboo Rose, LLC are confidential.
If you are not the intended recipient, please notify the sender immediately by replying to the e-mail, and then delete it without making copies or using it in any way.
No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient.

Disclaimer

The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast Ltd, an innovator in Software as a Service (SaaS) for business. Providing a safer and more useful place for your human generated data. Specializing in; Security, archiving and compliance. To find out more visit the Mimecast website.