You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Subasini Rath (JIRA)" <ji...@apache.org> on 2019/02/19 06:29:00 UTC
[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes

    [ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771607#comment-16771607 ] 

Subasini Rath commented on CONNECTORS-1563:
-------------------------------------------

Hi Karl,
    Could you please guide me - to which field manifold writes the actual textual content of the document.

Currently I am using the _text_ field but it has been found that _text_ does not contain the actual data. Rather it added some extra values to the actual content.

In my managed-schema : 

<field name="_text_" type="text_general" multiValued="true" indexed="true" stored="true"/>

After my indexing in Solr, the value looks like : (The first 4 lines are appended before the content of file)

"title":["NETWORK PLANNING\u0000"],
        "_text_":[" \n \n stream_size 34070  \n X-Parsed-By org.apache.tika.parser.DefaultParser  \n X-Parsed-By org.apache.tika.parser.txt.TXTParser  \n stream_content_type application/pdf  \n stream_name cs.exe?bmsdocid=9.2.1&func=eebms.docdownload  \n stream_source_info cs.exe?bmsdocid=9.2.1&func=eebms.docdownload  \n Content-Encoding UTF-8  \n resourceName cs.exe?bmsdocid=9.2.1&func=eebms.docdownload  \n Content-Type text/plain; charset=UTF-8  \n  \n \n  9.2.1 UNCONTROLLED IF PRINTED Page 1 of 13\nCompany Policy\nNETWORK\nDocument No Amendment No Approved By Approval Date Review Date\n: : : : :\n9.2.1 9 CEO 23/05/2016 23/05/2019\n9.2.1 NETWORK PLANNING\n1.0 POLICY STATEMENT\nThe company will plan the expansion and augmentation of its electrical network to achieve levels of safety, reliability and quality of supply commensurate with community, regulator, customer and shareholder expectations.\nThe company will coordinate its planning with the NSW transmission utility Transgrid and neighbouring distribution utilities to develop effective solutions to satisfy load growth within the company’s supply area and in adjacent franchise areas where the company’s network has influence.\n2.0 PURPOSE\nTo provide principles for planning network



Thanks & Regards,
Subasini Rath
O: +91-33 6636-8889 
M: +91 983-1234-341
Email: Subasini.Rath@endeavourenergy.com.au



> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
> -----------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1563
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
>             Project: ManifoldCF
>          Issue Type: Task
>          Components: Lucene/SOLR connector
>            Reporter: Sneha
>            Assignee: Karl Wright
>            Priority: Major
>         Attachments: Document simple history.docx, managed-schema, manifold settings.docx, manifoldcf.log, solr.log, solrconfig.xml
>
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an error on Solr i.e. null:org.apache.solr.common.SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
> If I ignore tika exception, my documents get indexed but dont have content field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)