You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Bisonti Mario <Ma...@vimar.com> on 2018/10/11 15:06:16 UTC

Tika and Solr : rejected document due to mime type restrictions

Hallo.
I startup tika server from command line:
java -jar /opt/tika/tika-server-1.19.1.jar

I configured, with ManifoldCF a connector to Solr.

When I start the ingest of pdf and .xls document, I see in the tika server:

INFO  Setting the server's publish address to be http://localhost:9998/
INFO  Logging initialized @1053ms to org.eclipse.jetty.util.log.Slf4jLog
INFO  jetty-9.4.z-SNAPSHOT; built: 2018-06-05T18:24:03.829Z; git: d5fc0523cfa96bfebfbda19606cad384d772f04c; jvm 10.0.2+13-Ubuntu-1ubuntu0.18.04.2
INFO  Started ServerConnector@f74e835{HTTP/1.1,[http/1.1]}{localhost:9998}
INFO  Started @1134ms
WARN  Empty contextPath
INFO  Started o.e.j.s.h.ContextHandler@68d6972f{/,null,AVAILABLE}
INFO  Started Apache Tika server at http://localhost:9998/
INFO  meta (application/pdf)
INFO  meta (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
WARN  Using fallback font 'LiberationSans' for 'TimesNewRomanPS-BoldMT'
WARN  Using fallback font 'LiberationSans' for 'Arial-Black'
WARN  Using fallback font 'LiberationSans' for 'TimesNewRomanPSMT'
WARN  Using fallback font 'LiberationSans' for 'Arial-BoldMT'
WARN  Using fallback font 'LiberationSans' for 'ArialMT'
WARN  Using fallback font 'LiberationSans' for 'CourierNewPSMT'
WARN  Using fallback font 'LiberationSans' for 'TimesNewRomanPS-ItalicMT'
INFO  tika (application/pdf)
WARN  Using fallback font 'LiberationSans' for 'TimesNewRomanPS-BoldMT'
WARN  Using fallback font 'LiberationSans' for 'Arial-Black'
WARN  Using fallback font 'LiberationSans' for 'TimesNewRomanPSMT'
WARN  Using fallback font 'LiberationSans' for 'Arial-BoldMT'
WARN  Using fallback font 'LiberationSans' for 'ArialMT'
WARN  Using fallback font 'LiberationSans' for 'CourierNewPSMT'
WARN  Using fallback font 'LiberationSans' for 'TimesNewRomanPS-ItalicMT'
INFO  tika (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

so it seems that tika server process the cocuments, but, Solr server doesn't ingest.

I obtain the error:
Solr connector rejected document due to mime type restrictions: (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
Solr connector rejected document due to mime type restrictions: (application/pdf)

I understood that tika converts all documents in text so it index to solr, or are there any restriction about Tika Server mime typ?

Thanks a lot

Mario

Re: Tika and Solr : rejected document due to mime type restrictions

Posted by Shawn Heisey <ap...@elyograg.org>.
On 10/11/2018 9:06 AM, Bisonti Mario wrote:
> I startup tika server from command line:
> java -jar /opt/tika/tika-server-1.19.1.jar
>
> I configured, with ManifoldCF a connector to Solr.
>
> When I start the ingest of pdf and .xls document, I see in the tika server:
<snip>
> so it seems that tika server process the cocuments, but, Solr server doesn't ingest.
>
> I obtain the error:
> Solr connector rejected document due to mime type restrictions: (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
> Solr connector rejected document due to mime type restrictions: (application/pdf)

Those errors are not coming from Solr.  Do you see any errors in 
solr.log?  If you do, then we can help you with those.

Since ManifoldCF calls its components connectors, I am betting the 
errors are being generated by ManifoldCF, and that for those documents, 
nothing has actually been sent to Solr, so you won't see errors in the 
solr.log for those files.  ManifoldCF is a separate project within 
Apache, which has its own support infrastructure.

https://manifoldcf.apache.org/en_US/mail.html

Thanks,
Shawn