You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by msaunier <ms...@citya.com> on 2018/01/09 09:32:42 UTC

Document connector excluding mime type and size - Tika Parser error

Hello Karl,

I hope you are well today.

 

I have 2 problems with ManifoldCF.

 

-----------

In *Outputs connectors* with Solr connector. I have add a < Maximum document
length and I have < Excluded 5 mime types > but it not work. I join capture.

 

----------

And in second, I have a *Tika exception* in ManifoldCF. 3 documents are
blocked :

 

FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed:
org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/
HwmfFont$WmfCharset;

java.lang.NoSuchMethodError:
org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/
HwmfFont$WmfCharset;

        at
org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74) ~[?:?]

        at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
~[?:?]

        at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
~[?:?]

        at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
~[?:?]

        at
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
~[?:?]

        at
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(Par
singEmbeddedDocumentExtractor.java:102) ~[?:?]

        at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbedded
File(AbstractOOXMLExtractor.java:375) ~[?:?]

        at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbedded
Part(AbstractOOXMLExtractor.java:260) ~[?:?]

        at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbedded
Parts(AbstractOOXMLExtractor.java:205) ~[?:?]

        at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(Abstr
actOOXMLExtractor.java:142) ~[?:?]

        at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtr
actorFactory.java:142) ~[?:?]

        at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:10
6) ~[?:?]

        at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
~[?:?]

        at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
~[?:?]

        at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
~[?:?]

        at
org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser
.java:74) ~[?:?]

        at
org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceD
ocumentWithException(TikaExtractor.java:235) ~[?:?]

        at
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineA
ddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226
) ~[mcf-agents.jar:?]

        at
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineA
ddFanout.sendDocument(IncrementalIngester.java:3077) ~[mcf-agents.jar:?]

        at
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineO
bjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java
:2708) ~[mcf-agents.jar:?]

        at
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentI
ngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]

        at
org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocu
mentWithException(WorkerThread.java:1583) ~[mcf-pull-agent.jar:?]

        at
org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocu
mentWithException(WorkerThread.java:1548) ~[mcf-pull-agent.jar:?]

        at
org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.pro
cessDocuments(SharedDriveConnector.java:939) ~[?:?]

        at
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
[mcf-pull-agent.jar:?]

 

I need to create an incident ticket?

 

----------

 

Thanks for your help.

 

Cordialement,

 



 

 

 


RE: Document connector excluding mime type and size - Tika Parser error

Posted by msaunier <ms...@citya.com>.
Ok. The aim of putting it in the connector was mainly not to have to repeat the operation for the 300 jobs in production.

 

Cordialement,

 



 

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : mardi 9 janvier 2018 15:44
À : user@manifoldcf.apache.org
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

Since the Tika extractor essentially filters out the content mime type (other than presenting it as metadata), you need to put an "allowed documents" transformation connection into your job pipeline BEFORE the Tika connector:

 

https://manifoldcf.apache.org/release/release-2.9/en_US/end-user-documentation.html#alloweddocuments

 

In fact, mime type exclusion is actually disabled in the Solr output connector *unless* you are using the extracting update handler.  That should resolve the one problem for you.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 9:35 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They document for Tika are :

·        Microsoft Word 97-2003

·        Application/msword

 

I can’t have more informations, they are in SCO servers and SCO do not have ls –lisan or stat command.

 

For SolR connecting, I seem to have emptied the index before the last indexation. (ManifoldCF and Solr) I do it again to be sure.

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:26


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1482 is for the Solr connector filtering issue.  A question: When you changed these fields in the output connection, had you already indexed any documents?  Those would only get cleaned up if you did a subsequent full crawl, after you made the connection change.

 

Karl

 

 

 

On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

If you let me know what kind of file they are (extension and what application created them) that is probably good enough.

Karl

 

On Tue, Jan 9, 2018 at 9:19 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Okay good. I look if I can test 1.17 Tika version.

 

I can’t transfert a document with this error, they are privates. Sorry.

 

If I encounter the error again on a non-private document, I'll come back to you.

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:12


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1481 is the ticket for the Tika problem.

 

Karl

 

 

On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, if you are in a position to build trunk, that's a newer version of Tika (1.17) which might (or might not) address this problem.

 

If you could create a ticket, I'd greatly appreciate attaching one document to it that causes the failure.

 

Thanks!

Karl

 

 

On Tue, Jan 9, 2018 at 8:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

It’s a 2.9 version.

 

I have a 2.8.1 in an other server with same job and same documents. I will test on this other server and make you a return.

 

Thanks for your help.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 13:15
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

I looked at the history of this.  We had to release a patch (2.8.1) that put various poi jars at root level in order to work around a Tika problem.  That patch may not have been entirely correct in that it looks like it may have blocked access by one of the deeper jars to a higher level.

 

Release 2.9 should fix this if I am correct.

 

Karl

 

 

On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

What version of MCF is this?  That's important to know since Tika has had problems with this kind of thing in the past and this looks like something similar.

 

The problem you are reporting is due to either a missing jar, or a bug in an internal tika classloader.  But I need to know whether this is a current bug or not, since we just went to a new Tika version.

 

Karl

 

 

On Tue, Jan 9, 2018 at 4:32 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

I hope you are well today.

 

I have 2 problems with ManifoldCF.

 

-----------

In *Outputs connectors* with Solr connector. I have add a « Maximum document length and I have « Excluded 5 mime types » but it not work. I join capture.

 

----------

And in second, I have a *Tika exception* in ManifoldCF. 3 documents are blocked :

 

FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

        at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) ~[?:?]

        at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]

 

I need to create an incident ticket?

 

----------

 

Thanks for your help.

 

Cordialement,

 



 

 

 

 

 

 

 

 

 

 


RE: Document connector excluding mime type and size - Tika Parser error

Posted by msaunier <ms...@citya.com>.
The problem is :

-        If I check « Extracting Update Handler », Tika not send content field.

-        If I uncheck, I have the content but not the « Exclude mime type »

 

So, for the moment, I prefer uncheck and use « Allowed documents ». 

 

I'm going to make a script to modify all the scrawlers at the same time with the API to add « Allowed documents » condition.

 

Thanks.

 

De : msaunier [mailto:msaunier@citya.com] 
Envoyé : mardi 9 janvier 2018 16:09
À : user@manifoldcf.apache.org
Objet : RE: Document connector excluding mime type and size - Tika Parser error

 

File info :

- MIME type: application/zip

- Endianness: Little endian

File "[Content_Types].xml":

- File name: [Content_Types].xml

- File size: 2696 bytes

- Compressed file size: 495 bytes

- Compression rate: 5.4x

- Creation date: 1980-01-01 00:00:00

- Compression: Deflate

File "_rels/.rels":

- File name: _rels/.rels

- File size: 590 bytes

- Compressed file size: 243 bytes

- Compression rate: 2.4x

- Creation date: 1980-01-01 00:00:00

- Compression: Deflate

File "word/_rels/document.xml.rels":

- File name: word/_rels/document.xml.rels

- File size: 2295 bytes

- Compressed file size: 431 bytes

- Compression rate: 5.3x

- Creation date: 1980-01-01 00:00:00

- Compression: Deflate

File "word/document.xml":

- File name: word/document.xml

- File size: 47.2 KB

- Compressed file size: 5833 bytes

- Compression rate: 8.3x

- Creation date: 1980-01-01 00:00:00

- Compression: Deflate

File "word/_rels/header1.xml.rels":

- File name: word/_rels/header1.xml.rels

- File size: 421 bytes

- Compressed file size: 197 bytes

- Compression rate: 2.1x

- Creation date: 1980-01-01 00:00:00

- Compression: Deflate

File "word/_rels/footer2.xml.rels":

- File name: word/_rels/footer2.xml.rels

- File size: 290 bytes

- Compressed file size: 186 bytes

- Compression rate: 1.6x

- Creation date: 1980-01-01 00:00:00

- Compression: Deflate

 

 

 

De : msaunier [mailto:msaunier@citya.com] 
Envoyé : mardi 9 janvier 2018 16:02
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : RE: Document connector excluding mime type and size - Tika Parser error

 

They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.

 



 

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : mardi 9 janvier 2018 15:54
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

As for the Tika issue, we explicitly tested documents of that type when rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we also tested this.

 

One of the potential issues is that if you are dropping down different versions of ManifoldCF into the same directories you *might* have a poi* jar in the wrong place because of the way we had to do the patch.  Please have a look at where the poi* jars are in your directory structure; they should all be in one directory (connector-common-lib).  If you see any anywhere else, that's the cause of the issue.

 

Karl

 

 

On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Since the Tika extractor essentially filters out the content mime type (other than presenting it as metadata), you need to put an "allowed documents" transformation connection into your job pipeline BEFORE the Tika connector:

 

https://manifoldcf.apache.org/release/release-2.9/en_US/end-user-documentation.html#alloweddocuments

 

In fact, mime type exclusion is actually disabled in the Solr output connector *unless* you are using the extracting update handler.  That should resolve the one problem for you.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 9:35 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They document for Tika are :

·        Microsoft Word 97-2003

·        Application/msword

 

I can’t have more informations, they are in SCO servers and SCO do not have ls –lisan or stat command.

 

For SolR connecting, I seem to have emptied the index before the last indexation. (ManifoldCF and Solr) I do it again to be sure.

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:26


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1482 is for the Solr connector filtering issue.  A question: When you changed these fields in the output connection, had you already indexed any documents?  Those would only get cleaned up if you did a subsequent full crawl, after you made the connection change.

 

Karl

 

 

 

On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

If you let me know what kind of file they are (extension and what application created them) that is probably good enough.

Karl

 

On Tue, Jan 9, 2018 at 9:19 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Okay good. I look if I can test 1.17 Tika version.

 

I can’t transfert a document with this error, they are privates. Sorry.

 

If I encounter the error again on a non-private document, I'll come back to you.

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:12


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1481 is the ticket for the Tika problem.

 

Karl

 

 

On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, if you are in a position to build trunk, that's a newer version of Tika (1.17) which might (or might not) address this problem.

 

If you could create a ticket, I'd greatly appreciate attaching one document to it that causes the failure.

 

Thanks!

Karl

 

 

On Tue, Jan 9, 2018 at 8:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

It’s a 2.9 version.

 

I have a 2.8.1 in an other server with same job and same documents. I will test on this other server and make you a return.

 

Thanks for your help.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 13:15
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

I looked at the history of this.  We had to release a patch (2.8.1) that put various poi jars at root level in order to work around a Tika problem.  That patch may not have been entirely correct in that it looks like it may have blocked access by one of the deeper jars to a higher level.

 

Release 2.9 should fix this if I am correct.

 

Karl

 

 

On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

What version of MCF is this?  That's important to know since Tika has had problems with this kind of thing in the past and this looks like something similar.

 

The problem you are reporting is due to either a missing jar, or a bug in an internal tika classloader.  But I need to know whether this is a current bug or not, since we just went to a new Tika version.

 

Karl

 

 

On Tue, Jan 9, 2018 at 4:32 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

I hope you are well today.

 

I have 2 problems with ManifoldCF.

 

-----------

In *Outputs connectors* with Solr connector. I have add a « Maximum document length and I have « Excluded 5 mime types » but it not work. I join capture.

 

----------

And in second, I have a *Tika exception* in ManifoldCF. 3 documents are blocked :

 

FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

        at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) ~[?:?]

        at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]

 

I need to create an incident ticket?

 

----------

 

Thanks for your help.

 

Cordialement,

 



 

 

 

 

 

 

 

 

 

 

 


RE: Document connector excluding mime type and size - Tika Parser error

Posted by msaunier <ms...@citya.com>.
File info :

- MIME type: application/zip

- Endianness: Little endian

File "[Content_Types].xml":

- File name: [Content_Types].xml

- File size: 2696 bytes

- Compressed file size: 495 bytes

- Compression rate: 5.4x

- Creation date: 1980-01-01 00:00:00

- Compression: Deflate

File "_rels/.rels":

- File name: _rels/.rels

- File size: 590 bytes

- Compressed file size: 243 bytes

- Compression rate: 2.4x

- Creation date: 1980-01-01 00:00:00

- Compression: Deflate

File "word/_rels/document.xml.rels":

- File name: word/_rels/document.xml.rels

- File size: 2295 bytes

- Compressed file size: 431 bytes

- Compression rate: 5.3x

- Creation date: 1980-01-01 00:00:00

- Compression: Deflate

File "word/document.xml":

- File name: word/document.xml

- File size: 47.2 KB

- Compressed file size: 5833 bytes

- Compression rate: 8.3x

- Creation date: 1980-01-01 00:00:00

- Compression: Deflate

File "word/_rels/header1.xml.rels":

- File name: word/_rels/header1.xml.rels

- File size: 421 bytes

- Compressed file size: 197 bytes

- Compression rate: 2.1x

- Creation date: 1980-01-01 00:00:00

- Compression: Deflate

File "word/_rels/footer2.xml.rels":

- File name: word/_rels/footer2.xml.rels

- File size: 290 bytes

- Compressed file size: 186 bytes

- Compression rate: 1.6x

- Creation date: 1980-01-01 00:00:00

- Compression: Deflate

 

 

 

De : msaunier [mailto:msaunier@citya.com] 
Envoyé : mardi 9 janvier 2018 16:02
À : user@manifoldcf.apache.org
Objet : RE: Document connector excluding mime type and size - Tika Parser error

 

They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.

 



 

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : mardi 9 janvier 2018 15:54
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

As for the Tika issue, we explicitly tested documents of that type when rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we also tested this.

 

One of the potential issues is that if you are dropping down different versions of ManifoldCF into the same directories you *might* have a poi* jar in the wrong place because of the way we had to do the patch.  Please have a look at where the poi* jars are in your directory structure; they should all be in one directory (connector-common-lib).  If you see any anywhere else, that's the cause of the issue.

 

Karl

 

 

On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Since the Tika extractor essentially filters out the content mime type (other than presenting it as metadata), you need to put an "allowed documents" transformation connection into your job pipeline BEFORE the Tika connector:

 

https://manifoldcf.apache.org/release/release-2.9/en_US/end-user-documentation.html#alloweddocuments

 

In fact, mime type exclusion is actually disabled in the Solr output connector *unless* you are using the extracting update handler.  That should resolve the one problem for you.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 9:35 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They document for Tika are :

·        Microsoft Word 97-2003

·        Application/msword

 

I can’t have more informations, they are in SCO servers and SCO do not have ls –lisan or stat command.

 

For SolR connecting, I seem to have emptied the index before the last indexation. (ManifoldCF and Solr) I do it again to be sure.

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:26


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1482 is for the Solr connector filtering issue.  A question: When you changed these fields in the output connection, had you already indexed any documents?  Those would only get cleaned up if you did a subsequent full crawl, after you made the connection change.

 

Karl

 

 

 

On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

If you let me know what kind of file they are (extension and what application created them) that is probably good enough.

Karl

 

On Tue, Jan 9, 2018 at 9:19 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Okay good. I look if I can test 1.17 Tika version.

 

I can’t transfert a document with this error, they are privates. Sorry.

 

If I encounter the error again on a non-private document, I'll come back to you.

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:12


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1481 is the ticket for the Tika problem.

 

Karl

 

 

On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, if you are in a position to build trunk, that's a newer version of Tika (1.17) which might (or might not) address this problem.

 

If you could create a ticket, I'd greatly appreciate attaching one document to it that causes the failure.

 

Thanks!

Karl

 

 

On Tue, Jan 9, 2018 at 8:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

It’s a 2.9 version.

 

I have a 2.8.1 in an other server with same job and same documents. I will test on this other server and make you a return.

 

Thanks for your help.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 13:15
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

I looked at the history of this.  We had to release a patch (2.8.1) that put various poi jars at root level in order to work around a Tika problem.  That patch may not have been entirely correct in that it looks like it may have blocked access by one of the deeper jars to a higher level.

 

Release 2.9 should fix this if I am correct.

 

Karl

 

 

On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

What version of MCF is this?  That's important to know since Tika has had problems with this kind of thing in the past and this looks like something similar.

 

The problem you are reporting is due to either a missing jar, or a bug in an internal tika classloader.  But I need to know whether this is a current bug or not, since we just went to a new Tika version.

 

Karl

 

 

On Tue, Jan 9, 2018 at 4:32 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

I hope you are well today.

 

I have 2 problems with ManifoldCF.

 

-----------

In *Outputs connectors* with Solr connector. I have add a « Maximum document length and I have « Excluded 5 mime types » but it not work. I join capture.

 

----------

And in second, I have a *Tika exception* in ManifoldCF. 3 documents are blocked :

 

FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

        at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) ~[?:?]

        at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]

 

I need to create an incident ticket?

 

----------

 

Thanks for your help.

 

Cordialement,

 



 

 

 

 

 

 

 

 

 

 

 


RE: Document connector excluding mime type and size - Tika Parser error

Posted by msaunier <ms...@citya.com>.
 

Ok. I'll confirm that tomorrow.

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : jeudi 11 janvier 2018 18:09
À : user@manifoldcf.apache.org
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

No Tika error is good, but have a look at Simple History to be sure documents were actually processed.  If you can confirm that, I'll kick off the patch process.

 

Karl

 

 

On Thu, Jan 11, 2018 at 11:26 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Ok. So.

 

With the same configuration but Tika 1.17 :

 

·        No Tika error

·        But, no documents send to Solr. I don’t understand why. I research.

 

 

 

 

De : msaunier [mailto:msaunier@citya.com <ma...@citya.com> ] 
Envoyé : jeudi 11 janvier 2018 15:32


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : RE: Document connector excluding mime type and size - Tika Parser error

 

I crawl for the moment. I think, I would have finished in 30 minutes. 

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : jeudi 11 janvier 2018 15:05
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

Did this work for you?

Karl

 

On Thu, Jan 11, 2018 at 6:36 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

If you need the jcifs connector, run "ant make-deps" too.  Then run "ant build" again.

 

Karl

 

On Thu, Jan 11, 2018 at 4:30 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

 

I have build and configured but WindowsShare connector do not appear in the list of repository connectors.

 

·        I have add jcifs.jar into the connectors/jcifs/lib-proprietary directory

·        I have ant make-core-deps

·        Ant build

·        Uncomment windows share into the connectors-proprietary.xml file in the dist folder

·        I have add jcifs.jar in connector-lib-proprietary

 

But not have the proposition on the manifold interface. 

 

Any idea ?

Thanks.

 

 

De : msaunier [mailto:msaunier@citya.com <ma...@citya.com> ] 
Envoyé : mercredi 10 janvier 2018 18:15
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : RE: Document connector excluding mime type and size - Tika Parser error

 

Good !

 

I configure and test that.

I give you a return as soon as the reading is finished.

400k documents.

 

If it works, I test on few million of documents.

 

Thank.

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : mercredi 10 janvier 2018 17:45
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

The build you should be using is the ant build.  Do not use the maven build for this purpose.

 

- Check out trunk:

 

svn co https://svn.apache.org/repos/asf/manifoldcf/trunk

 

- Download dependencies:

 

ant make-core-deps

 

- Build:

 

ant build

 

- Your deliverable is in the "dist" directory

 

Karl

 

 

On Wed, Jan 10, 2018 at 11:37 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

I have an error with the maven build, so I have test with an external 1.17 Tika Server but, POI not included. If you success a mvn package with 1.17 Tika, I am interested.

 

Today, I have not had much time to deal with it.

 

I found some bugs that I would declare tomorrow if they are not already. They concern log4j2, local_fr and a bug with the web interface and the keyboard input key.

 

I continu my investigation.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mercredi 10 janvier 2018 17:15


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

Any news?

Karl

 

On Tue, Jan 9, 2018 at 1:10 PM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Let me know what happens.
If it works for you, I'll see if we can put together a patch release of 2.9 with the fix.

 

Karl

 

 

On Tue, Jan 9, 2018 at 11:07 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Test check out and building with POI 3.17 and Tika 1.17? 

 

It’s possible.

 

I finish a project and I test that.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 16:57


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

So here's the problem; we used POI 3.17 with Tika 3.16 in 2.9, in order to deal with the classloader issue present in POI 3.15, and because POI 3.16 has a severe security issue that made it impossible to ship with.

 

Unfortunately that doesn't quite work; POI 3.17 is not backwards compatible with 3.16 completely and therefore problems occur with this combination.

 

The probable solution is to check out and build trunk and see if that works for you.  It very well might.  The question then is what to do next, because we are not scheduled to release again until April.  We might have to do a point release to deal with this.

 

Please give it a try and let me know what happens.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 10:29 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, never mind that last email.  We patched it in part in 2.9 by including the latest POI.  So clearly it's still an existing problem in POI.  I'll have to open a ticket there and await a patch from them.

 

Karl

 

On Tue, Jan 9, 2018 at 10:27 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

This screenshot cannot be MCF 2.9 since the version of poi was not 3.17 for the 2.9 release.

 

Karl

 

 

On Tue, Jan 9, 2018 at 10:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.

 



 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:54


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

As for the Tika issue, we explicitly tested documents of that type when rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we also tested this.

 

One of the potential issues is that if you are dropping down different versions of ManifoldCF into the same directories you *might* have a poi* jar in the wrong place because of the way we had to do the patch.  Please have a look at where the poi* jars are in your directory structure; they should all be in one directory (connector-common-lib).  If you see any anywhere else, that's the cause of the issue.

 

Karl

 

 

On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Since the Tika extractor essentially filters out the content mime type (other than presenting it as metadata), you need to put an "allowed documents" transformation connection into your job pipeline BEFORE the Tika connector:

 

https://manifoldcf.apache.org/release/release-2.9/en_US/end-user-documentation.html#alloweddocuments

 

In fact, mime type exclusion is actually disabled in the Solr output connector *unless* you are using the extracting update handler.  That should resolve the one problem for you.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 9:35 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They document for Tika are :

·        Microsoft Word 97-2003

·        Application/msword

 

I can’t have more informations, they are in SCO servers and SCO do not have ls –lisan or stat command.

 

For SolR connecting, I seem to have emptied the index before the last indexation. (ManifoldCF and Solr) I do it again to be sure.

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:26


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1482 is for the Solr connector filtering issue.  A question: When you changed these fields in the output connection, had you already indexed any documents?  Those would only get cleaned up if you did a subsequent full crawl, after you made the connection change.

 

Karl

 

 

 

On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

If you let me know what kind of file they are (extension and what application created them) that is probably good enough.

Karl

 

On Tue, Jan 9, 2018 at 9:19 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Okay good. I look if I can test 1.17 Tika version.

 

I can’t transfert a document with this error, they are privates. Sorry.

 

If I encounter the error again on a non-private document, I'll come back to you.

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:12


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1481 is the ticket for the Tika problem.

 

Karl

 

 

On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, if you are in a position to build trunk, that's a newer version of Tika (1.17) which might (or might not) address this problem.

 

If you could create a ticket, I'd greatly appreciate attaching one document to it that causes the failure.

 

Thanks!

Karl

 

 

On Tue, Jan 9, 2018 at 8:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

It’s a 2.9 version.

 

I have a 2.8.1 in an other server with same job and same documents. I will test on this other server and make you a return.

 

Thanks for your help.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 13:15
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

I looked at the history of this.  We had to release a patch (2.8.1) that put various poi jars at root level in order to work around a Tika problem.  That patch may not have been entirely correct in that it looks like it may have blocked access by one of the deeper jars to a higher level.

 

Release 2.9 should fix this if I am correct.

 

Karl

 

 

On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

What version of MCF is this?  That's important to know since Tika has had problems with this kind of thing in the past and this looks like something similar.

 

The problem you are reporting is due to either a missing jar, or a bug in an internal tika classloader.  But I need to know whether this is a current bug or not, since we just went to a new Tika version.

 

Karl

 

 

On Tue, Jan 9, 2018 at 4:32 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

I hope you are well today.

 

I have 2 problems with ManifoldCF.

 

-----------

In *Outputs connectors* with Solr connector. I have add a « Maximum document length and I have « Excluded 5 mime types » but it not work. I join capture.

 

----------

And in second, I have a *Tika exception* in ManifoldCF. 3 documents are blocked :

 

FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

        at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) ~[?:?]

        at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]

 

I need to create an incident ticket?

 

----------

 

Thanks for your help.

 

Cordialement,

 



 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Re: Document connector excluding mime type and size - Tika Parser error

Posted by Karl Wright <da...@gmail.com>.
Thanks for the test!
Karl


On Fri, Jan 12, 2018 at 12:50 PM, msaunier <ms...@citya.com> wrote:

> Ok. 2.9.1 work fine. No bug, no problem. 400k documents indexed.
>
>
>
> Thanks.
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* vendredi 12 janvier 2018 17:33
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> Can't help you with Solr exceptions -- sorry!
>
> Karl
>
>
>
> On Fri, Jan 12, 2018 at 11:20 AM, msaunier <ms...@citya.com> wrote:
>
> With the bin, it work good. I finish the test at 18h00 I think.
>
>
>
> PS : Do you knox this error in Solr ?
>
>
>
> 2018-01-12 16:19:50.175 WARN  (qtp754666084-28157) [   ]
> o.a.s.h.a.LukeRequestHandler Error getting file length for [segments_xwd]
>
> java.nio.file.NoSuchFileException: /var/solr/data/citya_shard5_
> replica1/data/index/segments_xwd
>
>         at sun.nio.fs.UnixException.translateToIOException(
> UnixException.java:86)
>
>         at sun.nio.fs.UnixException.rethrowAsIOException(
> UnixException.java:102)
>
>         at sun.nio.fs.UnixException.rethrowAsIOException(
> UnixException.java:107)
>
>        at sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(
> UnixFileAttributeViews.java:55)
>
>         at sun.nio.fs.UnixFileSystemProvider.readAttributes(
> UnixFileSystemProvider.java:144)
>
>         at sun.nio.fs.LinuxFileSystemProvider.readAttributes(
> LinuxFileSystemProvider.java:99)
>
>         at java.nio.file.Files.readAttributes(Files.java:1737)
>
>         at java.nio.file.Files.size(Files.java:2332)
>
>         at org.apache.lucene.store.FSDirectory.fileLength(
> FSDirectory.java:243)
>
>         at org.apache.lucene.store.NRTCachingDirectory.fileLength(
> NRTCachingDirectory.java:128)
>
>         at org.apache.solr.handler.admin.LukeRequestHandler.getFileLength(
> LukeRequestHandler.java:615)
>
>         at org.apache.solr.handler.admin.LukeRequestHandler.getIndexInfo(
> LukeRequestHandler.java:588)
>
>         at org.apache.solr.handler.admin.CoreAdminOperation.getCoreStatus(
> CoreAdminOperation.java:348)
>
>         at org.apache.solr.handler.admin.StatusOp.execute(StatusOp.
> java:48)
>
>         at org.apache.solr.handler.admin.CoreAdminOperation.execute(
> CoreAdminOperation.java:384)
>
>         at org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.
> call(CoreAdminHandler.java:388)
>
>         at org.apache.solr.handler.admin.CoreAdminHandler.
> handleRequestBody(CoreAdminHandler.java:174)
>
>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(
> RequestHandlerBase.java:173)
>
>         at org.apache.solr.servlet.HttpSolrCall.handleAdmin(
> HttpSolrCall.java:748)
>
>         at org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(
> HttpSolrCall.java:729)
>
>         at org.apache.solr.servlet.HttpSolrCall.call(
> HttpSolrCall.java:510)
>
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:361)
>
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:305)
>
>         at org.eclipse.jetty.servlet.ServletHandler$CachedChain.
> doFilter(ServletHandler.java:1691)
>
>         at org.eclipse.jetty.servlet.ServletHandler.doHandle(
> ServletHandler.java:582)
>
>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(
> ScopedHandler.java:143)
>
>         at org.eclipse.jetty.security.SecurityHandler.handle(
> SecurityHandler.java:548)
>
>         at org.eclipse.jetty.server.session.SessionHandler.
> doHandle(SessionHandler.java:226)
>
>         at org.eclipse.jetty.server.handler.ContextHandler.
> doHandle(ContextHandler.java:1180)
>
>         at org.eclipse.jetty.servlet.ServletHandler.doScope(
> ServletHandler.java:512)
>
>         at org.eclipse.jetty.server.session.SessionHandler.
> doScope(SessionHandler.java:185)
>
>         at org.eclipse.jetty.server.handler.ContextHandler.
> doScope(ContextHandler.java:1112)
>
>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(
> ScopedHandler.java:141)
>
>         at org.eclipse.jetty.server.handler.ContextHandlerCollection.
> handle(ContextHandlerCollection.java:213)
>
>         at org.eclipse.jetty.server.handler.HandlerCollection.
> handle(HandlerCollection.java:119)
>
>         at org.eclipse.jetty.server.handler.HandlerWrapper.handle(
> HandlerWrapper.java:134)
>
>         at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(
> RewriteHandler.java:335)
>
>         at org.eclipse.jetty.server.handler.HandlerWrapper.handle(
> HandlerWrapper.java:134)
>
>         at org.eclipse.jetty.server.Server.handle(Server.java:534)
>
>         at org.eclipse.jetty.server.HttpChannel.handle(
> HttpChannel.java:320)
>
>         at org.eclipse.jetty.server.HttpConnection.onFillable(
> HttpConnection.java:251)
>
>         at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(
> AbstractConnection.java:273)
>
>         at org.eclipse.jetty.io.FillInterest.fillable(
> FillInterest.java:95)
>
>         at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(
> SelectChannelEndPoint.java:93)
>
>         at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.
> executeProduceConsume(ExecuteProduceConsume.java:303)
>
>         at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.
> produceConsume(ExecuteProduceConsume.java:148)
>
>         at org.eclipse.jetty.util.thread.strategy.
> ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>
>         at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
> QueuedThreadPool.java:671)
>
>         at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(
> QueuedThreadPool.java:589)
>
>         at java.lang.Thread.run(Thread.java:748)
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* vendredi 12 janvier 2018 16:52
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> You can download the bin version here:
>
>
>
> https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.9.1
>
>
>
> Karl
>
>
>
>
>
> On Fri, Jan 12, 2018 at 10:50 AM, msaunier <ms...@citya.com> wrote:
>
> Solr not have specials log. (file join)
>
>
>
>
>
> I tail –f Solr and Manifold loggings but no error, no warning, they have
> just info but not really importants.
>
> The configuration is the same. Same properties.xml, same logging.xml, same
> start-options.env.unix, same start script, same database. But in 2.9.0 it
> work, in ‘2.9.1’ not have commit.
>
> I don’t understand why.
>
>
>
> Without Tika transformation, not working too.
>
>
>
> So, if youhave just change the Tika connector, I don’t understand why this
> version have this bug.
>
>
>
> Would you release a bin version with this hotfix?
>
>
>
> Thanks
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* vendredi 12 janvier 2018 16:13
> *À :* user@manifoldcf.apache.org
>
>
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> What does the solr log look like?  There should be [INFO] statements you
> can inspect to see what's being sent, and its size.
>
>
>
> The change to go to Tika 1.17 will not have affected the Solr connector in
> any way.  So one way to decide if this change might be the problem is to
> run a job that doesn't use Tika -- like a simple file directory crawl that
> has only text documents in it, and see if that makes it to Solr.  If not,
> then you must have a configuration difference somewhere.
>
>
>
> Karl
>
>
>
>
>
> On Fri, Jan 12, 2018 at 9:44 AM, msaunier <ms...@citya.com> wrote:
>
> I finished the tests.
>
>
>
> I do not have any more mistakes with Tika.
>
> However, you can not commit the data in Solr. I have no error log. The
> configuration is the same. I tried with 2 different databases and the same
> configuration; and I tried with the same database, but no data arrives in
> Solr. It's normal ?
>
> On 2.9.0, the commit works well.
>
>
>
> Regards,
>
>
>
>
>
> *De :* msaunier [mailto:msaunier@citya.com]
> *Envoyé :* vendredi 12 janvier 2018 11:52
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* RE: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> Hello Karl,
>
>
>
> I have to do a last test, I do it between noon and two o'clock and I
> inform you of the good functioning of this patch if it works
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com <da...@gmail.com>]
> *Envoyé :* jeudi 11 janvier 2018 18:09
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> No Tika error is good, but have a look at Simple History to be sure
> documents were actually processed.  If you can confirm that, I'll kick off
> the patch process.
>
>
>
> Karl
>
>
>
>
>
> On Thu, Jan 11, 2018 at 11:26 AM, msaunier <ms...@citya.com> wrote:
>
> Ok. So.
>
>
>
> With the same configuration but Tika 1.17 :
>
>
>
> ·        No Tika error
>
> ·        But, no documents send to Solr. I don’t understand why. I
> research.
>
>
>
>
>
>
>
>
>
> *De :* msaunier [mailto:msaunier@citya.com]
> *Envoyé :* jeudi 11 janvier 2018 15:32
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* RE: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> I crawl for the moment. I think, I would have finished in 30 minutes.
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com <da...@gmail.com>]
> *Envoyé :* jeudi 11 janvier 2018 15:05
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> Did this work for you?
>
> Karl
>
>
>
> On Thu, Jan 11, 2018 at 6:36 AM, Karl Wright <da...@gmail.com> wrote:
>
> If you need the jcifs connector, run "ant make-deps" too.  Then run "ant
> build" again.
>
>
>
> Karl
>
>
>
> On Thu, Jan 11, 2018 at 4:30 AM, msaunier <ms...@citya.com> wrote:
>
> Hello Karl,
>
>
>
> I have build and configured but WindowsShare connector do not appear in
> the list of repository connectors.
>
>
>
> ·        I have add jcifs.jar into the connectors/jcifs/lib-proprietary
> directory
>
> ·        I have ant make-core-deps
>
> ·        Ant build
>
> ·        Uncomment windows share into the connectors-proprietary.xml file
> in the dist folder
>
> ·        I have add jcifs.jar in connector-lib-proprietary
>
>
>
> But not have the proposition on the manifold interface.
>
>
>
> Any idea ?
>
> Thanks.
>
>
>
>
>
> *De :* msaunier [mailto:msaunier@citya.com]
> *Envoyé :* mercredi 10 janvier 2018 18:15
> *À :* user@manifoldcf.apache.org
> *Objet :* RE: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> Good !
>
>
>
> I configure and test that.
>
> I give you a return as soon as the reading is finished.
>
> 400k documents.
>
>
>
> If it works, I test on few million of documents.
>
>
>
> Thank.
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com <da...@gmail.com>]
> *Envoyé :* mercredi 10 janvier 2018 17:45
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> The build you should be using is the ant build.  Do not use the maven
> build for this purpose.
>
>
>
> - Check out trunk:
>
>
>
> svn co https://svn.apache.org/repos/asf/manifoldcf/trunk
>
>
>
> - Download dependencies:
>
>
>
> ant make-core-deps
>
>
>
> - Build:
>
>
>
> ant build
>
>
>
> - Your deliverable is in the "dist" directory
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jan 10, 2018 at 11:37 AM, msaunier <ms...@citya.com> wrote:
>
> I have an error with the maven build, so I have test with an external 1.17
> Tika Server but, POI not included. If you success a mvn package with 1.17
> Tika, I am interested.
>
>
>
> Today, I have not had much time to deal with it.
>
>
>
> I found some bugs that I would declare tomorrow if they are not already.
> They concern log4j2, local_fr and a bug with the web interface and the
> keyboard input key.
>
>
>
> I continu my investigation.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mercredi 10 janvier 2018 17:15
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> Any news?
>
> Karl
>
>
>
> On Tue, Jan 9, 2018 at 1:10 PM, Karl Wright <da...@gmail.com> wrote:
>
> Let me know what happens.
> If it works for you, I'll see if we can put together a patch release of
> 2.9 with the fix.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 11:07 AM, msaunier <ms...@citya.com> wrote:
>
> Test check out and building with POI 3.17 and Tika 1.17?
>
>
>
> It’s possible.
>
>
>
> I finish a project and I test that.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 16:57
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> So here's the problem; we used POI 3.17 with Tika 3.16 in 2.9, in order to
> deal with the classloader issue present in POI 3.15, and because POI 3.16
> has a severe security issue that made it impossible to ship with.
>
>
>
> Unfortunately that doesn't quite work; POI 3.17 is not backwards
> compatible with 3.16 completely and therefore problems occur with this
> combination.
>
>
>
> The probable solution is to check out and build trunk and see if that
> works for you.  It very well might.  The question then is what to do next,
> because we are not scheduled to release again until April.  We might have
> to do a point release to deal with this.
>
>
>
> Please give it a try and let me know what happens.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 10:29 AM, Karl Wright <da...@gmail.com> wrote:
>
> Ok, never mind that last email.  We patched it in part in 2.9 by including
> the latest POI.  So clearly it's still an existing problem in POI.  I'll
> have to open a ticket there and await a patch from them.
>
>
>
> Karl
>
>
>
> On Tue, Jan 9, 2018 at 10:27 AM, Karl Wright <da...@gmail.com> wrote:
>
> This screenshot cannot be MCF 2.9 since the version of poi was not 3.17
> for the 2.9 release.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 10:02 AM, msaunier <ms...@citya.com> wrote:
>
> They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.
>
>
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:54
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> As for the Tika issue, we explicitly tested documents of that type when
> rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we
> also tested this.
>
>
>
> One of the potential issues is that if you are dropping down different
> versions of ManifoldCF into the same directories you *might* have a poi*
> jar in the wrong place because of the way we had to do the patch.  Please
> have a look at where the poi* jars are in your directory structure; they
> should all be in one directory (connector-common-lib).  If you see any
> anywhere else, that's the cause of the issue.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <da...@gmail.com> wrote:
>
> Since the Tika extractor essentially filters out the content mime type
> (other than presenting it as metadata), you need to put an "allowed
> documents" transformation connection into your job pipeline BEFORE the Tika
> connector:
>
>
>
> https://manifoldcf.apache.org/release/release-2.9/en_US/end-
> user-documentation.html#alloweddocuments
>
>
>
> In fact, mime type exclusion is actually disabled in the Solr output
> connector *unless* you are using the extracting update handler.  That
> should resolve the one problem for you.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:35 AM, msaunier <ms...@citya.com> wrote:
>
> They document for Tika are :
>
> ·        Microsoft Word 97-2003
>
> ·        Application/msword
>
>
>
> I can’t have more informations, they are in SCO servers and SCO do not
> have ls –lisan or stat command.
>
>
>
> For SolR connecting, I seem to have emptied the index before the last
> indexation. (ManifoldCF and Solr) I do it again to be sure.
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:26
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> CONNECTORS-1482 is for the Solr connector filtering issue.  A question:
> When you changed these fields in the output connection, had you already
> indexed any documents?  Those would only get cleaned up if you did a
> subsequent full crawl, after you made the connection change.
>
>
>
> Karl
>
>
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <da...@gmail.com> wrote:
>
> If you let me know what kind of file they are (extension and what
> application created them) that is probably good enough.
>
> Karl
>
>
>
> On Tue, Jan 9, 2018 at 9:19 AM, msaunier <ms...@citya.com> wrote:
>
> Okay good. I look if I can test 1.17 Tika version.
>
>
>
> I can’t transfert a document with this error, they are privates. Sorry.
>
>
>
> If I encounter the error again on a non-private document, I'll come back
> to you.
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:12
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> CONNECTORS-1481 is the ticket for the Tika problem.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <da...@gmail.com> wrote:
>
> Ok, if you are in a position to build trunk, that's a newer version of
> Tika (1.17) which might (or might not) address this problem.
>
>
>
> If you could create a ticket, I'd greatly appreciate attaching one
> document to it that causes the failure.
>
>
>
> Thanks!
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 8:02 AM, msaunier <ms...@citya.com> wrote:
>
> It’s a 2.9 version.
>
>
>
> I have a 2.8.1 in an other server with same job and same documents. I will
> test on this other server and make you a return.
>
>
>
> Thanks for your help.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 13:15
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> I looked at the history of this.  We had to release a patch (2.8.1) that
> put various poi jars at root level in order to work around a Tika problem.
> That patch may not have been entirely correct in that it looks like it may
> have blocked access by one of the deeper jars to a higher level.
>
>
>
> Release 2.9 should fix this if I am correct.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <da...@gmail.com> wrote:
>
> What version of MCF is this?  That's important to know since Tika has had
> problems with this kind of thing in the past and this looks like something
> similar.
>
>
>
> The problem you are reporting is due to either a missing jar, or a bug in
> an internal tika classloader.  But I need to know whether this is a current
> bug or not, since we just went to a new Tika version.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 4:32 AM, msaunier <ms...@citya.com> wrote:
>
> Hello Karl,
>
> I hope you are well today.
>
>
>
> I have 2 problems with ManifoldCF.
>
>
>
> -----------
>
> In **Outputs connectors** with Solr connector. I have add a « Maximum
> document length and I have « Excluded 5 mime types » but it not work. I
> join capture.
>
>
>
> ----------
>
> And in second, I have a **Tika exception** in ManifoldCF. 3 documents are
> blocked :
>
>
>
> FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed:
> org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/
> apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
> java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.
> HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
>         at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
> ~[?:?]
>
>         at org.apache.tika.extractor.ParsingEmbeddedDocumentExtract
> or.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.
> OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.
> TikaParser.parse(TikaParser.java:74) ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.
> addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithExcept
> ion(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineObjectWithVersions.
> addOrReplaceDocumentWithException(IncrementalIngester.java:2708)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester.documentIngest(IncrementalIngester.java:756)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.connectors.sharedrive.
> SharedDriveConnector.processDocuments(SharedDriveConnector.java:939)
> ~[?:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
>
>
> I need to create an incident ticket?
>
>
>
> ----------
>
>
>
> Thanks for your help.
>
>
>
> Cordialement,
>
>
>
> [image: msaunier]
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

RE: Document connector excluding mime type and size - Tika Parser error

Posted by msaunier <ms...@citya.com>.
Ok. 2.9.1 work fine. No bug, no problem. 400k documents indexed.

 

Thanks.

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : vendredi 12 janvier 2018 17:33
À : user@manifoldcf.apache.org
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

Can't help you with Solr exceptions -- sorry!

Karl

 

On Fri, Jan 12, 2018 at 11:20 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

With the bin, it work good. I finish the test at 18h00 I think.

 

PS : Do you knox this error in Solr ? 

 

2018-01-12 16:19:50.175 WARN  (qtp754666084-28157) [   ] o.a.s.h.a.LukeRequestHandler Error getting file length for [segments_xwd]

java.nio.file.NoSuchFileException: /var/solr/data/citya_shard5_replica1/data/index/segments_xwd

        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)

        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)

        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)

       at sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)

        at sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)

        at sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)

        at java.nio.file.Files.readAttributes(Files.java:1737)

        at java.nio.file.Files.size(Files.java:2332)

        at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243)

        at org.apache.lucene.store.NRTCachingDirectory.fileLength(NRTCachingDirectory.java:128)

        at org.apache.solr.handler.admin.LukeRequestHandler.getFileLength(LukeRequestHandler.java:615)

        at org.apache.solr.handler.admin.LukeRequestHandler.getIndexInfo(LukeRequestHandler.java:588)

        at org.apache.solr.handler.admin.CoreAdminOperation.getCoreStatus(CoreAdminOperation.java:348)

        at org.apache.solr.handler.admin.StatusOp.execute(StatusOp.java:48)

        at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:384)

        at org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:388)

        at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:174)

        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)

        at org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:748)

        at org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:729)

        at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:510)

        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)

        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)

        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)

        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)

        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)

        at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)

        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)

        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)

        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)

        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)

        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)

        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)

        at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)

        at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)

        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)

        at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)

        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)

        at org.eclipse.jetty.server.Server.handle(Server.java:534)

        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)

        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)

        at org.eclipse.jetty.io <http://org.eclipse.jetty.io> .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)

        at org.eclipse.jetty.io <http://org.eclipse.jetty.io> .FillInterest.fillable(FillInterest.java:95)

        at org.eclipse.jetty.io <http://org.eclipse.jetty.io> .SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)

        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)

        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)

        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)

        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)

        at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)

        at java.lang.Thread.run(Thread.java:748)

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : vendredi 12 janvier 2018 16:52


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

You can download the bin version here:

 

https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.9.1

 

Karl

 

 

On Fri, Jan 12, 2018 at 10:50 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Solr not have specials log. (file join)

 



 

I tail –f Solr and Manifold loggings but no error, no warning, they have just info but not really importants.

The configuration is the same. Same properties.xml, same logging.xml, same start-options.env.unix, same start script, same database. But in 2.9.0 it work, in ‘2.9.1’ not have commit.

I don’t understand why.

 

Without Tika transformation, not working too.

 

So, if youhave just change the Tika connector, I don’t understand why this version have this bug.

 

Would you release a bin version with this hotfix?

 

Thanks

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : vendredi 12 janvier 2018 16:13
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 


Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

What does the solr log look like?  There should be [INFO] statements you can inspect to see what's being sent, and its size.

 

The change to go to Tika 1.17 will not have affected the Solr connector in any way.  So one way to decide if this change might be the problem is to run a job that doesn't use Tika -- like a simple file directory crawl that has only text documents in it, and see if that makes it to Solr.  If not, then you must have a configuration difference somewhere.

 

Karl

 

 

On Fri, Jan 12, 2018 at 9:44 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

I finished the tests.

 

I do not have any more mistakes with Tika.

However, you can not commit the data in Solr. I have no error log. The configuration is the same. I tried with 2 different databases and the same configuration; and I tried with the same database, but no data arrives in Solr. It's normal ?

On 2.9.0, the commit works well.

 

Regards,

 

 

De : msaunier [mailto:msaunier@citya.com <ma...@citya.com> ] 
Envoyé : vendredi 12 janvier 2018 11:52


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : RE: Document connector excluding mime type and size - Tika Parser error

 

Hello Karl,

 

I have to do a last test, I do it between noon and two o'clock and I inform you of the good functioning of this patch if it works

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : jeudi 11 janvier 2018 18:09
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

No Tika error is good, but have a look at Simple History to be sure documents were actually processed.  If you can confirm that, I'll kick off the patch process.

 

Karl

 

 

On Thu, Jan 11, 2018 at 11:26 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Ok. So.

 

With the same configuration but Tika 1.17 :

 

·        No Tika error

·        But, no documents send to Solr. I don’t understand why. I research.

 

 

 

 

De : msaunier [mailto:msaunier@citya.com <ma...@citya.com> ] 
Envoyé : jeudi 11 janvier 2018 15:32


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : RE: Document connector excluding mime type and size - Tika Parser error

 

I crawl for the moment. I think, I would have finished in 30 minutes. 

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : jeudi 11 janvier 2018 15:05
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

Did this work for you?

Karl

 

On Thu, Jan 11, 2018 at 6:36 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

If you need the jcifs connector, run "ant make-deps" too.  Then run "ant build" again.

 

Karl

 

On Thu, Jan 11, 2018 at 4:30 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

 

I have build and configured but WindowsShare connector do not appear in the list of repository connectors.

 

·        I have add jcifs.jar into the connectors/jcifs/lib-proprietary directory

·        I have ant make-core-deps

·        Ant build

·        Uncomment windows share into the connectors-proprietary.xml file in the dist folder

·        I have add jcifs.jar in connector-lib-proprietary

 

But not have the proposition on the manifold interface. 

 

Any idea ?

Thanks.

 

 

De : msaunier [mailto:msaunier@citya.com <ma...@citya.com> ] 
Envoyé : mercredi 10 janvier 2018 18:15
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : RE: Document connector excluding mime type and size - Tika Parser error

 

Good !

 

I configure and test that.

I give you a return as soon as the reading is finished.

400k documents.

 

If it works, I test on few million of documents.

 

Thank.

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : mercredi 10 janvier 2018 17:45
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

The build you should be using is the ant build.  Do not use the maven build for this purpose.

 

- Check out trunk:

 

svn co https://svn.apache.org/repos/asf/manifoldcf/trunk

 

- Download dependencies:

 

ant make-core-deps

 

- Build:

 

ant build

 

- Your deliverable is in the "dist" directory

 

Karl

 

 

On Wed, Jan 10, 2018 at 11:37 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

I have an error with the maven build, so I have test with an external 1.17 Tika Server but, POI not included. If you success a mvn package with 1.17 Tika, I am interested.

 

Today, I have not had much time to deal with it.

 

I found some bugs that I would declare tomorrow if they are not already. They concern log4j2, local_fr and a bug with the web interface and the keyboard input key.

 

I continu my investigation.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mercredi 10 janvier 2018 17:15


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

Any news?

Karl

 

On Tue, Jan 9, 2018 at 1:10 PM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Let me know what happens.
If it works for you, I'll see if we can put together a patch release of 2.9 with the fix.

 

Karl

 

 

On Tue, Jan 9, 2018 at 11:07 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Test check out and building with POI 3.17 and Tika 1.17? 

 

It’s possible.

 

I finish a project and I test that.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 16:57


À : user@manifoldcf.apache.org
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

So here's the problem; we used POI 3.17 with Tika 3.16 in 2.9, in order to deal with the classloader issue present in POI 3.15, and because POI 3.16 has a severe security issue that made it impossible to ship with.

 

Unfortunately that doesn't quite work; POI 3.17 is not backwards compatible with 3.16 completely and therefore problems occur with this combination.

 

The probable solution is to check out and build trunk and see if that works for you.  It very well might.  The question then is what to do next, because we are not scheduled to release again until April.  We might have to do a point release to deal with this.

 

Please give it a try and let me know what happens.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 10:29 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, never mind that last email.  We patched it in part in 2.9 by including the latest POI.  So clearly it's still an existing problem in POI.  I'll have to open a ticket there and await a patch from them.

 

Karl

 

On Tue, Jan 9, 2018 at 10:27 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

This screenshot cannot be MCF 2.9 since the version of poi was not 3.17 for the 2.9 release.

 

Karl

 

 

On Tue, Jan 9, 2018 at 10:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.

 



 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:54


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

As for the Tika issue, we explicitly tested documents of that type when rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we also tested this.

 

One of the potential issues is that if you are dropping down different versions of ManifoldCF into the same directories you *might* have a poi* jar in the wrong place because of the way we had to do the patch.  Please have a look at where the poi* jars are in your directory structure; they should all be in one directory (connector-common-lib).  If you see any anywhere else, that's the cause of the issue.

 

Karl

 

 

On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Since the Tika extractor essentially filters out the content mime type (other than presenting it as metadata), you need to put an "allowed documents" transformation connection into your job pipeline BEFORE the Tika connector:

 

https://manifoldcf.apache.org/release/release-2.9/en_US/end-user-documentation.html#alloweddocuments

 

In fact, mime type exclusion is actually disabled in the Solr output connector *unless* you are using the extracting update handler.  That should resolve the one problem for you.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 9:35 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They document for Tika are :

·        Microsoft Word 97-2003

·        Application/msword

 

I can’t have more informations, they are in SCO servers and SCO do not have ls –lisan or stat command.

 

For SolR connecting, I seem to have emptied the index before the last indexation. (ManifoldCF and Solr) I do it again to be sure.

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:26


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1482 is for the Solr connector filtering issue.  A question: When you changed these fields in the output connection, had you already indexed any documents?  Those would only get cleaned up if you did a subsequent full crawl, after you made the connection change.

 

Karl

 

 

 

On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

If you let me know what kind of file they are (extension and what application created them) that is probably good enough.

Karl

 

On Tue, Jan 9, 2018 at 9:19 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Okay good. I look if I can test 1.17 Tika version.

 

I can’t transfert a document with this error, they are privates. Sorry.

 

If I encounter the error again on a non-private document, I'll come back to you.

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:12


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1481 is the ticket for the Tika problem.

 

Karl

 

 

On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, if you are in a position to build trunk, that's a newer version of Tika (1.17) which might (or might not) address this problem.

 

If you could create a ticket, I'd greatly appreciate attaching one document to it that causes the failure.

 

Thanks!

Karl

 

 

On Tue, Jan 9, 2018 at 8:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

It’s a 2.9 version.

 

I have a 2.8.1 in an other server with same job and same documents. I will test on this other server and make you a return.

 

Thanks for your help.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 13:15
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

I looked at the history of this.  We had to release a patch (2.8.1) that put various poi jars at root level in order to work around a Tika problem.  That patch may not have been entirely correct in that it looks like it may have blocked access by one of the deeper jars to a higher level.

 

Release 2.9 should fix this if I am correct.

 

Karl

 

 

On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

What version of MCF is this?  That's important to know since Tika has had problems with this kind of thing in the past and this looks like something similar.

 

The problem you are reporting is due to either a missing jar, or a bug in an internal tika classloader.  But I need to know whether this is a current bug or not, since we just went to a new Tika version.

 

Karl

 

 

On Tue, Jan 9, 2018 at 4:32 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

I hope you are well today.

 

I have 2 problems with ManifoldCF.

 

-----------

In *Outputs connectors* with Solr connector. I have add a « Maximum document length and I have « Excluded 5 mime types » but it not work. I join capture.

 

----------

And in second, I have a *Tika exception* in ManifoldCF. 3 documents are blocked :

 

FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

        at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) ~[?:?]

        at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]

 

I need to create an incident ticket?

 

----------

 

Thanks for your help.

 

Cordialement,

 



 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Re: Document connector excluding mime type and size - Tika Parser error

Posted by Karl Wright <da...@gmail.com>.
Can't help you with Solr exceptions -- sorry!
Karl

On Fri, Jan 12, 2018 at 11:20 AM, msaunier <ms...@citya.com> wrote:

> With the bin, it work good. I finish the test at 18h00 I think.
>
>
>
> PS : Do you knox this error in Solr ?
>
>
>
> 2018-01-12 16:19:50.175 WARN  (qtp754666084-28157) [   ]
> o.a.s.h.a.LukeRequestHandler Error getting file length for [segments_xwd]
>
> java.nio.file.NoSuchFileException: /var/solr/data/citya_shard5_
> replica1/data/index/segments_xwd
>
>         at sun.nio.fs.UnixException.translateToIOException(
> UnixException.java:86)
>
>         at sun.nio.fs.UnixException.rethrowAsIOException(
> UnixException.java:102)
>
>         at sun.nio.fs.UnixException.rethrowAsIOException(
> UnixException.java:107)
>
>        at sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(
> UnixFileAttributeViews.java:55)
>
>         at sun.nio.fs.UnixFileSystemProvider.readAttributes(
> UnixFileSystemProvider.java:144)
>
>         at sun.nio.fs.LinuxFileSystemProvider.readAttributes(
> LinuxFileSystemProvider.java:99)
>
>         at java.nio.file.Files.readAttributes(Files.java:1737)
>
>         at java.nio.file.Files.size(Files.java:2332)
>
>         at org.apache.lucene.store.FSDirectory.fileLength(
> FSDirectory.java:243)
>
>         at org.apache.lucene.store.NRTCachingDirectory.fileLength(
> NRTCachingDirectory.java:128)
>
>         at org.apache.solr.handler.admin.LukeRequestHandler.getFileLength(
> LukeRequestHandler.java:615)
>
>         at org.apache.solr.handler.admin.LukeRequestHandler.getIndexInfo(
> LukeRequestHandler.java:588)
>
>         at org.apache.solr.handler.admin.CoreAdminOperation.getCoreStatus(
> CoreAdminOperation.java:348)
>
>         at org.apache.solr.handler.admin.StatusOp.execute(StatusOp.
> java:48)
>
>         at org.apache.solr.handler.admin.CoreAdminOperation.execute(
> CoreAdminOperation.java:384)
>
>         at org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.
> call(CoreAdminHandler.java:388)
>
>         at org.apache.solr.handler.admin.CoreAdminHandler.
> handleRequestBody(CoreAdminHandler.java:174)
>
>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(
> RequestHandlerBase.java:173)
>
>         at org.apache.solr.servlet.HttpSolrCall.handleAdmin(
> HttpSolrCall.java:748)
>
>         at org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(
> HttpSolrCall.java:729)
>
>         at org.apache.solr.servlet.HttpSolrCall.call(
> HttpSolrCall.java:510)
>
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:361)
>
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:305)
>
>         at org.eclipse.jetty.servlet.ServletHandler$CachedChain.
> doFilter(ServletHandler.java:1691)
>
>         at org.eclipse.jetty.servlet.ServletHandler.doHandle(
> ServletHandler.java:582)
>
>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(
> ScopedHandler.java:143)
>
>         at org.eclipse.jetty.security.SecurityHandler.handle(
> SecurityHandler.java:548)
>
>         at org.eclipse.jetty.server.session.SessionHandler.
> doHandle(SessionHandler.java:226)
>
>         at org.eclipse.jetty.server.handler.ContextHandler.
> doHandle(ContextHandler.java:1180)
>
>         at org.eclipse.jetty.servlet.ServletHandler.doScope(
> ServletHandler.java:512)
>
>         at org.eclipse.jetty.server.session.SessionHandler.
> doScope(SessionHandler.java:185)
>
>         at org.eclipse.jetty.server.handler.ContextHandler.
> doScope(ContextHandler.java:1112)
>
>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(
> ScopedHandler.java:141)
>
>         at org.eclipse.jetty.server.handler.ContextHandlerCollection.
> handle(ContextHandlerCollection.java:213)
>
>         at org.eclipse.jetty.server.handler.HandlerCollection.
> handle(HandlerCollection.java:119)
>
>         at org.eclipse.jetty.server.handler.HandlerWrapper.handle(
> HandlerWrapper.java:134)
>
>         at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(
> RewriteHandler.java:335)
>
>         at org.eclipse.jetty.server.handler.HandlerWrapper.handle(
> HandlerWrapper.java:134)
>
>         at org.eclipse.jetty.server.Server.handle(Server.java:534)
>
>         at org.eclipse.jetty.server.HttpChannel.handle(
> HttpChannel.java:320)
>
>         at org.eclipse.jetty.server.HttpConnection.onFillable(
> HttpConnection.java:251)
>
>         at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(
> AbstractConnection.java:273)
>
>         at org.eclipse.jetty.io.FillInterest.fillable(
> FillInterest.java:95)
>
>         at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(
> SelectChannelEndPoint.java:93)
>
>         at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.
> executeProduceConsume(ExecuteProduceConsume.java:303)
>
>         at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.
> produceConsume(ExecuteProduceConsume.java:148)
>
>         at org.eclipse.jetty.util.thread.strategy.
> ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>
>         at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
> QueuedThreadPool.java:671)
>
>         at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(
> QueuedThreadPool.java:589)
>
>         at java.lang.Thread.run(Thread.java:748)
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* vendredi 12 janvier 2018 16:52
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> You can download the bin version here:
>
>
>
> https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.9.1
>
>
>
> Karl
>
>
>
>
>
> On Fri, Jan 12, 2018 at 10:50 AM, msaunier <ms...@citya.com> wrote:
>
> Solr not have specials log. (file join)
>
>
>
>
>
> I tail –f Solr and Manifold loggings but no error, no warning, they have
> just info but not really importants.
>
> The configuration is the same. Same properties.xml, same logging.xml, same
> start-options.env.unix, same start script, same database. But in 2.9.0 it
> work, in ‘2.9.1’ not have commit.
>
> I don’t understand why.
>
>
>
> Without Tika transformation, not working too.
>
>
>
> So, if youhave just change the Tika connector, I don’t understand why this
> version have this bug.
>
>
>
> Would you release a bin version with this hotfix?
>
>
>
> Thanks
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* vendredi 12 janvier 2018 16:13
> *À :* user@manifoldcf.apache.org
>
>
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> What does the solr log look like?  There should be [INFO] statements you
> can inspect to see what's being sent, and its size.
>
>
>
> The change to go to Tika 1.17 will not have affected the Solr connector in
> any way.  So one way to decide if this change might be the problem is to
> run a job that doesn't use Tika -- like a simple file directory crawl that
> has only text documents in it, and see if that makes it to Solr.  If not,
> then you must have a configuration difference somewhere.
>
>
>
> Karl
>
>
>
>
>
> On Fri, Jan 12, 2018 at 9:44 AM, msaunier <ms...@citya.com> wrote:
>
> I finished the tests.
>
>
>
> I do not have any more mistakes with Tika.
>
> However, you can not commit the data in Solr. I have no error log. The
> configuration is the same. I tried with 2 different databases and the same
> configuration; and I tried with the same database, but no data arrives in
> Solr. It's normal ?
>
> On 2.9.0, the commit works well.
>
>
>
> Regards,
>
>
>
>
>
> *De :* msaunier [mailto:msaunier@citya.com]
> *Envoyé :* vendredi 12 janvier 2018 11:52
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* RE: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> Hello Karl,
>
>
>
> I have to do a last test, I do it between noon and two o'clock and I
> inform you of the good functioning of this patch if it works
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com <da...@gmail.com>]
> *Envoyé :* jeudi 11 janvier 2018 18:09
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> No Tika error is good, but have a look at Simple History to be sure
> documents were actually processed.  If you can confirm that, I'll kick off
> the patch process.
>
>
>
> Karl
>
>
>
>
>
> On Thu, Jan 11, 2018 at 11:26 AM, msaunier <ms...@citya.com> wrote:
>
> Ok. So.
>
>
>
> With the same configuration but Tika 1.17 :
>
>
>
> ·        No Tika error
>
> ·        But, no documents send to Solr. I don’t understand why. I
> research.
>
>
>
>
>
>
>
>
>
> *De :* msaunier [mailto:msaunier@citya.com]
> *Envoyé :* jeudi 11 janvier 2018 15:32
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* RE: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> I crawl for the moment. I think, I would have finished in 30 minutes.
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com <da...@gmail.com>]
> *Envoyé :* jeudi 11 janvier 2018 15:05
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> Did this work for you?
>
> Karl
>
>
>
> On Thu, Jan 11, 2018 at 6:36 AM, Karl Wright <da...@gmail.com> wrote:
>
> If you need the jcifs connector, run "ant make-deps" too.  Then run "ant
> build" again.
>
>
>
> Karl
>
>
>
> On Thu, Jan 11, 2018 at 4:30 AM, msaunier <ms...@citya.com> wrote:
>
> Hello Karl,
>
>
>
> I have build and configured but WindowsShare connector do not appear in
> the list of repository connectors.
>
>
>
> ·        I have add jcifs.jar into the connectors/jcifs/lib-proprietary
> directory
>
> ·        I have ant make-core-deps
>
> ·        Ant build
>
> ·        Uncomment windows share into the connectors-proprietary.xml file
> in the dist folder
>
> ·        I have add jcifs.jar in connector-lib-proprietary
>
>
>
> But not have the proposition on the manifold interface.
>
>
>
> Any idea ?
>
> Thanks.
>
>
>
>
>
> *De :* msaunier [mailto:msaunier@citya.com]
> *Envoyé :* mercredi 10 janvier 2018 18:15
> *À :* user@manifoldcf.apache.org
> *Objet :* RE: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> Good !
>
>
>
> I configure and test that.
>
> I give you a return as soon as the reading is finished.
>
> 400k documents.
>
>
>
> If it works, I test on few million of documents.
>
>
>
> Thank.
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com <da...@gmail.com>]
> *Envoyé :* mercredi 10 janvier 2018 17:45
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> The build you should be using is the ant build.  Do not use the maven
> build for this purpose.
>
>
>
> - Check out trunk:
>
>
>
> svn co https://svn.apache.org/repos/asf/manifoldcf/trunk
>
>
>
> - Download dependencies:
>
>
>
> ant make-core-deps
>
>
>
> - Build:
>
>
>
> ant build
>
>
>
> - Your deliverable is in the "dist" directory
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jan 10, 2018 at 11:37 AM, msaunier <ms...@citya.com> wrote:
>
> I have an error with the maven build, so I have test with an external 1.17
> Tika Server but, POI not included. If you success a mvn package with 1.17
> Tika, I am interested.
>
>
>
> Today, I have not had much time to deal with it.
>
>
>
> I found some bugs that I would declare tomorrow if they are not already.
> They concern log4j2, local_fr and a bug with the web interface and the
> keyboard input key.
>
>
>
> I continu my investigation.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mercredi 10 janvier 2018 17:15
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> Any news?
>
> Karl
>
>
>
> On Tue, Jan 9, 2018 at 1:10 PM, Karl Wright <da...@gmail.com> wrote:
>
> Let me know what happens.
> If it works for you, I'll see if we can put together a patch release of
> 2.9 with the fix.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 11:07 AM, msaunier <ms...@citya.com> wrote:
>
> Test check out and building with POI 3.17 and Tika 1.17?
>
>
>
> It’s possible.
>
>
>
> I finish a project and I test that.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 16:57
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> So here's the problem; we used POI 3.17 with Tika 3.16 in 2.9, in order to
> deal with the classloader issue present in POI 3.15, and because POI 3.16
> has a severe security issue that made it impossible to ship with.
>
>
>
> Unfortunately that doesn't quite work; POI 3.17 is not backwards
> compatible with 3.16 completely and therefore problems occur with this
> combination.
>
>
>
> The probable solution is to check out and build trunk and see if that
> works for you.  It very well might.  The question then is what to do next,
> because we are not scheduled to release again until April.  We might have
> to do a point release to deal with this.
>
>
>
> Please give it a try and let me know what happens.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 10:29 AM, Karl Wright <da...@gmail.com> wrote:
>
> Ok, never mind that last email.  We patched it in part in 2.9 by including
> the latest POI.  So clearly it's still an existing problem in POI.  I'll
> have to open a ticket there and await a patch from them.
>
>
>
> Karl
>
>
>
> On Tue, Jan 9, 2018 at 10:27 AM, Karl Wright <da...@gmail.com> wrote:
>
> This screenshot cannot be MCF 2.9 since the version of poi was not 3.17
> for the 2.9 release.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 10:02 AM, msaunier <ms...@citya.com> wrote:
>
> They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.
>
>
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:54
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> As for the Tika issue, we explicitly tested documents of that type when
> rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we
> also tested this.
>
>
>
> One of the potential issues is that if you are dropping down different
> versions of ManifoldCF into the same directories you *might* have a poi*
> jar in the wrong place because of the way we had to do the patch.  Please
> have a look at where the poi* jars are in your directory structure; they
> should all be in one directory (connector-common-lib).  If you see any
> anywhere else, that's the cause of the issue.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <da...@gmail.com> wrote:
>
> Since the Tika extractor essentially filters out the content mime type
> (other than presenting it as metadata), you need to put an "allowed
> documents" transformation connection into your job pipeline BEFORE the Tika
> connector:
>
>
>
> https://manifoldcf.apache.org/release/release-2.9/en_US/end-
> user-documentation.html#alloweddocuments
>
>
>
> In fact, mime type exclusion is actually disabled in the Solr output
> connector *unless* you are using the extracting update handler.  That
> should resolve the one problem for you.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:35 AM, msaunier <ms...@citya.com> wrote:
>
> They document for Tika are :
>
> ·        Microsoft Word 97-2003
>
> ·        Application/msword
>
>
>
> I can’t have more informations, they are in SCO servers and SCO do not
> have ls –lisan or stat command.
>
>
>
> For SolR connecting, I seem to have emptied the index before the last
> indexation. (ManifoldCF and Solr) I do it again to be sure.
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:26
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> CONNECTORS-1482 is for the Solr connector filtering issue.  A question:
> When you changed these fields in the output connection, had you already
> indexed any documents?  Those would only get cleaned up if you did a
> subsequent full crawl, after you made the connection change.
>
>
>
> Karl
>
>
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <da...@gmail.com> wrote:
>
> If you let me know what kind of file they are (extension and what
> application created them) that is probably good enough.
>
> Karl
>
>
>
> On Tue, Jan 9, 2018 at 9:19 AM, msaunier <ms...@citya.com> wrote:
>
> Okay good. I look if I can test 1.17 Tika version.
>
>
>
> I can’t transfert a document with this error, they are privates. Sorry.
>
>
>
> If I encounter the error again on a non-private document, I'll come back
> to you.
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:12
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> CONNECTORS-1481 is the ticket for the Tika problem.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <da...@gmail.com> wrote:
>
> Ok, if you are in a position to build trunk, that's a newer version of
> Tika (1.17) which might (or might not) address this problem.
>
>
>
> If you could create a ticket, I'd greatly appreciate attaching one
> document to it that causes the failure.
>
>
>
> Thanks!
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 8:02 AM, msaunier <ms...@citya.com> wrote:
>
> It’s a 2.9 version.
>
>
>
> I have a 2.8.1 in an other server with same job and same documents. I will
> test on this other server and make you a return.
>
>
>
> Thanks for your help.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 13:15
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> I looked at the history of this.  We had to release a patch (2.8.1) that
> put various poi jars at root level in order to work around a Tika problem.
> That patch may not have been entirely correct in that it looks like it may
> have blocked access by one of the deeper jars to a higher level.
>
>
>
> Release 2.9 should fix this if I am correct.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <da...@gmail.com> wrote:
>
> What version of MCF is this?  That's important to know since Tika has had
> problems with this kind of thing in the past and this looks like something
> similar.
>
>
>
> The problem you are reporting is due to either a missing jar, or a bug in
> an internal tika classloader.  But I need to know whether this is a current
> bug or not, since we just went to a new Tika version.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 4:32 AM, msaunier <ms...@citya.com> wrote:
>
> Hello Karl,
>
> I hope you are well today.
>
>
>
> I have 2 problems with ManifoldCF.
>
>
>
> -----------
>
> In **Outputs connectors** with Solr connector. I have add a « Maximum
> document length and I have « Excluded 5 mime types » but it not work. I
> join capture.
>
>
>
> ----------
>
> And in second, I have a **Tika exception** in ManifoldCF. 3 documents are
> blocked :
>
>
>
> FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed:
> org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/
> apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
> java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.
> HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
>         at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
> ~[?:?]
>
>         at org.apache.tika.extractor.ParsingEmbeddedDocumentExtract
> or.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.
> OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.
> TikaParser.parse(TikaParser.java:74) ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.
> addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithExcept
> ion(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineObjectWithVersions.
> addOrReplaceDocumentWithException(IncrementalIngester.java:2708)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester.documentIngest(IncrementalIngester.java:756)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.connectors.sharedrive.
> SharedDriveConnector.processDocuments(SharedDriveConnector.java:939)
> ~[?:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
>
>
> I need to create an incident ticket?
>
>
>
> ----------
>
>
>
> Thanks for your help.
>
>
>
> Cordialement,
>
>
>
> [image: msaunier]
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

RE: Document connector excluding mime type and size - Tika Parser error

Posted by msaunier <ms...@citya.com>.
With the bin, it work good. I finish the test at 18h00 I think.

 

PS : Do you knox this error in Solr ? 

 

2018-01-12 16:19:50.175 WARN  (qtp754666084-28157) [   ] o.a.s.h.a.LukeRequestHandler Error getting file length for [segments_xwd]

java.nio.file.NoSuchFileException: /var/solr/data/citya_shard5_replica1/data/index/segments_xwd

        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)

        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)

        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)

       at sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)

        at sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)

        at sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)

        at java.nio.file.Files.readAttributes(Files.java:1737)

        at java.nio.file.Files.size(Files.java:2332)

        at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243)

        at org.apache.lucene.store.NRTCachingDirectory.fileLength(NRTCachingDirectory.java:128)

        at org.apache.solr.handler.admin.LukeRequestHandler.getFileLength(LukeRequestHandler.java:615)

        at org.apache.solr.handler.admin.LukeRequestHandler.getIndexInfo(LukeRequestHandler.java:588)

        at org.apache.solr.handler.admin.CoreAdminOperation.getCoreStatus(CoreAdminOperation.java:348)

        at org.apache.solr.handler.admin.StatusOp.execute(StatusOp.java:48)

        at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:384)

        at org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:388)

        at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:174)

        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)

        at org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:748)

        at org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:729)

        at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:510)

        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)

        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)

        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)

        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)

        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)

        at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)

        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)

        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)

        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)

        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)

        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)

        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)

        at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)

        at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)

        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)

        at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)

        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)

        at org.eclipse.jetty.server.Server.handle(Server.java:534)

        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)

        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)

        at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)

        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)

        at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)

        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)

        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)

        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)

        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)

        at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)

        at java.lang.Thread.run(Thread.java:748)

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : vendredi 12 janvier 2018 16:52
À : user@manifoldcf.apache.org
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

You can download the bin version here:

 

https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.9.1

 

Karl

 

 

On Fri, Jan 12, 2018 at 10:50 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Solr not have specials log. (file join)

 



 

I tail –f Solr and Manifold loggings but no error, no warning, they have just info but not really importants.

The configuration is the same. Same properties.xml, same logging.xml, same start-options.env.unix, same start script, same database. But in 2.9.0 it work, in ‘2.9.1’ not have commit.

I don’t understand why.

 

Without Tika transformation, not working too.

 

So, if youhave just change the Tika connector, I don’t understand why this version have this bug.

 

Would you release a bin version with this hotfix?

 

Thanks

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : vendredi 12 janvier 2018 16:13
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 


Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

What does the solr log look like?  There should be [INFO] statements you can inspect to see what's being sent, and its size.

 

The change to go to Tika 1.17 will not have affected the Solr connector in any way.  So one way to decide if this change might be the problem is to run a job that doesn't use Tika -- like a simple file directory crawl that has only text documents in it, and see if that makes it to Solr.  If not, then you must have a configuration difference somewhere.

 

Karl

 

 

On Fri, Jan 12, 2018 at 9:44 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

I finished the tests.

 

I do not have any more mistakes with Tika.

However, you can not commit the data in Solr. I have no error log. The configuration is the same. I tried with 2 different databases and the same configuration; and I tried with the same database, but no data arrives in Solr. It's normal ?

On 2.9.0, the commit works well.

 

Regards,

 

 

De : msaunier [mailto:msaunier@citya.com <ma...@citya.com> ] 
Envoyé : vendredi 12 janvier 2018 11:52


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : RE: Document connector excluding mime type and size - Tika Parser error

 

Hello Karl,

 

I have to do a last test, I do it between noon and two o'clock and I inform you of the good functioning of this patch if it works

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : jeudi 11 janvier 2018 18:09
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

No Tika error is good, but have a look at Simple History to be sure documents were actually processed.  If you can confirm that, I'll kick off the patch process.

 

Karl

 

 

On Thu, Jan 11, 2018 at 11:26 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Ok. So.

 

With the same configuration but Tika 1.17 :

 

·        No Tika error

·        But, no documents send to Solr. I don’t understand why. I research.

 

 

 

 

De : msaunier [mailto:msaunier@citya.com <ma...@citya.com> ] 
Envoyé : jeudi 11 janvier 2018 15:32


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : RE: Document connector excluding mime type and size - Tika Parser error

 

I crawl for the moment. I think, I would have finished in 30 minutes. 

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : jeudi 11 janvier 2018 15:05
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

Did this work for you?

Karl

 

On Thu, Jan 11, 2018 at 6:36 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

If you need the jcifs connector, run "ant make-deps" too.  Then run "ant build" again.

 

Karl

 

On Thu, Jan 11, 2018 at 4:30 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

 

I have build and configured but WindowsShare connector do not appear in the list of repository connectors.

 

·        I have add jcifs.jar into the connectors/jcifs/lib-proprietary directory

·        I have ant make-core-deps

·        Ant build

·        Uncomment windows share into the connectors-proprietary.xml file in the dist folder

·        I have add jcifs.jar in connector-lib-proprietary

 

But not have the proposition on the manifold interface. 

 

Any idea ?

Thanks.

 

 

De : msaunier [mailto:msaunier@citya.com <ma...@citya.com> ] 
Envoyé : mercredi 10 janvier 2018 18:15
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : RE: Document connector excluding mime type and size - Tika Parser error

 

Good !

 

I configure and test that.

I give you a return as soon as the reading is finished.

400k documents.

 

If it works, I test on few million of documents.

 

Thank.

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : mercredi 10 janvier 2018 17:45
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

The build you should be using is the ant build.  Do not use the maven build for this purpose.

 

- Check out trunk:

 

svn co https://svn.apache.org/repos/asf/manifoldcf/trunk

 

- Download dependencies:

 

ant make-core-deps

 

- Build:

 

ant build

 

- Your deliverable is in the "dist" directory

 

Karl

 

 

On Wed, Jan 10, 2018 at 11:37 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

I have an error with the maven build, so I have test with an external 1.17 Tika Server but, POI not included. If you success a mvn package with 1.17 Tika, I am interested.

 

Today, I have not had much time to deal with it.

 

I found some bugs that I would declare tomorrow if they are not already. They concern log4j2, local_fr and a bug with the web interface and the keyboard input key.

 

I continu my investigation.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mercredi 10 janvier 2018 17:15


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

Any news?

Karl

 

On Tue, Jan 9, 2018 at 1:10 PM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Let me know what happens.
If it works for you, I'll see if we can put together a patch release of 2.9 with the fix.

 

Karl

 

 

On Tue, Jan 9, 2018 at 11:07 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Test check out and building with POI 3.17 and Tika 1.17? 

 

It’s possible.

 

I finish a project and I test that.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 16:57


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

So here's the problem; we used POI 3.17 with Tika 3.16 in 2.9, in order to deal with the classloader issue present in POI 3.15, and because POI 3.16 has a severe security issue that made it impossible to ship with.

 

Unfortunately that doesn't quite work; POI 3.17 is not backwards compatible with 3.16 completely and therefore problems occur with this combination.

 

The probable solution is to check out and build trunk and see if that works for you.  It very well might.  The question then is what to do next, because we are not scheduled to release again until April.  We might have to do a point release to deal with this.

 

Please give it a try and let me know what happens.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 10:29 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, never mind that last email.  We patched it in part in 2.9 by including the latest POI.  So clearly it's still an existing problem in POI.  I'll have to open a ticket there and await a patch from them.

 

Karl

 

On Tue, Jan 9, 2018 at 10:27 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

This screenshot cannot be MCF 2.9 since the version of poi was not 3.17 for the 2.9 release.

 

Karl

 

 

On Tue, Jan 9, 2018 at 10:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.

 



 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:54


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

As for the Tika issue, we explicitly tested documents of that type when rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we also tested this.

 

One of the potential issues is that if you are dropping down different versions of ManifoldCF into the same directories you *might* have a poi* jar in the wrong place because of the way we had to do the patch.  Please have a look at where the poi* jars are in your directory structure; they should all be in one directory (connector-common-lib).  If you see any anywhere else, that's the cause of the issue.

 

Karl

 

 

On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Since the Tika extractor essentially filters out the content mime type (other than presenting it as metadata), you need to put an "allowed documents" transformation connection into your job pipeline BEFORE the Tika connector:

 

https://manifoldcf.apache.org/release/release-2.9/en_US/end-user-documentation.html#alloweddocuments

 

In fact, mime type exclusion is actually disabled in the Solr output connector *unless* you are using the extracting update handler.  That should resolve the one problem for you.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 9:35 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They document for Tika are :

·        Microsoft Word 97-2003

·        Application/msword

 

I can’t have more informations, they are in SCO servers and SCO do not have ls –lisan or stat command.

 

For SolR connecting, I seem to have emptied the index before the last indexation. (ManifoldCF and Solr) I do it again to be sure.

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:26


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1482 is for the Solr connector filtering issue.  A question: When you changed these fields in the output connection, had you already indexed any documents?  Those would only get cleaned up if you did a subsequent full crawl, after you made the connection change.

 

Karl

 

 

 

On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

If you let me know what kind of file they are (extension and what application created them) that is probably good enough.

Karl

 

On Tue, Jan 9, 2018 at 9:19 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Okay good. I look if I can test 1.17 Tika version.

 

I can’t transfert a document with this error, they are privates. Sorry.

 

If I encounter the error again on a non-private document, I'll come back to you.

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:12


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1481 is the ticket for the Tika problem.

 

Karl

 

 

On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, if you are in a position to build trunk, that's a newer version of Tika (1.17) which might (or might not) address this problem.

 

If you could create a ticket, I'd greatly appreciate attaching one document to it that causes the failure.

 

Thanks!

Karl

 

 

On Tue, Jan 9, 2018 at 8:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

It’s a 2.9 version.

 

I have a 2.8.1 in an other server with same job and same documents. I will test on this other server and make you a return.

 

Thanks for your help.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 13:15
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

I looked at the history of this.  We had to release a patch (2.8.1) that put various poi jars at root level in order to work around a Tika problem.  That patch may not have been entirely correct in that it looks like it may have blocked access by one of the deeper jars to a higher level.

 

Release 2.9 should fix this if I am correct.

 

Karl

 

 

On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

What version of MCF is this?  That's important to know since Tika has had problems with this kind of thing in the past and this looks like something similar.

 

The problem you are reporting is due to either a missing jar, or a bug in an internal tika classloader.  But I need to know whether this is a current bug or not, since we just went to a new Tika version.

 

Karl

 

 

On Tue, Jan 9, 2018 at 4:32 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

I hope you are well today.

 

I have 2 problems with ManifoldCF.

 

-----------

In *Outputs connectors* with Solr connector. I have add a « Maximum document length and I have « Excluded 5 mime types » but it not work. I join capture.

 

----------

And in second, I have a *Tika exception* in ManifoldCF. 3 documents are blocked :

 

FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

        at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) ~[?:?]

        at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]

 

I need to create an incident ticket?

 

----------

 

Thanks for your help.

 

Cordialement,

 



 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Re: Document connector excluding mime type and size - Tika Parser error

Posted by Karl Wright <da...@gmail.com>.
You can download the bin version here:

https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.9.1

Karl


On Fri, Jan 12, 2018 at 10:50 AM, msaunier <ms...@citya.com> wrote:

> Solr not have specials log. (file join)
>
>
>
>
>
> I tail –f Solr and Manifold loggings but no error, no warning, they have
> just info but not really importants.
>
> The configuration is the same. Same properties.xml, same logging.xml, same
> start-options.env.unix, same start script, same database. But in 2.9.0 it
> work, in ‘2.9.1’ not have commit.
>
> I don’t understand why.
>
>
>
> Without Tika transformation, not working too.
>
>
>
> So, if youhave just change the Tika connector, I don’t understand why this
> version have this bug.
>
>
>
> Would you release a bin version with this hotfix?
>
>
>
> Thanks
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* vendredi 12 janvier 2018 16:13
> *À :* user@manifoldcf.apache.org
>
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> What does the solr log look like?  There should be [INFO] statements you
> can inspect to see what's being sent, and its size.
>
>
>
> The change to go to Tika 1.17 will not have affected the Solr connector in
> any way.  So one way to decide if this change might be the problem is to
> run a job that doesn't use Tika -- like a simple file directory crawl that
> has only text documents in it, and see if that makes it to Solr.  If not,
> then you must have a configuration difference somewhere.
>
>
>
> Karl
>
>
>
>
>
> On Fri, Jan 12, 2018 at 9:44 AM, msaunier <ms...@citya.com> wrote:
>
> I finished the tests.
>
>
>
> I do not have any more mistakes with Tika.
>
> However, you can not commit the data in Solr. I have no error log. The
> configuration is the same. I tried with 2 different databases and the same
> configuration; and I tried with the same database, but no data arrives in
> Solr. It's normal ?
>
> On 2.9.0, the commit works well.
>
>
>
> Regards,
>
>
>
>
>
> *De :* msaunier [mailto:msaunier@citya.com]
> *Envoyé :* vendredi 12 janvier 2018 11:52
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* RE: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> Hello Karl,
>
>
>
> I have to do a last test, I do it between noon and two o'clock and I
> inform you of the good functioning of this patch if it works
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com <da...@gmail.com>]
> *Envoyé :* jeudi 11 janvier 2018 18:09
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> No Tika error is good, but have a look at Simple History to be sure
> documents were actually processed.  If you can confirm that, I'll kick off
> the patch process.
>
>
>
> Karl
>
>
>
>
>
> On Thu, Jan 11, 2018 at 11:26 AM, msaunier <ms...@citya.com> wrote:
>
> Ok. So.
>
>
>
> With the same configuration but Tika 1.17 :
>
>
>
> ·        No Tika error
>
> ·        But, no documents send to Solr. I don’t understand why. I
> research.
>
>
>
>
>
>
>
>
>
> *De :* msaunier [mailto:msaunier@citya.com]
> *Envoyé :* jeudi 11 janvier 2018 15:32
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* RE: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> I crawl for the moment. I think, I would have finished in 30 minutes.
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com <da...@gmail.com>]
> *Envoyé :* jeudi 11 janvier 2018 15:05
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> Did this work for you?
>
> Karl
>
>
>
> On Thu, Jan 11, 2018 at 6:36 AM, Karl Wright <da...@gmail.com> wrote:
>
> If you need the jcifs connector, run "ant make-deps" too.  Then run "ant
> build" again.
>
>
>
> Karl
>
>
>
> On Thu, Jan 11, 2018 at 4:30 AM, msaunier <ms...@citya.com> wrote:
>
> Hello Karl,
>
>
>
> I have build and configured but WindowsShare connector do not appear in
> the list of repository connectors.
>
>
>
> ·        I have add jcifs.jar into the connectors/jcifs/lib-proprietary
> directory
>
> ·        I have ant make-core-deps
>
> ·        Ant build
>
> ·        Uncomment windows share into the connectors-proprietary.xml file
> in the dist folder
>
> ·        I have add jcifs.jar in connector-lib-proprietary
>
>
>
> But not have the proposition on the manifold interface.
>
>
>
> Any idea ?
>
> Thanks.
>
>
>
>
>
> *De :* msaunier [mailto:msaunier@citya.com]
> *Envoyé :* mercredi 10 janvier 2018 18:15
> *À :* user@manifoldcf.apache.org
> *Objet :* RE: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> Good !
>
>
>
> I configure and test that.
>
> I give you a return as soon as the reading is finished.
>
> 400k documents.
>
>
>
> If it works, I test on few million of documents.
>
>
>
> Thank.
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com <da...@gmail.com>]
> *Envoyé :* mercredi 10 janvier 2018 17:45
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> The build you should be using is the ant build.  Do not use the maven
> build for this purpose.
>
>
>
> - Check out trunk:
>
>
>
> svn co https://svn.apache.org/repos/asf/manifoldcf/trunk
>
>
>
> - Download dependencies:
>
>
>
> ant make-core-deps
>
>
>
> - Build:
>
>
>
> ant build
>
>
>
> - Your deliverable is in the "dist" directory
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jan 10, 2018 at 11:37 AM, msaunier <ms...@citya.com> wrote:
>
> I have an error with the maven build, so I have test with an external 1.17
> Tika Server but, POI not included. If you success a mvn package with 1.17
> Tika, I am interested.
>
>
>
> Today, I have not had much time to deal with it.
>
>
>
> I found some bugs that I would declare tomorrow if they are not already.
> They concern log4j2, local_fr and a bug with the web interface and the
> keyboard input key.
>
>
>
> I continu my investigation.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mercredi 10 janvier 2018 17:15
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> Any news?
>
> Karl
>
>
>
> On Tue, Jan 9, 2018 at 1:10 PM, Karl Wright <da...@gmail.com> wrote:
>
> Let me know what happens.
> If it works for you, I'll see if we can put together a patch release of
> 2.9 with the fix.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 11:07 AM, msaunier <ms...@citya.com> wrote:
>
> Test check out and building with POI 3.17 and Tika 1.17?
>
>
>
> It’s possible.
>
>
>
> I finish a project and I test that.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 16:57
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> So here's the problem; we used POI 3.17 with Tika 3.16 in 2.9, in order to
> deal with the classloader issue present in POI 3.15, and because POI 3.16
> has a severe security issue that made it impossible to ship with.
>
>
>
> Unfortunately that doesn't quite work; POI 3.17 is not backwards
> compatible with 3.16 completely and therefore problems occur with this
> combination.
>
>
>
> The probable solution is to check out and build trunk and see if that
> works for you.  It very well might.  The question then is what to do next,
> because we are not scheduled to release again until April.  We might have
> to do a point release to deal with this.
>
>
>
> Please give it a try and let me know what happens.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 10:29 AM, Karl Wright <da...@gmail.com> wrote:
>
> Ok, never mind that last email.  We patched it in part in 2.9 by including
> the latest POI.  So clearly it's still an existing problem in POI.  I'll
> have to open a ticket there and await a patch from them.
>
>
>
> Karl
>
>
>
> On Tue, Jan 9, 2018 at 10:27 AM, Karl Wright <da...@gmail.com> wrote:
>
> This screenshot cannot be MCF 2.9 since the version of poi was not 3.17
> for the 2.9 release.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 10:02 AM, msaunier <ms...@citya.com> wrote:
>
> They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.
>
>
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:54
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> As for the Tika issue, we explicitly tested documents of that type when
> rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we
> also tested this.
>
>
>
> One of the potential issues is that if you are dropping down different
> versions of ManifoldCF into the same directories you *might* have a poi*
> jar in the wrong place because of the way we had to do the patch.  Please
> have a look at where the poi* jars are in your directory structure; they
> should all be in one directory (connector-common-lib).  If you see any
> anywhere else, that's the cause of the issue.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <da...@gmail.com> wrote:
>
> Since the Tika extractor essentially filters out the content mime type
> (other than presenting it as metadata), you need to put an "allowed
> documents" transformation connection into your job pipeline BEFORE the Tika
> connector:
>
>
>
> https://manifoldcf.apache.org/release/release-2.9/en_US/end-
> user-documentation.html#alloweddocuments
>
>
>
> In fact, mime type exclusion is actually disabled in the Solr output
> connector *unless* you are using the extracting update handler.  That
> should resolve the one problem for you.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:35 AM, msaunier <ms...@citya.com> wrote:
>
> They document for Tika are :
>
> ·        Microsoft Word 97-2003
>
> ·        Application/msword
>
>
>
> I can’t have more informations, they are in SCO servers and SCO do not
> have ls –lisan or stat command.
>
>
>
> For SolR connecting, I seem to have emptied the index before the last
> indexation. (ManifoldCF and Solr) I do it again to be sure.
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:26
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> CONNECTORS-1482 is for the Solr connector filtering issue.  A question:
> When you changed these fields in the output connection, had you already
> indexed any documents?  Those would only get cleaned up if you did a
> subsequent full crawl, after you made the connection change.
>
>
>
> Karl
>
>
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <da...@gmail.com> wrote:
>
> If you let me know what kind of file they are (extension and what
> application created them) that is probably good enough.
>
> Karl
>
>
>
> On Tue, Jan 9, 2018 at 9:19 AM, msaunier <ms...@citya.com> wrote:
>
> Okay good. I look if I can test 1.17 Tika version.
>
>
>
> I can’t transfert a document with this error, they are privates. Sorry.
>
>
>
> If I encounter the error again on a non-private document, I'll come back
> to you.
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:12
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> CONNECTORS-1481 is the ticket for the Tika problem.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <da...@gmail.com> wrote:
>
> Ok, if you are in a position to build trunk, that's a newer version of
> Tika (1.17) which might (or might not) address this problem.
>
>
>
> If you could create a ticket, I'd greatly appreciate attaching one
> document to it that causes the failure.
>
>
>
> Thanks!
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 8:02 AM, msaunier <ms...@citya.com> wrote:
>
> It’s a 2.9 version.
>
>
>
> I have a 2.8.1 in an other server with same job and same documents. I will
> test on this other server and make you a return.
>
>
>
> Thanks for your help.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 13:15
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> I looked at the history of this.  We had to release a patch (2.8.1) that
> put various poi jars at root level in order to work around a Tika problem.
> That patch may not have been entirely correct in that it looks like it may
> have blocked access by one of the deeper jars to a higher level.
>
>
>
> Release 2.9 should fix this if I am correct.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <da...@gmail.com> wrote:
>
> What version of MCF is this?  That's important to know since Tika has had
> problems with this kind of thing in the past and this looks like something
> similar.
>
>
>
> The problem you are reporting is due to either a missing jar, or a bug in
> an internal tika classloader.  But I need to know whether this is a current
> bug or not, since we just went to a new Tika version.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 4:32 AM, msaunier <ms...@citya.com> wrote:
>
> Hello Karl,
>
> I hope you are well today.
>
>
>
> I have 2 problems with ManifoldCF.
>
>
>
> -----------
>
> In **Outputs connectors** with Solr connector. I have add a « Maximum
> document length and I have « Excluded 5 mime types » but it not work. I
> join capture.
>
>
>
> ----------
>
> And in second, I have a **Tika exception** in ManifoldCF. 3 documents are
> blocked :
>
>
>
> FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed:
> org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/
> apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
> java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.
> HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
>         at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
> ~[?:?]
>
>         at org.apache.tika.extractor.ParsingEmbeddedDocumentExtract
> or.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.
> OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.
> TikaParser.parse(TikaParser.java:74) ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.
> addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithExcept
> ion(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineObjectWithVersions.
> addOrReplaceDocumentWithException(IncrementalIngester.java:2708)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester.documentIngest(IncrementalIngester.java:756)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.connectors.sharedrive.
> SharedDriveConnector.processDocuments(SharedDriveConnector.java:939)
> ~[?:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
>
>
> I need to create an incident ticket?
>
>
>
> ----------
>
>
>
> Thanks for your help.
>
>
>
> Cordialement,
>
>
>
> [image: msaunier]
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

RE: Document connector excluding mime type and size - Tika Parser error

Posted by msaunier <ms...@citya.com>.
Solr not have specials log. (file join)

 



 

I tail –f Solr and Manifold loggings but no error, no warning, they have just info but not really importants.

The configuration is the same. Same properties.xml, same logging.xml, same start-options.env.unix, same start script, same database. But in 2.9.0 it work, in ‘2.9.1’ not have commit.

I don’t understand why.

 

Without Tika transformation, not working too.

 

So, if youhave just change the Tika connector, I don’t understand why this version have this bug.

 

Would you release a bin version with this hotfix?

 

Thanks

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : vendredi 12 janvier 2018 16:13
À : user@manifoldcf.apache.org
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

What does the solr log look like?  There should be [INFO] statements you can inspect to see what's being sent, and its size.

 

The change to go to Tika 1.17 will not have affected the Solr connector in any way.  So one way to decide if this change might be the problem is to run a job that doesn't use Tika -- like a simple file directory crawl that has only text documents in it, and see if that makes it to Solr.  If not, then you must have a configuration difference somewhere.

 

Karl

 

 

On Fri, Jan 12, 2018 at 9:44 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

I finished the tests.

 

I do not have any more mistakes with Tika.

However, you can not commit the data in Solr. I have no error log. The configuration is the same. I tried with 2 different databases and the same configuration; and I tried with the same database, but no data arrives in Solr. It's normal ?

On 2.9.0, the commit works well.

 

Regards,

 

 

De : msaunier [mailto:msaunier@citya.com <ma...@citya.com> ] 
Envoyé : vendredi 12 janvier 2018 11:52


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : RE: Document connector excluding mime type and size - Tika Parser error

 

Hello Karl,

 

I have to do a last test, I do it between noon and two o'clock and I inform you of the good functioning of this patch if it works

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : jeudi 11 janvier 2018 18:09
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

No Tika error is good, but have a look at Simple History to be sure documents were actually processed.  If you can confirm that, I'll kick off the patch process.

 

Karl

 

 

On Thu, Jan 11, 2018 at 11:26 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Ok. So.

 

With the same configuration but Tika 1.17 :

 

·        No Tika error

·        But, no documents send to Solr. I don’t understand why. I research.

 

 

 

 

De : msaunier [mailto:msaunier@citya.com <ma...@citya.com> ] 
Envoyé : jeudi 11 janvier 2018 15:32


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : RE: Document connector excluding mime type and size - Tika Parser error

 

I crawl for the moment. I think, I would have finished in 30 minutes. 

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : jeudi 11 janvier 2018 15:05
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

Did this work for you?

Karl

 

On Thu, Jan 11, 2018 at 6:36 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

If you need the jcifs connector, run "ant make-deps" too.  Then run "ant build" again.

 

Karl

 

On Thu, Jan 11, 2018 at 4:30 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

 

I have build and configured but WindowsShare connector do not appear in the list of repository connectors.

 

·        I have add jcifs.jar into the connectors/jcifs/lib-proprietary directory

·        I have ant make-core-deps

·        Ant build

·        Uncomment windows share into the connectors-proprietary.xml file in the dist folder

·        I have add jcifs.jar in connector-lib-proprietary

 

But not have the proposition on the manifold interface. 

 

Any idea ?

Thanks.

 

 

De : msaunier [mailto:msaunier@citya.com <ma...@citya.com> ] 
Envoyé : mercredi 10 janvier 2018 18:15
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : RE: Document connector excluding mime type and size - Tika Parser error

 

Good !

 

I configure and test that.

I give you a return as soon as the reading is finished.

400k documents.

 

If it works, I test on few million of documents.

 

Thank.

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : mercredi 10 janvier 2018 17:45
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

The build you should be using is the ant build.  Do not use the maven build for this purpose.

 

- Check out trunk:

 

svn co https://svn.apache.org/repos/asf/manifoldcf/trunk

 

- Download dependencies:

 

ant make-core-deps

 

- Build:

 

ant build

 

- Your deliverable is in the "dist" directory

 

Karl

 

 

On Wed, Jan 10, 2018 at 11:37 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

I have an error with the maven build, so I have test with an external 1.17 Tika Server but, POI not included. If you success a mvn package with 1.17 Tika, I am interested.

 

Today, I have not had much time to deal with it.

 

I found some bugs that I would declare tomorrow if they are not already. They concern log4j2, local_fr and a bug with the web interface and the keyboard input key.

 

I continu my investigation.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mercredi 10 janvier 2018 17:15


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

Any news?

Karl

 

On Tue, Jan 9, 2018 at 1:10 PM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Let me know what happens.
If it works for you, I'll see if we can put together a patch release of 2.9 with the fix.

 

Karl

 

 

On Tue, Jan 9, 2018 at 11:07 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Test check out and building with POI 3.17 and Tika 1.17? 

 

It’s possible.

 

I finish a project and I test that.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 16:57


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

So here's the problem; we used POI 3.17 with Tika 3.16 in 2.9, in order to deal with the classloader issue present in POI 3.15, and because POI 3.16 has a severe security issue that made it impossible to ship with.

 

Unfortunately that doesn't quite work; POI 3.17 is not backwards compatible with 3.16 completely and therefore problems occur with this combination.

 

The probable solution is to check out and build trunk and see if that works for you.  It very well might.  The question then is what to do next, because we are not scheduled to release again until April.  We might have to do a point release to deal with this.

 

Please give it a try and let me know what happens.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 10:29 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, never mind that last email.  We patched it in part in 2.9 by including the latest POI.  So clearly it's still an existing problem in POI.  I'll have to open a ticket there and await a patch from them.

 

Karl

 

On Tue, Jan 9, 2018 at 10:27 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

This screenshot cannot be MCF 2.9 since the version of poi was not 3.17 for the 2.9 release.

 

Karl

 

 

On Tue, Jan 9, 2018 at 10:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.

 



 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:54


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

As for the Tika issue, we explicitly tested documents of that type when rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we also tested this.

 

One of the potential issues is that if you are dropping down different versions of ManifoldCF into the same directories you *might* have a poi* jar in the wrong place because of the way we had to do the patch.  Please have a look at where the poi* jars are in your directory structure; they should all be in one directory (connector-common-lib).  If you see any anywhere else, that's the cause of the issue.

 

Karl

 

 

On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Since the Tika extractor essentially filters out the content mime type (other than presenting it as metadata), you need to put an "allowed documents" transformation connection into your job pipeline BEFORE the Tika connector:

 

https://manifoldcf.apache.org/release/release-2.9/en_US/end-user-documentation.html#alloweddocuments

 

In fact, mime type exclusion is actually disabled in the Solr output connector *unless* you are using the extracting update handler.  That should resolve the one problem for you.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 9:35 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They document for Tika are :

·        Microsoft Word 97-2003

·        Application/msword

 

I can’t have more informations, they are in SCO servers and SCO do not have ls –lisan or stat command.

 

For SolR connecting, I seem to have emptied the index before the last indexation. (ManifoldCF and Solr) I do it again to be sure.

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:26


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1482 is for the Solr connector filtering issue.  A question: When you changed these fields in the output connection, had you already indexed any documents?  Those would only get cleaned up if you did a subsequent full crawl, after you made the connection change.

 

Karl

 

 

 

On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

If you let me know what kind of file they are (extension and what application created them) that is probably good enough.

Karl

 

On Tue, Jan 9, 2018 at 9:19 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Okay good. I look if I can test 1.17 Tika version.

 

I can’t transfert a document with this error, they are privates. Sorry.

 

If I encounter the error again on a non-private document, I'll come back to you.

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:12


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1481 is the ticket for the Tika problem.

 

Karl

 

 

On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, if you are in a position to build trunk, that's a newer version of Tika (1.17) which might (or might not) address this problem.

 

If you could create a ticket, I'd greatly appreciate attaching one document to it that causes the failure.

 

Thanks!

Karl

 

 

On Tue, Jan 9, 2018 at 8:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

It’s a 2.9 version.

 

I have a 2.8.1 in an other server with same job and same documents. I will test on this other server and make you a return.

 

Thanks for your help.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 13:15
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

I looked at the history of this.  We had to release a patch (2.8.1) that put various poi jars at root level in order to work around a Tika problem.  That patch may not have been entirely correct in that it looks like it may have blocked access by one of the deeper jars to a higher level.

 

Release 2.9 should fix this if I am correct.

 

Karl

 

 

On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

What version of MCF is this?  That's important to know since Tika has had problems with this kind of thing in the past and this looks like something similar.

 

The problem you are reporting is due to either a missing jar, or a bug in an internal tika classloader.  But I need to know whether this is a current bug or not, since we just went to a new Tika version.

 

Karl

 

 

On Tue, Jan 9, 2018 at 4:32 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

I hope you are well today.

 

I have 2 problems with ManifoldCF.

 

-----------

In *Outputs connectors* with Solr connector. I have add a « Maximum document length and I have « Excluded 5 mime types » but it not work. I join capture.

 

----------

And in second, I have a *Tika exception* in ManifoldCF. 3 documents are blocked :

 

FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

        at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) ~[?:?]

        at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]

 

I need to create an incident ticket?

 

----------

 

Thanks for your help.

 

Cordialement,

 



 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Re: Document connector excluding mime type and size - Tika Parser error

Posted by Karl Wright <da...@gmail.com>.
What does the solr log look like?  There should be [INFO] statements you
can inspect to see what's being sent, and its size.

The change to go to Tika 1.17 will not have affected the Solr connector in
any way.  So one way to decide if this change might be the problem is to
run a job that doesn't use Tika -- like a simple file directory crawl that
has only text documents in it, and see if that makes it to Solr.  If not,
then you must have a configuration difference somewhere.

Karl


On Fri, Jan 12, 2018 at 9:44 AM, msaunier <ms...@citya.com> wrote:

> I finished the tests.
>
>
>
> I do not have any more mistakes with Tika.
>
> However, you can not commit the data in Solr. I have no error log. The
> configuration is the same. I tried with 2 different databases and the same
> configuration; and I tried with the same database, but no data arrives in
> Solr. It's normal ?
>
> On 2.9.0, the commit works well.
>
>
>
> Regards,
>
>
>
>
>
> *De :* msaunier [mailto:msaunier@citya.com]
> *Envoyé :* vendredi 12 janvier 2018 11:52
>
> *À :* user@manifoldcf.apache.org
> *Objet :* RE: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> Hello Karl,
>
>
>
> I have to do a last test, I do it between noon and two o'clock and I
> inform you of the good functioning of this patch if it works
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com <da...@gmail.com>]
> *Envoyé :* jeudi 11 janvier 2018 18:09
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> No Tika error is good, but have a look at Simple History to be sure
> documents were actually processed.  If you can confirm that, I'll kick off
> the patch process.
>
>
>
> Karl
>
>
>
>
>
> On Thu, Jan 11, 2018 at 11:26 AM, msaunier <ms...@citya.com> wrote:
>
> Ok. So.
>
>
>
> With the same configuration but Tika 1.17 :
>
>
>
> ·        No Tika error
>
> ·        But, no documents send to Solr. I don’t understand why. I
> research.
>
>
>
>
>
>
>
>
>
> *De :* msaunier [mailto:msaunier@citya.com]
> *Envoyé :* jeudi 11 janvier 2018 15:32
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* RE: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> I crawl for the moment. I think, I would have finished in 30 minutes.
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com <da...@gmail.com>]
> *Envoyé :* jeudi 11 janvier 2018 15:05
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> Did this work for you?
>
> Karl
>
>
>
> On Thu, Jan 11, 2018 at 6:36 AM, Karl Wright <da...@gmail.com> wrote:
>
> If you need the jcifs connector, run "ant make-deps" too.  Then run "ant
> build" again.
>
>
>
> Karl
>
>
>
> On Thu, Jan 11, 2018 at 4:30 AM, msaunier <ms...@citya.com> wrote:
>
> Hello Karl,
>
>
>
> I have build and configured but WindowsShare connector do not appear in
> the list of repository connectors.
>
>
>
> ·        I have add jcifs.jar into the connectors/jcifs/lib-proprietary
> directory
>
> ·        I have ant make-core-deps
>
> ·        Ant build
>
> ·        Uncomment windows share into the connectors-proprietary.xml file
> in the dist folder
>
> ·        I have add jcifs.jar in connector-lib-proprietary
>
>
>
> But not have the proposition on the manifold interface.
>
>
>
> Any idea ?
>
> Thanks.
>
>
>
>
>
> *De :* msaunier [mailto:msaunier@citya.com]
> *Envoyé :* mercredi 10 janvier 2018 18:15
> *À :* user@manifoldcf.apache.org
> *Objet :* RE: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> Good !
>
>
>
> I configure and test that.
>
> I give you a return as soon as the reading is finished.
>
> 400k documents.
>
>
>
> If it works, I test on few million of documents.
>
>
>
> Thank.
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com <da...@gmail.com>]
> *Envoyé :* mercredi 10 janvier 2018 17:45
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> The build you should be using is the ant build.  Do not use the maven
> build for this purpose.
>
>
>
> - Check out trunk:
>
>
>
> svn co https://svn.apache.org/repos/asf/manifoldcf/trunk
>
>
>
> - Download dependencies:
>
>
>
> ant make-core-deps
>
>
>
> - Build:
>
>
>
> ant build
>
>
>
> - Your deliverable is in the "dist" directory
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jan 10, 2018 at 11:37 AM, msaunier <ms...@citya.com> wrote:
>
> I have an error with the maven build, so I have test with an external 1.17
> Tika Server but, POI not included. If you success a mvn package with 1.17
> Tika, I am interested.
>
>
>
> Today, I have not had much time to deal with it.
>
>
>
> I found some bugs that I would declare tomorrow if they are not already.
> They concern log4j2, local_fr and a bug with the web interface and the
> keyboard input key.
>
>
>
> I continu my investigation.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mercredi 10 janvier 2018 17:15
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> Any news?
>
> Karl
>
>
>
> On Tue, Jan 9, 2018 at 1:10 PM, Karl Wright <da...@gmail.com> wrote:
>
> Let me know what happens.
> If it works for you, I'll see if we can put together a patch release of
> 2.9 with the fix.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 11:07 AM, msaunier <ms...@citya.com> wrote:
>
> Test check out and building with POI 3.17 and Tika 1.17?
>
>
>
> It’s possible.
>
>
>
> I finish a project and I test that.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 16:57
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> So here's the problem; we used POI 3.17 with Tika 3.16 in 2.9, in order to
> deal with the classloader issue present in POI 3.15, and because POI 3.16
> has a severe security issue that made it impossible to ship with.
>
>
>
> Unfortunately that doesn't quite work; POI 3.17 is not backwards
> compatible with 3.16 completely and therefore problems occur with this
> combination.
>
>
>
> The probable solution is to check out and build trunk and see if that
> works for you.  It very well might.  The question then is what to do next,
> because we are not scheduled to release again until April.  We might have
> to do a point release to deal with this.
>
>
>
> Please give it a try and let me know what happens.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 10:29 AM, Karl Wright <da...@gmail.com> wrote:
>
> Ok, never mind that last email.  We patched it in part in 2.9 by including
> the latest POI.  So clearly it's still an existing problem in POI.  I'll
> have to open a ticket there and await a patch from them.
>
>
>
> Karl
>
>
>
> On Tue, Jan 9, 2018 at 10:27 AM, Karl Wright <da...@gmail.com> wrote:
>
> This screenshot cannot be MCF 2.9 since the version of poi was not 3.17
> for the 2.9 release.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 10:02 AM, msaunier <ms...@citya.com> wrote:
>
> They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.
>
>
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:54
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> As for the Tika issue, we explicitly tested documents of that type when
> rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we
> also tested this.
>
>
>
> One of the potential issues is that if you are dropping down different
> versions of ManifoldCF into the same directories you *might* have a poi*
> jar in the wrong place because of the way we had to do the patch.  Please
> have a look at where the poi* jars are in your directory structure; they
> should all be in one directory (connector-common-lib).  If you see any
> anywhere else, that's the cause of the issue.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <da...@gmail.com> wrote:
>
> Since the Tika extractor essentially filters out the content mime type
> (other than presenting it as metadata), you need to put an "allowed
> documents" transformation connection into your job pipeline BEFORE the Tika
> connector:
>
>
>
> https://manifoldcf.apache.org/release/release-2.9/en_US/end-
> user-documentation.html#alloweddocuments
>
>
>
> In fact, mime type exclusion is actually disabled in the Solr output
> connector *unless* you are using the extracting update handler.  That
> should resolve the one problem for you.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:35 AM, msaunier <ms...@citya.com> wrote:
>
> They document for Tika are :
>
> ·        Microsoft Word 97-2003
>
> ·        Application/msword
>
>
>
> I can’t have more informations, they are in SCO servers and SCO do not
> have ls –lisan or stat command.
>
>
>
> For SolR connecting, I seem to have emptied the index before the last
> indexation. (ManifoldCF and Solr) I do it again to be sure.
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:26
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> CONNECTORS-1482 is for the Solr connector filtering issue.  A question:
> When you changed these fields in the output connection, had you already
> indexed any documents?  Those would only get cleaned up if you did a
> subsequent full crawl, after you made the connection change.
>
>
>
> Karl
>
>
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <da...@gmail.com> wrote:
>
> If you let me know what kind of file they are (extension and what
> application created them) that is probably good enough.
>
> Karl
>
>
>
> On Tue, Jan 9, 2018 at 9:19 AM, msaunier <ms...@citya.com> wrote:
>
> Okay good. I look if I can test 1.17 Tika version.
>
>
>
> I can’t transfert a document with this error, they are privates. Sorry.
>
>
>
> If I encounter the error again on a non-private document, I'll come back
> to you.
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:12
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> CONNECTORS-1481 is the ticket for the Tika problem.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <da...@gmail.com> wrote:
>
> Ok, if you are in a position to build trunk, that's a newer version of
> Tika (1.17) which might (or might not) address this problem.
>
>
>
> If you could create a ticket, I'd greatly appreciate attaching one
> document to it that causes the failure.
>
>
>
> Thanks!
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 8:02 AM, msaunier <ms...@citya.com> wrote:
>
> It’s a 2.9 version.
>
>
>
> I have a 2.8.1 in an other server with same job and same documents. I will
> test on this other server and make you a return.
>
>
>
> Thanks for your help.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 13:15
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> I looked at the history of this.  We had to release a patch (2.8.1) that
> put various poi jars at root level in order to work around a Tika problem.
> That patch may not have been entirely correct in that it looks like it may
> have blocked access by one of the deeper jars to a higher level.
>
>
>
> Release 2.9 should fix this if I am correct.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <da...@gmail.com> wrote:
>
> What version of MCF is this?  That's important to know since Tika has had
> problems with this kind of thing in the past and this looks like something
> similar.
>
>
>
> The problem you are reporting is due to either a missing jar, or a bug in
> an internal tika classloader.  But I need to know whether this is a current
> bug or not, since we just went to a new Tika version.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 4:32 AM, msaunier <ms...@citya.com> wrote:
>
> Hello Karl,
>
> I hope you are well today.
>
>
>
> I have 2 problems with ManifoldCF.
>
>
>
> -----------
>
> In **Outputs connectors** with Solr connector. I have add a « Maximum
> document length and I have « Excluded 5 mime types » but it not work. I
> join capture.
>
>
>
> ----------
>
> And in second, I have a **Tika exception** in ManifoldCF. 3 documents are
> blocked :
>
>
>
> FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed:
> org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/
> apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
> java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.
> HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
>         at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
> ~[?:?]
>
>         at org.apache.tika.extractor.ParsingEmbeddedDocumentExtract
> or.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.
> OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.
> TikaParser.parse(TikaParser.java:74) ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.
> addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithExcept
> ion(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineObjectWithVersions.
> addOrReplaceDocumentWithException(IncrementalIngester.java:2708)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester.documentIngest(IncrementalIngester.java:756)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.connectors.sharedrive.
> SharedDriveConnector.processDocuments(SharedDriveConnector.java:939)
> ~[?:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
>
>
> I need to create an incident ticket?
>
>
>
> ----------
>
>
>
> Thanks for your help.
>
>
>
> Cordialement,
>
>
>
> [image: msaunier]
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

RE: Document connector excluding mime type and size - Tika Parser error

Posted by msaunier <ms...@citya.com>.
I finished the tests.

 

I do not have any more mistakes with Tika.

However, you can not commit the data in Solr. I have no error log. The configuration is the same. I tried with 2 different databases and the same configuration; and I tried with the same database, but no data arrives in Solr. It's normal ?

On 2.9.0, the commit works well.

 

Regards,

 

 

De : msaunier [mailto:msaunier@citya.com] 
Envoyé : vendredi 12 janvier 2018 11:52
À : user@manifoldcf.apache.org
Objet : RE: Document connector excluding mime type and size - Tika Parser error

 

Hello Karl,

 

I have to do a last test, I do it between noon and two o'clock and I inform you of the good functioning of this patch if it works

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : jeudi 11 janvier 2018 18:09
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

No Tika error is good, but have a look at Simple History to be sure documents were actually processed.  If you can confirm that, I'll kick off the patch process.

 

Karl

 

 

On Thu, Jan 11, 2018 at 11:26 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Ok. So.

 

With the same configuration but Tika 1.17 :

 

·        No Tika error

·        But, no documents send to Solr. I don’t understand why. I research.

 

 

 

 

De : msaunier [mailto:msaunier@citya.com <ma...@citya.com> ] 
Envoyé : jeudi 11 janvier 2018 15:32


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : RE: Document connector excluding mime type and size - Tika Parser error

 

I crawl for the moment. I think, I would have finished in 30 minutes. 

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : jeudi 11 janvier 2018 15:05
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

Did this work for you?

Karl

 

On Thu, Jan 11, 2018 at 6:36 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

If you need the jcifs connector, run "ant make-deps" too.  Then run "ant build" again.

 

Karl

 

On Thu, Jan 11, 2018 at 4:30 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

 

I have build and configured but WindowsShare connector do not appear in the list of repository connectors.

 

·        I have add jcifs.jar into the connectors/jcifs/lib-proprietary directory

·        I have ant make-core-deps

·        Ant build

·        Uncomment windows share into the connectors-proprietary.xml file in the dist folder

·        I have add jcifs.jar in connector-lib-proprietary

 

But not have the proposition on the manifold interface. 

 

Any idea ?

Thanks.

 

 

De : msaunier [mailto:msaunier@citya.com <ma...@citya.com> ] 
Envoyé : mercredi 10 janvier 2018 18:15
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : RE: Document connector excluding mime type and size - Tika Parser error

 

Good !

 

I configure and test that.

I give you a return as soon as the reading is finished.

400k documents.

 

If it works, I test on few million of documents.

 

Thank.

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : mercredi 10 janvier 2018 17:45
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

The build you should be using is the ant build.  Do not use the maven build for this purpose.

 

- Check out trunk:

 

svn co https://svn.apache.org/repos/asf/manifoldcf/trunk

 

- Download dependencies:

 

ant make-core-deps

 

- Build:

 

ant build

 

- Your deliverable is in the "dist" directory

 

Karl

 

 

On Wed, Jan 10, 2018 at 11:37 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

I have an error with the maven build, so I have test with an external 1.17 Tika Server but, POI not included. If you success a mvn package with 1.17 Tika, I am interested.

 

Today, I have not had much time to deal with it.

 

I found some bugs that I would declare tomorrow if they are not already. They concern log4j2, local_fr and a bug with the web interface and the keyboard input key.

 

I continu my investigation.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mercredi 10 janvier 2018 17:15


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

Any news?

Karl

 

On Tue, Jan 9, 2018 at 1:10 PM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Let me know what happens.
If it works for you, I'll see if we can put together a patch release of 2.9 with the fix.

 

Karl

 

 

On Tue, Jan 9, 2018 at 11:07 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Test check out and building with POI 3.17 and Tika 1.17? 

 

It’s possible.

 

I finish a project and I test that.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 16:57


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

So here's the problem; we used POI 3.17 with Tika 3.16 in 2.9, in order to deal with the classloader issue present in POI 3.15, and because POI 3.16 has a severe security issue that made it impossible to ship with.

 

Unfortunately that doesn't quite work; POI 3.17 is not backwards compatible with 3.16 completely and therefore problems occur with this combination.

 

The probable solution is to check out and build trunk and see if that works for you.  It very well might.  The question then is what to do next, because we are not scheduled to release again until April.  We might have to do a point release to deal with this.

 

Please give it a try and let me know what happens.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 10:29 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, never mind that last email.  We patched it in part in 2.9 by including the latest POI.  So clearly it's still an existing problem in POI.  I'll have to open a ticket there and await a patch from them.

 

Karl

 

On Tue, Jan 9, 2018 at 10:27 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

This screenshot cannot be MCF 2.9 since the version of poi was not 3.17 for the 2.9 release.

 

Karl

 

 

On Tue, Jan 9, 2018 at 10:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.

 



 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:54


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

As for the Tika issue, we explicitly tested documents of that type when rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we also tested this.

 

One of the potential issues is that if you are dropping down different versions of ManifoldCF into the same directories you *might* have a poi* jar in the wrong place because of the way we had to do the patch.  Please have a look at where the poi* jars are in your directory structure; they should all be in one directory (connector-common-lib).  If you see any anywhere else, that's the cause of the issue.

 

Karl

 

 

On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Since the Tika extractor essentially filters out the content mime type (other than presenting it as metadata), you need to put an "allowed documents" transformation connection into your job pipeline BEFORE the Tika connector:

 

https://manifoldcf.apache.org/release/release-2.9/en_US/end-user-documentation.html#alloweddocuments

 

In fact, mime type exclusion is actually disabled in the Solr output connector *unless* you are using the extracting update handler.  That should resolve the one problem for you.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 9:35 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They document for Tika are :

·        Microsoft Word 97-2003

·        Application/msword

 

I can’t have more informations, they are in SCO servers and SCO do not have ls –lisan or stat command.

 

For SolR connecting, I seem to have emptied the index before the last indexation. (ManifoldCF and Solr) I do it again to be sure.

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:26


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1482 is for the Solr connector filtering issue.  A question: When you changed these fields in the output connection, had you already indexed any documents?  Those would only get cleaned up if you did a subsequent full crawl, after you made the connection change.

 

Karl

 

 

 

On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

If you let me know what kind of file they are (extension and what application created them) that is probably good enough.

Karl

 

On Tue, Jan 9, 2018 at 9:19 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Okay good. I look if I can test 1.17 Tika version.

 

I can’t transfert a document with this error, they are privates. Sorry.

 

If I encounter the error again on a non-private document, I'll come back to you.

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:12


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1481 is the ticket for the Tika problem.

 

Karl

 

 

On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, if you are in a position to build trunk, that's a newer version of Tika (1.17) which might (or might not) address this problem.

 

If you could create a ticket, I'd greatly appreciate attaching one document to it that causes the failure.

 

Thanks!

Karl

 

 

On Tue, Jan 9, 2018 at 8:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

It’s a 2.9 version.

 

I have a 2.8.1 in an other server with same job and same documents. I will test on this other server and make you a return.

 

Thanks for your help.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 13:15
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

I looked at the history of this.  We had to release a patch (2.8.1) that put various poi jars at root level in order to work around a Tika problem.  That patch may not have been entirely correct in that it looks like it may have blocked access by one of the deeper jars to a higher level.

 

Release 2.9 should fix this if I am correct.

 

Karl

 

 

On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

What version of MCF is this?  That's important to know since Tika has had problems with this kind of thing in the past and this looks like something similar.

 

The problem you are reporting is due to either a missing jar, or a bug in an internal tika classloader.  But I need to know whether this is a current bug or not, since we just went to a new Tika version.

 

Karl

 

 

On Tue, Jan 9, 2018 at 4:32 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

I hope you are well today.

 

I have 2 problems with ManifoldCF.

 

-----------

In *Outputs connectors* with Solr connector. I have add a « Maximum document length and I have « Excluded 5 mime types » but it not work. I join capture.

 

----------

And in second, I have a *Tika exception* in ManifoldCF. 3 documents are blocked :

 

FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

        at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) ~[?:?]

        at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]

 

I need to create an incident ticket?

 

----------

 

Thanks for your help.

 

Cordialement,

 



 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


RE: Document connector excluding mime type and size - Tika Parser error

Posted by msaunier <ms...@citya.com>.
Hello Karl,

 

I have to do a last test, I do it between noon and two o'clock and I inform you of the good functioning of this patch if it works

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : jeudi 11 janvier 2018 18:09
À : user@manifoldcf.apache.org
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

No Tika error is good, but have a look at Simple History to be sure documents were actually processed.  If you can confirm that, I'll kick off the patch process.

 

Karl

 

 

On Thu, Jan 11, 2018 at 11:26 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Ok. So.

 

With the same configuration but Tika 1.17 :

 

·        No Tika error

·        But, no documents send to Solr. I don’t understand why. I research.

 

 

 

 

De : msaunier [mailto:msaunier@citya.com <ma...@citya.com> ] 
Envoyé : jeudi 11 janvier 2018 15:32


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : RE: Document connector excluding mime type and size - Tika Parser error

 

I crawl for the moment. I think, I would have finished in 30 minutes. 

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : jeudi 11 janvier 2018 15:05
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

Did this work for you?

Karl

 

On Thu, Jan 11, 2018 at 6:36 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

If you need the jcifs connector, run "ant make-deps" too.  Then run "ant build" again.

 

Karl

 

On Thu, Jan 11, 2018 at 4:30 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

 

I have build and configured but WindowsShare connector do not appear in the list of repository connectors.

 

·        I have add jcifs.jar into the connectors/jcifs/lib-proprietary directory

·        I have ant make-core-deps

·        Ant build

·        Uncomment windows share into the connectors-proprietary.xml file in the dist folder

·        I have add jcifs.jar in connector-lib-proprietary

 

But not have the proposition on the manifold interface. 

 

Any idea ?

Thanks.

 

 

De : msaunier [mailto:msaunier@citya.com <ma...@citya.com> ] 
Envoyé : mercredi 10 janvier 2018 18:15
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : RE: Document connector excluding mime type and size - Tika Parser error

 

Good !

 

I configure and test that.

I give you a return as soon as the reading is finished.

400k documents.

 

If it works, I test on few million of documents.

 

Thank.

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : mercredi 10 janvier 2018 17:45
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

The build you should be using is the ant build.  Do not use the maven build for this purpose.

 

- Check out trunk:

 

svn co https://svn.apache.org/repos/asf/manifoldcf/trunk

 

- Download dependencies:

 

ant make-core-deps

 

- Build:

 

ant build

 

- Your deliverable is in the "dist" directory

 

Karl

 

 

On Wed, Jan 10, 2018 at 11:37 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

I have an error with the maven build, so I have test with an external 1.17 Tika Server but, POI not included. If you success a mvn package with 1.17 Tika, I am interested.

 

Today, I have not had much time to deal with it.

 

I found some bugs that I would declare tomorrow if they are not already. They concern log4j2, local_fr and a bug with the web interface and the keyboard input key.

 

I continu my investigation.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mercredi 10 janvier 2018 17:15


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

Any news?

Karl

 

On Tue, Jan 9, 2018 at 1:10 PM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Let me know what happens.
If it works for you, I'll see if we can put together a patch release of 2.9 with the fix.

 

Karl

 

 

On Tue, Jan 9, 2018 at 11:07 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Test check out and building with POI 3.17 and Tika 1.17? 

 

It’s possible.

 

I finish a project and I test that.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 16:57


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

So here's the problem; we used POI 3.17 with Tika 3.16 in 2.9, in order to deal with the classloader issue present in POI 3.15, and because POI 3.16 has a severe security issue that made it impossible to ship with.

 

Unfortunately that doesn't quite work; POI 3.17 is not backwards compatible with 3.16 completely and therefore problems occur with this combination.

 

The probable solution is to check out and build trunk and see if that works for you.  It very well might.  The question then is what to do next, because we are not scheduled to release again until April.  We might have to do a point release to deal with this.

 

Please give it a try and let me know what happens.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 10:29 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, never mind that last email.  We patched it in part in 2.9 by including the latest POI.  So clearly it's still an existing problem in POI.  I'll have to open a ticket there and await a patch from them.

 

Karl

 

On Tue, Jan 9, 2018 at 10:27 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

This screenshot cannot be MCF 2.9 since the version of poi was not 3.17 for the 2.9 release.

 

Karl

 

 

On Tue, Jan 9, 2018 at 10:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.

 



 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:54


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

As for the Tika issue, we explicitly tested documents of that type when rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we also tested this.

 

One of the potential issues is that if you are dropping down different versions of ManifoldCF into the same directories you *might* have a poi* jar in the wrong place because of the way we had to do the patch.  Please have a look at where the poi* jars are in your directory structure; they should all be in one directory (connector-common-lib).  If you see any anywhere else, that's the cause of the issue.

 

Karl

 

 

On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Since the Tika extractor essentially filters out the content mime type (other than presenting it as metadata), you need to put an "allowed documents" transformation connection into your job pipeline BEFORE the Tika connector:

 

https://manifoldcf.apache.org/release/release-2.9/en_US/end-user-documentation.html#alloweddocuments

 

In fact, mime type exclusion is actually disabled in the Solr output connector *unless* you are using the extracting update handler.  That should resolve the one problem for you.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 9:35 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They document for Tika are :

·        Microsoft Word 97-2003

·        Application/msword

 

I can’t have more informations, they are in SCO servers and SCO do not have ls –lisan or stat command.

 

For SolR connecting, I seem to have emptied the index before the last indexation. (ManifoldCF and Solr) I do it again to be sure.

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:26


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1482 is for the Solr connector filtering issue.  A question: When you changed these fields in the output connection, had you already indexed any documents?  Those would only get cleaned up if you did a subsequent full crawl, after you made the connection change.

 

Karl

 

 

 

On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

If you let me know what kind of file they are (extension and what application created them) that is probably good enough.

Karl

 

On Tue, Jan 9, 2018 at 9:19 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Okay good. I look if I can test 1.17 Tika version.

 

I can’t transfert a document with this error, they are privates. Sorry.

 

If I encounter the error again on a non-private document, I'll come back to you.

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:12


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1481 is the ticket for the Tika problem.

 

Karl

 

 

On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, if you are in a position to build trunk, that's a newer version of Tika (1.17) which might (or might not) address this problem.

 

If you could create a ticket, I'd greatly appreciate attaching one document to it that causes the failure.

 

Thanks!

Karl

 

 

On Tue, Jan 9, 2018 at 8:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

It’s a 2.9 version.

 

I have a 2.8.1 in an other server with same job and same documents. I will test on this other server and make you a return.

 

Thanks for your help.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 13:15
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

I looked at the history of this.  We had to release a patch (2.8.1) that put various poi jars at root level in order to work around a Tika problem.  That patch may not have been entirely correct in that it looks like it may have blocked access by one of the deeper jars to a higher level.

 

Release 2.9 should fix this if I am correct.

 

Karl

 

 

On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

What version of MCF is this?  That's important to know since Tika has had problems with this kind of thing in the past and this looks like something similar.

 

The problem you are reporting is due to either a missing jar, or a bug in an internal tika classloader.  But I need to know whether this is a current bug or not, since we just went to a new Tika version.

 

Karl

 

 

On Tue, Jan 9, 2018 at 4:32 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

I hope you are well today.

 

I have 2 problems with ManifoldCF.

 

-----------

In *Outputs connectors* with Solr connector. I have add a « Maximum document length and I have « Excluded 5 mime types » but it not work. I join capture.

 

----------

And in second, I have a *Tika exception* in ManifoldCF. 3 documents are blocked :

 

FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

        at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) ~[?:?]

        at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]

 

I need to create an incident ticket?

 

----------

 

Thanks for your help.

 

Cordialement,

 



 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Re: Document connector excluding mime type and size - Tika Parser error

Posted by Karl Wright <da...@gmail.com>.
No Tika error is good, but have a look at Simple History to be sure
documents were actually processed.  If you can confirm that, I'll kick off
the patch process.

Karl


On Thu, Jan 11, 2018 at 11:26 AM, msaunier <ms...@citya.com> wrote:

> Ok. So.
>
>
>
> With the same configuration but Tika 1.17 :
>
>
>
> ·        No Tika error
>
> ·        But, no documents send to Solr. I don’t understand why. I
> research.
>
>
>
>
>
>
>
>
>
> *De :* msaunier [mailto:msaunier@citya.com]
> *Envoyé :* jeudi 11 janvier 2018 15:32
>
> *À :* user@manifoldcf.apache.org
> *Objet :* RE: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> I crawl for the moment. I think, I would have finished in 30 minutes.
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com <da...@gmail.com>]
> *Envoyé :* jeudi 11 janvier 2018 15:05
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> Did this work for you?
>
> Karl
>
>
>
> On Thu, Jan 11, 2018 at 6:36 AM, Karl Wright <da...@gmail.com> wrote:
>
> If you need the jcifs connector, run "ant make-deps" too.  Then run "ant
> build" again.
>
>
>
> Karl
>
>
>
> On Thu, Jan 11, 2018 at 4:30 AM, msaunier <ms...@citya.com> wrote:
>
> Hello Karl,
>
>
>
> I have build and configured but WindowsShare connector do not appear in
> the list of repository connectors.
>
>
>
> ·        I have add jcifs.jar into the connectors/jcifs/lib-proprietary
> directory
>
> ·        I have ant make-core-deps
>
> ·        Ant build
>
> ·        Uncomment windows share into the connectors-proprietary.xml file
> in the dist folder
>
> ·        I have add jcifs.jar in connector-lib-proprietary
>
>
>
> But not have the proposition on the manifold interface.
>
>
>
> Any idea ?
>
> Thanks.
>
>
>
>
>
> *De :* msaunier [mailto:msaunier@citya.com]
> *Envoyé :* mercredi 10 janvier 2018 18:15
> *À :* user@manifoldcf.apache.org
> *Objet :* RE: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> Good !
>
>
>
> I configure and test that.
>
> I give you a return as soon as the reading is finished.
>
> 400k documents.
>
>
>
> If it works, I test on few million of documents.
>
>
>
> Thank.
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com <da...@gmail.com>]
> *Envoyé :* mercredi 10 janvier 2018 17:45
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> The build you should be using is the ant build.  Do not use the maven
> build for this purpose.
>
>
>
> - Check out trunk:
>
>
>
> svn co https://svn.apache.org/repos/asf/manifoldcf/trunk
>
>
>
> - Download dependencies:
>
>
>
> ant make-core-deps
>
>
>
> - Build:
>
>
>
> ant build
>
>
>
> - Your deliverable is in the "dist" directory
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jan 10, 2018 at 11:37 AM, msaunier <ms...@citya.com> wrote:
>
> I have an error with the maven build, so I have test with an external 1.17
> Tika Server but, POI not included. If you success a mvn package with 1.17
> Tika, I am interested.
>
>
>
> Today, I have not had much time to deal with it.
>
>
>
> I found some bugs that I would declare tomorrow if they are not already.
> They concern log4j2, local_fr and a bug with the web interface and the
> keyboard input key.
>
>
>
> I continu my investigation.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mercredi 10 janvier 2018 17:15
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> Any news?
>
> Karl
>
>
>
> On Tue, Jan 9, 2018 at 1:10 PM, Karl Wright <da...@gmail.com> wrote:
>
> Let me know what happens.
> If it works for you, I'll see if we can put together a patch release of
> 2.9 with the fix.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 11:07 AM, msaunier <ms...@citya.com> wrote:
>
> Test check out and building with POI 3.17 and Tika 1.17?
>
>
>
> It’s possible.
>
>
>
> I finish a project and I test that.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 16:57
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> So here's the problem; we used POI 3.17 with Tika 3.16 in 2.9, in order to
> deal with the classloader issue present in POI 3.15, and because POI 3.16
> has a severe security issue that made it impossible to ship with.
>
>
>
> Unfortunately that doesn't quite work; POI 3.17 is not backwards
> compatible with 3.16 completely and therefore problems occur with this
> combination.
>
>
>
> The probable solution is to check out and build trunk and see if that
> works for you.  It very well might.  The question then is what to do next,
> because we are not scheduled to release again until April.  We might have
> to do a point release to deal with this.
>
>
>
> Please give it a try and let me know what happens.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 10:29 AM, Karl Wright <da...@gmail.com> wrote:
>
> Ok, never mind that last email.  We patched it in part in 2.9 by including
> the latest POI.  So clearly it's still an existing problem in POI.  I'll
> have to open a ticket there and await a patch from them.
>
>
>
> Karl
>
>
>
> On Tue, Jan 9, 2018 at 10:27 AM, Karl Wright <da...@gmail.com> wrote:
>
> This screenshot cannot be MCF 2.9 since the version of poi was not 3.17
> for the 2.9 release.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 10:02 AM, msaunier <ms...@citya.com> wrote:
>
> They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.
>
>
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:54
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> As for the Tika issue, we explicitly tested documents of that type when
> rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we
> also tested this.
>
>
>
> One of the potential issues is that if you are dropping down different
> versions of ManifoldCF into the same directories you *might* have a poi*
> jar in the wrong place because of the way we had to do the patch.  Please
> have a look at where the poi* jars are in your directory structure; they
> should all be in one directory (connector-common-lib).  If you see any
> anywhere else, that's the cause of the issue.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <da...@gmail.com> wrote:
>
> Since the Tika extractor essentially filters out the content mime type
> (other than presenting it as metadata), you need to put an "allowed
> documents" transformation connection into your job pipeline BEFORE the Tika
> connector:
>
>
>
> https://manifoldcf.apache.org/release/release-2.9/en_US/end-
> user-documentation.html#alloweddocuments
>
>
>
> In fact, mime type exclusion is actually disabled in the Solr output
> connector *unless* you are using the extracting update handler.  That
> should resolve the one problem for you.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:35 AM, msaunier <ms...@citya.com> wrote:
>
> They document for Tika are :
>
> ·        Microsoft Word 97-2003
>
> ·        Application/msword
>
>
>
> I can’t have more informations, they are in SCO servers and SCO do not
> have ls –lisan or stat command.
>
>
>
> For SolR connecting, I seem to have emptied the index before the last
> indexation. (ManifoldCF and Solr) I do it again to be sure.
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:26
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> CONNECTORS-1482 is for the Solr connector filtering issue.  A question:
> When you changed these fields in the output connection, had you already
> indexed any documents?  Those would only get cleaned up if you did a
> subsequent full crawl, after you made the connection change.
>
>
>
> Karl
>
>
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <da...@gmail.com> wrote:
>
> If you let me know what kind of file they are (extension and what
> application created them) that is probably good enough.
>
> Karl
>
>
>
> On Tue, Jan 9, 2018 at 9:19 AM, msaunier <ms...@citya.com> wrote:
>
> Okay good. I look if I can test 1.17 Tika version.
>
>
>
> I can’t transfert a document with this error, they are privates. Sorry.
>
>
>
> If I encounter the error again on a non-private document, I'll come back
> to you.
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:12
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> CONNECTORS-1481 is the ticket for the Tika problem.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <da...@gmail.com> wrote:
>
> Ok, if you are in a position to build trunk, that's a newer version of
> Tika (1.17) which might (or might not) address this problem.
>
>
>
> If you could create a ticket, I'd greatly appreciate attaching one
> document to it that causes the failure.
>
>
>
> Thanks!
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 8:02 AM, msaunier <ms...@citya.com> wrote:
>
> It’s a 2.9 version.
>
>
>
> I have a 2.8.1 in an other server with same job and same documents. I will
> test on this other server and make you a return.
>
>
>
> Thanks for your help.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 13:15
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> I looked at the history of this.  We had to release a patch (2.8.1) that
> put various poi jars at root level in order to work around a Tika problem.
> That patch may not have been entirely correct in that it looks like it may
> have blocked access by one of the deeper jars to a higher level.
>
>
>
> Release 2.9 should fix this if I am correct.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <da...@gmail.com> wrote:
>
> What version of MCF is this?  That's important to know since Tika has had
> problems with this kind of thing in the past and this looks like something
> similar.
>
>
>
> The problem you are reporting is due to either a missing jar, or a bug in
> an internal tika classloader.  But I need to know whether this is a current
> bug or not, since we just went to a new Tika version.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 4:32 AM, msaunier <ms...@citya.com> wrote:
>
> Hello Karl,
>
> I hope you are well today.
>
>
>
> I have 2 problems with ManifoldCF.
>
>
>
> -----------
>
> In **Outputs connectors** with Solr connector. I have add a « Maximum
> document length and I have « Excluded 5 mime types » but it not work. I
> join capture.
>
>
>
> ----------
>
> And in second, I have a **Tika exception** in ManifoldCF. 3 documents are
> blocked :
>
>
>
> FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed:
> org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/
> apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
> java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.
> HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
>         at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
> ~[?:?]
>
>         at org.apache.tika.extractor.ParsingEmbeddedDocumentExtract
> or.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.
> OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.
> TikaParser.parse(TikaParser.java:74) ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.
> addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithExcept
> ion(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineObjectWithVersions.
> addOrReplaceDocumentWithException(IncrementalIngester.java:2708)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester.documentIngest(IncrementalIngester.java:756)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.connectors.sharedrive.
> SharedDriveConnector.processDocuments(SharedDriveConnector.java:939)
> ~[?:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
>
>
> I need to create an incident ticket?
>
>
>
> ----------
>
>
>
> Thanks for your help.
>
>
>
> Cordialement,
>
>
>
> [image: msaunier]
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

RE: Document connector excluding mime type and size - Tika Parser error

Posted by msaunier <ms...@citya.com>.
Ok. So.

 

With the same configuration but Tika 1.17 :

 

·        No Tika error

·        But, no documents send to Solr. I don’t understand why. I research.

 

 

 

 

De : msaunier [mailto:msaunier@citya.com] 
Envoyé : jeudi 11 janvier 2018 15:32
À : user@manifoldcf.apache.org
Objet : RE: Document connector excluding mime type and size - Tika Parser error

 

I crawl for the moment. I think, I would have finished in 30 minutes. 

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : jeudi 11 janvier 2018 15:05
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

Did this work for you?

Karl

 

On Thu, Jan 11, 2018 at 6:36 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

If you need the jcifs connector, run "ant make-deps" too.  Then run "ant build" again.

 

Karl

 

On Thu, Jan 11, 2018 at 4:30 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

 

I have build and configured but WindowsShare connector do not appear in the list of repository connectors.

 

·        I have add jcifs.jar into the connectors/jcifs/lib-proprietary directory

·        I have ant make-core-deps

·        Ant build

·        Uncomment windows share into the connectors-proprietary.xml file in the dist folder

·        I have add jcifs.jar in connector-lib-proprietary

 

But not have the proposition on the manifold interface. 

 

Any idea ?

Thanks.

 

 

De : msaunier [mailto:msaunier@citya.com <ma...@citya.com> ] 
Envoyé : mercredi 10 janvier 2018 18:15
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : RE: Document connector excluding mime type and size - Tika Parser error

 

Good !

 

I configure and test that.

I give you a return as soon as the reading is finished.

400k documents.

 

If it works, I test on few million of documents.

 

Thank.

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : mercredi 10 janvier 2018 17:45
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

The build you should be using is the ant build.  Do not use the maven build for this purpose.

 

- Check out trunk:

 

svn co https://svn.apache.org/repos/asf/manifoldcf/trunk

 

- Download dependencies:

 

ant make-core-deps

 

- Build:

 

ant build

 

- Your deliverable is in the "dist" directory

 

Karl

 

 

On Wed, Jan 10, 2018 at 11:37 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

I have an error with the maven build, so I have test with an external 1.17 Tika Server but, POI not included. If you success a mvn package with 1.17 Tika, I am interested.

 

Today, I have not had much time to deal with it.

 

I found some bugs that I would declare tomorrow if they are not already. They concern log4j2, local_fr and a bug with the web interface and the keyboard input key.

 

I continu my investigation.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mercredi 10 janvier 2018 17:15


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

Any news?

Karl

 

On Tue, Jan 9, 2018 at 1:10 PM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Let me know what happens.
If it works for you, I'll see if we can put together a patch release of 2.9 with the fix.

 

Karl

 

 

On Tue, Jan 9, 2018 at 11:07 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Test check out and building with POI 3.17 and Tika 1.17? 

 

It’s possible.

 

I finish a project and I test that.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 16:57


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

So here's the problem; we used POI 3.17 with Tika 3.16 in 2.9, in order to deal with the classloader issue present in POI 3.15, and because POI 3.16 has a severe security issue that made it impossible to ship with.

 

Unfortunately that doesn't quite work; POI 3.17 is not backwards compatible with 3.16 completely and therefore problems occur with this combination.

 

The probable solution is to check out and build trunk and see if that works for you.  It very well might.  The question then is what to do next, because we are not scheduled to release again until April.  We might have to do a point release to deal with this.

 

Please give it a try and let me know what happens.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 10:29 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, never mind that last email.  We patched it in part in 2.9 by including the latest POI.  So clearly it's still an existing problem in POI.  I'll have to open a ticket there and await a patch from them.

 

Karl

 

On Tue, Jan 9, 2018 at 10:27 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

This screenshot cannot be MCF 2.9 since the version of poi was not 3.17 for the 2.9 release.

 

Karl

 

 

On Tue, Jan 9, 2018 at 10:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.

 



 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:54


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

As for the Tika issue, we explicitly tested documents of that type when rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we also tested this.

 

One of the potential issues is that if you are dropping down different versions of ManifoldCF into the same directories you *might* have a poi* jar in the wrong place because of the way we had to do the patch.  Please have a look at where the poi* jars are in your directory structure; they should all be in one directory (connector-common-lib).  If you see any anywhere else, that's the cause of the issue.

 

Karl

 

 

On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Since the Tika extractor essentially filters out the content mime type (other than presenting it as metadata), you need to put an "allowed documents" transformation connection into your job pipeline BEFORE the Tika connector:

 

https://manifoldcf.apache.org/release/release-2.9/en_US/end-user-documentation.html#alloweddocuments

 

In fact, mime type exclusion is actually disabled in the Solr output connector *unless* you are using the extracting update handler.  That should resolve the one problem for you.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 9:35 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They document for Tika are :

·        Microsoft Word 97-2003

·        Application/msword

 

I can’t have more informations, they are in SCO servers and SCO do not have ls –lisan or stat command.

 

For SolR connecting, I seem to have emptied the index before the last indexation. (ManifoldCF and Solr) I do it again to be sure.

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:26


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1482 is for the Solr connector filtering issue.  A question: When you changed these fields in the output connection, had you already indexed any documents?  Those would only get cleaned up if you did a subsequent full crawl, after you made the connection change.

 

Karl

 

 

 

On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

If you let me know what kind of file they are (extension and what application created them) that is probably good enough.

Karl

 

On Tue, Jan 9, 2018 at 9:19 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Okay good. I look if I can test 1.17 Tika version.

 

I can’t transfert a document with this error, they are privates. Sorry.

 

If I encounter the error again on a non-private document, I'll come back to you.

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:12


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1481 is the ticket for the Tika problem.

 

Karl

 

 

On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, if you are in a position to build trunk, that's a newer version of Tika (1.17) which might (or might not) address this problem.

 

If you could create a ticket, I'd greatly appreciate attaching one document to it that causes the failure.

 

Thanks!

Karl

 

 

On Tue, Jan 9, 2018 at 8:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

It’s a 2.9 version.

 

I have a 2.8.1 in an other server with same job and same documents. I will test on this other server and make you a return.

 

Thanks for your help.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 13:15
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

I looked at the history of this.  We had to release a patch (2.8.1) that put various poi jars at root level in order to work around a Tika problem.  That patch may not have been entirely correct in that it looks like it may have blocked access by one of the deeper jars to a higher level.

 

Release 2.9 should fix this if I am correct.

 

Karl

 

 

On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

What version of MCF is this?  That's important to know since Tika has had problems with this kind of thing in the past and this looks like something similar.

 

The problem you are reporting is due to either a missing jar, or a bug in an internal tika classloader.  But I need to know whether this is a current bug or not, since we just went to a new Tika version.

 

Karl

 

 

On Tue, Jan 9, 2018 at 4:32 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

I hope you are well today.

 

I have 2 problems with ManifoldCF.

 

-----------

In *Outputs connectors* with Solr connector. I have add a « Maximum document length and I have « Excluded 5 mime types » but it not work. I join capture.

 

----------

And in second, I have a *Tika exception* in ManifoldCF. 3 documents are blocked :

 

FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

        at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) ~[?:?]

        at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]

 

I need to create an incident ticket?

 

----------

 

Thanks for your help.

 

Cordialement,

 



 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


RE: Document connector excluding mime type and size - Tika Parser error

Posted by msaunier <ms...@citya.com>.
I crawl for the moment. I think, I would have finished in 30 minutes. 

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : jeudi 11 janvier 2018 15:05
À : user@manifoldcf.apache.org
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

Did this work for you?

Karl

 

On Thu, Jan 11, 2018 at 6:36 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

If you need the jcifs connector, run "ant make-deps" too.  Then run "ant build" again.

 

Karl

 

On Thu, Jan 11, 2018 at 4:30 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

 

I have build and configured but WindowsShare connector do not appear in the list of repository connectors.

 

·        I have add jcifs.jar into the connectors/jcifs/lib-proprietary directory

·        I have ant make-core-deps

·        Ant build

·        Uncomment windows share into the connectors-proprietary.xml file in the dist folder

·        I have add jcifs.jar in connector-lib-proprietary

 

But not have the proposition on the manifold interface. 

 

Any idea ?

Thanks.

 

 

De : msaunier [mailto:msaunier@citya.com <ma...@citya.com> ] 
Envoyé : mercredi 10 janvier 2018 18:15
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : RE: Document connector excluding mime type and size - Tika Parser error

 

Good !

 

I configure and test that.

I give you a return as soon as the reading is finished.

400k documents.

 

If it works, I test on few million of documents.

 

Thank.

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : mercredi 10 janvier 2018 17:45
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

The build you should be using is the ant build.  Do not use the maven build for this purpose.

 

- Check out trunk:

 

svn co https://svn.apache.org/repos/asf/manifoldcf/trunk

 

- Download dependencies:

 

ant make-core-deps

 

- Build:

 

ant build

 

- Your deliverable is in the "dist" directory

 

Karl

 

 

On Wed, Jan 10, 2018 at 11:37 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

I have an error with the maven build, so I have test with an external 1.17 Tika Server but, POI not included. If you success a mvn package with 1.17 Tika, I am interested.

 

Today, I have not had much time to deal with it.

 

I found some bugs that I would declare tomorrow if they are not already. They concern log4j2, local_fr and a bug with the web interface and the keyboard input key.

 

I continu my investigation.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mercredi 10 janvier 2018 17:15


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

Any news?

Karl

 

On Tue, Jan 9, 2018 at 1:10 PM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Let me know what happens.
If it works for you, I'll see if we can put together a patch release of 2.9 with the fix.

 

Karl

 

 

On Tue, Jan 9, 2018 at 11:07 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Test check out and building with POI 3.17 and Tika 1.17? 

 

It’s possible.

 

I finish a project and I test that.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 16:57


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

So here's the problem; we used POI 3.17 with Tika 3.16 in 2.9, in order to deal with the classloader issue present in POI 3.15, and because POI 3.16 has a severe security issue that made it impossible to ship with.

 

Unfortunately that doesn't quite work; POI 3.17 is not backwards compatible with 3.16 completely and therefore problems occur with this combination.

 

The probable solution is to check out and build trunk and see if that works for you.  It very well might.  The question then is what to do next, because we are not scheduled to release again until April.  We might have to do a point release to deal with this.

 

Please give it a try and let me know what happens.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 10:29 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, never mind that last email.  We patched it in part in 2.9 by including the latest POI.  So clearly it's still an existing problem in POI.  I'll have to open a ticket there and await a patch from them.

 

Karl

 

On Tue, Jan 9, 2018 at 10:27 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

This screenshot cannot be MCF 2.9 since the version of poi was not 3.17 for the 2.9 release.

 

Karl

 

 

On Tue, Jan 9, 2018 at 10:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.

 



 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:54


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

As for the Tika issue, we explicitly tested documents of that type when rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we also tested this.

 

One of the potential issues is that if you are dropping down different versions of ManifoldCF into the same directories you *might* have a poi* jar in the wrong place because of the way we had to do the patch.  Please have a look at where the poi* jars are in your directory structure; they should all be in one directory (connector-common-lib).  If you see any anywhere else, that's the cause of the issue.

 

Karl

 

 

On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Since the Tika extractor essentially filters out the content mime type (other than presenting it as metadata), you need to put an "allowed documents" transformation connection into your job pipeline BEFORE the Tika connector:

 

https://manifoldcf.apache.org/release/release-2.9/en_US/end-user-documentation.html#alloweddocuments

 

In fact, mime type exclusion is actually disabled in the Solr output connector *unless* you are using the extracting update handler.  That should resolve the one problem for you.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 9:35 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They document for Tika are :

·        Microsoft Word 97-2003

·        Application/msword

 

I can’t have more informations, they are in SCO servers and SCO do not have ls –lisan or stat command.

 

For SolR connecting, I seem to have emptied the index before the last indexation. (ManifoldCF and Solr) I do it again to be sure.

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:26


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1482 is for the Solr connector filtering issue.  A question: When you changed these fields in the output connection, had you already indexed any documents?  Those would only get cleaned up if you did a subsequent full crawl, after you made the connection change.

 

Karl

 

 

 

On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

If you let me know what kind of file they are (extension and what application created them) that is probably good enough.

Karl

 

On Tue, Jan 9, 2018 at 9:19 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Okay good. I look if I can test 1.17 Tika version.

 

I can’t transfert a document with this error, they are privates. Sorry.

 

If I encounter the error again on a non-private document, I'll come back to you.

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:12


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1481 is the ticket for the Tika problem.

 

Karl

 

 

On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, if you are in a position to build trunk, that's a newer version of Tika (1.17) which might (or might not) address this problem.

 

If you could create a ticket, I'd greatly appreciate attaching one document to it that causes the failure.

 

Thanks!

Karl

 

 

On Tue, Jan 9, 2018 at 8:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

It’s a 2.9 version.

 

I have a 2.8.1 in an other server with same job and same documents. I will test on this other server and make you a return.

 

Thanks for your help.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 13:15
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

I looked at the history of this.  We had to release a patch (2.8.1) that put various poi jars at root level in order to work around a Tika problem.  That patch may not have been entirely correct in that it looks like it may have blocked access by one of the deeper jars to a higher level.

 

Release 2.9 should fix this if I am correct.

 

Karl

 

 

On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

What version of MCF is this?  That's important to know since Tika has had problems with this kind of thing in the past and this looks like something similar.

 

The problem you are reporting is due to either a missing jar, or a bug in an internal tika classloader.  But I need to know whether this is a current bug or not, since we just went to a new Tika version.

 

Karl

 

 

On Tue, Jan 9, 2018 at 4:32 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

I hope you are well today.

 

I have 2 problems with ManifoldCF.

 

-----------

In *Outputs connectors* with Solr connector. I have add a « Maximum document length and I have « Excluded 5 mime types » but it not work. I join capture.

 

----------

And in second, I have a *Tika exception* in ManifoldCF. 3 documents are blocked :

 

FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

        at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) ~[?:?]

        at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]

 

I need to create an incident ticket?

 

----------

 

Thanks for your help.

 

Cordialement,

 



 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Re: Document connector excluding mime type and size - Tika Parser error

Posted by Karl Wright <da...@gmail.com>.
Did this work for you?
Karl

On Thu, Jan 11, 2018 at 6:36 AM, Karl Wright <da...@gmail.com> wrote:

> If you need the jcifs connector, run "ant make-deps" too.  Then run "ant
> build" again.
>
> Karl
>
> On Thu, Jan 11, 2018 at 4:30 AM, msaunier <ms...@citya.com> wrote:
>
>> Hello Karl,
>>
>>
>>
>> I have build and configured but WindowsShare connector do not appear in
>> the list of repository connectors.
>>
>>
>>
>> ·        I have add jcifs.jar into the connectors/jcifs/lib-proprietary
>> directory
>>
>> ·        I have ant make-core-deps
>>
>> ·        Ant build
>>
>> ·        Uncomment windows share into the connectors-proprietary.xml
>> file in the dist folder
>>
>> ·        I have add jcifs.jar in connector-lib-proprietary
>>
>>
>>
>> But not have the proposition on the manifold interface.
>>
>>
>>
>> Any idea ?
>>
>> Thanks.
>>
>>
>>
>>
>>
>> *De :* msaunier [mailto:msaunier@citya.com]
>> *Envoyé :* mercredi 10 janvier 2018 18:15
>> *À :* user@manifoldcf.apache.org
>> *Objet :* RE: Document connector excluding mime type and size - Tika
>> Parser error
>>
>>
>>
>> Good !
>>
>>
>>
>> I configure and test that.
>>
>> I give you a return as soon as the reading is finished.
>>
>> 400k documents.
>>
>>
>>
>> If it works, I test on few million of documents.
>>
>>
>>
>> Thank.
>>
>>
>>
>>
>>
>> *De :* Karl Wright [mailto:daddywri@gmail.com <da...@gmail.com>]
>> *Envoyé :* mercredi 10 janvier 2018 17:45
>> *À :* user@manifoldcf.apache.org
>> *Objet :* Re: Document connector excluding mime type and size - Tika
>> Parser error
>>
>>
>>
>> The build you should be using is the ant build.  Do not use the maven
>> build for this purpose.
>>
>>
>>
>> - Check out trunk:
>>
>>
>>
>> svn co https://svn.apache.org/repos/asf/manifoldcf/trunk
>>
>>
>>
>> - Download dependencies:
>>
>>
>>
>> ant make-core-deps
>>
>>
>>
>> - Build:
>>
>>
>>
>> ant build
>>
>>
>>
>> - Your deliverable is in the "dist" directory
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Wed, Jan 10, 2018 at 11:37 AM, msaunier <ms...@citya.com> wrote:
>>
>> I have an error with the maven build, so I have test with an external
>> 1.17 Tika Server but, POI not included. If you success a mvn package with
>> 1.17 Tika, I am interested.
>>
>>
>>
>> Today, I have not had much time to deal with it.
>>
>>
>>
>> I found some bugs that I would declare tomorrow if they are not already.
>> They concern log4j2, local_fr and a bug with the web interface and the
>> keyboard input key.
>>
>>
>>
>> I continu my investigation.
>>
>>
>>
>> *De :* Karl Wright [mailto:daddywri@gmail.com]
>> *Envoyé :* mercredi 10 janvier 2018 17:15
>>
>>
>> *À :* user@manifoldcf.apache.org
>> *Objet :* Re: Document connector excluding mime type and size - Tika
>> Parser error
>>
>>
>>
>> Any news?
>>
>> Karl
>>
>>
>>
>> On Tue, Jan 9, 2018 at 1:10 PM, Karl Wright <da...@gmail.com> wrote:
>>
>> Let me know what happens.
>> If it works for you, I'll see if we can put together a patch release of
>> 2.9 with the fix.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 11:07 AM, msaunier <ms...@citya.com> wrote:
>>
>> Test check out and building with POI 3.17 and Tika 1.17?
>>
>>
>>
>> It’s possible.
>>
>>
>>
>> I finish a project and I test that.
>>
>>
>>
>> *De :* Karl Wright [mailto:daddywri@gmail.com]
>> *Envoyé :* mardi 9 janvier 2018 16:57
>>
>>
>> *À :* user@manifoldcf.apache.org
>> *Objet :* Re: Document connector excluding mime type and size - Tika
>> Parser error
>>
>>
>>
>> So here's the problem; we used POI 3.17 with Tika 3.16 in 2.9, in order
>> to deal with the classloader issue present in POI 3.15, and because POI
>> 3.16 has a severe security issue that made it impossible to ship with.
>>
>>
>>
>> Unfortunately that doesn't quite work; POI 3.17 is not backwards
>> compatible with 3.16 completely and therefore problems occur with this
>> combination.
>>
>>
>>
>> The probable solution is to check out and build trunk and see if that
>> works for you.  It very well might.  The question then is what to do next,
>> because we are not scheduled to release again until April.  We might have
>> to do a point release to deal with this.
>>
>>
>>
>> Please give it a try and let me know what happens.
>>
>>
>>
>> Thanks,
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 10:29 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> Ok, never mind that last email.  We patched it in part in 2.9 by
>> including the latest POI.  So clearly it's still an existing problem in
>> POI.  I'll have to open a ticket there and await a patch from them.
>>
>>
>>
>> Karl
>>
>>
>>
>> On Tue, Jan 9, 2018 at 10:27 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> This screenshot cannot be MCF 2.9 since the version of poi was not 3.17
>> for the 2.9 release.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 10:02 AM, msaunier <ms...@citya.com> wrote:
>>
>> They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *De :* Karl Wright [mailto:daddywri@gmail.com]
>> *Envoyé :* mardi 9 janvier 2018 15:54
>>
>>
>> *À :* user@manifoldcf.apache.org
>> *Objet :* Re: Document connector excluding mime type and size - Tika
>> Parser error
>>
>>
>>
>> As for the Tika issue, we explicitly tested documents of that type when
>> rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we
>> also tested this.
>>
>>
>>
>> One of the potential issues is that if you are dropping down different
>> versions of ManifoldCF into the same directories you *might* have a poi*
>> jar in the wrong place because of the way we had to do the patch.  Please
>> have a look at where the poi* jars are in your directory structure; they
>> should all be in one directory (connector-common-lib).  If you see any
>> anywhere else, that's the cause of the issue.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> Since the Tika extractor essentially filters out the content mime type
>> (other than presenting it as metadata), you need to put an "allowed
>> documents" transformation connection into your job pipeline BEFORE the Tika
>> connector:
>>
>>
>>
>> https://manifoldcf.apache.org/release/release-2.9/en_US/end-
>> user-documentation.html#alloweddocuments
>>
>>
>>
>> In fact, mime type exclusion is actually disabled in the Solr output
>> connector *unless* you are using the extracting update handler.  That
>> should resolve the one problem for you.
>>
>>
>>
>> Thanks,
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 9:35 AM, msaunier <ms...@citya.com> wrote:
>>
>> They document for Tika are :
>>
>> ·        Microsoft Word 97-2003
>>
>> ·        Application/msword
>>
>>
>>
>> I can’t have more informations, they are in SCO servers and SCO do not
>> have ls –lisan or stat command.
>>
>>
>>
>> For SolR connecting, I seem to have emptied the index before the last
>> indexation. (ManifoldCF and Solr) I do it again to be sure.
>>
>>
>>
>>
>>
>> *De :* Karl Wright [mailto:daddywri@gmail.com]
>> *Envoyé :* mardi 9 janvier 2018 15:26
>>
>>
>> *À :* user@manifoldcf.apache.org
>> *Objet :* Re: Document connector excluding mime type and size - Tika
>> Parser error
>>
>>
>>
>> CONNECTORS-1482 is for the Solr connector filtering issue.  A question:
>> When you changed these fields in the output connection, had you already
>> indexed any documents?  Those would only get cleaned up if you did a
>> subsequent full crawl, after you made the connection change.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> If you let me know what kind of file they are (extension and what
>> application created them) that is probably good enough.
>>
>> Karl
>>
>>
>>
>> On Tue, Jan 9, 2018 at 9:19 AM, msaunier <ms...@citya.com> wrote:
>>
>> Okay good. I look if I can test 1.17 Tika version.
>>
>>
>>
>> I can’t transfert a document with this error, they are privates. Sorry.
>>
>>
>>
>> If I encounter the error again on a non-private document, I'll come back
>> to you.
>>
>>
>>
>>
>>
>>
>>
>> *De :* Karl Wright [mailto:daddywri@gmail.com]
>> *Envoyé :* mardi 9 janvier 2018 15:12
>>
>>
>> *À :* user@manifoldcf.apache.org
>> *Objet :* Re: Document connector excluding mime type and size - Tika
>> Parser error
>>
>>
>>
>> CONNECTORS-1481 is the ticket for the Tika problem.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> Ok, if you are in a position to build trunk, that's a newer version of
>> Tika (1.17) which might (or might not) address this problem.
>>
>>
>>
>> If you could create a ticket, I'd greatly appreciate attaching one
>> document to it that causes the failure.
>>
>>
>>
>> Thanks!
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 8:02 AM, msaunier <ms...@citya.com> wrote:
>>
>> It’s a 2.9 version.
>>
>>
>>
>> I have a 2.8.1 in an other server with same job and same documents. I
>> will test on this other server and make you a return.
>>
>>
>>
>> Thanks for your help.
>>
>>
>>
>> *De :* Karl Wright [mailto:daddywri@gmail.com]
>> *Envoyé :* mardi 9 janvier 2018 13:15
>> *À :* user@manifoldcf.apache.org
>> *Objet :* Re: Document connector excluding mime type and size - Tika
>> Parser error
>>
>>
>>
>> I looked at the history of this.  We had to release a patch (2.8.1) that
>> put various poi jars at root level in order to work around a Tika problem.
>> That patch may not have been entirely correct in that it looks like it may
>> have blocked access by one of the deeper jars to a higher level.
>>
>>
>>
>> Release 2.9 should fix this if I am correct.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> What version of MCF is this?  That's important to know since Tika has had
>> problems with this kind of thing in the past and this looks like something
>> similar.
>>
>>
>>
>> The problem you are reporting is due to either a missing jar, or a bug in
>> an internal tika classloader.  But I need to know whether this is a current
>> bug or not, since we just went to a new Tika version.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 4:32 AM, msaunier <ms...@citya.com> wrote:
>>
>> Hello Karl,
>>
>> I hope you are well today.
>>
>>
>>
>> I have 2 problems with ManifoldCF.
>>
>>
>>
>> -----------
>>
>> In **Outputs connectors** with Solr connector. I have add a « Maximum
>> document length and I have « Excluded 5 mime types » but it not work. I
>> join capture.
>>
>>
>>
>> ----------
>>
>> And in second, I have a **Tika exception** in ManifoldCF. 3 documents
>> are blocked :
>>
>>
>>
>> FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed:
>> org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/
>> poi/hwmf/record/HwmfFont$WmfCharset;
>>
>> java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.Hwm
>> fFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;
>>
>>         at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>> ~[?:?]
>>
>>         at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.
>> parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory
>> .parse(OOXMLExtractorFactory.java:142) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>> ~[?:?]
>>
>>         at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
>> ~[?:?]
>>
>>         at org.apache.manifoldcf.agents.transformation.tika.TikaExtract
>> or.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester$PipelineAddEntryPoint.addOrReplaceDocumentWithExcepti
>> on(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
>> ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester$PipelineObjectWithVersions.addOrReplaceDocumentWithEx
>> ception(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessAct
>> ivity.ingestDocumentWithException(WorkerThread.java:1583)
>> ~[mcf-pull-agent.jar:?]
>>
>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessAct
>> ivity.ingestDocumentWithException(WorkerThread.java:1548)
>> ~[mcf-pull-agent.jar:?]
>>
>>         at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDr
>> iveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]
>>
>>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
>> [mcf-pull-agent.jar:?]
>>
>>
>>
>> I need to create an incident ticket?
>>
>>
>>
>> ----------
>>
>>
>>
>> Thanks for your help.
>>
>>
>>
>> Cordialement,
>>
>>
>>
>> [image: msaunier]
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: Document connector excluding mime type and size - Tika Parser error

Posted by Karl Wright <da...@gmail.com>.
If you need the jcifs connector, run "ant make-deps" too.  Then run "ant
build" again.

Karl

On Thu, Jan 11, 2018 at 4:30 AM, msaunier <ms...@citya.com> wrote:

> Hello Karl,
>
>
>
> I have build and configured but WindowsShare connector do not appear in
> the list of repository connectors.
>
>
>
> ·        I have add jcifs.jar into the connectors/jcifs/lib-proprietary
> directory
>
> ·        I have ant make-core-deps
>
> ·        Ant build
>
> ·        Uncomment windows share into the connectors-proprietary.xml file
> in the dist folder
>
> ·        I have add jcifs.jar in connector-lib-proprietary
>
>
>
> But not have the proposition on the manifold interface.
>
>
>
> Any idea ?
>
> Thanks.
>
>
>
>
>
> *De :* msaunier [mailto:msaunier@citya.com]
> *Envoyé :* mercredi 10 janvier 2018 18:15
> *À :* user@manifoldcf.apache.org
> *Objet :* RE: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> Good !
>
>
>
> I configure and test that.
>
> I give you a return as soon as the reading is finished.
>
> 400k documents.
>
>
>
> If it works, I test on few million of documents.
>
>
>
> Thank.
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com <da...@gmail.com>]
> *Envoyé :* mercredi 10 janvier 2018 17:45
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> The build you should be using is the ant build.  Do not use the maven
> build for this purpose.
>
>
>
> - Check out trunk:
>
>
>
> svn co https://svn.apache.org/repos/asf/manifoldcf/trunk
>
>
>
> - Download dependencies:
>
>
>
> ant make-core-deps
>
>
>
> - Build:
>
>
>
> ant build
>
>
>
> - Your deliverable is in the "dist" directory
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jan 10, 2018 at 11:37 AM, msaunier <ms...@citya.com> wrote:
>
> I have an error with the maven build, so I have test with an external 1.17
> Tika Server but, POI not included. If you success a mvn package with 1.17
> Tika, I am interested.
>
>
>
> Today, I have not had much time to deal with it.
>
>
>
> I found some bugs that I would declare tomorrow if they are not already.
> They concern log4j2, local_fr and a bug with the web interface and the
> keyboard input key.
>
>
>
> I continu my investigation.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mercredi 10 janvier 2018 17:15
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> Any news?
>
> Karl
>
>
>
> On Tue, Jan 9, 2018 at 1:10 PM, Karl Wright <da...@gmail.com> wrote:
>
> Let me know what happens.
> If it works for you, I'll see if we can put together a patch release of
> 2.9 with the fix.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 11:07 AM, msaunier <ms...@citya.com> wrote:
>
> Test check out and building with POI 3.17 and Tika 1.17?
>
>
>
> It’s possible.
>
>
>
> I finish a project and I test that.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 16:57
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> So here's the problem; we used POI 3.17 with Tika 3.16 in 2.9, in order to
> deal with the classloader issue present in POI 3.15, and because POI 3.16
> has a severe security issue that made it impossible to ship with.
>
>
>
> Unfortunately that doesn't quite work; POI 3.17 is not backwards
> compatible with 3.16 completely and therefore problems occur with this
> combination.
>
>
>
> The probable solution is to check out and build trunk and see if that
> works for you.  It very well might.  The question then is what to do next,
> because we are not scheduled to release again until April.  We might have
> to do a point release to deal with this.
>
>
>
> Please give it a try and let me know what happens.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 10:29 AM, Karl Wright <da...@gmail.com> wrote:
>
> Ok, never mind that last email.  We patched it in part in 2.9 by including
> the latest POI.  So clearly it's still an existing problem in POI.  I'll
> have to open a ticket there and await a patch from them.
>
>
>
> Karl
>
>
>
> On Tue, Jan 9, 2018 at 10:27 AM, Karl Wright <da...@gmail.com> wrote:
>
> This screenshot cannot be MCF 2.9 since the version of poi was not 3.17
> for the 2.9 release.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 10:02 AM, msaunier <ms...@citya.com> wrote:
>
> They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.
>
>
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:54
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> As for the Tika issue, we explicitly tested documents of that type when
> rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we
> also tested this.
>
>
>
> One of the potential issues is that if you are dropping down different
> versions of ManifoldCF into the same directories you *might* have a poi*
> jar in the wrong place because of the way we had to do the patch.  Please
> have a look at where the poi* jars are in your directory structure; they
> should all be in one directory (connector-common-lib).  If you see any
> anywhere else, that's the cause of the issue.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <da...@gmail.com> wrote:
>
> Since the Tika extractor essentially filters out the content mime type
> (other than presenting it as metadata), you need to put an "allowed
> documents" transformation connection into your job pipeline BEFORE the Tika
> connector:
>
>
>
> https://manifoldcf.apache.org/release/release-2.9/en_US/end-
> user-documentation.html#alloweddocuments
>
>
>
> In fact, mime type exclusion is actually disabled in the Solr output
> connector *unless* you are using the extracting update handler.  That
> should resolve the one problem for you.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:35 AM, msaunier <ms...@citya.com> wrote:
>
> They document for Tika are :
>
> ·        Microsoft Word 97-2003
>
> ·        Application/msword
>
>
>
> I can’t have more informations, they are in SCO servers and SCO do not
> have ls –lisan or stat command.
>
>
>
> For SolR connecting, I seem to have emptied the index before the last
> indexation. (ManifoldCF and Solr) I do it again to be sure.
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:26
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> CONNECTORS-1482 is for the Solr connector filtering issue.  A question:
> When you changed these fields in the output connection, had you already
> indexed any documents?  Those would only get cleaned up if you did a
> subsequent full crawl, after you made the connection change.
>
>
>
> Karl
>
>
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <da...@gmail.com> wrote:
>
> If you let me know what kind of file they are (extension and what
> application created them) that is probably good enough.
>
> Karl
>
>
>
> On Tue, Jan 9, 2018 at 9:19 AM, msaunier <ms...@citya.com> wrote:
>
> Okay good. I look if I can test 1.17 Tika version.
>
>
>
> I can’t transfert a document with this error, they are privates. Sorry.
>
>
>
> If I encounter the error again on a non-private document, I'll come back
> to you.
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:12
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> CONNECTORS-1481 is the ticket for the Tika problem.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <da...@gmail.com> wrote:
>
> Ok, if you are in a position to build trunk, that's a newer version of
> Tika (1.17) which might (or might not) address this problem.
>
>
>
> If you could create a ticket, I'd greatly appreciate attaching one
> document to it that causes the failure.
>
>
>
> Thanks!
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 8:02 AM, msaunier <ms...@citya.com> wrote:
>
> It’s a 2.9 version.
>
>
>
> I have a 2.8.1 in an other server with same job and same documents. I will
> test on this other server and make you a return.
>
>
>
> Thanks for your help.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 13:15
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> I looked at the history of this.  We had to release a patch (2.8.1) that
> put various poi jars at root level in order to work around a Tika problem.
> That patch may not have been entirely correct in that it looks like it may
> have blocked access by one of the deeper jars to a higher level.
>
>
>
> Release 2.9 should fix this if I am correct.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <da...@gmail.com> wrote:
>
> What version of MCF is this?  That's important to know since Tika has had
> problems with this kind of thing in the past and this looks like something
> similar.
>
>
>
> The problem you are reporting is due to either a missing jar, or a bug in
> an internal tika classloader.  But I need to know whether this is a current
> bug or not, since we just went to a new Tika version.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 4:32 AM, msaunier <ms...@citya.com> wrote:
>
> Hello Karl,
>
> I hope you are well today.
>
>
>
> I have 2 problems with ManifoldCF.
>
>
>
> -----------
>
> In **Outputs connectors** with Solr connector. I have add a « Maximum
> document length and I have « Excluded 5 mime types » but it not work. I
> join capture.
>
>
>
> ----------
>
> And in second, I have a **Tika exception** in ManifoldCF. 3 documents are
> blocked :
>
>
>
> FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed:
> org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/
> apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
> java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.
> HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
>         at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
> ~[?:?]
>
>         at org.apache.tika.extractor.ParsingEmbeddedDocumentExtract
> or.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.
> OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.
> TikaParser.parse(TikaParser.java:74) ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.
> addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithExcept
> ion(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineObjectWithVersions.
> addOrReplaceDocumentWithException(IncrementalIngester.java:2708)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester.documentIngest(IncrementalIngester.java:756)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.connectors.sharedrive.
> SharedDriveConnector.processDocuments(SharedDriveConnector.java:939)
> ~[?:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
>
>
> I need to create an incident ticket?
>
>
>
> ----------
>
>
>
> Thanks for your help.
>
>
>
> Cordialement,
>
>
>
> [image: msaunier]
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

RE: Document connector excluding mime type and size - Tika Parser error

Posted by msaunier <ms...@citya.com>.
Hello Karl,

 

I have build and configured but WindowsShare connector do not appear in the list of repository connectors.

 

·        I have add jcifs.jar into the connectors/jcifs/lib-proprietary directory

·        I have ant make-core-deps

·        Ant build

·        Uncomment windows share into the connectors-proprietary.xml file in the dist folder

·        I have add jcifs.jar in connector-lib-proprietary

 

But not have the proposition on the manifold interface. 

 

Any idea ?

Thanks.

 

 

De : msaunier [mailto:msaunier@citya.com] 
Envoyé : mercredi 10 janvier 2018 18:15
À : user@manifoldcf.apache.org
Objet : RE: Document connector excluding mime type and size - Tika Parser error

 

Good !

 

I configure and test that.

I give you a return as soon as the reading is finished.

400k documents.

 

If it works, I test on few million of documents.

 

Thank.

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : mercredi 10 janvier 2018 17:45
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

The build you should be using is the ant build.  Do not use the maven build for this purpose.

 

- Check out trunk:

 

svn co https://svn.apache.org/repos/asf/manifoldcf/trunk

 

- Download dependencies:

 

ant make-core-deps

 

- Build:

 

ant build

 

- Your deliverable is in the "dist" directory

 

Karl

 

 

On Wed, Jan 10, 2018 at 11:37 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

I have an error with the maven build, so I have test with an external 1.17 Tika Server but, POI not included. If you success a mvn package with 1.17 Tika, I am interested.

 

Today, I have not had much time to deal with it.

 

I found some bugs that I would declare tomorrow if they are not already. They concern log4j2, local_fr and a bug with the web interface and the keyboard input key.

 

I continu my investigation.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mercredi 10 janvier 2018 17:15


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

Any news?

Karl

 

On Tue, Jan 9, 2018 at 1:10 PM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Let me know what happens.
If it works for you, I'll see if we can put together a patch release of 2.9 with the fix.

 

Karl

 

 

On Tue, Jan 9, 2018 at 11:07 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Test check out and building with POI 3.17 and Tika 1.17? 

 

It’s possible.

 

I finish a project and I test that.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 16:57


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

So here's the problem; we used POI 3.17 with Tika 3.16 in 2.9, in order to deal with the classloader issue present in POI 3.15, and because POI 3.16 has a severe security issue that made it impossible to ship with.

 

Unfortunately that doesn't quite work; POI 3.17 is not backwards compatible with 3.16 completely and therefore problems occur with this combination.

 

The probable solution is to check out and build trunk and see if that works for you.  It very well might.  The question then is what to do next, because we are not scheduled to release again until April.  We might have to do a point release to deal with this.

 

Please give it a try and let me know what happens.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 10:29 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, never mind that last email.  We patched it in part in 2.9 by including the latest POI.  So clearly it's still an existing problem in POI.  I'll have to open a ticket there and await a patch from them.

 

Karl

 

On Tue, Jan 9, 2018 at 10:27 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

This screenshot cannot be MCF 2.9 since the version of poi was not 3.17 for the 2.9 release.

 

Karl

 

 

On Tue, Jan 9, 2018 at 10:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.

 



 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:54


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

As for the Tika issue, we explicitly tested documents of that type when rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we also tested this.

 

One of the potential issues is that if you are dropping down different versions of ManifoldCF into the same directories you *might* have a poi* jar in the wrong place because of the way we had to do the patch.  Please have a look at where the poi* jars are in your directory structure; they should all be in one directory (connector-common-lib).  If you see any anywhere else, that's the cause of the issue.

 

Karl

 

 

On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Since the Tika extractor essentially filters out the content mime type (other than presenting it as metadata), you need to put an "allowed documents" transformation connection into your job pipeline BEFORE the Tika connector:

 

https://manifoldcf.apache.org/release/release-2.9/en_US/end-user-documentation.html#alloweddocuments

 

In fact, mime type exclusion is actually disabled in the Solr output connector *unless* you are using the extracting update handler.  That should resolve the one problem for you.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 9:35 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They document for Tika are :

·        Microsoft Word 97-2003

·        Application/msword

 

I can’t have more informations, they are in SCO servers and SCO do not have ls –lisan or stat command.

 

For SolR connecting, I seem to have emptied the index before the last indexation. (ManifoldCF and Solr) I do it again to be sure.

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:26


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1482 is for the Solr connector filtering issue.  A question: When you changed these fields in the output connection, had you already indexed any documents?  Those would only get cleaned up if you did a subsequent full crawl, after you made the connection change.

 

Karl

 

 

 

On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

If you let me know what kind of file they are (extension and what application created them) that is probably good enough.

Karl

 

On Tue, Jan 9, 2018 at 9:19 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Okay good. I look if I can test 1.17 Tika version.

 

I can’t transfert a document with this error, they are privates. Sorry.

 

If I encounter the error again on a non-private document, I'll come back to you.

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:12


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1481 is the ticket for the Tika problem.

 

Karl

 

 

On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, if you are in a position to build trunk, that's a newer version of Tika (1.17) which might (or might not) address this problem.

 

If you could create a ticket, I'd greatly appreciate attaching one document to it that causes the failure.

 

Thanks!

Karl

 

 

On Tue, Jan 9, 2018 at 8:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

It’s a 2.9 version.

 

I have a 2.8.1 in an other server with same job and same documents. I will test on this other server and make you a return.

 

Thanks for your help.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 13:15
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

I looked at the history of this.  We had to release a patch (2.8.1) that put various poi jars at root level in order to work around a Tika problem.  That patch may not have been entirely correct in that it looks like it may have blocked access by one of the deeper jars to a higher level.

 

Release 2.9 should fix this if I am correct.

 

Karl

 

 

On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

What version of MCF is this?  That's important to know since Tika has had problems with this kind of thing in the past and this looks like something similar.

 

The problem you are reporting is due to either a missing jar, or a bug in an internal tika classloader.  But I need to know whether this is a current bug or not, since we just went to a new Tika version.

 

Karl

 

 

On Tue, Jan 9, 2018 at 4:32 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

I hope you are well today.

 

I have 2 problems with ManifoldCF.

 

-----------

In *Outputs connectors* with Solr connector. I have add a « Maximum document length and I have « Excluded 5 mime types » but it not work. I join capture.

 

----------

And in second, I have a *Tika exception* in ManifoldCF. 3 documents are blocked :

 

FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

        at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) ~[?:?]

        at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]

 

I need to create an incident ticket?

 

----------

 

Thanks for your help.

 

Cordialement,

 



 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


RE: Document connector excluding mime type and size - Tika Parser error

Posted by msaunier <ms...@citya.com>.
Good !

 

I configure and test that.

I give you a return as soon as the reading is finished.

400k documents.

 

If it works, I test on few million of documents.

 

Thank.

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : mercredi 10 janvier 2018 17:45
À : user@manifoldcf.apache.org
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

The build you should be using is the ant build.  Do not use the maven build for this purpose.

 

- Check out trunk:

 

svn co https://svn.apache.org/repos/asf/manifoldcf/trunk

 

- Download dependencies:

 

ant make-core-deps

 

- Build:

 

ant build

 

- Your deliverable is in the "dist" directory

 

Karl

 

 

On Wed, Jan 10, 2018 at 11:37 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

I have an error with the maven build, so I have test with an external 1.17 Tika Server but, POI not included. If you success a mvn package with 1.17 Tika, I am interested.

 

Today, I have not had much time to deal with it.

 

I found some bugs that I would declare tomorrow if they are not already. They concern log4j2, local_fr and a bug with the web interface and the keyboard input key.

 

I continu my investigation.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mercredi 10 janvier 2018 17:15


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

Any news?

Karl

 

On Tue, Jan 9, 2018 at 1:10 PM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Let me know what happens.
If it works for you, I'll see if we can put together a patch release of 2.9 with the fix.

 

Karl

 

 

On Tue, Jan 9, 2018 at 11:07 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Test check out and building with POI 3.17 and Tika 1.17? 

 

It’s possible.

 

I finish a project and I test that.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 16:57


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

So here's the problem; we used POI 3.17 with Tika 3.16 in 2.9, in order to deal with the classloader issue present in POI 3.15, and because POI 3.16 has a severe security issue that made it impossible to ship with.

 

Unfortunately that doesn't quite work; POI 3.17 is not backwards compatible with 3.16 completely and therefore problems occur with this combination.

 

The probable solution is to check out and build trunk and see if that works for you.  It very well might.  The question then is what to do next, because we are not scheduled to release again until April.  We might have to do a point release to deal with this.

 

Please give it a try and let me know what happens.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 10:29 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, never mind that last email.  We patched it in part in 2.9 by including the latest POI.  So clearly it's still an existing problem in POI.  I'll have to open a ticket there and await a patch from them.

 

Karl

 

On Tue, Jan 9, 2018 at 10:27 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

This screenshot cannot be MCF 2.9 since the version of poi was not 3.17 for the 2.9 release.

 

Karl

 

 

On Tue, Jan 9, 2018 at 10:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.

 



 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:54


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

As for the Tika issue, we explicitly tested documents of that type when rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we also tested this.

 

One of the potential issues is that if you are dropping down different versions of ManifoldCF into the same directories you *might* have a poi* jar in the wrong place because of the way we had to do the patch.  Please have a look at where the poi* jars are in your directory structure; they should all be in one directory (connector-common-lib).  If you see any anywhere else, that's the cause of the issue.

 

Karl

 

 

On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Since the Tika extractor essentially filters out the content mime type (other than presenting it as metadata), you need to put an "allowed documents" transformation connection into your job pipeline BEFORE the Tika connector:

 

https://manifoldcf.apache.org/release/release-2.9/en_US/end-user-documentation.html#alloweddocuments

 

In fact, mime type exclusion is actually disabled in the Solr output connector *unless* you are using the extracting update handler.  That should resolve the one problem for you.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 9:35 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They document for Tika are :

·        Microsoft Word 97-2003

·        Application/msword

 

I can’t have more informations, they are in SCO servers and SCO do not have ls –lisan or stat command.

 

For SolR connecting, I seem to have emptied the index before the last indexation. (ManifoldCF and Solr) I do it again to be sure.

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:26


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1482 is for the Solr connector filtering issue.  A question: When you changed these fields in the output connection, had you already indexed any documents?  Those would only get cleaned up if you did a subsequent full crawl, after you made the connection change.

 

Karl

 

 

 

On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

If you let me know what kind of file they are (extension and what application created them) that is probably good enough.

Karl

 

On Tue, Jan 9, 2018 at 9:19 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Okay good. I look if I can test 1.17 Tika version.

 

I can’t transfert a document with this error, they are privates. Sorry.

 

If I encounter the error again on a non-private document, I'll come back to you.

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:12


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1481 is the ticket for the Tika problem.

 

Karl

 

 

On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, if you are in a position to build trunk, that's a newer version of Tika (1.17) which might (or might not) address this problem.

 

If you could create a ticket, I'd greatly appreciate attaching one document to it that causes the failure.

 

Thanks!

Karl

 

 

On Tue, Jan 9, 2018 at 8:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

It’s a 2.9 version.

 

I have a 2.8.1 in an other server with same job and same documents. I will test on this other server and make you a return.

 

Thanks for your help.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 13:15
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

I looked at the history of this.  We had to release a patch (2.8.1) that put various poi jars at root level in order to work around a Tika problem.  That patch may not have been entirely correct in that it looks like it may have blocked access by one of the deeper jars to a higher level.

 

Release 2.9 should fix this if I am correct.

 

Karl

 

 

On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

What version of MCF is this?  That's important to know since Tika has had problems with this kind of thing in the past and this looks like something similar.

 

The problem you are reporting is due to either a missing jar, or a bug in an internal tika classloader.  But I need to know whether this is a current bug or not, since we just went to a new Tika version.

 

Karl

 

 

On Tue, Jan 9, 2018 at 4:32 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

I hope you are well today.

 

I have 2 problems with ManifoldCF.

 

-----------

In *Outputs connectors* with Solr connector. I have add a « Maximum document length and I have « Excluded 5 mime types » but it not work. I join capture.

 

----------

And in second, I have a *Tika exception* in ManifoldCF. 3 documents are blocked :

 

FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

        at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) ~[?:?]

        at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]

 

I need to create an incident ticket?

 

----------

 

Thanks for your help.

 

Cordialement,

 



 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Re: Document connector excluding mime type and size - Tika Parser error

Posted by Karl Wright <da...@gmail.com>.
The build you should be using is the ant build.  Do not use the maven build
for this purpose.

- Check out trunk:

svn co https://svn.apache.org/repos/asf/manifoldcf/trunk

- Download dependencies:

ant make-core-deps

- Build:

ant build

- Your deliverable is in the "dist" directory

Karl


On Wed, Jan 10, 2018 at 11:37 AM, msaunier <ms...@citya.com> wrote:

> I have an error with the maven build, so I have test with an external 1.17
> Tika Server but, POI not included. If you success a mvn package with 1.17
> Tika, I am interested.
>
>
>
> Today, I have not had much time to deal with it.
>
>
>
> I found some bugs that I would declare tomorrow if they are not already.
> They concern log4j2, local_fr and a bug with the web interface and the
> keyboard input key.
>
>
>
> I continu my investigation.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mercredi 10 janvier 2018 17:15
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> Any news?
>
> Karl
>
>
>
> On Tue, Jan 9, 2018 at 1:10 PM, Karl Wright <da...@gmail.com> wrote:
>
> Let me know what happens.
> If it works for you, I'll see if we can put together a patch release of
> 2.9 with the fix.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 11:07 AM, msaunier <ms...@citya.com> wrote:
>
> Test check out and building with POI 3.17 and Tika 1.17?
>
>
>
> It’s possible.
>
>
>
> I finish a project and I test that.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 16:57
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> So here's the problem; we used POI 3.17 with Tika 3.16 in 2.9, in order to
> deal with the classloader issue present in POI 3.15, and because POI 3.16
> has a severe security issue that made it impossible to ship with.
>
>
>
> Unfortunately that doesn't quite work; POI 3.17 is not backwards
> compatible with 3.16 completely and therefore problems occur with this
> combination.
>
>
>
> The probable solution is to check out and build trunk and see if that
> works for you.  It very well might.  The question then is what to do next,
> because we are not scheduled to release again until April.  We might have
> to do a point release to deal with this.
>
>
>
> Please give it a try and let me know what happens.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 10:29 AM, Karl Wright <da...@gmail.com> wrote:
>
> Ok, never mind that last email.  We patched it in part in 2.9 by including
> the latest POI.  So clearly it's still an existing problem in POI.  I'll
> have to open a ticket there and await a patch from them.
>
>
>
> Karl
>
>
>
> On Tue, Jan 9, 2018 at 10:27 AM, Karl Wright <da...@gmail.com> wrote:
>
> This screenshot cannot be MCF 2.9 since the version of poi was not 3.17
> for the 2.9 release.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 10:02 AM, msaunier <ms...@citya.com> wrote:
>
> They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.
>
>
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:54
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> As for the Tika issue, we explicitly tested documents of that type when
> rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we
> also tested this.
>
>
>
> One of the potential issues is that if you are dropping down different
> versions of ManifoldCF into the same directories you *might* have a poi*
> jar in the wrong place because of the way we had to do the patch.  Please
> have a look at where the poi* jars are in your directory structure; they
> should all be in one directory (connector-common-lib).  If you see any
> anywhere else, that's the cause of the issue.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <da...@gmail.com> wrote:
>
> Since the Tika extractor essentially filters out the content mime type
> (other than presenting it as metadata), you need to put an "allowed
> documents" transformation connection into your job pipeline BEFORE the Tika
> connector:
>
>
>
> https://manifoldcf.apache.org/release/release-2.9/en_US/end-
> user-documentation.html#alloweddocuments
>
>
>
> In fact, mime type exclusion is actually disabled in the Solr output
> connector *unless* you are using the extracting update handler.  That
> should resolve the one problem for you.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:35 AM, msaunier <ms...@citya.com> wrote:
>
> They document for Tika are :
>
> ·        Microsoft Word 97-2003
>
> ·        Application/msword
>
>
>
> I can’t have more informations, they are in SCO servers and SCO do not
> have ls –lisan or stat command.
>
>
>
> For SolR connecting, I seem to have emptied the index before the last
> indexation. (ManifoldCF and Solr) I do it again to be sure.
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:26
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> CONNECTORS-1482 is for the Solr connector filtering issue.  A question:
> When you changed these fields in the output connection, had you already
> indexed any documents?  Those would only get cleaned up if you did a
> subsequent full crawl, after you made the connection change.
>
>
>
> Karl
>
>
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <da...@gmail.com> wrote:
>
> If you let me know what kind of file they are (extension and what
> application created them) that is probably good enough.
>
> Karl
>
>
>
> On Tue, Jan 9, 2018 at 9:19 AM, msaunier <ms...@citya.com> wrote:
>
> Okay good. I look if I can test 1.17 Tika version.
>
>
>
> I can’t transfert a document with this error, they are privates. Sorry.
>
>
>
> If I encounter the error again on a non-private document, I'll come back
> to you.
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:12
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> CONNECTORS-1481 is the ticket for the Tika problem.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <da...@gmail.com> wrote:
>
> Ok, if you are in a position to build trunk, that's a newer version of
> Tika (1.17) which might (or might not) address this problem.
>
>
>
> If you could create a ticket, I'd greatly appreciate attaching one
> document to it that causes the failure.
>
>
>
> Thanks!
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 8:02 AM, msaunier <ms...@citya.com> wrote:
>
> It’s a 2.9 version.
>
>
>
> I have a 2.8.1 in an other server with same job and same documents. I will
> test on this other server and make you a return.
>
>
>
> Thanks for your help.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 13:15
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> I looked at the history of this.  We had to release a patch (2.8.1) that
> put various poi jars at root level in order to work around a Tika problem.
> That patch may not have been entirely correct in that it looks like it may
> have blocked access by one of the deeper jars to a higher level.
>
>
>
> Release 2.9 should fix this if I am correct.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <da...@gmail.com> wrote:
>
> What version of MCF is this?  That's important to know since Tika has had
> problems with this kind of thing in the past and this looks like something
> similar.
>
>
>
> The problem you are reporting is due to either a missing jar, or a bug in
> an internal tika classloader.  But I need to know whether this is a current
> bug or not, since we just went to a new Tika version.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 4:32 AM, msaunier <ms...@citya.com> wrote:
>
> Hello Karl,
>
> I hope you are well today.
>
>
>
> I have 2 problems with ManifoldCF.
>
>
>
> -----------
>
> In **Outputs connectors** with Solr connector. I have add a « Maximum
> document length and I have « Excluded 5 mime types » but it not work. I
> join capture.
>
>
>
> ----------
>
> And in second, I have a **Tika exception** in ManifoldCF. 3 documents are
> blocked :
>
>
>
> FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed:
> org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/
> apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
> java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.
> HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
>         at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
> ~[?:?]
>
>         at org.apache.tika.extractor.ParsingEmbeddedDocumentExtract
> or.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.
> OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.
> TikaParser.parse(TikaParser.java:74) ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.
> addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithExcept
> ion(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineObjectWithVersions.
> addOrReplaceDocumentWithException(IncrementalIngester.java:2708)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester.documentIngest(IncrementalIngester.java:756)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.connectors.sharedrive.
> SharedDriveConnector.processDocuments(SharedDriveConnector.java:939)
> ~[?:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
>
>
> I need to create an incident ticket?
>
>
>
> ----------
>
>
>
> Thanks for your help.
>
>
>
> Cordialement,
>
>
>
> [image: msaunier]
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

RE: Document connector excluding mime type and size - Tika Parser error

Posted by msaunier <ms...@citya.com>.
I have an error with the maven build, so I have test with an external 1.17 Tika Server but, POI not included. If you success a mvn package with 1.17 Tika, I am interested.

 

Today, I have not had much time to deal with it.

 

I found some bugs that I would declare tomorrow if they are not already. They concern log4j2, local_fr and a bug with the web interface and the keyboard input key.

 

I continu my investigation.

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : mercredi 10 janvier 2018 17:15
À : user@manifoldcf.apache.org
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

Any news?

Karl

 

On Tue, Jan 9, 2018 at 1:10 PM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Let me know what happens.
If it works for you, I'll see if we can put together a patch release of 2.9 with the fix.

 

Karl

 

 

On Tue, Jan 9, 2018 at 11:07 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Test check out and building with POI 3.17 and Tika 1.17? 

 

It’s possible.

 

I finish a project and I test that.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 16:57


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

So here's the problem; we used POI 3.17 with Tika 3.16 in 2.9, in order to deal with the classloader issue present in POI 3.15, and because POI 3.16 has a severe security issue that made it impossible to ship with.

 

Unfortunately that doesn't quite work; POI 3.17 is not backwards compatible with 3.16 completely and therefore problems occur with this combination.

 

The probable solution is to check out and build trunk and see if that works for you.  It very well might.  The question then is what to do next, because we are not scheduled to release again until April.  We might have to do a point release to deal with this.

 

Please give it a try and let me know what happens.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 10:29 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, never mind that last email.  We patched it in part in 2.9 by including the latest POI.  So clearly it's still an existing problem in POI.  I'll have to open a ticket there and await a patch from them.

 

Karl

 

On Tue, Jan 9, 2018 at 10:27 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

This screenshot cannot be MCF 2.9 since the version of poi was not 3.17 for the 2.9 release.

 

Karl

 

 

On Tue, Jan 9, 2018 at 10:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.

 



 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:54


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

As for the Tika issue, we explicitly tested documents of that type when rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we also tested this.

 

One of the potential issues is that if you are dropping down different versions of ManifoldCF into the same directories you *might* have a poi* jar in the wrong place because of the way we had to do the patch.  Please have a look at where the poi* jars are in your directory structure; they should all be in one directory (connector-common-lib).  If you see any anywhere else, that's the cause of the issue.

 

Karl

 

 

On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Since the Tika extractor essentially filters out the content mime type (other than presenting it as metadata), you need to put an "allowed documents" transformation connection into your job pipeline BEFORE the Tika connector:

 

https://manifoldcf.apache.org/release/release-2.9/en_US/end-user-documentation.html#alloweddocuments

 

In fact, mime type exclusion is actually disabled in the Solr output connector *unless* you are using the extracting update handler.  That should resolve the one problem for you.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 9:35 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They document for Tika are :

·        Microsoft Word 97-2003

·        Application/msword

 

I can’t have more informations, they are in SCO servers and SCO do not have ls –lisan or stat command.

 

For SolR connecting, I seem to have emptied the index before the last indexation. (ManifoldCF and Solr) I do it again to be sure.

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:26


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1482 is for the Solr connector filtering issue.  A question: When you changed these fields in the output connection, had you already indexed any documents?  Those would only get cleaned up if you did a subsequent full crawl, after you made the connection change.

 

Karl

 

 

 

On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

If you let me know what kind of file they are (extension and what application created them) that is probably good enough.

Karl

 

On Tue, Jan 9, 2018 at 9:19 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Okay good. I look if I can test 1.17 Tika version.

 

I can’t transfert a document with this error, they are privates. Sorry.

 

If I encounter the error again on a non-private document, I'll come back to you.

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:12


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1481 is the ticket for the Tika problem.

 

Karl

 

 

On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, if you are in a position to build trunk, that's a newer version of Tika (1.17) which might (or might not) address this problem.

 

If you could create a ticket, I'd greatly appreciate attaching one document to it that causes the failure.

 

Thanks!

Karl

 

 

On Tue, Jan 9, 2018 at 8:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

It’s a 2.9 version.

 

I have a 2.8.1 in an other server with same job and same documents. I will test on this other server and make you a return.

 

Thanks for your help.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 13:15
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

I looked at the history of this.  We had to release a patch (2.8.1) that put various poi jars at root level in order to work around a Tika problem.  That patch may not have been entirely correct in that it looks like it may have blocked access by one of the deeper jars to a higher level.

 

Release 2.9 should fix this if I am correct.

 

Karl

 

 

On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

What version of MCF is this?  That's important to know since Tika has had problems with this kind of thing in the past and this looks like something similar.

 

The problem you are reporting is due to either a missing jar, or a bug in an internal tika classloader.  But I need to know whether this is a current bug or not, since we just went to a new Tika version.

 

Karl

 

 

On Tue, Jan 9, 2018 at 4:32 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

I hope you are well today.

 

I have 2 problems with ManifoldCF.

 

-----------

In *Outputs connectors* with Solr connector. I have add a « Maximum document length and I have « Excluded 5 mime types » but it not work. I join capture.

 

----------

And in second, I have a *Tika exception* in ManifoldCF. 3 documents are blocked :

 

FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

        at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) ~[?:?]

        at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]

 

I need to create an incident ticket?

 

----------

 

Thanks for your help.

 

Cordialement,

 



 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Re: Document connector excluding mime type and size - Tika Parser error

Posted by Karl Wright <da...@gmail.com>.
Any news?
Karl

On Tue, Jan 9, 2018 at 1:10 PM, Karl Wright <da...@gmail.com> wrote:

> Let me know what happens.
> If it works for you, I'll see if we can put together a patch release of
> 2.9 with the fix.
>
> Karl
>
>
> On Tue, Jan 9, 2018 at 11:07 AM, msaunier <ms...@citya.com> wrote:
>
>> Test check out and building with POI 3.17 and Tika 1.17?
>>
>>
>>
>> It’s possible.
>>
>>
>>
>> I finish a project and I test that.
>>
>>
>>
>> *De :* Karl Wright [mailto:daddywri@gmail.com]
>> *Envoyé :* mardi 9 janvier 2018 16:57
>>
>> *À :* user@manifoldcf.apache.org
>> *Objet :* Re: Document connector excluding mime type and size - Tika
>> Parser error
>>
>>
>>
>> So here's the problem; we used POI 3.17 with Tika 3.16 in 2.9, in order
>> to deal with the classloader issue present in POI 3.15, and because POI
>> 3.16 has a severe security issue that made it impossible to ship with.
>>
>>
>>
>> Unfortunately that doesn't quite work; POI 3.17 is not backwards
>> compatible with 3.16 completely and therefore problems occur with this
>> combination.
>>
>>
>>
>> The probable solution is to check out and build trunk and see if that
>> works for you.  It very well might.  The question then is what to do next,
>> because we are not scheduled to release again until April.  We might have
>> to do a point release to deal with this.
>>
>>
>>
>> Please give it a try and let me know what happens.
>>
>>
>>
>> Thanks,
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 10:29 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> Ok, never mind that last email.  We patched it in part in 2.9 by
>> including the latest POI.  So clearly it's still an existing problem in
>> POI.  I'll have to open a ticket there and await a patch from them.
>>
>>
>>
>> Karl
>>
>>
>>
>> On Tue, Jan 9, 2018 at 10:27 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> This screenshot cannot be MCF 2.9 since the version of poi was not 3.17
>> for the 2.9 release.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 10:02 AM, msaunier <ms...@citya.com> wrote:
>>
>> They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *De :* Karl Wright [mailto:daddywri@gmail.com]
>> *Envoyé :* mardi 9 janvier 2018 15:54
>>
>>
>> *À :* user@manifoldcf.apache.org
>> *Objet :* Re: Document connector excluding mime type and size - Tika
>> Parser error
>>
>>
>>
>> As for the Tika issue, we explicitly tested documents of that type when
>> rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we
>> also tested this.
>>
>>
>>
>> One of the potential issues is that if you are dropping down different
>> versions of ManifoldCF into the same directories you *might* have a poi*
>> jar in the wrong place because of the way we had to do the patch.  Please
>> have a look at where the poi* jars are in your directory structure; they
>> should all be in one directory (connector-common-lib).  If you see any
>> anywhere else, that's the cause of the issue.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> Since the Tika extractor essentially filters out the content mime type
>> (other than presenting it as metadata), you need to put an "allowed
>> documents" transformation connection into your job pipeline BEFORE the Tika
>> connector:
>>
>>
>>
>> https://manifoldcf.apache.org/release/release-2.9/en_US/end-
>> user-documentation.html#alloweddocuments
>>
>>
>>
>> In fact, mime type exclusion is actually disabled in the Solr output
>> connector *unless* you are using the extracting update handler.  That
>> should resolve the one problem for you.
>>
>>
>>
>> Thanks,
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 9:35 AM, msaunier <ms...@citya.com> wrote:
>>
>> They document for Tika are :
>>
>> ·        Microsoft Word 97-2003
>>
>> ·        Application/msword
>>
>>
>>
>> I can’t have more informations, they are in SCO servers and SCO do not
>> have ls –lisan or stat command.
>>
>>
>>
>> For SolR connecting, I seem to have emptied the index before the last
>> indexation. (ManifoldCF and Solr) I do it again to be sure.
>>
>>
>>
>>
>>
>> *De :* Karl Wright [mailto:daddywri@gmail.com]
>> *Envoyé :* mardi 9 janvier 2018 15:26
>>
>>
>> *À :* user@manifoldcf.apache.org
>> *Objet :* Re: Document connector excluding mime type and size - Tika
>> Parser error
>>
>>
>>
>> CONNECTORS-1482 is for the Solr connector filtering issue.  A question:
>> When you changed these fields in the output connection, had you already
>> indexed any documents?  Those would only get cleaned up if you did a
>> subsequent full crawl, after you made the connection change.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> If you let me know what kind of file they are (extension and what
>> application created them) that is probably good enough.
>>
>> Karl
>>
>>
>>
>> On Tue, Jan 9, 2018 at 9:19 AM, msaunier <ms...@citya.com> wrote:
>>
>> Okay good. I look if I can test 1.17 Tika version.
>>
>>
>>
>> I can’t transfert a document with this error, they are privates. Sorry.
>>
>>
>>
>> If I encounter the error again on a non-private document, I'll come back
>> to you.
>>
>>
>>
>>
>>
>>
>>
>> *De :* Karl Wright [mailto:daddywri@gmail.com]
>> *Envoyé :* mardi 9 janvier 2018 15:12
>>
>>
>> *À :* user@manifoldcf.apache.org
>> *Objet :* Re: Document connector excluding mime type and size - Tika
>> Parser error
>>
>>
>>
>> CONNECTORS-1481 is the ticket for the Tika problem.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> Ok, if you are in a position to build trunk, that's a newer version of
>> Tika (1.17) which might (or might not) address this problem.
>>
>>
>>
>> If you could create a ticket, I'd greatly appreciate attaching one
>> document to it that causes the failure.
>>
>>
>>
>> Thanks!
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 8:02 AM, msaunier <ms...@citya.com> wrote:
>>
>> It’s a 2.9 version.
>>
>>
>>
>> I have a 2.8.1 in an other server with same job and same documents. I
>> will test on this other server and make you a return.
>>
>>
>>
>> Thanks for your help.
>>
>>
>>
>> *De :* Karl Wright [mailto:daddywri@gmail.com]
>> *Envoyé :* mardi 9 janvier 2018 13:15
>> *À :* user@manifoldcf.apache.org
>> *Objet :* Re: Document connector excluding mime type and size - Tika
>> Parser error
>>
>>
>>
>> I looked at the history of this.  We had to release a patch (2.8.1) that
>> put various poi jars at root level in order to work around a Tika problem.
>> That patch may not have been entirely correct in that it looks like it may
>> have blocked access by one of the deeper jars to a higher level.
>>
>>
>>
>> Release 2.9 should fix this if I am correct.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> What version of MCF is this?  That's important to know since Tika has had
>> problems with this kind of thing in the past and this looks like something
>> similar.
>>
>>
>>
>> The problem you are reporting is due to either a missing jar, or a bug in
>> an internal tika classloader.  But I need to know whether this is a current
>> bug or not, since we just went to a new Tika version.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 4:32 AM, msaunier <ms...@citya.com> wrote:
>>
>> Hello Karl,
>>
>> I hope you are well today.
>>
>>
>>
>> I have 2 problems with ManifoldCF.
>>
>>
>>
>> -----------
>>
>> In **Outputs connectors** with Solr connector. I have add a « Maximum
>> document length and I have « Excluded 5 mime types » but it not work. I
>> join capture.
>>
>>
>>
>> ----------
>>
>> And in second, I have a **Tika exception** in ManifoldCF. 3 documents
>> are blocked :
>>
>>
>>
>> FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed:
>> org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/
>> poi/hwmf/record/HwmfFont$WmfCharset;
>>
>> java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.Hwm
>> fFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;
>>
>>         at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>> ~[?:?]
>>
>>         at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.
>> parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory
>> .parse(OOXMLExtractorFactory.java:142) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>> ~[?:?]
>>
>>         at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
>> ~[?:?]
>>
>>         at org.apache.manifoldcf.agents.transformation.tika.TikaExtract
>> or.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester$PipelineAddEntryPoint.addOrReplaceDocumentWithExcepti
>> on(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
>> ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester$PipelineObjectWithVersions.addOrReplaceDocumentWithEx
>> ception(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessAct
>> ivity.ingestDocumentWithException(WorkerThread.java:1583)
>> ~[mcf-pull-agent.jar:?]
>>
>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessAct
>> ivity.ingestDocumentWithException(WorkerThread.java:1548)
>> ~[mcf-pull-agent.jar:?]
>>
>>         at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDr
>> iveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]
>>
>>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
>> [mcf-pull-agent.jar:?]
>>
>>
>>
>> I need to create an incident ticket?
>>
>>
>>
>> ----------
>>
>>
>>
>> Thanks for your help.
>>
>>
>>
>> Cordialement,
>>
>>
>>
>> [image: msaunier]
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: Document connector excluding mime type and size - Tika Parser error

Posted by Karl Wright <da...@gmail.com>.
Let me know what happens.
If it works for you, I'll see if we can put together a patch release of 2.9
with the fix.

Karl


On Tue, Jan 9, 2018 at 11:07 AM, msaunier <ms...@citya.com> wrote:

> Test check out and building with POI 3.17 and Tika 1.17?
>
>
>
> It’s possible.
>
>
>
> I finish a project and I test that.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 16:57
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> So here's the problem; we used POI 3.17 with Tika 3.16 in 2.9, in order to
> deal with the classloader issue present in POI 3.15, and because POI 3.16
> has a severe security issue that made it impossible to ship with.
>
>
>
> Unfortunately that doesn't quite work; POI 3.17 is not backwards
> compatible with 3.16 completely and therefore problems occur with this
> combination.
>
>
>
> The probable solution is to check out and build trunk and see if that
> works for you.  It very well might.  The question then is what to do next,
> because we are not scheduled to release again until April.  We might have
> to do a point release to deal with this.
>
>
>
> Please give it a try and let me know what happens.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 10:29 AM, Karl Wright <da...@gmail.com> wrote:
>
> Ok, never mind that last email.  We patched it in part in 2.9 by including
> the latest POI.  So clearly it's still an existing problem in POI.  I'll
> have to open a ticket there and await a patch from them.
>
>
>
> Karl
>
>
>
> On Tue, Jan 9, 2018 at 10:27 AM, Karl Wright <da...@gmail.com> wrote:
>
> This screenshot cannot be MCF 2.9 since the version of poi was not 3.17
> for the 2.9 release.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 10:02 AM, msaunier <ms...@citya.com> wrote:
>
> They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.
>
>
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:54
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> As for the Tika issue, we explicitly tested documents of that type when
> rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we
> also tested this.
>
>
>
> One of the potential issues is that if you are dropping down different
> versions of ManifoldCF into the same directories you *might* have a poi*
> jar in the wrong place because of the way we had to do the patch.  Please
> have a look at where the poi* jars are in your directory structure; they
> should all be in one directory (connector-common-lib).  If you see any
> anywhere else, that's the cause of the issue.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <da...@gmail.com> wrote:
>
> Since the Tika extractor essentially filters out the content mime type
> (other than presenting it as metadata), you need to put an "allowed
> documents" transformation connection into your job pipeline BEFORE the Tika
> connector:
>
>
>
> https://manifoldcf.apache.org/release/release-2.9/en_US/end-
> user-documentation.html#alloweddocuments
>
>
>
> In fact, mime type exclusion is actually disabled in the Solr output
> connector *unless* you are using the extracting update handler.  That
> should resolve the one problem for you.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:35 AM, msaunier <ms...@citya.com> wrote:
>
> They document for Tika are :
>
> ·        Microsoft Word 97-2003
>
> ·        Application/msword
>
>
>
> I can’t have more informations, they are in SCO servers and SCO do not
> have ls –lisan or stat command.
>
>
>
> For SolR connecting, I seem to have emptied the index before the last
> indexation. (ManifoldCF and Solr) I do it again to be sure.
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:26
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> CONNECTORS-1482 is for the Solr connector filtering issue.  A question:
> When you changed these fields in the output connection, had you already
> indexed any documents?  Those would only get cleaned up if you did a
> subsequent full crawl, after you made the connection change.
>
>
>
> Karl
>
>
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <da...@gmail.com> wrote:
>
> If you let me know what kind of file they are (extension and what
> application created them) that is probably good enough.
>
> Karl
>
>
>
> On Tue, Jan 9, 2018 at 9:19 AM, msaunier <ms...@citya.com> wrote:
>
> Okay good. I look if I can test 1.17 Tika version.
>
>
>
> I can’t transfert a document with this error, they are privates. Sorry.
>
>
>
> If I encounter the error again on a non-private document, I'll come back
> to you.
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:12
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> CONNECTORS-1481 is the ticket for the Tika problem.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <da...@gmail.com> wrote:
>
> Ok, if you are in a position to build trunk, that's a newer version of
> Tika (1.17) which might (or might not) address this problem.
>
>
>
> If you could create a ticket, I'd greatly appreciate attaching one
> document to it that causes the failure.
>
>
>
> Thanks!
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 8:02 AM, msaunier <ms...@citya.com> wrote:
>
> It’s a 2.9 version.
>
>
>
> I have a 2.8.1 in an other server with same job and same documents. I will
> test on this other server and make you a return.
>
>
>
> Thanks for your help.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 13:15
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> I looked at the history of this.  We had to release a patch (2.8.1) that
> put various poi jars at root level in order to work around a Tika problem.
> That patch may not have been entirely correct in that it looks like it may
> have blocked access by one of the deeper jars to a higher level.
>
>
>
> Release 2.9 should fix this if I am correct.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <da...@gmail.com> wrote:
>
> What version of MCF is this?  That's important to know since Tika has had
> problems with this kind of thing in the past and this looks like something
> similar.
>
>
>
> The problem you are reporting is due to either a missing jar, or a bug in
> an internal tika classloader.  But I need to know whether this is a current
> bug or not, since we just went to a new Tika version.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 4:32 AM, msaunier <ms...@citya.com> wrote:
>
> Hello Karl,
>
> I hope you are well today.
>
>
>
> I have 2 problems with ManifoldCF.
>
>
>
> -----------
>
> In **Outputs connectors** with Solr connector. I have add a « Maximum
> document length and I have « Excluded 5 mime types » but it not work. I
> join capture.
>
>
>
> ----------
>
> And in second, I have a **Tika exception** in ManifoldCF. 3 documents are
> blocked :
>
>
>
> FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed:
> org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/
> apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
> java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.
> HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
>         at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
> ~[?:?]
>
>         at org.apache.tika.extractor.ParsingEmbeddedDocumentExtract
> or.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.
> OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.
> TikaParser.parse(TikaParser.java:74) ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.
> addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithExcept
> ion(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineObjectWithVersions.
> addOrReplaceDocumentWithException(IncrementalIngester.java:2708)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester.documentIngest(IncrementalIngester.java:756)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.connectors.sharedrive.
> SharedDriveConnector.processDocuments(SharedDriveConnector.java:939)
> ~[?:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
>
>
> I need to create an incident ticket?
>
>
>
> ----------
>
>
>
> Thanks for your help.
>
>
>
> Cordialement,
>
>
>
> [image: msaunier]
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

RE: Document connector excluding mime type and size - Tika Parser error

Posted by msaunier <ms...@citya.com>.
Test check out and building with POI 3.17 and Tika 1.17? 

 

It’s possible.

 

I finish a project and I test that.

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : mardi 9 janvier 2018 16:57
À : user@manifoldcf.apache.org
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

So here's the problem; we used POI 3.17 with Tika 3.16 in 2.9, in order to deal with the classloader issue present in POI 3.15, and because POI 3.16 has a severe security issue that made it impossible to ship with.

 

Unfortunately that doesn't quite work; POI 3.17 is not backwards compatible with 3.16 completely and therefore problems occur with this combination.

 

The probable solution is to check out and build trunk and see if that works for you.  It very well might.  The question then is what to do next, because we are not scheduled to release again until April.  We might have to do a point release to deal with this.

 

Please give it a try and let me know what happens.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 10:29 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, never mind that last email.  We patched it in part in 2.9 by including the latest POI.  So clearly it's still an existing problem in POI.  I'll have to open a ticket there and await a patch from them.

 

Karl

 

On Tue, Jan 9, 2018 at 10:27 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

This screenshot cannot be MCF 2.9 since the version of poi was not 3.17 for the 2.9 release.

 

Karl

 

 

On Tue, Jan 9, 2018 at 10:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.

 



 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:54


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

As for the Tika issue, we explicitly tested documents of that type when rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we also tested this.

 

One of the potential issues is that if you are dropping down different versions of ManifoldCF into the same directories you *might* have a poi* jar in the wrong place because of the way we had to do the patch.  Please have a look at where the poi* jars are in your directory structure; they should all be in one directory (connector-common-lib).  If you see any anywhere else, that's the cause of the issue.

 

Karl

 

 

On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Since the Tika extractor essentially filters out the content mime type (other than presenting it as metadata), you need to put an "allowed documents" transformation connection into your job pipeline BEFORE the Tika connector:

 

https://manifoldcf.apache.org/release/release-2.9/en_US/end-user-documentation.html#alloweddocuments

 

In fact, mime type exclusion is actually disabled in the Solr output connector *unless* you are using the extracting update handler.  That should resolve the one problem for you.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 9:35 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They document for Tika are :

·        Microsoft Word 97-2003

·        Application/msword

 

I can’t have more informations, they are in SCO servers and SCO do not have ls –lisan or stat command.

 

For SolR connecting, I seem to have emptied the index before the last indexation. (ManifoldCF and Solr) I do it again to be sure.

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:26


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1482 is for the Solr connector filtering issue.  A question: When you changed these fields in the output connection, had you already indexed any documents?  Those would only get cleaned up if you did a subsequent full crawl, after you made the connection change.

 

Karl

 

 

 

On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

If you let me know what kind of file they are (extension and what application created them) that is probably good enough.

Karl

 

On Tue, Jan 9, 2018 at 9:19 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Okay good. I look if I can test 1.17 Tika version.

 

I can’t transfert a document with this error, they are privates. Sorry.

 

If I encounter the error again on a non-private document, I'll come back to you.

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:12


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1481 is the ticket for the Tika problem.

 

Karl

 

 

On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, if you are in a position to build trunk, that's a newer version of Tika (1.17) which might (or might not) address this problem.

 

If you could create a ticket, I'd greatly appreciate attaching one document to it that causes the failure.

 

Thanks!

Karl

 

 

On Tue, Jan 9, 2018 at 8:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

It’s a 2.9 version.

 

I have a 2.8.1 in an other server with same job and same documents. I will test on this other server and make you a return.

 

Thanks for your help.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 13:15
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

I looked at the history of this.  We had to release a patch (2.8.1) that put various poi jars at root level in order to work around a Tika problem.  That patch may not have been entirely correct in that it looks like it may have blocked access by one of the deeper jars to a higher level.

 

Release 2.9 should fix this if I am correct.

 

Karl

 

 

On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

What version of MCF is this?  That's important to know since Tika has had problems with this kind of thing in the past and this looks like something similar.

 

The problem you are reporting is due to either a missing jar, or a bug in an internal tika classloader.  But I need to know whether this is a current bug or not, since we just went to a new Tika version.

 

Karl

 

 

On Tue, Jan 9, 2018 at 4:32 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

I hope you are well today.

 

I have 2 problems with ManifoldCF.

 

-----------

In *Outputs connectors* with Solr connector. I have add a « Maximum document length and I have « Excluded 5 mime types » but it not work. I join capture.

 

----------

And in second, I have a *Tika exception* in ManifoldCF. 3 documents are blocked :

 

FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

        at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) ~[?:?]

        at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]

 

I need to create an incident ticket?

 

----------

 

Thanks for your help.

 

Cordialement,

 



 

 

 

 

 

 

 

 

 

 

 

 

 

 


Re: Document connector excluding mime type and size - Tika Parser error

Posted by Karl Wright <da...@gmail.com>.
So here's the problem; we used POI 3.17 with Tika 3.16 in 2.9, in order to
deal with the classloader issue present in POI 3.15, and because POI 3.16
has a severe security issue that made it impossible to ship with.

Unfortunately that doesn't quite work; POI 3.17 is not backwards compatible
with 3.16 completely and therefore problems occur with this combination.

The probable solution is to check out and build trunk and see if that works
for you.  It very well might.  The question then is what to do next,
because we are not scheduled to release again until April.  We might have
to do a point release to deal with this.

Please give it a try and let me know what happens.

Thanks,
Karl


On Tue, Jan 9, 2018 at 10:29 AM, Karl Wright <da...@gmail.com> wrote:

> Ok, never mind that last email.  We patched it in part in 2.9 by including
> the latest POI.  So clearly it's still an existing problem in POI.  I'll
> have to open a ticket there and await a patch from them.
>
> Karl
>
> On Tue, Jan 9, 2018 at 10:27 AM, Karl Wright <da...@gmail.com> wrote:
>
>> This screenshot cannot be MCF 2.9 since the version of poi was not 3.17
>> for the 2.9 release.
>>
>> Karl
>>
>>
>> On Tue, Jan 9, 2018 at 10:02 AM, msaunier <ms...@citya.com> wrote:
>>
>>> They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents
>>> servers.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *De :* Karl Wright [mailto:daddywri@gmail.com]
>>> *Envoyé :* mardi 9 janvier 2018 15:54
>>>
>>> *À :* user@manifoldcf.apache.org
>>> *Objet :* Re: Document connector excluding mime type and size - Tika
>>> Parser error
>>>
>>>
>>>
>>> As for the Tika issue, we explicitly tested documents of that type when
>>> rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we
>>> also tested this.
>>>
>>>
>>>
>>> One of the potential issues is that if you are dropping down different
>>> versions of ManifoldCF into the same directories you *might* have a poi*
>>> jar in the wrong place because of the way we had to do the patch.  Please
>>> have a look at where the poi* jars are in your directory structure; they
>>> should all be in one directory (connector-common-lib).  If you see any
>>> anywhere else, that's the cause of the issue.
>>>
>>>
>>>
>>> Karl
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <da...@gmail.com> wrote:
>>>
>>> Since the Tika extractor essentially filters out the content mime type
>>> (other than presenting it as metadata), you need to put an "allowed
>>> documents" transformation connection into your job pipeline BEFORE the Tika
>>> connector:
>>>
>>>
>>>
>>> https://manifoldcf.apache.org/release/release-2.9/en_US/end-
>>> user-documentation.html#alloweddocuments
>>>
>>>
>>>
>>> In fact, mime type exclusion is actually disabled in the Solr output
>>> connector *unless* you are using the extracting update handler.  That
>>> should resolve the one problem for you.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Karl
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jan 9, 2018 at 9:35 AM, msaunier <ms...@citya.com> wrote:
>>>
>>> They document for Tika are :
>>>
>>> ·        Microsoft Word 97-2003
>>>
>>> ·        Application/msword
>>>
>>>
>>>
>>> I can’t have more informations, they are in SCO servers and SCO do not
>>> have ls –lisan or stat command.
>>>
>>>
>>>
>>> For SolR connecting, I seem to have emptied the index before the last
>>> indexation. (ManifoldCF and Solr) I do it again to be sure.
>>>
>>>
>>>
>>>
>>>
>>> *De :* Karl Wright [mailto:daddywri@gmail.com]
>>> *Envoyé :* mardi 9 janvier 2018 15:26
>>>
>>>
>>> *À :* user@manifoldcf.apache.org
>>> *Objet :* Re: Document connector excluding mime type and size - Tika
>>> Parser error
>>>
>>>
>>>
>>> CONNECTORS-1482 is for the Solr connector filtering issue.  A question:
>>> When you changed these fields in the output connection, had you already
>>> indexed any documents?  Those would only get cleaned up if you did a
>>> subsequent full crawl, after you made the connection change.
>>>
>>>
>>>
>>> Karl
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <da...@gmail.com> wrote:
>>>
>>> If you let me know what kind of file they are (extension and what
>>> application created them) that is probably good enough.
>>>
>>> Karl
>>>
>>>
>>>
>>> On Tue, Jan 9, 2018 at 9:19 AM, msaunier <ms...@citya.com> wrote:
>>>
>>> Okay good. I look if I can test 1.17 Tika version.
>>>
>>>
>>>
>>> I can’t transfert a document with this error, they are privates. Sorry.
>>>
>>>
>>>
>>> If I encounter the error again on a non-private document, I'll come back
>>> to you.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *De :* Karl Wright [mailto:daddywri@gmail.com]
>>> *Envoyé :* mardi 9 janvier 2018 15:12
>>>
>>>
>>> *À :* user@manifoldcf.apache.org
>>> *Objet :* Re: Document connector excluding mime type and size - Tika
>>> Parser error
>>>
>>>
>>>
>>> CONNECTORS-1481 is the ticket for the Tika problem.
>>>
>>>
>>>
>>> Karl
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <da...@gmail.com> wrote:
>>>
>>> Ok, if you are in a position to build trunk, that's a newer version of
>>> Tika (1.17) which might (or might not) address this problem.
>>>
>>>
>>>
>>> If you could create a ticket, I'd greatly appreciate attaching one
>>> document to it that causes the failure.
>>>
>>>
>>>
>>> Thanks!
>>>
>>> Karl
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jan 9, 2018 at 8:02 AM, msaunier <ms...@citya.com> wrote:
>>>
>>> It’s a 2.9 version.
>>>
>>>
>>>
>>> I have a 2.8.1 in an other server with same job and same documents. I
>>> will test on this other server and make you a return.
>>>
>>>
>>>
>>> Thanks for your help.
>>>
>>>
>>>
>>> *De :* Karl Wright [mailto:daddywri@gmail.com]
>>> *Envoyé :* mardi 9 janvier 2018 13:15
>>> *À :* user@manifoldcf.apache.org
>>> *Objet :* Re: Document connector excluding mime type and size - Tika
>>> Parser error
>>>
>>>
>>>
>>> I looked at the history of this.  We had to release a patch (2.8.1) that
>>> put various poi jars at root level in order to work around a Tika problem.
>>> That patch may not have been entirely correct in that it looks like it may
>>> have blocked access by one of the deeper jars to a higher level.
>>>
>>>
>>>
>>> Release 2.9 should fix this if I am correct.
>>>
>>>
>>>
>>> Karl
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <da...@gmail.com> wrote:
>>>
>>> What version of MCF is this?  That's important to know since Tika has
>>> had problems with this kind of thing in the past and this looks like
>>> something similar.
>>>
>>>
>>>
>>> The problem you are reporting is due to either a missing jar, or a bug
>>> in an internal tika classloader.  But I need to know whether this is a
>>> current bug or not, since we just went to a new Tika version.
>>>
>>>
>>>
>>> Karl
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jan 9, 2018 at 4:32 AM, msaunier <ms...@citya.com> wrote:
>>>
>>> Hello Karl,
>>>
>>> I hope you are well today.
>>>
>>>
>>>
>>> I have 2 problems with ManifoldCF.
>>>
>>>
>>>
>>> -----------
>>>
>>> In **Outputs connectors** with Solr connector. I have add a « Maximum
>>> document length and I have « Excluded 5 mime types » but it not work. I
>>> join capture.
>>>
>>>
>>>
>>> ----------
>>>
>>> And in second, I have a **Tika exception** in ManifoldCF. 3 documents
>>> are blocked :
>>>
>>>
>>>
>>> FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed:
>>> org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/
>>> poi/hwmf/record/HwmfFont$WmfCharset;
>>>
>>> java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.Hwm
>>> fFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;
>>>
>>>         at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74)
>>> ~[?:?]
>>>
>>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>>> ~[?:?]
>>>
>>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>>> ~[?:?]
>>>
>>>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>>> ~[?:?]
>>>
>>>         at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>>> ~[?:?]
>>>
>>>         at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.p
>>> arseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]
>>>
>>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>>> r.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]
>>>
>>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>>> r.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]
>>>
>>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>>> r.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]
>>>
>>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>>> r.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]
>>>
>>>         at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory
>>> .parse(OOXMLExtractorFactory.java:142) ~[?:?]
>>>
>>>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>>> ~[?:?]
>>>
>>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>>> ~[?:?]
>>>
>>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>>> ~[?:?]
>>>
>>>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>>> ~[?:?]
>>>
>>>         at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
>>> ~[?:?]
>>>
>>>         at org.apache.manifoldcf.agents.transformation.tika.TikaExtract
>>> or.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]
>>>
>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>>> gester$PipelineAddEntryPoint.addOrReplaceDocumentWithExcepti
>>> on(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]
>>>
>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>>> gester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
>>> ~[mcf-agents.jar:?]
>>>
>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>>> gester$PipelineObjectWithVersions.addOrReplaceDocumentWithEx
>>> ception(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]
>>>
>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>>> gester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]
>>>
>>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessAct
>>> ivity.ingestDocumentWithException(WorkerThread.java:1583)
>>> ~[mcf-pull-agent.jar:?]
>>>
>>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessAct
>>> ivity.ingestDocumentWithException(WorkerThread.java:1548)
>>> ~[mcf-pull-agent.jar:?]
>>>
>>>         at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDr
>>> iveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]
>>>
>>>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
>>> [mcf-pull-agent.jar:?]
>>>
>>>
>>>
>>> I need to create an incident ticket?
>>>
>>>
>>>
>>> ----------
>>>
>>>
>>>
>>> Thanks for your help.
>>>
>>>
>>>
>>> Cordialement,
>>>
>>>
>>>
>>> [image: msaunier]
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: Document connector excluding mime type and size - Tika Parser error

Posted by Karl Wright <da...@gmail.com>.
Ok, never mind that last email.  We patched it in part in 2.9 by including
the latest POI.  So clearly it's still an existing problem in POI.  I'll
have to open a ticket there and await a patch from them.

Karl

On Tue, Jan 9, 2018 at 10:27 AM, Karl Wright <da...@gmail.com> wrote:

> This screenshot cannot be MCF 2.9 since the version of poi was not 3.17
> for the 2.9 release.
>
> Karl
>
>
> On Tue, Jan 9, 2018 at 10:02 AM, msaunier <ms...@citya.com> wrote:
>
>> They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *De :* Karl Wright [mailto:daddywri@gmail.com]
>> *Envoyé :* mardi 9 janvier 2018 15:54
>>
>> *À :* user@manifoldcf.apache.org
>> *Objet :* Re: Document connector excluding mime type and size - Tika
>> Parser error
>>
>>
>>
>> As for the Tika issue, we explicitly tested documents of that type when
>> rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we
>> also tested this.
>>
>>
>>
>> One of the potential issues is that if you are dropping down different
>> versions of ManifoldCF into the same directories you *might* have a poi*
>> jar in the wrong place because of the way we had to do the patch.  Please
>> have a look at where the poi* jars are in your directory structure; they
>> should all be in one directory (connector-common-lib).  If you see any
>> anywhere else, that's the cause of the issue.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> Since the Tika extractor essentially filters out the content mime type
>> (other than presenting it as metadata), you need to put an "allowed
>> documents" transformation connection into your job pipeline BEFORE the Tika
>> connector:
>>
>>
>>
>> https://manifoldcf.apache.org/release/release-2.9/en_US/end-
>> user-documentation.html#alloweddocuments
>>
>>
>>
>> In fact, mime type exclusion is actually disabled in the Solr output
>> connector *unless* you are using the extracting update handler.  That
>> should resolve the one problem for you.
>>
>>
>>
>> Thanks,
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 9:35 AM, msaunier <ms...@citya.com> wrote:
>>
>> They document for Tika are :
>>
>> ·        Microsoft Word 97-2003
>>
>> ·        Application/msword
>>
>>
>>
>> I can’t have more informations, they are in SCO servers and SCO do not
>> have ls –lisan or stat command.
>>
>>
>>
>> For SolR connecting, I seem to have emptied the index before the last
>> indexation. (ManifoldCF and Solr) I do it again to be sure.
>>
>>
>>
>>
>>
>> *De :* Karl Wright [mailto:daddywri@gmail.com]
>> *Envoyé :* mardi 9 janvier 2018 15:26
>>
>>
>> *À :* user@manifoldcf.apache.org
>> *Objet :* Re: Document connector excluding mime type and size - Tika
>> Parser error
>>
>>
>>
>> CONNECTORS-1482 is for the Solr connector filtering issue.  A question:
>> When you changed these fields in the output connection, had you already
>> indexed any documents?  Those would only get cleaned up if you did a
>> subsequent full crawl, after you made the connection change.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> If you let me know what kind of file they are (extension and what
>> application created them) that is probably good enough.
>>
>> Karl
>>
>>
>>
>> On Tue, Jan 9, 2018 at 9:19 AM, msaunier <ms...@citya.com> wrote:
>>
>> Okay good. I look if I can test 1.17 Tika version.
>>
>>
>>
>> I can’t transfert a document with this error, they are privates. Sorry.
>>
>>
>>
>> If I encounter the error again on a non-private document, I'll come back
>> to you.
>>
>>
>>
>>
>>
>>
>>
>> *De :* Karl Wright [mailto:daddywri@gmail.com]
>> *Envoyé :* mardi 9 janvier 2018 15:12
>>
>>
>> *À :* user@manifoldcf.apache.org
>> *Objet :* Re: Document connector excluding mime type and size - Tika
>> Parser error
>>
>>
>>
>> CONNECTORS-1481 is the ticket for the Tika problem.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> Ok, if you are in a position to build trunk, that's a newer version of
>> Tika (1.17) which might (or might not) address this problem.
>>
>>
>>
>> If you could create a ticket, I'd greatly appreciate attaching one
>> document to it that causes the failure.
>>
>>
>>
>> Thanks!
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 8:02 AM, msaunier <ms...@citya.com> wrote:
>>
>> It’s a 2.9 version.
>>
>>
>>
>> I have a 2.8.1 in an other server with same job and same documents. I
>> will test on this other server and make you a return.
>>
>>
>>
>> Thanks for your help.
>>
>>
>>
>> *De :* Karl Wright [mailto:daddywri@gmail.com]
>> *Envoyé :* mardi 9 janvier 2018 13:15
>> *À :* user@manifoldcf.apache.org
>> *Objet :* Re: Document connector excluding mime type and size - Tika
>> Parser error
>>
>>
>>
>> I looked at the history of this.  We had to release a patch (2.8.1) that
>> put various poi jars at root level in order to work around a Tika problem.
>> That patch may not have been entirely correct in that it looks like it may
>> have blocked access by one of the deeper jars to a higher level.
>>
>>
>>
>> Release 2.9 should fix this if I am correct.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> What version of MCF is this?  That's important to know since Tika has had
>> problems with this kind of thing in the past and this looks like something
>> similar.
>>
>>
>>
>> The problem you are reporting is due to either a missing jar, or a bug in
>> an internal tika classloader.  But I need to know whether this is a current
>> bug or not, since we just went to a new Tika version.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 4:32 AM, msaunier <ms...@citya.com> wrote:
>>
>> Hello Karl,
>>
>> I hope you are well today.
>>
>>
>>
>> I have 2 problems with ManifoldCF.
>>
>>
>>
>> -----------
>>
>> In **Outputs connectors** with Solr connector. I have add a « Maximum
>> document length and I have « Excluded 5 mime types » but it not work. I
>> join capture.
>>
>>
>>
>> ----------
>>
>> And in second, I have a **Tika exception** in ManifoldCF. 3 documents
>> are blocked :
>>
>>
>>
>> FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed:
>> org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/
>> poi/hwmf/record/HwmfFont$WmfCharset;
>>
>> java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.Hwm
>> fFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;
>>
>>         at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>> ~[?:?]
>>
>>         at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.
>> parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory
>> .parse(OOXMLExtractorFactory.java:142) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>> ~[?:?]
>>
>>         at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
>> ~[?:?]
>>
>>         at org.apache.manifoldcf.agents.transformation.tika.TikaExtract
>> or.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester$PipelineAddEntryPoint.addOrReplaceDocumentWithExcepti
>> on(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
>> ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester$PipelineObjectWithVersions.addOrReplaceDocumentWithEx
>> ception(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessAct
>> ivity.ingestDocumentWithException(WorkerThread.java:1583)
>> ~[mcf-pull-agent.jar:?]
>>
>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessAct
>> ivity.ingestDocumentWithException(WorkerThread.java:1548)
>> ~[mcf-pull-agent.jar:?]
>>
>>         at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDr
>> iveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]
>>
>>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
>> [mcf-pull-agent.jar:?]
>>
>>
>>
>> I need to create an incident ticket?
>>
>>
>>
>> ----------
>>
>>
>>
>> Thanks for your help.
>>
>>
>>
>> Cordialement,
>>
>>
>>
>> [image: msaunier]
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: Document connector excluding mime type and size - Tika Parser error

Posted by Karl Wright <da...@gmail.com>.
This screenshot cannot be MCF 2.9 since the version of poi was not 3.17 for
the 2.9 release.

Karl


On Tue, Jan 9, 2018 at 10:02 AM, msaunier <ms...@citya.com> wrote:

> They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.
>
>
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:54
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> As for the Tika issue, we explicitly tested documents of that type when
> rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we
> also tested this.
>
>
>
> One of the potential issues is that if you are dropping down different
> versions of ManifoldCF into the same directories you *might* have a poi*
> jar in the wrong place because of the way we had to do the patch.  Please
> have a look at where the poi* jars are in your directory structure; they
> should all be in one directory (connector-common-lib).  If you see any
> anywhere else, that's the cause of the issue.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <da...@gmail.com> wrote:
>
> Since the Tika extractor essentially filters out the content mime type
> (other than presenting it as metadata), you need to put an "allowed
> documents" transformation connection into your job pipeline BEFORE the Tika
> connector:
>
>
>
> https://manifoldcf.apache.org/release/release-2.9/en_US/end-
> user-documentation.html#alloweddocuments
>
>
>
> In fact, mime type exclusion is actually disabled in the Solr output
> connector *unless* you are using the extracting update handler.  That
> should resolve the one problem for you.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:35 AM, msaunier <ms...@citya.com> wrote:
>
> They document for Tika are :
>
> ·        Microsoft Word 97-2003
>
> ·        Application/msword
>
>
>
> I can’t have more informations, they are in SCO servers and SCO do not
> have ls –lisan or stat command.
>
>
>
> For SolR connecting, I seem to have emptied the index before the last
> indexation. (ManifoldCF and Solr) I do it again to be sure.
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:26
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> CONNECTORS-1482 is for the Solr connector filtering issue.  A question:
> When you changed these fields in the output connection, had you already
> indexed any documents?  Those would only get cleaned up if you did a
> subsequent full crawl, after you made the connection change.
>
>
>
> Karl
>
>
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <da...@gmail.com> wrote:
>
> If you let me know what kind of file they are (extension and what
> application created them) that is probably good enough.
>
> Karl
>
>
>
> On Tue, Jan 9, 2018 at 9:19 AM, msaunier <ms...@citya.com> wrote:
>
> Okay good. I look if I can test 1.17 Tika version.
>
>
>
> I can’t transfert a document with this error, they are privates. Sorry.
>
>
>
> If I encounter the error again on a non-private document, I'll come back
> to you.
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:12
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> CONNECTORS-1481 is the ticket for the Tika problem.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <da...@gmail.com> wrote:
>
> Ok, if you are in a position to build trunk, that's a newer version of
> Tika (1.17) which might (or might not) address this problem.
>
>
>
> If you could create a ticket, I'd greatly appreciate attaching one
> document to it that causes the failure.
>
>
>
> Thanks!
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 8:02 AM, msaunier <ms...@citya.com> wrote:
>
> It’s a 2.9 version.
>
>
>
> I have a 2.8.1 in an other server with same job and same documents. I will
> test on this other server and make you a return.
>
>
>
> Thanks for your help.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 13:15
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> I looked at the history of this.  We had to release a patch (2.8.1) that
> put various poi jars at root level in order to work around a Tika problem.
> That patch may not have been entirely correct in that it looks like it may
> have blocked access by one of the deeper jars to a higher level.
>
>
>
> Release 2.9 should fix this if I am correct.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <da...@gmail.com> wrote:
>
> What version of MCF is this?  That's important to know since Tika has had
> problems with this kind of thing in the past and this looks like something
> similar.
>
>
>
> The problem you are reporting is due to either a missing jar, or a bug in
> an internal tika classloader.  But I need to know whether this is a current
> bug or not, since we just went to a new Tika version.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 4:32 AM, msaunier <ms...@citya.com> wrote:
>
> Hello Karl,
>
> I hope you are well today.
>
>
>
> I have 2 problems with ManifoldCF.
>
>
>
> -----------
>
> In **Outputs connectors** with Solr connector. I have add a « Maximum
> document length and I have « Excluded 5 mime types » but it not work. I
> join capture.
>
>
>
> ----------
>
> And in second, I have a **Tika exception** in ManifoldCF. 3 documents are
> blocked :
>
>
>
> FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed:
> org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/
> apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
> java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.
> HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
>         at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
> ~[?:?]
>
>         at org.apache.tika.extractor.ParsingEmbeddedDocumentExtract
> or.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.
> OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.
> TikaParser.parse(TikaParser.java:74) ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.
> addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithExcept
> ion(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineObjectWithVersions.
> addOrReplaceDocumentWithException(IncrementalIngester.java:2708)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester.documentIngest(IncrementalIngester.java:756)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.connectors.sharedrive.
> SharedDriveConnector.processDocuments(SharedDriveConnector.java:939)
> ~[?:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
>
>
> I need to create an incident ticket?
>
>
>
> ----------
>
>
>
> Thanks for your help.
>
>
>
> Cordialement,
>
>
>
> [image: msaunier]
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

RE: Document connector excluding mime type and size - Tika Parser error

Posted by msaunier <ms...@citya.com>.
They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.

 



 

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : mardi 9 janvier 2018 15:54
À : user@manifoldcf.apache.org
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

As for the Tika issue, we explicitly tested documents of that type when rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we also tested this.

 

One of the potential issues is that if you are dropping down different versions of ManifoldCF into the same directories you *might* have a poi* jar in the wrong place because of the way we had to do the patch.  Please have a look at where the poi* jars are in your directory structure; they should all be in one directory (connector-common-lib).  If you see any anywhere else, that's the cause of the issue.

 

Karl

 

 

On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Since the Tika extractor essentially filters out the content mime type (other than presenting it as metadata), you need to put an "allowed documents" transformation connection into your job pipeline BEFORE the Tika connector:

 

https://manifoldcf.apache.org/release/release-2.9/en_US/end-user-documentation.html#alloweddocuments

 

In fact, mime type exclusion is actually disabled in the Solr output connector *unless* you are using the extracting update handler.  That should resolve the one problem for you.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 9:35 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

They document for Tika are :

·        Microsoft Word 97-2003

·        Application/msword

 

I can’t have more informations, they are in SCO servers and SCO do not have ls –lisan or stat command.

 

For SolR connecting, I seem to have emptied the index before the last indexation. (ManifoldCF and Solr) I do it again to be sure.

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:26


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1482 is for the Solr connector filtering issue.  A question: When you changed these fields in the output connection, had you already indexed any documents?  Those would only get cleaned up if you did a subsequent full crawl, after you made the connection change.

 

Karl

 

 

 

On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

If you let me know what kind of file they are (extension and what application created them) that is probably good enough.

Karl

 

On Tue, Jan 9, 2018 at 9:19 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Okay good. I look if I can test 1.17 Tika version.

 

I can’t transfert a document with this error, they are privates. Sorry.

 

If I encounter the error again on a non-private document, I'll come back to you.

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:12


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1481 is the ticket for the Tika problem.

 

Karl

 

 

On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, if you are in a position to build trunk, that's a newer version of Tika (1.17) which might (or might not) address this problem.

 

If you could create a ticket, I'd greatly appreciate attaching one document to it that causes the failure.

 

Thanks!

Karl

 

 

On Tue, Jan 9, 2018 at 8:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

It’s a 2.9 version.

 

I have a 2.8.1 in an other server with same job and same documents. I will test on this other server and make you a return.

 

Thanks for your help.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 13:15
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

I looked at the history of this.  We had to release a patch (2.8.1) that put various poi jars at root level in order to work around a Tika problem.  That patch may not have been entirely correct in that it looks like it may have blocked access by one of the deeper jars to a higher level.

 

Release 2.9 should fix this if I am correct.

 

Karl

 

 

On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

What version of MCF is this?  That's important to know since Tika has had problems with this kind of thing in the past and this looks like something similar.

 

The problem you are reporting is due to either a missing jar, or a bug in an internal tika classloader.  But I need to know whether this is a current bug or not, since we just went to a new Tika version.

 

Karl

 

 

On Tue, Jan 9, 2018 at 4:32 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

I hope you are well today.

 

I have 2 problems with ManifoldCF.

 

-----------

In *Outputs connectors* with Solr connector. I have add a « Maximum document length and I have « Excluded 5 mime types » but it not work. I join capture.

 

----------

And in second, I have a *Tika exception* in ManifoldCF. 3 documents are blocked :

 

FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

        at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) ~[?:?]

        at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]

 

I need to create an incident ticket?

 

----------

 

Thanks for your help.

 

Cordialement,

 



 

 

 

 

 

 

 

 

 

 

 


Re: Document connector excluding mime type and size - Tika Parser error

Posted by Karl Wright <da...@gmail.com>.
As for the Tika issue, we explicitly tested documents of that type when
rolling out 2.8.1.  When we updated 2.8.1 to a new Tika in 2.9 I believe we
also tested this.

One of the potential issues is that if you are dropping down different
versions of ManifoldCF into the same directories you *might* have a poi*
jar in the wrong place because of the way we had to do the patch.  Please
have a look at where the poi* jars are in your directory structure; they
should all be in one directory (connector-common-lib).  If you see any
anywhere else, that's the cause of the issue.

Karl


On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <da...@gmail.com> wrote:

> Since the Tika extractor essentially filters out the content mime type
> (other than presenting it as metadata), you need to put an "allowed
> documents" transformation connection into your job pipeline BEFORE the Tika
> connector:
>
> https://manifoldcf.apache.org/release/release-2.9/en_US/end-
> user-documentation.html#alloweddocuments
>
> In fact, mime type exclusion is actually disabled in the Solr output
> connector *unless* you are using the extracting update handler.  That
> should resolve the one problem for you.
>
> Thanks,
> Karl
>
>
> On Tue, Jan 9, 2018 at 9:35 AM, msaunier <ms...@citya.com> wrote:
>
>> They document for Tika are :
>>
>> ·        Microsoft Word 97-2003
>>
>> ·        Application/msword
>>
>>
>>
>> I can’t have more informations, they are in SCO servers and SCO do not
>> have ls –lisan or stat command.
>>
>>
>>
>> For SolR connecting, I seem to have emptied the index before the last
>> indexation. (ManifoldCF and Solr) I do it again to be sure.
>>
>>
>>
>>
>>
>> *De :* Karl Wright [mailto:daddywri@gmail.com]
>> *Envoyé :* mardi 9 janvier 2018 15:26
>>
>> *À :* user@manifoldcf.apache.org
>> *Objet :* Re: Document connector excluding mime type and size - Tika
>> Parser error
>>
>>
>>
>> CONNECTORS-1482 is for the Solr connector filtering issue.  A question:
>> When you changed these fields in the output connection, had you already
>> indexed any documents?  Those would only get cleaned up if you did a
>> subsequent full crawl, after you made the connection change.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> If you let me know what kind of file they are (extension and what
>> application created them) that is probably good enough.
>>
>> Karl
>>
>>
>>
>> On Tue, Jan 9, 2018 at 9:19 AM, msaunier <ms...@citya.com> wrote:
>>
>> Okay good. I look if I can test 1.17 Tika version.
>>
>>
>>
>> I can’t transfert a document with this error, they are privates. Sorry.
>>
>>
>>
>> If I encounter the error again on a non-private document, I'll come back
>> to you.
>>
>>
>>
>>
>>
>>
>>
>> *De :* Karl Wright [mailto:daddywri@gmail.com]
>> *Envoyé :* mardi 9 janvier 2018 15:12
>>
>>
>> *À :* user@manifoldcf.apache.org
>> *Objet :* Re: Document connector excluding mime type and size - Tika
>> Parser error
>>
>>
>>
>> CONNECTORS-1481 is the ticket for the Tika problem.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> Ok, if you are in a position to build trunk, that's a newer version of
>> Tika (1.17) which might (or might not) address this problem.
>>
>>
>>
>> If you could create a ticket, I'd greatly appreciate attaching one
>> document to it that causes the failure.
>>
>>
>>
>> Thanks!
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 8:02 AM, msaunier <ms...@citya.com> wrote:
>>
>> It’s a 2.9 version.
>>
>>
>>
>> I have a 2.8.1 in an other server with same job and same documents. I
>> will test on this other server and make you a return.
>>
>>
>>
>> Thanks for your help.
>>
>>
>>
>> *De :* Karl Wright [mailto:daddywri@gmail.com]
>> *Envoyé :* mardi 9 janvier 2018 13:15
>> *À :* user@manifoldcf.apache.org
>> *Objet :* Re: Document connector excluding mime type and size - Tika
>> Parser error
>>
>>
>>
>> I looked at the history of this.  We had to release a patch (2.8.1) that
>> put various poi jars at root level in order to work around a Tika problem.
>> That patch may not have been entirely correct in that it looks like it may
>> have blocked access by one of the deeper jars to a higher level.
>>
>>
>>
>> Release 2.9 should fix this if I am correct.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> What version of MCF is this?  That's important to know since Tika has had
>> problems with this kind of thing in the past and this looks like something
>> similar.
>>
>>
>>
>> The problem you are reporting is due to either a missing jar, or a bug in
>> an internal tika classloader.  But I need to know whether this is a current
>> bug or not, since we just went to a new Tika version.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 4:32 AM, msaunier <ms...@citya.com> wrote:
>>
>> Hello Karl,
>>
>> I hope you are well today.
>>
>>
>>
>> I have 2 problems with ManifoldCF.
>>
>>
>>
>> -----------
>>
>> In **Outputs connectors** with Solr connector. I have add a « Maximum
>> document length and I have « Excluded 5 mime types » but it not work. I
>> join capture.
>>
>>
>>
>> ----------
>>
>> And in second, I have a **Tika exception** in ManifoldCF. 3 documents
>> are blocked :
>>
>>
>>
>> FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed:
>> org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/
>> poi/hwmf/record/HwmfFont$WmfCharset;
>>
>> java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.Hwm
>> fFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;
>>
>>         at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>> ~[?:?]
>>
>>         at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.
>> parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory
>> .parse(OOXMLExtractorFactory.java:142) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>> ~[?:?]
>>
>>         at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
>> ~[?:?]
>>
>>         at org.apache.manifoldcf.agents.transformation.tika.TikaExtract
>> or.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester$PipelineAddEntryPoint.addOrReplaceDocumentWithExcepti
>> on(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
>> ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester$PipelineObjectWithVersions.addOrReplaceDocumentWithEx
>> ception(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessAct
>> ivity.ingestDocumentWithException(WorkerThread.java:1583)
>> ~[mcf-pull-agent.jar:?]
>>
>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessAct
>> ivity.ingestDocumentWithException(WorkerThread.java:1548)
>> ~[mcf-pull-agent.jar:?]
>>
>>         at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDr
>> iveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]
>>
>>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
>> [mcf-pull-agent.jar:?]
>>
>>
>>
>> I need to create an incident ticket?
>>
>>
>>
>> ----------
>>
>>
>>
>> Thanks for your help.
>>
>>
>>
>> Cordialement,
>>
>>
>>
>> [image: msaunier]
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: Document connector excluding mime type and size - Tika Parser error

Posted by Karl Wright <da...@gmail.com>.
Since the Tika extractor essentially filters out the content mime type
(other than presenting it as metadata), you need to put an "allowed
documents" transformation connection into your job pipeline BEFORE the Tika
connector:

https://manifoldcf.apache.org/release/release-2.9/en_US/end-user-documentation.html#alloweddocuments

In fact, mime type exclusion is actually disabled in the Solr output
connector *unless* you are using the extracting update handler.  That
should resolve the one problem for you.

Thanks,
Karl


On Tue, Jan 9, 2018 at 9:35 AM, msaunier <ms...@citya.com> wrote:

> They document for Tika are :
>
> ·        Microsoft Word 97-2003
>
> ·        Application/msword
>
>
>
> I can’t have more informations, they are in SCO servers and SCO do not
> have ls –lisan or stat command.
>
>
>
> For SolR connecting, I seem to have emptied the index before the last
> indexation. (ManifoldCF and Solr) I do it again to be sure.
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:26
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> CONNECTORS-1482 is for the Solr connector filtering issue.  A question:
> When you changed these fields in the output connection, had you already
> indexed any documents?  Those would only get cleaned up if you did a
> subsequent full crawl, after you made the connection change.
>
>
>
> Karl
>
>
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <da...@gmail.com> wrote:
>
> If you let me know what kind of file they are (extension and what
> application created them) that is probably good enough.
>
> Karl
>
>
>
> On Tue, Jan 9, 2018 at 9:19 AM, msaunier <ms...@citya.com> wrote:
>
> Okay good. I look if I can test 1.17 Tika version.
>
>
>
> I can’t transfert a document with this error, they are privates. Sorry.
>
>
>
> If I encounter the error again on a non-private document, I'll come back
> to you.
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:12
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> CONNECTORS-1481 is the ticket for the Tika problem.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <da...@gmail.com> wrote:
>
> Ok, if you are in a position to build trunk, that's a newer version of
> Tika (1.17) which might (or might not) address this problem.
>
>
>
> If you could create a ticket, I'd greatly appreciate attaching one
> document to it that causes the failure.
>
>
>
> Thanks!
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 8:02 AM, msaunier <ms...@citya.com> wrote:
>
> It’s a 2.9 version.
>
>
>
> I have a 2.8.1 in an other server with same job and same documents. I will
> test on this other server and make you a return.
>
>
>
> Thanks for your help.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 13:15
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> I looked at the history of this.  We had to release a patch (2.8.1) that
> put various poi jars at root level in order to work around a Tika problem.
> That patch may not have been entirely correct in that it looks like it may
> have blocked access by one of the deeper jars to a higher level.
>
>
>
> Release 2.9 should fix this if I am correct.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <da...@gmail.com> wrote:
>
> What version of MCF is this?  That's important to know since Tika has had
> problems with this kind of thing in the past and this looks like something
> similar.
>
>
>
> The problem you are reporting is due to either a missing jar, or a bug in
> an internal tika classloader.  But I need to know whether this is a current
> bug or not, since we just went to a new Tika version.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 4:32 AM, msaunier <ms...@citya.com> wrote:
>
> Hello Karl,
>
> I hope you are well today.
>
>
>
> I have 2 problems with ManifoldCF.
>
>
>
> -----------
>
> In **Outputs connectors** with Solr connector. I have add a « Maximum
> document length and I have « Excluded 5 mime types » but it not work. I
> join capture.
>
>
>
> ----------
>
> And in second, I have a **Tika exception** in ManifoldCF. 3 documents are
> blocked :
>
>
>
> FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed:
> org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/
> apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
> java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.
> HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
>         at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
> ~[?:?]
>
>         at org.apache.tika.extractor.ParsingEmbeddedDocumentExtract
> or.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.
> OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.
> TikaParser.parse(TikaParser.java:74) ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.
> addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithExcept
> ion(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineObjectWithVersions.
> addOrReplaceDocumentWithException(IncrementalIngester.java:2708)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester.documentIngest(IncrementalIngester.java:756)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.connectors.sharedrive.
> SharedDriveConnector.processDocuments(SharedDriveConnector.java:939)
> ~[?:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
>
>
> I need to create an incident ticket?
>
>
>
> ----------
>
>
>
> Thanks for your help.
>
>
>
> Cordialement,
>
>
>
> [image: msaunier]
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

RE: Document connector excluding mime type and size - Tika Parser error

Posted by msaunier <ms...@citya.com>.
They document for Tika are :

·        Microsoft Word 97-2003

·        Application/msword

 

I can’t have more informations, they are in SCO servers and SCO do not have ls –lisan or stat command.

 

For SolR connecting, I seem to have emptied the index before the last indexation. (ManifoldCF and Solr) I do it again to be sure.

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : mardi 9 janvier 2018 15:26
À : user@manifoldcf.apache.org
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1482 is for the Solr connector filtering issue.  A question: When you changed these fields in the output connection, had you already indexed any documents?  Those would only get cleaned up if you did a subsequent full crawl, after you made the connection change.

 

Karl

 

 

 

On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

If you let me know what kind of file they are (extension and what application created them) that is probably good enough.

Karl

 

On Tue, Jan 9, 2018 at 9:19 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Okay good. I look if I can test 1.17 Tika version.

 

I can’t transfert a document with this error, they are privates. Sorry.

 

If I encounter the error again on a non-private document, I'll come back to you.

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:12


À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1481 is the ticket for the Tika problem.

 

Karl

 

 

On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, if you are in a position to build trunk, that's a newer version of Tika (1.17) which might (or might not) address this problem.

 

If you could create a ticket, I'd greatly appreciate attaching one document to it that causes the failure.

 

Thanks!

Karl

 

 

On Tue, Jan 9, 2018 at 8:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

It’s a 2.9 version.

 

I have a 2.8.1 in an other server with same job and same documents. I will test on this other server and make you a return.

 

Thanks for your help.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 13:15
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

I looked at the history of this.  We had to release a patch (2.8.1) that put various poi jars at root level in order to work around a Tika problem.  That patch may not have been entirely correct in that it looks like it may have blocked access by one of the deeper jars to a higher level.

 

Release 2.9 should fix this if I am correct.

 

Karl

 

 

On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

What version of MCF is this?  That's important to know since Tika has had problems with this kind of thing in the past and this looks like something similar.

 

The problem you are reporting is due to either a missing jar, or a bug in an internal tika classloader.  But I need to know whether this is a current bug or not, since we just went to a new Tika version.

 

Karl

 

 

On Tue, Jan 9, 2018 at 4:32 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

I hope you are well today.

 

I have 2 problems with ManifoldCF.

 

-----------

In *Outputs connectors* with Solr connector. I have add a « Maximum document length and I have « Excluded 5 mime types » but it not work. I join capture.

 

----------

And in second, I have a *Tika exception* in ManifoldCF. 3 documents are blocked :

 

FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

        at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) ~[?:?]

        at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]

 

I need to create an incident ticket?

 

----------

 

Thanks for your help.

 

Cordialement,

 



 

 

 

 

 

 

 

 

 


Re: Document connector excluding mime type and size - Tika Parser error

Posted by Karl Wright <da...@gmail.com>.
CONNECTORS-1482 is for the Solr connector filtering issue.  A question:
When you changed these fields in the output connection, had you already
indexed any documents?  Those would only get cleaned up if you did a
subsequent full crawl, after you made the connection change.

Karl



On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <da...@gmail.com> wrote:

> If you let me know what kind of file they are (extension and what
> application created them) that is probably good enough.
> Karl
>
> On Tue, Jan 9, 2018 at 9:19 AM, msaunier <ms...@citya.com> wrote:
>
>> Okay good. I look if I can test 1.17 Tika version.
>>
>>
>>
>> I can’t transfert a document with this error, they are privates. Sorry.
>>
>>
>>
>> If I encounter the error again on a non-private document, I'll come back
>> to you.
>>
>>
>>
>>
>>
>>
>>
>> *De :* Karl Wright [mailto:daddywri@gmail.com]
>> *Envoyé :* mardi 9 janvier 2018 15:12
>>
>> *À :* user@manifoldcf.apache.org
>> *Objet :* Re: Document connector excluding mime type and size - Tika
>> Parser error
>>
>>
>>
>> CONNECTORS-1481 is the ticket for the Tika problem.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> Ok, if you are in a position to build trunk, that's a newer version of
>> Tika (1.17) which might (or might not) address this problem.
>>
>>
>>
>> If you could create a ticket, I'd greatly appreciate attaching one
>> document to it that causes the failure.
>>
>>
>>
>> Thanks!
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 8:02 AM, msaunier <ms...@citya.com> wrote:
>>
>> It’s a 2.9 version.
>>
>>
>>
>> I have a 2.8.1 in an other server with same job and same documents. I
>> will test on this other server and make you a return.
>>
>>
>>
>> Thanks for your help.
>>
>>
>>
>> *De :* Karl Wright [mailto:daddywri@gmail.com]
>> *Envoyé :* mardi 9 janvier 2018 13:15
>> *À :* user@manifoldcf.apache.org
>> *Objet :* Re: Document connector excluding mime type and size - Tika
>> Parser error
>>
>>
>>
>> I looked at the history of this.  We had to release a patch (2.8.1) that
>> put various poi jars at root level in order to work around a Tika problem.
>> That patch may not have been entirely correct in that it looks like it may
>> have blocked access by one of the deeper jars to a higher level.
>>
>>
>>
>> Release 2.9 should fix this if I am correct.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> What version of MCF is this?  That's important to know since Tika has had
>> problems with this kind of thing in the past and this looks like something
>> similar.
>>
>>
>>
>> The problem you are reporting is due to either a missing jar, or a bug in
>> an internal tika classloader.  But I need to know whether this is a current
>> bug or not, since we just went to a new Tika version.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 4:32 AM, msaunier <ms...@citya.com> wrote:
>>
>> Hello Karl,
>>
>> I hope you are well today.
>>
>>
>>
>> I have 2 problems with ManifoldCF.
>>
>>
>>
>> -----------
>>
>> In **Outputs connectors** with Solr connector. I have add a « Maximum
>> document length and I have « Excluded 5 mime types » but it not work. I
>> join capture.
>>
>>
>>
>> ----------
>>
>> And in second, I have a **Tika exception** in ManifoldCF. 3 documents
>> are blocked :
>>
>>
>>
>> FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed:
>> org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/
>> poi/hwmf/record/HwmfFont$WmfCharset;
>>
>> java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.Hwm
>> fFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;
>>
>>         at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>> ~[?:?]
>>
>>         at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.
>> parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory
>> .parse(OOXMLExtractorFactory.java:142) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>> ~[?:?]
>>
>>         at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
>> ~[?:?]
>>
>>         at org.apache.manifoldcf.agents.transformation.tika.TikaExtract
>> or.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester$PipelineAddEntryPoint.addOrReplaceDocumentWithExcepti
>> on(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
>> ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester$PipelineObjectWithVersions.addOrReplaceDocumentWithEx
>> ception(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessAct
>> ivity.ingestDocumentWithException(WorkerThread.java:1583)
>> ~[mcf-pull-agent.jar:?]
>>
>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessAct
>> ivity.ingestDocumentWithException(WorkerThread.java:1548)
>> ~[mcf-pull-agent.jar:?]
>>
>>         at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDr
>> iveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]
>>
>>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
>> [mcf-pull-agent.jar:?]
>>
>>
>>
>> I need to create an incident ticket?
>>
>>
>>
>> ----------
>>
>>
>>
>> Thanks for your help.
>>
>>
>>
>> Cordialement,
>>
>>
>>
>> [image: msaunier]
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: Document connector excluding mime type and size - Tika Parser error

Posted by Karl Wright <da...@gmail.com>.
If you let me know what kind of file they are (extension and what
application created them) that is probably good enough.
Karl

On Tue, Jan 9, 2018 at 9:19 AM, msaunier <ms...@citya.com> wrote:

> Okay good. I look if I can test 1.17 Tika version.
>
>
>
> I can’t transfert a document with this error, they are privates. Sorry.
>
>
>
> If I encounter the error again on a non-private document, I'll come back
> to you.
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:12
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> CONNECTORS-1481 is the ticket for the Tika problem.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <da...@gmail.com> wrote:
>
> Ok, if you are in a position to build trunk, that's a newer version of
> Tika (1.17) which might (or might not) address this problem.
>
>
>
> If you could create a ticket, I'd greatly appreciate attaching one
> document to it that causes the failure.
>
>
>
> Thanks!
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 8:02 AM, msaunier <ms...@citya.com> wrote:
>
> It’s a 2.9 version.
>
>
>
> I have a 2.8.1 in an other server with same job and same documents. I will
> test on this other server and make you a return.
>
>
>
> Thanks for your help.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 13:15
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> I looked at the history of this.  We had to release a patch (2.8.1) that
> put various poi jars at root level in order to work around a Tika problem.
> That patch may not have been entirely correct in that it looks like it may
> have blocked access by one of the deeper jars to a higher level.
>
>
>
> Release 2.9 should fix this if I am correct.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <da...@gmail.com> wrote:
>
> What version of MCF is this?  That's important to know since Tika has had
> problems with this kind of thing in the past and this looks like something
> similar.
>
>
>
> The problem you are reporting is due to either a missing jar, or a bug in
> an internal tika classloader.  But I need to know whether this is a current
> bug or not, since we just went to a new Tika version.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 4:32 AM, msaunier <ms...@citya.com> wrote:
>
> Hello Karl,
>
> I hope you are well today.
>
>
>
> I have 2 problems with ManifoldCF.
>
>
>
> -----------
>
> In **Outputs connectors** with Solr connector. I have add a « Maximum
> document length and I have « Excluded 5 mime types » but it not work. I
> join capture.
>
>
>
> ----------
>
> And in second, I have a **Tika exception** in ManifoldCF. 3 documents are
> blocked :
>
>
>
> FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed:
> org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/
> apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
> java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.
> HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
>         at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
> ~[?:?]
>
>         at org.apache.tika.extractor.ParsingEmbeddedDocumentExtract
> or.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.
> OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.
> TikaParser.parse(TikaParser.java:74) ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.
> addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithExcept
> ion(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineObjectWithVersions.
> addOrReplaceDocumentWithException(IncrementalIngester.java:2708)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester.documentIngest(IncrementalIngester.java:756)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.connectors.sharedrive.
> SharedDriveConnector.processDocuments(SharedDriveConnector.java:939)
> ~[?:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
>
>
> I need to create an incident ticket?
>
>
>
> ----------
>
>
>
> Thanks for your help.
>
>
>
> Cordialement,
>
>
>
> [image: msaunier]
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

RE: Document connector excluding mime type and size - Tika Parser error

Posted by msaunier <ms...@citya.com>.
Okay good. I look if I can test 1.17 Tika version.

 

I can’t transfert a document with this error, they are privates. Sorry.

 

If I encounter the error again on a non-private document, I'll come back to you.

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : mardi 9 janvier 2018 15:12
À : user@manifoldcf.apache.org
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1481 is the ticket for the Tika problem.

 

Karl

 

 

On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

Ok, if you are in a position to build trunk, that's a newer version of Tika (1.17) which might (or might not) address this problem.

 

If you could create a ticket, I'd greatly appreciate attaching one document to it that causes the failure.

 

Thanks!

Karl

 

 

On Tue, Jan 9, 2018 at 8:02 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

It’s a 2.9 version.

 

I have a 2.8.1 in an other server with same job and same documents. I will test on this other server and make you a return.

 

Thanks for your help.

 

De : Karl Wright [mailto:daddywri@gmail.com <ma...@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 13:15
À : user@manifoldcf.apache.org <ma...@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

I looked at the history of this.  We had to release a patch (2.8.1) that put various poi jars at root level in order to work around a Tika problem.  That patch may not have been entirely correct in that it looks like it may have blocked access by one of the deeper jars to a higher level.

 

Release 2.9 should fix this if I am correct.

 

Karl

 

 

On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

What version of MCF is this?  That's important to know since Tika has had problems with this kind of thing in the past and this looks like something similar.

 

The problem you are reporting is due to either a missing jar, or a bug in an internal tika classloader.  But I need to know whether this is a current bug or not, since we just went to a new Tika version.

 

Karl

 

 

On Tue, Jan 9, 2018 at 4:32 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

I hope you are well today.

 

I have 2 problems with ManifoldCF.

 

-----------

In *Outputs connectors* with Solr connector. I have add a « Maximum document length and I have « Excluded 5 mime types » but it not work. I join capture.

 

----------

And in second, I have a *Tika exception* in ManifoldCF. 3 documents are blocked :

 

FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

        at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) ~[?:?]

        at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]

 

I need to create an incident ticket?

 

----------

 

Thanks for your help.

 

Cordialement,

 



 

 

 

 

 

 

 


Re: Document connector excluding mime type and size - Tika Parser error

Posted by Karl Wright <da...@gmail.com>.
CONNECTORS-1481 is the ticket for the Tika problem.

Karl


On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <da...@gmail.com> wrote:

> Ok, if you are in a position to build trunk, that's a newer version of
> Tika (1.17) which might (or might not) address this problem.
>
> If you could create a ticket, I'd greatly appreciate attaching one
> document to it that causes the failure.
>
> Thanks!
> Karl
>
>
> On Tue, Jan 9, 2018 at 8:02 AM, msaunier <ms...@citya.com> wrote:
>
>> It’s a 2.9 version.
>>
>>
>>
>> I have a 2.8.1 in an other server with same job and same documents. I
>> will test on this other server and make you a return.
>>
>>
>>
>> Thanks for your help.
>>
>>
>>
>> *De :* Karl Wright [mailto:daddywri@gmail.com]
>> *Envoyé :* mardi 9 janvier 2018 13:15
>> *À :* user@manifoldcf.apache.org
>> *Objet :* Re: Document connector excluding mime type and size - Tika
>> Parser error
>>
>>
>>
>> I looked at the history of this.  We had to release a patch (2.8.1) that
>> put various poi jars at root level in order to work around a Tika problem.
>> That patch may not have been entirely correct in that it looks like it may
>> have blocked access by one of the deeper jars to a higher level.
>>
>>
>>
>> Release 2.9 should fix this if I am correct.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> What version of MCF is this?  That's important to know since Tika has had
>> problems with this kind of thing in the past and this looks like something
>> similar.
>>
>>
>>
>> The problem you are reporting is due to either a missing jar, or a bug in
>> an internal tika classloader.  But I need to know whether this is a current
>> bug or not, since we just went to a new Tika version.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Jan 9, 2018 at 4:32 AM, msaunier <ms...@citya.com> wrote:
>>
>> Hello Karl,
>>
>> I hope you are well today.
>>
>>
>>
>> I have 2 problems with ManifoldCF.
>>
>>
>>
>> -----------
>>
>> In **Outputs connectors** with Solr connector. I have add a « Maximum
>> document length and I have « Excluded 5 mime types » but it not work. I
>> join capture.
>>
>>
>>
>> ----------
>>
>> And in second, I have a **Tika exception** in ManifoldCF. 3 documents
>> are blocked :
>>
>>
>>
>> FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed:
>> org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/
>> poi/hwmf/record/HwmfFont$WmfCharset;
>>
>> java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.Hwm
>> fFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;
>>
>>         at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>> ~[?:?]
>>
>>         at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.
>> parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory
>> .parse(OOXMLExtractorFactory.java:142) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>> ~[?:?]
>>
>>         at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
>> ~[?:?]
>>
>>         at org.apache.manifoldcf.agents.transformation.tika.TikaExtract
>> or.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester$PipelineAddEntryPoint.addOrReplaceDocumentWithExcepti
>> on(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
>> ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester$PipelineObjectWithVersions.addOrReplaceDocumentWithEx
>> ception(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessAct
>> ivity.ingestDocumentWithException(WorkerThread.java:1583)
>> ~[mcf-pull-agent.jar:?]
>>
>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessAct
>> ivity.ingestDocumentWithException(WorkerThread.java:1548)
>> ~[mcf-pull-agent.jar:?]
>>
>>         at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDr
>> iveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]
>>
>>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
>> [mcf-pull-agent.jar:?]
>>
>>
>>
>> I need to create an incident ticket?
>>
>>
>>
>> ----------
>>
>>
>>
>> Thanks for your help.
>>
>>
>>
>> Cordialement,
>>
>>
>>
>> [image: msaunier]
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: Document connector excluding mime type and size - Tika Parser error

Posted by Karl Wright <da...@gmail.com>.
Ok, if you are in a position to build trunk, that's a newer version of Tika
(1.17) which might (or might not) address this problem.

If you could create a ticket, I'd greatly appreciate attaching one document
to it that causes the failure.

Thanks!
Karl


On Tue, Jan 9, 2018 at 8:02 AM, msaunier <ms...@citya.com> wrote:

> It’s a 2.9 version.
>
>
>
> I have a 2.8.1 in an other server with same job and same documents. I will
> test on this other server and make you a return.
>
>
>
> Thanks for your help.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 13:15
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> I looked at the history of this.  We had to release a patch (2.8.1) that
> put various poi jars at root level in order to work around a Tika problem.
> That patch may not have been entirely correct in that it looks like it may
> have blocked access by one of the deeper jars to a higher level.
>
>
>
> Release 2.9 should fix this if I am correct.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <da...@gmail.com> wrote:
>
> What version of MCF is this?  That's important to know since Tika has had
> problems with this kind of thing in the past and this looks like something
> similar.
>
>
>
> The problem you are reporting is due to either a missing jar, or a bug in
> an internal tika classloader.  But I need to know whether this is a current
> bug or not, since we just went to a new Tika version.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 4:32 AM, msaunier <ms...@citya.com> wrote:
>
> Hello Karl,
>
> I hope you are well today.
>
>
>
> I have 2 problems with ManifoldCF.
>
>
>
> -----------
>
> In **Outputs connectors** with Solr connector. I have add a « Maximum
> document length and I have « Excluded 5 mime types » but it not work. I
> join capture.
>
>
>
> ----------
>
> And in second, I have a **Tika exception** in ManifoldCF. 3 documents are
> blocked :
>
>
>
> FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed:
> org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/
> apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
> java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.
> HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
>         at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
> ~[?:?]
>
>         at org.apache.tika.extractor.ParsingEmbeddedDocumentExtract
> or.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.
> OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.
> TikaParser.parse(TikaParser.java:74) ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.
> addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithExcept
> ion(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineObjectWithVersions.
> addOrReplaceDocumentWithException(IncrementalIngester.java:2708)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester.documentIngest(IncrementalIngester.java:756)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.connectors.sharedrive.
> SharedDriveConnector.processDocuments(SharedDriveConnector.java:939)
> ~[?:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
>
>
> I need to create an incident ticket?
>
>
>
> ----------
>
>
>
> Thanks for your help.
>
>
>
> Cordialement,
>
>
>
> [image: msaunier]
>
>
>
>
>
>
>
>
>
>
>

RE: Document connector excluding mime type and size - Tika Parser error

Posted by msaunier <ms...@citya.com>.
It’s a 2.9 version.

 

I have a 2.8.1 in an other server with same job and same documents. I will test on this other server and make you a return.

 

Thanks for your help.

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : mardi 9 janvier 2018 13:15
À : user@manifoldcf.apache.org
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

I looked at the history of this.  We had to release a patch (2.8.1) that put various poi jars at root level in order to work around a Tika problem.  That patch may not have been entirely correct in that it looks like it may have blocked access by one of the deeper jars to a higher level.

 

Release 2.9 should fix this if I am correct.

 

Karl

 

 

On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <daddywri@gmail.com <ma...@gmail.com> > wrote:

What version of MCF is this?  That's important to know since Tika has had problems with this kind of thing in the past and this looks like something similar.

 

The problem you are reporting is due to either a missing jar, or a bug in an internal tika classloader.  But I need to know whether this is a current bug or not, since we just went to a new Tika version.

 

Karl

 

 

On Tue, Jan 9, 2018 at 4:32 AM, msaunier <msaunier@citya.com <ma...@citya.com> > wrote:

Hello Karl,

I hope you are well today.

 

I have 2 problems with ManifoldCF.

 

-----------

In *Outputs connectors* with Solr connector. I have add a « Maximum document length and I have « Excluded 5 mime types » but it not work. I join capture.

 

----------

And in second, I have a *Tika exception* in ManifoldCF. 3 documents are blocked :

 

FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

        at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) ~[?:?]

        at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548) ~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]

 

I need to create an incident ticket?

 

----------

 

Thanks for your help.

 

Cordialement,

 



 

 

 

 

 


Re: Document connector excluding mime type and size - Tika Parser error

Posted by Karl Wright <da...@gmail.com>.
I looked at the history of this.  We had to release a patch (2.8.1) that
put various poi jars at root level in order to work around a Tika problem.
That patch may not have been entirely correct in that it looks like it may
have blocked access by one of the deeper jars to a higher level.

Release 2.9 should fix this if I am correct.

Karl


On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <da...@gmail.com> wrote:

> What version of MCF is this?  That's important to know since Tika has had
> problems with this kind of thing in the past and this looks like something
> similar.
>
> The problem you are reporting is due to either a missing jar, or a bug in
> an internal tika classloader.  But I need to know whether this is a current
> bug or not, since we just went to a new Tika version.
>
> Karl
>
>
> On Tue, Jan 9, 2018 at 4:32 AM, msaunier <ms...@citya.com> wrote:
>
>> Hello Karl,
>>
>> I hope you are well today.
>>
>>
>>
>> I have 2 problems with ManifoldCF.
>>
>>
>>
>> -----------
>>
>> In **Outputs connectors** with Solr connector. I have add a « Maximum
>> document length and I have « Excluded 5 mime types » but it not work. I
>> join capture.
>>
>>
>>
>> ----------
>>
>> And in second, I have a **Tika exception** in ManifoldCF. 3 documents
>> are blocked :
>>
>>
>>
>> FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed:
>> org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/
>> poi/hwmf/record/HwmfFont$WmfCharset;
>>
>> java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.Hwm
>> fFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;
>>
>>         at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>> ~[?:?]
>>
>>         at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.
>> parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto
>> r.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory
>> .parse(OOXMLExtractorFactory.java:142) ~[?:?]
>>
>>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> ~[?:?]
>>
>>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>> ~[?:?]
>>
>>         at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
>> ~[?:?]
>>
>>         at org.apache.manifoldcf.agents.transformation.tika.TikaExtract
>> or.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester$PipelineAddEntryPoint.addOrReplaceDocumentWithExcepti
>> on(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
>> ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester$PipelineObjectWithVersions.addOrReplaceDocumentWithEx
>> ception(IncrementalIngester.java:2708) ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn
>> gester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?]
>>
>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessAct
>> ivity.ingestDocumentWithException(WorkerThread.java:1583)
>> ~[mcf-pull-agent.jar:?]
>>
>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessAct
>> ivity.ingestDocumentWithException(WorkerThread.java:1548)
>> ~[mcf-pull-agent.jar:?]
>>
>>         at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDr
>> iveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?]
>>
>>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
>> [mcf-pull-agent.jar:?]
>>
>>
>>
>> I need to create an incident ticket?
>>
>>
>>
>> ----------
>>
>>
>>
>> Thanks for your help.
>>
>>
>>
>> Cordialement,
>>
>>
>>
>> [image: msaunier]
>>
>>
>>
>>
>>
>>
>>
>
>

Re: Document connector excluding mime type and size - Tika Parser error

Posted by Karl Wright <da...@gmail.com>.
What version of MCF is this?  That's important to know since Tika has had
problems with this kind of thing in the past and this looks like something
similar.

The problem you are reporting is due to either a missing jar, or a bug in
an internal tika classloader.  But I need to know whether this is a current
bug or not, since we just went to a new Tika version.

Karl


On Tue, Jan 9, 2018 at 4:32 AM, msaunier <ms...@citya.com> wrote:

> Hello Karl,
>
> I hope you are well today.
>
>
>
> I have 2 problems with ManifoldCF.
>
>
>
> -----------
>
> In **Outputs connectors** with Solr connector. I have add a « Maximum
> document length and I have « Excluded 5 mime types » but it not work. I
> join capture.
>
>
>
> ----------
>
> And in second, I have a **Tika exception** in ManifoldCF. 3 documents are
> blocked :
>
>
>
> FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed:
> org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/
> apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
> java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.
> HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
>         at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
> ~[?:?]
>
>         at org.apache.tika.extractor.ParsingEmbeddedDocumentExtract
> or.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.
> OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.
> TikaParser.parse(TikaParser.java:74) ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.
> addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithExcept
> ion(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineObjectWithVersions.
> addOrReplaceDocumentWithException(IncrementalIngester.java:2708)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester.documentIngest(IncrementalIngester.java:756)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.connectors.sharedrive.
> SharedDriveConnector.processDocuments(SharedDriveConnector.java:939)
> ~[?:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
>
>
> I need to create an incident ticket?
>
>
>
> ----------
>
>
>
> Thanks for your help.
>
>
>
> Cordialement,
>
>
>
> [image: msaunier]
>
>
>
>
>
>
>