You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Karl Wright <da...@gmail.com> on 2016/05/15 16:11:30 UTC

Re: Questing regarding Tika text extraction and elasticsearch

Hi Silvio,

This sounds like a problem with the way the Elastic Search connector is
forming JSON.  The spec is silent on control characters:

http://rfc7159.net/rfc7159#rfc.section.8.1

... so we just embed those in strings.  But it sounds like ElasticSearch's
JSON parser is not so happy with them.

If we can find an encoding that satisfies everyone, we can change the code
to do what is needed.  Maybe "\0" for null, etc?

Karl


On Sun, May 15, 2016 at 10:21 AM, <si...@quantentunnel.de> wrote:

> Hi Apache ManifoldCF user list
>
> I’m experimenting with Apache ManifoldCF 2.3 which I use to index the
> network Windows shares of our company. I’m using Elasticsearch 1.7.4,
> Apache ManifoldCF 2.3 with MS Active Directory as authority source.
> I defined a job with the following connection configuration comprising the
> following chain of transformations (order in the list indicates the order
> of the transformations):
>
> 1.    Repository connection (MS Network Share)
> 2.    Allowed documents
> 3.    Tika extractor
> 4.    Metadata adjuster
> 5.    Elasticsearch
>
> I do this because I don’t want to store the original document inside the
> elasticsearch index but only the extracted text of the document. This works
> so far. However, there are numerous documents which cause an exception of
> the following kind when being  analyzed and sent to the indexer by Apache
> ManifoldCF. Note that the exceptions happens in the Elastic search analyzer:
>
> [2016-03-16 22:22:43,884][DEBUG][action.index             ] [Tefral the
> Surveyor] [shareindex][2], node[O2bWpnsKS8iAE7hwGEOpuA], [P], s[STARTED]:
> Failed to execute [index {[sharein
> dex][attachment][file://///du-evs-01/AppDevData%24/0Repository/temp/indexingtestcorpus/M%C3%A4useTastaturen%202.3.16%20-%20Kopie.pdf],
> source[{"access_permission:extract_for_access
> ibility" : "true","dcterms:created" :
> "2016-03-02T13:03:47Z","access_permission:can_modify" :
> "true","access_permission:modify_annotations" : "true","Creation-Date" :
> "2016-03-02T1
> 3:03:47Z","fileLastModified" :
> "2016-03-02T13:03:37.433Z","access_permission:fill_in_form" :
> "true","created" : "Wed Mar 02 14:03:47 CET 2016","stream_size" :
> "52067","dc:format" :
>  "application\/pdf; version=1.4","access_permission:can_print" :
> "true","stream_name" : "MäuseTastaturen 2.3.16 -
> Kopie.pdf","xmp:CreatorTool" : "Canon iR-ADV C5250  PDF","resourc
> eName" : "MäuseTastaturen 2.3.16 - Kopie.pdf","fileCreatedOn" :
> "2016-03-16T21:22:24.085Z","access_permission:assemble_document" :
> "true","meta:creation-date" : "2016-03-02T13:03:
> 47Z","lastModified" : "Wed Mar 02 14:03:37 CET 2016","pdf:PDFVersion" :
> "1.4","X-Parsed-By" : "org.apache.tika.parser.DefaultParser","shareName" :
> "AppDevData$","access_permission:
> can_print_degraded" : "true","xmpTPg:NPages" : "1","createdOn" : "Wed Mar
> 16 22:22:24 CET 2016","pdf:encrypted" :
> "false","access_permission:extract_content" : "true","producer" :
> "Adobe PSL 1.2e for Canon ","attributes" : "32","Content-Type" :
> "applica-tion\/pdf","allow_token_document" :
> ["LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-16152","LDAPConn:S
> -1-5-21-1751174259-1996115066-1435642685-16153","LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-7894"],"deny_token_document"
> : "LDAPConn:DEAD_AUTHORITY","allow_token_share" : "
> __nosecurity__","deny_token_share" : "__nosecurity__","allow_token_parent"
> : "__nosecurity__","deny_token_parent" : "__nosecurity__","content" : ""}]}]
> org.elasticsearch.index.mapper.MapperParsingException: failed to parse
> [_source]
>         at
> org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:411)
>         at
> org.elasticsearch.index.mapper.internal.SourceFieldMapper.preParse(SourceFieldMapper.java:240)
>         at
> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:540)
>         at
> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:493)
>         at
> org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:492)
>         at
> org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:192)
>         at
> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase.performOnPrimary(TransportShardReplicationOperationAction.java:574)
>         at
> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1.doRun(TransportShardReplicationOperationAction.java:440)
>         at
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.elasticsearch.ElasticsearchParseException: Failed to parse
> content to map
>         at
> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:130)
>         at
> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:81)
>         at
> org.elasticsearch.index.mapper.internal.SourceFieldMapper.parseCreateField(SourceFieldMapper.java:274)
>         at
> org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:401)
>         ... 11 more
> Caused by: org.elasticsearch.common.jackson.core.JsonParseException:
> Illegal unquoted character ((CTRL-CHAR, code 0)): has to be escaped using
> backslash to be included in string va
> lue
>  at [Source: [B@5b774e8b; line: 1, column: 1145]
>         at
> org.elasticsearch.common.jackson.core.JsonParser._constructError(JsonParser.java:1487)
>         at
> org.elasticsearch.common.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518)
>         at
> org.elasticsearch.common.jackson.core.base.ParserMinimalBase._throwUnquotedSpace(ParserMinimalBase.java:482)
>         at
> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2357)
>         at
> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2287)
>         at
> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:286)
>         at
> org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:86)
>         at
> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:293)
>         at
> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:275)
>         at
> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readOrderedMap(AbstractXContentParser.java:258)
>         at
> org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrdered(AbstractXContentParser.java:213)
>         at
> org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrderedAndClose(AbstractXContentParser.java:228)
>         at
> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:125)
>         ... 14 more
>
> This happens for documents of different types/extension, such as pdfs as
> well as xlsx, etc. It seems that Tika sometimes does not remove special
> characters as the null character 0x0000. The presence of the special
> characters causes Elasticsearch to omit the indexing of the document. Thus
> the document is not indexed at all, as  special characters need to be
> escaped when handed over as a JSON request. Is there a way to work around
> the problem with the existing functionality of Apache ManifoldCF?
>
> Regards
> Silvio
>
>

Re: Questing regarding Tika text extraction and elasticsearch

Posted by Karl Wright <da...@gmail.com>.
Yes.
Karl


On Mon, May 16, 2016 at 1:14 PM, Silvio Meier <
silvio.r.meier@quantentunnel.de> wrote:

> Hi Karl
>
> Thanks for the fast response and the patch. I'll patch the version that I have. Will the patch be included in the next official release of Apache ManifoldCF?
>
> Regards
> Silvio
>
>
> On 15.05.2016 18:37, Karl Wright wrote:
>
> Here's the patch.  Relatively short.
>
> Karl
>
>
> On Sun, May 15, 2016 at 12:27 PM, Karl Wright <da...@gmail.com> wrote:
>
>> There is a way apparently you are allowed to encode this, and I have a
>> patch, but JIRA is down.  If it doesn't come back up soon I'll email you
>> the patch.
>>
>> Karl
>>
>>
>> On Sun, May 15, 2016 at 12:11 PM, Karl Wright < <da...@gmail.com>
>> daddywri@gmail.com> wrote:
>>
>>> Hi Silvio,
>>>
>>> This sounds like a problem with the way the Elastic Search connector is
>>> forming JSON.  The spec is silent on control characters:
>>>
>>> http://rfc7159.net/rfc7159#rfc.section.8.1
>>>
>>> ... so we just embed those in strings.  But it sounds like
>>> ElasticSearch's JSON parser is not so happy with them.
>>>
>>> If we can find an encoding that satisfies everyone, we can change the
>>> code to do what is needed.  Maybe "\0" for null, etc?
>>>
>>> Karl
>>>
>>>
>>> On Sun, May 15, 2016 at 10:21 AM, < <si...@quantentunnel.de>
>>> silvio.r.meier@quantentunnel.de> wrote:
>>>
>>>> Hi Apache ManifoldCF user list
>>>>
>>>> I’m experimenting with Apache ManifoldCF 2.3 which I use to index the
>>>> network Windows shares of our company. I’m using Elasticsearch 1.7.4,
>>>> Apache ManifoldCF 2.3 with MS Active Directory as authority source.
>>>> I defined a job with the following connection configuration comprising
>>>> the following chain of transformations (order in the list indicates the
>>>> order of the transformations):
>>>>
>>>> 1.    Repository connection (MS Network Share)
>>>> 2.    Allowed documents
>>>> 3.    Tika extractor
>>>> 4.    Metadata adjuster
>>>> 5.    Elasticsearch
>>>>
>>>> I do this because I don’t want to store the original document inside
>>>> the elasticsearch index but only the extracted text of the document. This
>>>> works so far. However, there are numerous documents which cause an
>>>> exception of the following kind when being  analyzed and sent to the
>>>> indexer by Apache ManifoldCF. Note that the exceptions happens in the
>>>> Elastic search analyzer:
>>>>
>>>> [2016-03-16 22:22:43,884][DEBUG][action.index             ] [Tefral the
>>>> Surveyor] [shareindex][2], node[O2bWpnsKS8iAE7hwGEOpuA], [P], s[STARTED]:
>>>> Failed to execute [index {[sharein
>>>> dex][attachment][
>>>> file://///du-evs-01/AppDevData%24/0Repository/temp/indexingtestcorpus/M%C3%A4useTastaturen%202.3.16%20-%20Kopie.pdf],
>>>> source[{"access_permission:extract_for_access
>>>> ibility" : "true","dcterms:created" :
>>>> "2016-03-02T13:03:47Z","access_permission:can_modify" :
>>>> "true","access_permission:modify_annotations" : "true","Creation-Date" :
>>>> "2016-03-02T1
>>>> 3:03:47Z","fileLastModified" :
>>>> "2016-03-02T13:03:37.433Z","access_permission:fill_in_form" :
>>>> "true","created" : "Wed Mar 02 14:03:47 CET 2016","stream_size" :
>>>> "52067","dc:format" :
>>>>  "application\/pdf; version=1.4","access_permission:can_print" :
>>>> "true","stream_name" : "MäuseTastaturen 2.3.16 -
>>>> Kopie.pdf","xmp:CreatorTool" : "Canon iR-ADV C5250  PDF","resourc
>>>> eName" : "MäuseTastaturen 2.3.16 - Kopie.pdf","fileCreatedOn" :
>>>> "2016-03-16T21:22:24.085Z","access_permission:assemble_document" :
>>>> "true","meta:creation-date" : "2016-03-02T13:03:
>>>> 47Z","lastModified" : "Wed Mar 02 14:03:37 CET 2016","pdf:PDFVersion" :
>>>> "1.4","X-Parsed-By" : "org.apache.tika.parser.DefaultParser","shareName" :
>>>> "AppDevData$","access_permission:
>>>> can_print_degraded" : "true","xmpTPg:NPages" : "1","createdOn" : "Wed
>>>> Mar 16 22:22:24 CET 2016","pdf:encrypted" :
>>>> "false","access_permission:extract_content" : "true","producer" :
>>>> "Adobe PSL 1.2e for Canon ","attributes" : "32","Content-Type" :
>>>> "applica-tion\/pdf","allow_token_document" :
>>>> ["LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-16152","LDAPConn:S
>>>> -1-5-21-1751174259-1996115066-1435642685-16153","LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-7894"],"deny_token_document"
>>>> : "LDAPConn:DEAD_AUTHORITY","allow_token_share" : "
>>>> __nosecurity__","deny_token_share" :
>>>> "__nosecurity__","allow_token_parent" :
>>>> "__nosecurity__","deny_token_parent" : "__nosecurity__","content" : ""}]}]
>>>> org.elasticsearch.index.mapper.MapperParsingException: failed to parse
>>>> [_source]
>>>>         at
>>>> org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:411)
>>>>         at
>>>> org.elasticsearch.index.mapper.internal.SourceFieldMapper.preParse(SourceFieldMapper.java:240)
>>>>         at
>>>> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:540)
>>>>         at
>>>> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:493)
>>>>         at
>>>> org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:492)
>>>>         at
>>>> org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:192)
>>>>         at
>>>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase.performOnPrimary(TransportShardReplicationOperationAction.java:574)
>>>>         at
>>>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1.doRun(TransportShardReplicationOperationAction.java:440)
>>>>         at
>>>> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
>>>>         at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>         at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>         at java.lang.Thread.run(Thread.java:745)
>>>> Caused by: org.elasticsearch.ElasticsearchParseException: Failed to
>>>> parse content to map
>>>>         at
>>>> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:130)
>>>>         at
>>>> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:81)
>>>>         at
>>>> org.elasticsearch.index.mapper.internal.SourceFieldMapper.parseCreateField(SourceFieldMapper.java:274)
>>>>         at
>>>> org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:401)
>>>>         ... 11 more
>>>> Caused by: org.elasticsearch.common.jackson.core.JsonParseException:
>>>> Illegal unquoted character ((CTRL-CHAR, code 0)): has to be escaped using
>>>> backslash to be included in string va
>>>> lue
>>>>  at [Source: [B@5b774e8b; line: 1, column: 1145]
>>>>         at
>>>> org.elasticsearch.common.jackson.core.JsonParser._constructError(JsonParser.java:1487)
>>>>         at
>>>> org.elasticsearch.common.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518)
>>>>         at
>>>> org.elasticsearch.common.jackson.core.base.ParserMinimalBase._throwUnquotedSpace(ParserMinimalBase.java:482)
>>>>         at
>>>> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2357)
>>>>         at
>>>> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2287)
>>>>         at
>>>> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:286)
>>>>         at
>>>> org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:86)
>>>>         at
>>>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:293)
>>>>         at
>>>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:275)
>>>>         at
>>>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readOrderedMap(AbstractXContentParser.java:258)
>>>>         at
>>>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrdered(AbstractXContentParser.java:213)
>>>>         at
>>>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrderedAndClose(AbstractXContentParser.java:228)
>>>>         at
>>>> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:125)
>>>>         ... 14 more
>>>>
>>>> This happens for documents of different types/extension, such as pdfs
>>>> as well as xlsx, etc. It seems that Tika sometimes does not remove special
>>>> characters as the null character 0x0000. The presence of the special
>>>> characters causes Elasticsearch to omit the indexing of the document. Thus
>>>> the document is not indexed at all, as  special characters need to be
>>>> escaped when handed over as a JSON request. Is there a way to work around
>>>> the problem with the existing functionality of Apache ManifoldCF?
>>>>
>>>> Regards
>>>> Silvio
>>>>
>>>>
>>>
>>>
>>
>

Re: Questing regarding Tika text extraction and elasticsearch

Posted by Silvio Meier <si...@quantentunnel.de>.
Hi Karl

Thanks for the fast response and the patch. I'll patch the version that I have. Will the patch be included in the next official release of Apache ManifoldCF?

Regards
Silvio


On 15.05.2016 18:37, Karl Wright wrote:
> Here's the patch.  Relatively short.
>
> Karl
>
>
> On Sun, May 15, 2016 at 12:27 PM, Karl Wright <daddywri@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     There is a way apparently you are allowed to encode this, and I
>     have a patch, but JIRA is down.  If it doesn't come back up soon
>     I'll email you the patch.
>
>     Karl
>
>
>     On Sun, May 15, 2016 at 12:11 PM, Karl Wright <daddywri@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         Hi Silvio,
>
>         This sounds like a problem with the way the Elastic Search
>         connector is forming JSON.  The spec is silent on control
>         characters:
>
>         http://rfc7159.net/rfc7159#rfc.section.8.1
>
>         ... so we just embed those in strings.  But it sounds like
>         ElasticSearch's JSON parser is not so happy with them.
>
>         If we can find an encoding that satisfies everyone, we can
>         change the code to do what is needed.  Maybe "\0" for null, etc?
>
>         Karl
>
>
>         On Sun, May 15, 2016 at 10:21 AM,
>         <silvio.r.meier@quantentunnel.de
>         <ma...@quantentunnel.de>> wrote:
>
>             Hi Apache ManifoldCF user list
>             I\u2019m experimenting with Apache ManifoldCF 2.3 which I use
>             to index the network Windows shares of our company. I\u2019m
>             using Elasticsearch 1.7.4, Apache ManifoldCF 2.3 with MS
>             Active Directory as authority source.
>             I defined a job with the following connection
>             configuration comprising the following chain of
>             transformations (order in the list indicates the order of
>             the transformations):
>
>             1.    Repository connection (MS Network Share)
>             2.    Allowed documents
>             3.    Tika extractor
>             4.    Metadata adjuster
>             5.    Elasticsearch
>             I do this because I don\u2019t want to store the original
>             document inside the elasticsearch index but only the
>             extracted text of the document. This works so far.
>             However, there are numerous documents which cause an
>             exception of the following kind when being analyzed and
>             sent to the indexer by Apache ManifoldCF. Note that the
>             exceptions happens in the Elastic search analyzer:
>             [2016-03-16 22:22:43,884][DEBUG][action.index ] [Tefral
>             the Surveyor] [shareindex][2],
>             node[O2bWpnsKS8iAE7hwGEOpuA], [P], s[STARTED]: Failed to
>             execute [index {[sharein
>             dex][attachment][file://///du-evs-01/AppDevData%24/0Repository/temp/indexingtestcorpus/M%C3%A4useTastaturen%202.3.16%20-%20Kopie.pdf],
>             source[{"access_permission:extract_for_access
>             ibility" : "true","dcterms:created" :
>             "2016-03-02T13:03:47Z","access_permission:can_modify" :
>             "true","access_permission:modify_annotations" :
>             "true","Creation-Date" : "2016-03-02T1
>             3:03:47Z","fileLastModified" :
>             "2016-03-02T13:03:37.433Z","access_permission:fill_in_form" :
>             "true","created" : "Wed Mar 02 14:03:47 CET
>             2016","stream_size" : "52067","dc:format" :
>              "application\/pdf;
>             version=1.4","access_permission:can_print" :
>             "true","stream_name" : "M\u251c�useTastaturen 2.3.16 -
>             Kopie.pdf","xmp:CreatorTool" : "Canon iR-ADV C5250
>             PDF","resourc
>             eName" : "M\u251c�useTastaturen 2.3.16 -
>             Kopie.pdf","fileCreatedOn" :
>             "2016-03-16T21:22:24.085Z","access_permission:assemble_document"
>             : "true","meta:creation-date" : "2016-03-02T13:03:
>             47Z","lastModified" : "Wed Mar 02 14:03:37 CET
>             2016","pdf:PDFVersion" : "1.4","X-Parsed-By" :
>             "org.apache.tika.parser.DefaultParser","shareName" :
>             "AppDevData$","access_permission:
>             can_print_degraded" : "true","xmpTPg:NPages" :
>             "1","createdOn" : "Wed Mar 16 22:22:24 CET
>             2016","pdf:encrypted" :
>             "false","access_permission:extract_content" :
>             "true","producer" :
>             "Adobe PSL 1.2e for Canon ","attributes" :
>             "32","Content-Type" :
>             "applica-tion\/pdf","allow_token_document" :
>             ["LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-16152","LDAPConn:S
>             -1-5-21-1751174259-1996115066-1435642685-16153","LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-7894"],"deny_token_document"
>             : "LDAPConn:DEAD_AUTHORITY","allow_token_share" : "
>             __nosecurity__","deny_token_share" :
>             "__nosecurity__","allow_token_parent" :
>             "__nosecurity__","deny_token_parent" :
>             "__nosecurity__","content" : ""}]}]
>             org.elasticsearch.index.mapper.MapperParsingException:
>             failed to parse [_source]
>                     at
>             org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:411)
>                     at
>             org.elasticsearch.index.mapper.internal.SourceFieldMapper.preParse(SourceFieldMapper.java:240)
>                     at
>             org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:540)
>                     at
>             org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:493)
>                     at
>             org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:492)
>                     at
>             org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:192)
>                     at
>             org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase.performOnPrimary(TransportShardReplicationOperationAction.java:574)
>                     at
>             org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1.doRun(TransportShardReplicationOperationAction.java:440)
>                     at
>             org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
>                     at
>             java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>                     at
>             java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>                     at java.lang.Thread.run(Thread.java:745)
>             Caused by: org.elasticsearch.ElasticsearchParseException:
>             Failed to parse content to map
>                     at
>             org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:130)
>                     at
>             org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:81)
>                     at
>             org.elasticsearch.index.mapper.internal.SourceFieldMapper.parseCreateField(SourceFieldMapper.java:274)
>                     at
>             org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:401)
>                     ... 11 more
>             Caused by:
>             org.elasticsearch.common.jackson.core.JsonParseException:
>             Illegal unquoted character ((CTRL-CHAR, code 0)): has to
>             be escaped using backslash to be included in string va
>             lue
>              at [Source: [B@5b774e8b; line: 1, column: 1145]
>                     at
>             org.elasticsearch.common.jackson.core.JsonParser._constructError(JsonParser.java:1487)
>                     at
>             org.elasticsearch.common.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518)
>                     at
>             org.elasticsearch.common.jackson.core.base.ParserMinimalBase._throwUnquotedSpace(ParserMinimalBase.java:482)
>                     at
>             org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2357)
>                     at
>             org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2287)
>                     at
>             org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:286)
>                     at
>             org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:86)
>                     at
>             org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:293)
>                     at
>             org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:275)
>                     at
>             org.elasticsearch.common.xcontent.support.AbstractXContentParser.readOrderedMap(AbstractXContentParser.java:258)
>                     at
>             org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrdered(AbstractXContentParser.java:213)
>                     at
>             org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrderedAndClose(AbstractXContentParser.java:228)
>                     at
>             org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:125)
>                     ... 14 more
>             This happens for documents of different types/extension,
>             such as pdfs as well as xlsx, etc. It seems that Tika
>             sometimes does not remove special characters as the null
>             character 0x0000. The presence of the special characters
>             causes Elasticsearch to omit the indexing of the document.
>             Thus the document is not indexed at all, as  special
>             characters need to be escaped when handed over as a JSON
>             request. Is there a way to work around the problem with
>             the existing functionality of Apache ManifoldCF?
>             Regards
>             Silvio
>
>
>
>

Re: Questing regarding Tika text extraction and elasticsearch

Posted by Karl Wright <da...@gmail.com>.
Here's the patch.  Relatively short.

Karl


On Sun, May 15, 2016 at 12:27 PM, Karl Wright <da...@gmail.com> wrote:

> There is a way apparently you are allowed to encode this, and I have a
> patch, but JIRA is down.  If it doesn't come back up soon I'll email you
> the patch.
>
> Karl
>
>
> On Sun, May 15, 2016 at 12:11 PM, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Silvio,
>>
>> This sounds like a problem with the way the Elastic Search connector is
>> forming JSON.  The spec is silent on control characters:
>>
>> http://rfc7159.net/rfc7159#rfc.section.8.1
>>
>> ... so we just embed those in strings.  But it sounds like
>> ElasticSearch's JSON parser is not so happy with them.
>>
>> If we can find an encoding that satisfies everyone, we can change the
>> code to do what is needed.  Maybe "\0" for null, etc?
>>
>> Karl
>>
>>
>> On Sun, May 15, 2016 at 10:21 AM, <si...@quantentunnel.de>
>> wrote:
>>
>>> Hi Apache ManifoldCF user list
>>>
>>> I’m experimenting with Apache ManifoldCF 2.3 which I use to index the
>>> network Windows shares of our company. I’m using Elasticsearch 1.7.4,
>>> Apache ManifoldCF 2.3 with MS Active Directory as authority source.
>>> I defined a job with the following connection configuration comprising
>>> the following chain of transformations (order in the list indicates the
>>> order of the transformations):
>>>
>>> 1.    Repository connection (MS Network Share)
>>> 2.    Allowed documents
>>> 3.    Tika extractor
>>> 4.    Metadata adjuster
>>> 5.    Elasticsearch
>>>
>>> I do this because I don’t want to store the original document inside the
>>> elasticsearch index but only the extracted text of the document. This works
>>> so far. However, there are numerous documents which cause an exception of
>>> the following kind when being  analyzed and sent to the indexer by Apache
>>> ManifoldCF. Note that the exceptions happens in the Elastic search analyzer:
>>>
>>> [2016-03-16 22:22:43,884][DEBUG][action.index             ] [Tefral the
>>> Surveyor] [shareindex][2], node[O2bWpnsKS8iAE7hwGEOpuA], [P], s[STARTED]:
>>> Failed to execute [index {[sharein
>>> dex][attachment][file://///du-evs-01/AppDevData%24/0Repository/temp/indexingtestcorpus/M%C3%A4useTastaturen%202.3.16%20-%20Kopie.pdf],
>>> source[{"access_permission:extract_for_access
>>> ibility" : "true","dcterms:created" :
>>> "2016-03-02T13:03:47Z","access_permission:can_modify" :
>>> "true","access_permission:modify_annotations" : "true","Creation-Date" :
>>> "2016-03-02T1
>>> 3:03:47Z","fileLastModified" :
>>> "2016-03-02T13:03:37.433Z","access_permission:fill_in_form" :
>>> "true","created" : "Wed Mar 02 14:03:47 CET 2016","stream_size" :
>>> "52067","dc:format" :
>>>  "application\/pdf; version=1.4","access_permission:can_print" :
>>> "true","stream_name" : "MäuseTastaturen 2.3.16 -
>>> Kopie.pdf","xmp:CreatorTool" : "Canon iR-ADV C5250  PDF","resourc
>>> eName" : "MäuseTastaturen 2.3.16 - Kopie.pdf","fileCreatedOn" :
>>> "2016-03-16T21:22:24.085Z","access_permission:assemble_document" :
>>> "true","meta:creation-date" : "2016-03-02T13:03:
>>> 47Z","lastModified" : "Wed Mar 02 14:03:37 CET 2016","pdf:PDFVersion" :
>>> "1.4","X-Parsed-By" : "org.apache.tika.parser.DefaultParser","shareName" :
>>> "AppDevData$","access_permission:
>>> can_print_degraded" : "true","xmpTPg:NPages" : "1","createdOn" : "Wed
>>> Mar 16 22:22:24 CET 2016","pdf:encrypted" :
>>> "false","access_permission:extract_content" : "true","producer" :
>>> "Adobe PSL 1.2e for Canon ","attributes" : "32","Content-Type" :
>>> "applica-tion\/pdf","allow_token_document" :
>>> ["LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-16152","LDAPConn:S
>>> -1-5-21-1751174259-1996115066-1435642685-16153","LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-7894"],"deny_token_document"
>>> : "LDAPConn:DEAD_AUTHORITY","allow_token_share" : "
>>> __nosecurity__","deny_token_share" :
>>> "__nosecurity__","allow_token_parent" :
>>> "__nosecurity__","deny_token_parent" : "__nosecurity__","content" : ""}]}]
>>> org.elasticsearch.index.mapper.MapperParsingException: failed to parse
>>> [_source]
>>>         at
>>> org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:411)
>>>         at
>>> org.elasticsearch.index.mapper.internal.SourceFieldMapper.preParse(SourceFieldMapper.java:240)
>>>         at
>>> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:540)
>>>         at
>>> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:493)
>>>         at
>>> org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:492)
>>>         at
>>> org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:192)
>>>         at
>>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase.performOnPrimary(TransportShardReplicationOperationAction.java:574)
>>>         at
>>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1.doRun(TransportShardReplicationOperationAction.java:440)
>>>         at
>>> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>         at java.lang.Thread.run(Thread.java:745)
>>> Caused by: org.elasticsearch.ElasticsearchParseException: Failed to
>>> parse content to map
>>>         at
>>> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:130)
>>>         at
>>> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:81)
>>>         at
>>> org.elasticsearch.index.mapper.internal.SourceFieldMapper.parseCreateField(SourceFieldMapper.java:274)
>>>         at
>>> org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:401)
>>>         ... 11 more
>>> Caused by: org.elasticsearch.common.jackson.core.JsonParseException:
>>> Illegal unquoted character ((CTRL-CHAR, code 0)): has to be escaped using
>>> backslash to be included in string va
>>> lue
>>>  at [Source: [B@5b774e8b; line: 1, column: 1145]
>>>         at
>>> org.elasticsearch.common.jackson.core.JsonParser._constructError(JsonParser.java:1487)
>>>         at
>>> org.elasticsearch.common.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518)
>>>         at
>>> org.elasticsearch.common.jackson.core.base.ParserMinimalBase._throwUnquotedSpace(ParserMinimalBase.java:482)
>>>         at
>>> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2357)
>>>         at
>>> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2287)
>>>         at
>>> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:286)
>>>         at
>>> org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:86)
>>>         at
>>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:293)
>>>         at
>>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:275)
>>>         at
>>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readOrderedMap(AbstractXContentParser.java:258)
>>>         at
>>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrdered(AbstractXContentParser.java:213)
>>>         at
>>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrderedAndClose(AbstractXContentParser.java:228)
>>>         at
>>> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:125)
>>>         ... 14 more
>>>
>>> This happens for documents of different types/extension, such as pdfs as
>>> well as xlsx, etc. It seems that Tika sometimes does not remove special
>>> characters as the null character 0x0000. The presence of the special
>>> characters causes Elasticsearch to omit the indexing of the document. Thus
>>> the document is not indexed at all, as  special characters need to be
>>> escaped when handed over as a JSON request. Is there a way to work around
>>> the problem with the existing functionality of Apache ManifoldCF?
>>>
>>> Regards
>>> Silvio
>>>
>>>
>>
>>
>

Re: Questing regarding Tika text extraction and elasticsearch

Posted by Karl Wright <da...@gmail.com>.
There is a way apparently you are allowed to encode this, and I have a
patch, but JIRA is down.  If it doesn't come back up soon I'll email you
the patch.

Karl


On Sun, May 15, 2016 at 12:11 PM, Karl Wright <da...@gmail.com> wrote:

> Hi Silvio,
>
> This sounds like a problem with the way the Elastic Search connector is
> forming JSON.  The spec is silent on control characters:
>
> http://rfc7159.net/rfc7159#rfc.section.8.1
>
> ... so we just embed those in strings.  But it sounds like ElasticSearch's
> JSON parser is not so happy with them.
>
> If we can find an encoding that satisfies everyone, we can change the code
> to do what is needed.  Maybe "\0" for null, etc?
>
> Karl
>
>
> On Sun, May 15, 2016 at 10:21 AM, <si...@quantentunnel.de> wrote:
>
>> Hi Apache ManifoldCF user list
>>
>> I’m experimenting with Apache ManifoldCF 2.3 which I use to index the
>> network Windows shares of our company. I’m using Elasticsearch 1.7.4,
>> Apache ManifoldCF 2.3 with MS Active Directory as authority source.
>> I defined a job with the following connection configuration comprising
>> the following chain of transformations (order in the list indicates the
>> order of the transformations):
>>
>> 1.    Repository connection (MS Network Share)
>> 2.    Allowed documents
>> 3.    Tika extractor
>> 4.    Metadata adjuster
>> 5.    Elasticsearch
>>
>> I do this because I don’t want to store the original document inside the
>> elasticsearch index but only the extracted text of the document. This works
>> so far. However, there are numerous documents which cause an exception of
>> the following kind when being  analyzed and sent to the indexer by Apache
>> ManifoldCF. Note that the exceptions happens in the Elastic search analyzer:
>>
>> [2016-03-16 22:22:43,884][DEBUG][action.index             ] [Tefral the
>> Surveyor] [shareindex][2], node[O2bWpnsKS8iAE7hwGEOpuA], [P], s[STARTED]:
>> Failed to execute [index {[sharein
>> dex][attachment][file://///du-evs-01/AppDevData%24/0Repository/temp/indexingtestcorpus/M%C3%A4useTastaturen%202.3.16%20-%20Kopie.pdf],
>> source[{"access_permission:extract_for_access
>> ibility" : "true","dcterms:created" :
>> "2016-03-02T13:03:47Z","access_permission:can_modify" :
>> "true","access_permission:modify_annotations" : "true","Creation-Date" :
>> "2016-03-02T1
>> 3:03:47Z","fileLastModified" :
>> "2016-03-02T13:03:37.433Z","access_permission:fill_in_form" :
>> "true","created" : "Wed Mar 02 14:03:47 CET 2016","stream_size" :
>> "52067","dc:format" :
>>  "application\/pdf; version=1.4","access_permission:can_print" :
>> "true","stream_name" : "MäuseTastaturen 2.3.16 -
>> Kopie.pdf","xmp:CreatorTool" : "Canon iR-ADV C5250  PDF","resourc
>> eName" : "MäuseTastaturen 2.3.16 - Kopie.pdf","fileCreatedOn" :
>> "2016-03-16T21:22:24.085Z","access_permission:assemble_document" :
>> "true","meta:creation-date" : "2016-03-02T13:03:
>> 47Z","lastModified" : "Wed Mar 02 14:03:37 CET 2016","pdf:PDFVersion" :
>> "1.4","X-Parsed-By" : "org.apache.tika.parser.DefaultParser","shareName" :
>> "AppDevData$","access_permission:
>> can_print_degraded" : "true","xmpTPg:NPages" : "1","createdOn" : "Wed Mar
>> 16 22:22:24 CET 2016","pdf:encrypted" :
>> "false","access_permission:extract_content" : "true","producer" :
>> "Adobe PSL 1.2e for Canon ","attributes" : "32","Content-Type" :
>> "applica-tion\/pdf","allow_token_document" :
>> ["LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-16152","LDAPConn:S
>> -1-5-21-1751174259-1996115066-1435642685-16153","LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-7894"],"deny_token_document"
>> : "LDAPConn:DEAD_AUTHORITY","allow_token_share" : "
>> __nosecurity__","deny_token_share" :
>> "__nosecurity__","allow_token_parent" :
>> "__nosecurity__","deny_token_parent" : "__nosecurity__","content" : ""}]}]
>> org.elasticsearch.index.mapper.MapperParsingException: failed to parse
>> [_source]
>>         at
>> org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:411)
>>         at
>> org.elasticsearch.index.mapper.internal.SourceFieldMapper.preParse(SourceFieldMapper.java:240)
>>         at
>> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:540)
>>         at
>> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:493)
>>         at
>> org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:492)
>>         at
>> org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:192)
>>         at
>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase.performOnPrimary(TransportShardReplicationOperationAction.java:574)
>>         at
>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1.doRun(TransportShardReplicationOperationAction.java:440)
>>         at
>> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>         at java.lang.Thread.run(Thread.java:745)
>> Caused by: org.elasticsearch.ElasticsearchParseException: Failed to parse
>> content to map
>>         at
>> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:130)
>>         at
>> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:81)
>>         at
>> org.elasticsearch.index.mapper.internal.SourceFieldMapper.parseCreateField(SourceFieldMapper.java:274)
>>         at
>> org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:401)
>>         ... 11 more
>> Caused by: org.elasticsearch.common.jackson.core.JsonParseException:
>> Illegal unquoted character ((CTRL-CHAR, code 0)): has to be escaped using
>> backslash to be included in string va
>> lue
>>  at [Source: [B@5b774e8b; line: 1, column: 1145]
>>         at
>> org.elasticsearch.common.jackson.core.JsonParser._constructError(JsonParser.java:1487)
>>         at
>> org.elasticsearch.common.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518)
>>         at
>> org.elasticsearch.common.jackson.core.base.ParserMinimalBase._throwUnquotedSpace(ParserMinimalBase.java:482)
>>         at
>> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2357)
>>         at
>> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2287)
>>         at
>> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:286)
>>         at
>> org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:86)
>>         at
>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:293)
>>         at
>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:275)
>>         at
>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readOrderedMap(AbstractXContentParser.java:258)
>>         at
>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrdered(AbstractXContentParser.java:213)
>>         at
>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrderedAndClose(AbstractXContentParser.java:228)
>>         at
>> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:125)
>>         ... 14 more
>>
>> This happens for documents of different types/extension, such as pdfs as
>> well as xlsx, etc. It seems that Tika sometimes does not remove special
>> characters as the null character 0x0000. The presence of the special
>> characters causes Elasticsearch to omit the indexing of the document. Thus
>> the document is not indexed at all, as  special characters need to be
>> escaped when handed over as a JSON request. Is there a way to work around
>> the problem with the existing functionality of Apache ManifoldCF?
>>
>> Regards
>> Silvio
>>
>>
>
>