You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-dev@james.apache.org by "Trần Tiến Đức (Jira)" <se...@james.apache.org> on 2020/02/07 08:43:00 UTC

[jira] [Updated] (JAMES-3044) JsoupTextExtractor fails on parsing htmls

     [ https://issues.apache.org/jira/browse/JAMES-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Trần Tiến Đức updated JAMES-3044:
---------------------------------
    Description: 
Suspecting Jsoup version 1.21.1 links to the issue. [https://github.com/jhy/jsoup/issues/1288|https://github.com/jhy/jsoup/issues/1250]

[https://github.com/jhy/jsoup/issues/1250]

 

Here is the stack trace:
{code:java}
 	java.io.IOException: Input is binary and unsupported
	at org.jsoup.UncheckedIOException.<init>(UncheckedIOException.java:11)
	at org.jsoup.parser.CharacterReader.<init>(CharacterReader.java:38)
	at org.jsoup.parser.CharacterReader.<init>(CharacterReader.java:43)
	at org.jsoup.parser.TreeBuilder.initialiseParse(TreeBuilder.java:38)
	at org.jsoup.parser.HtmlTreeBuilder.initialiseParse(HtmlTreeBuilder.java:65)
	at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:46)
	at org.jsoup.parser.Parser.parseInput(Parser.java:35)
	at org.jsoup.helper.DataUtil.parseInputStream(DataUtil.java:169)
	at org.jsoup.helper.DataUtil.load(DataUtil.java:66)
	at org.jsoup.Jsoup.parse(Jsoup.java:118)
	at org.apache.james.mailbox.store.extractor.JsoupTextExtractor.parseHtmlContent(JsoupTextExtractor.java:61)
	at org.apache.james.mailbox.store.extractor.JsoupTextExtractor.extractContent(JsoupTextExtractor.java:48)
	at org.apache.james.mailbox.elasticsearch.json.MimePart$Builder.extractText(MimePart.java:155)
	at org.apache.james.mailbox.elasticsearch.json.MimePart$Builder.parseContent(MimePart.java:145)
	at org.apache.james.mailbox.elasticsearch.json.MimePart$Builder.build(MimePart.java:130)
	at org.apache.james.mailbox.elasticsearch.json.MimePartParser.closeMimePart(MimePartParser.java:102)
	at org.apache.james.mailbox.elasticsearch.json.MimePartParser.processMimePart(MimePartParser.java:80)
	at org.apache.james.mailbox.elasticsearch.json.MimePartParser.parse(MimePartParser.java:61)
	at org.apache.james.mailbox.elasticsearch.json.IndexableMessage$Builder.instantiateIndexedMessage(IndexableMessage.java:109)
	at org.apache.james.mailbox.elasticsearch.json.IndexableMessage$Builder.build(IndexableMessage.java:75)
	at org.apache.james.mailbox.elasticsearch.json.MessageToElasticSearchJson.convertToJson(MessageToElasticSearchJson.java:69)
	at org.apache.james.mailbox.elasticsearch.events.ElasticSearchListeningMessageSearchIndex.generateIndexedJson(ElasticSearchListeningMessageSearchIndex.java:152)
	at org.apache.james.mailbox.elasticsearch.events.ElasticSearchListeningMessageSearchIndex.add(ElasticSearchListeningMessageSearchIndex.java:145)
	at org.apache.james.mailbox.store.search.ListeningMessageSearchIndex.lambda$handleAdded$1(ListeningMessageSearchIndex.java:100)
	at com.github.fge.lambdas.consumers.ConsumerChainer.lambda$sneakyThrow$9(ConsumerChainer.java:73)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
	at java.util.Iterator.forEachRemaining(Iterator.java:116)
	at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
	at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
	at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272)
	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
	at org.apache.james.mailbox.store.search.ListeningMessageSearchIndex.handleAdded(ListeningMessageSearchIndex.java:100)
	at org.apache.james.mailbox.store.search.ListeningMessageSearchIndex.handleMailboxEvent(ListeningMessageSearchIndex.java:82)
	at org.apache.james.mailbox.store.search.ListeningMessageSearchIndex.event(ListeningMessageSearchIndex.java:72)
	at org.apache.james.mailbox.events.MailboxListenerExecutor.execute(MailboxListenerExecutor.java:41)
	at org.apache.james.mailbox.events.GroupRegistration.runListener(GroupRegistration.java:152)
	at org.apache.james.mailbox.events.GroupRegistration.lambda$deliver$2(GroupRegistration.java:142)
	at com.github.fge.lambdas.runnable.RunnableChainer.doRun(RunnableChainer.java:18)
	at com.github.fge.lambdas.runnable.ThrowingRunnable.run(ThrowingRunnable.java:16)
	at reactor.core.publisher.MonoRunnable.call(MonoRunnable.java:73)
	at reactor.core.publisher.MonoRunnable.call(MonoRunnable.java:32)
	at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:132)
	at reactor.core.publisher.FluxSubscribeOnValue$ScheduledScalar.run(FluxSubscribeOnValue.java:178)
	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

{code}

  was:
Suspecting Jsoup version 1.21.1 links to the issue. https://github.com/jhy/jsoup/issues/1250

 

Here is the stack trace:
{code:java}
 	java.io.IOException: Input is binary and unsupported
	at org.jsoup.UncheckedIOException.<init>(UncheckedIOException.java:11)
	at org.jsoup.parser.CharacterReader.<init>(CharacterReader.java:38)
	at org.jsoup.parser.CharacterReader.<init>(CharacterReader.java:43)
	at org.jsoup.parser.TreeBuilder.initialiseParse(TreeBuilder.java:38)
	at org.jsoup.parser.HtmlTreeBuilder.initialiseParse(HtmlTreeBuilder.java:65)
	at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:46)
	at org.jsoup.parser.Parser.parseInput(Parser.java:35)
	at org.jsoup.helper.DataUtil.parseInputStream(DataUtil.java:169)
	at org.jsoup.helper.DataUtil.load(DataUtil.java:66)
	at org.jsoup.Jsoup.parse(Jsoup.java:118)
	at org.apache.james.mailbox.store.extractor.JsoupTextExtractor.parseHtmlContent(JsoupTextExtractor.java:61)
	at org.apache.james.mailbox.store.extractor.JsoupTextExtractor.extractContent(JsoupTextExtractor.java:48)
	at org.apache.james.mailbox.elasticsearch.json.MimePart$Builder.extractText(MimePart.java:155)
	at org.apache.james.mailbox.elasticsearch.json.MimePart$Builder.parseContent(MimePart.java:145)
	at org.apache.james.mailbox.elasticsearch.json.MimePart$Builder.build(MimePart.java:130)
	at org.apache.james.mailbox.elasticsearch.json.MimePartParser.closeMimePart(MimePartParser.java:102)
	at org.apache.james.mailbox.elasticsearch.json.MimePartParser.processMimePart(MimePartParser.java:80)
	at org.apache.james.mailbox.elasticsearch.json.MimePartParser.parse(MimePartParser.java:61)
	at org.apache.james.mailbox.elasticsearch.json.IndexableMessage$Builder.instantiateIndexedMessage(IndexableMessage.java:109)
	at org.apache.james.mailbox.elasticsearch.json.IndexableMessage$Builder.build(IndexableMessage.java:75)
	at org.apache.james.mailbox.elasticsearch.json.MessageToElasticSearchJson.convertToJson(MessageToElasticSearchJson.java:69)
	at org.apache.james.mailbox.elasticsearch.events.ElasticSearchListeningMessageSearchIndex.generateIndexedJson(ElasticSearchListeningMessageSearchIndex.java:152)
	at org.apache.james.mailbox.elasticsearch.events.ElasticSearchListeningMessageSearchIndex.add(ElasticSearchListeningMessageSearchIndex.java:145)
	at org.apache.james.mailbox.store.search.ListeningMessageSearchIndex.lambda$handleAdded$1(ListeningMessageSearchIndex.java:100)
	at com.github.fge.lambdas.consumers.ConsumerChainer.lambda$sneakyThrow$9(ConsumerChainer.java:73)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
	at java.util.Iterator.forEachRemaining(Iterator.java:116)
	at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
	at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
	at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272)
	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
	at org.apache.james.mailbox.store.search.ListeningMessageSearchIndex.handleAdded(ListeningMessageSearchIndex.java:100)
	at org.apache.james.mailbox.store.search.ListeningMessageSearchIndex.handleMailboxEvent(ListeningMessageSearchIndex.java:82)
	at org.apache.james.mailbox.store.search.ListeningMessageSearchIndex.event(ListeningMessageSearchIndex.java:72)
	at org.apache.james.mailbox.events.MailboxListenerExecutor.execute(MailboxListenerExecutor.java:41)
	at org.apache.james.mailbox.events.GroupRegistration.runListener(GroupRegistration.java:152)
	at org.apache.james.mailbox.events.GroupRegistration.lambda$deliver$2(GroupRegistration.java:142)
	at com.github.fge.lambdas.runnable.RunnableChainer.doRun(RunnableChainer.java:18)
	at com.github.fge.lambdas.runnable.ThrowingRunnable.run(ThrowingRunnable.java:16)
	at reactor.core.publisher.MonoRunnable.call(MonoRunnable.java:73)
	at reactor.core.publisher.MonoRunnable.call(MonoRunnable.java:32)
	at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:132)
	at reactor.core.publisher.FluxSubscribeOnValue$ScheduledScalar.run(FluxSubscribeOnValue.java:178)
	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

{code}


> JsoupTextExtractor fails on parsing htmls
> -----------------------------------------
>
>                 Key: JAMES-3044
>                 URL: https://issues.apache.org/jira/browse/JAMES-3044
>             Project: James Server
>          Issue Type: Bug
>            Reporter: Trần Tiến Đức
>            Priority: Major
>
> Suspecting Jsoup version 1.21.1 links to the issue. [https://github.com/jhy/jsoup/issues/1288|https://github.com/jhy/jsoup/issues/1250]
> [https://github.com/jhy/jsoup/issues/1250]
>  
> Here is the stack trace:
> {code:java}
>  	java.io.IOException: Input is binary and unsupported
> 	at org.jsoup.UncheckedIOException.<init>(UncheckedIOException.java:11)
> 	at org.jsoup.parser.CharacterReader.<init>(CharacterReader.java:38)
> 	at org.jsoup.parser.CharacterReader.<init>(CharacterReader.java:43)
> 	at org.jsoup.parser.TreeBuilder.initialiseParse(TreeBuilder.java:38)
> 	at org.jsoup.parser.HtmlTreeBuilder.initialiseParse(HtmlTreeBuilder.java:65)
> 	at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:46)
> 	at org.jsoup.parser.Parser.parseInput(Parser.java:35)
> 	at org.jsoup.helper.DataUtil.parseInputStream(DataUtil.java:169)
> 	at org.jsoup.helper.DataUtil.load(DataUtil.java:66)
> 	at org.jsoup.Jsoup.parse(Jsoup.java:118)
> 	at org.apache.james.mailbox.store.extractor.JsoupTextExtractor.parseHtmlContent(JsoupTextExtractor.java:61)
> 	at org.apache.james.mailbox.store.extractor.JsoupTextExtractor.extractContent(JsoupTextExtractor.java:48)
> 	at org.apache.james.mailbox.elasticsearch.json.MimePart$Builder.extractText(MimePart.java:155)
> 	at org.apache.james.mailbox.elasticsearch.json.MimePart$Builder.parseContent(MimePart.java:145)
> 	at org.apache.james.mailbox.elasticsearch.json.MimePart$Builder.build(MimePart.java:130)
> 	at org.apache.james.mailbox.elasticsearch.json.MimePartParser.closeMimePart(MimePartParser.java:102)
> 	at org.apache.james.mailbox.elasticsearch.json.MimePartParser.processMimePart(MimePartParser.java:80)
> 	at org.apache.james.mailbox.elasticsearch.json.MimePartParser.parse(MimePartParser.java:61)
> 	at org.apache.james.mailbox.elasticsearch.json.IndexableMessage$Builder.instantiateIndexedMessage(IndexableMessage.java:109)
> 	at org.apache.james.mailbox.elasticsearch.json.IndexableMessage$Builder.build(IndexableMessage.java:75)
> 	at org.apache.james.mailbox.elasticsearch.json.MessageToElasticSearchJson.convertToJson(MessageToElasticSearchJson.java:69)
> 	at org.apache.james.mailbox.elasticsearch.events.ElasticSearchListeningMessageSearchIndex.generateIndexedJson(ElasticSearchListeningMessageSearchIndex.java:152)
> 	at org.apache.james.mailbox.elasticsearch.events.ElasticSearchListeningMessageSearchIndex.add(ElasticSearchListeningMessageSearchIndex.java:145)
> 	at org.apache.james.mailbox.store.search.ListeningMessageSearchIndex.lambda$handleAdded$1(ListeningMessageSearchIndex.java:100)
> 	at com.github.fge.lambdas.consumers.ConsumerChainer.lambda$sneakyThrow$9(ConsumerChainer.java:73)
> 	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
> 	at java.util.Iterator.forEachRemaining(Iterator.java:116)
> 	at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
> 	at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
> 	at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272)
> 	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
> 	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> 	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> 	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
> 	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
> 	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> 	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
> 	at org.apache.james.mailbox.store.search.ListeningMessageSearchIndex.handleAdded(ListeningMessageSearchIndex.java:100)
> 	at org.apache.james.mailbox.store.search.ListeningMessageSearchIndex.handleMailboxEvent(ListeningMessageSearchIndex.java:82)
> 	at org.apache.james.mailbox.store.search.ListeningMessageSearchIndex.event(ListeningMessageSearchIndex.java:72)
> 	at org.apache.james.mailbox.events.MailboxListenerExecutor.execute(MailboxListenerExecutor.java:41)
> 	at org.apache.james.mailbox.events.GroupRegistration.runListener(GroupRegistration.java:152)
> 	at org.apache.james.mailbox.events.GroupRegistration.lambda$deliver$2(GroupRegistration.java:142)
> 	at com.github.fge.lambdas.runnable.RunnableChainer.doRun(RunnableChainer.java:18)
> 	at com.github.fge.lambdas.runnable.ThrowingRunnable.run(ThrowingRunnable.java:16)
> 	at reactor.core.publisher.MonoRunnable.call(MonoRunnable.java:73)
> 	at reactor.core.publisher.MonoRunnable.call(MonoRunnable.java:32)
> 	at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:132)
> 	at reactor.core.publisher.FluxSubscribeOnValue$ScheduledScalar.run(FluxSubscribeOnValue.java:178)
> 	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
> 	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org