You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Luis Filipe Nassif (JIRA)" <ji...@apache.org> on 2018/12/03 12:22:00 UTC
[jira] [Commented] (TIKA-2550) ToTextHandler includes
element content
[ https://issues.apache.org/jira/browse/TIKA-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707113#comment-16707113 ]
Luis Filipe Nassif commented on TIKA-2550:
------------------------------------------
Sorry for late reply, [~tallison@apache.org]. Will it change behaviour of Html text extraction with ToTextContentHandler? It is important to us (in forensic field) to index text contained in script elements to look for malicious html files. I think it may be a not backward compatible change...
But if I remember html script elements are being handled as embedded docs? So I am not sure if this change will ignore scripts from html, could you clarify?
> ToTextHandler includes <style/> element content
> -----------------------------------------------
>
> Key: TIKA-2550
> URL: https://issues.apache.org/jira/browse/TIKA-2550
> Project: Tika
> Issue Type: Bug
> Reporter: Tim Allison
> Assignee: Tim Allison
> Priority: Trivial
> Fix For: 2.0.0, 1.20
>
>
> When using the ToTextHandler to process .java files, the <style/> element content is included, e.g.:
> {noformat}
> testFile
> code {
> color: rgb(0,0,0); font-family: monospace; font-size: 12px; white-space: nowrap;
> }
> .java_plain {
> color: rgb(0,0,0);
> }
> .java_keyword {
> color: rgb(0,0,0); font-weight: bold;
> }
> .java_javadoc_tag {
> color: rgb(147,147,147); background-color: rgb(247,247,247); font-style: italic; font-weight: bold;
> }
> h1 {
> font-family: sans-serif; font-size: 16pt; font-weight: bold; color: rgb(0,0,0); background: rgb(210,210,210); border: solid 1px black; padding: 5px; text-align: center;
> }
> .java_type {
> color: rgb(0,44,221);
> }
> .java_literal {
> color: rgb(188,0,0);
> }
> .java_javadoc_comment {
> color: rgb(147,147,147); background-color: rgb(247,247,247); font-style: italic;
> }
> .java_operator {
> color: rgb(0,124,31);
> }
> .java_separator {
> color: rgb(0,33,255);
> }
> .java_comment {
> color: rgb(147,147,147); background-color: rgb(247,247,247);
> }
> testFile/*************************************************************************
> * Compilation: javac HelloWorld.java
> * Execution: java HelloWorld
> *
> * Prints "Hello, World". By tradition, this is everyone's first program.
> *
> *************************************************************************/
> public class HelloWorld {
> public static void main(String[] args) {
> System.out.println("Hello, World");
> }
> }
> {noformat}
> Is this what we want as the default behavior?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)