You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Raimund Merkert (JIRA)" <ji...@apache.org> on 2011/09/25 18:38:26 UTC

[jira] [Created] (TIKA-730) WriteOutContentHandler concatenates title tag and body text.

WriteOutContentHandler concatenates title tag and body text.
------------------------------------------------------------

                 Key: TIKA-730
                 URL: https://issues.apache.org/jira/browse/TIKA-730
             Project: Tika
          Issue Type: Bug
          Components: general, parser
    Affects Versions: 0.9
            Reporter: Raimund Merkert


I just noticed that the WriteOutContentHandler concatenates strings that it should not concatenate. I noticed this in case of a title tag which was combined with the first text in a body, e.g.: <head><title>a</title><head><body>b</body>
results in "ab" and not "a b" (or something else with a break). Interestingly, "<p>a</p><p>b</p>" does get broken into separate words. 

I'm not aware of a better way to extract text only with an out-of-the-box tika.

I've added a small unit test here:
{code}
package tika;

import java.io.ByteArrayInputStream;
import java.io.InputStream;
import java.io.StringWriter;
import java.nio.charset.Charset;

import junit.framework.Assert;

import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.WriteOutContentHandler;
import org.junit.Test;

public class WriteOutContentHandler_JUnit {

	private static final String HTML = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">"
			+ "<html xmlns=\"http://www.w3.org/1999/xhtml\"><head><title>title</title></head>  <body>a</body></html>";

	public static String processStream(String str) throws Exception {

		InputStream in = new ByteArrayInputStream(str.getBytes(Charset
				.forName("UTF-8")));

		AutoDetectParser parser = new AutoDetectParser();
		ParseContext context = new ParseContext();
		org.apache.tika.metadata.Metadata m = new org.apache.tika.metadata.Metadata();
		StringWriter out = new StringWriter();
		WriteOutContentHandler ctHandler = new WriteOutContentHandler(out);

		try {
			parser.parse(in, ctHandler, m, context);
			return out.toString();
		} finally {
			out.flush();
		}
	}

	@Test
	public void testParse() throws Exception {
		String data = processStream(HTML);
		data = data.trim();
		System.err.println("Extracted:\n" + data);
		Assert.assertFalse(data.equals("titlea"));
	}
}
{code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-730) WriteOutContentHandler concatenates title tag and body text.

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114295#comment-13114295 ] 

Uwe Schindler commented on TIKA-730:
------------------------------------

The content handler should be used after BodyContentHandler. The XHTMLRequestHandler used as input before only inserts ignorableWhitespace between body tags, not in head. The <head> section is not intended to pass through WriteOutContentHandler, therefore use a BodyContentHandler as input to WriteOutContentHandler.

> WriteOutContentHandler concatenates title tag and body text.
> ------------------------------------------------------------
>
>                 Key: TIKA-730
>                 URL: https://issues.apache.org/jira/browse/TIKA-730
>             Project: Tika
>          Issue Type: Bug
>          Components: general, parser
>    Affects Versions: 0.9
>            Reporter: Raimund Merkert
>
> I just noticed that the WriteOutContentHandler concatenates strings that it should not concatenate. I noticed this in case of a title tag which was combined with the first text in a body, e.g.: <head><title>a</title><head><body>b</body>
> results in "ab" and not "a b" (or something else with a break). Interestingly, "<p>a</p><p>b</p>" does get broken into separate words. 
> I'm not aware of a better way to extract text only with an out-of-the-box tika.
> I've added a small unit test here:
> {code}
> package tika;
> import java.io.ByteArrayInputStream;
> import java.io.InputStream;
> import java.io.StringWriter;
> import java.nio.charset.Charset;
> import junit.framework.Assert;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.sax.WriteOutContentHandler;
> import org.junit.Test;
> public class WriteOutContentHandler_JUnit {
> 	private static final String HTML = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">"
> 			+ "<html xmlns=\"http://www.w3.org/1999/xhtml\"><head><title>title</title></head>  <body>a</body></html>";
> 	public static String processStream(String str) throws Exception {
> 		InputStream in = new ByteArrayInputStream(str.getBytes(Charset
> 				.forName("UTF-8")));
> 		AutoDetectParser parser = new AutoDetectParser();
> 		ParseContext context = new ParseContext();
> 		org.apache.tika.metadata.Metadata m = new org.apache.tika.metadata.Metadata();
> 		StringWriter out = new StringWriter();
> 		WriteOutContentHandler ctHandler = new WriteOutContentHandler(out);
> 		try {
> 			parser.parse(in, ctHandler, m, context);
> 			return out.toString();
> 		} finally {
> 			out.flush();
> 		}
> 	}
> 	@Test
> 	public void testParse() throws Exception {
> 		String data = processStream(HTML);
> 		data = data.trim();
> 		System.err.println("Extracted:\n" + data);
> 		Assert.assertFalse(data.equals("titlea"));
> 	}
> }
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-730) WriteOutContentHandler concatenates title tag and body text.

Posted by "Jukka Zitting (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-730.
--------------------------------

    Resolution: Won't Fix

Resolving as Won't Fix since in this case the WriteOutContentHandler class works exactly as designed and documented.

Have you looked at the [Tika facade class|http://tika.apache.org/0.10/api/org/apache/tika/Tika.html] that provides a simplified API for extracting just the text content of a document as a String or a Reader? That should be a better match for your use case than WriteOutContentHandler.
                
> WriteOutContentHandler concatenates title tag and body text.
> ------------------------------------------------------------
>
>                 Key: TIKA-730
>                 URL: https://issues.apache.org/jira/browse/TIKA-730
>             Project: Tika
>          Issue Type: Bug
>          Components: general, parser
>    Affects Versions: 0.9
>            Reporter: Raimund Merkert
>
> I just noticed that the WriteOutContentHandler concatenates strings that it should not concatenate. I noticed this in case of a title tag which was combined with the first text in a body, e.g.: <head><title>a</title><head><body>b</body>
> results in "ab" and not "a b" (or something else with a break). Interestingly, "<p>a</p><p>b</p>" does get broken into separate words. 
> I'm not aware of a better way to extract text only with an out-of-the-box tika.
> I've added a small unit test here:
> {code}
> package tika;
> import java.io.ByteArrayInputStream;
> import java.io.InputStream;
> import java.io.StringWriter;
> import java.nio.charset.Charset;
> import junit.framework.Assert;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.sax.WriteOutContentHandler;
> import org.junit.Test;
> public class WriteOutContentHandler_JUnit {
> 	private static final String HTML = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">"
> 			+ "<html xmlns=\"http://www.w3.org/1999/xhtml\"><head><title>title</title></head>  <body>a</body></html>";
> 	public static String processStream(String str) throws Exception {
> 		InputStream in = new ByteArrayInputStream(str.getBytes(Charset
> 				.forName("UTF-8")));
> 		AutoDetectParser parser = new AutoDetectParser();
> 		ParseContext context = new ParseContext();
> 		org.apache.tika.metadata.Metadata m = new org.apache.tika.metadata.Metadata();
> 		StringWriter out = new StringWriter();
> 		WriteOutContentHandler ctHandler = new WriteOutContentHandler(out);
> 		try {
> 			parser.parse(in, ctHandler, m, context);
> 			return out.toString();
> 		} finally {
> 			out.flush();
> 		}
> 	}
> 	@Test
> 	public void testParse() throws Exception {
> 		String data = processStream(HTML);
> 		data = data.trim();
> 		System.err.println("Extracted:\n" + data);
> 		Assert.assertFalse(data.equals("titlea"));
> 	}
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-730) WriteOutContentHandler concatenates title tag and body text.

Posted by "Raimund Merkert (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114639#comment-13114639 ] 

Raimund Merkert commented on TIKA-730:
--------------------------------------

Maybe something like this should be documented. Or better, the handler should just ignore the head tag completely since it's not meant to support that particular tag.



> WriteOutContentHandler concatenates title tag and body text.
> ------------------------------------------------------------
>
>                 Key: TIKA-730
>                 URL: https://issues.apache.org/jira/browse/TIKA-730
>             Project: Tika
>          Issue Type: Bug
>          Components: general, parser
>    Affects Versions: 0.9
>            Reporter: Raimund Merkert
>
> I just noticed that the WriteOutContentHandler concatenates strings that it should not concatenate. I noticed this in case of a title tag which was combined with the first text in a body, e.g.: <head><title>a</title><head><body>b</body>
> results in "ab" and not "a b" (or something else with a break). Interestingly, "<p>a</p><p>b</p>" does get broken into separate words. 
> I'm not aware of a better way to extract text only with an out-of-the-box tika.
> I've added a small unit test here:
> {code}
> package tika;
> import java.io.ByteArrayInputStream;
> import java.io.InputStream;
> import java.io.StringWriter;
> import java.nio.charset.Charset;
> import junit.framework.Assert;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.sax.WriteOutContentHandler;
> import org.junit.Test;
> public class WriteOutContentHandler_JUnit {
> 	private static final String HTML = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">"
> 			+ "<html xmlns=\"http://www.w3.org/1999/xhtml\"><head><title>title</title></head>  <body>a</body></html>";
> 	public static String processStream(String str) throws Exception {
> 		InputStream in = new ByteArrayInputStream(str.getBytes(Charset
> 				.forName("UTF-8")));
> 		AutoDetectParser parser = new AutoDetectParser();
> 		ParseContext context = new ParseContext();
> 		org.apache.tika.metadata.Metadata m = new org.apache.tika.metadata.Metadata();
> 		StringWriter out = new StringWriter();
> 		WriteOutContentHandler ctHandler = new WriteOutContentHandler(out);
> 		try {
> 			parser.parse(in, ctHandler, m, context);
> 			return out.toString();
> 		} finally {
> 			out.flush();
> 		}
> 	}
> 	@Test
> 	public void testParse() throws Exception {
> 		String data = processStream(HTML);
> 		data = data.trim();
> 		System.err.println("Extracted:\n" + data);
> 		Assert.assertFalse(data.equals("titlea"));
> 	}
> }
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira