You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2010/07/11 23:39:50 UTC

[jira] Commented: (TIKA-394) Missing spaces on html parsing

    [ https://issues.apache.org/jira/browse/TIKA-394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887235#action_12887235 ] 

Ken Krugler commented on TIKA-394:
----------------------------------

The issue here is that various content handlers (e.g. BodyContentHandler, WriteoutContentHandler) need to add spaces at the end (and sometimes beginning) of elements that typically get rendered on different lines by a browser.

Examples include <p>, <option>, <menu>, and others.

I think we need a TextContentHandler "helper" that everybody can use to convert a stream of XHTML events into a reasonable text approximation of the marked up content.


> Missing spaces on html parsing
> ------------------------------
>
>                 Key: TIKA-394
>                 URL: https://issues.apache.org/jira/browse/TIKA-394
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.6
>         Environment: Tomcat 6, Windows XP (russian locale)
>            Reporter: Andrey Barhatov
>            Assignee: Ken Krugler
>
> On parsing such html code:
> text<p>more<br>yet<select><option>city1<option>city2</select>
> resulting text is:
> textmore
> yetcity1city2
> But must be:
> text
> more
> yet city1 city2
> Code sample:
> import java.io.*;
> import org.apache.tika.metadata.*;
> import org.apache.tika.parser.*;
> public class test {
>    public static void main(String[] args) throws Exception {
>       Metadata metadata = new Metadata();
>       metadata.set(Metadata.CONTENT_TYPE, "text/html");
>       String content = "text<p>more<br>yet<select><option>city1<option>city2</select>";
>       InputStream in = new ByteArrayInputStream(content.getBytes("UTF-8"));
>       AutoDetectParser parser = new AutoDetectParser();
>       Reader reader = new ParsingReader(parser, in, metadata, new ParseContext());
>       char[] buf = new char[10000];
>       int len;
>       StringBuffer text = new StringBuffer();
>       while((len = reader.read(buf)) > 0) {
>          text.append(buf, 0, len);
>       }
>       System.out.print(text);
>    }
> }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.