You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2010/11/05 08:33:41 UTC

[jira] Updated: (LUCENE-589) Demo HTML parser doesn't work for international documents

     [ https://issues.apache.org/jira/browse/LUCENE-589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-589:
-------------------------------

    Attachment: LUCENE-589.patch

attached is a patch, it also fixes LUCENE-2246.

> Demo HTML parser doesn't work for international documents
> ---------------------------------------------------------
>
>                 Key: LUCENE-589
>                 URL: https://issues.apache.org/jira/browse/LUCENE-589
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Examples
>    Affects Versions: 2.0.0
>            Reporter: Curtis d'Entremont
>            Assignee: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-589.patch
>
>
> Javacc assumes ASCII so it won't work with, say, japanese documents. Ideally it would read the charset from the HTML markup, but that can by tricky. For now assuming unicode would do the trick:
> Add the following line marked with a + to HTMLParser.jj:
> options {
>   STATIC = false;
>   OPTIMIZE_TOKEN_MANAGER = true;
>   //DEBUG_LOOKAHEAD = true;
>   //DEBUG_TOKEN_MANAGER = true;
> +  UNICODE_INPUT = true;
> }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org