You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Jeff Nadler (JIRA)" <ji...@apache.org> on 2011/01/21 19:27:44 UTC

[jira] Created: (SOLR-2328) HTMLStripCharFilter Leaves Broken HTML Tags

HTMLStripCharFilter Leaves Broken HTML Tags
-------------------------------------------

                 Key: SOLR-2328
                 URL: https://issues.apache.org/jira/browse/SOLR-2328
             Project: Solr
          Issue Type: Bug
          Components: Schema and Analysis
    Affects Versions: 1.4.1
            Reporter: Jeff Nadler
             Fix For: 1.5


Some kinds of 'bad' HTML are missed by HTMLStripCharFilter.   For example, the following invalid HTML:
     <a href=\"http://www.twitter.com/ceonyc\"@ceonyc</a>

Is filtered to:
     <a href="http://www.twitter.com/ceonyc"@ceonyc

I understand the challenge here, without the end > it's tough to know what to do.  It turns out that real-world web pages are full of this kind of garbage HTML, and browsers (impressively!) seem to handle this quite gracefully.   

Plus, users in my app can search for 'href' and find lots of matches (that don't appear to contain 'href') as a result.





-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (SOLR-2328) HTMLStripCharFilter Leaves Broken HTML Tags

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man updated SOLR-2328:
---------------------------

    Fix Version/s:     (was: 4.0)

removing fixVersion=4.0 because there is no patch, no assignee, and no evidence that anyone is currently working on this issue. (this can certainly be revisited if volunteers step forward)

                
> HTMLStripCharFilter Leaves Broken HTML Tags
> -------------------------------------------
>
>                 Key: SOLR-2328
>                 URL: https://issues.apache.org/jira/browse/SOLR-2328
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4.1
>            Reporter: Jeff Nadler
>
> Some kinds of 'bad' HTML are missed by HTMLStripCharFilter.   For example, the following invalid HTML:
>      <a href=\"http://www.twitter.com/ceonyc\"@ceonyc</a>
> Is filtered to:
>      <a href="http://www.twitter.com/ceonyc"@ceonyc
> I understand the challenge here, without the end > it's tough to know what to do.  It turns out that real-world web pages are full of this kind of garbage HTML, and browsers (impressively!) seem to handle this quite gracefully.   
> Plus, users in my app can search for 'href' and find lots of matches (that don't appear to contain 'href') as a result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org