You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nicholas DiPiazza (JIRA)" <ji...@apache.org> on 2019/01/06 00:05:00 UTC

[jira] [Updated] (TIKA-2805) Should the HTML parser by default just ignore the

     [ https://issues.apache.org/jira/browse/TIKA-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicholas DiPiazza updated TIKA-2805:
------------------------------------
    Description: 
The tika's HTML parser will take this:
{code:java}
<noscript><div class='noindex'>You may be trying to access this site from a secured browser on the server. Please enable scripts and reload this page.</div></noscript>{code}
and will parse it:
{code:java}
You may be trying to access this site from a secured browser on the server. Please enable scripts and reload this page.{code}
Shouldn't it just ignore those sections and leave those out of the parse output? 

  was:
The tika parser will take this:
{code:java}
<noscript><div class='noindex'>You may be trying to access this site from a secured browser on the server. Please enable scripts and reload this page.</div></noscript>{code}
and will parse it:
{code:java}
You may be trying to access this site from a secured browser on the server. Please enable scripts and reload this page.{code}
Shouldn't it just ignore those sections and leave those out of the parse output? 


> Should the HTML parser by default just ignore the <noscript> section?
> ---------------------------------------------------------------------
>
>                 Key: TIKA-2805
>                 URL: https://issues.apache.org/jira/browse/TIKA-2805
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Nicholas DiPiazza
>            Priority: Major
>
> The tika's HTML parser will take this:
> {code:java}
> <noscript><div class='noindex'>You may be trying to access this site from a secured browser on the server. Please enable scripts and reload this page.</div></noscript>{code}
> and will parse it:
> {code:java}
> You may be trying to access this site from a secured browser on the server. Please enable scripts and reload this page.{code}
> Shouldn't it just ignore those sections and leave those out of the parse output? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)