You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2010/08/25 16:58:17 UTC

[jira] Created: (TIKA-498) HTML parser fails on turkish locale

HTML parser fails on turkish locale
-----------------------------------

                 Key: TIKA-498
                 URL: https://issues.apache.org/jira/browse/TIKA-498
             Project: Tika
          Issue Type: Bug
          Components: parser
            Reporter: Robert Muir


To reproduce: mvn test -DargLine=-Duser.language=tr

This is because it uses toLowerCase for the default Locale 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-498) HTML parser fails on turkish locale

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann resolved TIKA-498.
------------------------------------

    Fix Version/s: 0.8
       Resolution: Fixed

- patch applied to trunk in r989202. Thanks very much, Robert!

> HTML parser fails on turkish locale
> -----------------------------------
>
>                 Key: TIKA-498
>                 URL: https://issues.apache.org/jira/browse/TIKA-498
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Robert Muir
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>         Attachments: TIKA-498.patch
>
>
> To reproduce: mvn test -DargLine=-Duser.language=tr
> This is because it uses toLowerCase for the default Locale 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-498) HTML parser fails on turkish locale

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902494#action_12902494 ] 

Robert Muir commented on TIKA-498:
----------------------------------

Thanks for taking a look Chris.

There might be other problems too, but this one was detected easily by an existing test.
In general if the value is not going to be for display, but for comparisons and such,
it is safest to do the same for any toUpperCase/toLowerCase calls.

I found this while trying to debug the reason that SOLR-2088 fails, no evidence that its tika's fault
but the integration with Solr has a regression that appeared since we upgraded to 0.8-snapshot.


> HTML parser fails on turkish locale
> -----------------------------------
>
>                 Key: TIKA-498
>                 URL: https://issues.apache.org/jira/browse/TIKA-498
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Robert Muir
>            Assignee: Chris A. Mattmann
>         Attachments: TIKA-498.patch
>
>
> To reproduce: mvn test -DargLine=-Duser.language=tr
> This is because it uses toLowerCase for the default Locale 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-498) HTML parser fails on turkish locale

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated TIKA-498:
-----------------------------

    Attachment: TIKA-498.patch

> HTML parser fails on turkish locale
> -----------------------------------
>
>                 Key: TIKA-498
>                 URL: https://issues.apache.org/jira/browse/TIKA-498
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Robert Muir
>         Attachments: TIKA-498.patch
>
>
> To reproduce: mvn test -DargLine=-Duser.language=tr
> This is because it uses toLowerCase for the default Locale 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (TIKA-498) HTML parser fails on turkish locale

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann reassigned TIKA-498:
--------------------------------------

    Assignee: Chris A. Mattmann

> HTML parser fails on turkish locale
> -----------------------------------
>
>                 Key: TIKA-498
>                 URL: https://issues.apache.org/jira/browse/TIKA-498
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Robert Muir
>            Assignee: Chris A. Mattmann
>         Attachments: TIKA-498.patch
>
>
> To reproduce: mvn test -DargLine=-Duser.language=tr
> This is because it uses toLowerCase for the default Locale 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-498) HTML parser fails on turkish locale

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902489#action_12902489 ] 

Chris A. Mattmann commented on TIKA-498:
----------------------------------------

Hi Robert:

Thanks mucho! I've verified that the test breaks too on my machine (Mac OS X 10.5.6, JDK 1.6.0, System JRE) with your above command line:

{noformat}
Running org.apache.tika.parser.html.HtmlParserTest
Tests run: 25, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.778 sec <<< FAILURE!

Results :

Failed tests: 
  testLineBreak(org.apache.tika.parser.html.HtmlParserTest)

Tests run: 165, Failures: 1, Errors: 0, Skipped: 0

{noformat}

I'll now apply your patch and see if it fixes it...

Cheers,
Chris


> HTML parser fails on turkish locale
> -----------------------------------
>
>                 Key: TIKA-498
>                 URL: https://issues.apache.org/jira/browse/TIKA-498
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Robert Muir
>            Assignee: Chris A. Mattmann
>         Attachments: TIKA-498.patch
>
>
> To reproduce: mvn test -DargLine=-Duser.language=tr
> This is because it uses toLowerCase for the default Locale 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-498) HTML parser fails on turkish locale

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902496#action_12902496 ] 

Chris A. Mattmann commented on TIKA-498:
----------------------------------------

Okey dokey, I've confirmed Robert's patch fixes the issue:

{noformat}
[INFO] ------------------------------------------------------------------------
[INFO] Building Apache Tika
[INFO]    task-segment: [test]
[INFO] ------------------------------------------------------------------------
[INFO] [remote-resources:process {execution: default}]
[INFO] 
[INFO] 
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] ------------------------------------------------------------------------
[INFO] Apache Tika parent .................................... SUCCESS [1.425s]
[INFO] Apache Tika core ...................................... SUCCESS [7.387s]
[INFO] Apache Tika parsers ................................... SUCCESS [1:11.272s]
[INFO] Apache Tika application ............................... SUCCESS [4.470s]
[INFO] Apache Tika OSGi bundle ............................... SUCCESS [0.872s]
[INFO] Apache Tika ........................................... SUCCESS [0.300s]
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESSFUL
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1 minute 28 seconds
[INFO] Finished at: Wed Aug 25 09:05:01 PDT 2010
[INFO] Final Memory: 37M/81M
[INFO] ------------------------------------------------------------------------
{noformat}


> HTML parser fails on turkish locale
> -----------------------------------
>
>                 Key: TIKA-498
>                 URL: https://issues.apache.org/jira/browse/TIKA-498
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Robert Muir
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>         Attachments: TIKA-498.patch
>
>
> To reproduce: mvn test -DargLine=-Duser.language=tr
> This is because it uses toLowerCase for the default Locale 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.