You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Chris A. Mattmann (JIRA)" <ji...@apache.org> on 2010/01/16 07:01:57 UTC

[jira] Issue Comment Edited: (TIKA-357) Increase buffer size for meta tag sniffing

    [ https://issues.apache.org/jira/browse/TIKA-357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801110#action_12801110 ] 

Chris A. Mattmann edited comment on TIKA-357 at 1/16/10 6:00 AM:
-----------------------------------------------------------------

Ken:

I applied your patch and ran it against your sample file, but am not seeing the patch fix the issue: 

[chipotle:~/src/tika/trunk] mattmann% mvn -Dtest=MimeDetectionTest test
[INFO] Scanning for projects...
[INFO] Reactor build order: 
[INFO]   Apache Tika parent
[INFO]   Apache Tika core
[INFO]   Apache Tika parsers
[INFO]   Apache Tika application
[INFO]   Apache Tika OSGi bundle
[INFO]   Apache Tika
[INFO] ------------------------------------------------------------------------
[INFO] Building Apache Tika parent
[INFO]    task-segment: [test]
[INFO] ------------------------------------------------------------------------
[INFO] Setting property: classpath.resource.loader.class => 'org.codehaus.plexus.velocity.ContextClassLoaderResourceLoader'.
[INFO] Setting property: velocimacro.messages.on => 'false'.
[INFO] Setting property: resource.loader => 'classpath'.
[INFO] Setting property: resource.manager.logwhenfound => 'false'.
[INFO] [remote-resources:process {execution: default}]
[INFO] ------------------------------------------------------------------------
[INFO] Building Apache Tika core
[INFO]    task-segment: [test]
[INFO] ------------------------------------------------------------------------
[INFO] [remote-resources:process {execution: default}]
[INFO] [resources:resources]
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 20 resources
[INFO] Copying 3 resources
[INFO] [compiler:compile]
[INFO] Nothing to compile - all classes are up to date
[INFO] [resources:testResources]
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 26 resources
[INFO] Copying 3 resources
[INFO] [compiler:testCompile]
[INFO] Nothing to compile - all classes are up to date
[INFO] [surefire:test]
[INFO] Surefire report directory: /Users/mattmann/src/tika/trunk/tika-core/target/surefire-reports

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running org.apache.tika.mime.MimeDetectionTest
Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.48 sec <<< FAILURE!

Results :

Failed tests: 
  testDetection(org.apache.tika.mime.MimeDetectionTest)

Tests run: 2, Failures: 1, Errors: 0, Skipped: 0

[INFO] ------------------------------------------------------------------------
[ERROR] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] There are test failures.

Please refer to /Users/mattmann/src/tika/trunk/tika-core/target/surefire-reports for the individual test results.
[INFO] ------------------------------------------------------------------------
[INFO] For more information, run Maven with the -e switch
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 5 seconds
[INFO] Finished at: Fri Jan 15 21:54:44 PST 2010
[INFO] Final Memory: 13M/23M
[INFO] ------------------------------------------------------------------------
[chipotle:~/src/tika/trunk] mattmann% more tika-core/target/surefire-reports/org.apache.tika.mime.MimeDetectionTest.txt
-------------------------------------------------------------------------------
Test set: org.apache.tika.mime.MimeDetectionTest
-------------------------------------------------------------------------------
Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.482 sec <<< FAILURE!
testDetection(org.apache.tika.mime.MimeDetectionTest)  Time elapsed: 0.368 sec  <<< FAILURE!
junit.framework.ComparisonFailure: testlargerbuffer.html is not properly detected: detected. expected:<...html> but was:<...plain>
        at junit.framework.Assert.assertEquals(Assert.java:81)
        at org.apache.tika.mime.MimeDetectionTest.testStream(MimeDetectionTest.java:91)
        at org.apache.tika.mime.MimeDetectionTest.testFile(MimeDetectionTest.java:80)
        at org.apache.tika.mime.MimeDetectionTest.testDetection(MimeDetectionTest.java:61)

[chipotle:~/src/tika/trunk] mattmann% 

Did this patch work for you in terms of AutoDetection? I would imagine the MimeTypes detector would detect it based on your patch but your patch updates the HtmlParser, rather than the detection part. Let me look into this more -- I'd like to get this into 0.6 which I've been promising to cut an RC for (but haven't had time sorry!) the past few weeks ;)

Cheers,
Chris


      was (Author: chrismattmann):
    Ken:

I applied your patch and ran it against your sample file, but am not seeing the patch fix the issue: 

{noformat}
[chipotle:~/src/tika/trunk] mattmann% mvn -Dtest=MimeDetectionTest test
[INFO] Scanning for projects...
[INFO] Reactor build order: 
[INFO]   Apache Tika parent
[INFO]   Apache Tika core
[INFO]   Apache Tika parsers
[INFO]   Apache Tika application
[INFO]   Apache Tika OSGi bundle
[INFO]   Apache Tika
[INFO] ------------------------------------------------------------------------
[INFO] Building Apache Tika parent
[INFO]    task-segment: [test]
[INFO] ------------------------------------------------------------------------
[INFO] Setting property: classpath.resource.loader.class => 'org.codehaus.plexus.velocity.ContextClassLoaderResourceLoader'.
[INFO] Setting property: velocimacro.messages.on => 'false'.
[INFO] Setting property: resource.loader => 'classpath'.
[INFO] Setting property: resource.manager.logwhenfound => 'false'.
[INFO] [remote-resources:process {execution: default}]
[INFO] ------------------------------------------------------------------------
[INFO] Building Apache Tika core
[INFO]    task-segment: [test]
[INFO] ------------------------------------------------------------------------
[INFO] [remote-resources:process {execution: default}]
[INFO] [resources:resources]
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 20 resources
[INFO] Copying 3 resources
[INFO] [compiler:compile]
[INFO] Nothing to compile - all classes are up to date
[INFO] [resources:testResources]
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 26 resources
[INFO] Copying 3 resources
[INFO] [compiler:testCompile]
[INFO] Nothing to compile - all classes are up to date
[INFO] [surefire:test]
[INFO] Surefire report directory: /Users/mattmann/src/tika/trunk/tika-core/target/surefire-reports

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running org.apache.tika.mime.MimeDetectionTest
Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.48 sec <<< FAILURE!

Results :

Failed tests: 
  testDetection(org.apache.tika.mime.MimeDetectionTest)

Tests run: 2, Failures: 1, Errors: 0, Skipped: 0

[INFO] ------------------------------------------------------------------------
[ERROR] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] There are test failures.

Please refer to /Users/mattmann/src/tika/trunk/tika-core/target/surefire-reports for the individual test results.
[INFO] ------------------------------------------------------------------------
[INFO] For more information, run Maven with the -e switch
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 5 seconds
[INFO] Finished at: Fri Jan 15 21:54:44 PST 2010
[INFO] Final Memory: 13M/23M
[INFO] ------------------------------------------------------------------------
[chipotle:~/src/tika/trunk] mattmann% more tika-core/target/surefire-reports/org.apache.tika.mime.MimeDetectionTest.txt
-------------------------------------------------------------------------------
Test set: org.apache.tika.mime.MimeDetectionTest
-------------------------------------------------------------------------------
Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.482 sec <<< FAILURE!
testDetection(org.apache.tika.mime.MimeDetectionTest)  Time elapsed: 0.368 sec  <<< FAILURE!
junit.framework.ComparisonFailure: testlargerbuffer.html is not properly detected: detected. expected:<...html> but was:<...plain>
        at junit.framework.Assert.assertEquals(Assert.java:81)
        at org.apache.tika.mime.MimeDetectionTest.testStream(MimeDetectionTest.java:91)
        at org.apache.tika.mime.MimeDetectionTest.testFile(MimeDetectionTest.java:80)
        at org.apache.tika.mime.MimeDetectionTest.testDetection(MimeDetectionTest.java:61)

[chipotle:~/src/tika/trunk] mattmann% 
{noformat}

Did this patch work for you in terms of AutoDetection? I would imagine the MimeTypes detector would detect it based on your patch but your patch updates the HtmlParser, rather than the detection part. Let me look into this more -- I'd like to get this into 0.6 which I've been promising to cut an RC for (but haven't had time sorry!) the past few weeks ;)

Cheers,
Chris

  
> Increase buffer size for meta tag sniffing
> ------------------------------------------
>
>                 Key: TIKA-357
>                 URL: https://issues.apache.org/jira/browse/TIKA-357
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.5
>            Reporter: Ken Krugler
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: makler.html, TIKA-357.patch
>
>
> Some web pages (such as makler.su, see attached) have lots of script data before the body of the HTML.
> When this happens, the sniffing code fails to find the charset info in the meta tag, because it currently only sniffs the first 4K.
> Bumping it to 8K would cover all of the cases that I (Ken) have seen during a test crawl.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.