You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Sebastian Nagel <wa...@googlemail.com> on 2017/07/04 10:18:22 UTC

Tika content detection and crawled "remote" content

Hi,

recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with
the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't
always correct [1].

For the June 2017 crawl I've prepared a comparison of content types sent by the server in the HTTP
header and as detected by Tika 1.15 [2].  It shows that content types by Tika are definitely clean
(1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).

A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are
plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:

            Tika-1.15                HTTP-Content-Type
1001968023  application/xhtml+xml    text/html
   2298146  application/rss+xml      text/xml
    617435  application/rss+xml      application/xml
    613525  text/html                unk
    361525  application/xhtml+xml    unk
    297707  application/rdf+xml      application/xml


However, there are a few dubious decisions, esp. the group of web server-side scripting languages
(ASP, JSP, PHP, ColdFusion, etc.):

         Tika-1.15         HTTP-Content-Type
2047739  text/x-php        text/html
 681629  text/asp          text/html
 193095  text/x-coldfusion text/html
 172318  text/aspdotnet    text/html
 139033  text/x-jsp        text/html
  38415  text/x-cgi        text/html
  32092  text/x-php        text/xml
  18021  text/x-perl       text/html

Of course, due to misconfigurations some servers may deliver the script files unmodified but in
general I wouldn't expect that this happens for millions of pages.  I've checked some of the
affected URLs:

- HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag)

https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
    http://www.privi.com/product-details.asp?cno=C10910011
    http://mental-ray.de/Root_alt/Default.asp
    http://ekyrs.org/support/index.php?action=profile
    http://cwmorse.eu5.org/lineal/mostrar.php?contador=200

- (overlong) comment block at start of HTML which "masks" the HTML declaration
    http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24

http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
    https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
    https://de.e-stories.org/categories.php?&lan=nl&art=p

- HTML with some scripting fragments ("<?php?>") present:
    http://www.eco-ani-yao.org/shien/

- others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
    http://www.proedinc.com/customer/content.aspx?redid=9
    http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
    http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
    http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79


Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent
from the responding server.
Now my question: where's the best place to fix this: in the crawler [3] or in Tika?

If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm
happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it easy to
search or grep for confusion pairs.


Thanks and best,
Sebastian


[1] https://github.com/commoncrawl/nutch/issues/3
[2] s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz

https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
[3] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152
[4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/


Re: Tika content detection and crawled "remote" content

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,

a follow up based on Tika 1.16 for the July crawl:

           #  Tika-1.16                   HTTP-Content-Type
     4580525  text/x-php                  text/html
      842698  text/x-coldfusion           text/html
      579128  text/asp                    text/html
      510323  text/aspdotnet              text/html
      255267  text/x-jsp                  text/html

The full list is placed on
  s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.16-cc-main-2017-30.txt.xz

I hope to find some time the next weeks to try the WARC parser and have a closer look and open
issues for the problems with HTML and scripting languages.

Thanks,
Sebastian


On 07/04/2017 12:18 PM, Sebastian Nagel wrote:
> Hi,
> 
> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with
> the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't
> always correct [1].
> 
> For the June 2017 crawl I've prepared a comparison of content types sent by the server in the HTTP
> header and as detected by Tika 1.15 [2].  It shows that content types by Tika are definitely clean
> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
> 
> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are
> plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
> 
>             Tika-1.15                HTTP-Content-Type
> 1001968023  application/xhtml+xml    text/html
>    2298146  application/rss+xml      text/xml
>     617435  application/rss+xml      application/xml
>     613525  text/html                unk
>     361525  application/xhtml+xml    unk
>     297707  application/rdf+xml      application/xml
> 
> 
> However, there are a few dubious decisions, esp. the group of web server-side scripting languages
> (ASP, JSP, PHP, ColdFusion, etc.):
> 
>          Tika-1.15         HTTP-Content-Type
> 2047739  text/x-php        text/html
>  681629  text/asp          text/html
>  193095  text/x-coldfusion text/html
>  172318  text/aspdotnet    text/html
>  139033  text/x-jsp        text/html
>   38415  text/x-cgi        text/html
>   32092  text/x-php        text/xml
>   18021  text/x-perl       text/html
> 
> Of course, due to misconfigurations some servers may deliver the script files unmodified but in
> general I wouldn't expect that this happens for millions of pages.  I've checked some of the
> affected URLs:
> 
> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag)
> 
> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
>     http://www.privi.com/product-details.asp?cno=C10910011
>     http://mental-ray.de/Root_alt/Default.asp
>     http://ekyrs.org/support/index.php?action=profile
>     http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
> 
> - (overlong) comment block at start of HTML which "masks" the HTML declaration
>     http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
> 
> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
>     https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
>     https://de.e-stories.org/categories.php?&lan=nl&art=p
> 
> - HTML with some scripting fragments ("<?php?>") present:
>     http://www.eco-ani-yao.org/shien/
> 
> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
>     http://www.proedinc.com/customer/content.aspx?redid=9
>     http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
>     http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>     http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79
> 
> 
> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent
> from the responding server.
> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
> 
> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm
> happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it easy to
> search or grep for confusion pairs.
> 
> 
> Thanks and best,
> Sebastian
> 
> 
> [1] https://github.com/commoncrawl/nutch/issues/3
> [2] s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
> 
> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
> [3] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152
> [4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
> 


RE: FW: Tika content detection and crawled "remote" content

Posted by "Allison, Timothy B." <ta...@mitre.org>.
> The initial intention is, of course, to help to improve the MIME detection in Tika core.
Absolutely agree.

> Yes, you'll get few 10,000 more (MS)Office documents thanks to Tika:

           Tika-1.15                       HTTP-Content-Type
  12520    application/x-tika-msoffice     application/octet-stream
   6681    application/x-tika-ooxml        application/octet-stream
   3793    application/x-tika-msoffice     text/plain

Agreed, as I look at the numbers they aren't huge, but the improvement for our test corpus development is fantastic.  Even a few thousand extra docx, for example, will help.  

My guess is that the x-tika-ooxml and x-tika-msoffice are truncated files.  Common Crawl is truncating at 1MB, right?  

Again, WOW!!!  Thank you!!!

Cheers,

          Tim
-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com] 
Sent: Wednesday, July 5, 2017 8:43 AM
To: Allison, Timothy B. <ta...@mitre.org>
Cc: dominik.stadler@gmx.at; POI Developers List (dev@poi.apache.org) <de...@poi.apache.org>
Subject: Re: FW: Tika content detection and crawled "remote" content

Yes, you'll get few 10,000 more (MS)Office documents thanks to Tika:

           Tika-1.15                       HTTP-Content-Type
  12520    application/x-tika-msoffice     application/octet-stream
   6681    application/x-tika-ooxml        application/octet-stream
   3793    application/x-tika-msoffice     text/plain
   3515    application/x-tika-msoffice     application/force-download
   2259    application/x-tika-ooxml        application/msword
   1911    application/x-tika-msoffice     unk
   1314    application/x-tika-msoffice     application/download
   1259    application/x-tika-ooxml        unk
   1068    application/x-tika-ooxml        application/force-download
    711    application/x-tika-msoffice     file/unknown
    ...

The initial intention is, of course, to help to improve the MIME detection in Tika core.
Among the detected office formats there is one conspicuous pair:

    127    application/msword      text/vnd.graphviz

Looks like *.dot is taken as indicator only for MSWord documents.

Let me know if I can help to extract any data sets!

Thanks,
Sebastian


On 07/05/2017 01:42 PM, Allison, Timothy B. wrote:
> Dominik,
>   Thanks to Sebastian and CommonCrawl, this means that we can now have far better precision and recall in selecting only MSOffice docs for our regression tests!!!
> 
> 
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> Sent: Tuesday, July 4, 2017 6:18 AM
> To: user@tika.apache.org
> Subject: Tika content detection and crawled "remote" content
> 
> Hi,
> 
> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].
> 
> For the June 2017 crawl I've prepared a comparison of content types 
> sent by the server in the HTTP header and as detected by Tika 1.15 
> [2].  It shows that content types by Tika are definitely clean
> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
> 
> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
> 
>             Tika-1.15                HTTP-Content-Type
> 1001968023  application/xhtml+xml    text/html
>    2298146  application/rss+xml      text/xml
>     617435  application/rss+xml      application/xml
>     613525  text/html                unk
>     361525  application/xhtml+xml    unk
>     297707  application/rdf+xml      application/xml
> 
> 
> However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
> 
>          Tika-1.15         HTTP-Content-Type
> 2047739  text/x-php        text/html
>  681629  text/asp          text/html
>  193095  text/x-coldfusion text/html
>  172318  text/aspdotnet    text/html
>  139033  text/x-jsp        text/html
>   38415  text/x-cgi        text/html
>   32092  text/x-php        text/xml
>   18021  text/x-perl       text/html
> 
> Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages.  I've checked some of the affected URLs:
> 
> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening 
> tag)
> 
> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
>     http://www.privi.com/product-details.asp?cno=C10910011
>     http://mental-ray.de/Root_alt/Default.asp
>     http://ekyrs.org/support/index.php?action=profile
>     http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
> 
> - (overlong) comment block at start of HTML which "masks" the HTML declaration
>     http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
> 
> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
>     https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
>     https://de.e-stories.org/categories.php?&lan=nl&art=p
> 
> - HTML with some scripting fragments ("<?php?>") present:
>     http://www.eco-ani-yao.org/shien/
> 
> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
>     http://www.proedinc.com/customer/content.aspx?redid=9
>     http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
>     http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>     
> http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068
> f79
> 
> 
> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
> 
> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
> 
> 
> Thanks and best,
> Sebastian
> 
> 
> [1] https://github.com/commoncrawl/nutch/issues/3
> [2] 
> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
> a-1.15-cc-main-2017-26.txt.xz
> 
> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
> [3] 
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
> util/MimeUtil.java#L152 [4] 
> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
> 


RE: FW: Tika content detection and crawled "remote" content

Posted by "Allison, Timothy B." <ta...@mitre.org>.
> The initial intention is, of course, to help to improve the MIME detection in Tika core.
Absolutely agree.

> Yes, you'll get few 10,000 more (MS)Office documents thanks to Tika:

           Tika-1.15                       HTTP-Content-Type
  12520    application/x-tika-msoffice     application/octet-stream
   6681    application/x-tika-ooxml        application/octet-stream
   3793    application/x-tika-msoffice     text/plain

Agreed, as I look at the numbers they aren't huge, but the improvement for our test corpus development is fantastic.  Even a few thousand extra docx, for example, will help.  

My guess is that the x-tika-ooxml and x-tika-msoffice are truncated files.  Common Crawl is truncating at 1MB, right?  

Again, WOW!!!  Thank you!!!

Cheers,

          Tim
-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com] 
Sent: Wednesday, July 5, 2017 8:43 AM
To: Allison, Timothy B. <ta...@mitre.org>
Cc: dominik.stadler@gmx.at; POI Developers List (dev@poi.apache.org) <de...@poi.apache.org>
Subject: Re: FW: Tika content detection and crawled "remote" content

Yes, you'll get few 10,000 more (MS)Office documents thanks to Tika:

           Tika-1.15                       HTTP-Content-Type
  12520    application/x-tika-msoffice     application/octet-stream
   6681    application/x-tika-ooxml        application/octet-stream
   3793    application/x-tika-msoffice     text/plain
   3515    application/x-tika-msoffice     application/force-download
   2259    application/x-tika-ooxml        application/msword
   1911    application/x-tika-msoffice     unk
   1314    application/x-tika-msoffice     application/download
   1259    application/x-tika-ooxml        unk
   1068    application/x-tika-ooxml        application/force-download
    711    application/x-tika-msoffice     file/unknown
    ...

The initial intention is, of course, to help to improve the MIME detection in Tika core.
Among the detected office formats there is one conspicuous pair:

    127    application/msword      text/vnd.graphviz

Looks like *.dot is taken as indicator only for MSWord documents.

Let me know if I can help to extract any data sets!

Thanks,
Sebastian


On 07/05/2017 01:42 PM, Allison, Timothy B. wrote:
> Dominik,
>   Thanks to Sebastian and CommonCrawl, this means that we can now have far better precision and recall in selecting only MSOffice docs for our regression tests!!!
> 
> 
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> Sent: Tuesday, July 4, 2017 6:18 AM
> To: user@tika.apache.org
> Subject: Tika content detection and crawled "remote" content
> 
> Hi,
> 
> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].
> 
> For the June 2017 crawl I've prepared a comparison of content types 
> sent by the server in the HTTP header and as detected by Tika 1.15 
> [2].  It shows that content types by Tika are definitely clean
> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
> 
> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
> 
>             Tika-1.15                HTTP-Content-Type
> 1001968023  application/xhtml+xml    text/html
>    2298146  application/rss+xml      text/xml
>     617435  application/rss+xml      application/xml
>     613525  text/html                unk
>     361525  application/xhtml+xml    unk
>     297707  application/rdf+xml      application/xml
> 
> 
> However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
> 
>          Tika-1.15         HTTP-Content-Type
> 2047739  text/x-php        text/html
>  681629  text/asp          text/html
>  193095  text/x-coldfusion text/html
>  172318  text/aspdotnet    text/html
>  139033  text/x-jsp        text/html
>   38415  text/x-cgi        text/html
>   32092  text/x-php        text/xml
>   18021  text/x-perl       text/html
> 
> Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages.  I've checked some of the affected URLs:
> 
> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening 
> tag)
> 
> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
>     http://www.privi.com/product-details.asp?cno=C10910011
>     http://mental-ray.de/Root_alt/Default.asp
>     http://ekyrs.org/support/index.php?action=profile
>     http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
> 
> - (overlong) comment block at start of HTML which "masks" the HTML declaration
>     http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
> 
> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
>     https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
>     https://de.e-stories.org/categories.php?&lan=nl&art=p
> 
> - HTML with some scripting fragments ("<?php?>") present:
>     http://www.eco-ani-yao.org/shien/
> 
> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
>     http://www.proedinc.com/customer/content.aspx?redid=9
>     http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
>     http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>     
> http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068
> f79
> 
> 
> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
> 
> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
> 
> 
> Thanks and best,
> Sebastian
> 
> 
> [1] https://github.com/commoncrawl/nutch/issues/3
> [2] 
> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
> a-1.15-cc-main-2017-26.txt.xz
> 
> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
> [3] 
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
> util/MimeUtil.java#L152 [4] 
> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: FW: Tika content detection and crawled "remote" content

Posted by Sebastian Nagel <wa...@googlemail.com>.
Yes, you'll get few 10,000 more (MS)Office documents thanks to Tika:

           Tika-1.15                       HTTP-Content-Type
  12520    application/x-tika-msoffice     application/octet-stream
   6681    application/x-tika-ooxml        application/octet-stream
   3793    application/x-tika-msoffice     text/plain
   3515    application/x-tika-msoffice     application/force-download
   2259    application/x-tika-ooxml        application/msword
   1911    application/x-tika-msoffice     unk
   1314    application/x-tika-msoffice     application/download
   1259    application/x-tika-ooxml        unk
   1068    application/x-tika-ooxml        application/force-download
    711    application/x-tika-msoffice     file/unknown
    ...

The initial intention is, of course, to help to improve the MIME detection in Tika core.
Among the detected office formats there is one conspicuous pair:

    127    application/msword      text/vnd.graphviz

Looks like *.dot is taken as indicator only for MSWord documents.

Let me know if I can help to extract any data sets!

Thanks,
Sebastian


On 07/05/2017 01:42 PM, Allison, Timothy B. wrote:
> Dominik,
>   Thanks to Sebastian and CommonCrawl, this means that we can now have far better precision and recall in selecting only MSOffice docs for our regression tests!!!
> 
> 
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com] 
> Sent: Tuesday, July 4, 2017 6:18 AM
> To: user@tika.apache.org
> Subject: Tika content detection and crawled "remote" content
> 
> Hi,
> 
> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].
> 
> For the June 2017 crawl I've prepared a comparison of content types sent by the server in the HTTP header and as detected by Tika 1.15 [2].  It shows that content types by Tika are definitely clean
> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
> 
> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
> 
>             Tika-1.15                HTTP-Content-Type
> 1001968023  application/xhtml+xml    text/html
>    2298146  application/rss+xml      text/xml
>     617435  application/rss+xml      application/xml
>     613525  text/html                unk
>     361525  application/xhtml+xml    unk
>     297707  application/rdf+xml      application/xml
> 
> 
> However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
> 
>          Tika-1.15         HTTP-Content-Type
> 2047739  text/x-php        text/html
>  681629  text/asp          text/html
>  193095  text/x-coldfusion text/html
>  172318  text/aspdotnet    text/html
>  139033  text/x-jsp        text/html
>   38415  text/x-cgi        text/html
>   32092  text/x-php        text/xml
>   18021  text/x-perl       text/html
> 
> Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages.  I've checked some of the affected URLs:
> 
> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag)
> 
> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
>     http://www.privi.com/product-details.asp?cno=C10910011
>     http://mental-ray.de/Root_alt/Default.asp
>     http://ekyrs.org/support/index.php?action=profile
>     http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
> 
> - (overlong) comment block at start of HTML which "masks" the HTML declaration
>     http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
> 
> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
>     https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
>     https://de.e-stories.org/categories.php?&lan=nl&art=p
> 
> - HTML with some scripting fragments ("<?php?>") present:
>     http://www.eco-ani-yao.org/shien/
> 
> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
>     http://www.proedinc.com/customer/content.aspx?redid=9
>     http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
>     http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>     http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79
> 
> 
> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
> 
> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
> 
> 
> Thanks and best,
> Sebastian
> 
> 
> [1] https://github.com/commoncrawl/nutch/issues/3
> [2] s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
> 
> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
> [3] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152
> [4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


FW: Tika content detection and crawled "remote" content

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Dominik,
  Thanks to Sebastian and CommonCrawl, this means that we can now have far better precision and recall in selecting only MSOffice docs for our regression tests!!!


-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com] 
Sent: Tuesday, July 4, 2017 6:18 AM
To: user@tika.apache.org
Subject: Tika content detection and crawled "remote" content

Hi,

recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].

For the June 2017 crawl I've prepared a comparison of content types sent by the server in the HTTP header and as detected by Tika 1.15 [2].  It shows that content types by Tika are definitely clean
(1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).

A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:

            Tika-1.15                HTTP-Content-Type
1001968023  application/xhtml+xml    text/html
   2298146  application/rss+xml      text/xml
    617435  application/rss+xml      application/xml
    613525  text/html                unk
    361525  application/xhtml+xml    unk
    297707  application/rdf+xml      application/xml


However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):

         Tika-1.15         HTTP-Content-Type
2047739  text/x-php        text/html
 681629  text/asp          text/html
 193095  text/x-coldfusion text/html
 172318  text/aspdotnet    text/html
 139033  text/x-jsp        text/html
  38415  text/x-cgi        text/html
  32092  text/x-php        text/xml
  18021  text/x-perl       text/html

Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages.  I've checked some of the affected URLs:

- HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag)

https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
    http://www.privi.com/product-details.asp?cno=C10910011
    http://mental-ray.de/Root_alt/Default.asp
    http://ekyrs.org/support/index.php?action=profile
    http://cwmorse.eu5.org/lineal/mostrar.php?contador=200

- (overlong) comment block at start of HTML which "masks" the HTML declaration
    http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24

http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
    https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
    https://de.e-stories.org/categories.php?&lan=nl&art=p

- HTML with some scripting fragments ("<?php?>") present:
    http://www.eco-ani-yao.org/shien/

- others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
    http://www.proedinc.com/customer/content.aspx?redid=9
    http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
    http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
    http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79


Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
Now my question: where's the best place to fix this: in the crawler [3] or in Tika?

If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.


Thanks and best,
Sebastian


[1] https://github.com/commoncrawl/nutch/issues/3
[2] s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz

https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
[3] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152
[4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/


FW: Tika content detection and crawled "remote" content

Posted by "Allison, Timothy B." <ta...@mitre.org>.
All,

> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.

This is an amazing step forward for sampling PDF files from Common Crawl.  I used to rely on the http-headers and/or file suffix, but now we also have Tika's judgment on every file in Common Crawl.

We still have to deal with the 1MB truncation (I think), but this is an amazing development.  Thank you, Sebastian!

Cheers,

             Tim

-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com] 
Sent: Tuesday, July 4, 2017 6:18 AM
To: user@tika.apache.org
Subject: Tika content detection and crawled "remote" content

Hi,

recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].

For the June 2017 crawl I've prepared a comparison of content types sent by the server in the HTTP header and as detected by Tika 1.15 [2].  It shows that content types by Tika are definitely clean
(1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).

A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:

            Tika-1.15                HTTP-Content-Type
1001968023  application/xhtml+xml    text/html
   2298146  application/rss+xml      text/xml
    617435  application/rss+xml      application/xml
    613525  text/html                unk
    361525  application/xhtml+xml    unk
    297707  application/rdf+xml      application/xml


However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):

         Tika-1.15         HTTP-Content-Type
2047739  text/x-php        text/html
 681629  text/asp          text/html
 193095  text/x-coldfusion text/html
 172318  text/aspdotnet    text/html
 139033  text/x-jsp        text/html
  38415  text/x-cgi        text/html
  32092  text/x-php        text/xml
  18021  text/x-perl       text/html

Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages.  I've checked some of the affected URLs:

- HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag)

https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
    http://www.privi.com/product-details.asp?cno=C10910011
    http://mental-ray.de/Root_alt/Default.asp
    http://ekyrs.org/support/index.php?action=profile
    http://cwmorse.eu5.org/lineal/mostrar.php?contador=200

- (overlong) comment block at start of HTML which "masks" the HTML declaration
    http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24

http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
    https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
    https://de.e-stories.org/categories.php?&lan=nl&art=p

- HTML with some scripting fragments ("<?php?>") present:
    http://www.eco-ani-yao.org/shien/

- others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
    http://www.proedinc.com/customer/content.aspx?redid=9
    http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
    http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
    http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79


Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
Now my question: where's the best place to fix this: in the crawler [3] or in Tika?

If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.


Thanks and best,
Sebastian


[1] https://github.com/commoncrawl/nutch/issues/3
[2] s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz

https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
[3] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152
[4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: Tika content detection and crawled "remote" content

Posted by Luís Filipe Nassif <lf...@gmail.com>.
Hi Nick,

As commented on TIKA-2419, the original issue of eml/emlx being detected as
html I fixed locally by increasing the magic priority of eml/emlx instead
of decreasing html priority. Maybe that is an alternative to dropping the
xml priority in the future, but it can impact other things too.

Luis

2017-07-05 11:07 GMT-03:00 Nick Burch <ni...@apache.org>:

> Having taken a "quick" look over lunch at some of the "programming
> language" ones, and gone down a rabbit whole... I think at least some of
> them are as described in TIKA-2419, where our change to the HTML magic
> priority to fix for HTML-containing formats like email had broken some
> things.
>
> I've done a quick fix for 1.16, but it'd be good to try the impact of
> other things, eg dropping the xml priority to match the html one to see if
> that helps / breaks other things
>
>
> Otherwise, for anything else (eg that word / graphviz one), please do open
> up JIRAs!
>
> Thanks
> Nick
>
>
> On 05/07/17 14:10, Allison, Timothy B. wrote:
>
>> Why, yes, please!  JIRA with small samples would be fantastic.  I think
>> working in desc order of most common to least would be best...php, asp,
>> coldfusion.
>>
>> I'm about to cut 1.16, but I look forward to improving Tika with this
>> tremendously useful data.
>>
>> Again, many thanks!
>>
>> Cheers,
>>
>>             Tim
>>
>> -----Original Message-----
>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
>> Sent: Wednesday, July 5, 2017 9:03 AM
>> To: user@tika.apache.org
>> Subject: Re: Tika content detection and crawled "remote" content
>>
>> Hi Tim,
>>
>> thanks! Let me know if I should take any actions (e.g., open issue(s) on
>> Jira) or whether I can help by compiling smaller test sets.
>>
>> Best,
>> Sebastian
>>
>> On 07/05/2017 02:09 PM, Allison, Timothy B. wrote:
>>
>>> This is FANTASTIC!!!  Thank you, Sebastian!
>>>
>>> I suspect that we should try to fix these at the Tika level.  We'll
>>> never be 100%, but most of the problems you describe _should_ be fixable.
>>>
>>>   > If anyone is interested in using the detected MIME types or anything
>>> else from Common Crawl - I'm happy to help!  The URL index [4] contains now
>>> a new field "mime-detected" which makes it easy to search or grep for
>>> confusion pairs.
>>>
>>> This is an amazing step forward for our regression corpus.  We used to
>>> rely on the http headers and/or file suffix to oversample non-html.  This
>>> will allow far cleaner pulls.
>>>
>>> -----Original Message-----
>>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
>>> Sent: Tuesday, July 4, 2017 6:18 AM
>>> To: user@tika.apache.org
>>> Subject: Tika content detection and crawled "remote" content
>>>
>>> Hi,
>>>
>>> recently I've plugged in Tika's content detection into Common Crawl's
>>> crawler (modified Nutch) with the target to get clean and correct MIME type
>>> - the HTTP Content-Type may contain garbage and isn't always correct [1].
>>>
>>> For the June 2017 crawl I've prepared a comparison of content types
>>> sent by the server in the HTTP header and as detected by Tika 1.15
>>> [2].  It shows that content types by Tika are definitely clean
>>> (1,400 different content types vs. more than 6,000 content type
>>> "strings" from HTTP headers).
>>>
>>> A look on the "confusions" where Content-Type and Tika differ, shows a
>>> mixed picture: some pairs are plausible, e.g., if Tika changes the type to
>>> a more precise subtype or detects the MIME at all:
>>>
>>>              Tika-1.15                HTTP-Content-Type
>>> 1001968023  application/xhtml+xml    text/html
>>>     2298146  application/rss+xml      text/xml
>>>      617435  application/rss+xml      application/xml
>>>      613525  text/html                unk
>>>      361525  application/xhtml+xml    unk
>>>      297707  application/rdf+xml      application/xml
>>>
>>>
>>> However, there are a few dubious decisions, esp. the group of web
>>> server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
>>>
>>>           Tika-1.15         HTTP-Content-Type
>>> 2047739  text/x-php        text/html
>>>   681629  text/asp          text/html
>>>   193095  text/x-coldfusion text/html
>>>   172318  text/aspdotnet    text/html
>>>   139033  text/x-jsp        text/html
>>>    38415  text/x-cgi        text/html
>>>    32092  text/x-php        text/xml
>>>    18021  text/x-perl       text/html
>>>
>>> Of course, due to misconfigurations some servers may deliver the script
>>> files unmodified but in general I wouldn't expect that this happens for
>>> millions of pages.  I've checked some of the affected URLs:
>>>
>>> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
>>> tag)
>>>
>>> https://www.projectmanagement.com/profile/profile_contributi
>>> ons.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=
>>> 2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
>>>      http://www.privi.com/product-details.asp?cno=C10910011
>>>      http://mental-ray.de/Root_alt/Default.asp
>>>      http://ekyrs.org/support/index.php?action=profile
>>>      http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>>>
>>> - (overlong) comment block at start of HTML which "masks" the HTML
>>> declaration
>>>      http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>>>
>>> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&pa
>>> ge=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnu
>>> m&desc=asc&no=6
>>>      https://www.preventiongenetics.com/About/Resources/disease/
>>> MarfansSyndrome.php
>>>      https://de.e-stories.org/categories.php?&lan=nl&art=p
>>>
>>> - HTML with some scripting fragments ("<?php?>") present:
>>>      http://www.eco-ani-yao.org/shien/
>>>
>>> - others are clearly HTML (looks more like a bug, at least, there is no
>>> simple explanation)
>>>      http://www.proedinc.com/customer/content.aspx?redid=9
>>>      http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b79
>>> 71faa23413fa1164be0c068f79
>>>      http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>>>      http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa
>>> 1164be0c068
>>> f79
>>>
>>>
>>> Obviously certain file suffixes (.php, .aspx) should get less weight
>>> compared to Content-Type sent from the responding server.
>>> Now my question: where's the best place to fix this: in the crawler [3]
>>> or in Tika?
>>>
>>> If anyone is interested in using the detected MIME types or anything
>>> else from Common Crawl - I'm happy to help!  The URL index [4] contains now
>>> a new field "mime-detected" which makes it easy to search or grep for
>>> confusion pairs.
>>>
>>>
>>> Thanks and best,
>>> Sebastian
>>>
>>>
>>> [1] https://github.com/commoncrawl/nutch/issues/3
>>> [2]
>>> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
>>> a-1.15-cc-main-2017-26.txt.xz
>>>
>>> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
>>> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
>>> [3]
>>> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
>>> util/MimeUtil.java#L152 [4]
>>> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>>>
>>>
>>
>

RE: Tika content detection and crawled "remote" content

Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 7 Jul 2017, Allison, Timothy B. wrote:
> Should we add a WARC parser? ☺

I think we should!

And also add support into Tika Batch for processing from them :)

Nick

Re: Tika content detection and crawled "remote" content

Posted by Chris Mattmann <ma...@apache.org>.
Yep!

 

 

 

From: "Allison, Timothy B." <ta...@mitre.org>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Friday, July 7, 2017 at 3:52 AM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: RE: Tika content detection and crawled "remote" content

 

Should we add a WARC parser? J

 

From: Julien Nioche [mailto:lists.digitalpebble@gmail.com] 
Sent: Friday, July 7, 2017 3:43 AM
To: user@tika.apache.org
Subject: Re: Tika content detection and crawled "remote" content

 

Is anyone aware of a tool to run Tika on a WARC file? Everything required for detection
and parsing is contained (URL, HTTP metadata, binary content).

 

you could do that with my good old Behemoth in 2 steps : WARC to Behemoth format then run Tika on that

 

 

 

 

 

On 6 July 2017 at 13:27, Sebastian Nagel <wa...@googlemail.com> wrote:

Hi,

> Otherwise, for anything else (eg that word / graphviz one), please do open up JIRAs!
Done, see TIKA-2242.

>> Why, yes, please!  JIRA with small samples would be fantastic.

1000 randomly chosen examples per content-type are ready:

https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/test/
  tika_html_server_side_scripting_lang_php.warc.gz
  tika_html_server_side_scripting_lang_asp.warc.gz
  tika_html_server_side_scripting_lang_coldfusion.warc.gz
  tika_html_server_side_scripting_lang_jsp.warc.gz
  tika_html_server_side_scripting_lang_cgi.warc.gz
  tika_html_server_side_scripting_lang_perl.warc.gz

Note: there are few real PHP/JSP/Perl/... documents among them.

If there is no "global" solution (TIKA-2419), I'll open "smaller" Jiras.

Is anyone aware of a tool to run Tika on a WARC file? Everything required for detection
and parsing is contained (URL, HTTP metadata, binary content).

Thanks,
Sebastian


On 07/05/2017 04:07 PM, Nick Burch wrote:
> Having taken a "quick" look over lunch at some of the "programming language" ones, and gone down a
> rabbit whole... I think at least some of them are as described in TIKA-2419, where our change to the
> HTML magic priority to fix for HTML-containing formats like email had broken some things.
>
> I've done a quick fix for 1.16, but it'd be good to try the impact of other things, eg dropping the
> xml priority to match the html one to see if that helps / breaks other things
>
>
> Otherwise, for anything else (eg that word / graphviz one), please do open up JIRAs!
>
> Thanks
> Nick
>
> On 05/07/17 14:10, Allison, Timothy B. wrote:
>> Why, yes, please!  JIRA with small samples would be fantastic.  I think working in desc order of
>> most common to least would be best...php, asp, coldfusion.
>>
>> I'm about to cut 1.16, but I look forward to improving Tika with this tremendously useful data.
>>
>> Again, many thanks!
>>
>> Cheers,
>>
>>             Tim
>>
>> -----Original Message-----
>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
>> Sent: Wednesday, July 5, 2017 9:03 AM
>> To: user@tika.apache.org
>> Subject: Re: Tika content detection and crawled "remote" content
>>
>> Hi Tim,
>>
>> thanks! Let me know if I should take any actions (e.g., open issue(s) on Jira) or whether I can
>> help by compiling smaller test sets.
>>
>> Best,
>> Sebastian
>>
>> On 07/05/2017 02:09 PM, Allison, Timothy B. wrote:
>>> This is FANTASTIC!!!  Thank you, Sebastian!
>>>
>>> I suspect that we should try to fix these at the Tika level.  We'll never be 100%, but most of
>>> the problems you describe _should_ be fixable.
>>>
>>>   > If anyone is interested in using the detected MIME types or anything else from Common Crawl -
>>> I'm happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it
>>> easy to search or grep for confusion pairs.
>>>
>>> This is an amazing step forward for our regression corpus.  We used to rely on the http headers
>>> and/or file suffix to oversample non-html.  This will allow far cleaner pulls.
>>>
>>> -----Original Message-----
>>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
>>> Sent: Tuesday, July 4, 2017 6:18 AM
>>> To: user@tika.apache.org
>>> Subject: Tika content detection and crawled "remote" content
>>>
>>> Hi,
>>>
>>> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch)
>>> with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage
>>> and isn't always correct [1].
>>>
>>> For the June 2017 crawl I've prepared a comparison of content types
>>> sent by the server in the HTTP header and as detected by Tika 1.15
>>> [2].  It shows that content types by Tika are definitely clean
>>> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
>>>
>>> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs
>>> are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
>>>
>>>              Tika-1.15                HTTP-Content-Type
>>> 1001968023  application/xhtml+xml    text/html
>>>     2298146  application/rss+xml      text/xml
>>>      617435  application/rss+xml      application/xml
>>>      613525  text/html                unk
>>>      361525  application/xhtml+xml    unk
>>>      297707  application/rdf+xml      application/xml
>>>
>>>
>>> However, there are a few dubious decisions, esp. the group of web server-side scripting languages
>>> (ASP, JSP, PHP, ColdFusion, etc.):
>>>
>>>           Tika-1.15         HTTP-Content-Type
>>> 2047739  text/x-php        text/html
>>>   681629  text/asp          text/html
>>>   193095  text/x-coldfusion text/html
>>>   172318  text/aspdotnet    text/html
>>>   139033  text/x-jsp        text/html
>>>    38415  text/x-cgi        text/html
>>>    32092  text/x-php        text/xml
>>>    18021  text/x-perl       text/html
>>>
>>> Of course, due to misconfigurations some servers may deliver the script files unmodified but in
>>> general I wouldn't expect that this happens for millions of pages.  I've checked some of the
>>> affected URLs:
>>>
>>> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
>>> tag)
>>>
>>> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
>>>
>>>      http://www.privi.com/product-details.asp?cno=C10910011
>>>      http://mental-ray.de/Root_alt/Default.asp
>>>      http://ekyrs.org/support/index.php?action=profile
>>>      http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>>>
>>> - (overlong) comment block at start of HTML which "masks" the HTML declaration
>>>      http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>>>
>>> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
>>>
>>>      https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
>>>      https://de.e-stories.org/categories.php?&lan=nl&art=p
>>>
>>> - HTML with some scripting fragments ("<?php?>") present:
>>>      http://www.eco-ani-yao.org/shien/
>>>
>>> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
>>>      http://www.proedinc.com/customer/content.aspx?redid=9
>>>      http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
>>>      http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>>>      http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068
>>> f79
>>>
>>>
>>> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type
>>> sent from the responding server.
>>> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
>>>
>>> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm
>>> happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it easy to
>>> search or grep for confusion pairs.
>>>
>>>
>>> Thanks and best,
>>> Sebastian
>>>
>>>
>>> [1] https://github.com/commoncrawl/nutch/issues/3
>>> [2]
>>> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
>>> a-1.15-cc-main-2017-26.txt.xz
>>>
>>> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
>>> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
>>> [3]
>>> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
>>> util/MimeUtil.java#L152 [4]
>>> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>>>
>>
>



 

-- 


Open Source Solutions for Text Engineering


http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble


RE: Tika content detection and crawled "remote" content

Posted by "Allison, Timothy B." <ta...@mitre.org>.
>which have a pretty heavy/messy dependency tree

You've seem our pom, right?  We have you covered!

...
<dependency>
  <groupId>*</groupId>
  <artifactId>*</artifactId>
  <version>*</version>
</dependency>
...

From: Jackson, Andy [mailto:Andrew.Jackson@bl.uk]
Sent: Friday, July 7, 2017 7:19 AM
To: user@tika.apache.org
Subject: Re: Tika content detection and crawled "remote" content

In case it helps, I wrote some prototype modules to add ARC and WARC support to Tika:

https://github.com/ukwa/webarchive-discovery/tree/master/digipres-tika/src/main/java/uk/bl/wa/tika/parser/warc

...and extended Tika to use them:

https://github.com/ukwa/webarchive-discovery/blob/master/digipres-tika/src/main/java/uk/bl/wa/tika/PreservationParser.java#L62-L63

However, they are based on the Internet Archive's (W)ARC parsers, which have a pretty heavy/messy dependency tree. It would probably be better to build them on JWAT, which has few dependencies (but may not be quite as robust to edge cases as the IA ones).

https://sbforge.org/display/JWAT/JWAT

(see e.g. https://sbforge.org/display/JWAT/Reading+a+WARC+file)

Hope that helps,
Andy Jackson (UK Web Archive)

From: Timothy Allison <ta...@mitre.org>>
Reply-To: "user@tika.apache.org<ma...@tika.apache.org>" <us...@tika.apache.org>>
Date: Friday, 7 July 2017 at 11:52
To: "user@tika.apache.org<ma...@tika.apache.org>" <us...@tika.apache.org>>
Subject: RE: Tika content detection and crawled "remote" content

Should we add a WARC parser? :)

From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
Sent: Friday, July 7, 2017 3:43 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Tika content detection and crawled "remote" content

Is anyone aware of a tool to run Tika on a WARC file? Everything required for detection
and parsing is contained (URL, HTTP metadata, binary content).

you could do that with my good old Behemoth<https://github.com/DigitalPebble/behemoth> in 2 steps : WARC to Behemoth format then run Tika on that





On 6 July 2017 at 13:27, Sebastian Nagel <wa...@googlemail.com>> wrote:
Hi,

> Otherwise, for anything else (eg that word / graphviz one), please do open up JIRAs!
Done, see TIKA-2242.

>> Why, yes, please!  JIRA with small samples would be fantastic.

1000 randomly chosen examples per content-type are ready:

https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/test/
  tika_html_server_side_scripting_lang_php.warc.gz
  tika_html_server_side_scripting_lang_asp.warc.gz
  tika_html_server_side_scripting_lang_coldfusion.warc.gz
  tika_html_server_side_scripting_lang_jsp.warc.gz
  tika_html_server_side_scripting_lang_cgi.warc.gz
  tika_html_server_side_scripting_lang_perl.warc.gz

Note: there are few real PHP/JSP/Perl/... documents among them.

If there is no "global" solution (TIKA-2419), I'll open "smaller" Jiras.

Is anyone aware of a tool to run Tika on a WARC file? Everything required for detection
and parsing is contained (URL, HTTP metadata, binary content).

Thanks,
Sebastian

On 07/05/2017 04:07 PM, Nick Burch wrote:
> Having taken a "quick" look over lunch at some of the "programming language" ones, and gone down a
> rabbit whole... I think at least some of them are as described in TIKA-2419, where our change to the
> HTML magic priority to fix for HTML-containing formats like email had broken some things.
>
> I've done a quick fix for 1.16, but it'd be good to try the impact of other things, eg dropping the
> xml priority to match the html one to see if that helps / breaks other things
>
>
> Otherwise, for anything else (eg that word / graphviz one), please do open up JIRAs!
>
> Thanks
> Nick
>
> On 05/07/17 14:10, Allison, Timothy B. wrote:
>> Why, yes, please!  JIRA with small samples would be fantastic.  I think working in desc order of
>> most common to least would be best...php, asp, coldfusion.
>>
>> I'm about to cut 1.16, but I look forward to improving Tika with this tremendously useful data.
>>
>> Again, many thanks!
>>
>> Cheers,
>>
>>             Tim
>>
>> -----Original Message-----
>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com<ma...@googlemail.com>]
>> Sent: Wednesday, July 5, 2017 9:03 AM
>> To: user@tika.apache.org<ma...@tika.apache.org>
>> Subject: Re: Tika content detection and crawled "remote" content
>>
>> Hi Tim,
>>
>> thanks! Let me know if I should take any actions (e.g., open issue(s) on Jira) or whether I can
>> help by compiling smaller test sets.
>>
>> Best,
>> Sebastian
>>
>> On 07/05/2017 02:09 PM, Allison, Timothy B. wrote:
>>> This is FANTASTIC!!!  Thank you, Sebastian!
>>>
>>> I suspect that we should try to fix these at the Tika level.  We'll never be 100%, but most of
>>> the problems you describe _should_ be fixable.
>>>
>>>   > If anyone is interested in using the detected MIME types or anything else from Common Crawl -
>>> I'm happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it
>>> easy to search or grep for confusion pairs.
>>>
>>> This is an amazing step forward for our regression corpus.  We used to rely on the http headers
>>> and/or file suffix to oversample non-html.  This will allow far cleaner pulls.
>>>
>>> -----Original Message-----
>>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com<ma...@googlemail.com>]
>>> Sent: Tuesday, July 4, 2017 6:18 AM
>>> To: user@tika.apache.org<ma...@tika.apache.org>
>>> Subject: Tika content detection and crawled "remote" content
>>>
>>> Hi,
>>>
>>> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch)
>>> with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage
>>> and isn't always correct [1].
>>>
>>> For the June 2017 crawl I've prepared a comparison of content types
>>> sent by the server in the HTTP header and as detected by Tika 1.15
>>> [2].  It shows that content types by Tika are definitely clean
>>> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
>>>
>>> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs
>>> are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
>>>
>>>              Tika-1.15                HTTP-Content-Type
>>> 1001968023  application/xhtml+xml    text/html
>>>     2298146  application/rss+xml      text/xml
>>>      617435  application/rss+xml      application/xml
>>>      613525  text/html                unk
>>>      361525  application/xhtml+xml    unk
>>>      297707  application/rdf+xml      application/xml
>>>
>>>
>>> However, there are a few dubious decisions, esp. the group of web server-side scripting languages
>>> (ASP, JSP, PHP, ColdFusion, etc.):
>>>
>>>           Tika-1.15         HTTP-Content-Type
>>> 2047739  text/x-php        text/html
>>>   681629  text/asp          text/html
>>>   193095  text/x-coldfusion text/html
>>>   172318  text/aspdotnet    text/html
>>>   139033  text/x-jsp        text/html
>>>    38415  text/x-cgi        text/html
>>>    32092  text/x-php        text/xml
>>>    18021  text/x-perl       text/html
>>>
>>> Of course, due to misconfigurations some servers may deliver the script files unmodified but in
>>> general I wouldn't expect that this happens for millions of pages.  I've checked some of the
>>> affected URLs:
>>>
>>> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
>>> tag)
>>>
>>> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
>>>
>>>      http://www.privi.com/product-details.asp?cno=C10910011
>>>      http://mental-ray.de/Root_alt/Default.asp
>>>      http://ekyrs.org/support/index.php?action=profile
>>>      http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>>>
>>> - (overlong) comment block at start of HTML which "masks" the HTML declaration
>>>      http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>>>
>>> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
>>>
>>>      https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
>>>      https://de.e-stories.org/categories.php?&lan=nl&art=p
>>>
>>> - HTML with some scripting fragments ("<?php?>") present:
>>>      http://www.eco-ani-yao.org/shien/
>>>
>>> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
>>>      http://www.proedinc.com/customer/content.aspx?redid=9
>>>      http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
>>>      http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>>>      http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068
>>> f79
>>>
>>>
>>> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type
>>> sent from the responding server.
>>> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
>>>
>>> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm
>>> happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it easy to
>>> search or grep for confusion pairs.
>>>
>>>
>>> Thanks and best,
>>> Sebastian
>>>
>>>
>>> [1] https://github.com/commoncrawl/nutch/issues/3
>>> [2]
>>> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
>>> a-1.15-cc-main-2017-26.txt.xz
>>>
>>> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
>>> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
>>> [3]
>>> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
>>> util/MimeUtil.java#L152 [4]
>>> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>>>
>>
>



--

Open Source Solutions for Text Engineering

http://www.digitalpebble.com<http://www.digitalpebble.com/>
http://digitalpebble.blogspot.com/
#digitalpebble<http://twitter.com/digitalpebble>


******************************************************************************************************************
Experience the British Library online at www.bl.uk<http://www.bl.uk/>
The British Library's latest Annual Report and Accounts : www.bl.uk/aboutus/annrep/index.html<http://www.bl.uk/aboutus/annrep/index.html>
Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabook<http://www.bl.uk/adoptabook>
The Library's St Pancras site is WiFi - enabled
*****************************************************************************************************************
The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the postmaster@bl.uk<ma...@bl.uk> : The contents of this e-mail must not be disclosed or copied without the sender's consent.
The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author.
*****************************************************************************************************************
Think before you print

Re: Tika content detection and crawled "remote" content

Posted by "Jackson, Andy" <An...@bl.uk>.
In case it helps, I wrote some prototype modules to add ARC and WARC support to Tika:

https://github.com/ukwa/webarchive-discovery/tree/master/digipres-tika/src/main/java/uk/bl/wa/tika/parser/warc

…and extended Tika to use them:

https://github.com/ukwa/webarchive-discovery/blob/master/digipres-tika/src/main/java/uk/bl/wa/tika/PreservationParser.java#L62-L63

However, they are based on the Internet Archive’s (W)ARC parsers, which have a pretty heavy/messy dependency tree. It would probably be better to build them on JWAT, which has few dependencies (but may not be quite as robust to edge cases as the IA ones).

https://sbforge.org/display/JWAT/JWAT

(see e.g. https://sbforge.org/display/JWAT/Reading+a+WARC+file)

Hope that helps,
Andy Jackson (UK Web Archive)

From: Timothy Allison <ta...@mitre.org>>
Reply-To: "user@tika.apache.org<ma...@tika.apache.org>" <us...@tika.apache.org>>
Date: Friday, 7 July 2017 at 11:52
To: "user@tika.apache.org<ma...@tika.apache.org>" <us...@tika.apache.org>>
Subject: RE: Tika content detection and crawled "remote" content

Should we add a WARC parser? :)

From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
Sent: Friday, July 7, 2017 3:43 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Tika content detection and crawled "remote" content

Is anyone aware of a tool to run Tika on a WARC file? Everything required for detection
and parsing is contained (URL, HTTP metadata, binary content).

you could do that with my good old Behemoth<https://github.com/DigitalPebble/behemoth> in 2 steps : WARC to Behemoth format then run Tika on that





On 6 July 2017 at 13:27, Sebastian Nagel <wa...@googlemail.com>> wrote:
Hi,

> Otherwise, for anything else (eg that word / graphviz one), please do open up JIRAs!
Done, see TIKA-2242.

>> Why, yes, please!  JIRA with small samples would be fantastic.

1000 randomly chosen examples per content-type are ready:

https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/test/
  tika_html_server_side_scripting_lang_php.warc.gz
  tika_html_server_side_scripting_lang_asp.warc.gz
  tika_html_server_side_scripting_lang_coldfusion.warc.gz
  tika_html_server_side_scripting_lang_jsp.warc.gz
  tika_html_server_side_scripting_lang_cgi.warc.gz
  tika_html_server_side_scripting_lang_perl.warc.gz

Note: there are few real PHP/JSP/Perl/... documents among them.

If there is no "global" solution (TIKA-2419), I'll open "smaller" Jiras.

Is anyone aware of a tool to run Tika on a WARC file? Everything required for detection
and parsing is contained (URL, HTTP metadata, binary content).

Thanks,
Sebastian

On 07/05/2017 04:07 PM, Nick Burch wrote:
> Having taken a "quick" look over lunch at some of the "programming language" ones, and gone down a
> rabbit whole... I think at least some of them are as described in TIKA-2419, where our change to the
> HTML magic priority to fix for HTML-containing formats like email had broken some things.
>
> I've done a quick fix for 1.16, but it'd be good to try the impact of other things, eg dropping the
> xml priority to match the html one to see if that helps / breaks other things
>
>
> Otherwise, for anything else (eg that word / graphviz one), please do open up JIRAs!
>
> Thanks
> Nick
>
> On 05/07/17 14:10, Allison, Timothy B. wrote:
>> Why, yes, please!  JIRA with small samples would be fantastic.  I think working in desc order of
>> most common to least would be best...php, asp, coldfusion.
>>
>> I'm about to cut 1.16, but I look forward to improving Tika with this tremendously useful data.
>>
>> Again, many thanks!
>>
>> Cheers,
>>
>>             Tim
>>
>> -----Original Message-----
>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com<ma...@googlemail.com>]
>> Sent: Wednesday, July 5, 2017 9:03 AM
>> To: user@tika.apache.org<ma...@tika.apache.org>
>> Subject: Re: Tika content detection and crawled "remote" content
>>
>> Hi Tim,
>>
>> thanks! Let me know if I should take any actions (e.g., open issue(s) on Jira) or whether I can
>> help by compiling smaller test sets.
>>
>> Best,
>> Sebastian
>>
>> On 07/05/2017 02:09 PM, Allison, Timothy B. wrote:
>>> This is FANTASTIC!!!  Thank you, Sebastian!
>>>
>>> I suspect that we should try to fix these at the Tika level.  We'll never be 100%, but most of
>>> the problems you describe _should_ be fixable.
>>>
>>>   > If anyone is interested in using the detected MIME types or anything else from Common Crawl -
>>> I'm happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it
>>> easy to search or grep for confusion pairs.
>>>
>>> This is an amazing step forward for our regression corpus.  We used to rely on the http headers
>>> and/or file suffix to oversample non-html.  This will allow far cleaner pulls.
>>>
>>> -----Original Message-----
>>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com<ma...@googlemail.com>]
>>> Sent: Tuesday, July 4, 2017 6:18 AM
>>> To: user@tika.apache.org<ma...@tika.apache.org>
>>> Subject: Tika content detection and crawled "remote" content
>>>
>>> Hi,
>>>
>>> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch)
>>> with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage
>>> and isn't always correct [1].
>>>
>>> For the June 2017 crawl I've prepared a comparison of content types
>>> sent by the server in the HTTP header and as detected by Tika 1.15
>>> [2].  It shows that content types by Tika are definitely clean
>>> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
>>>
>>> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs
>>> are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
>>>
>>>              Tika-1.15                HTTP-Content-Type
>>> 1001968023  application/xhtml+xml    text/html
>>>     2298146  application/rss+xml      text/xml
>>>      617435  application/rss+xml      application/xml
>>>      613525  text/html                unk
>>>      361525  application/xhtml+xml    unk
>>>      297707  application/rdf+xml      application/xml
>>>
>>>
>>> However, there are a few dubious decisions, esp. the group of web server-side scripting languages
>>> (ASP, JSP, PHP, ColdFusion, etc.):
>>>
>>>           Tika-1.15         HTTP-Content-Type
>>> 2047739  text/x-php        text/html
>>>   681629  text/asp          text/html
>>>   193095  text/x-coldfusion text/html
>>>   172318  text/aspdotnet    text/html
>>>   139033  text/x-jsp        text/html
>>>    38415  text/x-cgi        text/html
>>>    32092  text/x-php        text/xml
>>>    18021  text/x-perl       text/html
>>>
>>> Of course, due to misconfigurations some servers may deliver the script files unmodified but in
>>> general I wouldn't expect that this happens for millions of pages.  I've checked some of the
>>> affected URLs:
>>>
>>> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
>>> tag)
>>>
>>> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
>>>
>>>      http://www.privi.com/product-details.asp?cno=C10910011
>>>      http://mental-ray.de/Root_alt/Default.asp
>>>      http://ekyrs.org/support/index.php?action=profile
>>>      http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>>>
>>> - (overlong) comment block at start of HTML which "masks" the HTML declaration
>>>      http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>>>
>>> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
>>>
>>>      https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
>>>      https://de.e-stories.org/categories.php?&lan=nl&art=p
>>>
>>> - HTML with some scripting fragments ("<?php?>") present:
>>>      http://www.eco-ani-yao.org/shien/
>>>
>>> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
>>>      http://www.proedinc.com/customer/content.aspx?redid=9
>>>      http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
>>>      http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>>>      http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068
>>> f79
>>>
>>>
>>> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type
>>> sent from the responding server.
>>> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
>>>
>>> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm
>>> happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it easy to
>>> search or grep for confusion pairs.
>>>
>>>
>>> Thanks and best,
>>> Sebastian
>>>
>>>
>>> [1] https://github.com/commoncrawl/nutch/issues/3
>>> [2]
>>> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
>>> a-1.15-cc-main-2017-26.txt.xz
>>>
>>> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
>>> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
>>> [3]
>>> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
>>> util/MimeUtil.java#L152 [4]
>>> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>>>
>>
>



--

Open Source Solutions for Text Engineering

http://www.digitalpebble.com<http://www.digitalpebble.com/>
http://digitalpebble.blogspot.com/
#digitalpebble<http://twitter.com/digitalpebble>


******************************************************************************************************************
Experience the British Library online at www.bl.uk<http://www.bl.uk/>
The British Library’s latest Annual Report and Accounts : www.bl.uk/aboutus/annrep/index.html<http://www.bl.uk/aboutus/annrep/index.html>
Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabook<http://www.bl.uk/adoptabook>
The Library's St Pancras site is WiFi - enabled
*****************************************************************************************************************
The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the postmaster@bl.uk<ma...@bl.uk> : The contents of this e-mail must not be disclosed or copied without the sender's consent.
The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author.
*****************************************************************************************************************
Think before you print

RE: Tika content detection and crawled "remote" content

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Should we add a WARC parser? ☺

From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
Sent: Friday, July 7, 2017 3:43 AM
To: user@tika.apache.org
Subject: Re: Tika content detection and crawled "remote" content

Is anyone aware of a tool to run Tika on a WARC file? Everything required for detection
and parsing is contained (URL, HTTP metadata, binary content).

you could do that with my good old Behemoth<https://github.com/DigitalPebble/behemoth> in 2 steps : WARC to Behemoth format then run Tika on that





On 6 July 2017 at 13:27, Sebastian Nagel <wa...@googlemail.com>> wrote:
Hi,

> Otherwise, for anything else (eg that word / graphviz one), please do open up JIRAs!
Done, see TIKA-2242.

>> Why, yes, please!  JIRA with small samples would be fantastic.

1000 randomly chosen examples per content-type are ready:

https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/test/
  tika_html_server_side_scripting_lang_php.warc.gz
  tika_html_server_side_scripting_lang_asp.warc.gz
  tika_html_server_side_scripting_lang_coldfusion.warc.gz
  tika_html_server_side_scripting_lang_jsp.warc.gz
  tika_html_server_side_scripting_lang_cgi.warc.gz
  tika_html_server_side_scripting_lang_perl.warc.gz

Note: there are few real PHP/JSP/Perl/... documents among them.

If there is no "global" solution (TIKA-2419), I'll open "smaller" Jiras.

Is anyone aware of a tool to run Tika on a WARC file? Everything required for detection
and parsing is contained (URL, HTTP metadata, binary content).

Thanks,
Sebastian

On 07/05/2017 04:07 PM, Nick Burch wrote:
> Having taken a "quick" look over lunch at some of the "programming language" ones, and gone down a
> rabbit whole... I think at least some of them are as described in TIKA-2419, where our change to the
> HTML magic priority to fix for HTML-containing formats like email had broken some things.
>
> I've done a quick fix for 1.16, but it'd be good to try the impact of other things, eg dropping the
> xml priority to match the html one to see if that helps / breaks other things
>
>
> Otherwise, for anything else (eg that word / graphviz one), please do open up JIRAs!
>
> Thanks
> Nick
>
> On 05/07/17 14:10, Allison, Timothy B. wrote:
>> Why, yes, please!  JIRA with small samples would be fantastic.  I think working in desc order of
>> most common to least would be best...php, asp, coldfusion.
>>
>> I'm about to cut 1.16, but I look forward to improving Tika with this tremendously useful data.
>>
>> Again, many thanks!
>>
>> Cheers,
>>
>>             Tim
>>
>> -----Original Message-----
>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com<ma...@googlemail.com>]
>> Sent: Wednesday, July 5, 2017 9:03 AM
>> To: user@tika.apache.org<ma...@tika.apache.org>
>> Subject: Re: Tika content detection and crawled "remote" content
>>
>> Hi Tim,
>>
>> thanks! Let me know if I should take any actions (e.g., open issue(s) on Jira) or whether I can
>> help by compiling smaller test sets.
>>
>> Best,
>> Sebastian
>>
>> On 07/05/2017 02:09 PM, Allison, Timothy B. wrote:
>>> This is FANTASTIC!!!  Thank you, Sebastian!
>>>
>>> I suspect that we should try to fix these at the Tika level.  We'll never be 100%, but most of
>>> the problems you describe _should_ be fixable.
>>>
>>>   > If anyone is interested in using the detected MIME types or anything else from Common Crawl -
>>> I'm happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it
>>> easy to search or grep for confusion pairs.
>>>
>>> This is an amazing step forward for our regression corpus.  We used to rely on the http headers
>>> and/or file suffix to oversample non-html.  This will allow far cleaner pulls.
>>>
>>> -----Original Message-----
>>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com<ma...@googlemail.com>]
>>> Sent: Tuesday, July 4, 2017 6:18 AM
>>> To: user@tika.apache.org<ma...@tika.apache.org>
>>> Subject: Tika content detection and crawled "remote" content
>>>
>>> Hi,
>>>
>>> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch)
>>> with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage
>>> and isn't always correct [1].
>>>
>>> For the June 2017 crawl I've prepared a comparison of content types
>>> sent by the server in the HTTP header and as detected by Tika 1.15
>>> [2].  It shows that content types by Tika are definitely clean
>>> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
>>>
>>> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs
>>> are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
>>>
>>>              Tika-1.15                HTTP-Content-Type
>>> 1001968023  application/xhtml+xml    text/html
>>>     2298146  application/rss+xml      text/xml
>>>      617435  application/rss+xml      application/xml
>>>      613525  text/html                unk
>>>      361525  application/xhtml+xml    unk
>>>      297707  application/rdf+xml      application/xml
>>>
>>>
>>> However, there are a few dubious decisions, esp. the group of web server-side scripting languages
>>> (ASP, JSP, PHP, ColdFusion, etc.):
>>>
>>>           Tika-1.15         HTTP-Content-Type
>>> 2047739  text/x-php        text/html
>>>   681629  text/asp          text/html
>>>   193095  text/x-coldfusion text/html
>>>   172318  text/aspdotnet    text/html
>>>   139033  text/x-jsp        text/html
>>>    38415  text/x-cgi        text/html
>>>    32092  text/x-php        text/xml
>>>    18021  text/x-perl       text/html
>>>
>>> Of course, due to misconfigurations some servers may deliver the script files unmodified but in
>>> general I wouldn't expect that this happens for millions of pages.  I've checked some of the
>>> affected URLs:
>>>
>>> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
>>> tag)
>>>
>>> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
>>>
>>>      http://www.privi.com/product-details.asp?cno=C10910011
>>>      http://mental-ray.de/Root_alt/Default.asp
>>>      http://ekyrs.org/support/index.php?action=profile
>>>      http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>>>
>>> - (overlong) comment block at start of HTML which "masks" the HTML declaration
>>>      http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>>>
>>> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
>>>
>>>      https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
>>>      https://de.e-stories.org/categories.php?&lan=nl&art=p
>>>
>>> - HTML with some scripting fragments ("<?php?>") present:
>>>      http://www.eco-ani-yao.org/shien/
>>>
>>> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
>>>      http://www.proedinc.com/customer/content.aspx?redid=9
>>>      http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
>>>      http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>>>      http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068
>>> f79
>>>
>>>
>>> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type
>>> sent from the responding server.
>>> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
>>>
>>> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm
>>> happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it easy to
>>> search or grep for confusion pairs.
>>>
>>>
>>> Thanks and best,
>>> Sebastian
>>>
>>>
>>> [1] https://github.com/commoncrawl/nutch/issues/3
>>> [2]
>>> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
>>> a-1.15-cc-main-2017-26.txt.xz
>>>
>>> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
>>> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
>>> [3]
>>> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
>>> util/MimeUtil.java#L152 [4]
>>> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>>>
>>
>



--

Open Source Solutions for Text Engineering

http://www.digitalpebble.com<http://www.digitalpebble.com/>
http://digitalpebble.blogspot.com/
#digitalpebble<http://twitter.com/digitalpebble>

Re: Tika content detection and crawled "remote" content

Posted by Julien Nioche <li...@gmail.com>.
>
> Is anyone aware of a tool to run Tika on a WARC file? Everything required
> for detection
> and parsing is contained (URL, HTTP metadata, binary content).


you could do that with my good old Behemoth
<https://github.com/DigitalPebble/behemoth> in 2 steps : WARC to Behemoth
format then run Tika on that





On 6 July 2017 at 13:27, Sebastian Nagel <wa...@googlemail.com> wrote:

> Hi,
>
> > Otherwise, for anything else (eg that word / graphviz one), please do
> open up JIRAs!
> Done, see TIKA-2242.
>
> >> Why, yes, please!  JIRA with small samples would be fantastic.
>
> 1000 randomly chosen examples per content-type are ready:
>
> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/test/
>   tika_html_server_side_scripting_lang_php.warc.gz
>   tika_html_server_side_scripting_lang_asp.warc.gz
>   tika_html_server_side_scripting_lang_coldfusion.warc.gz
>   tika_html_server_side_scripting_lang_jsp.warc.gz
>   tika_html_server_side_scripting_lang_cgi.warc.gz
>   tika_html_server_side_scripting_lang_perl.warc.gz
>
> Note: there are few real PHP/JSP/Perl/... documents among them.
>
> If there is no "global" solution (TIKA-2419), I'll open "smaller" Jiras.
>
> Is anyone aware of a tool to run Tika on a WARC file? Everything required
> for detection
> and parsing is contained (URL, HTTP metadata, binary content).
>
> Thanks,
> Sebastian
>
> On 07/05/2017 04:07 PM, Nick Burch wrote:
> > Having taken a "quick" look over lunch at some of the "programming
> language" ones, and gone down a
> > rabbit whole... I think at least some of them are as described in
> TIKA-2419, where our change to the
> > HTML magic priority to fix for HTML-containing formats like email had
> broken some things.
> >
> > I've done a quick fix for 1.16, but it'd be good to try the impact of
> other things, eg dropping the
> > xml priority to match the html one to see if that helps / breaks other
> things
> >
> >
> > Otherwise, for anything else (eg that word / graphviz one), please do
> open up JIRAs!
> >
> > Thanks
> > Nick
> >
> > On 05/07/17 14:10, Allison, Timothy B. wrote:
> >> Why, yes, please!  JIRA with small samples would be fantastic.  I think
> working in desc order of
> >> most common to least would be best...php, asp, coldfusion.
> >>
> >> I'm about to cut 1.16, but I look forward to improving Tika with this
> tremendously useful data.
> >>
> >> Again, many thanks!
> >>
> >> Cheers,
> >>
> >>             Tim
> >>
> >> -----Original Message-----
> >> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> >> Sent: Wednesday, July 5, 2017 9:03 AM
> >> To: user@tika.apache.org
> >> Subject: Re: Tika content detection and crawled "remote" content
> >>
> >> Hi Tim,
> >>
> >> thanks! Let me know if I should take any actions (e.g., open issue(s)
> on Jira) or whether I can
> >> help by compiling smaller test sets.
> >>
> >> Best,
> >> Sebastian
> >>
> >> On 07/05/2017 02:09 PM, Allison, Timothy B. wrote:
> >>> This is FANTASTIC!!!  Thank you, Sebastian!
> >>>
> >>> I suspect that we should try to fix these at the Tika level.  We'll
> never be 100%, but most of
> >>> the problems you describe _should_ be fixable.
> >>>
> >>>   > If anyone is interested in using the detected MIME types or
> anything else from Common Crawl -
> >>> I'm happy to help!  The URL index [4] contains now a new field
> "mime-detected" which makes it
> >>> easy to search or grep for confusion pairs.
> >>>
> >>> This is an amazing step forward for our regression corpus.  We used to
> rely on the http headers
> >>> and/or file suffix to oversample non-html.  This will allow far
> cleaner pulls.
> >>>
> >>> -----Original Message-----
> >>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> >>> Sent: Tuesday, July 4, 2017 6:18 AM
> >>> To: user@tika.apache.org
> >>> Subject: Tika content detection and crawled "remote" content
> >>>
> >>> Hi,
> >>>
> >>> recently I've plugged in Tika's content detection into Common Crawl's
> crawler (modified Nutch)
> >>> with the target to get clean and correct MIME type - the HTTP
> Content-Type may contain garbage
> >>> and isn't always correct [1].
> >>>
> >>> For the June 2017 crawl I've prepared a comparison of content types
> >>> sent by the server in the HTTP header and as detected by Tika 1.15
> >>> [2].  It shows that content types by Tika are definitely clean
> >>> (1,400 different content types vs. more than 6,000 content type
> "strings" from HTTP headers).
> >>>
> >>> A look on the "confusions" where Content-Type and Tika differ, shows a
> mixed picture: some pairs
> >>> are plausible, e.g., if Tika changes the type to a more precise
> subtype or detects the MIME at all:
> >>>
> >>>              Tika-1.15                HTTP-Content-Type
> >>> 1001968023  application/xhtml+xml    text/html
> >>>     2298146  application/rss+xml      text/xml
> >>>      617435  application/rss+xml      application/xml
> >>>      613525  text/html                unk
> >>>      361525  application/xhtml+xml    unk
> >>>      297707  application/rdf+xml      application/xml
> >>>
> >>>
> >>> However, there are a few dubious decisions, esp. the group of web
> server-side scripting languages
> >>> (ASP, JSP, PHP, ColdFusion, etc.):
> >>>
> >>>           Tika-1.15         HTTP-Content-Type
> >>> 2047739  text/x-php        text/html
> >>>   681629  text/asp          text/html
> >>>   193095  text/x-coldfusion text/html
> >>>   172318  text/aspdotnet    text/html
> >>>   139033  text/x-jsp        text/html
> >>>    38415  text/x-cgi        text/html
> >>>    32092  text/x-php        text/xml
> >>>    18021  text/x-perl       text/html
> >>>
> >>> Of course, due to misconfigurations some servers may deliver the
> script files unmodified but in
> >>> general I wouldn't expect that this happens for millions of pages.
> I've checked some of the
> >>> affected URLs:
> >>>
> >>> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
> >>> tag)
> >>>
> >>> https://www.projectmanagement.com/profile/profile_
> contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&
> c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&
> c_d=0&c_ra=2&c_p=0
> >>>
> >>>      http://www.privi.com/product-details.asp?cno=C10910011
> >>>      http://mental-ray.de/Root_alt/Default.asp
> >>>      http://ekyrs.org/support/index.php?action=profile
> >>>      http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
> >>>
> >>> - (overlong) comment block at start of HTML which "masks" the HTML
> declaration
> >>>      http://www.mannheim-virtuell.de/index.php?branchenID=2&
> rubrikID=24
> >>>
> >>> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&
> page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=
> headnum&desc=asc&no=6
> >>>
> >>>      https://www.preventiongenetics.com/About/Resources/disease/
> MarfansSyndrome.php
> >>>      https://de.e-stories.org/categories.php?&lan=nl&art=p
> >>>
> >>> - HTML with some scripting fragments ("<?php?>") present:
> >>>      http://www.eco-ani-yao.org/shien/
> >>>
> >>> - others are clearly HTML (looks more like a bug, at least, there is
> no simple explanation)
> >>>      http://www.proedinc.com/customer/content.aspx?redid=9
> >>>      http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=
> bf3b7971faa23413fa1164be0c068f79
> >>>      http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
> >>>      http://cball.dyndns.org/wbb2/map.php?sid=
> bf3b7971faa23413fa1164be0c068
> >>> f79
> >>>
> >>>
> >>> Obviously certain file suffixes (.php, .aspx) should get less weight
> compared to Content-Type
> >>> sent from the responding server.
> >>> Now my question: where's the best place to fix this: in the crawler
> [3] or in Tika?
> >>>
> >>> If anyone is interested in using the detected MIME types or anything
> else from Common Crawl - I'm
> >>> happy to help!  The URL index [4] contains now a new field
> "mime-detected" which makes it easy to
> >>> search or grep for confusion pairs.
> >>>
> >>>
> >>> Thanks and best,
> >>> Sebastian
> >>>
> >>>
> >>> [1] https://github.com/commoncrawl/nutch/issues/3
> >>> [2]
> >>> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
> >>> a-1.15-cc-main-2017-26.txt.xz
> >>>
> >>> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
> >>> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
> >>> [3]
> >>> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
> >>> util/MimeUtil.java#L152 [4]
> >>> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
> >>>
> >>
> >
>
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>

Re: Tika content detection and crawled "remote" content

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,

> Otherwise, for anything else (eg that word / graphviz one), please do open up JIRAs!
Done, see TIKA-2242.

>> Why, yes, please!  JIRA with small samples would be fantastic.

1000 randomly chosen examples per content-type are ready:

https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/test/
  tika_html_server_side_scripting_lang_php.warc.gz
  tika_html_server_side_scripting_lang_asp.warc.gz
  tika_html_server_side_scripting_lang_coldfusion.warc.gz
  tika_html_server_side_scripting_lang_jsp.warc.gz
  tika_html_server_side_scripting_lang_cgi.warc.gz
  tika_html_server_side_scripting_lang_perl.warc.gz

Note: there are few real PHP/JSP/Perl/... documents among them.

If there is no "global" solution (TIKA-2419), I'll open "smaller" Jiras.

Is anyone aware of a tool to run Tika on a WARC file? Everything required for detection
and parsing is contained (URL, HTTP metadata, binary content).

Thanks,
Sebastian

On 07/05/2017 04:07 PM, Nick Burch wrote:
> Having taken a "quick" look over lunch at some of the "programming language" ones, and gone down a
> rabbit whole... I think at least some of them are as described in TIKA-2419, where our change to the
> HTML magic priority to fix for HTML-containing formats like email had broken some things.
> 
> I've done a quick fix for 1.16, but it'd be good to try the impact of other things, eg dropping the
> xml priority to match the html one to see if that helps / breaks other things
> 
> 
> Otherwise, for anything else (eg that word / graphviz one), please do open up JIRAs!
> 
> Thanks
> Nick
> 
> On 05/07/17 14:10, Allison, Timothy B. wrote:
>> Why, yes, please!  JIRA with small samples would be fantastic.  I think working in desc order of
>> most common to least would be best...php, asp, coldfusion.
>>
>> I'm about to cut 1.16, but I look forward to improving Tika with this tremendously useful data.
>>
>> Again, many thanks!
>>
>> Cheers,
>>
>>             Tim
>>
>> -----Original Message-----
>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
>> Sent: Wednesday, July 5, 2017 9:03 AM
>> To: user@tika.apache.org
>> Subject: Re: Tika content detection and crawled "remote" content
>>
>> Hi Tim,
>>
>> thanks! Let me know if I should take any actions (e.g., open issue(s) on Jira) or whether I can
>> help by compiling smaller test sets.
>>
>> Best,
>> Sebastian
>>
>> On 07/05/2017 02:09 PM, Allison, Timothy B. wrote:
>>> This is FANTASTIC!!!  Thank you, Sebastian!
>>>
>>> I suspect that we should try to fix these at the Tika level.  We'll never be 100%, but most of
>>> the problems you describe _should_ be fixable.
>>>
>>>   > If anyone is interested in using the detected MIME types or anything else from Common Crawl -
>>> I'm happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it
>>> easy to search or grep for confusion pairs.
>>>
>>> This is an amazing step forward for our regression corpus.  We used to rely on the http headers
>>> and/or file suffix to oversample non-html.  This will allow far cleaner pulls.
>>>
>>> -----Original Message-----
>>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
>>> Sent: Tuesday, July 4, 2017 6:18 AM
>>> To: user@tika.apache.org
>>> Subject: Tika content detection and crawled "remote" content
>>>
>>> Hi,
>>>
>>> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch)
>>> with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage
>>> and isn't always correct [1].
>>>
>>> For the June 2017 crawl I've prepared a comparison of content types
>>> sent by the server in the HTTP header and as detected by Tika 1.15
>>> [2].  It shows that content types by Tika are definitely clean
>>> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
>>>
>>> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs
>>> are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
>>>
>>>              Tika-1.15                HTTP-Content-Type
>>> 1001968023  application/xhtml+xml    text/html
>>>     2298146  application/rss+xml      text/xml
>>>      617435  application/rss+xml      application/xml
>>>      613525  text/html                unk
>>>      361525  application/xhtml+xml    unk
>>>      297707  application/rdf+xml      application/xml
>>>
>>>
>>> However, there are a few dubious decisions, esp. the group of web server-side scripting languages
>>> (ASP, JSP, PHP, ColdFusion, etc.):
>>>
>>>           Tika-1.15         HTTP-Content-Type
>>> 2047739  text/x-php        text/html
>>>   681629  text/asp          text/html
>>>   193095  text/x-coldfusion text/html
>>>   172318  text/aspdotnet    text/html
>>>   139033  text/x-jsp        text/html
>>>    38415  text/x-cgi        text/html
>>>    32092  text/x-php        text/xml
>>>    18021  text/x-perl       text/html
>>>
>>> Of course, due to misconfigurations some servers may deliver the script files unmodified but in
>>> general I wouldn't expect that this happens for millions of pages.  I've checked some of the
>>> affected URLs:
>>>
>>> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
>>> tag)
>>>
>>> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
>>>
>>>      http://www.privi.com/product-details.asp?cno=C10910011
>>>      http://mental-ray.de/Root_alt/Default.asp
>>>      http://ekyrs.org/support/index.php?action=profile
>>>      http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>>>
>>> - (overlong) comment block at start of HTML which "masks" the HTML declaration
>>>      http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>>>
>>> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
>>>
>>>      https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
>>>      https://de.e-stories.org/categories.php?&lan=nl&art=p
>>>
>>> - HTML with some scripting fragments ("<?php?>") present:
>>>      http://www.eco-ani-yao.org/shien/
>>>
>>> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
>>>      http://www.proedinc.com/customer/content.aspx?redid=9
>>>      http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
>>>      http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>>>      http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068
>>> f79
>>>
>>>
>>> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type
>>> sent from the responding server.
>>> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
>>>
>>> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm
>>> happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it easy to
>>> search or grep for confusion pairs.
>>>
>>>
>>> Thanks and best,
>>> Sebastian
>>>
>>>
>>> [1] https://github.com/commoncrawl/nutch/issues/3
>>> [2]
>>> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
>>> a-1.15-cc-main-2017-26.txt.xz
>>>
>>> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
>>> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
>>> [3]
>>> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
>>> util/MimeUtil.java#L152 [4]
>>> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>>>
>>
> 


Re: Tika content detection and crawled "remote" content

Posted by Nick Burch <ni...@apache.org>.
Having taken a "quick" look over lunch at some of the "programming 
language" ones, and gone down a rabbit whole... I think at least some of 
them are as described in TIKA-2419, where our change to the HTML magic 
priority to fix for HTML-containing formats like email had broken some 
things.

I've done a quick fix for 1.16, but it'd be good to try the impact of 
other things, eg dropping the xml priority to match the html one to see 
if that helps / breaks other things


Otherwise, for anything else (eg that word / graphviz one), please do 
open up JIRAs!

Thanks
Nick

On 05/07/17 14:10, Allison, Timothy B. wrote:
> Why, yes, please!  JIRA with small samples would be fantastic.  I think working in desc order of most common to least would be best...php, asp, coldfusion.
> 
> I'm about to cut 1.16, but I look forward to improving Tika with this tremendously useful data.
> 
> Again, many thanks!
> 
> Cheers,
> 
>             Tim
> 
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> Sent: Wednesday, July 5, 2017 9:03 AM
> To: user@tika.apache.org
> Subject: Re: Tika content detection and crawled "remote" content
> 
> Hi Tim,
> 
> thanks! Let me know if I should take any actions (e.g., open issue(s) on Jira) or whether I can help by compiling smaller test sets.
> 
> Best,
> Sebastian
> 
> On 07/05/2017 02:09 PM, Allison, Timothy B. wrote:
>> This is FANTASTIC!!!  Thank you, Sebastian!
>>
>> I suspect that we should try to fix these at the Tika level.  We'll never be 100%, but most of the problems you describe _should_ be fixable.
>>
>>   > If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
>>
>> This is an amazing step forward for our regression corpus.  We used to rely on the http headers and/or file suffix to oversample non-html.  This will allow far cleaner pulls.
>>
>> -----Original Message-----
>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
>> Sent: Tuesday, July 4, 2017 6:18 AM
>> To: user@tika.apache.org
>> Subject: Tika content detection and crawled "remote" content
>>
>> Hi,
>>
>> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].
>>
>> For the June 2017 crawl I've prepared a comparison of content types
>> sent by the server in the HTTP header and as detected by Tika 1.15
>> [2].  It shows that content types by Tika are definitely clean
>> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
>>
>> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
>>
>>              Tika-1.15                HTTP-Content-Type
>> 1001968023  application/xhtml+xml    text/html
>>     2298146  application/rss+xml      text/xml
>>      617435  application/rss+xml      application/xml
>>      613525  text/html                unk
>>      361525  application/xhtml+xml    unk
>>      297707  application/rdf+xml      application/xml
>>
>>
>> However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
>>
>>           Tika-1.15         HTTP-Content-Type
>> 2047739  text/x-php        text/html
>>   681629  text/asp          text/html
>>   193095  text/x-coldfusion text/html
>>   172318  text/aspdotnet    text/html
>>   139033  text/x-jsp        text/html
>>    38415  text/x-cgi        text/html
>>    32092  text/x-php        text/xml
>>    18021  text/x-perl       text/html
>>
>> Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages.  I've checked some of the affected URLs:
>>
>> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
>> tag)
>>
>> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
>>      http://www.privi.com/product-details.asp?cno=C10910011
>>      http://mental-ray.de/Root_alt/Default.asp
>>      http://ekyrs.org/support/index.php?action=profile
>>      http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>>
>> - (overlong) comment block at start of HTML which "masks" the HTML declaration
>>      http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>>
>> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
>>      https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
>>      https://de.e-stories.org/categories.php?&lan=nl&art=p
>>
>> - HTML with some scripting fragments ("<?php?>") present:
>>      http://www.eco-ani-yao.org/shien/
>>
>> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
>>      http://www.proedinc.com/customer/content.aspx?redid=9
>>      http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
>>      http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>>      
>> http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068
>> f79
>>
>>
>> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
>> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
>>
>> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
>>
>>
>> Thanks and best,
>> Sebastian
>>
>>
>> [1] https://github.com/commoncrawl/nutch/issues/3
>> [2]
>> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
>> a-1.15-cc-main-2017-26.txt.xz
>>
>> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
>> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
>> [3]
>> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
>> util/MimeUtil.java#L152 [4]
>> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>>
> 


RE: Tika content detection and crawled "remote" content

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Why, yes, please!  JIRA with small samples would be fantastic.  I think working in desc order of most common to least would be best...php, asp, coldfusion.

I'm about to cut 1.16, but I look forward to improving Tika with this tremendously useful data.

Again, many thanks!

Cheers,

           Tim

-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com] 
Sent: Wednesday, July 5, 2017 9:03 AM
To: user@tika.apache.org
Subject: Re: Tika content detection and crawled "remote" content

Hi Tim,

thanks! Let me know if I should take any actions (e.g., open issue(s) on Jira) or whether I can help by compiling smaller test sets.

Best,
Sebastian

On 07/05/2017 02:09 PM, Allison, Timothy B. wrote:
> This is FANTASTIC!!!  Thank you, Sebastian!
> 
> I suspect that we should try to fix these at the Tika level.  We'll never be 100%, but most of the problems you describe _should_ be fixable.
> 
>  > If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
> 
> This is an amazing step forward for our regression corpus.  We used to rely on the http headers and/or file suffix to oversample non-html.  This will allow far cleaner pulls.
> 
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> Sent: Tuesday, July 4, 2017 6:18 AM
> To: user@tika.apache.org
> Subject: Tika content detection and crawled "remote" content
> 
> Hi,
> 
> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].
> 
> For the June 2017 crawl I've prepared a comparison of content types 
> sent by the server in the HTTP header and as detected by Tika 1.15 
> [2].  It shows that content types by Tika are definitely clean
> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
> 
> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
> 
>             Tika-1.15                HTTP-Content-Type
> 1001968023  application/xhtml+xml    text/html
>    2298146  application/rss+xml      text/xml
>     617435  application/rss+xml      application/xml
>     613525  text/html                unk
>     361525  application/xhtml+xml    unk
>     297707  application/rdf+xml      application/xml
> 
> 
> However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
> 
>          Tika-1.15         HTTP-Content-Type
> 2047739  text/x-php        text/html
>  681629  text/asp          text/html
>  193095  text/x-coldfusion text/html
>  172318  text/aspdotnet    text/html
>  139033  text/x-jsp        text/html
>   38415  text/x-cgi        text/html
>   32092  text/x-php        text/xml
>   18021  text/x-perl       text/html
> 
> Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages.  I've checked some of the affected URLs:
> 
> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening 
> tag)
> 
> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
>     http://www.privi.com/product-details.asp?cno=C10910011
>     http://mental-ray.de/Root_alt/Default.asp
>     http://ekyrs.org/support/index.php?action=profile
>     http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
> 
> - (overlong) comment block at start of HTML which "masks" the HTML declaration
>     http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
> 
> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
>     https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
>     https://de.e-stories.org/categories.php?&lan=nl&art=p
> 
> - HTML with some scripting fragments ("<?php?>") present:
>     http://www.eco-ani-yao.org/shien/
> 
> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
>     http://www.proedinc.com/customer/content.aspx?redid=9
>     http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
>     http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>     
> http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068
> f79
> 
> 
> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
> 
> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
> 
> 
> Thanks and best,
> Sebastian
> 
> 
> [1] https://github.com/commoncrawl/nutch/issues/3
> [2] 
> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
> a-1.15-cc-main-2017-26.txt.xz
> 
> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
> [3] 
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
> util/MimeUtil.java#L152 [4] 
> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
> 


Re: Tika content detection and crawled "remote" content

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Tim,

thanks! Let me know if I should take any actions (e.g., open issue(s) on Jira)
or whether I can help by compiling smaller test sets.

Best,
Sebastian

On 07/05/2017 02:09 PM, Allison, Timothy B. wrote:
> This is FANTASTIC!!!  Thank you, Sebastian!
> 
> I suspect that we should try to fix these at the Tika level.  We'll never be 100%, but most of the problems you describe _should_ be fixable.
> 
>  > If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
> 
> This is an amazing step forward for our regression corpus.  We used to rely on the http headers and/or file suffix to oversample non-html.  This will allow far cleaner pulls.
> 
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com] 
> Sent: Tuesday, July 4, 2017 6:18 AM
> To: user@tika.apache.org
> Subject: Tika content detection and crawled "remote" content
> 
> Hi,
> 
> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].
> 
> For the June 2017 crawl I've prepared a comparison of content types sent by the server in the HTTP header and as detected by Tika 1.15 [2].  It shows that content types by Tika are definitely clean
> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
> 
> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
> 
>             Tika-1.15                HTTP-Content-Type
> 1001968023  application/xhtml+xml    text/html
>    2298146  application/rss+xml      text/xml
>     617435  application/rss+xml      application/xml
>     613525  text/html                unk
>     361525  application/xhtml+xml    unk
>     297707  application/rdf+xml      application/xml
> 
> 
> However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
> 
>          Tika-1.15         HTTP-Content-Type
> 2047739  text/x-php        text/html
>  681629  text/asp          text/html
>  193095  text/x-coldfusion text/html
>  172318  text/aspdotnet    text/html
>  139033  text/x-jsp        text/html
>   38415  text/x-cgi        text/html
>   32092  text/x-php        text/xml
>   18021  text/x-perl       text/html
> 
> Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages.  I've checked some of the affected URLs:
> 
> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag)
> 
> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
>     http://www.privi.com/product-details.asp?cno=C10910011
>     http://mental-ray.de/Root_alt/Default.asp
>     http://ekyrs.org/support/index.php?action=profile
>     http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
> 
> - (overlong) comment block at start of HTML which "masks" the HTML declaration
>     http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
> 
> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
>     https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
>     https://de.e-stories.org/categories.php?&lan=nl&art=p
> 
> - HTML with some scripting fragments ("<?php?>") present:
>     http://www.eco-ani-yao.org/shien/
> 
> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
>     http://www.proedinc.com/customer/content.aspx?redid=9
>     http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
>     http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>     http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79
> 
> 
> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
> 
> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
> 
> 
> Thanks and best,
> Sebastian
> 
> 
> [1] https://github.com/commoncrawl/nutch/issues/3
> [2] s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
> 
> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
> [3] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152
> [4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
> 


Re: Tika content detection and crawled "remote" content

Posted by Chris Mattmann <ma...@apache.org>.
Totally agree, thank you Common Crawl for running Tika!



On 7/5/17, 5:09 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:

    This is FANTASTIC!!!  Thank you, Sebastian!
    
    I suspect that we should try to fix these at the Tika level.  We'll never be 100%, but most of the problems you describe _should_ be fixable.
    
     > If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
    
    This is an amazing step forward for our regression corpus.  We used to rely on the http headers and/or file suffix to oversample non-html.  This will allow far cleaner pulls.
    
    -----Original Message-----
    From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com] 
    Sent: Tuesday, July 4, 2017 6:18 AM
    To: user@tika.apache.org
    Subject: Tika content detection and crawled "remote" content
    
    Hi,
    
    recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].
    
    For the June 2017 crawl I've prepared a comparison of content types sent by the server in the HTTP header and as detected by Tika 1.15 [2].  It shows that content types by Tika are definitely clean
    (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
    
    A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
    
                Tika-1.15                HTTP-Content-Type
    1001968023  application/xhtml+xml    text/html
       2298146  application/rss+xml      text/xml
        617435  application/rss+xml      application/xml
        613525  text/html                unk
        361525  application/xhtml+xml    unk
        297707  application/rdf+xml      application/xml
    
    
    However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
    
             Tika-1.15         HTTP-Content-Type
    2047739  text/x-php        text/html
     681629  text/asp          text/html
     193095  text/x-coldfusion text/html
     172318  text/aspdotnet    text/html
     139033  text/x-jsp        text/html
      38415  text/x-cgi        text/html
      32092  text/x-php        text/xml
      18021  text/x-perl       text/html
    
    Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages.  I've checked some of the affected URLs:
    
    - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag)
    
    https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
        http://www.privi.com/product-details.asp?cno=C10910011
        http://mental-ray.de/Root_alt/Default.asp
        http://ekyrs.org/support/index.php?action=profile
        http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
    
    - (overlong) comment block at start of HTML which "masks" the HTML declaration
        http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
    
    http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
        https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
        https://de.e-stories.org/categories.php?&lan=nl&art=p
    
    - HTML with some scripting fragments ("<?php?>") present:
        http://www.eco-ani-yao.org/shien/
    
    - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
        http://www.proedinc.com/customer/content.aspx?redid=9
        http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
        http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
        http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79
    
    
    Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
    Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
    
    If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
    
    
    Thanks and best,
    Sebastian
    
    
    [1] https://github.com/commoncrawl/nutch/issues/3
    [2] s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
    
    https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
    [3] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152
    [4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
    
    



RE: Tika content detection and crawled "remote" content

Posted by "Allison, Timothy B." <ta...@mitre.org>.
This is FANTASTIC!!!  Thank you, Sebastian!

I suspect that we should try to fix these at the Tika level.  We'll never be 100%, but most of the problems you describe _should_ be fixable.

 > If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.

This is an amazing step forward for our regression corpus.  We used to rely on the http headers and/or file suffix to oversample non-html.  This will allow far cleaner pulls.

-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com] 
Sent: Tuesday, July 4, 2017 6:18 AM
To: user@tika.apache.org
Subject: Tika content detection and crawled "remote" content

Hi,

recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].

For the June 2017 crawl I've prepared a comparison of content types sent by the server in the HTTP header and as detected by Tika 1.15 [2].  It shows that content types by Tika are definitely clean
(1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).

A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:

            Tika-1.15                HTTP-Content-Type
1001968023  application/xhtml+xml    text/html
   2298146  application/rss+xml      text/xml
    617435  application/rss+xml      application/xml
    613525  text/html                unk
    361525  application/xhtml+xml    unk
    297707  application/rdf+xml      application/xml


However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):

         Tika-1.15         HTTP-Content-Type
2047739  text/x-php        text/html
 681629  text/asp          text/html
 193095  text/x-coldfusion text/html
 172318  text/aspdotnet    text/html
 139033  text/x-jsp        text/html
  38415  text/x-cgi        text/html
  32092  text/x-php        text/xml
  18021  text/x-perl       text/html

Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages.  I've checked some of the affected URLs:

- HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag)

https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
    http://www.privi.com/product-details.asp?cno=C10910011
    http://mental-ray.de/Root_alt/Default.asp
    http://ekyrs.org/support/index.php?action=profile
    http://cwmorse.eu5.org/lineal/mostrar.php?contador=200

- (overlong) comment block at start of HTML which "masks" the HTML declaration
    http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24

http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
    https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
    https://de.e-stories.org/categories.php?&lan=nl&art=p

- HTML with some scripting fragments ("<?php?>") present:
    http://www.eco-ani-yao.org/shien/

- others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
    http://www.proedinc.com/customer/content.aspx?redid=9
    http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
    http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
    http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79


Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
Now my question: where's the best place to fix this: in the crawler [3] or in Tika?

If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.


Thanks and best,
Sebastian


[1] https://github.com/commoncrawl/nutch/issues/3
[2] s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz

https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
[3] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152
[4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/