You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Sebastian Nagel <wa...@googlemail.com> on 2017/07/04 10:18:22 UTC
Tika content detection and crawled "remote" content
Hi,
recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with
the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't
always correct [1].
For the June 2017 crawl I've prepared a comparison of content types sent by the server in the HTTP
header and as detected by Tika 1.15 [2]. It shows that content types by Tika are definitely clean
(1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are
plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
Tika-1.15 HTTP-Content-Type
1001968023 application/xhtml+xml text/html
2298146 application/rss+xml text/xml
617435 application/rss+xml application/xml
613525 text/html unk
361525 application/xhtml+xml unk
297707 application/rdf+xml application/xml
However, there are a few dubious decisions, esp. the group of web server-side scripting languages
(ASP, JSP, PHP, ColdFusion, etc.):
Tika-1.15 HTTP-Content-Type
2047739 text/x-php text/html
681629 text/asp text/html
193095 text/x-coldfusion text/html
172318 text/aspdotnet text/html
139033 text/x-jsp text/html
38415 text/x-cgi text/html
32092 text/x-php text/xml
18021 text/x-perl text/html
Of course, due to misconfigurations some servers may deliver the script files unmodified but in
general I wouldn't expect that this happens for millions of pages. I've checked some of the
affected URLs:
- HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag)
https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
http://www.privi.com/product-details.asp?cno=C10910011
http://mental-ray.de/Root_alt/Default.asp
http://ekyrs.org/support/index.php?action=profile
http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
- (overlong) comment block at start of HTML which "masks" the HTML declaration
http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
https://de.e-stories.org/categories.php?&lan=nl&art=p
- HTML with some scripting fragments ("<?php?>") present:
http://www.eco-ani-yao.org/shien/
- others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
http://www.proedinc.com/customer/content.aspx?redid=9
http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79
Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent
from the responding server.
Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm
happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to
search or grep for confusion pairs.
Thanks and best,
Sebastian
[1] https://github.com/commoncrawl/nutch/issues/3
[2] s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
[3] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152
[4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
Re: Tika content detection and crawled "remote" content
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,
a follow up based on Tika 1.16 for the July crawl:
# Tika-1.16 HTTP-Content-Type
4580525 text/x-php text/html
842698 text/x-coldfusion text/html
579128 text/asp text/html
510323 text/aspdotnet text/html
255267 text/x-jsp text/html
The full list is placed on
s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.16-cc-main-2017-30.txt.xz
I hope to find some time the next weeks to try the WARC parser and have a closer look and open
issues for the problems with HTML and scripting languages.
Thanks,
Sebastian
On 07/04/2017 12:18 PM, Sebastian Nagel wrote:
> Hi,
>
> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with
> the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't
> always correct [1].
>
> For the June 2017 crawl I've prepared a comparison of content types sent by the server in the HTTP
> header and as detected by Tika 1.15 [2]. It shows that content types by Tika are definitely clean
> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
>
> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are
> plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
>
> Tika-1.15 HTTP-Content-Type
> 1001968023 application/xhtml+xml text/html
> 2298146 application/rss+xml text/xml
> 617435 application/rss+xml application/xml
> 613525 text/html unk
> 361525 application/xhtml+xml unk
> 297707 application/rdf+xml application/xml
>
>
> However, there are a few dubious decisions, esp. the group of web server-side scripting languages
> (ASP, JSP, PHP, ColdFusion, etc.):
>
> Tika-1.15 HTTP-Content-Type
> 2047739 text/x-php text/html
> 681629 text/asp text/html
> 193095 text/x-coldfusion text/html
> 172318 text/aspdotnet text/html
> 139033 text/x-jsp text/html
> 38415 text/x-cgi text/html
> 32092 text/x-php text/xml
> 18021 text/x-perl text/html
>
> Of course, due to misconfigurations some servers may deliver the script files unmodified but in
> general I wouldn't expect that this happens for millions of pages. I've checked some of the
> affected URLs:
>
> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag)
>
> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
> http://www.privi.com/product-details.asp?cno=C10910011
> http://mental-ray.de/Root_alt/Default.asp
> http://ekyrs.org/support/index.php?action=profile
> http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>
> - (overlong) comment block at start of HTML which "masks" the HTML declaration
> http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>
> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
> https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
> https://de.e-stories.org/categories.php?&lan=nl&art=p
>
> - HTML with some scripting fragments ("<?php?>") present:
> http://www.eco-ani-yao.org/shien/
>
> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
> http://www.proedinc.com/customer/content.aspx?redid=9
> http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
> http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
> http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79
>
>
> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent
> from the responding server.
> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
>
> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm
> happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to
> search or grep for confusion pairs.
>
>
> Thanks and best,
> Sebastian
>
>
> [1] https://github.com/commoncrawl/nutch/issues/3
> [2] s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
>
> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
> [3] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152
> [4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>
RE: FW: Tika content detection and crawled "remote" content
Posted by "Allison, Timothy B." <ta...@mitre.org>.
> The initial intention is, of course, to help to improve the MIME detection in Tika core.
Absolutely agree.
> Yes, you'll get few 10,000 more (MS)Office documents thanks to Tika:
Tika-1.15 HTTP-Content-Type
12520 application/x-tika-msoffice application/octet-stream
6681 application/x-tika-ooxml application/octet-stream
3793 application/x-tika-msoffice text/plain
Agreed, as I look at the numbers they aren't huge, but the improvement for our test corpus development is fantastic. Even a few thousand extra docx, for example, will help.
My guess is that the x-tika-ooxml and x-tika-msoffice are truncated files. Common Crawl is truncating at 1MB, right?
Again, WOW!!! Thank you!!!
Cheers,
Tim
-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
Sent: Wednesday, July 5, 2017 8:43 AM
To: Allison, Timothy B. <ta...@mitre.org>
Cc: dominik.stadler@gmx.at; POI Developers List (dev@poi.apache.org) <de...@poi.apache.org>
Subject: Re: FW: Tika content detection and crawled "remote" content
Yes, you'll get few 10,000 more (MS)Office documents thanks to Tika:
Tika-1.15 HTTP-Content-Type
12520 application/x-tika-msoffice application/octet-stream
6681 application/x-tika-ooxml application/octet-stream
3793 application/x-tika-msoffice text/plain
3515 application/x-tika-msoffice application/force-download
2259 application/x-tika-ooxml application/msword
1911 application/x-tika-msoffice unk
1314 application/x-tika-msoffice application/download
1259 application/x-tika-ooxml unk
1068 application/x-tika-ooxml application/force-download
711 application/x-tika-msoffice file/unknown
...
The initial intention is, of course, to help to improve the MIME detection in Tika core.
Among the detected office formats there is one conspicuous pair:
127 application/msword text/vnd.graphviz
Looks like *.dot is taken as indicator only for MSWord documents.
Let me know if I can help to extract any data sets!
Thanks,
Sebastian
On 07/05/2017 01:42 PM, Allison, Timothy B. wrote:
> Dominik,
> Thanks to Sebastian and CommonCrawl, this means that we can now have far better precision and recall in selecting only MSOffice docs for our regression tests!!!
>
>
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> Sent: Tuesday, July 4, 2017 6:18 AM
> To: user@tika.apache.org
> Subject: Tika content detection and crawled "remote" content
>
> Hi,
>
> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].
>
> For the June 2017 crawl I've prepared a comparison of content types
> sent by the server in the HTTP header and as detected by Tika 1.15
> [2]. It shows that content types by Tika are definitely clean
> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
>
> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
>
> Tika-1.15 HTTP-Content-Type
> 1001968023 application/xhtml+xml text/html
> 2298146 application/rss+xml text/xml
> 617435 application/rss+xml application/xml
> 613525 text/html unk
> 361525 application/xhtml+xml unk
> 297707 application/rdf+xml application/xml
>
>
> However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
>
> Tika-1.15 HTTP-Content-Type
> 2047739 text/x-php text/html
> 681629 text/asp text/html
> 193095 text/x-coldfusion text/html
> 172318 text/aspdotnet text/html
> 139033 text/x-jsp text/html
> 38415 text/x-cgi text/html
> 32092 text/x-php text/xml
> 18021 text/x-perl text/html
>
> Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages. I've checked some of the affected URLs:
>
> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
> tag)
>
> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
> http://www.privi.com/product-details.asp?cno=C10910011
> http://mental-ray.de/Root_alt/Default.asp
> http://ekyrs.org/support/index.php?action=profile
> http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>
> - (overlong) comment block at start of HTML which "masks" the HTML declaration
> http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>
> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
> https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
> https://de.e-stories.org/categories.php?&lan=nl&art=p
>
> - HTML with some scripting fragments ("<?php?>") present:
> http://www.eco-ani-yao.org/shien/
>
> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
> http://www.proedinc.com/customer/content.aspx?redid=9
> http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
> http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>
> http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068
> f79
>
>
> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
>
> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
>
>
> Thanks and best,
> Sebastian
>
>
> [1] https://github.com/commoncrawl/nutch/issues/3
> [2]
> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
> a-1.15-cc-main-2017-26.txt.xz
>
> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
> [3]
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
> util/MimeUtil.java#L152 [4]
> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>
RE: FW: Tika content detection and crawled "remote" content
Posted by "Allison, Timothy B." <ta...@mitre.org>.
> The initial intention is, of course, to help to improve the MIME detection in Tika core.
Absolutely agree.
> Yes, you'll get few 10,000 more (MS)Office documents thanks to Tika:
Tika-1.15 HTTP-Content-Type
12520 application/x-tika-msoffice application/octet-stream
6681 application/x-tika-ooxml application/octet-stream
3793 application/x-tika-msoffice text/plain
Agreed, as I look at the numbers they aren't huge, but the improvement for our test corpus development is fantastic. Even a few thousand extra docx, for example, will help.
My guess is that the x-tika-ooxml and x-tika-msoffice are truncated files. Common Crawl is truncating at 1MB, right?
Again, WOW!!! Thank you!!!
Cheers,
Tim
-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
Sent: Wednesday, July 5, 2017 8:43 AM
To: Allison, Timothy B. <ta...@mitre.org>
Cc: dominik.stadler@gmx.at; POI Developers List (dev@poi.apache.org) <de...@poi.apache.org>
Subject: Re: FW: Tika content detection and crawled "remote" content
Yes, you'll get few 10,000 more (MS)Office documents thanks to Tika:
Tika-1.15 HTTP-Content-Type
12520 application/x-tika-msoffice application/octet-stream
6681 application/x-tika-ooxml application/octet-stream
3793 application/x-tika-msoffice text/plain
3515 application/x-tika-msoffice application/force-download
2259 application/x-tika-ooxml application/msword
1911 application/x-tika-msoffice unk
1314 application/x-tika-msoffice application/download
1259 application/x-tika-ooxml unk
1068 application/x-tika-ooxml application/force-download
711 application/x-tika-msoffice file/unknown
...
The initial intention is, of course, to help to improve the MIME detection in Tika core.
Among the detected office formats there is one conspicuous pair:
127 application/msword text/vnd.graphviz
Looks like *.dot is taken as indicator only for MSWord documents.
Let me know if I can help to extract any data sets!
Thanks,
Sebastian
On 07/05/2017 01:42 PM, Allison, Timothy B. wrote:
> Dominik,
> Thanks to Sebastian and CommonCrawl, this means that we can now have far better precision and recall in selecting only MSOffice docs for our regression tests!!!
>
>
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> Sent: Tuesday, July 4, 2017 6:18 AM
> To: user@tika.apache.org
> Subject: Tika content detection and crawled "remote" content
>
> Hi,
>
> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].
>
> For the June 2017 crawl I've prepared a comparison of content types
> sent by the server in the HTTP header and as detected by Tika 1.15
> [2]. It shows that content types by Tika are definitely clean
> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
>
> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
>
> Tika-1.15 HTTP-Content-Type
> 1001968023 application/xhtml+xml text/html
> 2298146 application/rss+xml text/xml
> 617435 application/rss+xml application/xml
> 613525 text/html unk
> 361525 application/xhtml+xml unk
> 297707 application/rdf+xml application/xml
>
>
> However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
>
> Tika-1.15 HTTP-Content-Type
> 2047739 text/x-php text/html
> 681629 text/asp text/html
> 193095 text/x-coldfusion text/html
> 172318 text/aspdotnet text/html
> 139033 text/x-jsp text/html
> 38415 text/x-cgi text/html
> 32092 text/x-php text/xml
> 18021 text/x-perl text/html
>
> Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages. I've checked some of the affected URLs:
>
> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
> tag)
>
> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
> http://www.privi.com/product-details.asp?cno=C10910011
> http://mental-ray.de/Root_alt/Default.asp
> http://ekyrs.org/support/index.php?action=profile
> http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>
> - (overlong) comment block at start of HTML which "masks" the HTML declaration
> http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>
> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
> https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
> https://de.e-stories.org/categories.php?&lan=nl&art=p
>
> - HTML with some scripting fragments ("<?php?>") present:
> http://www.eco-ani-yao.org/shien/
>
> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
> http://www.proedinc.com/customer/content.aspx?redid=9
> http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
> http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>
> http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068
> f79
>
>
> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
>
> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
>
>
> Thanks and best,
> Sebastian
>
>
> [1] https://github.com/commoncrawl/nutch/issues/3
> [2]
> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
> a-1.15-cc-main-2017-26.txt.xz
>
> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
> [3]
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
> util/MimeUtil.java#L152 [4]
> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
Re: FW: Tika content detection and crawled "remote" content
Posted by Sebastian Nagel <wa...@googlemail.com>.
Yes, you'll get few 10,000 more (MS)Office documents thanks to Tika:
Tika-1.15 HTTP-Content-Type
12520 application/x-tika-msoffice application/octet-stream
6681 application/x-tika-ooxml application/octet-stream
3793 application/x-tika-msoffice text/plain
3515 application/x-tika-msoffice application/force-download
2259 application/x-tika-ooxml application/msword
1911 application/x-tika-msoffice unk
1314 application/x-tika-msoffice application/download
1259 application/x-tika-ooxml unk
1068 application/x-tika-ooxml application/force-download
711 application/x-tika-msoffice file/unknown
...
The initial intention is, of course, to help to improve the MIME detection in Tika core.
Among the detected office formats there is one conspicuous pair:
127 application/msword text/vnd.graphviz
Looks like *.dot is taken as indicator only for MSWord documents.
Let me know if I can help to extract any data sets!
Thanks,
Sebastian
On 07/05/2017 01:42 PM, Allison, Timothy B. wrote:
> Dominik,
> Thanks to Sebastian and CommonCrawl, this means that we can now have far better precision and recall in selecting only MSOffice docs for our regression tests!!!
>
>
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> Sent: Tuesday, July 4, 2017 6:18 AM
> To: user@tika.apache.org
> Subject: Tika content detection and crawled "remote" content
>
> Hi,
>
> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].
>
> For the June 2017 crawl I've prepared a comparison of content types sent by the server in the HTTP header and as detected by Tika 1.15 [2]. It shows that content types by Tika are definitely clean
> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
>
> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
>
> Tika-1.15 HTTP-Content-Type
> 1001968023 application/xhtml+xml text/html
> 2298146 application/rss+xml text/xml
> 617435 application/rss+xml application/xml
> 613525 text/html unk
> 361525 application/xhtml+xml unk
> 297707 application/rdf+xml application/xml
>
>
> However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
>
> Tika-1.15 HTTP-Content-Type
> 2047739 text/x-php text/html
> 681629 text/asp text/html
> 193095 text/x-coldfusion text/html
> 172318 text/aspdotnet text/html
> 139033 text/x-jsp text/html
> 38415 text/x-cgi text/html
> 32092 text/x-php text/xml
> 18021 text/x-perl text/html
>
> Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages. I've checked some of the affected URLs:
>
> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag)
>
> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
> http://www.privi.com/product-details.asp?cno=C10910011
> http://mental-ray.de/Root_alt/Default.asp
> http://ekyrs.org/support/index.php?action=profile
> http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>
> - (overlong) comment block at start of HTML which "masks" the HTML declaration
> http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>
> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
> https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
> https://de.e-stories.org/categories.php?&lan=nl&art=p
>
> - HTML with some scripting fragments ("<?php?>") present:
> http://www.eco-ani-yao.org/shien/
>
> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
> http://www.proedinc.com/customer/content.aspx?redid=9
> http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
> http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
> http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79
>
>
> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
>
> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
>
>
> Thanks and best,
> Sebastian
>
>
> [1] https://github.com/commoncrawl/nutch/issues/3
> [2] s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
>
> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
> [3] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152
> [4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
FW: Tika content detection and crawled "remote" content
Posted by "Allison, Timothy B." <ta...@mitre.org>.
Dominik,
Thanks to Sebastian and CommonCrawl, this means that we can now have far better precision and recall in selecting only MSOffice docs for our regression tests!!!
-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
Sent: Tuesday, July 4, 2017 6:18 AM
To: user@tika.apache.org
Subject: Tika content detection and crawled "remote" content
Hi,
recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].
For the June 2017 crawl I've prepared a comparison of content types sent by the server in the HTTP header and as detected by Tika 1.15 [2]. It shows that content types by Tika are definitely clean
(1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
Tika-1.15 HTTP-Content-Type
1001968023 application/xhtml+xml text/html
2298146 application/rss+xml text/xml
617435 application/rss+xml application/xml
613525 text/html unk
361525 application/xhtml+xml unk
297707 application/rdf+xml application/xml
However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
Tika-1.15 HTTP-Content-Type
2047739 text/x-php text/html
681629 text/asp text/html
193095 text/x-coldfusion text/html
172318 text/aspdotnet text/html
139033 text/x-jsp text/html
38415 text/x-cgi text/html
32092 text/x-php text/xml
18021 text/x-perl text/html
Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages. I've checked some of the affected URLs:
- HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag)
https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
http://www.privi.com/product-details.asp?cno=C10910011
http://mental-ray.de/Root_alt/Default.asp
http://ekyrs.org/support/index.php?action=profile
http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
- (overlong) comment block at start of HTML which "masks" the HTML declaration
http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
https://de.e-stories.org/categories.php?&lan=nl&art=p
- HTML with some scripting fragments ("<?php?>") present:
http://www.eco-ani-yao.org/shien/
- others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
http://www.proedinc.com/customer/content.aspx?redid=9
http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79
Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
Thanks and best,
Sebastian
[1] https://github.com/commoncrawl/nutch/issues/3
[2] s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
[3] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152
[4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
FW: Tika content detection and crawled "remote" content
Posted by "Allison, Timothy B." <ta...@mitre.org>.
All,
> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
This is an amazing step forward for sampling PDF files from Common Crawl. I used to rely on the http-headers and/or file suffix, but now we also have Tika's judgment on every file in Common Crawl.
We still have to deal with the 1MB truncation (I think), but this is an amazing development. Thank you, Sebastian!
Cheers,
Tim
-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
Sent: Tuesday, July 4, 2017 6:18 AM
To: user@tika.apache.org
Subject: Tika content detection and crawled "remote" content
Hi,
recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].
For the June 2017 crawl I've prepared a comparison of content types sent by the server in the HTTP header and as detected by Tika 1.15 [2]. It shows that content types by Tika are definitely clean
(1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
Tika-1.15 HTTP-Content-Type
1001968023 application/xhtml+xml text/html
2298146 application/rss+xml text/xml
617435 application/rss+xml application/xml
613525 text/html unk
361525 application/xhtml+xml unk
297707 application/rdf+xml application/xml
However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
Tika-1.15 HTTP-Content-Type
2047739 text/x-php text/html
681629 text/asp text/html
193095 text/x-coldfusion text/html
172318 text/aspdotnet text/html
139033 text/x-jsp text/html
38415 text/x-cgi text/html
32092 text/x-php text/xml
18021 text/x-perl text/html
Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages. I've checked some of the affected URLs:
- HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag)
https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
http://www.privi.com/product-details.asp?cno=C10910011
http://mental-ray.de/Root_alt/Default.asp
http://ekyrs.org/support/index.php?action=profile
http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
- (overlong) comment block at start of HTML which "masks" the HTML declaration
http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
https://de.e-stories.org/categories.php?&lan=nl&art=p
- HTML with some scripting fragments ("<?php?>") present:
http://www.eco-ani-yao.org/shien/
- others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
http://www.proedinc.com/customer/content.aspx?redid=9
http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79
Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
Thanks and best,
Sebastian
[1] https://github.com/commoncrawl/nutch/issues/3
[2] s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
[3] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152
[4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org
Re: Tika content detection and crawled "remote" content
Posted by Luís Filipe Nassif <lf...@gmail.com>.
Hi Nick,
As commented on TIKA-2419, the original issue of eml/emlx being detected as
html I fixed locally by increasing the magic priority of eml/emlx instead
of decreasing html priority. Maybe that is an alternative to dropping the
xml priority in the future, but it can impact other things too.
Luis
2017-07-05 11:07 GMT-03:00 Nick Burch <ni...@apache.org>:
> Having taken a "quick" look over lunch at some of the "programming
> language" ones, and gone down a rabbit whole... I think at least some of
> them are as described in TIKA-2419, where our change to the HTML magic
> priority to fix for HTML-containing formats like email had broken some
> things.
>
> I've done a quick fix for 1.16, but it'd be good to try the impact of
> other things, eg dropping the xml priority to match the html one to see if
> that helps / breaks other things
>
>
> Otherwise, for anything else (eg that word / graphviz one), please do open
> up JIRAs!
>
> Thanks
> Nick
>
>
> On 05/07/17 14:10, Allison, Timothy B. wrote:
>
>> Why, yes, please! JIRA with small samples would be fantastic. I think
>> working in desc order of most common to least would be best...php, asp,
>> coldfusion.
>>
>> I'm about to cut 1.16, but I look forward to improving Tika with this
>> tremendously useful data.
>>
>> Again, many thanks!
>>
>> Cheers,
>>
>> Tim
>>
>> -----Original Message-----
>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
>> Sent: Wednesday, July 5, 2017 9:03 AM
>> To: user@tika.apache.org
>> Subject: Re: Tika content detection and crawled "remote" content
>>
>> Hi Tim,
>>
>> thanks! Let me know if I should take any actions (e.g., open issue(s) on
>> Jira) or whether I can help by compiling smaller test sets.
>>
>> Best,
>> Sebastian
>>
>> On 07/05/2017 02:09 PM, Allison, Timothy B. wrote:
>>
>>> This is FANTASTIC!!! Thank you, Sebastian!
>>>
>>> I suspect that we should try to fix these at the Tika level. We'll
>>> never be 100%, but most of the problems you describe _should_ be fixable.
>>>
>>> > If anyone is interested in using the detected MIME types or anything
>>> else from Common Crawl - I'm happy to help! The URL index [4] contains now
>>> a new field "mime-detected" which makes it easy to search or grep for
>>> confusion pairs.
>>>
>>> This is an amazing step forward for our regression corpus. We used to
>>> rely on the http headers and/or file suffix to oversample non-html. This
>>> will allow far cleaner pulls.
>>>
>>> -----Original Message-----
>>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
>>> Sent: Tuesday, July 4, 2017 6:18 AM
>>> To: user@tika.apache.org
>>> Subject: Tika content detection and crawled "remote" content
>>>
>>> Hi,
>>>
>>> recently I've plugged in Tika's content detection into Common Crawl's
>>> crawler (modified Nutch) with the target to get clean and correct MIME type
>>> - the HTTP Content-Type may contain garbage and isn't always correct [1].
>>>
>>> For the June 2017 crawl I've prepared a comparison of content types
>>> sent by the server in the HTTP header and as detected by Tika 1.15
>>> [2]. It shows that content types by Tika are definitely clean
>>> (1,400 different content types vs. more than 6,000 content type
>>> "strings" from HTTP headers).
>>>
>>> A look on the "confusions" where Content-Type and Tika differ, shows a
>>> mixed picture: some pairs are plausible, e.g., if Tika changes the type to
>>> a more precise subtype or detects the MIME at all:
>>>
>>> Tika-1.15 HTTP-Content-Type
>>> 1001968023 application/xhtml+xml text/html
>>> 2298146 application/rss+xml text/xml
>>> 617435 application/rss+xml application/xml
>>> 613525 text/html unk
>>> 361525 application/xhtml+xml unk
>>> 297707 application/rdf+xml application/xml
>>>
>>>
>>> However, there are a few dubious decisions, esp. the group of web
>>> server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
>>>
>>> Tika-1.15 HTTP-Content-Type
>>> 2047739 text/x-php text/html
>>> 681629 text/asp text/html
>>> 193095 text/x-coldfusion text/html
>>> 172318 text/aspdotnet text/html
>>> 139033 text/x-jsp text/html
>>> 38415 text/x-cgi text/html
>>> 32092 text/x-php text/xml
>>> 18021 text/x-perl text/html
>>>
>>> Of course, due to misconfigurations some servers may deliver the script
>>> files unmodified but in general I wouldn't expect that this happens for
>>> millions of pages. I've checked some of the affected URLs:
>>>
>>> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
>>> tag)
>>>
>>> https://www.projectmanagement.com/profile/profile_contributi
>>> ons.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=
>>> 2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
>>> http://www.privi.com/product-details.asp?cno=C10910011
>>> http://mental-ray.de/Root_alt/Default.asp
>>> http://ekyrs.org/support/index.php?action=profile
>>> http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>>>
>>> - (overlong) comment block at start of HTML which "masks" the HTML
>>> declaration
>>> http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>>>
>>> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&pa
>>> ge=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnu
>>> m&desc=asc&no=6
>>> https://www.preventiongenetics.com/About/Resources/disease/
>>> MarfansSyndrome.php
>>> https://de.e-stories.org/categories.php?&lan=nl&art=p
>>>
>>> - HTML with some scripting fragments ("<?php?>") present:
>>> http://www.eco-ani-yao.org/shien/
>>>
>>> - others are clearly HTML (looks more like a bug, at least, there is no
>>> simple explanation)
>>> http://www.proedinc.com/customer/content.aspx?redid=9
>>> http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b79
>>> 71faa23413fa1164be0c068f79
>>> http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>>> http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa
>>> 1164be0c068
>>> f79
>>>
>>>
>>> Obviously certain file suffixes (.php, .aspx) should get less weight
>>> compared to Content-Type sent from the responding server.
>>> Now my question: where's the best place to fix this: in the crawler [3]
>>> or in Tika?
>>>
>>> If anyone is interested in using the detected MIME types or anything
>>> else from Common Crawl - I'm happy to help! The URL index [4] contains now
>>> a new field "mime-detected" which makes it easy to search or grep for
>>> confusion pairs.
>>>
>>>
>>> Thanks and best,
>>> Sebastian
>>>
>>>
>>> [1] https://github.com/commoncrawl/nutch/issues/3
>>> [2]
>>> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
>>> a-1.15-cc-main-2017-26.txt.xz
>>>
>>> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
>>> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
>>> [3]
>>> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
>>> util/MimeUtil.java#L152 [4]
>>> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>>>
>>>
>>
>
RE: Tika content detection and crawled "remote" content
Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 7 Jul 2017, Allison, Timothy B. wrote:
> Should we add a WARC parser? ☺
I think we should!
And also add support into Tika Batch for processing from them :)
Nick
Re: Tika content detection and crawled "remote" content
Posted by Chris Mattmann <ma...@apache.org>.
Yep!
From: "Allison, Timothy B." <ta...@mitre.org>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Friday, July 7, 2017 at 3:52 AM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: RE: Tika content detection and crawled "remote" content
Should we add a WARC parser? J
From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
Sent: Friday, July 7, 2017 3:43 AM
To: user@tika.apache.org
Subject: Re: Tika content detection and crawled "remote" content
Is anyone aware of a tool to run Tika on a WARC file? Everything required for detection
and parsing is contained (URL, HTTP metadata, binary content).
you could do that with my good old Behemoth in 2 steps : WARC to Behemoth format then run Tika on that
On 6 July 2017 at 13:27, Sebastian Nagel <wa...@googlemail.com> wrote:
Hi,
> Otherwise, for anything else (eg that word / graphviz one), please do open up JIRAs!
Done, see TIKA-2242.
>> Why, yes, please! JIRA with small samples would be fantastic.
1000 randomly chosen examples per content-type are ready:
https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/test/
tika_html_server_side_scripting_lang_php.warc.gz
tika_html_server_side_scripting_lang_asp.warc.gz
tika_html_server_side_scripting_lang_coldfusion.warc.gz
tika_html_server_side_scripting_lang_jsp.warc.gz
tika_html_server_side_scripting_lang_cgi.warc.gz
tika_html_server_side_scripting_lang_perl.warc.gz
Note: there are few real PHP/JSP/Perl/... documents among them.
If there is no "global" solution (TIKA-2419), I'll open "smaller" Jiras.
Is anyone aware of a tool to run Tika on a WARC file? Everything required for detection
and parsing is contained (URL, HTTP metadata, binary content).
Thanks,
Sebastian
On 07/05/2017 04:07 PM, Nick Burch wrote:
> Having taken a "quick" look over lunch at some of the "programming language" ones, and gone down a
> rabbit whole... I think at least some of them are as described in TIKA-2419, where our change to the
> HTML magic priority to fix for HTML-containing formats like email had broken some things.
>
> I've done a quick fix for 1.16, but it'd be good to try the impact of other things, eg dropping the
> xml priority to match the html one to see if that helps / breaks other things
>
>
> Otherwise, for anything else (eg that word / graphviz one), please do open up JIRAs!
>
> Thanks
> Nick
>
> On 05/07/17 14:10, Allison, Timothy B. wrote:
>> Why, yes, please! JIRA with small samples would be fantastic. I think working in desc order of
>> most common to least would be best...php, asp, coldfusion.
>>
>> I'm about to cut 1.16, but I look forward to improving Tika with this tremendously useful data.
>>
>> Again, many thanks!
>>
>> Cheers,
>>
>> Tim
>>
>> -----Original Message-----
>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
>> Sent: Wednesday, July 5, 2017 9:03 AM
>> To: user@tika.apache.org
>> Subject: Re: Tika content detection and crawled "remote" content
>>
>> Hi Tim,
>>
>> thanks! Let me know if I should take any actions (e.g., open issue(s) on Jira) or whether I can
>> help by compiling smaller test sets.
>>
>> Best,
>> Sebastian
>>
>> On 07/05/2017 02:09 PM, Allison, Timothy B. wrote:
>>> This is FANTASTIC!!! Thank you, Sebastian!
>>>
>>> I suspect that we should try to fix these at the Tika level. We'll never be 100%, but most of
>>> the problems you describe _should_ be fixable.
>>>
>>> > If anyone is interested in using the detected MIME types or anything else from Common Crawl -
>>> I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it
>>> easy to search or grep for confusion pairs.
>>>
>>> This is an amazing step forward for our regression corpus. We used to rely on the http headers
>>> and/or file suffix to oversample non-html. This will allow far cleaner pulls.
>>>
>>> -----Original Message-----
>>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
>>> Sent: Tuesday, July 4, 2017 6:18 AM
>>> To: user@tika.apache.org
>>> Subject: Tika content detection and crawled "remote" content
>>>
>>> Hi,
>>>
>>> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch)
>>> with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage
>>> and isn't always correct [1].
>>>
>>> For the June 2017 crawl I've prepared a comparison of content types
>>> sent by the server in the HTTP header and as detected by Tika 1.15
>>> [2]. It shows that content types by Tika are definitely clean
>>> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
>>>
>>> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs
>>> are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
>>>
>>> Tika-1.15 HTTP-Content-Type
>>> 1001968023 application/xhtml+xml text/html
>>> 2298146 application/rss+xml text/xml
>>> 617435 application/rss+xml application/xml
>>> 613525 text/html unk
>>> 361525 application/xhtml+xml unk
>>> 297707 application/rdf+xml application/xml
>>>
>>>
>>> However, there are a few dubious decisions, esp. the group of web server-side scripting languages
>>> (ASP, JSP, PHP, ColdFusion, etc.):
>>>
>>> Tika-1.15 HTTP-Content-Type
>>> 2047739 text/x-php text/html
>>> 681629 text/asp text/html
>>> 193095 text/x-coldfusion text/html
>>> 172318 text/aspdotnet text/html
>>> 139033 text/x-jsp text/html
>>> 38415 text/x-cgi text/html
>>> 32092 text/x-php text/xml
>>> 18021 text/x-perl text/html
>>>
>>> Of course, due to misconfigurations some servers may deliver the script files unmodified but in
>>> general I wouldn't expect that this happens for millions of pages. I've checked some of the
>>> affected URLs:
>>>
>>> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
>>> tag)
>>>
>>> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
>>>
>>> http://www.privi.com/product-details.asp?cno=C10910011
>>> http://mental-ray.de/Root_alt/Default.asp
>>> http://ekyrs.org/support/index.php?action=profile
>>> http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>>>
>>> - (overlong) comment block at start of HTML which "masks" the HTML declaration
>>> http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>>>
>>> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
>>>
>>> https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
>>> https://de.e-stories.org/categories.php?&lan=nl&art=p
>>>
>>> - HTML with some scripting fragments ("<?php?>") present:
>>> http://www.eco-ani-yao.org/shien/
>>>
>>> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
>>> http://www.proedinc.com/customer/content.aspx?redid=9
>>> http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
>>> http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>>> http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068
>>> f79
>>>
>>>
>>> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type
>>> sent from the responding server.
>>> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
>>>
>>> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm
>>> happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to
>>> search or grep for confusion pairs.
>>>
>>>
>>> Thanks and best,
>>> Sebastian
>>>
>>>
>>> [1] https://github.com/commoncrawl/nutch/issues/3
>>> [2]
>>> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
>>> a-1.15-cc-main-2017-26.txt.xz
>>>
>>> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
>>> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
>>> [3]
>>> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
>>> util/MimeUtil.java#L152 [4]
>>> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>>>
>>
>
--
Open Source Solutions for Text Engineering
http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble
RE: Tika content detection and crawled "remote" content
Posted by "Allison, Timothy B." <ta...@mitre.org>.
>which have a pretty heavy/messy dependency tree
You've seem our pom, right? We have you covered!
...
<dependency>
<groupId>*</groupId>
<artifactId>*</artifactId>
<version>*</version>
</dependency>
...
From: Jackson, Andy [mailto:Andrew.Jackson@bl.uk]
Sent: Friday, July 7, 2017 7:19 AM
To: user@tika.apache.org
Subject: Re: Tika content detection and crawled "remote" content
In case it helps, I wrote some prototype modules to add ARC and WARC support to Tika:
https://github.com/ukwa/webarchive-discovery/tree/master/digipres-tika/src/main/java/uk/bl/wa/tika/parser/warc
...and extended Tika to use them:
https://github.com/ukwa/webarchive-discovery/blob/master/digipres-tika/src/main/java/uk/bl/wa/tika/PreservationParser.java#L62-L63
However, they are based on the Internet Archive's (W)ARC parsers, which have a pretty heavy/messy dependency tree. It would probably be better to build them on JWAT, which has few dependencies (but may not be quite as robust to edge cases as the IA ones).
https://sbforge.org/display/JWAT/JWAT
(see e.g. https://sbforge.org/display/JWAT/Reading+a+WARC+file)
Hope that helps,
Andy Jackson (UK Web Archive)
From: Timothy Allison <ta...@mitre.org>>
Reply-To: "user@tika.apache.org<ma...@tika.apache.org>" <us...@tika.apache.org>>
Date: Friday, 7 July 2017 at 11:52
To: "user@tika.apache.org<ma...@tika.apache.org>" <us...@tika.apache.org>>
Subject: RE: Tika content detection and crawled "remote" content
Should we add a WARC parser? :)
From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
Sent: Friday, July 7, 2017 3:43 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Tika content detection and crawled "remote" content
Is anyone aware of a tool to run Tika on a WARC file? Everything required for detection
and parsing is contained (URL, HTTP metadata, binary content).
you could do that with my good old Behemoth<https://github.com/DigitalPebble/behemoth> in 2 steps : WARC to Behemoth format then run Tika on that
On 6 July 2017 at 13:27, Sebastian Nagel <wa...@googlemail.com>> wrote:
Hi,
> Otherwise, for anything else (eg that word / graphviz one), please do open up JIRAs!
Done, see TIKA-2242.
>> Why, yes, please! JIRA with small samples would be fantastic.
1000 randomly chosen examples per content-type are ready:
https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/test/
tika_html_server_side_scripting_lang_php.warc.gz
tika_html_server_side_scripting_lang_asp.warc.gz
tika_html_server_side_scripting_lang_coldfusion.warc.gz
tika_html_server_side_scripting_lang_jsp.warc.gz
tika_html_server_side_scripting_lang_cgi.warc.gz
tika_html_server_side_scripting_lang_perl.warc.gz
Note: there are few real PHP/JSP/Perl/... documents among them.
If there is no "global" solution (TIKA-2419), I'll open "smaller" Jiras.
Is anyone aware of a tool to run Tika on a WARC file? Everything required for detection
and parsing is contained (URL, HTTP metadata, binary content).
Thanks,
Sebastian
On 07/05/2017 04:07 PM, Nick Burch wrote:
> Having taken a "quick" look over lunch at some of the "programming language" ones, and gone down a
> rabbit whole... I think at least some of them are as described in TIKA-2419, where our change to the
> HTML magic priority to fix for HTML-containing formats like email had broken some things.
>
> I've done a quick fix for 1.16, but it'd be good to try the impact of other things, eg dropping the
> xml priority to match the html one to see if that helps / breaks other things
>
>
> Otherwise, for anything else (eg that word / graphviz one), please do open up JIRAs!
>
> Thanks
> Nick
>
> On 05/07/17 14:10, Allison, Timothy B. wrote:
>> Why, yes, please! JIRA with small samples would be fantastic. I think working in desc order of
>> most common to least would be best...php, asp, coldfusion.
>>
>> I'm about to cut 1.16, but I look forward to improving Tika with this tremendously useful data.
>>
>> Again, many thanks!
>>
>> Cheers,
>>
>> Tim
>>
>> -----Original Message-----
>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com<ma...@googlemail.com>]
>> Sent: Wednesday, July 5, 2017 9:03 AM
>> To: user@tika.apache.org<ma...@tika.apache.org>
>> Subject: Re: Tika content detection and crawled "remote" content
>>
>> Hi Tim,
>>
>> thanks! Let me know if I should take any actions (e.g., open issue(s) on Jira) or whether I can
>> help by compiling smaller test sets.
>>
>> Best,
>> Sebastian
>>
>> On 07/05/2017 02:09 PM, Allison, Timothy B. wrote:
>>> This is FANTASTIC!!! Thank you, Sebastian!
>>>
>>> I suspect that we should try to fix these at the Tika level. We'll never be 100%, but most of
>>> the problems you describe _should_ be fixable.
>>>
>>> > If anyone is interested in using the detected MIME types or anything else from Common Crawl -
>>> I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it
>>> easy to search or grep for confusion pairs.
>>>
>>> This is an amazing step forward for our regression corpus. We used to rely on the http headers
>>> and/or file suffix to oversample non-html. This will allow far cleaner pulls.
>>>
>>> -----Original Message-----
>>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com<ma...@googlemail.com>]
>>> Sent: Tuesday, July 4, 2017 6:18 AM
>>> To: user@tika.apache.org<ma...@tika.apache.org>
>>> Subject: Tika content detection and crawled "remote" content
>>>
>>> Hi,
>>>
>>> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch)
>>> with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage
>>> and isn't always correct [1].
>>>
>>> For the June 2017 crawl I've prepared a comparison of content types
>>> sent by the server in the HTTP header and as detected by Tika 1.15
>>> [2]. It shows that content types by Tika are definitely clean
>>> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
>>>
>>> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs
>>> are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
>>>
>>> Tika-1.15 HTTP-Content-Type
>>> 1001968023 application/xhtml+xml text/html
>>> 2298146 application/rss+xml text/xml
>>> 617435 application/rss+xml application/xml
>>> 613525 text/html unk
>>> 361525 application/xhtml+xml unk
>>> 297707 application/rdf+xml application/xml
>>>
>>>
>>> However, there are a few dubious decisions, esp. the group of web server-side scripting languages
>>> (ASP, JSP, PHP, ColdFusion, etc.):
>>>
>>> Tika-1.15 HTTP-Content-Type
>>> 2047739 text/x-php text/html
>>> 681629 text/asp text/html
>>> 193095 text/x-coldfusion text/html
>>> 172318 text/aspdotnet text/html
>>> 139033 text/x-jsp text/html
>>> 38415 text/x-cgi text/html
>>> 32092 text/x-php text/xml
>>> 18021 text/x-perl text/html
>>>
>>> Of course, due to misconfigurations some servers may deliver the script files unmodified but in
>>> general I wouldn't expect that this happens for millions of pages. I've checked some of the
>>> affected URLs:
>>>
>>> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
>>> tag)
>>>
>>> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
>>>
>>> http://www.privi.com/product-details.asp?cno=C10910011
>>> http://mental-ray.de/Root_alt/Default.asp
>>> http://ekyrs.org/support/index.php?action=profile
>>> http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>>>
>>> - (overlong) comment block at start of HTML which "masks" the HTML declaration
>>> http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>>>
>>> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
>>>
>>> https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
>>> https://de.e-stories.org/categories.php?&lan=nl&art=p
>>>
>>> - HTML with some scripting fragments ("<?php?>") present:
>>> http://www.eco-ani-yao.org/shien/
>>>
>>> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
>>> http://www.proedinc.com/customer/content.aspx?redid=9
>>> http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
>>> http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>>> http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068
>>> f79
>>>
>>>
>>> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type
>>> sent from the responding server.
>>> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
>>>
>>> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm
>>> happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to
>>> search or grep for confusion pairs.
>>>
>>>
>>> Thanks and best,
>>> Sebastian
>>>
>>>
>>> [1] https://github.com/commoncrawl/nutch/issues/3
>>> [2]
>>> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
>>> a-1.15-cc-main-2017-26.txt.xz
>>>
>>> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
>>> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
>>> [3]
>>> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
>>> util/MimeUtil.java#L152 [4]
>>> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>>>
>>
>
--
Open Source Solutions for Text Engineering
http://www.digitalpebble.com<http://www.digitalpebble.com/>
http://digitalpebble.blogspot.com/
#digitalpebble<http://twitter.com/digitalpebble>
******************************************************************************************************************
Experience the British Library online at www.bl.uk<http://www.bl.uk/>
The British Library's latest Annual Report and Accounts : www.bl.uk/aboutus/annrep/index.html<http://www.bl.uk/aboutus/annrep/index.html>
Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabook<http://www.bl.uk/adoptabook>
The Library's St Pancras site is WiFi - enabled
*****************************************************************************************************************
The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the postmaster@bl.uk<ma...@bl.uk> : The contents of this e-mail must not be disclosed or copied without the sender's consent.
The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author.
*****************************************************************************************************************
Think before you print
Re: Tika content detection and crawled "remote" content
Posted by "Jackson, Andy" <An...@bl.uk>.
In case it helps, I wrote some prototype modules to add ARC and WARC support to Tika:
https://github.com/ukwa/webarchive-discovery/tree/master/digipres-tika/src/main/java/uk/bl/wa/tika/parser/warc
…and extended Tika to use them:
https://github.com/ukwa/webarchive-discovery/blob/master/digipres-tika/src/main/java/uk/bl/wa/tika/PreservationParser.java#L62-L63
However, they are based on the Internet Archive’s (W)ARC parsers, which have a pretty heavy/messy dependency tree. It would probably be better to build them on JWAT, which has few dependencies (but may not be quite as robust to edge cases as the IA ones).
https://sbforge.org/display/JWAT/JWAT
(see e.g. https://sbforge.org/display/JWAT/Reading+a+WARC+file)
Hope that helps,
Andy Jackson (UK Web Archive)
From: Timothy Allison <ta...@mitre.org>>
Reply-To: "user@tika.apache.org<ma...@tika.apache.org>" <us...@tika.apache.org>>
Date: Friday, 7 July 2017 at 11:52
To: "user@tika.apache.org<ma...@tika.apache.org>" <us...@tika.apache.org>>
Subject: RE: Tika content detection and crawled "remote" content
Should we add a WARC parser? :)
From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
Sent: Friday, July 7, 2017 3:43 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Tika content detection and crawled "remote" content
Is anyone aware of a tool to run Tika on a WARC file? Everything required for detection
and parsing is contained (URL, HTTP metadata, binary content).
you could do that with my good old Behemoth<https://github.com/DigitalPebble/behemoth> in 2 steps : WARC to Behemoth format then run Tika on that
On 6 July 2017 at 13:27, Sebastian Nagel <wa...@googlemail.com>> wrote:
Hi,
> Otherwise, for anything else (eg that word / graphviz one), please do open up JIRAs!
Done, see TIKA-2242.
>> Why, yes, please! JIRA with small samples would be fantastic.
1000 randomly chosen examples per content-type are ready:
https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/test/
tika_html_server_side_scripting_lang_php.warc.gz
tika_html_server_side_scripting_lang_asp.warc.gz
tika_html_server_side_scripting_lang_coldfusion.warc.gz
tika_html_server_side_scripting_lang_jsp.warc.gz
tika_html_server_side_scripting_lang_cgi.warc.gz
tika_html_server_side_scripting_lang_perl.warc.gz
Note: there are few real PHP/JSP/Perl/... documents among them.
If there is no "global" solution (TIKA-2419), I'll open "smaller" Jiras.
Is anyone aware of a tool to run Tika on a WARC file? Everything required for detection
and parsing is contained (URL, HTTP metadata, binary content).
Thanks,
Sebastian
On 07/05/2017 04:07 PM, Nick Burch wrote:
> Having taken a "quick" look over lunch at some of the "programming language" ones, and gone down a
> rabbit whole... I think at least some of them are as described in TIKA-2419, where our change to the
> HTML magic priority to fix for HTML-containing formats like email had broken some things.
>
> I've done a quick fix for 1.16, but it'd be good to try the impact of other things, eg dropping the
> xml priority to match the html one to see if that helps / breaks other things
>
>
> Otherwise, for anything else (eg that word / graphviz one), please do open up JIRAs!
>
> Thanks
> Nick
>
> On 05/07/17 14:10, Allison, Timothy B. wrote:
>> Why, yes, please! JIRA with small samples would be fantastic. I think working in desc order of
>> most common to least would be best...php, asp, coldfusion.
>>
>> I'm about to cut 1.16, but I look forward to improving Tika with this tremendously useful data.
>>
>> Again, many thanks!
>>
>> Cheers,
>>
>> Tim
>>
>> -----Original Message-----
>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com<ma...@googlemail.com>]
>> Sent: Wednesday, July 5, 2017 9:03 AM
>> To: user@tika.apache.org<ma...@tika.apache.org>
>> Subject: Re: Tika content detection and crawled "remote" content
>>
>> Hi Tim,
>>
>> thanks! Let me know if I should take any actions (e.g., open issue(s) on Jira) or whether I can
>> help by compiling smaller test sets.
>>
>> Best,
>> Sebastian
>>
>> On 07/05/2017 02:09 PM, Allison, Timothy B. wrote:
>>> This is FANTASTIC!!! Thank you, Sebastian!
>>>
>>> I suspect that we should try to fix these at the Tika level. We'll never be 100%, but most of
>>> the problems you describe _should_ be fixable.
>>>
>>> > If anyone is interested in using the detected MIME types or anything else from Common Crawl -
>>> I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it
>>> easy to search or grep for confusion pairs.
>>>
>>> This is an amazing step forward for our regression corpus. We used to rely on the http headers
>>> and/or file suffix to oversample non-html. This will allow far cleaner pulls.
>>>
>>> -----Original Message-----
>>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com<ma...@googlemail.com>]
>>> Sent: Tuesday, July 4, 2017 6:18 AM
>>> To: user@tika.apache.org<ma...@tika.apache.org>
>>> Subject: Tika content detection and crawled "remote" content
>>>
>>> Hi,
>>>
>>> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch)
>>> with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage
>>> and isn't always correct [1].
>>>
>>> For the June 2017 crawl I've prepared a comparison of content types
>>> sent by the server in the HTTP header and as detected by Tika 1.15
>>> [2]. It shows that content types by Tika are definitely clean
>>> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
>>>
>>> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs
>>> are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
>>>
>>> Tika-1.15 HTTP-Content-Type
>>> 1001968023 application/xhtml+xml text/html
>>> 2298146 application/rss+xml text/xml
>>> 617435 application/rss+xml application/xml
>>> 613525 text/html unk
>>> 361525 application/xhtml+xml unk
>>> 297707 application/rdf+xml application/xml
>>>
>>>
>>> However, there are a few dubious decisions, esp. the group of web server-side scripting languages
>>> (ASP, JSP, PHP, ColdFusion, etc.):
>>>
>>> Tika-1.15 HTTP-Content-Type
>>> 2047739 text/x-php text/html
>>> 681629 text/asp text/html
>>> 193095 text/x-coldfusion text/html
>>> 172318 text/aspdotnet text/html
>>> 139033 text/x-jsp text/html
>>> 38415 text/x-cgi text/html
>>> 32092 text/x-php text/xml
>>> 18021 text/x-perl text/html
>>>
>>> Of course, due to misconfigurations some servers may deliver the script files unmodified but in
>>> general I wouldn't expect that this happens for millions of pages. I've checked some of the
>>> affected URLs:
>>>
>>> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
>>> tag)
>>>
>>> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
>>>
>>> http://www.privi.com/product-details.asp?cno=C10910011
>>> http://mental-ray.de/Root_alt/Default.asp
>>> http://ekyrs.org/support/index.php?action=profile
>>> http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>>>
>>> - (overlong) comment block at start of HTML which "masks" the HTML declaration
>>> http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>>>
>>> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
>>>
>>> https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
>>> https://de.e-stories.org/categories.php?&lan=nl&art=p
>>>
>>> - HTML with some scripting fragments ("<?php?>") present:
>>> http://www.eco-ani-yao.org/shien/
>>>
>>> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
>>> http://www.proedinc.com/customer/content.aspx?redid=9
>>> http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
>>> http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>>> http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068
>>> f79
>>>
>>>
>>> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type
>>> sent from the responding server.
>>> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
>>>
>>> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm
>>> happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to
>>> search or grep for confusion pairs.
>>>
>>>
>>> Thanks and best,
>>> Sebastian
>>>
>>>
>>> [1] https://github.com/commoncrawl/nutch/issues/3
>>> [2]
>>> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
>>> a-1.15-cc-main-2017-26.txt.xz
>>>
>>> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
>>> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
>>> [3]
>>> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
>>> util/MimeUtil.java#L152 [4]
>>> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>>>
>>
>
--
Open Source Solutions for Text Engineering
http://www.digitalpebble.com<http://www.digitalpebble.com/>
http://digitalpebble.blogspot.com/
#digitalpebble<http://twitter.com/digitalpebble>
******************************************************************************************************************
Experience the British Library online at www.bl.uk<http://www.bl.uk/>
The British Library’s latest Annual Report and Accounts : www.bl.uk/aboutus/annrep/index.html<http://www.bl.uk/aboutus/annrep/index.html>
Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabook<http://www.bl.uk/adoptabook>
The Library's St Pancras site is WiFi - enabled
*****************************************************************************************************************
The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the postmaster@bl.uk<ma...@bl.uk> : The contents of this e-mail must not be disclosed or copied without the sender's consent.
The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author.
*****************************************************************************************************************
Think before you print
RE: Tika content detection and crawled "remote" content
Posted by "Allison, Timothy B." <ta...@mitre.org>.
Should we add a WARC parser? ☺
From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
Sent: Friday, July 7, 2017 3:43 AM
To: user@tika.apache.org
Subject: Re: Tika content detection and crawled "remote" content
Is anyone aware of a tool to run Tika on a WARC file? Everything required for detection
and parsing is contained (URL, HTTP metadata, binary content).
you could do that with my good old Behemoth<https://github.com/DigitalPebble/behemoth> in 2 steps : WARC to Behemoth format then run Tika on that
On 6 July 2017 at 13:27, Sebastian Nagel <wa...@googlemail.com>> wrote:
Hi,
> Otherwise, for anything else (eg that word / graphviz one), please do open up JIRAs!
Done, see TIKA-2242.
>> Why, yes, please! JIRA with small samples would be fantastic.
1000 randomly chosen examples per content-type are ready:
https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/test/
tika_html_server_side_scripting_lang_php.warc.gz
tika_html_server_side_scripting_lang_asp.warc.gz
tika_html_server_side_scripting_lang_coldfusion.warc.gz
tika_html_server_side_scripting_lang_jsp.warc.gz
tika_html_server_side_scripting_lang_cgi.warc.gz
tika_html_server_side_scripting_lang_perl.warc.gz
Note: there are few real PHP/JSP/Perl/... documents among them.
If there is no "global" solution (TIKA-2419), I'll open "smaller" Jiras.
Is anyone aware of a tool to run Tika on a WARC file? Everything required for detection
and parsing is contained (URL, HTTP metadata, binary content).
Thanks,
Sebastian
On 07/05/2017 04:07 PM, Nick Burch wrote:
> Having taken a "quick" look over lunch at some of the "programming language" ones, and gone down a
> rabbit whole... I think at least some of them are as described in TIKA-2419, where our change to the
> HTML magic priority to fix for HTML-containing formats like email had broken some things.
>
> I've done a quick fix for 1.16, but it'd be good to try the impact of other things, eg dropping the
> xml priority to match the html one to see if that helps / breaks other things
>
>
> Otherwise, for anything else (eg that word / graphviz one), please do open up JIRAs!
>
> Thanks
> Nick
>
> On 05/07/17 14:10, Allison, Timothy B. wrote:
>> Why, yes, please! JIRA with small samples would be fantastic. I think working in desc order of
>> most common to least would be best...php, asp, coldfusion.
>>
>> I'm about to cut 1.16, but I look forward to improving Tika with this tremendously useful data.
>>
>> Again, many thanks!
>>
>> Cheers,
>>
>> Tim
>>
>> -----Original Message-----
>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com<ma...@googlemail.com>]
>> Sent: Wednesday, July 5, 2017 9:03 AM
>> To: user@tika.apache.org<ma...@tika.apache.org>
>> Subject: Re: Tika content detection and crawled "remote" content
>>
>> Hi Tim,
>>
>> thanks! Let me know if I should take any actions (e.g., open issue(s) on Jira) or whether I can
>> help by compiling smaller test sets.
>>
>> Best,
>> Sebastian
>>
>> On 07/05/2017 02:09 PM, Allison, Timothy B. wrote:
>>> This is FANTASTIC!!! Thank you, Sebastian!
>>>
>>> I suspect that we should try to fix these at the Tika level. We'll never be 100%, but most of
>>> the problems you describe _should_ be fixable.
>>>
>>> > If anyone is interested in using the detected MIME types or anything else from Common Crawl -
>>> I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it
>>> easy to search or grep for confusion pairs.
>>>
>>> This is an amazing step forward for our regression corpus. We used to rely on the http headers
>>> and/or file suffix to oversample non-html. This will allow far cleaner pulls.
>>>
>>> -----Original Message-----
>>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com<ma...@googlemail.com>]
>>> Sent: Tuesday, July 4, 2017 6:18 AM
>>> To: user@tika.apache.org<ma...@tika.apache.org>
>>> Subject: Tika content detection and crawled "remote" content
>>>
>>> Hi,
>>>
>>> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch)
>>> with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage
>>> and isn't always correct [1].
>>>
>>> For the June 2017 crawl I've prepared a comparison of content types
>>> sent by the server in the HTTP header and as detected by Tika 1.15
>>> [2]. It shows that content types by Tika are definitely clean
>>> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
>>>
>>> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs
>>> are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
>>>
>>> Tika-1.15 HTTP-Content-Type
>>> 1001968023 application/xhtml+xml text/html
>>> 2298146 application/rss+xml text/xml
>>> 617435 application/rss+xml application/xml
>>> 613525 text/html unk
>>> 361525 application/xhtml+xml unk
>>> 297707 application/rdf+xml application/xml
>>>
>>>
>>> However, there are a few dubious decisions, esp. the group of web server-side scripting languages
>>> (ASP, JSP, PHP, ColdFusion, etc.):
>>>
>>> Tika-1.15 HTTP-Content-Type
>>> 2047739 text/x-php text/html
>>> 681629 text/asp text/html
>>> 193095 text/x-coldfusion text/html
>>> 172318 text/aspdotnet text/html
>>> 139033 text/x-jsp text/html
>>> 38415 text/x-cgi text/html
>>> 32092 text/x-php text/xml
>>> 18021 text/x-perl text/html
>>>
>>> Of course, due to misconfigurations some servers may deliver the script files unmodified but in
>>> general I wouldn't expect that this happens for millions of pages. I've checked some of the
>>> affected URLs:
>>>
>>> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
>>> tag)
>>>
>>> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
>>>
>>> http://www.privi.com/product-details.asp?cno=C10910011
>>> http://mental-ray.de/Root_alt/Default.asp
>>> http://ekyrs.org/support/index.php?action=profile
>>> http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>>>
>>> - (overlong) comment block at start of HTML which "masks" the HTML declaration
>>> http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>>>
>>> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
>>>
>>> https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
>>> https://de.e-stories.org/categories.php?&lan=nl&art=p
>>>
>>> - HTML with some scripting fragments ("<?php?>") present:
>>> http://www.eco-ani-yao.org/shien/
>>>
>>> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
>>> http://www.proedinc.com/customer/content.aspx?redid=9
>>> http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
>>> http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>>> http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068
>>> f79
>>>
>>>
>>> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type
>>> sent from the responding server.
>>> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
>>>
>>> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm
>>> happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to
>>> search or grep for confusion pairs.
>>>
>>>
>>> Thanks and best,
>>> Sebastian
>>>
>>>
>>> [1] https://github.com/commoncrawl/nutch/issues/3
>>> [2]
>>> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
>>> a-1.15-cc-main-2017-26.txt.xz
>>>
>>> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
>>> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
>>> [3]
>>> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
>>> util/MimeUtil.java#L152 [4]
>>> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>>>
>>
>
--
Open Source Solutions for Text Engineering
http://www.digitalpebble.com<http://www.digitalpebble.com/>
http://digitalpebble.blogspot.com/
#digitalpebble<http://twitter.com/digitalpebble>
Re: Tika content detection and crawled "remote" content
Posted by Julien Nioche <li...@gmail.com>.
>
> Is anyone aware of a tool to run Tika on a WARC file? Everything required
> for detection
> and parsing is contained (URL, HTTP metadata, binary content).
you could do that with my good old Behemoth
<https://github.com/DigitalPebble/behemoth> in 2 steps : WARC to Behemoth
format then run Tika on that
On 6 July 2017 at 13:27, Sebastian Nagel <wa...@googlemail.com> wrote:
> Hi,
>
> > Otherwise, for anything else (eg that word / graphviz one), please do
> open up JIRAs!
> Done, see TIKA-2242.
>
> >> Why, yes, please! JIRA with small samples would be fantastic.
>
> 1000 randomly chosen examples per content-type are ready:
>
> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/test/
> tika_html_server_side_scripting_lang_php.warc.gz
> tika_html_server_side_scripting_lang_asp.warc.gz
> tika_html_server_side_scripting_lang_coldfusion.warc.gz
> tika_html_server_side_scripting_lang_jsp.warc.gz
> tika_html_server_side_scripting_lang_cgi.warc.gz
> tika_html_server_side_scripting_lang_perl.warc.gz
>
> Note: there are few real PHP/JSP/Perl/... documents among them.
>
> If there is no "global" solution (TIKA-2419), I'll open "smaller" Jiras.
>
> Is anyone aware of a tool to run Tika on a WARC file? Everything required
> for detection
> and parsing is contained (URL, HTTP metadata, binary content).
>
> Thanks,
> Sebastian
>
> On 07/05/2017 04:07 PM, Nick Burch wrote:
> > Having taken a "quick" look over lunch at some of the "programming
> language" ones, and gone down a
> > rabbit whole... I think at least some of them are as described in
> TIKA-2419, where our change to the
> > HTML magic priority to fix for HTML-containing formats like email had
> broken some things.
> >
> > I've done a quick fix for 1.16, but it'd be good to try the impact of
> other things, eg dropping the
> > xml priority to match the html one to see if that helps / breaks other
> things
> >
> >
> > Otherwise, for anything else (eg that word / graphviz one), please do
> open up JIRAs!
> >
> > Thanks
> > Nick
> >
> > On 05/07/17 14:10, Allison, Timothy B. wrote:
> >> Why, yes, please! JIRA with small samples would be fantastic. I think
> working in desc order of
> >> most common to least would be best...php, asp, coldfusion.
> >>
> >> I'm about to cut 1.16, but I look forward to improving Tika with this
> tremendously useful data.
> >>
> >> Again, many thanks!
> >>
> >> Cheers,
> >>
> >> Tim
> >>
> >> -----Original Message-----
> >> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> >> Sent: Wednesday, July 5, 2017 9:03 AM
> >> To: user@tika.apache.org
> >> Subject: Re: Tika content detection and crawled "remote" content
> >>
> >> Hi Tim,
> >>
> >> thanks! Let me know if I should take any actions (e.g., open issue(s)
> on Jira) or whether I can
> >> help by compiling smaller test sets.
> >>
> >> Best,
> >> Sebastian
> >>
> >> On 07/05/2017 02:09 PM, Allison, Timothy B. wrote:
> >>> This is FANTASTIC!!! Thank you, Sebastian!
> >>>
> >>> I suspect that we should try to fix these at the Tika level. We'll
> never be 100%, but most of
> >>> the problems you describe _should_ be fixable.
> >>>
> >>> > If anyone is interested in using the detected MIME types or
> anything else from Common Crawl -
> >>> I'm happy to help! The URL index [4] contains now a new field
> "mime-detected" which makes it
> >>> easy to search or grep for confusion pairs.
> >>>
> >>> This is an amazing step forward for our regression corpus. We used to
> rely on the http headers
> >>> and/or file suffix to oversample non-html. This will allow far
> cleaner pulls.
> >>>
> >>> -----Original Message-----
> >>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> >>> Sent: Tuesday, July 4, 2017 6:18 AM
> >>> To: user@tika.apache.org
> >>> Subject: Tika content detection and crawled "remote" content
> >>>
> >>> Hi,
> >>>
> >>> recently I've plugged in Tika's content detection into Common Crawl's
> crawler (modified Nutch)
> >>> with the target to get clean and correct MIME type - the HTTP
> Content-Type may contain garbage
> >>> and isn't always correct [1].
> >>>
> >>> For the June 2017 crawl I've prepared a comparison of content types
> >>> sent by the server in the HTTP header and as detected by Tika 1.15
> >>> [2]. It shows that content types by Tika are definitely clean
> >>> (1,400 different content types vs. more than 6,000 content type
> "strings" from HTTP headers).
> >>>
> >>> A look on the "confusions" where Content-Type and Tika differ, shows a
> mixed picture: some pairs
> >>> are plausible, e.g., if Tika changes the type to a more precise
> subtype or detects the MIME at all:
> >>>
> >>> Tika-1.15 HTTP-Content-Type
> >>> 1001968023 application/xhtml+xml text/html
> >>> 2298146 application/rss+xml text/xml
> >>> 617435 application/rss+xml application/xml
> >>> 613525 text/html unk
> >>> 361525 application/xhtml+xml unk
> >>> 297707 application/rdf+xml application/xml
> >>>
> >>>
> >>> However, there are a few dubious decisions, esp. the group of web
> server-side scripting languages
> >>> (ASP, JSP, PHP, ColdFusion, etc.):
> >>>
> >>> Tika-1.15 HTTP-Content-Type
> >>> 2047739 text/x-php text/html
> >>> 681629 text/asp text/html
> >>> 193095 text/x-coldfusion text/html
> >>> 172318 text/aspdotnet text/html
> >>> 139033 text/x-jsp text/html
> >>> 38415 text/x-cgi text/html
> >>> 32092 text/x-php text/xml
> >>> 18021 text/x-perl text/html
> >>>
> >>> Of course, due to misconfigurations some servers may deliver the
> script files unmodified but in
> >>> general I wouldn't expect that this happens for millions of pages.
> I've checked some of the
> >>> affected URLs:
> >>>
> >>> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
> >>> tag)
> >>>
> >>> https://www.projectmanagement.com/profile/profile_
> contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&
> c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&
> c_d=0&c_ra=2&c_p=0
> >>>
> >>> http://www.privi.com/product-details.asp?cno=C10910011
> >>> http://mental-ray.de/Root_alt/Default.asp
> >>> http://ekyrs.org/support/index.php?action=profile
> >>> http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
> >>>
> >>> - (overlong) comment block at start of HTML which "masks" the HTML
> declaration
> >>> http://www.mannheim-virtuell.de/index.php?branchenID=2&
> rubrikID=24
> >>>
> >>> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&
> page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=
> headnum&desc=asc&no=6
> >>>
> >>> https://www.preventiongenetics.com/About/Resources/disease/
> MarfansSyndrome.php
> >>> https://de.e-stories.org/categories.php?&lan=nl&art=p
> >>>
> >>> - HTML with some scripting fragments ("<?php?>") present:
> >>> http://www.eco-ani-yao.org/shien/
> >>>
> >>> - others are clearly HTML (looks more like a bug, at least, there is
> no simple explanation)
> >>> http://www.proedinc.com/customer/content.aspx?redid=9
> >>> http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=
> bf3b7971faa23413fa1164be0c068f79
> >>> http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
> >>> http://cball.dyndns.org/wbb2/map.php?sid=
> bf3b7971faa23413fa1164be0c068
> >>> f79
> >>>
> >>>
> >>> Obviously certain file suffixes (.php, .aspx) should get less weight
> compared to Content-Type
> >>> sent from the responding server.
> >>> Now my question: where's the best place to fix this: in the crawler
> [3] or in Tika?
> >>>
> >>> If anyone is interested in using the detected MIME types or anything
> else from Common Crawl - I'm
> >>> happy to help! The URL index [4] contains now a new field
> "mime-detected" which makes it easy to
> >>> search or grep for confusion pairs.
> >>>
> >>>
> >>> Thanks and best,
> >>> Sebastian
> >>>
> >>>
> >>> [1] https://github.com/commoncrawl/nutch/issues/3
> >>> [2]
> >>> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
> >>> a-1.15-cc-main-2017-26.txt.xz
> >>>
> >>> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
> >>> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
> >>> [3]
> >>> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
> >>> util/MimeUtil.java#L152 [4]
> >>> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
> >>>
> >>
> >
>
>
--
*Open Source Solutions for Text Engineering*
http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>
Re: Tika content detection and crawled "remote" content
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,
> Otherwise, for anything else (eg that word / graphviz one), please do open up JIRAs!
Done, see TIKA-2242.
>> Why, yes, please! JIRA with small samples would be fantastic.
1000 randomly chosen examples per content-type are ready:
https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/test/
tika_html_server_side_scripting_lang_php.warc.gz
tika_html_server_side_scripting_lang_asp.warc.gz
tika_html_server_side_scripting_lang_coldfusion.warc.gz
tika_html_server_side_scripting_lang_jsp.warc.gz
tika_html_server_side_scripting_lang_cgi.warc.gz
tika_html_server_side_scripting_lang_perl.warc.gz
Note: there are few real PHP/JSP/Perl/... documents among them.
If there is no "global" solution (TIKA-2419), I'll open "smaller" Jiras.
Is anyone aware of a tool to run Tika on a WARC file? Everything required for detection
and parsing is contained (URL, HTTP metadata, binary content).
Thanks,
Sebastian
On 07/05/2017 04:07 PM, Nick Burch wrote:
> Having taken a "quick" look over lunch at some of the "programming language" ones, and gone down a
> rabbit whole... I think at least some of them are as described in TIKA-2419, where our change to the
> HTML magic priority to fix for HTML-containing formats like email had broken some things.
>
> I've done a quick fix for 1.16, but it'd be good to try the impact of other things, eg dropping the
> xml priority to match the html one to see if that helps / breaks other things
>
>
> Otherwise, for anything else (eg that word / graphviz one), please do open up JIRAs!
>
> Thanks
> Nick
>
> On 05/07/17 14:10, Allison, Timothy B. wrote:
>> Why, yes, please! JIRA with small samples would be fantastic. I think working in desc order of
>> most common to least would be best...php, asp, coldfusion.
>>
>> I'm about to cut 1.16, but I look forward to improving Tika with this tremendously useful data.
>>
>> Again, many thanks!
>>
>> Cheers,
>>
>> Tim
>>
>> -----Original Message-----
>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
>> Sent: Wednesday, July 5, 2017 9:03 AM
>> To: user@tika.apache.org
>> Subject: Re: Tika content detection and crawled "remote" content
>>
>> Hi Tim,
>>
>> thanks! Let me know if I should take any actions (e.g., open issue(s) on Jira) or whether I can
>> help by compiling smaller test sets.
>>
>> Best,
>> Sebastian
>>
>> On 07/05/2017 02:09 PM, Allison, Timothy B. wrote:
>>> This is FANTASTIC!!! Thank you, Sebastian!
>>>
>>> I suspect that we should try to fix these at the Tika level. We'll never be 100%, but most of
>>> the problems you describe _should_ be fixable.
>>>
>>> > If anyone is interested in using the detected MIME types or anything else from Common Crawl -
>>> I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it
>>> easy to search or grep for confusion pairs.
>>>
>>> This is an amazing step forward for our regression corpus. We used to rely on the http headers
>>> and/or file suffix to oversample non-html. This will allow far cleaner pulls.
>>>
>>> -----Original Message-----
>>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
>>> Sent: Tuesday, July 4, 2017 6:18 AM
>>> To: user@tika.apache.org
>>> Subject: Tika content detection and crawled "remote" content
>>>
>>> Hi,
>>>
>>> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch)
>>> with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage
>>> and isn't always correct [1].
>>>
>>> For the June 2017 crawl I've prepared a comparison of content types
>>> sent by the server in the HTTP header and as detected by Tika 1.15
>>> [2]. It shows that content types by Tika are definitely clean
>>> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
>>>
>>> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs
>>> are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
>>>
>>> Tika-1.15 HTTP-Content-Type
>>> 1001968023 application/xhtml+xml text/html
>>> 2298146 application/rss+xml text/xml
>>> 617435 application/rss+xml application/xml
>>> 613525 text/html unk
>>> 361525 application/xhtml+xml unk
>>> 297707 application/rdf+xml application/xml
>>>
>>>
>>> However, there are a few dubious decisions, esp. the group of web server-side scripting languages
>>> (ASP, JSP, PHP, ColdFusion, etc.):
>>>
>>> Tika-1.15 HTTP-Content-Type
>>> 2047739 text/x-php text/html
>>> 681629 text/asp text/html
>>> 193095 text/x-coldfusion text/html
>>> 172318 text/aspdotnet text/html
>>> 139033 text/x-jsp text/html
>>> 38415 text/x-cgi text/html
>>> 32092 text/x-php text/xml
>>> 18021 text/x-perl text/html
>>>
>>> Of course, due to misconfigurations some servers may deliver the script files unmodified but in
>>> general I wouldn't expect that this happens for millions of pages. I've checked some of the
>>> affected URLs:
>>>
>>> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
>>> tag)
>>>
>>> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
>>>
>>> http://www.privi.com/product-details.asp?cno=C10910011
>>> http://mental-ray.de/Root_alt/Default.asp
>>> http://ekyrs.org/support/index.php?action=profile
>>> http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>>>
>>> - (overlong) comment block at start of HTML which "masks" the HTML declaration
>>> http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>>>
>>> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
>>>
>>> https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
>>> https://de.e-stories.org/categories.php?&lan=nl&art=p
>>>
>>> - HTML with some scripting fragments ("<?php?>") present:
>>> http://www.eco-ani-yao.org/shien/
>>>
>>> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
>>> http://www.proedinc.com/customer/content.aspx?redid=9
>>> http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
>>> http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>>> http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068
>>> f79
>>>
>>>
>>> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type
>>> sent from the responding server.
>>> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
>>>
>>> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm
>>> happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to
>>> search or grep for confusion pairs.
>>>
>>>
>>> Thanks and best,
>>> Sebastian
>>>
>>>
>>> [1] https://github.com/commoncrawl/nutch/issues/3
>>> [2]
>>> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
>>> a-1.15-cc-main-2017-26.txt.xz
>>>
>>> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
>>> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
>>> [3]
>>> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
>>> util/MimeUtil.java#L152 [4]
>>> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>>>
>>
>
Re: Tika content detection and crawled "remote" content
Posted by Nick Burch <ni...@apache.org>.
Having taken a "quick" look over lunch at some of the "programming
language" ones, and gone down a rabbit whole... I think at least some of
them are as described in TIKA-2419, where our change to the HTML magic
priority to fix for HTML-containing formats like email had broken some
things.
I've done a quick fix for 1.16, but it'd be good to try the impact of
other things, eg dropping the xml priority to match the html one to see
if that helps / breaks other things
Otherwise, for anything else (eg that word / graphviz one), please do
open up JIRAs!
Thanks
Nick
On 05/07/17 14:10, Allison, Timothy B. wrote:
> Why, yes, please! JIRA with small samples would be fantastic. I think working in desc order of most common to least would be best...php, asp, coldfusion.
>
> I'm about to cut 1.16, but I look forward to improving Tika with this tremendously useful data.
>
> Again, many thanks!
>
> Cheers,
>
> Tim
>
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> Sent: Wednesday, July 5, 2017 9:03 AM
> To: user@tika.apache.org
> Subject: Re: Tika content detection and crawled "remote" content
>
> Hi Tim,
>
> thanks! Let me know if I should take any actions (e.g., open issue(s) on Jira) or whether I can help by compiling smaller test sets.
>
> Best,
> Sebastian
>
> On 07/05/2017 02:09 PM, Allison, Timothy B. wrote:
>> This is FANTASTIC!!! Thank you, Sebastian!
>>
>> I suspect that we should try to fix these at the Tika level. We'll never be 100%, but most of the problems you describe _should_ be fixable.
>>
>> > If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
>>
>> This is an amazing step forward for our regression corpus. We used to rely on the http headers and/or file suffix to oversample non-html. This will allow far cleaner pulls.
>>
>> -----Original Message-----
>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
>> Sent: Tuesday, July 4, 2017 6:18 AM
>> To: user@tika.apache.org
>> Subject: Tika content detection and crawled "remote" content
>>
>> Hi,
>>
>> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].
>>
>> For the June 2017 crawl I've prepared a comparison of content types
>> sent by the server in the HTTP header and as detected by Tika 1.15
>> [2]. It shows that content types by Tika are definitely clean
>> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
>>
>> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
>>
>> Tika-1.15 HTTP-Content-Type
>> 1001968023 application/xhtml+xml text/html
>> 2298146 application/rss+xml text/xml
>> 617435 application/rss+xml application/xml
>> 613525 text/html unk
>> 361525 application/xhtml+xml unk
>> 297707 application/rdf+xml application/xml
>>
>>
>> However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
>>
>> Tika-1.15 HTTP-Content-Type
>> 2047739 text/x-php text/html
>> 681629 text/asp text/html
>> 193095 text/x-coldfusion text/html
>> 172318 text/aspdotnet text/html
>> 139033 text/x-jsp text/html
>> 38415 text/x-cgi text/html
>> 32092 text/x-php text/xml
>> 18021 text/x-perl text/html
>>
>> Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages. I've checked some of the affected URLs:
>>
>> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
>> tag)
>>
>> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
>> http://www.privi.com/product-details.asp?cno=C10910011
>> http://mental-ray.de/Root_alt/Default.asp
>> http://ekyrs.org/support/index.php?action=profile
>> http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>>
>> - (overlong) comment block at start of HTML which "masks" the HTML declaration
>> http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>>
>> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
>> https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
>> https://de.e-stories.org/categories.php?&lan=nl&art=p
>>
>> - HTML with some scripting fragments ("<?php?>") present:
>> http://www.eco-ani-yao.org/shien/
>>
>> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
>> http://www.proedinc.com/customer/content.aspx?redid=9
>> http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
>> http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>>
>> http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068
>> f79
>>
>>
>> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
>> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
>>
>> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
>>
>>
>> Thanks and best,
>> Sebastian
>>
>>
>> [1] https://github.com/commoncrawl/nutch/issues/3
>> [2]
>> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
>> a-1.15-cc-main-2017-26.txt.xz
>>
>> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
>> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
>> [3]
>> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
>> util/MimeUtil.java#L152 [4]
>> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>>
>
RE: Tika content detection and crawled "remote" content
Posted by "Allison, Timothy B." <ta...@mitre.org>.
Why, yes, please! JIRA with small samples would be fantastic. I think working in desc order of most common to least would be best...php, asp, coldfusion.
I'm about to cut 1.16, but I look forward to improving Tika with this tremendously useful data.
Again, many thanks!
Cheers,
Tim
-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
Sent: Wednesday, July 5, 2017 9:03 AM
To: user@tika.apache.org
Subject: Re: Tika content detection and crawled "remote" content
Hi Tim,
thanks! Let me know if I should take any actions (e.g., open issue(s) on Jira) or whether I can help by compiling smaller test sets.
Best,
Sebastian
On 07/05/2017 02:09 PM, Allison, Timothy B. wrote:
> This is FANTASTIC!!! Thank you, Sebastian!
>
> I suspect that we should try to fix these at the Tika level. We'll never be 100%, but most of the problems you describe _should_ be fixable.
>
> > If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
>
> This is an amazing step forward for our regression corpus. We used to rely on the http headers and/or file suffix to oversample non-html. This will allow far cleaner pulls.
>
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> Sent: Tuesday, July 4, 2017 6:18 AM
> To: user@tika.apache.org
> Subject: Tika content detection and crawled "remote" content
>
> Hi,
>
> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].
>
> For the June 2017 crawl I've prepared a comparison of content types
> sent by the server in the HTTP header and as detected by Tika 1.15
> [2]. It shows that content types by Tika are definitely clean
> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
>
> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
>
> Tika-1.15 HTTP-Content-Type
> 1001968023 application/xhtml+xml text/html
> 2298146 application/rss+xml text/xml
> 617435 application/rss+xml application/xml
> 613525 text/html unk
> 361525 application/xhtml+xml unk
> 297707 application/rdf+xml application/xml
>
>
> However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
>
> Tika-1.15 HTTP-Content-Type
> 2047739 text/x-php text/html
> 681629 text/asp text/html
> 193095 text/x-coldfusion text/html
> 172318 text/aspdotnet text/html
> 139033 text/x-jsp text/html
> 38415 text/x-cgi text/html
> 32092 text/x-php text/xml
> 18021 text/x-perl text/html
>
> Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages. I've checked some of the affected URLs:
>
> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
> tag)
>
> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
> http://www.privi.com/product-details.asp?cno=C10910011
> http://mental-ray.de/Root_alt/Default.asp
> http://ekyrs.org/support/index.php?action=profile
> http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>
> - (overlong) comment block at start of HTML which "masks" the HTML declaration
> http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>
> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
> https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
> https://de.e-stories.org/categories.php?&lan=nl&art=p
>
> - HTML with some scripting fragments ("<?php?>") present:
> http://www.eco-ani-yao.org/shien/
>
> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
> http://www.proedinc.com/customer/content.aspx?redid=9
> http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
> http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>
> http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068
> f79
>
>
> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
>
> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
>
>
> Thanks and best,
> Sebastian
>
>
> [1] https://github.com/commoncrawl/nutch/issues/3
> [2]
> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
> a-1.15-cc-main-2017-26.txt.xz
>
> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
> [3]
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
> util/MimeUtil.java#L152 [4]
> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>
Re: Tika content detection and crawled "remote" content
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Tim,
thanks! Let me know if I should take any actions (e.g., open issue(s) on Jira)
or whether I can help by compiling smaller test sets.
Best,
Sebastian
On 07/05/2017 02:09 PM, Allison, Timothy B. wrote:
> This is FANTASTIC!!! Thank you, Sebastian!
>
> I suspect that we should try to fix these at the Tika level. We'll never be 100%, but most of the problems you describe _should_ be fixable.
>
> > If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
>
> This is an amazing step forward for our regression corpus. We used to rely on the http headers and/or file suffix to oversample non-html. This will allow far cleaner pulls.
>
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> Sent: Tuesday, July 4, 2017 6:18 AM
> To: user@tika.apache.org
> Subject: Tika content detection and crawled "remote" content
>
> Hi,
>
> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].
>
> For the June 2017 crawl I've prepared a comparison of content types sent by the server in the HTTP header and as detected by Tika 1.15 [2]. It shows that content types by Tika are definitely clean
> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
>
> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
>
> Tika-1.15 HTTP-Content-Type
> 1001968023 application/xhtml+xml text/html
> 2298146 application/rss+xml text/xml
> 617435 application/rss+xml application/xml
> 613525 text/html unk
> 361525 application/xhtml+xml unk
> 297707 application/rdf+xml application/xml
>
>
> However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
>
> Tika-1.15 HTTP-Content-Type
> 2047739 text/x-php text/html
> 681629 text/asp text/html
> 193095 text/x-coldfusion text/html
> 172318 text/aspdotnet text/html
> 139033 text/x-jsp text/html
> 38415 text/x-cgi text/html
> 32092 text/x-php text/xml
> 18021 text/x-perl text/html
>
> Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages. I've checked some of the affected URLs:
>
> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag)
>
> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
> http://www.privi.com/product-details.asp?cno=C10910011
> http://mental-ray.de/Root_alt/Default.asp
> http://ekyrs.org/support/index.php?action=profile
> http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>
> - (overlong) comment block at start of HTML which "masks" the HTML declaration
> http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>
> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
> https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
> https://de.e-stories.org/categories.php?&lan=nl&art=p
>
> - HTML with some scripting fragments ("<?php?>") present:
> http://www.eco-ani-yao.org/shien/
>
> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
> http://www.proedinc.com/customer/content.aspx?redid=9
> http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
> http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
> http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79
>
>
> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
>
> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
>
>
> Thanks and best,
> Sebastian
>
>
> [1] https://github.com/commoncrawl/nutch/issues/3
> [2] s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
>
> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
> [3] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152
> [4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>
Re: Tika content detection and crawled "remote" content
Posted by Chris Mattmann <ma...@apache.org>.
Totally agree, thank you Common Crawl for running Tika!
On 7/5/17, 5:09 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:
This is FANTASTIC!!! Thank you, Sebastian!
I suspect that we should try to fix these at the Tika level. We'll never be 100%, but most of the problems you describe _should_ be fixable.
> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
This is an amazing step forward for our regression corpus. We used to rely on the http headers and/or file suffix to oversample non-html. This will allow far cleaner pulls.
-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
Sent: Tuesday, July 4, 2017 6:18 AM
To: user@tika.apache.org
Subject: Tika content detection and crawled "remote" content
Hi,
recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].
For the June 2017 crawl I've prepared a comparison of content types sent by the server in the HTTP header and as detected by Tika 1.15 [2]. It shows that content types by Tika are definitely clean
(1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
Tika-1.15 HTTP-Content-Type
1001968023 application/xhtml+xml text/html
2298146 application/rss+xml text/xml
617435 application/rss+xml application/xml
613525 text/html unk
361525 application/xhtml+xml unk
297707 application/rdf+xml application/xml
However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
Tika-1.15 HTTP-Content-Type
2047739 text/x-php text/html
681629 text/asp text/html
193095 text/x-coldfusion text/html
172318 text/aspdotnet text/html
139033 text/x-jsp text/html
38415 text/x-cgi text/html
32092 text/x-php text/xml
18021 text/x-perl text/html
Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages. I've checked some of the affected URLs:
- HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag)
https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
http://www.privi.com/product-details.asp?cno=C10910011
http://mental-ray.de/Root_alt/Default.asp
http://ekyrs.org/support/index.php?action=profile
http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
- (overlong) comment block at start of HTML which "masks" the HTML declaration
http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
https://de.e-stories.org/categories.php?&lan=nl&art=p
- HTML with some scripting fragments ("<?php?>") present:
http://www.eco-ani-yao.org/shien/
- others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
http://www.proedinc.com/customer/content.aspx?redid=9
http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79
Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
Thanks and best,
Sebastian
[1] https://github.com/commoncrawl/nutch/issues/3
[2] s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
[3] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152
[4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
RE: Tika content detection and crawled "remote" content
Posted by "Allison, Timothy B." <ta...@mitre.org>.
This is FANTASTIC!!! Thank you, Sebastian!
I suspect that we should try to fix these at the Tika level. We'll never be 100%, but most of the problems you describe _should_ be fixable.
> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
This is an amazing step forward for our regression corpus. We used to rely on the http headers and/or file suffix to oversample non-html. This will allow far cleaner pulls.
-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
Sent: Tuesday, July 4, 2017 6:18 AM
To: user@tika.apache.org
Subject: Tika content detection and crawled "remote" content
Hi,
recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].
For the June 2017 crawl I've prepared a comparison of content types sent by the server in the HTTP header and as detected by Tika 1.15 [2]. It shows that content types by Tika are definitely clean
(1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
Tika-1.15 HTTP-Content-Type
1001968023 application/xhtml+xml text/html
2298146 application/rss+xml text/xml
617435 application/rss+xml application/xml
613525 text/html unk
361525 application/xhtml+xml unk
297707 application/rdf+xml application/xml
However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
Tika-1.15 HTTP-Content-Type
2047739 text/x-php text/html
681629 text/asp text/html
193095 text/x-coldfusion text/html
172318 text/aspdotnet text/html
139033 text/x-jsp text/html
38415 text/x-cgi text/html
32092 text/x-php text/xml
18021 text/x-perl text/html
Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages. I've checked some of the affected URLs:
- HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag)
https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
http://www.privi.com/product-details.asp?cno=C10910011
http://mental-ray.de/Root_alt/Default.asp
http://ekyrs.org/support/index.php?action=profile
http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
- (overlong) comment block at start of HTML which "masks" the HTML declaration
http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
https://de.e-stories.org/categories.php?&lan=nl&art=p
- HTML with some scripting fragments ("<?php?>") present:
http://www.eco-ani-yao.org/shien/
- others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
http://www.proedinc.com/customer/content.aspx?redid=9
http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79
Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
Thanks and best,
Sebastian
[1] https://github.com/commoncrawl/nutch/issues/3
[2] s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
[3] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152
[4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/