You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2017/07/05 11:42:46 UTC
FW: Tika content detection and crawled "remote" content
Dominik,
Thanks to Sebastian and CommonCrawl, this means that we can now have far better precision and recall in selecting only MSOffice docs for our regression tests!!!
-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
Sent: Tuesday, July 4, 2017 6:18 AM
To: user@tika.apache.org
Subject: Tika content detection and crawled "remote" content
Hi,
recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].
For the June 2017 crawl I've prepared a comparison of content types sent by the server in the HTTP header and as detected by Tika 1.15 [2]. It shows that content types by Tika are definitely clean
(1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
Tika-1.15 HTTP-Content-Type
1001968023 application/xhtml+xml text/html
2298146 application/rss+xml text/xml
617435 application/rss+xml application/xml
613525 text/html unk
361525 application/xhtml+xml unk
297707 application/rdf+xml application/xml
However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
Tika-1.15 HTTP-Content-Type
2047739 text/x-php text/html
681629 text/asp text/html
193095 text/x-coldfusion text/html
172318 text/aspdotnet text/html
139033 text/x-jsp text/html
38415 text/x-cgi text/html
32092 text/x-php text/xml
18021 text/x-perl text/html
Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages. I've checked some of the affected URLs:
- HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag)
https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
http://www.privi.com/product-details.asp?cno=C10910011
http://mental-ray.de/Root_alt/Default.asp
http://ekyrs.org/support/index.php?action=profile
http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
- (overlong) comment block at start of HTML which "masks" the HTML declaration
http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
https://de.e-stories.org/categories.php?&lan=nl&art=p
- HTML with some scripting fragments ("<?php?>") present:
http://www.eco-ani-yao.org/shien/
- others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
http://www.proedinc.com/customer/content.aspx?redid=9
http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79
Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
Thanks and best,
Sebastian
[1] https://github.com/commoncrawl/nutch/issues/3
[2] s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
[3] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152
[4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
RE: FW: Tika content detection and crawled "remote" content
Posted by "Allison, Timothy B." <ta...@mitre.org>.
> The initial intention is, of course, to help to improve the MIME detection in Tika core.
Absolutely agree.
> Yes, you'll get few 10,000 more (MS)Office documents thanks to Tika:
Tika-1.15 HTTP-Content-Type
12520 application/x-tika-msoffice application/octet-stream
6681 application/x-tika-ooxml application/octet-stream
3793 application/x-tika-msoffice text/plain
Agreed, as I look at the numbers they aren't huge, but the improvement for our test corpus development is fantastic. Even a few thousand extra docx, for example, will help.
My guess is that the x-tika-ooxml and x-tika-msoffice are truncated files. Common Crawl is truncating at 1MB, right?
Again, WOW!!! Thank you!!!
Cheers,
Tim
-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
Sent: Wednesday, July 5, 2017 8:43 AM
To: Allison, Timothy B. <ta...@mitre.org>
Cc: dominik.stadler@gmx.at; POI Developers List (dev@poi.apache.org) <de...@poi.apache.org>
Subject: Re: FW: Tika content detection and crawled "remote" content
Yes, you'll get few 10,000 more (MS)Office documents thanks to Tika:
Tika-1.15 HTTP-Content-Type
12520 application/x-tika-msoffice application/octet-stream
6681 application/x-tika-ooxml application/octet-stream
3793 application/x-tika-msoffice text/plain
3515 application/x-tika-msoffice application/force-download
2259 application/x-tika-ooxml application/msword
1911 application/x-tika-msoffice unk
1314 application/x-tika-msoffice application/download
1259 application/x-tika-ooxml unk
1068 application/x-tika-ooxml application/force-download
711 application/x-tika-msoffice file/unknown
...
The initial intention is, of course, to help to improve the MIME detection in Tika core.
Among the detected office formats there is one conspicuous pair:
127 application/msword text/vnd.graphviz
Looks like *.dot is taken as indicator only for MSWord documents.
Let me know if I can help to extract any data sets!
Thanks,
Sebastian
On 07/05/2017 01:42 PM, Allison, Timothy B. wrote:
> Dominik,
> Thanks to Sebastian and CommonCrawl, this means that we can now have far better precision and recall in selecting only MSOffice docs for our regression tests!!!
>
>
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> Sent: Tuesday, July 4, 2017 6:18 AM
> To: user@tika.apache.org
> Subject: Tika content detection and crawled "remote" content
>
> Hi,
>
> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].
>
> For the June 2017 crawl I've prepared a comparison of content types
> sent by the server in the HTTP header and as detected by Tika 1.15
> [2]. It shows that content types by Tika are definitely clean
> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
>
> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
>
> Tika-1.15 HTTP-Content-Type
> 1001968023 application/xhtml+xml text/html
> 2298146 application/rss+xml text/xml
> 617435 application/rss+xml application/xml
> 613525 text/html unk
> 361525 application/xhtml+xml unk
> 297707 application/rdf+xml application/xml
>
>
> However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
>
> Tika-1.15 HTTP-Content-Type
> 2047739 text/x-php text/html
> 681629 text/asp text/html
> 193095 text/x-coldfusion text/html
> 172318 text/aspdotnet text/html
> 139033 text/x-jsp text/html
> 38415 text/x-cgi text/html
> 32092 text/x-php text/xml
> 18021 text/x-perl text/html
>
> Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages. I've checked some of the affected URLs:
>
> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
> tag)
>
> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
> http://www.privi.com/product-details.asp?cno=C10910011
> http://mental-ray.de/Root_alt/Default.asp
> http://ekyrs.org/support/index.php?action=profile
> http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>
> - (overlong) comment block at start of HTML which "masks" the HTML declaration
> http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>
> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
> https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
> https://de.e-stories.org/categories.php?&lan=nl&art=p
>
> - HTML with some scripting fragments ("<?php?>") present:
> http://www.eco-ani-yao.org/shien/
>
> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
> http://www.proedinc.com/customer/content.aspx?redid=9
> http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
> http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>
> http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068
> f79
>
>
> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
>
> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
>
>
> Thanks and best,
> Sebastian
>
>
> [1] https://github.com/commoncrawl/nutch/issues/3
> [2]
> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
> a-1.15-cc-main-2017-26.txt.xz
>
> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
> [3]
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
> util/MimeUtil.java#L152 [4]
> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>
RE: FW: Tika content detection and crawled "remote" content
Posted by "Allison, Timothy B." <ta...@mitre.org>.
> The initial intention is, of course, to help to improve the MIME detection in Tika core.
Absolutely agree.
> Yes, you'll get few 10,000 more (MS)Office documents thanks to Tika:
Tika-1.15 HTTP-Content-Type
12520 application/x-tika-msoffice application/octet-stream
6681 application/x-tika-ooxml application/octet-stream
3793 application/x-tika-msoffice text/plain
Agreed, as I look at the numbers they aren't huge, but the improvement for our test corpus development is fantastic. Even a few thousand extra docx, for example, will help.
My guess is that the x-tika-ooxml and x-tika-msoffice are truncated files. Common Crawl is truncating at 1MB, right?
Again, WOW!!! Thank you!!!
Cheers,
Tim
-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
Sent: Wednesday, July 5, 2017 8:43 AM
To: Allison, Timothy B. <ta...@mitre.org>
Cc: dominik.stadler@gmx.at; POI Developers List (dev@poi.apache.org) <de...@poi.apache.org>
Subject: Re: FW: Tika content detection and crawled "remote" content
Yes, you'll get few 10,000 more (MS)Office documents thanks to Tika:
Tika-1.15 HTTP-Content-Type
12520 application/x-tika-msoffice application/octet-stream
6681 application/x-tika-ooxml application/octet-stream
3793 application/x-tika-msoffice text/plain
3515 application/x-tika-msoffice application/force-download
2259 application/x-tika-ooxml application/msword
1911 application/x-tika-msoffice unk
1314 application/x-tika-msoffice application/download
1259 application/x-tika-ooxml unk
1068 application/x-tika-ooxml application/force-download
711 application/x-tika-msoffice file/unknown
...
The initial intention is, of course, to help to improve the MIME detection in Tika core.
Among the detected office formats there is one conspicuous pair:
127 application/msword text/vnd.graphviz
Looks like *.dot is taken as indicator only for MSWord documents.
Let me know if I can help to extract any data sets!
Thanks,
Sebastian
On 07/05/2017 01:42 PM, Allison, Timothy B. wrote:
> Dominik,
> Thanks to Sebastian and CommonCrawl, this means that we can now have far better precision and recall in selecting only MSOffice docs for our regression tests!!!
>
>
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> Sent: Tuesday, July 4, 2017 6:18 AM
> To: user@tika.apache.org
> Subject: Tika content detection and crawled "remote" content
>
> Hi,
>
> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].
>
> For the June 2017 crawl I've prepared a comparison of content types
> sent by the server in the HTTP header and as detected by Tika 1.15
> [2]. It shows that content types by Tika are definitely clean
> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
>
> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
>
> Tika-1.15 HTTP-Content-Type
> 1001968023 application/xhtml+xml text/html
> 2298146 application/rss+xml text/xml
> 617435 application/rss+xml application/xml
> 613525 text/html unk
> 361525 application/xhtml+xml unk
> 297707 application/rdf+xml application/xml
>
>
> However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
>
> Tika-1.15 HTTP-Content-Type
> 2047739 text/x-php text/html
> 681629 text/asp text/html
> 193095 text/x-coldfusion text/html
> 172318 text/aspdotnet text/html
> 139033 text/x-jsp text/html
> 38415 text/x-cgi text/html
> 32092 text/x-php text/xml
> 18021 text/x-perl text/html
>
> Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages. I've checked some of the affected URLs:
>
> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
> tag)
>
> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
> http://www.privi.com/product-details.asp?cno=C10910011
> http://mental-ray.de/Root_alt/Default.asp
> http://ekyrs.org/support/index.php?action=profile
> http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>
> - (overlong) comment block at start of HTML which "masks" the HTML declaration
> http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>
> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
> https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
> https://de.e-stories.org/categories.php?&lan=nl&art=p
>
> - HTML with some scripting fragments ("<?php?>") present:
> http://www.eco-ani-yao.org/shien/
>
> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
> http://www.proedinc.com/customer/content.aspx?redid=9
> http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
> http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>
> http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068
> f79
>
>
> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
>
> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
>
>
> Thanks and best,
> Sebastian
>
>
> [1] https://github.com/commoncrawl/nutch/issues/3
> [2]
> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
> a-1.15-cc-main-2017-26.txt.xz
>
> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
> [3]
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
> util/MimeUtil.java#L152 [4]
> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
Re: FW: Tika content detection and crawled "remote" content
Posted by Sebastian Nagel <wa...@googlemail.com>.
Yes, you'll get few 10,000 more (MS)Office documents thanks to Tika:
Tika-1.15 HTTP-Content-Type
12520 application/x-tika-msoffice application/octet-stream
6681 application/x-tika-ooxml application/octet-stream
3793 application/x-tika-msoffice text/plain
3515 application/x-tika-msoffice application/force-download
2259 application/x-tika-ooxml application/msword
1911 application/x-tika-msoffice unk
1314 application/x-tika-msoffice application/download
1259 application/x-tika-ooxml unk
1068 application/x-tika-ooxml application/force-download
711 application/x-tika-msoffice file/unknown
...
The initial intention is, of course, to help to improve the MIME detection in Tika core.
Among the detected office formats there is one conspicuous pair:
127 application/msword text/vnd.graphviz
Looks like *.dot is taken as indicator only for MSWord documents.
Let me know if I can help to extract any data sets!
Thanks,
Sebastian
On 07/05/2017 01:42 PM, Allison, Timothy B. wrote:
> Dominik,
> Thanks to Sebastian and CommonCrawl, this means that we can now have far better precision and recall in selecting only MSOffice docs for our regression tests!!!
>
>
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> Sent: Tuesday, July 4, 2017 6:18 AM
> To: user@tika.apache.org
> Subject: Tika content detection and crawled "remote" content
>
> Hi,
>
> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].
>
> For the June 2017 crawl I've prepared a comparison of content types sent by the server in the HTTP header and as detected by Tika 1.15 [2]. It shows that content types by Tika are definitely clean
> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
>
> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
>
> Tika-1.15 HTTP-Content-Type
> 1001968023 application/xhtml+xml text/html
> 2298146 application/rss+xml text/xml
> 617435 application/rss+xml application/xml
> 613525 text/html unk
> 361525 application/xhtml+xml unk
> 297707 application/rdf+xml application/xml
>
>
> However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
>
> Tika-1.15 HTTP-Content-Type
> 2047739 text/x-php text/html
> 681629 text/asp text/html
> 193095 text/x-coldfusion text/html
> 172318 text/aspdotnet text/html
> 139033 text/x-jsp text/html
> 38415 text/x-cgi text/html
> 32092 text/x-php text/xml
> 18021 text/x-perl text/html
>
> Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages. I've checked some of the affected URLs:
>
> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag)
>
> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
> http://www.privi.com/product-details.asp?cno=C10910011
> http://mental-ray.de/Root_alt/Default.asp
> http://ekyrs.org/support/index.php?action=profile
> http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>
> - (overlong) comment block at start of HTML which "masks" the HTML declaration
> http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>
> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
> https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
> https://de.e-stories.org/categories.php?&lan=nl&art=p
>
> - HTML with some scripting fragments ("<?php?>") present:
> http://www.eco-ani-yao.org/shien/
>
> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
> http://www.proedinc.com/customer/content.aspx?redid=9
> http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
> http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
> http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79
>
>
> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
>
> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
>
>
> Thanks and best,
> Sebastian
>
>
> [1] https://github.com/commoncrawl/nutch/issues/3
> [2] s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
>
> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
> [3] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152
> [4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org