You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Oscar Rieken Jr via user <us...@tika.apache.org> on 2022/07/25 20:35:24 UTC
Datasets for testing large number of attachments
I am currently trying to validate our Tika setup and was looking for a set of example data I could use
I found this dir -> Index of /base/docs/cc_large (apache.org)<https://corpora.tika.apache.org/base/docs/cc_large/>
Would I just download that data set or is there another place with multiple file types?
Forgive me this is all new to me just trying to figure it all out
Re: Datasets for testing large number of attachments
Posted by Oscar Rieken Jr <os...@cofense.com.INVALID>.
Yeah I think id want to pull the data from CommonCrawl and go from there.
Do you have a link to that script?
From: Tim Allison <ta...@apache.org>
Date: Tuesday, July 26, 2022 at 2:59 PM
To: Oscar Rieken Jr <os...@cofense.com>
Cc: user@tika.apache.org <us...@tika.apache.org>, corpora-dev@tika.apache.org <co...@tika.apache.org>
Subject: Re: Datasets for testing large number of attachments
External Email
We have ~1.9TB. But I'd skip cc_large because that's just a copy of some directories under commoncrawl3_refetched.
If you want to pull fresher data out of CommonCrawl, I have undocumented scripts to do that. I could add documentation.
These are the top 100 mime types and counts. This db was generated on a slightly earlier version of the corpus/corpora, but it should be close enough.
MIME_STRING cnt
application/pdf 768490
text/plain 472041
text/html 429707
application/x-tika-msoffice 297990
image/png 190815
application/octet-stream 190645
image/jpeg 179533
application/xhtml+xml 151830
application/x-bzip2 124204
application/x-tika-ooxml 122523
application/x-bzip 107435
application/xml 107003
application/zip 93467
application/x-sh 88712
application/gzip 73535
image/gif 66713
application/zlib 46483
text/calendar 40385
application/postscript 35526
application/rss+xml 34428
application/atom+xml 28950
multipart/appledouble 27602
image/svg+xml 25771
application/vnd.oasis.opendocument.text 25753
application/rdf+xml 24890
application/vnd.google-earth.kml+xml 24049
application/rtf 23915
application/x-matroska 19437
application/x-shockwave-flash 18879
video/quicktime 18546
application/epub+zip 18205
application/vnd.ms-excel 17465
application/x-xz 16869
text/x-vcard 16772
application/java-vm 16761
audio/mpeg 15534
message/rfc822 14405
application/vnd.oasis.opendocument.spreadsheet 12659
application/x-bibtex-text-file 12261
application/x-rar-compressed; version=4 12123
text/x-php 10870
text/x-diff 10080
video/mp4 8281
audio/mp4 8221
application/x-msdownload 8019
application/x-bittorrent 7964
image/vnd.microsoft.icon 7382
application/mbox 6799
application/x-x509-cert; format=der 6597
audio/vnd.wave 6550
image/bmp 6411
application/x-endnote-refer 5922
image/vnd.djvu 5874
text/x-matlab 5734
application/vnd.apple.mpegurl 5511
image/tiff 5430
image/webp 4972
application/vnd.oasis.opendocument.presentation 3989
text/x-jsp 3973
text/x-csrc 3555
video/x-ms-wmv 3453
video/x-m4v 3443
application/x-dbf 3381
text/x-chdr 3263
text/x-perl 3124
application/x-rpm 3023
application/x-mobipocket-ebook 2726
audio/midi 2697
application/vnd.oasis.opendocument.graphics 2675
application/vnd.ms-excel.sheet.4 2591
application/x-font-ttf 2575
application/xspf+xml 2557
text/x-python 2416
audio/vorbis 2354
application/msword 2223
application/ogg 2222
application/x-gtar 2181
audio/x-mpegurl 2067
video/x-flv 1969
audio/x-ms-wma 1874
image/icns 1857
application/x-object 1823
application/x-7z-compressed 1795
application/x-msdownload; format=pe32 1784
application/x-debian-package 1700
application/x-mysql-table-definition 1669
image/vnd.dxf; format=ascii 1664
application/x-sqlite3 1606
application/x-berkeley-db; format=hash 1457
application/x-executable 1455
video/mpeg 1366
application/pkcs7-signature 1359
application/x-ms-asx 1266
image/vnd.zbrush.pcx 1247
image/vnd.dwg 1243
application/fits 1217
application/xslfo+xml 1206
application/x-sharedlib 1185
audio/prs.sid 1173
text/x-vcalendar 1156
On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr <os...@cofense.com>> wrote:
We were thinking something around 2TB of data with a good mix of excel, images, pdfs, text and powerpoints. So I guess a mix of everything.
From: Tim Allison <ta...@apache.org>>
Date: Tuesday, July 26, 2022 at 9:19 AM
To: user@tika.apache.org<ma...@tika.apache.org> <us...@tika.apache.org>>
Cc: Oscar Rieken Jr <os...@cofense.com>>, corpora-dev@tika.apache.org<ma...@tika.apache.org> <co...@tika.apache.org>>
Subject: Re: Datasets for testing large number of attachments
External Email
What Nick said...
cc_large is a sample of some of the larger documents from commoncrawl3_refetched.
If you want to give your pipeline a workout, I also recommend using the MockParser that is available in the tika-core tests jar. That allows you to instrument an OOM and timeouts and system exits and all sorts of other mayhem. Obv, don't put the tika-core tests jar on your class path in production. See the files in https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock for examples of how to trigger bad behavior with the MockParser.
On the corpora, as Nick said, let us know what you want and we can help you select files.
Cheers,
Tim
On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org>> wrote:
On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
> I am currently trying to validate our Tika setup and was looking for a
> set of example data I could use
If you want a small number of files of lots of different types, the test
files in the Tika source tree will work. Main set are in
tika-parsers/src/test/resources/test-documents/
If you want a very large number of files, then the Tika Corpora collection
is a good source. We have a few different collections, including stuff
from common crawl, govdocs and bug trackers. If you can let us know what
sort of file types and how many, we can suggest the best corpora
collection
Nick
Re: Datasets for testing large number of attachments
Posted by Oscar Rieken Jr <os...@cofense.com.INVALID>.
Awesome thanks ill give this a shot!
From: Nicholas DiPiazza <ni...@gmail.com>
Date: Tuesday, July 26, 2022 at 3:13 PM
To: user@tika.apache.org <us...@tika.apache.org>, tallison@apache.org <ta...@apache.org>
Cc: Oscar Rieken Jr <os...@cofense.com>, corpora-dev@tika.apache.org <co...@tika.apache.org>
Subject: Re: Datasets for testing large number of attachments
External Email
Script I used back in the day to do what you are looking for:
#!/bin/bash
for i in $(seq -f "%03g" 2 999)
do
wget http://downloads.digitalcorpora.org/corpora/files/govdocs1/zipfiles/$i.zip -O $i.zip
unzip $i.zip
rm $i.zip
done
not sure if it still works
On Tue, Jul 26, 2022 at 1:59 PM Tim Allison <ta...@apache.org>> wrote:
As a warning, tho, Common Crawl truncates files at 1MB, so we have a bunch of truncated files. We refetched some and put those under commoncrawl3_refetched.
On Tue, Jul 26, 2022 at 2:58 PM Tim Allison <ta...@apache.org>> wrote:
We have ~1.9TB. But I'd skip cc_large because that's just a copy of some directories under commoncrawl3_refetched.
If you want to pull fresher data out of CommonCrawl, I have undocumented scripts to do that. I could add documentation.
These are the top 100 mime types and counts. This db was generated on a slightly earlier version of the corpus/corpora, but it should be close enough.
MIME_STRING cnt
application/pdf 768490
text/plain 472041
text/html 429707
application/x-tika-msoffice 297990
image/png 190815
application/octet-stream 190645
image/jpeg 179533
application/xhtml+xml 151830
application/x-bzip2 124204
application/x-tika-ooxml 122523
application/x-bzip 107435
application/xml 107003
application/zip 93467
application/x-sh 88712
application/gzip 73535
image/gif 66713
application/zlib 46483
text/calendar 40385
application/postscript 35526
application/rss+xml 34428
application/atom+xml 28950
multipart/appledouble 27602
image/svg+xml 25771
application/vnd.oasis.opendocument.text 25753
application/rdf+xml 24890
application/vnd.google-earth.kml+xml 24049
application/rtf 23915
application/x-matroska 19437
application/x-shockwave-flash 18879
video/quicktime 18546
application/epub+zip 18205
application/vnd.ms-excel 17465
application/x-xz 16869
text/x-vcard 16772
application/java-vm 16761
audio/mpeg 15534
message/rfc822 14405
application/vnd.oasis.opendocument.spreadsheet 12659
application/x-bibtex-text-file 12261
application/x-rar-compressed; version=4 12123
text/x-php 10870
text/x-diff 10080
video/mp4 8281
audio/mp4 8221
application/x-msdownload 8019
application/x-bittorrent 7964
image/vnd.microsoft.icon 7382
application/mbox 6799
application/x-x509-cert; format=der 6597
audio/vnd.wave 6550
image/bmp 6411
application/x-endnote-refer 5922
image/vnd.djvu 5874
text/x-matlab 5734
application/vnd.apple.mpegurl 5511
image/tiff 5430
image/webp 4972
application/vnd.oasis.opendocument.presentation 3989
text/x-jsp 3973
text/x-csrc 3555
video/x-ms-wmv 3453
video/x-m4v 3443
application/x-dbf 3381
text/x-chdr 3263
text/x-perl 3124
application/x-rpm 3023
application/x-mobipocket-ebook 2726
audio/midi 2697
application/vnd.oasis.opendocument.graphics 2675
application/vnd.ms-excel.sheet.4 2591
application/x-font-ttf 2575
application/xspf+xml 2557
text/x-python 2416
audio/vorbis 2354
application/msword 2223
application/ogg 2222
application/x-gtar 2181
audio/x-mpegurl 2067
video/x-flv 1969
audio/x-ms-wma 1874
image/icns 1857
application/x-object 1823
application/x-7z-compressed 1795
application/x-msdownload; format=pe32 1784
application/x-debian-package 1700
application/x-mysql-table-definition 1669
image/vnd.dxf; format=ascii 1664
application/x-sqlite3 1606
application/x-berkeley-db; format=hash 1457
application/x-executable 1455
video/mpeg 1366
application/pkcs7-signature 1359
application/x-ms-asx 1266
image/vnd.zbrush.pcx 1247
image/vnd.dwg 1243
application/fits 1217
application/xslfo+xml 1206
application/x-sharedlib 1185
audio/prs.sid 1173
text/x-vcalendar 1156
On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr <os...@cofense.com>> wrote:
We were thinking something around 2TB of data with a good mix of excel, images, pdfs, text and powerpoints. So I guess a mix of everything.
From: Tim Allison <ta...@apache.org>>
Date: Tuesday, July 26, 2022 at 9:19 AM
To: user@tika.apache.org<ma...@tika.apache.org> <us...@tika.apache.org>>
Cc: Oscar Rieken Jr <os...@cofense.com>>, corpora-dev@tika.apache.org<ma...@tika.apache.org> <co...@tika.apache.org>>
Subject: Re: Datasets for testing large number of attachments
External Email
What Nick said...
cc_large is a sample of some of the larger documents from commoncrawl3_refetched.
If you want to give your pipeline a workout, I also recommend using the MockParser that is available in the tika-core tests jar. That allows you to instrument an OOM and timeouts and system exits and all sorts of other mayhem. Obv, don't put the tika-core tests jar on your class path in production. See the files in https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock for examples of how to trigger bad behavior with the MockParser.
On the corpora, as Nick said, let us know what you want and we can help you select files.
Cheers,
Tim
On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org>> wrote:
On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
> I am currently trying to validate our Tika setup and was looking for a
> set of example data I could use
If you want a small number of files of lots of different types, the test
files in the Tika source tree will work. Main set are in
tika-parsers/src/test/resources/test-documents/
If you want a very large number of files, then the Tika Corpora collection
is a good source. We have a few different collections, including stuff
from common crawl, govdocs and bug trackers. If you can let us know what
sort of file types and how many, we can suggest the best corpora
collection
Nick
Re: Datasets for testing large number of attachments
Posted by Oscar Rieken Jr via user <us...@tika.apache.org>.
Awesome thanks ill give this a shot!
From: Nicholas DiPiazza <ni...@gmail.com>
Date: Tuesday, July 26, 2022 at 3:13 PM
To: user@tika.apache.org <us...@tika.apache.org>, tallison@apache.org <ta...@apache.org>
Cc: Oscar Rieken Jr <os...@cofense.com>, corpora-dev@tika.apache.org <co...@tika.apache.org>
Subject: Re: Datasets for testing large number of attachments
External Email
Script I used back in the day to do what you are looking for:
#!/bin/bash
for i in $(seq -f "%03g" 2 999)
do
wget http://downloads.digitalcorpora.org/corpora/files/govdocs1/zipfiles/$i.zip -O $i.zip
unzip $i.zip
rm $i.zip
done
not sure if it still works
On Tue, Jul 26, 2022 at 1:59 PM Tim Allison <ta...@apache.org>> wrote:
As a warning, tho, Common Crawl truncates files at 1MB, so we have a bunch of truncated files. We refetched some and put those under commoncrawl3_refetched.
On Tue, Jul 26, 2022 at 2:58 PM Tim Allison <ta...@apache.org>> wrote:
We have ~1.9TB. But I'd skip cc_large because that's just a copy of some directories under commoncrawl3_refetched.
If you want to pull fresher data out of CommonCrawl, I have undocumented scripts to do that. I could add documentation.
These are the top 100 mime types and counts. This db was generated on a slightly earlier version of the corpus/corpora, but it should be close enough.
MIME_STRING cnt
application/pdf 768490
text/plain 472041
text/html 429707
application/x-tika-msoffice 297990
image/png 190815
application/octet-stream 190645
image/jpeg 179533
application/xhtml+xml 151830
application/x-bzip2 124204
application/x-tika-ooxml 122523
application/x-bzip 107435
application/xml 107003
application/zip 93467
application/x-sh 88712
application/gzip 73535
image/gif 66713
application/zlib 46483
text/calendar 40385
application/postscript 35526
application/rss+xml 34428
application/atom+xml 28950
multipart/appledouble 27602
image/svg+xml 25771
application/vnd.oasis.opendocument.text 25753
application/rdf+xml 24890
application/vnd.google-earth.kml+xml 24049
application/rtf 23915
application/x-matroska 19437
application/x-shockwave-flash 18879
video/quicktime 18546
application/epub+zip 18205
application/vnd.ms-excel 17465
application/x-xz 16869
text/x-vcard 16772
application/java-vm 16761
audio/mpeg 15534
message/rfc822 14405
application/vnd.oasis.opendocument.spreadsheet 12659
application/x-bibtex-text-file 12261
application/x-rar-compressed; version=4 12123
text/x-php 10870
text/x-diff 10080
video/mp4 8281
audio/mp4 8221
application/x-msdownload 8019
application/x-bittorrent 7964
image/vnd.microsoft.icon 7382
application/mbox 6799
application/x-x509-cert; format=der 6597
audio/vnd.wave 6550
image/bmp 6411
application/x-endnote-refer 5922
image/vnd.djvu 5874
text/x-matlab 5734
application/vnd.apple.mpegurl 5511
image/tiff 5430
image/webp 4972
application/vnd.oasis.opendocument.presentation 3989
text/x-jsp 3973
text/x-csrc 3555
video/x-ms-wmv 3453
video/x-m4v 3443
application/x-dbf 3381
text/x-chdr 3263
text/x-perl 3124
application/x-rpm 3023
application/x-mobipocket-ebook 2726
audio/midi 2697
application/vnd.oasis.opendocument.graphics 2675
application/vnd.ms-excel.sheet.4 2591
application/x-font-ttf 2575
application/xspf+xml 2557
text/x-python 2416
audio/vorbis 2354
application/msword 2223
application/ogg 2222
application/x-gtar 2181
audio/x-mpegurl 2067
video/x-flv 1969
audio/x-ms-wma 1874
image/icns 1857
application/x-object 1823
application/x-7z-compressed 1795
application/x-msdownload; format=pe32 1784
application/x-debian-package 1700
application/x-mysql-table-definition 1669
image/vnd.dxf; format=ascii 1664
application/x-sqlite3 1606
application/x-berkeley-db; format=hash 1457
application/x-executable 1455
video/mpeg 1366
application/pkcs7-signature 1359
application/x-ms-asx 1266
image/vnd.zbrush.pcx 1247
image/vnd.dwg 1243
application/fits 1217
application/xslfo+xml 1206
application/x-sharedlib 1185
audio/prs.sid 1173
text/x-vcalendar 1156
On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr <os...@cofense.com>> wrote:
We were thinking something around 2TB of data with a good mix of excel, images, pdfs, text and powerpoints. So I guess a mix of everything.
From: Tim Allison <ta...@apache.org>>
Date: Tuesday, July 26, 2022 at 9:19 AM
To: user@tika.apache.org<ma...@tika.apache.org> <us...@tika.apache.org>>
Cc: Oscar Rieken Jr <os...@cofense.com>>, corpora-dev@tika.apache.org<ma...@tika.apache.org> <co...@tika.apache.org>>
Subject: Re: Datasets for testing large number of attachments
External Email
What Nick said...
cc_large is a sample of some of the larger documents from commoncrawl3_refetched.
If you want to give your pipeline a workout, I also recommend using the MockParser that is available in the tika-core tests jar. That allows you to instrument an OOM and timeouts and system exits and all sorts of other mayhem. Obv, don't put the tika-core tests jar on your class path in production. See the files in https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock for examples of how to trigger bad behavior with the MockParser.
On the corpora, as Nick said, let us know what you want and we can help you select files.
Cheers,
Tim
On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org>> wrote:
On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
> I am currently trying to validate our Tika setup and was looking for a
> set of example data I could use
If you want a small number of files of lots of different types, the test
files in the Tika source tree will work. Main set are in
tika-parsers/src/test/resources/test-documents/
If you want a very large number of files, then the Tika Corpora collection
is a good source. We have a few different collections, including stuff
from common crawl, govdocs and bug trackers. If you can let us know what
sort of file types and how many, we can suggest the best corpora
collection
Nick
Re: Datasets for testing large number of attachments
Posted by Nicholas DiPiazza <ni...@gmail.com>.
Script I used back in the day to do what you are looking for:
#!/bin/bash
for i in $(seq -f "%03g" 2 999)
do
wget
http://downloads.digitalcorpora.org/corpora/files/govdocs1/zipfiles/$i.zip
-O $i.zip
unzip $i.zip
rm $i.zip
done
not sure if it still works
On Tue, Jul 26, 2022 at 1:59 PM Tim Allison <ta...@apache.org> wrote:
> As a warning, tho, Common Crawl truncates files at 1MB, so we have a bunch
> of truncated files. We refetched some and put those under
> commoncrawl3_refetched.
>
> On Tue, Jul 26, 2022 at 2:58 PM Tim Allison <ta...@apache.org> wrote:
>
>> We have ~1.9TB. But I'd skip cc_large because that's just a copy of some
>> directories under commoncrawl3_refetched.
>>
>> If you want to pull fresher data out of CommonCrawl, I have undocumented
>> scripts to do that. I could add documentation.
>>
>> These are the top 100 mime types and counts. This db was generated on a
>> slightly earlier version of the corpus/corpora, but it should be close
>> enough.
>>
>> MIME_STRING cnt
>> application/pdf 768490
>> text/plain 472041
>> text/html 429707
>> application/x-tika-msoffice 297990
>> image/png 190815
>> application/octet-stream 190645
>> image/jpeg 179533
>> application/xhtml+xml 151830
>> application/x-bzip2 124204
>> application/x-tika-ooxml 122523
>> application/x-bzip 107435
>> application/xml 107003
>> application/zip 93467
>> application/x-sh 88712
>> application/gzip 73535
>> image/gif 66713
>> application/zlib 46483
>> text/calendar 40385
>> application/postscript 35526
>> application/rss+xml 34428
>> application/atom+xml 28950
>> multipart/appledouble 27602
>> image/svg+xml 25771
>> application/vnd.oasis.opendocument.text 25753
>> application/rdf+xml 24890
>> application/vnd.google-earth.kml+xml 24049
>> application/rtf 23915
>> application/x-matroska 19437
>> application/x-shockwave-flash 18879
>> video/quicktime 18546
>> application/epub+zip 18205
>> application/vnd.ms-excel 17465
>> application/x-xz 16869
>> text/x-vcard 16772
>> application/java-vm 16761
>> audio/mpeg 15534
>> message/rfc822 14405
>> application/vnd.oasis.opendocument.spreadsheet 12659
>> application/x-bibtex-text-file 12261
>> application/x-rar-compressed; version=4 12123
>> text/x-php 10870
>> text/x-diff 10080
>> video/mp4 8281
>> audio/mp4 8221
>> application/x-msdownload 8019
>> application/x-bittorrent 7964
>> image/vnd.microsoft.icon 7382
>> application/mbox 6799
>> application/x-x509-cert; format=der 6597
>> audio/vnd.wave 6550
>> image/bmp 6411
>> application/x-endnote-refer 5922
>> image/vnd.djvu 5874
>> text/x-matlab 5734
>> application/vnd.apple.mpegurl 5511
>> image/tiff 5430
>> image/webp 4972
>> application/vnd.oasis.opendocument.presentation 3989
>> text/x-jsp 3973
>> text/x-csrc 3555
>> video/x-ms-wmv 3453
>> video/x-m4v 3443
>> application/x-dbf 3381
>> text/x-chdr 3263
>> text/x-perl 3124
>> application/x-rpm 3023
>> application/x-mobipocket-ebook 2726
>> audio/midi 2697
>> application/vnd.oasis.opendocument.graphics 2675
>> application/vnd.ms-excel.sheet.4 2591
>> application/x-font-ttf 2575
>> application/xspf+xml 2557
>> text/x-python 2416
>> audio/vorbis 2354
>> application/msword 2223
>> application/ogg 2222
>> application/x-gtar 2181
>> audio/x-mpegurl 2067
>> video/x-flv 1969
>> audio/x-ms-wma 1874
>> image/icns 1857
>> application/x-object 1823
>> application/x-7z-compressed 1795
>> application/x-msdownload; format=pe32 1784
>> application/x-debian-package 1700
>> application/x-mysql-table-definition 1669
>> image/vnd.dxf; format=ascii 1664
>> application/x-sqlite3 1606
>> application/x-berkeley-db; format=hash 1457
>> application/x-executable 1455
>> video/mpeg 1366
>> application/pkcs7-signature 1359
>> application/x-ms-asx 1266
>> image/vnd.zbrush.pcx 1247
>> image/vnd.dwg 1243
>> application/fits 1217
>> application/xslfo+xml 1206
>> application/x-sharedlib 1185
>> audio/prs.sid 1173
>> text/x-vcalendar 1156
>>
>>
>>
>>
>> On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr <
>> oscar.riekenjr@cofense.com> wrote:
>>
>>> We were thinking something around 2TB of data with a good mix of excel,
>>> images, pdfs, text and powerpoints. So I guess a mix of everything.
>>>
>>>
>>>
>>> *From: *Tim Allison <ta...@apache.org>
>>> *Date: *Tuesday, July 26, 2022 at 9:19 AM
>>> *To: *user@tika.apache.org <us...@tika.apache.org>
>>> *Cc: *Oscar Rieken Jr <os...@cofense.com>,
>>> corpora-dev@tika.apache.org <co...@tika.apache.org>
>>> *Subject: *Re: Datasets for testing large number of attachments
>>>
>>> External Email
>>>
>>> What Nick said...
>>>
>>>
>>>
>>> cc_large is a sample of some of the larger documents from
>>> commoncrawl3_refetched.
>>>
>>>
>>>
>>> If you want to give your pipeline a workout, I also recommend using the
>>> MockParser that is available in the tika-core tests jar. That allows you
>>> to instrument an OOM and timeouts and system exits and all sorts of other
>>> mayhem. Obv, don't put the tika-core tests jar on your class path in
>>> production. See the files in
>>> https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock
>>> for examples of how to trigger bad behavior with the MockParser.
>>>
>>>
>>>
>>> On the corpora, as Nick said, let us know what you want and we can help
>>> you select files.
>>>
>>>
>>>
>>> Cheers,
>>>
>>>
>>>
>>> Tim
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org> wrote:
>>>
>>> On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
>>> > I am currently trying to validate our Tika setup and was looking for a
>>> > set of example data I could use
>>>
>>> If you want a small number of files of lots of different types, the test
>>> files in the Tika source tree will work. Main set are in
>>> tika-parsers/src/test/resources/test-documents/
>>>
>>> If you want a very large number of files, then the Tika Corpora
>>> collection
>>> is a good source. We have a few different collections, including stuff
>>> from common crawl, govdocs and bug trackers. If you can let us know what
>>> sort of file types and how many, we can suggest the best corpora
>>> collection
>>>
>>> Nick
>>>
>>>
Re: Datasets for testing large number of attachments
Posted by Nicholas DiPiazza <ni...@gmail.com>.
Script I used back in the day to do what you are looking for:
#!/bin/bash
for i in $(seq -f "%03g" 2 999)
do
wget
http://downloads.digitalcorpora.org/corpora/files/govdocs1/zipfiles/$i.zip
-O $i.zip
unzip $i.zip
rm $i.zip
done
not sure if it still works
On Tue, Jul 26, 2022 at 1:59 PM Tim Allison <ta...@apache.org> wrote:
> As a warning, tho, Common Crawl truncates files at 1MB, so we have a bunch
> of truncated files. We refetched some and put those under
> commoncrawl3_refetched.
>
> On Tue, Jul 26, 2022 at 2:58 PM Tim Allison <ta...@apache.org> wrote:
>
>> We have ~1.9TB. But I'd skip cc_large because that's just a copy of some
>> directories under commoncrawl3_refetched.
>>
>> If you want to pull fresher data out of CommonCrawl, I have undocumented
>> scripts to do that. I could add documentation.
>>
>> These are the top 100 mime types and counts. This db was generated on a
>> slightly earlier version of the corpus/corpora, but it should be close
>> enough.
>>
>> MIME_STRING cnt
>> application/pdf 768490
>> text/plain 472041
>> text/html 429707
>> application/x-tika-msoffice 297990
>> image/png 190815
>> application/octet-stream 190645
>> image/jpeg 179533
>> application/xhtml+xml 151830
>> application/x-bzip2 124204
>> application/x-tika-ooxml 122523
>> application/x-bzip 107435
>> application/xml 107003
>> application/zip 93467
>> application/x-sh 88712
>> application/gzip 73535
>> image/gif 66713
>> application/zlib 46483
>> text/calendar 40385
>> application/postscript 35526
>> application/rss+xml 34428
>> application/atom+xml 28950
>> multipart/appledouble 27602
>> image/svg+xml 25771
>> application/vnd.oasis.opendocument.text 25753
>> application/rdf+xml 24890
>> application/vnd.google-earth.kml+xml 24049
>> application/rtf 23915
>> application/x-matroska 19437
>> application/x-shockwave-flash 18879
>> video/quicktime 18546
>> application/epub+zip 18205
>> application/vnd.ms-excel 17465
>> application/x-xz 16869
>> text/x-vcard 16772
>> application/java-vm 16761
>> audio/mpeg 15534
>> message/rfc822 14405
>> application/vnd.oasis.opendocument.spreadsheet 12659
>> application/x-bibtex-text-file 12261
>> application/x-rar-compressed; version=4 12123
>> text/x-php 10870
>> text/x-diff 10080
>> video/mp4 8281
>> audio/mp4 8221
>> application/x-msdownload 8019
>> application/x-bittorrent 7964
>> image/vnd.microsoft.icon 7382
>> application/mbox 6799
>> application/x-x509-cert; format=der 6597
>> audio/vnd.wave 6550
>> image/bmp 6411
>> application/x-endnote-refer 5922
>> image/vnd.djvu 5874
>> text/x-matlab 5734
>> application/vnd.apple.mpegurl 5511
>> image/tiff 5430
>> image/webp 4972
>> application/vnd.oasis.opendocument.presentation 3989
>> text/x-jsp 3973
>> text/x-csrc 3555
>> video/x-ms-wmv 3453
>> video/x-m4v 3443
>> application/x-dbf 3381
>> text/x-chdr 3263
>> text/x-perl 3124
>> application/x-rpm 3023
>> application/x-mobipocket-ebook 2726
>> audio/midi 2697
>> application/vnd.oasis.opendocument.graphics 2675
>> application/vnd.ms-excel.sheet.4 2591
>> application/x-font-ttf 2575
>> application/xspf+xml 2557
>> text/x-python 2416
>> audio/vorbis 2354
>> application/msword 2223
>> application/ogg 2222
>> application/x-gtar 2181
>> audio/x-mpegurl 2067
>> video/x-flv 1969
>> audio/x-ms-wma 1874
>> image/icns 1857
>> application/x-object 1823
>> application/x-7z-compressed 1795
>> application/x-msdownload; format=pe32 1784
>> application/x-debian-package 1700
>> application/x-mysql-table-definition 1669
>> image/vnd.dxf; format=ascii 1664
>> application/x-sqlite3 1606
>> application/x-berkeley-db; format=hash 1457
>> application/x-executable 1455
>> video/mpeg 1366
>> application/pkcs7-signature 1359
>> application/x-ms-asx 1266
>> image/vnd.zbrush.pcx 1247
>> image/vnd.dwg 1243
>> application/fits 1217
>> application/xslfo+xml 1206
>> application/x-sharedlib 1185
>> audio/prs.sid 1173
>> text/x-vcalendar 1156
>>
>>
>>
>>
>> On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr <
>> oscar.riekenjr@cofense.com> wrote:
>>
>>> We were thinking something around 2TB of data with a good mix of excel,
>>> images, pdfs, text and powerpoints. So I guess a mix of everything.
>>>
>>>
>>>
>>> *From: *Tim Allison <ta...@apache.org>
>>> *Date: *Tuesday, July 26, 2022 at 9:19 AM
>>> *To: *user@tika.apache.org <us...@tika.apache.org>
>>> *Cc: *Oscar Rieken Jr <os...@cofense.com>,
>>> corpora-dev@tika.apache.org <co...@tika.apache.org>
>>> *Subject: *Re: Datasets for testing large number of attachments
>>>
>>> External Email
>>>
>>> What Nick said...
>>>
>>>
>>>
>>> cc_large is a sample of some of the larger documents from
>>> commoncrawl3_refetched.
>>>
>>>
>>>
>>> If you want to give your pipeline a workout, I also recommend using the
>>> MockParser that is available in the tika-core tests jar. That allows you
>>> to instrument an OOM and timeouts and system exits and all sorts of other
>>> mayhem. Obv, don't put the tika-core tests jar on your class path in
>>> production. See the files in
>>> https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock
>>> for examples of how to trigger bad behavior with the MockParser.
>>>
>>>
>>>
>>> On the corpora, as Nick said, let us know what you want and we can help
>>> you select files.
>>>
>>>
>>>
>>> Cheers,
>>>
>>>
>>>
>>> Tim
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org> wrote:
>>>
>>> On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
>>> > I am currently trying to validate our Tika setup and was looking for a
>>> > set of example data I could use
>>>
>>> If you want a small number of files of lots of different types, the test
>>> files in the Tika source tree will work. Main set are in
>>> tika-parsers/src/test/resources/test-documents/
>>>
>>> If you want a very large number of files, then the Tika Corpora
>>> collection
>>> is a good source. We have a few different collections, including stuff
>>> from common crawl, govdocs and bug trackers. If you can let us know what
>>> sort of file types and how many, we can suggest the best corpora
>>> collection
>>>
>>> Nick
>>>
>>>
Re: Datasets for testing large number of attachments
Posted by Tim Allison <ta...@apache.org>.
As a warning, tho, Common Crawl truncates files at 1MB, so we have a bunch
of truncated files. We refetched some and put those under
commoncrawl3_refetched.
On Tue, Jul 26, 2022 at 2:58 PM Tim Allison <ta...@apache.org> wrote:
> We have ~1.9TB. But I'd skip cc_large because that's just a copy of some
> directories under commoncrawl3_refetched.
>
> If you want to pull fresher data out of CommonCrawl, I have undocumented
> scripts to do that. I could add documentation.
>
> These are the top 100 mime types and counts. This db was generated on a
> slightly earlier version of the corpus/corpora, but it should be close
> enough.
>
> MIME_STRING cnt
> application/pdf 768490
> text/plain 472041
> text/html 429707
> application/x-tika-msoffice 297990
> image/png 190815
> application/octet-stream 190645
> image/jpeg 179533
> application/xhtml+xml 151830
> application/x-bzip2 124204
> application/x-tika-ooxml 122523
> application/x-bzip 107435
> application/xml 107003
> application/zip 93467
> application/x-sh 88712
> application/gzip 73535
> image/gif 66713
> application/zlib 46483
> text/calendar 40385
> application/postscript 35526
> application/rss+xml 34428
> application/atom+xml 28950
> multipart/appledouble 27602
> image/svg+xml 25771
> application/vnd.oasis.opendocument.text 25753
> application/rdf+xml 24890
> application/vnd.google-earth.kml+xml 24049
> application/rtf 23915
> application/x-matroska 19437
> application/x-shockwave-flash 18879
> video/quicktime 18546
> application/epub+zip 18205
> application/vnd.ms-excel 17465
> application/x-xz 16869
> text/x-vcard 16772
> application/java-vm 16761
> audio/mpeg 15534
> message/rfc822 14405
> application/vnd.oasis.opendocument.spreadsheet 12659
> application/x-bibtex-text-file 12261
> application/x-rar-compressed; version=4 12123
> text/x-php 10870
> text/x-diff 10080
> video/mp4 8281
> audio/mp4 8221
> application/x-msdownload 8019
> application/x-bittorrent 7964
> image/vnd.microsoft.icon 7382
> application/mbox 6799
> application/x-x509-cert; format=der 6597
> audio/vnd.wave 6550
> image/bmp 6411
> application/x-endnote-refer 5922
> image/vnd.djvu 5874
> text/x-matlab 5734
> application/vnd.apple.mpegurl 5511
> image/tiff 5430
> image/webp 4972
> application/vnd.oasis.opendocument.presentation 3989
> text/x-jsp 3973
> text/x-csrc 3555
> video/x-ms-wmv 3453
> video/x-m4v 3443
> application/x-dbf 3381
> text/x-chdr 3263
> text/x-perl 3124
> application/x-rpm 3023
> application/x-mobipocket-ebook 2726
> audio/midi 2697
> application/vnd.oasis.opendocument.graphics 2675
> application/vnd.ms-excel.sheet.4 2591
> application/x-font-ttf 2575
> application/xspf+xml 2557
> text/x-python 2416
> audio/vorbis 2354
> application/msword 2223
> application/ogg 2222
> application/x-gtar 2181
> audio/x-mpegurl 2067
> video/x-flv 1969
> audio/x-ms-wma 1874
> image/icns 1857
> application/x-object 1823
> application/x-7z-compressed 1795
> application/x-msdownload; format=pe32 1784
> application/x-debian-package 1700
> application/x-mysql-table-definition 1669
> image/vnd.dxf; format=ascii 1664
> application/x-sqlite3 1606
> application/x-berkeley-db; format=hash 1457
> application/x-executable 1455
> video/mpeg 1366
> application/pkcs7-signature 1359
> application/x-ms-asx 1266
> image/vnd.zbrush.pcx 1247
> image/vnd.dwg 1243
> application/fits 1217
> application/xslfo+xml 1206
> application/x-sharedlib 1185
> audio/prs.sid 1173
> text/x-vcalendar 1156
>
>
>
>
> On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr <
> oscar.riekenjr@cofense.com> wrote:
>
>> We were thinking something around 2TB of data with a good mix of excel,
>> images, pdfs, text and powerpoints. So I guess a mix of everything.
>>
>>
>>
>> *From: *Tim Allison <ta...@apache.org>
>> *Date: *Tuesday, July 26, 2022 at 9:19 AM
>> *To: *user@tika.apache.org <us...@tika.apache.org>
>> *Cc: *Oscar Rieken Jr <os...@cofense.com>,
>> corpora-dev@tika.apache.org <co...@tika.apache.org>
>> *Subject: *Re: Datasets for testing large number of attachments
>>
>> External Email
>>
>> What Nick said...
>>
>>
>>
>> cc_large is a sample of some of the larger documents from
>> commoncrawl3_refetched.
>>
>>
>>
>> If you want to give your pipeline a workout, I also recommend using the
>> MockParser that is available in the tika-core tests jar. That allows you
>> to instrument an OOM and timeouts and system exits and all sorts of other
>> mayhem. Obv, don't put the tika-core tests jar on your class path in
>> production. See the files in
>> https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock
>> for examples of how to trigger bad behavior with the MockParser.
>>
>>
>>
>> On the corpora, as Nick said, let us know what you want and we can help
>> you select files.
>>
>>
>>
>> Cheers,
>>
>>
>>
>> Tim
>>
>>
>>
>>
>>
>> On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org> wrote:
>>
>> On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
>> > I am currently trying to validate our Tika setup and was looking for a
>> > set of example data I could use
>>
>> If you want a small number of files of lots of different types, the test
>> files in the Tika source tree will work. Main set are in
>> tika-parsers/src/test/resources/test-documents/
>>
>> If you want a very large number of files, then the Tika Corpora
>> collection
>> is a good source. We have a few different collections, including stuff
>> from common crawl, govdocs and bug trackers. If you can let us know what
>> sort of file types and how many, we can suggest the best corpora
>> collection
>>
>> Nick
>>
>>
Re: Datasets for testing large number of attachments
Posted by Oscar Rieken Jr via user <us...@tika.apache.org>.
Yeah I think id want to pull the data from CommonCrawl and go from there.
Do you have a link to that script?
From: Tim Allison <ta...@apache.org>
Date: Tuesday, July 26, 2022 at 2:59 PM
To: Oscar Rieken Jr <os...@cofense.com>
Cc: user@tika.apache.org <us...@tika.apache.org>, corpora-dev@tika.apache.org <co...@tika.apache.org>
Subject: Re: Datasets for testing large number of attachments
External Email
We have ~1.9TB. But I'd skip cc_large because that's just a copy of some directories under commoncrawl3_refetched.
If you want to pull fresher data out of CommonCrawl, I have undocumented scripts to do that. I could add documentation.
These are the top 100 mime types and counts. This db was generated on a slightly earlier version of the corpus/corpora, but it should be close enough.
MIME_STRING cnt
application/pdf 768490
text/plain 472041
text/html 429707
application/x-tika-msoffice 297990
image/png 190815
application/octet-stream 190645
image/jpeg 179533
application/xhtml+xml 151830
application/x-bzip2 124204
application/x-tika-ooxml 122523
application/x-bzip 107435
application/xml 107003
application/zip 93467
application/x-sh 88712
application/gzip 73535
image/gif 66713
application/zlib 46483
text/calendar 40385
application/postscript 35526
application/rss+xml 34428
application/atom+xml 28950
multipart/appledouble 27602
image/svg+xml 25771
application/vnd.oasis.opendocument.text 25753
application/rdf+xml 24890
application/vnd.google-earth.kml+xml 24049
application/rtf 23915
application/x-matroska 19437
application/x-shockwave-flash 18879
video/quicktime 18546
application/epub+zip 18205
application/vnd.ms-excel 17465
application/x-xz 16869
text/x-vcard 16772
application/java-vm 16761
audio/mpeg 15534
message/rfc822 14405
application/vnd.oasis.opendocument.spreadsheet 12659
application/x-bibtex-text-file 12261
application/x-rar-compressed; version=4 12123
text/x-php 10870
text/x-diff 10080
video/mp4 8281
audio/mp4 8221
application/x-msdownload 8019
application/x-bittorrent 7964
image/vnd.microsoft.icon 7382
application/mbox 6799
application/x-x509-cert; format=der 6597
audio/vnd.wave 6550
image/bmp 6411
application/x-endnote-refer 5922
image/vnd.djvu 5874
text/x-matlab 5734
application/vnd.apple.mpegurl 5511
image/tiff 5430
image/webp 4972
application/vnd.oasis.opendocument.presentation 3989
text/x-jsp 3973
text/x-csrc 3555
video/x-ms-wmv 3453
video/x-m4v 3443
application/x-dbf 3381
text/x-chdr 3263
text/x-perl 3124
application/x-rpm 3023
application/x-mobipocket-ebook 2726
audio/midi 2697
application/vnd.oasis.opendocument.graphics 2675
application/vnd.ms-excel.sheet.4 2591
application/x-font-ttf 2575
application/xspf+xml 2557
text/x-python 2416
audio/vorbis 2354
application/msword 2223
application/ogg 2222
application/x-gtar 2181
audio/x-mpegurl 2067
video/x-flv 1969
audio/x-ms-wma 1874
image/icns 1857
application/x-object 1823
application/x-7z-compressed 1795
application/x-msdownload; format=pe32 1784
application/x-debian-package 1700
application/x-mysql-table-definition 1669
image/vnd.dxf; format=ascii 1664
application/x-sqlite3 1606
application/x-berkeley-db; format=hash 1457
application/x-executable 1455
video/mpeg 1366
application/pkcs7-signature 1359
application/x-ms-asx 1266
image/vnd.zbrush.pcx 1247
image/vnd.dwg 1243
application/fits 1217
application/xslfo+xml 1206
application/x-sharedlib 1185
audio/prs.sid 1173
text/x-vcalendar 1156
On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr <os...@cofense.com>> wrote:
We were thinking something around 2TB of data with a good mix of excel, images, pdfs, text and powerpoints. So I guess a mix of everything.
From: Tim Allison <ta...@apache.org>>
Date: Tuesday, July 26, 2022 at 9:19 AM
To: user@tika.apache.org<ma...@tika.apache.org> <us...@tika.apache.org>>
Cc: Oscar Rieken Jr <os...@cofense.com>>, corpora-dev@tika.apache.org<ma...@tika.apache.org> <co...@tika.apache.org>>
Subject: Re: Datasets for testing large number of attachments
External Email
What Nick said...
cc_large is a sample of some of the larger documents from commoncrawl3_refetched.
If you want to give your pipeline a workout, I also recommend using the MockParser that is available in the tika-core tests jar. That allows you to instrument an OOM and timeouts and system exits and all sorts of other mayhem. Obv, don't put the tika-core tests jar on your class path in production. See the files in https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock for examples of how to trigger bad behavior with the MockParser.
On the corpora, as Nick said, let us know what you want and we can help you select files.
Cheers,
Tim
On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org>> wrote:
On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
> I am currently trying to validate our Tika setup and was looking for a
> set of example data I could use
If you want a small number of files of lots of different types, the test
files in the Tika source tree will work. Main set are in
tika-parsers/src/test/resources/test-documents/
If you want a very large number of files, then the Tika Corpora collection
is a good source. We have a few different collections, including stuff
from common crawl, govdocs and bug trackers. If you can let us know what
sort of file types and how many, we can suggest the best corpora
collection
Nick
Re: Datasets for testing large number of attachments
Posted by Tim Allison <ta...@apache.org>.
As a warning, tho, Common Crawl truncates files at 1MB, so we have a bunch
of truncated files. We refetched some and put those under
commoncrawl3_refetched.
On Tue, Jul 26, 2022 at 2:58 PM Tim Allison <ta...@apache.org> wrote:
> We have ~1.9TB. But I'd skip cc_large because that's just a copy of some
> directories under commoncrawl3_refetched.
>
> If you want to pull fresher data out of CommonCrawl, I have undocumented
> scripts to do that. I could add documentation.
>
> These are the top 100 mime types and counts. This db was generated on a
> slightly earlier version of the corpus/corpora, but it should be close
> enough.
>
> MIME_STRING cnt
> application/pdf 768490
> text/plain 472041
> text/html 429707
> application/x-tika-msoffice 297990
> image/png 190815
> application/octet-stream 190645
> image/jpeg 179533
> application/xhtml+xml 151830
> application/x-bzip2 124204
> application/x-tika-ooxml 122523
> application/x-bzip 107435
> application/xml 107003
> application/zip 93467
> application/x-sh 88712
> application/gzip 73535
> image/gif 66713
> application/zlib 46483
> text/calendar 40385
> application/postscript 35526
> application/rss+xml 34428
> application/atom+xml 28950
> multipart/appledouble 27602
> image/svg+xml 25771
> application/vnd.oasis.opendocument.text 25753
> application/rdf+xml 24890
> application/vnd.google-earth.kml+xml 24049
> application/rtf 23915
> application/x-matroska 19437
> application/x-shockwave-flash 18879
> video/quicktime 18546
> application/epub+zip 18205
> application/vnd.ms-excel 17465
> application/x-xz 16869
> text/x-vcard 16772
> application/java-vm 16761
> audio/mpeg 15534
> message/rfc822 14405
> application/vnd.oasis.opendocument.spreadsheet 12659
> application/x-bibtex-text-file 12261
> application/x-rar-compressed; version=4 12123
> text/x-php 10870
> text/x-diff 10080
> video/mp4 8281
> audio/mp4 8221
> application/x-msdownload 8019
> application/x-bittorrent 7964
> image/vnd.microsoft.icon 7382
> application/mbox 6799
> application/x-x509-cert; format=der 6597
> audio/vnd.wave 6550
> image/bmp 6411
> application/x-endnote-refer 5922
> image/vnd.djvu 5874
> text/x-matlab 5734
> application/vnd.apple.mpegurl 5511
> image/tiff 5430
> image/webp 4972
> application/vnd.oasis.opendocument.presentation 3989
> text/x-jsp 3973
> text/x-csrc 3555
> video/x-ms-wmv 3453
> video/x-m4v 3443
> application/x-dbf 3381
> text/x-chdr 3263
> text/x-perl 3124
> application/x-rpm 3023
> application/x-mobipocket-ebook 2726
> audio/midi 2697
> application/vnd.oasis.opendocument.graphics 2675
> application/vnd.ms-excel.sheet.4 2591
> application/x-font-ttf 2575
> application/xspf+xml 2557
> text/x-python 2416
> audio/vorbis 2354
> application/msword 2223
> application/ogg 2222
> application/x-gtar 2181
> audio/x-mpegurl 2067
> video/x-flv 1969
> audio/x-ms-wma 1874
> image/icns 1857
> application/x-object 1823
> application/x-7z-compressed 1795
> application/x-msdownload; format=pe32 1784
> application/x-debian-package 1700
> application/x-mysql-table-definition 1669
> image/vnd.dxf; format=ascii 1664
> application/x-sqlite3 1606
> application/x-berkeley-db; format=hash 1457
> application/x-executable 1455
> video/mpeg 1366
> application/pkcs7-signature 1359
> application/x-ms-asx 1266
> image/vnd.zbrush.pcx 1247
> image/vnd.dwg 1243
> application/fits 1217
> application/xslfo+xml 1206
> application/x-sharedlib 1185
> audio/prs.sid 1173
> text/x-vcalendar 1156
>
>
>
>
> On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr <
> oscar.riekenjr@cofense.com> wrote:
>
>> We were thinking something around 2TB of data with a good mix of excel,
>> images, pdfs, text and powerpoints. So I guess a mix of everything.
>>
>>
>>
>> *From: *Tim Allison <ta...@apache.org>
>> *Date: *Tuesday, July 26, 2022 at 9:19 AM
>> *To: *user@tika.apache.org <us...@tika.apache.org>
>> *Cc: *Oscar Rieken Jr <os...@cofense.com>,
>> corpora-dev@tika.apache.org <co...@tika.apache.org>
>> *Subject: *Re: Datasets for testing large number of attachments
>>
>> External Email
>>
>> What Nick said...
>>
>>
>>
>> cc_large is a sample of some of the larger documents from
>> commoncrawl3_refetched.
>>
>>
>>
>> If you want to give your pipeline a workout, I also recommend using the
>> MockParser that is available in the tika-core tests jar. That allows you
>> to instrument an OOM and timeouts and system exits and all sorts of other
>> mayhem. Obv, don't put the tika-core tests jar on your class path in
>> production. See the files in
>> https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock
>> for examples of how to trigger bad behavior with the MockParser.
>>
>>
>>
>> On the corpora, as Nick said, let us know what you want and we can help
>> you select files.
>>
>>
>>
>> Cheers,
>>
>>
>>
>> Tim
>>
>>
>>
>>
>>
>> On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org> wrote:
>>
>> On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
>> > I am currently trying to validate our Tika setup and was looking for a
>> > set of example data I could use
>>
>> If you want a small number of files of lots of different types, the test
>> files in the Tika source tree will work. Main set are in
>> tika-parsers/src/test/resources/test-documents/
>>
>> If you want a very large number of files, then the Tika Corpora
>> collection
>> is a good source. We have a few different collections, including stuff
>> from common crawl, govdocs and bug trackers. If you can let us know what
>> sort of file types and how many, we can suggest the best corpora
>> collection
>>
>> Nick
>>
>>
Re: Datasets for testing large number of attachments
Posted by Tim Allison <ta...@apache.org>.
We have ~1.9TB. But I'd skip cc_large because that's just a copy of some
directories under commoncrawl3_refetched.
If you want to pull fresher data out of CommonCrawl, I have undocumented
scripts to do that. I could add documentation.
These are the top 100 mime types and counts. This db was generated on a
slightly earlier version of the corpus/corpora, but it should be close
enough.
MIME_STRING cnt
application/pdf 768490
text/plain 472041
text/html 429707
application/x-tika-msoffice 297990
image/png 190815
application/octet-stream 190645
image/jpeg 179533
application/xhtml+xml 151830
application/x-bzip2 124204
application/x-tika-ooxml 122523
application/x-bzip 107435
application/xml 107003
application/zip 93467
application/x-sh 88712
application/gzip 73535
image/gif 66713
application/zlib 46483
text/calendar 40385
application/postscript 35526
application/rss+xml 34428
application/atom+xml 28950
multipart/appledouble 27602
image/svg+xml 25771
application/vnd.oasis.opendocument.text 25753
application/rdf+xml 24890
application/vnd.google-earth.kml+xml 24049
application/rtf 23915
application/x-matroska 19437
application/x-shockwave-flash 18879
video/quicktime 18546
application/epub+zip 18205
application/vnd.ms-excel 17465
application/x-xz 16869
text/x-vcard 16772
application/java-vm 16761
audio/mpeg 15534
message/rfc822 14405
application/vnd.oasis.opendocument.spreadsheet 12659
application/x-bibtex-text-file 12261
application/x-rar-compressed; version=4 12123
text/x-php 10870
text/x-diff 10080
video/mp4 8281
audio/mp4 8221
application/x-msdownload 8019
application/x-bittorrent 7964
image/vnd.microsoft.icon 7382
application/mbox 6799
application/x-x509-cert; format=der 6597
audio/vnd.wave 6550
image/bmp 6411
application/x-endnote-refer 5922
image/vnd.djvu 5874
text/x-matlab 5734
application/vnd.apple.mpegurl 5511
image/tiff 5430
image/webp 4972
application/vnd.oasis.opendocument.presentation 3989
text/x-jsp 3973
text/x-csrc 3555
video/x-ms-wmv 3453
video/x-m4v 3443
application/x-dbf 3381
text/x-chdr 3263
text/x-perl 3124
application/x-rpm 3023
application/x-mobipocket-ebook 2726
audio/midi 2697
application/vnd.oasis.opendocument.graphics 2675
application/vnd.ms-excel.sheet.4 2591
application/x-font-ttf 2575
application/xspf+xml 2557
text/x-python 2416
audio/vorbis 2354
application/msword 2223
application/ogg 2222
application/x-gtar 2181
audio/x-mpegurl 2067
video/x-flv 1969
audio/x-ms-wma 1874
image/icns 1857
application/x-object 1823
application/x-7z-compressed 1795
application/x-msdownload; format=pe32 1784
application/x-debian-package 1700
application/x-mysql-table-definition 1669
image/vnd.dxf; format=ascii 1664
application/x-sqlite3 1606
application/x-berkeley-db; format=hash 1457
application/x-executable 1455
video/mpeg 1366
application/pkcs7-signature 1359
application/x-ms-asx 1266
image/vnd.zbrush.pcx 1247
image/vnd.dwg 1243
application/fits 1217
application/xslfo+xml 1206
application/x-sharedlib 1185
audio/prs.sid 1173
text/x-vcalendar 1156
On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr <os...@cofense.com>
wrote:
> We were thinking something around 2TB of data with a good mix of excel,
> images, pdfs, text and powerpoints. So I guess a mix of everything.
>
>
>
> *From: *Tim Allison <ta...@apache.org>
> *Date: *Tuesday, July 26, 2022 at 9:19 AM
> *To: *user@tika.apache.org <us...@tika.apache.org>
> *Cc: *Oscar Rieken Jr <os...@cofense.com>,
> corpora-dev@tika.apache.org <co...@tika.apache.org>
> *Subject: *Re: Datasets for testing large number of attachments
>
> External Email
>
> What Nick said...
>
>
>
> cc_large is a sample of some of the larger documents from
> commoncrawl3_refetched.
>
>
>
> If you want to give your pipeline a workout, I also recommend using the
> MockParser that is available in the tika-core tests jar. That allows you
> to instrument an OOM and timeouts and system exits and all sorts of other
> mayhem. Obv, don't put the tika-core tests jar on your class path in
> production. See the files in
> https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock
> for examples of how to trigger bad behavior with the MockParser.
>
>
>
> On the corpora, as Nick said, let us know what you want and we can help
> you select files.
>
>
>
> Cheers,
>
>
>
> Tim
>
>
>
>
>
> On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org> wrote:
>
> On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
> > I am currently trying to validate our Tika setup and was looking for a
> > set of example data I could use
>
> If you want a small number of files of lots of different types, the test
> files in the Tika source tree will work. Main set are in
> tika-parsers/src/test/resources/test-documents/
>
> If you want a very large number of files, then the Tika Corpora collection
> is a good source. We have a few different collections, including stuff
> from common crawl, govdocs and bug trackers. If you can let us know what
> sort of file types and how many, we can suggest the best corpora
> collection
>
> Nick
>
>
Re: Datasets for testing large number of attachments
Posted by Tim Allison <ta...@apache.org>.
We have ~1.9TB. But I'd skip cc_large because that's just a copy of some
directories under commoncrawl3_refetched.
If you want to pull fresher data out of CommonCrawl, I have undocumented
scripts to do that. I could add documentation.
These are the top 100 mime types and counts. This db was generated on a
slightly earlier version of the corpus/corpora, but it should be close
enough.
MIME_STRING cnt
application/pdf 768490
text/plain 472041
text/html 429707
application/x-tika-msoffice 297990
image/png 190815
application/octet-stream 190645
image/jpeg 179533
application/xhtml+xml 151830
application/x-bzip2 124204
application/x-tika-ooxml 122523
application/x-bzip 107435
application/xml 107003
application/zip 93467
application/x-sh 88712
application/gzip 73535
image/gif 66713
application/zlib 46483
text/calendar 40385
application/postscript 35526
application/rss+xml 34428
application/atom+xml 28950
multipart/appledouble 27602
image/svg+xml 25771
application/vnd.oasis.opendocument.text 25753
application/rdf+xml 24890
application/vnd.google-earth.kml+xml 24049
application/rtf 23915
application/x-matroska 19437
application/x-shockwave-flash 18879
video/quicktime 18546
application/epub+zip 18205
application/vnd.ms-excel 17465
application/x-xz 16869
text/x-vcard 16772
application/java-vm 16761
audio/mpeg 15534
message/rfc822 14405
application/vnd.oasis.opendocument.spreadsheet 12659
application/x-bibtex-text-file 12261
application/x-rar-compressed; version=4 12123
text/x-php 10870
text/x-diff 10080
video/mp4 8281
audio/mp4 8221
application/x-msdownload 8019
application/x-bittorrent 7964
image/vnd.microsoft.icon 7382
application/mbox 6799
application/x-x509-cert; format=der 6597
audio/vnd.wave 6550
image/bmp 6411
application/x-endnote-refer 5922
image/vnd.djvu 5874
text/x-matlab 5734
application/vnd.apple.mpegurl 5511
image/tiff 5430
image/webp 4972
application/vnd.oasis.opendocument.presentation 3989
text/x-jsp 3973
text/x-csrc 3555
video/x-ms-wmv 3453
video/x-m4v 3443
application/x-dbf 3381
text/x-chdr 3263
text/x-perl 3124
application/x-rpm 3023
application/x-mobipocket-ebook 2726
audio/midi 2697
application/vnd.oasis.opendocument.graphics 2675
application/vnd.ms-excel.sheet.4 2591
application/x-font-ttf 2575
application/xspf+xml 2557
text/x-python 2416
audio/vorbis 2354
application/msword 2223
application/ogg 2222
application/x-gtar 2181
audio/x-mpegurl 2067
video/x-flv 1969
audio/x-ms-wma 1874
image/icns 1857
application/x-object 1823
application/x-7z-compressed 1795
application/x-msdownload; format=pe32 1784
application/x-debian-package 1700
application/x-mysql-table-definition 1669
image/vnd.dxf; format=ascii 1664
application/x-sqlite3 1606
application/x-berkeley-db; format=hash 1457
application/x-executable 1455
video/mpeg 1366
application/pkcs7-signature 1359
application/x-ms-asx 1266
image/vnd.zbrush.pcx 1247
image/vnd.dwg 1243
application/fits 1217
application/xslfo+xml 1206
application/x-sharedlib 1185
audio/prs.sid 1173
text/x-vcalendar 1156
On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr <os...@cofense.com>
wrote:
> We were thinking something around 2TB of data with a good mix of excel,
> images, pdfs, text and powerpoints. So I guess a mix of everything.
>
>
>
> *From: *Tim Allison <ta...@apache.org>
> *Date: *Tuesday, July 26, 2022 at 9:19 AM
> *To: *user@tika.apache.org <us...@tika.apache.org>
> *Cc: *Oscar Rieken Jr <os...@cofense.com>,
> corpora-dev@tika.apache.org <co...@tika.apache.org>
> *Subject: *Re: Datasets for testing large number of attachments
>
> External Email
>
> What Nick said...
>
>
>
> cc_large is a sample of some of the larger documents from
> commoncrawl3_refetched.
>
>
>
> If you want to give your pipeline a workout, I also recommend using the
> MockParser that is available in the tika-core tests jar. That allows you
> to instrument an OOM and timeouts and system exits and all sorts of other
> mayhem. Obv, don't put the tika-core tests jar on your class path in
> production. See the files in
> https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock
> for examples of how to trigger bad behavior with the MockParser.
>
>
>
> On the corpora, as Nick said, let us know what you want and we can help
> you select files.
>
>
>
> Cheers,
>
>
>
> Tim
>
>
>
>
>
> On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org> wrote:
>
> On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
> > I am currently trying to validate our Tika setup and was looking for a
> > set of example data I could use
>
> If you want a small number of files of lots of different types, the test
> files in the Tika source tree will work. Main set are in
> tika-parsers/src/test/resources/test-documents/
>
> If you want a very large number of files, then the Tika Corpora collection
> is a good source. We have a few different collections, including stuff
> from common crawl, govdocs and bug trackers. If you can let us know what
> sort of file types and how many, we can suggest the best corpora
> collection
>
> Nick
>
>
Re: Datasets for testing large number of attachments
Posted by Oscar Rieken Jr via user <us...@tika.apache.org>.
We were thinking something around 2TB of data with a good mix of excel, images, pdfs, text and powerpoints. So I guess a mix of everything.
From: Tim Allison <ta...@apache.org>
Date: Tuesday, July 26, 2022 at 9:19 AM
To: user@tika.apache.org <us...@tika.apache.org>
Cc: Oscar Rieken Jr <os...@cofense.com>, corpora-dev@tika.apache.org <co...@tika.apache.org>
Subject: Re: Datasets for testing large number of attachments
External Email
What Nick said...
cc_large is a sample of some of the larger documents from commoncrawl3_refetched.
If you want to give your pipeline a workout, I also recommend using the MockParser that is available in the tika-core tests jar. That allows you to instrument an OOM and timeouts and system exits and all sorts of other mayhem. Obv, don't put the tika-core tests jar on your class path in production. See the files in https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock for examples of how to trigger bad behavior with the MockParser.
On the corpora, as Nick said, let us know what you want and we can help you select files.
Cheers,
Tim
On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org>> wrote:
On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
> I am currently trying to validate our Tika setup and was looking for a
> set of example data I could use
If you want a small number of files of lots of different types, the test
files in the Tika source tree will work. Main set are in
tika-parsers/src/test/resources/test-documents/
If you want a very large number of files, then the Tika Corpora collection
is a good source. We have a few different collections, including stuff
from common crawl, govdocs and bug trackers. If you can let us know what
sort of file types and how many, we can suggest the best corpora
collection
Nick
Re: Datasets for testing large number of attachments
Posted by Oscar Rieken Jr <os...@cofense.com.INVALID>.
We were thinking something around 2TB of data with a good mix of excel, images, pdfs, text and powerpoints. So I guess a mix of everything.
From: Tim Allison <ta...@apache.org>
Date: Tuesday, July 26, 2022 at 9:19 AM
To: user@tika.apache.org <us...@tika.apache.org>
Cc: Oscar Rieken Jr <os...@cofense.com>, corpora-dev@tika.apache.org <co...@tika.apache.org>
Subject: Re: Datasets for testing large number of attachments
External Email
What Nick said...
cc_large is a sample of some of the larger documents from commoncrawl3_refetched.
If you want to give your pipeline a workout, I also recommend using the MockParser that is available in the tika-core tests jar. That allows you to instrument an OOM and timeouts and system exits and all sorts of other mayhem. Obv, don't put the tika-core tests jar on your class path in production. See the files in https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock for examples of how to trigger bad behavior with the MockParser.
On the corpora, as Nick said, let us know what you want and we can help you select files.
Cheers,
Tim
On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org>> wrote:
On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
> I am currently trying to validate our Tika setup and was looking for a
> set of example data I could use
If you want a small number of files of lots of different types, the test
files in the Tika source tree will work. Main set are in
tika-parsers/src/test/resources/test-documents/
If you want a very large number of files, then the Tika Corpora collection
is a good source. We have a few different collections, including stuff
from common crawl, govdocs and bug trackers. If you can let us know what
sort of file types and how many, we can suggest the best corpora
collection
Nick
Re: Datasets for testing large number of attachments
Posted by Tim Allison <ta...@apache.org>.
What Nick said...
cc_large is a sample of some of the larger documents from
commoncrawl3_refetched.
If you want to give your pipeline a workout, I also recommend using the
MockParser that is available in the tika-core tests jar. That allows you
to instrument an OOM and timeouts and system exits and all sorts of other
mayhem. Obv, don't put the tika-core tests jar on your class path in
production. See the files in
https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock
for examples of how to trigger bad behavior with the MockParser.
On the corpora, as Nick said, let us know what you want and we can help you
select files.
Cheers,
Tim
On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org> wrote:
> On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
> > I am currently trying to validate our Tika setup and was looking for a
> > set of example data I could use
>
> If you want a small number of files of lots of different types, the test
> files in the Tika source tree will work. Main set are in
> tika-parsers/src/test/resources/test-documents/
>
> If you want a very large number of files, then the Tika Corpora collection
> is a good source. We have a few different collections, including stuff
> from common crawl, govdocs and bug trackers. If you can let us know what
> sort of file types and how many, we can suggest the best corpora
> collection
>
> Nick
>
Re: Datasets for testing large number of attachments
Posted by Tim Allison <ta...@apache.org>.
What Nick said...
cc_large is a sample of some of the larger documents from
commoncrawl3_refetched.
If you want to give your pipeline a workout, I also recommend using the
MockParser that is available in the tika-core tests jar. That allows you
to instrument an OOM and timeouts and system exits and all sorts of other
mayhem. Obv, don't put the tika-core tests jar on your class path in
production. See the files in
https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock
for examples of how to trigger bad behavior with the MockParser.
On the corpora, as Nick said, let us know what you want and we can help you
select files.
Cheers,
Tim
On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org> wrote:
> On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
> > I am currently trying to validate our Tika setup and was looking for a
> > set of example data I could use
>
> If you want a small number of files of lots of different types, the test
> files in the Tika source tree will work. Main set are in
> tika-parsers/src/test/resources/test-documents/
>
> If you want a very large number of files, then the Tika Corpora collection
> is a good source. We have a few different collections, including stuff
> from common crawl, govdocs and bug trackers. If you can let us know what
> sort of file types and how many, we can suggest the best corpora
> collection
>
> Nick
>
Re: Datasets for testing large number of attachments
Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
> I am currently trying to validate our Tika setup and was looking for a
> set of example data I could use
If you want a small number of files of lots of different types, the test
files in the Tika source tree will work. Main set are in
tika-parsers/src/test/resources/test-documents/
If you want a very large number of files, then the Tika Corpora collection
is a good source. We have a few different collections, including stuff
from common crawl, govdocs and bug trackers. If you can let us know what
sort of file types and how many, we can suggest the best corpora
collection
Nick