You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Oscar Rieken Jr via user <us...@tika.apache.org> on 2022/07/25 20:35:24 UTC

Datasets for testing large number of attachments

I am currently trying to validate our Tika setup and was looking for a set of example data I could use

I found this dir -> Index of /base/docs/cc_large (apache.org)<https://corpora.tika.apache.org/base/docs/cc_large/>
Would I just download that data set or is there another place with multiple file types?

Forgive me this is all new to me just trying to figure it all out


Re: Datasets for testing large number of attachments

Posted by Oscar Rieken Jr <os...@cofense.com.INVALID>.
Yeah I think id want to pull the data from CommonCrawl and go from there.

Do you have a link to that script?

From: Tim Allison <ta...@apache.org>
Date: Tuesday, July 26, 2022 at 2:59 PM
To: Oscar Rieken Jr <os...@cofense.com>
Cc: user@tika.apache.org <us...@tika.apache.org>, corpora-dev@tika.apache.org <co...@tika.apache.org>
Subject: Re: Datasets for testing large number of attachments
External Email
We have ~1.9TB.  But I'd skip cc_large because that's just a copy of some directories under commoncrawl3_refetched.

If you want to pull fresher data out of CommonCrawl, I have undocumented scripts to do that.  I could add documentation.

These are the top 100 mime types and counts.  This db was generated on a slightly earlier version of the corpus/corpora, but it should be close enough.

MIME_STRING    cnt
application/pdf    768490
text/plain    472041
text/html    429707
application/x-tika-msoffice    297990
image/png    190815
application/octet-stream    190645
image/jpeg    179533
application/xhtml+xml    151830
application/x-bzip2    124204
application/x-tika-ooxml    122523
application/x-bzip    107435
application/xml    107003
application/zip    93467
application/x-sh    88712
application/gzip    73535
image/gif    66713
application/zlib    46483
text/calendar    40385
application/postscript    35526
application/rss+xml    34428
application/atom+xml    28950
multipart/appledouble    27602
image/svg+xml    25771
application/vnd.oasis.opendocument.text    25753
application/rdf+xml    24890
application/vnd.google-earth.kml+xml    24049
application/rtf    23915
application/x-matroska    19437
application/x-shockwave-flash    18879
video/quicktime    18546
application/epub+zip    18205
application/vnd.ms-excel    17465
application/x-xz    16869
text/x-vcard    16772
application/java-vm    16761
audio/mpeg    15534
message/rfc822    14405
application/vnd.oasis.opendocument.spreadsheet    12659
application/x-bibtex-text-file    12261
application/x-rar-compressed; version=4    12123
text/x-php    10870
text/x-diff    10080
video/mp4    8281
audio/mp4    8221
application/x-msdownload    8019
application/x-bittorrent    7964
image/vnd.microsoft.icon    7382
application/mbox    6799
application/x-x509-cert; format=der    6597
audio/vnd.wave    6550
image/bmp    6411
application/x-endnote-refer    5922
image/vnd.djvu    5874
text/x-matlab    5734
application/vnd.apple.mpegurl    5511
image/tiff    5430
image/webp    4972
application/vnd.oasis.opendocument.presentation    3989
text/x-jsp    3973
text/x-csrc    3555
video/x-ms-wmv    3453
video/x-m4v    3443
application/x-dbf    3381
text/x-chdr    3263
text/x-perl    3124
application/x-rpm    3023
application/x-mobipocket-ebook    2726
audio/midi    2697
application/vnd.oasis.opendocument.graphics    2675
application/vnd.ms-excel.sheet.4    2591
application/x-font-ttf    2575
application/xspf+xml    2557
text/x-python    2416
audio/vorbis    2354
application/msword    2223
application/ogg    2222
application/x-gtar    2181
audio/x-mpegurl    2067
video/x-flv    1969
audio/x-ms-wma    1874
image/icns    1857
application/x-object    1823
application/x-7z-compressed    1795
application/x-msdownload; format=pe32    1784
application/x-debian-package    1700
application/x-mysql-table-definition    1669
image/vnd.dxf; format=ascii    1664
application/x-sqlite3    1606
application/x-berkeley-db; format=hash    1457
application/x-executable    1455
video/mpeg    1366
application/pkcs7-signature    1359
application/x-ms-asx    1266
image/vnd.zbrush.pcx    1247
image/vnd.dwg    1243
application/fits    1217
application/xslfo+xml    1206
application/x-sharedlib    1185
audio/prs.sid    1173
text/x-vcalendar    1156




On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr <os...@cofense.com>> wrote:
We were thinking something around 2TB of data with a good mix of excel, images, pdfs, text and powerpoints. So I guess a mix of everything.

From: Tim Allison <ta...@apache.org>>
Date: Tuesday, July 26, 2022 at 9:19 AM
To: user@tika.apache.org<ma...@tika.apache.org> <us...@tika.apache.org>>
Cc: Oscar Rieken Jr <os...@cofense.com>>, corpora-dev@tika.apache.org<ma...@tika.apache.org> <co...@tika.apache.org>>
Subject: Re: Datasets for testing large number of attachments
External Email
What Nick said...

cc_large is a sample of some of the larger documents from commoncrawl3_refetched.

If you want to give your pipeline a workout, I also recommend using the MockParser that is available in the tika-core tests jar.  That allows you to instrument an OOM and timeouts and system exits and all sorts of other mayhem.  Obv, don't put the tika-core tests jar on your class path in production.  See the files in https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock for examples of how to trigger bad behavior with the MockParser.

On the corpora, as Nick said, let us know what you want and we can help you select files.

Cheers,

        Tim


On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org>> wrote:
On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
> I am currently trying to validate our Tika setup and was looking for a
> set of example data I could use

If you want a small number of files of lots of different types, the test
files in the Tika source tree will work. Main set are in
tika-parsers/src/test/resources/test-documents/

If you want a very large number of files, then the Tika Corpora collection
is a good source. We have a few different collections, including stuff
from common crawl, govdocs and bug trackers. If you can let us know what
sort of file types and how many, we can suggest the best corpora
collection

Nick

Re: Datasets for testing large number of attachments

Posted by Oscar Rieken Jr <os...@cofense.com.INVALID>.
Awesome thanks ill give this a shot!

From: Nicholas DiPiazza <ni...@gmail.com>
Date: Tuesday, July 26, 2022 at 3:13 PM
To: user@tika.apache.org <us...@tika.apache.org>, tallison@apache.org <ta...@apache.org>
Cc: Oscar Rieken Jr <os...@cofense.com>, corpora-dev@tika.apache.org <co...@tika.apache.org>
Subject: Re: Datasets for testing large number of attachments
External Email
Script I used back in the day to do what you are looking for:

#!/bin/bash
for i in $(seq -f "%03g" 2 999)
do
  wget http://downloads.digitalcorpora.org/corpora/files/govdocs1/zipfiles/$i.zip -O $i.zip
  unzip $i.zip
  rm $i.zip
done

not sure if it still works


On Tue, Jul 26, 2022 at 1:59 PM Tim Allison <ta...@apache.org>> wrote:
As a warning, tho, Common Crawl truncates files at 1MB, so we have a bunch of truncated files.  We refetched some and put those under commoncrawl3_refetched.

On Tue, Jul 26, 2022 at 2:58 PM Tim Allison <ta...@apache.org>> wrote:
We have ~1.9TB.  But I'd skip cc_large because that's just a copy of some directories under commoncrawl3_refetched.

If you want to pull fresher data out of CommonCrawl, I have undocumented scripts to do that.  I could add documentation.

These are the top 100 mime types and counts.  This db was generated on a slightly earlier version of the corpus/corpora, but it should be close enough.

MIME_STRING    cnt
application/pdf    768490
text/plain    472041
text/html    429707
application/x-tika-msoffice    297990
image/png    190815
application/octet-stream    190645
image/jpeg    179533
application/xhtml+xml    151830
application/x-bzip2    124204
application/x-tika-ooxml    122523
application/x-bzip    107435
application/xml    107003
application/zip    93467
application/x-sh    88712
application/gzip    73535
image/gif    66713
application/zlib    46483
text/calendar    40385
application/postscript    35526
application/rss+xml    34428
application/atom+xml    28950
multipart/appledouble    27602
image/svg+xml    25771
application/vnd.oasis.opendocument.text    25753
application/rdf+xml    24890
application/vnd.google-earth.kml+xml    24049
application/rtf    23915
application/x-matroska    19437
application/x-shockwave-flash    18879
video/quicktime    18546
application/epub+zip    18205
application/vnd.ms-excel    17465
application/x-xz    16869
text/x-vcard    16772
application/java-vm    16761
audio/mpeg    15534
message/rfc822    14405
application/vnd.oasis.opendocument.spreadsheet    12659
application/x-bibtex-text-file    12261
application/x-rar-compressed; version=4    12123
text/x-php    10870
text/x-diff    10080
video/mp4    8281
audio/mp4    8221
application/x-msdownload    8019
application/x-bittorrent    7964
image/vnd.microsoft.icon    7382
application/mbox    6799
application/x-x509-cert; format=der    6597
audio/vnd.wave    6550
image/bmp    6411
application/x-endnote-refer    5922
image/vnd.djvu    5874
text/x-matlab    5734
application/vnd.apple.mpegurl    5511
image/tiff    5430
image/webp    4972
application/vnd.oasis.opendocument.presentation    3989
text/x-jsp    3973
text/x-csrc    3555
video/x-ms-wmv    3453
video/x-m4v    3443
application/x-dbf    3381
text/x-chdr    3263
text/x-perl    3124
application/x-rpm    3023
application/x-mobipocket-ebook    2726
audio/midi    2697
application/vnd.oasis.opendocument.graphics    2675
application/vnd.ms-excel.sheet.4    2591
application/x-font-ttf    2575
application/xspf+xml    2557
text/x-python    2416
audio/vorbis    2354
application/msword    2223
application/ogg    2222
application/x-gtar    2181
audio/x-mpegurl    2067
video/x-flv    1969
audio/x-ms-wma    1874
image/icns    1857
application/x-object    1823
application/x-7z-compressed    1795
application/x-msdownload; format=pe32    1784
application/x-debian-package    1700
application/x-mysql-table-definition    1669
image/vnd.dxf; format=ascii    1664
application/x-sqlite3    1606
application/x-berkeley-db; format=hash    1457
application/x-executable    1455
video/mpeg    1366
application/pkcs7-signature    1359
application/x-ms-asx    1266
image/vnd.zbrush.pcx    1247
image/vnd.dwg    1243
application/fits    1217
application/xslfo+xml    1206
application/x-sharedlib    1185
audio/prs.sid    1173
text/x-vcalendar    1156




On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr <os...@cofense.com>> wrote:
We were thinking something around 2TB of data with a good mix of excel, images, pdfs, text and powerpoints. So I guess a mix of everything.

From: Tim Allison <ta...@apache.org>>
Date: Tuesday, July 26, 2022 at 9:19 AM
To: user@tika.apache.org<ma...@tika.apache.org> <us...@tika.apache.org>>
Cc: Oscar Rieken Jr <os...@cofense.com>>, corpora-dev@tika.apache.org<ma...@tika.apache.org> <co...@tika.apache.org>>
Subject: Re: Datasets for testing large number of attachments
External Email
What Nick said...

cc_large is a sample of some of the larger documents from commoncrawl3_refetched.

If you want to give your pipeline a workout, I also recommend using the MockParser that is available in the tika-core tests jar.  That allows you to instrument an OOM and timeouts and system exits and all sorts of other mayhem.  Obv, don't put the tika-core tests jar on your class path in production.  See the files in https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock for examples of how to trigger bad behavior with the MockParser.

On the corpora, as Nick said, let us know what you want and we can help you select files.

Cheers,

        Tim


On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org>> wrote:
On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
> I am currently trying to validate our Tika setup and was looking for a
> set of example data I could use

If you want a small number of files of lots of different types, the test
files in the Tika source tree will work. Main set are in
tika-parsers/src/test/resources/test-documents/

If you want a very large number of files, then the Tika Corpora collection
is a good source. We have a few different collections, including stuff
from common crawl, govdocs and bug trackers. If you can let us know what
sort of file types and how many, we can suggest the best corpora
collection

Nick

Re: Datasets for testing large number of attachments

Posted by Oscar Rieken Jr via user <us...@tika.apache.org>.
Awesome thanks ill give this a shot!

From: Nicholas DiPiazza <ni...@gmail.com>
Date: Tuesday, July 26, 2022 at 3:13 PM
To: user@tika.apache.org <us...@tika.apache.org>, tallison@apache.org <ta...@apache.org>
Cc: Oscar Rieken Jr <os...@cofense.com>, corpora-dev@tika.apache.org <co...@tika.apache.org>
Subject: Re: Datasets for testing large number of attachments
External Email
Script I used back in the day to do what you are looking for:

#!/bin/bash
for i in $(seq -f "%03g" 2 999)
do
  wget http://downloads.digitalcorpora.org/corpora/files/govdocs1/zipfiles/$i.zip -O $i.zip
  unzip $i.zip
  rm $i.zip
done

not sure if it still works


On Tue, Jul 26, 2022 at 1:59 PM Tim Allison <ta...@apache.org>> wrote:
As a warning, tho, Common Crawl truncates files at 1MB, so we have a bunch of truncated files.  We refetched some and put those under commoncrawl3_refetched.

On Tue, Jul 26, 2022 at 2:58 PM Tim Allison <ta...@apache.org>> wrote:
We have ~1.9TB.  But I'd skip cc_large because that's just a copy of some directories under commoncrawl3_refetched.

If you want to pull fresher data out of CommonCrawl, I have undocumented scripts to do that.  I could add documentation.

These are the top 100 mime types and counts.  This db was generated on a slightly earlier version of the corpus/corpora, but it should be close enough.

MIME_STRING    cnt
application/pdf    768490
text/plain    472041
text/html    429707
application/x-tika-msoffice    297990
image/png    190815
application/octet-stream    190645
image/jpeg    179533
application/xhtml+xml    151830
application/x-bzip2    124204
application/x-tika-ooxml    122523
application/x-bzip    107435
application/xml    107003
application/zip    93467
application/x-sh    88712
application/gzip    73535
image/gif    66713
application/zlib    46483
text/calendar    40385
application/postscript    35526
application/rss+xml    34428
application/atom+xml    28950
multipart/appledouble    27602
image/svg+xml    25771
application/vnd.oasis.opendocument.text    25753
application/rdf+xml    24890
application/vnd.google-earth.kml+xml    24049
application/rtf    23915
application/x-matroska    19437
application/x-shockwave-flash    18879
video/quicktime    18546
application/epub+zip    18205
application/vnd.ms-excel    17465
application/x-xz    16869
text/x-vcard    16772
application/java-vm    16761
audio/mpeg    15534
message/rfc822    14405
application/vnd.oasis.opendocument.spreadsheet    12659
application/x-bibtex-text-file    12261
application/x-rar-compressed; version=4    12123
text/x-php    10870
text/x-diff    10080
video/mp4    8281
audio/mp4    8221
application/x-msdownload    8019
application/x-bittorrent    7964
image/vnd.microsoft.icon    7382
application/mbox    6799
application/x-x509-cert; format=der    6597
audio/vnd.wave    6550
image/bmp    6411
application/x-endnote-refer    5922
image/vnd.djvu    5874
text/x-matlab    5734
application/vnd.apple.mpegurl    5511
image/tiff    5430
image/webp    4972
application/vnd.oasis.opendocument.presentation    3989
text/x-jsp    3973
text/x-csrc    3555
video/x-ms-wmv    3453
video/x-m4v    3443
application/x-dbf    3381
text/x-chdr    3263
text/x-perl    3124
application/x-rpm    3023
application/x-mobipocket-ebook    2726
audio/midi    2697
application/vnd.oasis.opendocument.graphics    2675
application/vnd.ms-excel.sheet.4    2591
application/x-font-ttf    2575
application/xspf+xml    2557
text/x-python    2416
audio/vorbis    2354
application/msword    2223
application/ogg    2222
application/x-gtar    2181
audio/x-mpegurl    2067
video/x-flv    1969
audio/x-ms-wma    1874
image/icns    1857
application/x-object    1823
application/x-7z-compressed    1795
application/x-msdownload; format=pe32    1784
application/x-debian-package    1700
application/x-mysql-table-definition    1669
image/vnd.dxf; format=ascii    1664
application/x-sqlite3    1606
application/x-berkeley-db; format=hash    1457
application/x-executable    1455
video/mpeg    1366
application/pkcs7-signature    1359
application/x-ms-asx    1266
image/vnd.zbrush.pcx    1247
image/vnd.dwg    1243
application/fits    1217
application/xslfo+xml    1206
application/x-sharedlib    1185
audio/prs.sid    1173
text/x-vcalendar    1156




On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr <os...@cofense.com>> wrote:
We were thinking something around 2TB of data with a good mix of excel, images, pdfs, text and powerpoints. So I guess a mix of everything.

From: Tim Allison <ta...@apache.org>>
Date: Tuesday, July 26, 2022 at 9:19 AM
To: user@tika.apache.org<ma...@tika.apache.org> <us...@tika.apache.org>>
Cc: Oscar Rieken Jr <os...@cofense.com>>, corpora-dev@tika.apache.org<ma...@tika.apache.org> <co...@tika.apache.org>>
Subject: Re: Datasets for testing large number of attachments
External Email
What Nick said...

cc_large is a sample of some of the larger documents from commoncrawl3_refetched.

If you want to give your pipeline a workout, I also recommend using the MockParser that is available in the tika-core tests jar.  That allows you to instrument an OOM and timeouts and system exits and all sorts of other mayhem.  Obv, don't put the tika-core tests jar on your class path in production.  See the files in https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock for examples of how to trigger bad behavior with the MockParser.

On the corpora, as Nick said, let us know what you want and we can help you select files.

Cheers,

        Tim


On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org>> wrote:
On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
> I am currently trying to validate our Tika setup and was looking for a
> set of example data I could use

If you want a small number of files of lots of different types, the test
files in the Tika source tree will work. Main set are in
tika-parsers/src/test/resources/test-documents/

If you want a very large number of files, then the Tika Corpora collection
is a good source. We have a few different collections, including stuff
from common crawl, govdocs and bug trackers. If you can let us know what
sort of file types and how many, we can suggest the best corpora
collection

Nick

Re: Datasets for testing large number of attachments

Posted by Nicholas DiPiazza <ni...@gmail.com>.
Script I used back in the day to do what you are looking for:

#!/bin/bash
for i in $(seq -f "%03g" 2 999)
do
  wget
http://downloads.digitalcorpora.org/corpora/files/govdocs1/zipfiles/$i.zip
-O $i.zip
  unzip $i.zip
  rm $i.zip
done

not sure if it still works


On Tue, Jul 26, 2022 at 1:59 PM Tim Allison <ta...@apache.org> wrote:

> As a warning, tho, Common Crawl truncates files at 1MB, so we have a bunch
> of truncated files.  We refetched some and put those under
> commoncrawl3_refetched.
>
> On Tue, Jul 26, 2022 at 2:58 PM Tim Allison <ta...@apache.org> wrote:
>
>> We have ~1.9TB.  But I'd skip cc_large because that's just a copy of some
>> directories under commoncrawl3_refetched.
>>
>> If you want to pull fresher data out of CommonCrawl, I have undocumented
>> scripts to do that.  I could add documentation.
>>
>> These are the top 100 mime types and counts.  This db was generated on a
>> slightly earlier version of the corpus/corpora, but it should be close
>> enough.
>>
>> MIME_STRING    cnt
>> application/pdf    768490
>> text/plain    472041
>> text/html    429707
>> application/x-tika-msoffice    297990
>> image/png    190815
>> application/octet-stream    190645
>> image/jpeg    179533
>> application/xhtml+xml    151830
>> application/x-bzip2    124204
>> application/x-tika-ooxml    122523
>> application/x-bzip    107435
>> application/xml    107003
>> application/zip    93467
>> application/x-sh    88712
>> application/gzip    73535
>> image/gif    66713
>> application/zlib    46483
>> text/calendar    40385
>> application/postscript    35526
>> application/rss+xml    34428
>> application/atom+xml    28950
>> multipart/appledouble    27602
>> image/svg+xml    25771
>> application/vnd.oasis.opendocument.text    25753
>> application/rdf+xml    24890
>> application/vnd.google-earth.kml+xml    24049
>> application/rtf    23915
>> application/x-matroska    19437
>> application/x-shockwave-flash    18879
>> video/quicktime    18546
>> application/epub+zip    18205
>> application/vnd.ms-excel    17465
>> application/x-xz    16869
>> text/x-vcard    16772
>> application/java-vm    16761
>> audio/mpeg    15534
>> message/rfc822    14405
>> application/vnd.oasis.opendocument.spreadsheet    12659
>> application/x-bibtex-text-file    12261
>> application/x-rar-compressed; version=4    12123
>> text/x-php    10870
>> text/x-diff    10080
>> video/mp4    8281
>> audio/mp4    8221
>> application/x-msdownload    8019
>> application/x-bittorrent    7964
>> image/vnd.microsoft.icon    7382
>> application/mbox    6799
>> application/x-x509-cert; format=der    6597
>> audio/vnd.wave    6550
>> image/bmp    6411
>> application/x-endnote-refer    5922
>> image/vnd.djvu    5874
>> text/x-matlab    5734
>> application/vnd.apple.mpegurl    5511
>> image/tiff    5430
>> image/webp    4972
>> application/vnd.oasis.opendocument.presentation    3989
>> text/x-jsp    3973
>> text/x-csrc    3555
>> video/x-ms-wmv    3453
>> video/x-m4v    3443
>> application/x-dbf    3381
>> text/x-chdr    3263
>> text/x-perl    3124
>> application/x-rpm    3023
>> application/x-mobipocket-ebook    2726
>> audio/midi    2697
>> application/vnd.oasis.opendocument.graphics    2675
>> application/vnd.ms-excel.sheet.4    2591
>> application/x-font-ttf    2575
>> application/xspf+xml    2557
>> text/x-python    2416
>> audio/vorbis    2354
>> application/msword    2223
>> application/ogg    2222
>> application/x-gtar    2181
>> audio/x-mpegurl    2067
>> video/x-flv    1969
>> audio/x-ms-wma    1874
>> image/icns    1857
>> application/x-object    1823
>> application/x-7z-compressed    1795
>> application/x-msdownload; format=pe32    1784
>> application/x-debian-package    1700
>> application/x-mysql-table-definition    1669
>> image/vnd.dxf; format=ascii    1664
>> application/x-sqlite3    1606
>> application/x-berkeley-db; format=hash    1457
>> application/x-executable    1455
>> video/mpeg    1366
>> application/pkcs7-signature    1359
>> application/x-ms-asx    1266
>> image/vnd.zbrush.pcx    1247
>> image/vnd.dwg    1243
>> application/fits    1217
>> application/xslfo+xml    1206
>> application/x-sharedlib    1185
>> audio/prs.sid    1173
>> text/x-vcalendar    1156
>>
>>
>>
>>
>> On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr <
>> oscar.riekenjr@cofense.com> wrote:
>>
>>> We were thinking something around 2TB of data with a good mix of excel,
>>> images, pdfs, text and powerpoints. So I guess a mix of everything.
>>>
>>>
>>>
>>> *From: *Tim Allison <ta...@apache.org>
>>> *Date: *Tuesday, July 26, 2022 at 9:19 AM
>>> *To: *user@tika.apache.org <us...@tika.apache.org>
>>> *Cc: *Oscar Rieken Jr <os...@cofense.com>,
>>> corpora-dev@tika.apache.org <co...@tika.apache.org>
>>> *Subject: *Re: Datasets for testing large number of attachments
>>>
>>> External Email
>>>
>>> What Nick said...
>>>
>>>
>>>
>>> cc_large is a sample of some of the larger documents from
>>> commoncrawl3_refetched.
>>>
>>>
>>>
>>> If you want to give your pipeline a workout, I also recommend using the
>>> MockParser that is available in the tika-core tests jar.  That allows you
>>> to instrument an OOM and timeouts and system exits and all sorts of other
>>> mayhem.  Obv, don't put the tika-core tests jar on your class path in
>>> production.  See the files in
>>> https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock
>>> for examples of how to trigger bad behavior with the MockParser.
>>>
>>>
>>>
>>> On the corpora, as Nick said, let us know what you want and we can help
>>> you select files.
>>>
>>>
>>>
>>> Cheers,
>>>
>>>
>>>
>>>         Tim
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org> wrote:
>>>
>>> On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
>>> > I am currently trying to validate our Tika setup and was looking for a
>>> > set of example data I could use
>>>
>>> If you want a small number of files of lots of different types, the test
>>> files in the Tika source tree will work. Main set are in
>>> tika-parsers/src/test/resources/test-documents/
>>>
>>> If you want a very large number of files, then the Tika Corpora
>>> collection
>>> is a good source. We have a few different collections, including stuff
>>> from common crawl, govdocs and bug trackers. If you can let us know what
>>> sort of file types and how many, we can suggest the best corpora
>>> collection
>>>
>>> Nick
>>>
>>>

Re: Datasets for testing large number of attachments

Posted by Nicholas DiPiazza <ni...@gmail.com>.
Script I used back in the day to do what you are looking for:

#!/bin/bash
for i in $(seq -f "%03g" 2 999)
do
  wget
http://downloads.digitalcorpora.org/corpora/files/govdocs1/zipfiles/$i.zip
-O $i.zip
  unzip $i.zip
  rm $i.zip
done

not sure if it still works


On Tue, Jul 26, 2022 at 1:59 PM Tim Allison <ta...@apache.org> wrote:

> As a warning, tho, Common Crawl truncates files at 1MB, so we have a bunch
> of truncated files.  We refetched some and put those under
> commoncrawl3_refetched.
>
> On Tue, Jul 26, 2022 at 2:58 PM Tim Allison <ta...@apache.org> wrote:
>
>> We have ~1.9TB.  But I'd skip cc_large because that's just a copy of some
>> directories under commoncrawl3_refetched.
>>
>> If you want to pull fresher data out of CommonCrawl, I have undocumented
>> scripts to do that.  I could add documentation.
>>
>> These are the top 100 mime types and counts.  This db was generated on a
>> slightly earlier version of the corpus/corpora, but it should be close
>> enough.
>>
>> MIME_STRING    cnt
>> application/pdf    768490
>> text/plain    472041
>> text/html    429707
>> application/x-tika-msoffice    297990
>> image/png    190815
>> application/octet-stream    190645
>> image/jpeg    179533
>> application/xhtml+xml    151830
>> application/x-bzip2    124204
>> application/x-tika-ooxml    122523
>> application/x-bzip    107435
>> application/xml    107003
>> application/zip    93467
>> application/x-sh    88712
>> application/gzip    73535
>> image/gif    66713
>> application/zlib    46483
>> text/calendar    40385
>> application/postscript    35526
>> application/rss+xml    34428
>> application/atom+xml    28950
>> multipart/appledouble    27602
>> image/svg+xml    25771
>> application/vnd.oasis.opendocument.text    25753
>> application/rdf+xml    24890
>> application/vnd.google-earth.kml+xml    24049
>> application/rtf    23915
>> application/x-matroska    19437
>> application/x-shockwave-flash    18879
>> video/quicktime    18546
>> application/epub+zip    18205
>> application/vnd.ms-excel    17465
>> application/x-xz    16869
>> text/x-vcard    16772
>> application/java-vm    16761
>> audio/mpeg    15534
>> message/rfc822    14405
>> application/vnd.oasis.opendocument.spreadsheet    12659
>> application/x-bibtex-text-file    12261
>> application/x-rar-compressed; version=4    12123
>> text/x-php    10870
>> text/x-diff    10080
>> video/mp4    8281
>> audio/mp4    8221
>> application/x-msdownload    8019
>> application/x-bittorrent    7964
>> image/vnd.microsoft.icon    7382
>> application/mbox    6799
>> application/x-x509-cert; format=der    6597
>> audio/vnd.wave    6550
>> image/bmp    6411
>> application/x-endnote-refer    5922
>> image/vnd.djvu    5874
>> text/x-matlab    5734
>> application/vnd.apple.mpegurl    5511
>> image/tiff    5430
>> image/webp    4972
>> application/vnd.oasis.opendocument.presentation    3989
>> text/x-jsp    3973
>> text/x-csrc    3555
>> video/x-ms-wmv    3453
>> video/x-m4v    3443
>> application/x-dbf    3381
>> text/x-chdr    3263
>> text/x-perl    3124
>> application/x-rpm    3023
>> application/x-mobipocket-ebook    2726
>> audio/midi    2697
>> application/vnd.oasis.opendocument.graphics    2675
>> application/vnd.ms-excel.sheet.4    2591
>> application/x-font-ttf    2575
>> application/xspf+xml    2557
>> text/x-python    2416
>> audio/vorbis    2354
>> application/msword    2223
>> application/ogg    2222
>> application/x-gtar    2181
>> audio/x-mpegurl    2067
>> video/x-flv    1969
>> audio/x-ms-wma    1874
>> image/icns    1857
>> application/x-object    1823
>> application/x-7z-compressed    1795
>> application/x-msdownload; format=pe32    1784
>> application/x-debian-package    1700
>> application/x-mysql-table-definition    1669
>> image/vnd.dxf; format=ascii    1664
>> application/x-sqlite3    1606
>> application/x-berkeley-db; format=hash    1457
>> application/x-executable    1455
>> video/mpeg    1366
>> application/pkcs7-signature    1359
>> application/x-ms-asx    1266
>> image/vnd.zbrush.pcx    1247
>> image/vnd.dwg    1243
>> application/fits    1217
>> application/xslfo+xml    1206
>> application/x-sharedlib    1185
>> audio/prs.sid    1173
>> text/x-vcalendar    1156
>>
>>
>>
>>
>> On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr <
>> oscar.riekenjr@cofense.com> wrote:
>>
>>> We were thinking something around 2TB of data with a good mix of excel,
>>> images, pdfs, text and powerpoints. So I guess a mix of everything.
>>>
>>>
>>>
>>> *From: *Tim Allison <ta...@apache.org>
>>> *Date: *Tuesday, July 26, 2022 at 9:19 AM
>>> *To: *user@tika.apache.org <us...@tika.apache.org>
>>> *Cc: *Oscar Rieken Jr <os...@cofense.com>,
>>> corpora-dev@tika.apache.org <co...@tika.apache.org>
>>> *Subject: *Re: Datasets for testing large number of attachments
>>>
>>> External Email
>>>
>>> What Nick said...
>>>
>>>
>>>
>>> cc_large is a sample of some of the larger documents from
>>> commoncrawl3_refetched.
>>>
>>>
>>>
>>> If you want to give your pipeline a workout, I also recommend using the
>>> MockParser that is available in the tika-core tests jar.  That allows you
>>> to instrument an OOM and timeouts and system exits and all sorts of other
>>> mayhem.  Obv, don't put the tika-core tests jar on your class path in
>>> production.  See the files in
>>> https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock
>>> for examples of how to trigger bad behavior with the MockParser.
>>>
>>>
>>>
>>> On the corpora, as Nick said, let us know what you want and we can help
>>> you select files.
>>>
>>>
>>>
>>> Cheers,
>>>
>>>
>>>
>>>         Tim
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org> wrote:
>>>
>>> On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
>>> > I am currently trying to validate our Tika setup and was looking for a
>>> > set of example data I could use
>>>
>>> If you want a small number of files of lots of different types, the test
>>> files in the Tika source tree will work. Main set are in
>>> tika-parsers/src/test/resources/test-documents/
>>>
>>> If you want a very large number of files, then the Tika Corpora
>>> collection
>>> is a good source. We have a few different collections, including stuff
>>> from common crawl, govdocs and bug trackers. If you can let us know what
>>> sort of file types and how many, we can suggest the best corpora
>>> collection
>>>
>>> Nick
>>>
>>>

Re: Datasets for testing large number of attachments

Posted by Tim Allison <ta...@apache.org>.
As a warning, tho, Common Crawl truncates files at 1MB, so we have a bunch
of truncated files.  We refetched some and put those under
commoncrawl3_refetched.

On Tue, Jul 26, 2022 at 2:58 PM Tim Allison <ta...@apache.org> wrote:

> We have ~1.9TB.  But I'd skip cc_large because that's just a copy of some
> directories under commoncrawl3_refetched.
>
> If you want to pull fresher data out of CommonCrawl, I have undocumented
> scripts to do that.  I could add documentation.
>
> These are the top 100 mime types and counts.  This db was generated on a
> slightly earlier version of the corpus/corpora, but it should be close
> enough.
>
> MIME_STRING    cnt
> application/pdf    768490
> text/plain    472041
> text/html    429707
> application/x-tika-msoffice    297990
> image/png    190815
> application/octet-stream    190645
> image/jpeg    179533
> application/xhtml+xml    151830
> application/x-bzip2    124204
> application/x-tika-ooxml    122523
> application/x-bzip    107435
> application/xml    107003
> application/zip    93467
> application/x-sh    88712
> application/gzip    73535
> image/gif    66713
> application/zlib    46483
> text/calendar    40385
> application/postscript    35526
> application/rss+xml    34428
> application/atom+xml    28950
> multipart/appledouble    27602
> image/svg+xml    25771
> application/vnd.oasis.opendocument.text    25753
> application/rdf+xml    24890
> application/vnd.google-earth.kml+xml    24049
> application/rtf    23915
> application/x-matroska    19437
> application/x-shockwave-flash    18879
> video/quicktime    18546
> application/epub+zip    18205
> application/vnd.ms-excel    17465
> application/x-xz    16869
> text/x-vcard    16772
> application/java-vm    16761
> audio/mpeg    15534
> message/rfc822    14405
> application/vnd.oasis.opendocument.spreadsheet    12659
> application/x-bibtex-text-file    12261
> application/x-rar-compressed; version=4    12123
> text/x-php    10870
> text/x-diff    10080
> video/mp4    8281
> audio/mp4    8221
> application/x-msdownload    8019
> application/x-bittorrent    7964
> image/vnd.microsoft.icon    7382
> application/mbox    6799
> application/x-x509-cert; format=der    6597
> audio/vnd.wave    6550
> image/bmp    6411
> application/x-endnote-refer    5922
> image/vnd.djvu    5874
> text/x-matlab    5734
> application/vnd.apple.mpegurl    5511
> image/tiff    5430
> image/webp    4972
> application/vnd.oasis.opendocument.presentation    3989
> text/x-jsp    3973
> text/x-csrc    3555
> video/x-ms-wmv    3453
> video/x-m4v    3443
> application/x-dbf    3381
> text/x-chdr    3263
> text/x-perl    3124
> application/x-rpm    3023
> application/x-mobipocket-ebook    2726
> audio/midi    2697
> application/vnd.oasis.opendocument.graphics    2675
> application/vnd.ms-excel.sheet.4    2591
> application/x-font-ttf    2575
> application/xspf+xml    2557
> text/x-python    2416
> audio/vorbis    2354
> application/msword    2223
> application/ogg    2222
> application/x-gtar    2181
> audio/x-mpegurl    2067
> video/x-flv    1969
> audio/x-ms-wma    1874
> image/icns    1857
> application/x-object    1823
> application/x-7z-compressed    1795
> application/x-msdownload; format=pe32    1784
> application/x-debian-package    1700
> application/x-mysql-table-definition    1669
> image/vnd.dxf; format=ascii    1664
> application/x-sqlite3    1606
> application/x-berkeley-db; format=hash    1457
> application/x-executable    1455
> video/mpeg    1366
> application/pkcs7-signature    1359
> application/x-ms-asx    1266
> image/vnd.zbrush.pcx    1247
> image/vnd.dwg    1243
> application/fits    1217
> application/xslfo+xml    1206
> application/x-sharedlib    1185
> audio/prs.sid    1173
> text/x-vcalendar    1156
>
>
>
>
> On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr <
> oscar.riekenjr@cofense.com> wrote:
>
>> We were thinking something around 2TB of data with a good mix of excel,
>> images, pdfs, text and powerpoints. So I guess a mix of everything.
>>
>>
>>
>> *From: *Tim Allison <ta...@apache.org>
>> *Date: *Tuesday, July 26, 2022 at 9:19 AM
>> *To: *user@tika.apache.org <us...@tika.apache.org>
>> *Cc: *Oscar Rieken Jr <os...@cofense.com>,
>> corpora-dev@tika.apache.org <co...@tika.apache.org>
>> *Subject: *Re: Datasets for testing large number of attachments
>>
>> External Email
>>
>> What Nick said...
>>
>>
>>
>> cc_large is a sample of some of the larger documents from
>> commoncrawl3_refetched.
>>
>>
>>
>> If you want to give your pipeline a workout, I also recommend using the
>> MockParser that is available in the tika-core tests jar.  That allows you
>> to instrument an OOM and timeouts and system exits and all sorts of other
>> mayhem.  Obv, don't put the tika-core tests jar on your class path in
>> production.  See the files in
>> https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock
>> for examples of how to trigger bad behavior with the MockParser.
>>
>>
>>
>> On the corpora, as Nick said, let us know what you want and we can help
>> you select files.
>>
>>
>>
>> Cheers,
>>
>>
>>
>>         Tim
>>
>>
>>
>>
>>
>> On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org> wrote:
>>
>> On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
>> > I am currently trying to validate our Tika setup and was looking for a
>> > set of example data I could use
>>
>> If you want a small number of files of lots of different types, the test
>> files in the Tika source tree will work. Main set are in
>> tika-parsers/src/test/resources/test-documents/
>>
>> If you want a very large number of files, then the Tika Corpora
>> collection
>> is a good source. We have a few different collections, including stuff
>> from common crawl, govdocs and bug trackers. If you can let us know what
>> sort of file types and how many, we can suggest the best corpora
>> collection
>>
>> Nick
>>
>>

Re: Datasets for testing large number of attachments

Posted by Oscar Rieken Jr via user <us...@tika.apache.org>.
Yeah I think id want to pull the data from CommonCrawl and go from there.

Do you have a link to that script?

From: Tim Allison <ta...@apache.org>
Date: Tuesday, July 26, 2022 at 2:59 PM
To: Oscar Rieken Jr <os...@cofense.com>
Cc: user@tika.apache.org <us...@tika.apache.org>, corpora-dev@tika.apache.org <co...@tika.apache.org>
Subject: Re: Datasets for testing large number of attachments
External Email
We have ~1.9TB.  But I'd skip cc_large because that's just a copy of some directories under commoncrawl3_refetched.

If you want to pull fresher data out of CommonCrawl, I have undocumented scripts to do that.  I could add documentation.

These are the top 100 mime types and counts.  This db was generated on a slightly earlier version of the corpus/corpora, but it should be close enough.

MIME_STRING    cnt
application/pdf    768490
text/plain    472041
text/html    429707
application/x-tika-msoffice    297990
image/png    190815
application/octet-stream    190645
image/jpeg    179533
application/xhtml+xml    151830
application/x-bzip2    124204
application/x-tika-ooxml    122523
application/x-bzip    107435
application/xml    107003
application/zip    93467
application/x-sh    88712
application/gzip    73535
image/gif    66713
application/zlib    46483
text/calendar    40385
application/postscript    35526
application/rss+xml    34428
application/atom+xml    28950
multipart/appledouble    27602
image/svg+xml    25771
application/vnd.oasis.opendocument.text    25753
application/rdf+xml    24890
application/vnd.google-earth.kml+xml    24049
application/rtf    23915
application/x-matroska    19437
application/x-shockwave-flash    18879
video/quicktime    18546
application/epub+zip    18205
application/vnd.ms-excel    17465
application/x-xz    16869
text/x-vcard    16772
application/java-vm    16761
audio/mpeg    15534
message/rfc822    14405
application/vnd.oasis.opendocument.spreadsheet    12659
application/x-bibtex-text-file    12261
application/x-rar-compressed; version=4    12123
text/x-php    10870
text/x-diff    10080
video/mp4    8281
audio/mp4    8221
application/x-msdownload    8019
application/x-bittorrent    7964
image/vnd.microsoft.icon    7382
application/mbox    6799
application/x-x509-cert; format=der    6597
audio/vnd.wave    6550
image/bmp    6411
application/x-endnote-refer    5922
image/vnd.djvu    5874
text/x-matlab    5734
application/vnd.apple.mpegurl    5511
image/tiff    5430
image/webp    4972
application/vnd.oasis.opendocument.presentation    3989
text/x-jsp    3973
text/x-csrc    3555
video/x-ms-wmv    3453
video/x-m4v    3443
application/x-dbf    3381
text/x-chdr    3263
text/x-perl    3124
application/x-rpm    3023
application/x-mobipocket-ebook    2726
audio/midi    2697
application/vnd.oasis.opendocument.graphics    2675
application/vnd.ms-excel.sheet.4    2591
application/x-font-ttf    2575
application/xspf+xml    2557
text/x-python    2416
audio/vorbis    2354
application/msword    2223
application/ogg    2222
application/x-gtar    2181
audio/x-mpegurl    2067
video/x-flv    1969
audio/x-ms-wma    1874
image/icns    1857
application/x-object    1823
application/x-7z-compressed    1795
application/x-msdownload; format=pe32    1784
application/x-debian-package    1700
application/x-mysql-table-definition    1669
image/vnd.dxf; format=ascii    1664
application/x-sqlite3    1606
application/x-berkeley-db; format=hash    1457
application/x-executable    1455
video/mpeg    1366
application/pkcs7-signature    1359
application/x-ms-asx    1266
image/vnd.zbrush.pcx    1247
image/vnd.dwg    1243
application/fits    1217
application/xslfo+xml    1206
application/x-sharedlib    1185
audio/prs.sid    1173
text/x-vcalendar    1156




On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr <os...@cofense.com>> wrote:
We were thinking something around 2TB of data with a good mix of excel, images, pdfs, text and powerpoints. So I guess a mix of everything.

From: Tim Allison <ta...@apache.org>>
Date: Tuesday, July 26, 2022 at 9:19 AM
To: user@tika.apache.org<ma...@tika.apache.org> <us...@tika.apache.org>>
Cc: Oscar Rieken Jr <os...@cofense.com>>, corpora-dev@tika.apache.org<ma...@tika.apache.org> <co...@tika.apache.org>>
Subject: Re: Datasets for testing large number of attachments
External Email
What Nick said...

cc_large is a sample of some of the larger documents from commoncrawl3_refetched.

If you want to give your pipeline a workout, I also recommend using the MockParser that is available in the tika-core tests jar.  That allows you to instrument an OOM and timeouts and system exits and all sorts of other mayhem.  Obv, don't put the tika-core tests jar on your class path in production.  See the files in https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock for examples of how to trigger bad behavior with the MockParser.

On the corpora, as Nick said, let us know what you want and we can help you select files.

Cheers,

        Tim


On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org>> wrote:
On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
> I am currently trying to validate our Tika setup and was looking for a
> set of example data I could use

If you want a small number of files of lots of different types, the test
files in the Tika source tree will work. Main set are in
tika-parsers/src/test/resources/test-documents/

If you want a very large number of files, then the Tika Corpora collection
is a good source. We have a few different collections, including stuff
from common crawl, govdocs and bug trackers. If you can let us know what
sort of file types and how many, we can suggest the best corpora
collection

Nick

Re: Datasets for testing large number of attachments

Posted by Tim Allison <ta...@apache.org>.
As a warning, tho, Common Crawl truncates files at 1MB, so we have a bunch
of truncated files.  We refetched some and put those under
commoncrawl3_refetched.

On Tue, Jul 26, 2022 at 2:58 PM Tim Allison <ta...@apache.org> wrote:

> We have ~1.9TB.  But I'd skip cc_large because that's just a copy of some
> directories under commoncrawl3_refetched.
>
> If you want to pull fresher data out of CommonCrawl, I have undocumented
> scripts to do that.  I could add documentation.
>
> These are the top 100 mime types and counts.  This db was generated on a
> slightly earlier version of the corpus/corpora, but it should be close
> enough.
>
> MIME_STRING    cnt
> application/pdf    768490
> text/plain    472041
> text/html    429707
> application/x-tika-msoffice    297990
> image/png    190815
> application/octet-stream    190645
> image/jpeg    179533
> application/xhtml+xml    151830
> application/x-bzip2    124204
> application/x-tika-ooxml    122523
> application/x-bzip    107435
> application/xml    107003
> application/zip    93467
> application/x-sh    88712
> application/gzip    73535
> image/gif    66713
> application/zlib    46483
> text/calendar    40385
> application/postscript    35526
> application/rss+xml    34428
> application/atom+xml    28950
> multipart/appledouble    27602
> image/svg+xml    25771
> application/vnd.oasis.opendocument.text    25753
> application/rdf+xml    24890
> application/vnd.google-earth.kml+xml    24049
> application/rtf    23915
> application/x-matroska    19437
> application/x-shockwave-flash    18879
> video/quicktime    18546
> application/epub+zip    18205
> application/vnd.ms-excel    17465
> application/x-xz    16869
> text/x-vcard    16772
> application/java-vm    16761
> audio/mpeg    15534
> message/rfc822    14405
> application/vnd.oasis.opendocument.spreadsheet    12659
> application/x-bibtex-text-file    12261
> application/x-rar-compressed; version=4    12123
> text/x-php    10870
> text/x-diff    10080
> video/mp4    8281
> audio/mp4    8221
> application/x-msdownload    8019
> application/x-bittorrent    7964
> image/vnd.microsoft.icon    7382
> application/mbox    6799
> application/x-x509-cert; format=der    6597
> audio/vnd.wave    6550
> image/bmp    6411
> application/x-endnote-refer    5922
> image/vnd.djvu    5874
> text/x-matlab    5734
> application/vnd.apple.mpegurl    5511
> image/tiff    5430
> image/webp    4972
> application/vnd.oasis.opendocument.presentation    3989
> text/x-jsp    3973
> text/x-csrc    3555
> video/x-ms-wmv    3453
> video/x-m4v    3443
> application/x-dbf    3381
> text/x-chdr    3263
> text/x-perl    3124
> application/x-rpm    3023
> application/x-mobipocket-ebook    2726
> audio/midi    2697
> application/vnd.oasis.opendocument.graphics    2675
> application/vnd.ms-excel.sheet.4    2591
> application/x-font-ttf    2575
> application/xspf+xml    2557
> text/x-python    2416
> audio/vorbis    2354
> application/msword    2223
> application/ogg    2222
> application/x-gtar    2181
> audio/x-mpegurl    2067
> video/x-flv    1969
> audio/x-ms-wma    1874
> image/icns    1857
> application/x-object    1823
> application/x-7z-compressed    1795
> application/x-msdownload; format=pe32    1784
> application/x-debian-package    1700
> application/x-mysql-table-definition    1669
> image/vnd.dxf; format=ascii    1664
> application/x-sqlite3    1606
> application/x-berkeley-db; format=hash    1457
> application/x-executable    1455
> video/mpeg    1366
> application/pkcs7-signature    1359
> application/x-ms-asx    1266
> image/vnd.zbrush.pcx    1247
> image/vnd.dwg    1243
> application/fits    1217
> application/xslfo+xml    1206
> application/x-sharedlib    1185
> audio/prs.sid    1173
> text/x-vcalendar    1156
>
>
>
>
> On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr <
> oscar.riekenjr@cofense.com> wrote:
>
>> We were thinking something around 2TB of data with a good mix of excel,
>> images, pdfs, text and powerpoints. So I guess a mix of everything.
>>
>>
>>
>> *From: *Tim Allison <ta...@apache.org>
>> *Date: *Tuesday, July 26, 2022 at 9:19 AM
>> *To: *user@tika.apache.org <us...@tika.apache.org>
>> *Cc: *Oscar Rieken Jr <os...@cofense.com>,
>> corpora-dev@tika.apache.org <co...@tika.apache.org>
>> *Subject: *Re: Datasets for testing large number of attachments
>>
>> External Email
>>
>> What Nick said...
>>
>>
>>
>> cc_large is a sample of some of the larger documents from
>> commoncrawl3_refetched.
>>
>>
>>
>> If you want to give your pipeline a workout, I also recommend using the
>> MockParser that is available in the tika-core tests jar.  That allows you
>> to instrument an OOM and timeouts and system exits and all sorts of other
>> mayhem.  Obv, don't put the tika-core tests jar on your class path in
>> production.  See the files in
>> https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock
>> for examples of how to trigger bad behavior with the MockParser.
>>
>>
>>
>> On the corpora, as Nick said, let us know what you want and we can help
>> you select files.
>>
>>
>>
>> Cheers,
>>
>>
>>
>>         Tim
>>
>>
>>
>>
>>
>> On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org> wrote:
>>
>> On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
>> > I am currently trying to validate our Tika setup and was looking for a
>> > set of example data I could use
>>
>> If you want a small number of files of lots of different types, the test
>> files in the Tika source tree will work. Main set are in
>> tika-parsers/src/test/resources/test-documents/
>>
>> If you want a very large number of files, then the Tika Corpora
>> collection
>> is a good source. We have a few different collections, including stuff
>> from common crawl, govdocs and bug trackers. If you can let us know what
>> sort of file types and how many, we can suggest the best corpora
>> collection
>>
>> Nick
>>
>>

Re: Datasets for testing large number of attachments

Posted by Tim Allison <ta...@apache.org>.
We have ~1.9TB.  But I'd skip cc_large because that's just a copy of some
directories under commoncrawl3_refetched.

If you want to pull fresher data out of CommonCrawl, I have undocumented
scripts to do that.  I could add documentation.

These are the top 100 mime types and counts.  This db was generated on a
slightly earlier version of the corpus/corpora, but it should be close
enough.

MIME_STRING    cnt
application/pdf    768490
text/plain    472041
text/html    429707
application/x-tika-msoffice    297990
image/png    190815
application/octet-stream    190645
image/jpeg    179533
application/xhtml+xml    151830
application/x-bzip2    124204
application/x-tika-ooxml    122523
application/x-bzip    107435
application/xml    107003
application/zip    93467
application/x-sh    88712
application/gzip    73535
image/gif    66713
application/zlib    46483
text/calendar    40385
application/postscript    35526
application/rss+xml    34428
application/atom+xml    28950
multipart/appledouble    27602
image/svg+xml    25771
application/vnd.oasis.opendocument.text    25753
application/rdf+xml    24890
application/vnd.google-earth.kml+xml    24049
application/rtf    23915
application/x-matroska    19437
application/x-shockwave-flash    18879
video/quicktime    18546
application/epub+zip    18205
application/vnd.ms-excel    17465
application/x-xz    16869
text/x-vcard    16772
application/java-vm    16761
audio/mpeg    15534
message/rfc822    14405
application/vnd.oasis.opendocument.spreadsheet    12659
application/x-bibtex-text-file    12261
application/x-rar-compressed; version=4    12123
text/x-php    10870
text/x-diff    10080
video/mp4    8281
audio/mp4    8221
application/x-msdownload    8019
application/x-bittorrent    7964
image/vnd.microsoft.icon    7382
application/mbox    6799
application/x-x509-cert; format=der    6597
audio/vnd.wave    6550
image/bmp    6411
application/x-endnote-refer    5922
image/vnd.djvu    5874
text/x-matlab    5734
application/vnd.apple.mpegurl    5511
image/tiff    5430
image/webp    4972
application/vnd.oasis.opendocument.presentation    3989
text/x-jsp    3973
text/x-csrc    3555
video/x-ms-wmv    3453
video/x-m4v    3443
application/x-dbf    3381
text/x-chdr    3263
text/x-perl    3124
application/x-rpm    3023
application/x-mobipocket-ebook    2726
audio/midi    2697
application/vnd.oasis.opendocument.graphics    2675
application/vnd.ms-excel.sheet.4    2591
application/x-font-ttf    2575
application/xspf+xml    2557
text/x-python    2416
audio/vorbis    2354
application/msword    2223
application/ogg    2222
application/x-gtar    2181
audio/x-mpegurl    2067
video/x-flv    1969
audio/x-ms-wma    1874
image/icns    1857
application/x-object    1823
application/x-7z-compressed    1795
application/x-msdownload; format=pe32    1784
application/x-debian-package    1700
application/x-mysql-table-definition    1669
image/vnd.dxf; format=ascii    1664
application/x-sqlite3    1606
application/x-berkeley-db; format=hash    1457
application/x-executable    1455
video/mpeg    1366
application/pkcs7-signature    1359
application/x-ms-asx    1266
image/vnd.zbrush.pcx    1247
image/vnd.dwg    1243
application/fits    1217
application/xslfo+xml    1206
application/x-sharedlib    1185
audio/prs.sid    1173
text/x-vcalendar    1156




On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr <os...@cofense.com>
wrote:

> We were thinking something around 2TB of data with a good mix of excel,
> images, pdfs, text and powerpoints. So I guess a mix of everything.
>
>
>
> *From: *Tim Allison <ta...@apache.org>
> *Date: *Tuesday, July 26, 2022 at 9:19 AM
> *To: *user@tika.apache.org <us...@tika.apache.org>
> *Cc: *Oscar Rieken Jr <os...@cofense.com>,
> corpora-dev@tika.apache.org <co...@tika.apache.org>
> *Subject: *Re: Datasets for testing large number of attachments
>
> External Email
>
> What Nick said...
>
>
>
> cc_large is a sample of some of the larger documents from
> commoncrawl3_refetched.
>
>
>
> If you want to give your pipeline a workout, I also recommend using the
> MockParser that is available in the tika-core tests jar.  That allows you
> to instrument an OOM and timeouts and system exits and all sorts of other
> mayhem.  Obv, don't put the tika-core tests jar on your class path in
> production.  See the files in
> https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock
> for examples of how to trigger bad behavior with the MockParser.
>
>
>
> On the corpora, as Nick said, let us know what you want and we can help
> you select files.
>
>
>
> Cheers,
>
>
>
>         Tim
>
>
>
>
>
> On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org> wrote:
>
> On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
> > I am currently trying to validate our Tika setup and was looking for a
> > set of example data I could use
>
> If you want a small number of files of lots of different types, the test
> files in the Tika source tree will work. Main set are in
> tika-parsers/src/test/resources/test-documents/
>
> If you want a very large number of files, then the Tika Corpora collection
> is a good source. We have a few different collections, including stuff
> from common crawl, govdocs and bug trackers. If you can let us know what
> sort of file types and how many, we can suggest the best corpora
> collection
>
> Nick
>
>

Re: Datasets for testing large number of attachments

Posted by Tim Allison <ta...@apache.org>.
We have ~1.9TB.  But I'd skip cc_large because that's just a copy of some
directories under commoncrawl3_refetched.

If you want to pull fresher data out of CommonCrawl, I have undocumented
scripts to do that.  I could add documentation.

These are the top 100 mime types and counts.  This db was generated on a
slightly earlier version of the corpus/corpora, but it should be close
enough.

MIME_STRING    cnt
application/pdf    768490
text/plain    472041
text/html    429707
application/x-tika-msoffice    297990
image/png    190815
application/octet-stream    190645
image/jpeg    179533
application/xhtml+xml    151830
application/x-bzip2    124204
application/x-tika-ooxml    122523
application/x-bzip    107435
application/xml    107003
application/zip    93467
application/x-sh    88712
application/gzip    73535
image/gif    66713
application/zlib    46483
text/calendar    40385
application/postscript    35526
application/rss+xml    34428
application/atom+xml    28950
multipart/appledouble    27602
image/svg+xml    25771
application/vnd.oasis.opendocument.text    25753
application/rdf+xml    24890
application/vnd.google-earth.kml+xml    24049
application/rtf    23915
application/x-matroska    19437
application/x-shockwave-flash    18879
video/quicktime    18546
application/epub+zip    18205
application/vnd.ms-excel    17465
application/x-xz    16869
text/x-vcard    16772
application/java-vm    16761
audio/mpeg    15534
message/rfc822    14405
application/vnd.oasis.opendocument.spreadsheet    12659
application/x-bibtex-text-file    12261
application/x-rar-compressed; version=4    12123
text/x-php    10870
text/x-diff    10080
video/mp4    8281
audio/mp4    8221
application/x-msdownload    8019
application/x-bittorrent    7964
image/vnd.microsoft.icon    7382
application/mbox    6799
application/x-x509-cert; format=der    6597
audio/vnd.wave    6550
image/bmp    6411
application/x-endnote-refer    5922
image/vnd.djvu    5874
text/x-matlab    5734
application/vnd.apple.mpegurl    5511
image/tiff    5430
image/webp    4972
application/vnd.oasis.opendocument.presentation    3989
text/x-jsp    3973
text/x-csrc    3555
video/x-ms-wmv    3453
video/x-m4v    3443
application/x-dbf    3381
text/x-chdr    3263
text/x-perl    3124
application/x-rpm    3023
application/x-mobipocket-ebook    2726
audio/midi    2697
application/vnd.oasis.opendocument.graphics    2675
application/vnd.ms-excel.sheet.4    2591
application/x-font-ttf    2575
application/xspf+xml    2557
text/x-python    2416
audio/vorbis    2354
application/msword    2223
application/ogg    2222
application/x-gtar    2181
audio/x-mpegurl    2067
video/x-flv    1969
audio/x-ms-wma    1874
image/icns    1857
application/x-object    1823
application/x-7z-compressed    1795
application/x-msdownload; format=pe32    1784
application/x-debian-package    1700
application/x-mysql-table-definition    1669
image/vnd.dxf; format=ascii    1664
application/x-sqlite3    1606
application/x-berkeley-db; format=hash    1457
application/x-executable    1455
video/mpeg    1366
application/pkcs7-signature    1359
application/x-ms-asx    1266
image/vnd.zbrush.pcx    1247
image/vnd.dwg    1243
application/fits    1217
application/xslfo+xml    1206
application/x-sharedlib    1185
audio/prs.sid    1173
text/x-vcalendar    1156




On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr <os...@cofense.com>
wrote:

> We were thinking something around 2TB of data with a good mix of excel,
> images, pdfs, text and powerpoints. So I guess a mix of everything.
>
>
>
> *From: *Tim Allison <ta...@apache.org>
> *Date: *Tuesday, July 26, 2022 at 9:19 AM
> *To: *user@tika.apache.org <us...@tika.apache.org>
> *Cc: *Oscar Rieken Jr <os...@cofense.com>,
> corpora-dev@tika.apache.org <co...@tika.apache.org>
> *Subject: *Re: Datasets for testing large number of attachments
>
> External Email
>
> What Nick said...
>
>
>
> cc_large is a sample of some of the larger documents from
> commoncrawl3_refetched.
>
>
>
> If you want to give your pipeline a workout, I also recommend using the
> MockParser that is available in the tika-core tests jar.  That allows you
> to instrument an OOM and timeouts and system exits and all sorts of other
> mayhem.  Obv, don't put the tika-core tests jar on your class path in
> production.  See the files in
> https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock
> for examples of how to trigger bad behavior with the MockParser.
>
>
>
> On the corpora, as Nick said, let us know what you want and we can help
> you select files.
>
>
>
> Cheers,
>
>
>
>         Tim
>
>
>
>
>
> On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org> wrote:
>
> On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
> > I am currently trying to validate our Tika setup and was looking for a
> > set of example data I could use
>
> If you want a small number of files of lots of different types, the test
> files in the Tika source tree will work. Main set are in
> tika-parsers/src/test/resources/test-documents/
>
> If you want a very large number of files, then the Tika Corpora collection
> is a good source. We have a few different collections, including stuff
> from common crawl, govdocs and bug trackers. If you can let us know what
> sort of file types and how many, we can suggest the best corpora
> collection
>
> Nick
>
>

Re: Datasets for testing large number of attachments

Posted by Oscar Rieken Jr via user <us...@tika.apache.org>.
We were thinking something around 2TB of data with a good mix of excel, images, pdfs, text and powerpoints. So I guess a mix of everything.

From: Tim Allison <ta...@apache.org>
Date: Tuesday, July 26, 2022 at 9:19 AM
To: user@tika.apache.org <us...@tika.apache.org>
Cc: Oscar Rieken Jr <os...@cofense.com>, corpora-dev@tika.apache.org <co...@tika.apache.org>
Subject: Re: Datasets for testing large number of attachments
External Email
What Nick said...

cc_large is a sample of some of the larger documents from commoncrawl3_refetched.

If you want to give your pipeline a workout, I also recommend using the MockParser that is available in the tika-core tests jar.  That allows you to instrument an OOM and timeouts and system exits and all sorts of other mayhem.  Obv, don't put the tika-core tests jar on your class path in production.  See the files in https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock for examples of how to trigger bad behavior with the MockParser.

On the corpora, as Nick said, let us know what you want and we can help you select files.

Cheers,

        Tim


On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org>> wrote:
On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
> I am currently trying to validate our Tika setup and was looking for a
> set of example data I could use

If you want a small number of files of lots of different types, the test
files in the Tika source tree will work. Main set are in
tika-parsers/src/test/resources/test-documents/

If you want a very large number of files, then the Tika Corpora collection
is a good source. We have a few different collections, including stuff
from common crawl, govdocs and bug trackers. If you can let us know what
sort of file types and how many, we can suggest the best corpora
collection

Nick

Re: Datasets for testing large number of attachments

Posted by Oscar Rieken Jr <os...@cofense.com.INVALID>.
We were thinking something around 2TB of data with a good mix of excel, images, pdfs, text and powerpoints. So I guess a mix of everything.

From: Tim Allison <ta...@apache.org>
Date: Tuesday, July 26, 2022 at 9:19 AM
To: user@tika.apache.org <us...@tika.apache.org>
Cc: Oscar Rieken Jr <os...@cofense.com>, corpora-dev@tika.apache.org <co...@tika.apache.org>
Subject: Re: Datasets for testing large number of attachments
External Email
What Nick said...

cc_large is a sample of some of the larger documents from commoncrawl3_refetched.

If you want to give your pipeline a workout, I also recommend using the MockParser that is available in the tika-core tests jar.  That allows you to instrument an OOM and timeouts and system exits and all sorts of other mayhem.  Obv, don't put the tika-core tests jar on your class path in production.  See the files in https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock for examples of how to trigger bad behavior with the MockParser.

On the corpora, as Nick said, let us know what you want and we can help you select files.

Cheers,

        Tim


On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org>> wrote:
On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
> I am currently trying to validate our Tika setup and was looking for a
> set of example data I could use

If you want a small number of files of lots of different types, the test
files in the Tika source tree will work. Main set are in
tika-parsers/src/test/resources/test-documents/

If you want a very large number of files, then the Tika Corpora collection
is a good source. We have a few different collections, including stuff
from common crawl, govdocs and bug trackers. If you can let us know what
sort of file types and how many, we can suggest the best corpora
collection

Nick

Re: Datasets for testing large number of attachments

Posted by Tim Allison <ta...@apache.org>.
What Nick said...

cc_large is a sample of some of the larger documents from
commoncrawl3_refetched.

If you want to give your pipeline a workout, I also recommend using the
MockParser that is available in the tika-core tests jar.  That allows you
to instrument an OOM and timeouts and system exits and all sorts of other
mayhem.  Obv, don't put the tika-core tests jar on your class path in
production.  See the files in
https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock
for examples of how to trigger bad behavior with the MockParser.

On the corpora, as Nick said, let us know what you want and we can help you
select files.

Cheers,

        Tim


On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org> wrote:

> On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
> > I am currently trying to validate our Tika setup and was looking for a
> > set of example data I could use
>
> If you want a small number of files of lots of different types, the test
> files in the Tika source tree will work. Main set are in
> tika-parsers/src/test/resources/test-documents/
>
> If you want a very large number of files, then the Tika Corpora collection
> is a good source. We have a few different collections, including stuff
> from common crawl, govdocs and bug trackers. If you can let us know what
> sort of file types and how many, we can suggest the best corpora
> collection
>
> Nick
>

Re: Datasets for testing large number of attachments

Posted by Tim Allison <ta...@apache.org>.
What Nick said...

cc_large is a sample of some of the larger documents from
commoncrawl3_refetched.

If you want to give your pipeline a workout, I also recommend using the
MockParser that is available in the tika-core tests jar.  That allows you
to instrument an OOM and timeouts and system exits and all sorts of other
mayhem.  Obv, don't put the tika-core tests jar on your class path in
production.  See the files in
https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock
for examples of how to trigger bad behavior with the MockParser.

On the corpora, as Nick said, let us know what you want and we can help you
select files.

Cheers,

        Tim


On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <ap...@gagravarr.org> wrote:

> On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
> > I am currently trying to validate our Tika setup and was looking for a
> > set of example data I could use
>
> If you want a small number of files of lots of different types, the test
> files in the Tika source tree will work. Main set are in
> tika-parsers/src/test/resources/test-documents/
>
> If you want a very large number of files, then the Tika Corpora collection
> is a good source. We have a few different collections, including stuff
> from common crawl, govdocs and bug trackers. If you can let us know what
> sort of file types and how many, we can suggest the best corpora
> collection
>
> Nick
>

Re: Datasets for testing large number of attachments

Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
> I am currently trying to validate our Tika setup and was looking for a 
> set of example data I could use

If you want a small number of files of lots of different types, the test 
files in the Tika source tree will work. Main set are in
tika-parsers/src/test/resources/test-documents/

If you want a very large number of files, then the Tika Corpora collection 
is a good source. We have a few different collections, including stuff 
from common crawl, govdocs and bug trackers. If you can let us know what 
sort of file types and how many, we can suggest the best corpora 
collection

Nick