You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Tim Allison <ta...@apache.org> on 2022/10/03 13:25:45 UTC

metadata keys

All,

  I recently extracted metadata keys from 1 million files in our
regression corpus and did a group by.  This allows insight into common
metadata keys.

  I've included two views, one looks at overall counts, and the other
breaks down metadata keys by mime type.

  Please let us know if you find anything interesting or have any questions.

https://corpora.tika.apache.org/base/share/metadata-keys-overall-1m.txt.gz
https://corpora.tika.apache.org/base/share/metadata-keys-by-mime-1m.txt.gz

   Best,

            Tim

Re: metadata keys

Posted by Tim Allison <ta...@apache.org>.

I shouldn't be, but I'm disheartened by how many metadata keys are not
name-spaced.  I don't think we can do anything with these in 2.x, but
for 3.x, we should be thinking about namespacing all the keys that
don't have natural dc: or other standards.

I'm also, frankly, bewildered by the amount of custom / non-standard
metadata.  Again, this shouldn't surprise me, but...wow.

https://issues.apache.org/jira/browse/TIKA-3872


On Fri, Oct 7, 2022 at 9:23 AM Tim Allison <ta...@apache.org> wrote:
>
> Is there anything that leaps out that needs attention?
>
> On Fri, Oct 7, 2022 at 7:12 AM Markus Jelsma <ma...@openindex.io> wrote:
> >
> > Ah, there are some differences this time, except for MboxParser, of course :)
> >
> > Very nice to see this happening, it wasn't present/noticed in the other set
> > tiff:ImageWidth,727519
> > tiff:ImageLength,727512
> >
> > There are this time also quite a few with whitespaces in the keys:
> > Dimension HorizontalPixelSize,166272
> > Dimension VerticalPixelSize,166272
> >
> > Attempts to do some Javascript:
> > <script>,1
> > var gcse = document.createElement(,1
> > var s = document.getElementsByTagName(,1
> >
> > Something that appears to be a 'tag cloud' of a Dutch blog about travelling to Thailand:
> > "thailand,thailand forum,bangkok,chiang mai,vakantie,accommodatie,hotel,surat thani,tuktuk,eiland,krabi,phuket,sukothai,phi phi,khao sok,guesthouse,national park,isaan,monnik,samui,panghan,bergvolk,eiland,trein,vliegtuig,ayutthaya,visum,thai,sawasdee,tempelflower",1
> >
> > More tag clouds:
> > "homoeopathy, homopati, homeopathy, homeopati, hormon, alopaty, allopaty, alopati, biochemic, biokemik, biokimia",1
> > "homopati, homopathy, homeopati, homoeopati, biochemic, biokimia",1
> >
> > Chinese, Cyrillic and Arabic mixed with Latin. Especially Arabic is weird when displayed correctly with the ,1 on its left:
> > custom:Шифр,1
> > custom:тавсўф,1
> > custom:آموزش ایندیزاین,1
> > custom:关键字,1
> >
> > Escaping gone mad:
> > "\""content-type\""",5
> >
> > There are also e-mail addresses that i am not going to put down here. And i must say, after looking through it, MboxParser did still surprise me.
> >
> > Thanks,
> > Markus
> >
> > Op do 6 okt. 2022 om 17:27 schreef Tim Allison <ta...@apache.org>:
> >>
> >> I reprocessed a million files and wrote proper UTF-8 csv files.  This
> >> did away with any risk of me botching something via copy/paste from
> >> stdout.
> >>
> >> https://corpora.tika.apache.org/base/share/metadata-keys-1m-20221006.tgz
> >>
> >> On Mon, Oct 3, 2022 at 4:03 PM Markus Jelsma <ma...@openindex.io> wrote:
> >> >
> >> > Hi Tim,
> >> >
> >> > I would expect that many strange keys are actually present in the source data, and are not due to an error somewhere in Tika or its dependencies. Although mboxparser could have an issue somewhere.
> >> >
> >> > But it might be an idea to map some bad keys to their proper counterpart, such as keywords, content-type and friends.
> >> >
> >> > Regards,
> >> > Markus
> >> >
> >> > Op ma 3 okt. 2022 om 17:10 schreef Tim Allison <ta...@apache.org>:
> >> >>
> >> >> Thank you, Markus, for looking through these sheets.  There's a chance
> >> >> I botched the encodings in transferring data from one location to
> >> >> another.  Let me take another look, and yes, we've got to make some
> >> >> improvements to the mbox parser.
> >> >>
> >> >> More digging for me to do on the data and your findings!
> >> >>
> >> >> Thank you!
> >> >>
> >> >> On Mon, Oct 3, 2022 at 10:56 AM Markus Jelsma
> >> >> <ma...@openindex.io> wrote:
> >> >> >
> >> >> > Hi,
> >> >> >
> >> >> > These aggregations of large real world sets are always interesting to look through. Especially because they are bound to have a lot of garbage and peculiarities. There are probably some badly chosen key names, and very likely many programming errors.
> >> >> >
> >> >> > Some interesting examples:
> >> >> >
> >> >> > what is this:
> >> >> > Р’С‹Р±РµСЂРёС‚Рµ_СЂР°СЃС€РёСЂРµРЅРёРµ_РґР»СЏ_РїР°РєРѕРІРєРё
> >> >> >
> >> >> > the usual mixing of double-colon variants, there are also many escaped quotes:
> >> >> > ”keywords” and \"keywords\"
> >> >> >
> >> >> > these two are identical, but given a large enough set, they might not be:
> >> >> > height 512205
> >> >> > width 512205
> >> >> >
> >> >> > mboxparser spews out a lot of garbage, incredible:
> >> >> > MboxParser- $b!zf|!!;~![#1#07n#2#2f|!j6b!k#1#4;~h>!a#1#7;~h>!r$=$n8e 3
> >> >> > MboxParser- $b"($3$n%a!<%k$o!"4x@>#i#t6&f1bn$x$4;22cd 3
> >> >> > MboxParser- $b"(?=$79~$_!&%"%/%;%9ey!">\ 3
> >> >> >
> >> >> > really, it does:
> >> >> > MboxParser-_blank">http 3
> >> >> > MboxParser-a-aa-azzzzzzz-azzzzazzzzz-azzzzzzzazzzzzzzzzazzzzzazzzzzzzzzaz 3
> >> >> > MboxParser-a-aa-azzzzzzz-azzzzzzzz-azzzzazzzzzazzzzzzazzzzzzz 3
> >> >> >
> >> >> > non-Latin scripts are expected, this is simplified Chinese:
> >> >> > if:头像和分页采用圆形样式 (translation: Avatars and pagination in a circular style (?))
> >> >> >
> >> >> > perhaps shortest possible key name:
> >> >> > T 4
> >> >> >
> >> >> > mboxparser, again, this time with XML tags:
> >> >> > MboxParser-ype>state</span></font></st1:placetype></st1 4
> >> >> > MboxParser-ype>university</span></font></st1:placetype></st1:place></st1 4
> >> >> >
> >> >> > the set seems to contain stuff from adult sites:
> >> >> > xhamster-site-verification
> >> >> >
> >> >> > for some reason, the Dutch government always pops up in large sets:
> >> >> > custom:OVERHEID.Informatietype/DC.type  13
> >> >> > custom:OVERHEID.Organisatietype/OVERHEID.organisationType       13
> >> >> >
> >> >> > there are 18 different ways to spell/use Content-Type, of which four are, of course, with mboxparser:
> >> >> > Content-Type    6612729
> >> >> > content_type    14
> >> >> > \"Content-Type\"        9
> >> >> > \"content-type\"        5
> >> >> >
> >> >> > the inevitable encoding error:
> >> >> > pdf:docinfo:custom:-ý§ Q 10
> >> >> > pagerankâ„¢ 50
> >> >> >
> >> >> > what.is.this:
> >> >> > Laisv371DiskusijuIrK363rybosForumas 4
> >> >> >
> >> >> > hey, another contenter for the shortest key name:
> >> >> > M 4
> >> >> >
> >> >> > there are 67 unique dcterms key names, but their counts are not very high:
> >> >> > DCTERMS.title   44
> >> >> > dcterms.title   26
> >> >> > dcterms:title   13
> >> >> > dcterms.Title   3
> >> >> >
> >> >> > there is also a Content-Type in Russian:
> >> >> > Тип-содержимое 3
> >> >> >
> >> >> > someone wants to remove your dust:
> >> >> > Dust_Removal_Data 339
> >> >> >
> >> >> > there are 908 unique unknown tags, no idea what that is:
> >> >> > Exif_IFD0:Unknown_tag_(0x8482)  36
> >> >> > Unknown_tag_(0x00bf)    36
> >> >> > Exif_SubIFD:Unknown_tag_(0x9009)        35
> >> >> > Unknown_tag_(0x00a0)    35
> >> >> > Unknown_tag_(0x050e)    35
> >> >> >
> >> >> > ah, the winner of the shortest key name (line 2235):
> >> >> > 71
> >> >> >
> >> >> > longest key, guess who:
> >> >> > MboxParser-http://www.facebook.com/donnakuhnarthttps://www.flickr.com/photos/donnakuhnhttp://picassogirl.tumblr.comhttps://twitter.com/digitalaardvarkhttps://plus.google.com/+digitalaardvarkshttps://www.linkedin.com/in/donnakuhnhttp://www.saatchionline.com/donnakuhnhttp://pinterest.com/sarcasthttps        3
> >> >> >
> >> >> > Besides Latin, Japanese and Chinese, Cyrillic is also present. But the six most frequently used Arabic symbols are not present. I wonder why. But there is an RTL-script present, Hebrew. It is always strange to meet terms/wors of RTL-scripts in an otherwise general LTR-world.
> >> >> >
> >> >> > I was a bit disappointed not to find any obscene terms. The set seemed to be large enough for at least some general curse words.
> >> >> >
> >> >> > MboxParser is the real winner with 1763 unique keys, this is really absurd!
> >> >> >
> >> >> > Thanks, this was fun!
> >> >> > Markus
> >> >> >
> >> >> > Op ma 3 okt. 2022 om 15:26 schreef Tim Allison <ta...@apache.org>:
> >> >> >>
> >> >> >> All,
> >> >> >>
> >> >> >>   I recently extracted metadata keys from 1 million files in our
> >> >> >> regression corpus and did a group by.  This allows insight into common
> >> >> >> metadata keys.
> >> >> >>
> >> >> >>   I've included two views, one looks at overall counts, and the other
> >> >> >> breaks down metadata keys by mime type.
> >> >> >>
> >> >> >>   Please let us know if you find anything interesting or have any questions.
> >> >> >>
> >> >> >> https://corpora.tika.apache.org/base/share/metadata-keys-overall-1m.txt.gz
> >> >> >> https://corpora.tika.apache.org/base/share/metadata-keys-by-mime-1m.txt.gz
> >> >> >>
> >> >> >>    Best,
> >> >> >>
> >> >> >>             Tim

Re: metadata keys

Posted by Tim Allison <ta...@apache.org>.

Is there anything that leaps out that needs attention?

On Fri, Oct 7, 2022 at 7:12 AM Markus Jelsma <ma...@openindex.io> wrote:
>
> Ah, there are some differences this time, except for MboxParser, of course :)
>
> Very nice to see this happening, it wasn't present/noticed in the other set
> tiff:ImageWidth,727519
> tiff:ImageLength,727512
>
> There are this time also quite a few with whitespaces in the keys:
> Dimension HorizontalPixelSize,166272
> Dimension VerticalPixelSize,166272
>
> Attempts to do some Javascript:
> <script>,1
> var gcse = document.createElement(,1
> var s = document.getElementsByTagName(,1
>
> Something that appears to be a 'tag cloud' of a Dutch blog about travelling to Thailand:
> "thailand,thailand forum,bangkok,chiang mai,vakantie,accommodatie,hotel,surat thani,tuktuk,eiland,krabi,phuket,sukothai,phi phi,khao sok,guesthouse,national park,isaan,monnik,samui,panghan,bergvolk,eiland,trein,vliegtuig,ayutthaya,visum,thai,sawasdee,tempelflower",1
>
> More tag clouds:
> "homoeopathy, homopati, homeopathy, homeopati, hormon, alopaty, allopaty, alopati, biochemic, biokemik, biokimia",1
> "homopati, homopathy, homeopati, homoeopati, biochemic, biokimia",1
>
> Chinese, Cyrillic and Arabic mixed with Latin. Especially Arabic is weird when displayed correctly with the ,1 on its left:
> custom:Шифр,1
> custom:тавсўф,1
> custom:آموزش ایندیزاین,1
> custom:关键字,1
>
> Escaping gone mad:
> "\""content-type\""",5
>
> There are also e-mail addresses that i am not going to put down here. And i must say, after looking through it, MboxParser did still surprise me.
>
> Thanks,
> Markus
>
> Op do 6 okt. 2022 om 17:27 schreef Tim Allison <ta...@apache.org>:
>>
>> I reprocessed a million files and wrote proper UTF-8 csv files.  This
>> did away with any risk of me botching something via copy/paste from
>> stdout.
>>
>> https://corpora.tika.apache.org/base/share/metadata-keys-1m-20221006.tgz
>>
>> On Mon, Oct 3, 2022 at 4:03 PM Markus Jelsma <ma...@openindex.io> wrote:
>> >
>> > Hi Tim,
>> >
>> > I would expect that many strange keys are actually present in the source data, and are not due to an error somewhere in Tika or its dependencies. Although mboxparser could have an issue somewhere.
>> >
>> > But it might be an idea to map some bad keys to their proper counterpart, such as keywords, content-type and friends.
>> >
>> > Regards,
>> > Markus
>> >
>> > Op ma 3 okt. 2022 om 17:10 schreef Tim Allison <ta...@apache.org>:
>> >>
>> >> Thank you, Markus, for looking through these sheets.  There's a chance
>> >> I botched the encodings in transferring data from one location to
>> >> another.  Let me take another look, and yes, we've got to make some
>> >> improvements to the mbox parser.
>> >>
>> >> More digging for me to do on the data and your findings!
>> >>
>> >> Thank you!
>> >>
>> >> On Mon, Oct 3, 2022 at 10:56 AM Markus Jelsma
>> >> <ma...@openindex.io> wrote:
>> >> >
>> >> > Hi,
>> >> >
>> >> > These aggregations of large real world sets are always interesting to look through. Especially because they are bound to have a lot of garbage and peculiarities. There are probably some badly chosen key names, and very likely many programming errors.
>> >> >
>> >> > Some interesting examples:
>> >> >
>> >> > what is this:
>> >> > Р’С‹Р±РµСЂРёС‚Рµ_СЂР°СЃС€РёСЂРµРЅРёРµ_РґР»СЏ_РїР°РєРѕРІРєРё
>> >> >
>> >> > the usual mixing of double-colon variants, there are also many escaped quotes:
>> >> > ”keywords” and \"keywords\"
>> >> >
>> >> > these two are identical, but given a large enough set, they might not be:
>> >> > height 512205
>> >> > width 512205
>> >> >
>> >> > mboxparser spews out a lot of garbage, incredible:
>> >> > MboxParser- $b!zf|!!;~![#1#07n#2#2f|!j6b!k#1#4;~h>!a#1#7;~h>!r$=$n8e 3
>> >> > MboxParser- $b"($3$n%a!<%k$o!"4x@>#i#t6&f1bn$x$4;22cd 3
>> >> > MboxParser- $b"(?=$79~$_!&%"%/%;%9ey!">\ 3
>> >> >
>> >> > really, it does:
>> >> > MboxParser-_blank">http 3
>> >> > MboxParser-a-aa-azzzzzzz-azzzzazzzzz-azzzzzzzazzzzzzzzzazzzzzazzzzzzzzzaz 3
>> >> > MboxParser-a-aa-azzzzzzz-azzzzzzzz-azzzzazzzzzazzzzzzazzzzzzz 3
>> >> >
>> >> > non-Latin scripts are expected, this is simplified Chinese:
>> >> > if:头像和分页采用圆形样式 (translation: Avatars and pagination in a circular style (?))
>> >> >
>> >> > perhaps shortest possible key name:
>> >> > T 4
>> >> >
>> >> > mboxparser, again, this time with XML tags:
>> >> > MboxParser-ype>state</span></font></st1:placetype></st1 4
>> >> > MboxParser-ype>university</span></font></st1:placetype></st1:place></st1 4
>> >> >
>> >> > the set seems to contain stuff from adult sites:
>> >> > xhamster-site-verification
>> >> >
>> >> > for some reason, the Dutch government always pops up in large sets:
>> >> > custom:OVERHEID.Informatietype/DC.type  13
>> >> > custom:OVERHEID.Organisatietype/OVERHEID.organisationType       13
>> >> >
>> >> > there are 18 different ways to spell/use Content-Type, of which four are, of course, with mboxparser:
>> >> > Content-Type    6612729
>> >> > content_type    14
>> >> > \"Content-Type\"        9
>> >> > \"content-type\"        5
>> >> >
>> >> > the inevitable encoding error:
>> >> > pdf:docinfo:custom:-ý§ Q 10
>> >> > pagerankâ„¢ 50
>> >> >
>> >> > what.is.this:
>> >> > Laisv371DiskusijuIrK363rybosForumas 4
>> >> >
>> >> > hey, another contenter for the shortest key name:
>> >> > M 4
>> >> >
>> >> > there are 67 unique dcterms key names, but their counts are not very high:
>> >> > DCTERMS.title   44
>> >> > dcterms.title   26
>> >> > dcterms:title   13
>> >> > dcterms.Title   3
>> >> >
>> >> > there is also a Content-Type in Russian:
>> >> > Тип-содержимое 3
>> >> >
>> >> > someone wants to remove your dust:
>> >> > Dust_Removal_Data 339
>> >> >
>> >> > there are 908 unique unknown tags, no idea what that is:
>> >> > Exif_IFD0:Unknown_tag_(0x8482)  36
>> >> > Unknown_tag_(0x00bf)    36
>> >> > Exif_SubIFD:Unknown_tag_(0x9009)        35
>> >> > Unknown_tag_(0x00a0)    35
>> >> > Unknown_tag_(0x050e)    35
>> >> >
>> >> > ah, the winner of the shortest key name (line 2235):
>> >> > 71
>> >> >
>> >> > longest key, guess who:
>> >> > MboxParser-http://www.facebook.com/donnakuhnarthttps://www.flickr.com/photos/donnakuhnhttp://picassogirl.tumblr.comhttps://twitter.com/digitalaardvarkhttps://plus.google.com/+digitalaardvarkshttps://www.linkedin.com/in/donnakuhnhttp://www.saatchionline.com/donnakuhnhttp://pinterest.com/sarcasthttps        3
>> >> >
>> >> > Besides Latin, Japanese and Chinese, Cyrillic is also present. But the six most frequently used Arabic symbols are not present. I wonder why. But there is an RTL-script present, Hebrew. It is always strange to meet terms/wors of RTL-scripts in an otherwise general LTR-world.
>> >> >
>> >> > I was a bit disappointed not to find any obscene terms. The set seemed to be large enough for at least some general curse words.
>> >> >
>> >> > MboxParser is the real winner with 1763 unique keys, this is really absurd!
>> >> >
>> >> > Thanks, this was fun!
>> >> > Markus
>> >> >
>> >> > Op ma 3 okt. 2022 om 15:26 schreef Tim Allison <ta...@apache.org>:
>> >> >>
>> >> >> All,
>> >> >>
>> >> >>   I recently extracted metadata keys from 1 million files in our
>> >> >> regression corpus and did a group by.  This allows insight into common
>> >> >> metadata keys.
>> >> >>
>> >> >>   I've included two views, one looks at overall counts, and the other
>> >> >> breaks down metadata keys by mime type.
>> >> >>
>> >> >>   Please let us know if you find anything interesting or have any questions.
>> >> >>
>> >> >> https://corpora.tika.apache.org/base/share/metadata-keys-overall-1m.txt.gz
>> >> >> https://corpora.tika.apache.org/base/share/metadata-keys-by-mime-1m.txt.gz
>> >> >>
>> >> >>    Best,
>> >> >>
>> >> >>             Tim

Re: metadata keys

Posted by Markus Jelsma <ma...@openindex.io>.

Ah, there are some differences this time, except for MboxParser, of course
:)

Very nice to see this happening, it wasn't present/noticed in the other set
tiff:ImageWidth,727519
tiff:ImageLength,727512

There are this time also quite a few with whitespaces in the keys:
Dimension HorizontalPixelSize,166272
Dimension VerticalPixelSize,166272

Attempts to do some Javascript:
<script>,1
var gcse = document.createElement(,1
var s = document.getElementsByTagName(,1

Something that appears to be a 'tag cloud' of a Dutch blog about travelling
to Thailand:
"thailand,thailand forum,bangkok,chiang
mai,vakantie,accommodatie,hotel,surat
thani,tuktuk,eiland,krabi,phuket,sukothai,phi phi,khao
sok,guesthouse,national
park,isaan,monnik,samui,panghan,bergvolk,eiland,trein,vliegtuig,ayutthaya,visum,thai,sawasdee,tempelflower",1

More tag clouds:
"homoeopathy, homopati, homeopathy, homeopati, hormon, alopaty, allopaty,
alopati, biochemic, biokemik, biokimia",1
"homopati, homopathy, homeopati, homoeopati, biochemic, biokimia",1

Chinese, Cyrillic and Arabic mixed with Latin. Especially Arabic is weird
when displayed correctly with the ,1 on its left:
custom:Шифр,1
custom:тавсўф,1
custom:آموزش ایندیزاین,1
custom:关键字,1

Escaping gone mad:
"\""content-type\""",5

There are also e-mail addresses that i am not going to put down here. And i
must say, after looking through it, MboxParser did still surprise me.

Thanks,
Markus

Op do 6 okt. 2022 om 17:27 schreef Tim Allison <ta...@apache.org>:

> I reprocessed a million files and wrote proper UTF-8 csv files.  This
> did away with any risk of me botching something via copy/paste from
> stdout.
>
> https://corpora.tika.apache.org/base/share/metadata-keys-1m-20221006.tgz
>
> On Mon, Oct 3, 2022 at 4:03 PM Markus Jelsma <ma...@openindex.io>
> wrote:
> >
> > Hi Tim,
> >
> > I would expect that many strange keys are actually present in the source
> data, and are not due to an error somewhere in Tika or its dependencies.
> Although mboxparser could have an issue somewhere.
> >
> > But it might be an idea to map some bad keys to their proper
> counterpart, such as keywords, content-type and friends.
> >
> > Regards,
> > Markus
> >
> > Op ma 3 okt. 2022 om 17:10 schreef Tim Allison <ta...@apache.org>:
> >>
> >> Thank you, Markus, for looking through these sheets.  There's a chance
> >> I botched the encodings in transferring data from one location to
> >> another.  Let me take another look, and yes, we've got to make some
> >> improvements to the mbox parser.
> >>
> >> More digging for me to do on the data and your findings!
> >>
> >> Thank you!
> >>
> >> On Mon, Oct 3, 2022 at 10:56 AM Markus Jelsma
> >> <ma...@openindex.io> wrote:
> >> >
> >> > Hi,
> >> >
> >> > These aggregations of large real world sets are always interesting to
> look through. Especially because they are bound to have a lot of garbage
> and peculiarities. There are probably some badly chosen key names, and very
> likely many programming errors.
> >> >
> >> > Some interesting examples:
> >> >
> >> > what is this:
> >> > Р’С‹Р±РµСЂРёС‚Рµ_СЂР°СЃС€РёСЂРµРЅРёРµ_РґР»СЏ_РїР°РєРѕРІРєРё
> >> >
> >> > the usual mixing of double-colon variants, there are also many
> escaped quotes:
> >> > ”keywords” and \"keywords\"
> >> >
> >> > these two are identical, but given a large enough set, they might not
> be:
> >> > height 512205
> >> > width 512205
> >> >
> >> > mboxparser spews out a lot of garbage, incredible:
> >> > MboxParser- $b!zf|!!;~![#1#07n#2#2f|!j6b!k#1#4;~h>!a#1#7;~h>!r$=$n8e 3
> >> > MboxParser- $b"($3$n%a!<%k$o!"4x@>#i#t6&f1bn$x$4;22cd 3
> >> > MboxParser- $b"(?=$79~$_!&%"%/%;%9ey!">\ 3
> >> >
> >> > really, it does:
> >> > MboxParser-_blank">http 3
> >> >
> MboxParser-a-aa-azzzzzzz-azzzzazzzzz-azzzzzzzazzzzzzzzzazzzzzazzzzzzzzzaz 3
> >> > MboxParser-a-aa-azzzzzzz-azzzzzzzz-azzzzazzzzzazzzzzzazzzzzzz 3
> >> >
> >> > non-Latin scripts are expected, this is simplified Chinese:
> >> > if:头像和分页采用圆形样式 (translation: Avatars and pagination in a circular
> style (?))
> >> >
> >> > perhaps shortest possible key name:
> >> > T 4
> >> >
> >> > mboxparser, again, this time with XML tags:
> >> > MboxParser-ype>state</span></font></st1:placetype></st1 4
> >> >
> MboxParser-ype>university</span></font></st1:placetype></st1:place></st1 4
> >> >
> >> > the set seems to contain stuff from adult sites:
> >> > xhamster-site-verification
> >> >
> >> > for some reason, the Dutch government always pops up in large sets:
> >> > custom:OVERHEID.Informatietype/DC.type  13
> >> > custom:OVERHEID.Organisatietype/OVERHEID.organisationType       13
> >> >
> >> > there are 18 different ways to spell/use Content-Type, of which four
> are, of course, with mboxparser:
> >> > Content-Type    6612729
> >> > content_type    14
> >> > \"Content-Type\"        9
> >> > \"content-type\"        5
> >> >
> >> > the inevitable encoding error:
> >> > pdf:docinfo:custom:-ý§ Q 10
> >> > pagerankâ„¢ 50
> >> >
> >> > what.is.this:
> >> > Laisv371DiskusijuIrK363rybosForumas 4
> >> >
> >> > hey, another contenter for the shortest key name:
> >> > M 4
> >> >
> >> > there are 67 unique dcterms key names, but their counts are not very
> high:
> >> > DCTERMS.title   44
> >> > dcterms.title   26
> >> > dcterms:title   13
> >> > dcterms.Title   3
> >> >
> >> > there is also a Content-Type in Russian:
> >> > Тип-содержимое 3
> >> >
> >> > someone wants to remove your dust:
> >> > Dust_Removal_Data 339
> >> >
> >> > there are 908 unique unknown tags, no idea what that is:
> >> > Exif_IFD0:Unknown_tag_(0x8482)  36
> >> > Unknown_tag_(0x00bf)    36
> >> > Exif_SubIFD:Unknown_tag_(0x9009)        35
> >> > Unknown_tag_(0x00a0)    35
> >> > Unknown_tag_(0x050e)    35
> >> >
> >> > ah, the winner of the shortest key name (line 2235):
> >> > 71
> >> >
> >> > longest key, guess who:
> >> > MboxParser-
> http://www.facebook.com/donnakuhnarthttps://www.flickr.com/photos/donnakuhnhttp://picassogirl.tumblr.comhttps://twitter.com/digitalaardvarkhttps://plus.google.com/+digitalaardvarkshttps://www.linkedin.com/in/donnakuhnhttp://www.saatchionline.com/donnakuhnhttp://pinterest.com/sarcasthttps
>       3
> >> >
> >> > Besides Latin, Japanese and Chinese, Cyrillic is also present. But
> the six most frequently used Arabic symbols are not present. I wonder why.
> But there is an RTL-script present, Hebrew. It is always strange to meet
> terms/wors of RTL-scripts in an otherwise general LTR-world.
> >> >
> >> > I was a bit disappointed not to find any obscene terms. The set
> seemed to be large enough for at least some general curse words.
> >> >
> >> > MboxParser is the real winner with 1763 unique keys, this is really
> absurd!
> >> >
> >> > Thanks, this was fun!
> >> > Markus
> >> >
> >> > Op ma 3 okt. 2022 om 15:26 schreef Tim Allison <ta...@apache.org>:
> >> >>
> >> >> All,
> >> >>
> >> >>   I recently extracted metadata keys from 1 million files in our
> >> >> regression corpus and did a group by.  This allows insight into
> common
> >> >> metadata keys.
> >> >>
> >> >>   I've included two views, one looks at overall counts, and the other
> >> >> breaks down metadata keys by mime type.
> >> >>
> >> >>   Please let us know if you find anything interesting or have any
> questions.
> >> >>
> >> >>
> https://corpora.tika.apache.org/base/share/metadata-keys-overall-1m.txt.gz
> >> >>
> https://corpora.tika.apache.org/base/share/metadata-keys-by-mime-1m.txt.gz
> >> >>
> >> >>    Best,
> >> >>
> >> >>             Tim
>

Re: metadata keys

Posted by Tim Allison <ta...@apache.org>.

I reprocessed a million files and wrote proper UTF-8 csv files.  This
did away with any risk of me botching something via copy/paste from
stdout.

https://corpora.tika.apache.org/base/share/metadata-keys-1m-20221006.tgz

On Mon, Oct 3, 2022 at 4:03 PM Markus Jelsma <ma...@openindex.io> wrote:
>
> Hi Tim,
>
> I would expect that many strange keys are actually present in the source data, and are not due to an error somewhere in Tika or its dependencies. Although mboxparser could have an issue somewhere.
>
> But it might be an idea to map some bad keys to their proper counterpart, such as keywords, content-type and friends.
>
> Regards,
> Markus
>
> Op ma 3 okt. 2022 om 17:10 schreef Tim Allison <ta...@apache.org>:
>>
>> Thank you, Markus, for looking through these sheets.  There's a chance
>> I botched the encodings in transferring data from one location to
>> another.  Let me take another look, and yes, we've got to make some
>> improvements to the mbox parser.
>>
>> More digging for me to do on the data and your findings!
>>
>> Thank you!
>>
>> On Mon, Oct 3, 2022 at 10:56 AM Markus Jelsma
>> <ma...@openindex.io> wrote:
>> >
>> > Hi,
>> >
>> > These aggregations of large real world sets are always interesting to look through. Especially because they are bound to have a lot of garbage and peculiarities. There are probably some badly chosen key names, and very likely many programming errors.
>> >
>> > Some interesting examples:
>> >
>> > what is this:
>> > Р’С‹Р±РµСЂРёС‚Рµ_СЂР°СЃС€РёСЂРµРЅРёРµ_РґР»СЏ_РїР°РєРѕРІРєРё
>> >
>> > the usual mixing of double-colon variants, there are also many escaped quotes:
>> > ”keywords” and \"keywords\"
>> >
>> > these two are identical, but given a large enough set, they might not be:
>> > height 512205
>> > width 512205
>> >
>> > mboxparser spews out a lot of garbage, incredible:
>> > MboxParser- $b!zf|!!;~![#1#07n#2#2f|!j6b!k#1#4;~h>!a#1#7;~h>!r$=$n8e 3
>> > MboxParser- $b"($3$n%a!<%k$o!"4x@>#i#t6&f1bn$x$4;22cd 3
>> > MboxParser- $b"(?=$79~$_!&%"%/%;%9ey!">\ 3
>> >
>> > really, it does:
>> > MboxParser-_blank">http 3
>> > MboxParser-a-aa-azzzzzzz-azzzzazzzzz-azzzzzzzazzzzzzzzzazzzzzazzzzzzzzzaz 3
>> > MboxParser-a-aa-azzzzzzz-azzzzzzzz-azzzzazzzzzazzzzzzazzzzzzz 3
>> >
>> > non-Latin scripts are expected, this is simplified Chinese:
>> > if:头像和分页采用圆形样式 (translation: Avatars and pagination in a circular style (?))
>> >
>> > perhaps shortest possible key name:
>> > T 4
>> >
>> > mboxparser, again, this time with XML tags:
>> > MboxParser-ype>state</span></font></st1:placetype></st1 4
>> > MboxParser-ype>university</span></font></st1:placetype></st1:place></st1 4
>> >
>> > the set seems to contain stuff from adult sites:
>> > xhamster-site-verification
>> >
>> > for some reason, the Dutch government always pops up in large sets:
>> > custom:OVERHEID.Informatietype/DC.type  13
>> > custom:OVERHEID.Organisatietype/OVERHEID.organisationType       13
>> >
>> > there are 18 different ways to spell/use Content-Type, of which four are, of course, with mboxparser:
>> > Content-Type    6612729
>> > content_type    14
>> > \"Content-Type\"        9
>> > \"content-type\"        5
>> >
>> > the inevitable encoding error:
>> > pdf:docinfo:custom:-ý§ Q 10
>> > pagerankâ„¢ 50
>> >
>> > what.is.this:
>> > Laisv371DiskusijuIrK363rybosForumas 4
>> >
>> > hey, another contenter for the shortest key name:
>> > M 4
>> >
>> > there are 67 unique dcterms key names, but their counts are not very high:
>> > DCTERMS.title   44
>> > dcterms.title   26
>> > dcterms:title   13
>> > dcterms.Title   3
>> >
>> > there is also a Content-Type in Russian:
>> > Тип-содержимое 3
>> >
>> > someone wants to remove your dust:
>> > Dust_Removal_Data 339
>> >
>> > there are 908 unique unknown tags, no idea what that is:
>> > Exif_IFD0:Unknown_tag_(0x8482)  36
>> > Unknown_tag_(0x00bf)    36
>> > Exif_SubIFD:Unknown_tag_(0x9009)        35
>> > Unknown_tag_(0x00a0)    35
>> > Unknown_tag_(0x050e)    35
>> >
>> > ah, the winner of the shortest key name (line 2235):
>> > 71
>> >
>> > longest key, guess who:
>> > MboxParser-http://www.facebook.com/donnakuhnarthttps://www.flickr.com/photos/donnakuhnhttp://picassogirl.tumblr.comhttps://twitter.com/digitalaardvarkhttps://plus.google.com/+digitalaardvarkshttps://www.linkedin.com/in/donnakuhnhttp://www.saatchionline.com/donnakuhnhttp://pinterest.com/sarcasthttps        3
>> >
>> > Besides Latin, Japanese and Chinese, Cyrillic is also present. But the six most frequently used Arabic symbols are not present. I wonder why. But there is an RTL-script present, Hebrew. It is always strange to meet terms/wors of RTL-scripts in an otherwise general LTR-world.
>> >
>> > I was a bit disappointed not to find any obscene terms. The set seemed to be large enough for at least some general curse words.
>> >
>> > MboxParser is the real winner with 1763 unique keys, this is really absurd!
>> >
>> > Thanks, this was fun!
>> > Markus
>> >
>> > Op ma 3 okt. 2022 om 15:26 schreef Tim Allison <ta...@apache.org>:
>> >>
>> >> All,
>> >>
>> >>   I recently extracted metadata keys from 1 million files in our
>> >> regression corpus and did a group by.  This allows insight into common
>> >> metadata keys.
>> >>
>> >>   I've included two views, one looks at overall counts, and the other
>> >> breaks down metadata keys by mime type.
>> >>
>> >>   Please let us know if you find anything interesting or have any questions.
>> >>
>> >> https://corpora.tika.apache.org/base/share/metadata-keys-overall-1m.txt.gz
>> >> https://corpora.tika.apache.org/base/share/metadata-keys-by-mime-1m.txt.gz
>> >>
>> >>    Best,
>> >>
>> >>             Tim

Re: metadata keys

Posted by Markus Jelsma <ma...@openindex.io>.

Hi Tim,

I would expect that many strange keys are actually present in the source
data, and are not due to an error somewhere in Tika or its dependencies.
Although mboxparser could have an issue somewhere.

But it might be an idea to map some bad keys to their proper counterpart,
such as keywords, content-type and friends.

Regards,
Markus

Op ma 3 okt. 2022 om 17:10 schreef Tim Allison <ta...@apache.org>:

> Thank you, Markus, for looking through these sheets.  There's a chance
> I botched the encodings in transferring data from one location to
> another.  Let me take another look, and yes, we've got to make some
> improvements to the mbox parser.
>
> More digging for me to do on the data and your findings!
>
> Thank you!
>
> On Mon, Oct 3, 2022 at 10:56 AM Markus Jelsma
> <ma...@openindex.io> wrote:
> >
> > Hi,
> >
> > These aggregations of large real world sets are always interesting to
> look through. Especially because they are bound to have a lot of garbage
> and peculiarities. There are probably some badly chosen key names, and very
> likely many programming errors.
> >
> > Some interesting examples:
> >
> > what is this:
> > Р’С‹Р±РµСЂРёС‚Рµ_СЂР°СЃС€РёСЂРµРЅРёРµ_РґР»СЏ_РїР°РєРѕРІРєРё
> >
> > the usual mixing of double-colon variants, there are also many escaped
> quotes:
> > ”keywords” and \"keywords\"
> >
> > these two are identical, but given a large enough set, they might not be:
> > height 512205
> > width 512205
> >
> > mboxparser spews out a lot of garbage, incredible:
> > MboxParser- $b!zf|!!;~![#1#07n#2#2f|!j6b!k#1#4;~h>!a#1#7;~h>!r$=$n8e 3
> > MboxParser- $b"($3$n%a!<%k$o!"4x@>#i#t6&f1bn$x$4;22cd 3
> > MboxParser- $b"(?=$79~$_!&%"%/%;%9ey!">\ 3
> >
> > really, it does:
> > MboxParser-_blank">http 3
> >
> MboxParser-a-aa-azzzzzzz-azzzzazzzzz-azzzzzzzazzzzzzzzzazzzzzazzzzzzzzzaz 3
> > MboxParser-a-aa-azzzzzzz-azzzzzzzz-azzzzazzzzzazzzzzzazzzzzzz 3
> >
> > non-Latin scripts are expected, this is simplified Chinese:
> > if:头像和分页采用圆形样式 (translation: Avatars and pagination in a circular style
> (?))
> >
> > perhaps shortest possible key name:
> > T 4
> >
> > mboxparser, again, this time with XML tags:
> > MboxParser-ype>state</span></font></st1:placetype></st1 4
> > MboxParser-ype>university</span></font></st1:placetype></st1:place></st1
> 4
> >
> > the set seems to contain stuff from adult sites:
> > xhamster-site-verification
> >
> > for some reason, the Dutch government always pops up in large sets:
> > custom:OVERHEID.Informatietype/DC.type  13
> > custom:OVERHEID.Organisatietype/OVERHEID.organisationType       13
> >
> > there are 18 different ways to spell/use Content-Type, of which four
> are, of course, with mboxparser:
> > Content-Type    6612729
> > content_type    14
> > \"Content-Type\"        9
> > \"content-type\"        5
> >
> > the inevitable encoding error:
> > pdf:docinfo:custom:-ý§ Q 10
> > pagerankâ„¢ 50
> >
> > what.is.this:
> > Laisv371DiskusijuIrK363rybosForumas 4
> >
> > hey, another contenter for the shortest key name:
> > M 4
> >
> > there are 67 unique dcterms key names, but their counts are not very
> high:
> > DCTERMS.title   44
> > dcterms.title   26
> > dcterms:title   13
> > dcterms.Title   3
> >
> > there is also a Content-Type in Russian:
> > Тип-содержимое 3
> >
> > someone wants to remove your dust:
> > Dust_Removal_Data 339
> >
> > there are 908 unique unknown tags, no idea what that is:
> > Exif_IFD0:Unknown_tag_(0x8482)  36
> > Unknown_tag_(0x00bf)    36
> > Exif_SubIFD:Unknown_tag_(0x9009)        35
> > Unknown_tag_(0x00a0)    35
> > Unknown_tag_(0x050e)    35
> >
> > ah, the winner of the shortest key name (line 2235):
> > 71
> >
> > longest key, guess who:
> > MboxParser-
> http://www.facebook.com/donnakuhnarthttps://www.flickr.com/photos/donnakuhnhttp://picassogirl.tumblr.comhttps://twitter.com/digitalaardvarkhttps://plus.google.com/+digitalaardvarkshttps://www.linkedin.com/in/donnakuhnhttp://www.saatchionline.com/donnakuhnhttp://pinterest.com/sarcasthttps
>       3
> >
> > Besides Latin, Japanese and Chinese, Cyrillic is also present. But the
> six most frequently used Arabic symbols are not present. I wonder why. But
> there is an RTL-script present, Hebrew. It is always strange to meet
> terms/wors of RTL-scripts in an otherwise general LTR-world.
> >
> > I was a bit disappointed not to find any obscene terms. The set seemed
> to be large enough for at least some general curse words.
> >
> > MboxParser is the real winner with 1763 unique keys, this is really
> absurd!
> >
> > Thanks, this was fun!
> > Markus
> >
> > Op ma 3 okt. 2022 om 15:26 schreef Tim Allison <ta...@apache.org>:
> >>
> >> All,
> >>
> >>   I recently extracted metadata keys from 1 million files in our
> >> regression corpus and did a group by.  This allows insight into common
> >> metadata keys.
> >>
> >>   I've included two views, one looks at overall counts, and the other
> >> breaks down metadata keys by mime type.
> >>
> >>   Please let us know if you find anything interesting or have any
> questions.
> >>
> >>
> https://corpora.tika.apache.org/base/share/metadata-keys-overall-1m.txt.gz
> >>
> https://corpora.tika.apache.org/base/share/metadata-keys-by-mime-1m.txt.gz
> >>
> >>    Best,
> >>
> >>             Tim
>

Re: metadata keys

Posted by Tim Allison <ta...@apache.org>.

Thank you, Markus, for looking through these sheets.  There's a chance
I botched the encodings in transferring data from one location to
another.  Let me take another look, and yes, we've got to make some
improvements to the mbox parser.

More digging for me to do on the data and your findings!

Thank you!

On Mon, Oct 3, 2022 at 10:56 AM Markus Jelsma
<ma...@openindex.io> wrote:
>
> Hi,
>
> These aggregations of large real world sets are always interesting to look through. Especially because they are bound to have a lot of garbage and peculiarities. There are probably some badly chosen key names, and very likely many programming errors.
>
> Some interesting examples:
>
> what is this:
> Р’С‹Р±РµСЂРёС‚Рµ_СЂР°СЃС€РёСЂРµРЅРёРµ_РґР»СЏ_РїР°РєРѕРІРєРё
>
> the usual mixing of double-colon variants, there are also many escaped quotes:
> ”keywords” and \"keywords\"
>
> these two are identical, but given a large enough set, they might not be:
> height 512205
> width 512205
>
> mboxparser spews out a lot of garbage, incredible:
> MboxParser- $b!zf|!!;~![#1#07n#2#2f|!j6b!k#1#4;~h>!a#1#7;~h>!r$=$n8e 3
> MboxParser- $b"($3$n%a!<%k$o!"4x@>#i#t6&f1bn$x$4;22cd 3
> MboxParser- $b"(?=$79~$_!&%"%/%;%9ey!">\ 3
>
> really, it does:
> MboxParser-_blank">http 3
> MboxParser-a-aa-azzzzzzz-azzzzazzzzz-azzzzzzzazzzzzzzzzazzzzzazzzzzzzzzaz 3
> MboxParser-a-aa-azzzzzzz-azzzzzzzz-azzzzazzzzzazzzzzzazzzzzzz 3
>
> non-Latin scripts are expected, this is simplified Chinese:
> if:头像和分页采用圆形样式 (translation: Avatars and pagination in a circular style (?))
>
> perhaps shortest possible key name:
> T 4
>
> mboxparser, again, this time with XML tags:
> MboxParser-ype>state</span></font></st1:placetype></st1 4
> MboxParser-ype>university</span></font></st1:placetype></st1:place></st1 4
>
> the set seems to contain stuff from adult sites:
> xhamster-site-verification
>
> for some reason, the Dutch government always pops up in large sets:
> custom:OVERHEID.Informatietype/DC.type  13
> custom:OVERHEID.Organisatietype/OVERHEID.organisationType       13
>
> there are 18 different ways to spell/use Content-Type, of which four are, of course, with mboxparser:
> Content-Type    6612729
> content_type    14
> \"Content-Type\"        9
> \"content-type\"        5
>
> the inevitable encoding error:
> pdf:docinfo:custom:-ý§ Q 10
> pagerankâ„¢ 50
>
> what.is.this:
> Laisv371DiskusijuIrK363rybosForumas 4
>
> hey, another contenter for the shortest key name:
> M 4
>
> there are 67 unique dcterms key names, but their counts are not very high:
> DCTERMS.title   44
> dcterms.title   26
> dcterms:title   13
> dcterms.Title   3
>
> there is also a Content-Type in Russian:
> Тип-содержимое 3
>
> someone wants to remove your dust:
> Dust_Removal_Data 339
>
> there are 908 unique unknown tags, no idea what that is:
> Exif_IFD0:Unknown_tag_(0x8482)  36
> Unknown_tag_(0x00bf)    36
> Exif_SubIFD:Unknown_tag_(0x9009)        35
> Unknown_tag_(0x00a0)    35
> Unknown_tag_(0x050e)    35
>
> ah, the winner of the shortest key name (line 2235):
> 71
>
> longest key, guess who:
> MboxParser-http://www.facebook.com/donnakuhnarthttps://www.flickr.com/photos/donnakuhnhttp://picassogirl.tumblr.comhttps://twitter.com/digitalaardvarkhttps://plus.google.com/+digitalaardvarkshttps://www.linkedin.com/in/donnakuhnhttp://www.saatchionline.com/donnakuhnhttp://pinterest.com/sarcasthttps        3
>
> Besides Latin, Japanese and Chinese, Cyrillic is also present. But the six most frequently used Arabic symbols are not present. I wonder why. But there is an RTL-script present, Hebrew. It is always strange to meet terms/wors of RTL-scripts in an otherwise general LTR-world.
>
> I was a bit disappointed not to find any obscene terms. The set seemed to be large enough for at least some general curse words.
>
> MboxParser is the real winner with 1763 unique keys, this is really absurd!
>
> Thanks, this was fun!
> Markus
>
> Op ma 3 okt. 2022 om 15:26 schreef Tim Allison <ta...@apache.org>:
>>
>> All,
>>
>>   I recently extracted metadata keys from 1 million files in our
>> regression corpus and did a group by.  This allows insight into common
>> metadata keys.
>>
>>   I've included two views, one looks at overall counts, and the other
>> breaks down metadata keys by mime type.
>>
>>   Please let us know if you find anything interesting or have any questions.
>>
>> https://corpora.tika.apache.org/base/share/metadata-keys-overall-1m.txt.gz
>> https://corpora.tika.apache.org/base/share/metadata-keys-by-mime-1m.txt.gz
>>
>>    Best,
>>
>>             Tim

Re: metadata keys

Posted by Markus Jelsma <ma...@openindex.io>.

Hi,

These aggregations of large real world sets are always interesting to look
through. Especially because they are bound to have a lot of garbage and
peculiarities. There are probably some badly chosen key names, and very
likely many programming errors.

Some interesting examples:

what is this:
Р’С‹Р±РµСЂРёС‚Рµ_СЂР°СЃС€РёСЂРµРЅРёРµ_РґР»СЏ_РїР°РєРѕРІРєРё

the usual mixing of double-colon variants, there are also many escaped
quotes:
”keywords” and \"keywords\"

these two are identical, but given a large enough set, they might not be:
height 512205
width 512205

mboxparser spews out a lot of garbage, incredible:
MboxParser- $b!zf|!!;~![#1#07n#2#2f|!j6b!k#1#4;~h>!a#1#7;~h>!r$=$n8e 3
MboxParser- $b"($3$n%a!<%k$o!"4x@>#i#t6&f1bn$x$4;22cd 3
MboxParser- $b"(?=$79~$_!&%"%/%;%9ey!">\ 3

really, it does:
MboxParser-_blank">http 3
MboxParser-a-aa-azzzzzzz-azzzzazzzzz-azzzzzzzazzzzzzzzzazzzzzazzzzzzzzzaz 3
MboxParser-a-aa-azzzzzzz-azzzzzzzz-azzzzazzzzzazzzzzzazzzzzzz 3

non-Latin scripts are expected, this is simplified Chinese:
if:头像和分页采用圆形样式 (translation: Avatars and pagination in a circular style (?))

perhaps shortest possible key name:
T 4

mboxparser, again, this time with XML tags:
MboxParser-ype>state</span></font></st1:placetype></st1 4
MboxParser-ype>university</span></font></st1:placetype></st1:place></st1 4

the set seems to contain stuff from adult sites:
xhamster-site-verification

for some reason, the Dutch government always pops up in large sets:
custom:OVERHEID.Informatietype/DC.type  13
custom:OVERHEID.Organisatietype/OVERHEID.organisationType       13

there are 18 different ways to spell/use Content-Type, of which four are,
of course, with mboxparser:
Content-Type    6612729
content_type    14
\"Content-Type\"        9
\"content-type\"        5

the inevitable encoding error:
pdf:docinfo:custom:-ý§ Q 10
pagerankâ„¢ 50

what.is.this:
Laisv371DiskusijuIrK363rybosForumas 4

hey, another contenter for the shortest key name:
M 4

there are 67 unique dcterms key names, but their counts are not very high:
DCTERMS.title   44
dcterms.title   26
dcterms:title   13
dcterms.Title   3

there is also a Content-Type in Russian:
Тип-содержимое 3

someone wants to remove your dust:
Dust_Removal_Data 339

there are 908 unique unknown tags, no idea what that is:
Exif_IFD0:Unknown_tag_(0x8482)  36
Unknown_tag_(0x00bf)    36
Exif_SubIFD:Unknown_tag_(0x9009)        35
Unknown_tag_(0x00a0)    35
Unknown_tag_(0x050e)    35

ah, the winner of the shortest key name (line 2235):
71

longest key, guess who:
MboxParser-
http://www.facebook.com/donnakuhnarthttps://www.flickr.com/photos/donnakuhnhttp://picassogirl.tumblr.comhttps://twitter.com/digitalaardvarkhttps://plus.google.com/+digitalaardvarkshttps://www.linkedin.com/in/donnakuhnhttp://www.saatchionline.com/donnakuhnhttp://pinterest.com/sarcasthttps
       3

Besides Latin, Japanese and Chinese, Cyrillic is also present. But the six
most frequently used Arabic symbols are not present. I wonder why. But
there is an RTL-script present, Hebrew. It is always strange to meet
terms/wors of RTL-scripts in an otherwise general LTR-world.

I was a bit disappointed not to find any obscene terms. The set seemed to
be large enough for at least some general curse words.

MboxParser is the real winner with 1763 unique keys, this is really absurd!

Thanks, this was fun!
Markus

Op ma 3 okt. 2022 om 15:26 schreef Tim Allison <ta...@apache.org>:

> All,
>
>   I recently extracted metadata keys from 1 million files in our
> regression corpus and did a group by.  This allows insight into common
> metadata keys.
>
>   I've included two views, one looks at overall counts, and the other
> breaks down metadata keys by mime type.
>
>   Please let us know if you find anything interesting or have any
> questions.
>
> https://corpora.tika.apache.org/base/share/metadata-keys-overall-1m.txt.gz
> https://corpora.tika.apache.org/base/share/metadata-keys-by-mime-1m.txt.gz
>
>    Best,
>
>             Tim
>