You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Andy Seaborne <an...@apache.org> on 2021/11/26 16:25:15 UTC
Parsing WikiData truthy - version of 2021-11-17
Parsing Wikidata truthy - version of 2021-11-17
File:
31,248,632,791 (30G) wikidata-20211117-truthy-BETA.nt.bz2
Triples: (by line count):
6,356,805,162 triples (6.36 billion)
Wikidata is about 15-16B at the same timepoint so truthy is about 40%.
Wikidata grows at ~1B a quarter and presumably truthy grows
proportionally - ~400e6/quarter.
Machine: Dell XPS laptop : 2021 model : Core i7 @3GHz
PCIe NVMe SSD
(==> not a server machine!)
To decompress and count lines: ("bzcat -d | wc -l" -- no Jena)
(4h 4m 40s)
=> 434,089 lines per second.
Parse:
"riot --sink --check"
43,958.52 sec : 6,356,805,162 Triples : Rate: 144,609.19 per second
0 errors : 1,427 warnings
(12h 12m 39s)
... which is really very clean.
Summary of warnings:
Code/Count/Message
(sorted by frequency, with 2 illustrative examples of each):
Code:13 834 DEFAULT_PORT_SHOULD_BE_OMITTED
http://www.andreykashechkin.com:80/
https://www.lucee.org:443/
Code:58 194 PROHIBITED_COMPONENT_PRESENT
https://sales@malopus.com.ua
https://info@hsdf.org.ng
Code:12 153 PORT_SHOULD_NOT_BE_EMPTY
cvs://anonymous@linuxlibertine.cvs.sourceforge.net:/cvsroot/linuxlibertine
http://https://chr.bg
Code:201 78 Bad lexical form: XSD dateTime
'-3500000000-01-01T00:00:00Z'
'5000000000000-01-01T00:00:00Z'
Code:11 35 LOWERCASE_PREFERRED
Https://slovencivangliji.com
Https://sysdevunicallib@gmail.com
Code:57 27 REQUIRED_COMPONENT_MISSING
http://:www.bjbcollege.in
https:///w/files/pdfs/todesopfer-in-berlin-17-10032.pdf
Code:0 26 ILLEGAL_CHARACTER
http://www.tandfonline.com/loi/wzes20#.VKw-y2MhDTohttp://www.tandfonline.com/loi/wzes20#.VKw-y2MhDTo
http://norgeskart.no/geoportal/#!?zoom=3&lon=626883.62&lat=7113922.81&wms=http:%2F%2Fwms.miljodirektoratet.no%2Fgeoserver%2Ffjordkatalogen%2Fwms&project=geonorge&layers=1002#6%252F-58549%252F6698183%252Fl%252Fwms%252F%255Bhttp:%252F%252Fwms.miljodirektoratet.no%252Fgeoserver%252Ffjordkatalogen%252Fwms%255D%252F+fjordkatalogen_grenser%252F+fjordkatalogen_omrade
Code:36 21 HAS_PASSWORD
cvs://anonymous:@offsystem.cvs.sourceforge.net/cvsroot/offsystem
cvs://:pserver:cvs:@cvs.cvsnt.org:/cvsnt
Code:300 16 Illegal character in IRI
https://en.wiktionary.org/wiki/[U+D83C]...
https://en.wiktionary.org/wiki/?[U+DE97]...
Code:30 15 ILLEGAL_PERCENT_ENCODING
http://vestnik.dgu.ru/pol.aspx?razdel=2&rznameru=%u0421%u0435%u0440%u0438%u044f2%3a+%u0413%u0443%u043c%u0430%u043d%u0438%u0442%u0430%u0440%u043d%u044b%u0435+%u043d%u0430%u0443%u043a%u0438+&rznameen=Humanitarian+sciences
https://dbs.ossolineum.pl/lwow/pozycja.php?id=6053&s=9&search=b%
Code:5 8 CONTROL_CHARACTER
http://www.vill.nishiawakura.okayama.jp/wp/‚±‚Ç‚à%7D‘ŠÙ/
http://kaminomachi.jp/ŽO“‡%7D‘ŠÙ/
Code:35 7 BAD_IDN
http://www.villagodipiovene.it
http://miracle-march.com(閉鎖)
Code:55 6 UNICODE_WHITESPACE
http://sfi-cybium.fr/fr/les-campagnes-de-chalutages-expérimentaux
-en-mer-du-nord
http://sfi-cybium.fr/fr/inventaire-systématique-des-blenniidae
des-côtes-tunisiennes
Code:100 5 Language not valid
zh-classical
zh-classical
Code:33 1 DNS_LABEL_DASH_START_OR_END
http://-www.plovdiv-press.bg
Code:46 1 NOT_NFC
http://www.barcelonaopenbancsabadell.com/index.php?ACD=10550055-3176246-&AID=3176246〈=en
Re: Parsing WikiData truthy - version of 2021-11-17
Posted by Andy Seaborne <an...@apache.org>.
On 26/11/2021 19:09, Andy Seaborne wrote:
>
>
> On 26/11/2021 16:51, Marco Neumann wrote:
>> I got the following 107 warnings during parsing of the complete Wikidata
>> truthy - version of 2021-11-17
>
> Those are included in the report at start-of-thread.
> Full report:
> https://gist.github.com/afs/15719d46299bcf7346e3c314ac109040
The archives don't present the plain text nicely:
The summary as un-re-formatted plain text:
https://gist.github.com/afs/7b20b36391e186b262cdb485dbfae681
The [U+D83C] are surrogate pairs and should not appear in UTF-8
U+D83C U+DF1F is 🌟 (U+1F31F) and should be encoded into bytes as UTF-8:
xF0 x9F x8C x9F
https://en.wiktionary.org/wiki/%F0%9F%8C%9F
Re: Parsing WikiData truthy - version of 2021-11-17
Posted by Andy Seaborne <an...@apache.org>.
On 26/11/2021 16:51, Marco Neumann wrote:
> I got the following 107 warnings during parsing of the complete Wikidata
> truthy - version of 2021-11-17
Those are included in the report at start-of-thread.
Full report:
https://gist.github.com/afs/15719d46299bcf7346e3c314ac109040
You are reporting a loader run? The loader runs with the equivalent of
"riot --nocheck" which is more more forgiving.
[U+D83C] (which is one part of a surrogate pair and illegal in UTF-8)
means the codepoint 0xD83C was found after UTF-8 decoding. Surrogate
pair characters should not appear because the character ought to encoded
as UTF-8, not in UTF-16 style.
Andy
>
> http://www.lotico.com/truthy.txt
>
> On Fri, Nov 26, 2021 at 4:25 PM Andy Seaborne <an...@apache.org> wrote:
>
>> Parsing Wikidata truthy - version of 2021-11-17
>>
>> File:
>> 31,248,632,791 (30G) wikidata-20211117-truthy-BETA.nt.bz2
>> Triples: (by line count):
>> 6,356,805,162 triples (6.36 billion)
>>
>> Wikidata is about 15-16B at the same timepoint so truthy is about 40%.
>>
>> Wikidata grows at ~1B a quarter and presumably truthy grows
>> proportionally - ~400e6/quarter.
>>
>> Machine: Dell XPS laptop : 2021 model : Core i7 @3GHz
>> PCIe NVMe SSD
>> (==> not a server machine!)
>>
>> To decompress and count lines: ("bzcat -d | wc -l" -- no Jena)
>> (4h 4m 40s)
>> => 434,089 lines per second.
>>
>> Parse:
>> "riot --sink --check"
>> 43,958.52 sec : 6,356,805,162 Triples : Rate: 144,609.19 per second
>> 0 errors : 1,427 warnings
>> (12h 12m 39s)
>>
>> ... which is really very clean.
>>
>> Summary of warnings:
>> Code/Count/Message
>> (sorted by frequency, with 2 illustrative examples of each):
>>
>> Code:13 834 DEFAULT_PORT_SHOULD_BE_OMITTED
>> http://www.andreykashechkin.com:80/
>> https://www.lucee.org:443/
>> Code:58 194 PROHIBITED_COMPONENT_PRESENT
>> https://sales@malopus.com.ua
>> https://info@hsdf.org.ng
>> Code:12 153 PORT_SHOULD_NOT_BE_EMPTY
>>
>> cvs://anonymous@linuxlibertine.cvs.sourceforge.net:/cvsroot/linuxlibertine
>> http://https://chr.bg
>> Code:201 78 Bad lexical form: XSD dateTime
>> '-3500000000-01-01T00:00:00Z'
>> '5000000000000-01-01T00:00:00Z'
>> Code:11 35 LOWERCASE_PREFERRED
>> Https://slovencivangliji.com
>> Https://sysdevunicallib@gmail.com
>> Code:57 27 REQUIRED_COMPONENT_MISSING
>> http://:www.bjbcollege.in
>> https:///w/files/pdfs/todesopfer-in-berlin-17-10032.pdf
>> Code:0 26 ILLEGAL_CHARACTER
>>
>>
>> http://www.tandfonline.com/loi/wzes20#.VKw-y2MhDTohttp://www.tandfonline.com/loi/wzes20#.VKw-y2MhDTo
>>
>>
>> http://norgeskart.no/geoportal/#!?zoom=3&lon=626883.62&lat=7113922.81&wms=http:%2F%2Fwms.miljodirektoratet.no%2Fgeoserver%2Ffjordkatalogen%2Fwms&project=geonorge&layers=1002#6%252F-58549%252F6698183%252Fl%252Fwms%252F%255Bhttp:%252F%252Fwms.miljodirektoratet.no%252Fgeoserver%252Ffjordkatalogen%252Fwms%255D%252F+fjordkatalogen_grenser%252F+fjordkatalogen_omrade
>> Code:36
>> <http://norgeskart.no/geoportal/#!?zoom=3&lon=626883.62&lat=7113922.81&wms=http:%2F%2Fwms.miljodirektoratet.no%2Fgeoserver%2Ffjordkatalogen%2Fwms&project=geonorge&layers=1002%236%252F-58549%252F6698183%252Fl%252Fwms%252F%255Bhttp:%252F%252Fwms.miljodirektoratet.no%252Fgeoserver%252Ffjordkatalogen%252Fwms%255D%252F+fjordkatalogen_grenser%252F+fjordkatalogen_omradeCode:36>
>> 21 HAS_PASSWORD
>> cvs://anonymous:@offsystem.cvs.sourceforge.net/cvsroot/offsystem
>> cvs://:pserver:cvs:@cvs.cvsnt.org:/cvsnt
>> Code:300 16 Illegal character in IRI
>> https://en.wiktionary.org/wiki/[U+D83C].
>> <https://en.wiktionary.org/wiki/%5BU+D83C%5D.>..
>> https://en.wiktionary.org/wiki/?[U+DE97]...
>> Code:30 15 ILLEGAL_PERCENT_ENCODING
>>
>>
>> http://vestnik.dgu.ru/pol.aspx?razdel=2&rznameru=%u0421%u0435%u0440%u0438%u044f2%3a+%u0413%u0443%u043c%u0430%u043d%u0438%u0442%u0430%u0440%u043d%u044b%u0435+%u043d%u0430%u0443%u043a%u0438+&rznameen=Humanitarian+sciences
>> https://dbs.ossolineum.pl/lwow/pozycja.php?id=6053&s=9&search=b%
>> Code:5 8 CONTROL_CHARACTER
>> http://www.vill.nishiawakura.okayama.jp/wp/‚±‚Ç‚à %7D ‘ŠÙ/
>> http://kaminomachi.jp/ŽO“‡ %7D ‘ŠÙ/
>> Code:35 7 BAD_IDN
>> http://www.villagodipiovene.it
>> http://miracle-march.com(閉鎖)
>> Code:55 6 UNICODE_WHITESPACE
>>
>> http://sfi-cybium.fr/fr/les-campagnes-de-chalutages-expérimentaux
>> -en-mer-du-nord
>>
>> http://sfi-cybium.fr/fr/inventaire-systématique-des-blenniidae
>> des-côtes-tunisiennes
>> Code:100 5 Language not valid
>> zh-classical
>> zh-classical
>> Code:33 1 DNS_LABEL_DASH_START_OR_END
>> http://-www.plovdiv-press.bg
>> Code:46 1 NOT_NFC
>>
>>
>> http://www.barcelonaopenbancsabadell.com/index.php?ACD=10550055-3176246-&AID=3176246
>> 〈=en
>>
>
>
Re: Parsing WikiData truthy - version of 2021-11-17
Posted by Marco Neumann <ma...@gmail.com>.
I have posted these warnings to twitter as well and one the UK wikidata
users is currently updating them manually. Andy, could you please look up
the triples in the offending lines numbers @Tagishsimon was asking for?
https://twitter.com/neumarcx/status/1463485394910072845
Thank you
On Fri, Nov 26, 2021 at 4:51 PM Marco Neumann <ma...@gmail.com>
wrote:
> I got the following 107 warnings during parsing of the complete Wikidata
> truthy - version of 2021-11-17
>
> http://www.lotico.com/truthy.txt
>
> On Fri, Nov 26, 2021 at 4:25 PM Andy Seaborne <an...@apache.org> wrote:
>
>> Parsing Wikidata truthy - version of 2021-11-17
>>
>> File:
>> 31,248,632,791 (30G) wikidata-20211117-truthy-BETA.nt.bz2
>> Triples: (by line count):
>> 6,356,805,162 triples (6.36 billion)
>>
>> Wikidata is about 15-16B at the same timepoint so truthy is about 40%.
>>
>> Wikidata grows at ~1B a quarter and presumably truthy grows
>> proportionally - ~400e6/quarter.
>>
>> Machine: Dell XPS laptop : 2021 model : Core i7 @3GHz
>> PCIe NVMe SSD
>> (==> not a server machine!)
>>
>> To decompress and count lines: ("bzcat -d | wc -l" -- no Jena)
>> (4h 4m 40s)
>> => 434,089 lines per second.
>>
>> Parse:
>> "riot --sink --check"
>> 43,958.52 sec : 6,356,805,162 Triples : Rate: 144,609.19 per second
>> 0 errors : 1,427 warnings
>> (12h 12m 39s)
>>
>> ... which is really very clean.
>>
>> Summary of warnings:
>> Code/Count/Message
>> (sorted by frequency, with 2 illustrative examples of each):
>>
>> Code:13 834 DEFAULT_PORT_SHOULD_BE_OMITTED
>> http://www.andreykashechkin.com:80/
>> https://www.lucee.org:443/
>> Code:58 194 PROHIBITED_COMPONENT_PRESENT
>> https://sales@malopus.com.ua
>> https://info@hsdf.org.ng
>> Code:12 153 PORT_SHOULD_NOT_BE_EMPTY
>>
>> cvs://anonymous@linuxlibertine.cvs.sourceforge.net:
>> /cvsroot/linuxlibertine
>> http://https://chr.bg
>> Code:201 78 Bad lexical form: XSD dateTime
>> '-3500000000-01-01T00:00:00Z'
>> '5000000000000-01-01T00:00:00Z'
>> Code:11 35 LOWERCASE_PREFERRED
>> Https://slovencivangliji.com
>> Https://sysdevunicallib@gmail.com
>> Code:57 27 REQUIRED_COMPONENT_MISSING
>> http://:www.bjbcollege.in
>> https:///w/files/pdfs/todesopfer-in-berlin-17-10032.pdf
>> Code:0 26 ILLEGAL_CHARACTER
>>
>>
>> http://www.tandfonline.com/loi/wzes20#.VKw-y2MhDTohttp://www.tandfonline.com/loi/wzes20#.VKw-y2MhDTo
>>
>>
>> http://norgeskart.no/geoportal/#!?zoom=3&lon=626883.62&lat=7113922.81&wms=http:%2F%2Fwms.miljodirektoratet.no%2Fgeoserver%2Ffjordkatalogen%2Fwms&project=geonorge&layers=1002#6%252F-58549%252F6698183%252Fl%252Fwms%252F%255Bhttp:%252F%252Fwms.miljodirektoratet.no%252Fgeoserver%252Ffjordkatalogen%252Fwms%255D%252F+fjordkatalogen_grenser%252F+fjordkatalogen_omrade
>> Code:36
>> <http://norgeskart.no/geoportal/#!?zoom=3&lon=626883.62&lat=7113922.81&wms=http:%2F%2Fwms.miljodirektoratet.no%2Fgeoserver%2Ffjordkatalogen%2Fwms&project=geonorge&layers=1002%236%252F-58549%252F6698183%252Fl%252Fwms%252F%255Bhttp:%252F%252Fwms.miljodirektoratet.no%252Fgeoserver%252Ffjordkatalogen%252Fwms%255D%252F+fjordkatalogen_grenser%252F+fjordkatalogen_omradeCode:36>
>> 21 HAS_PASSWORD
>> cvs://anonymous:@offsystem.cvs.sourceforge.net/cvsroot/offsystem
>> cvs://:pserver:cvs:@cvs.cvsnt.org:/cvsnt
>> Code:300 16 Illegal character in IRI
>> https://en.wiktionary.org/wiki/[U+D83C].
>> <https://en.wiktionary.org/wiki/%5BU+D83C%5D.>..
>> https://en.wiktionary.org/wiki/?[U+DE97]...
>> Code:30 15 ILLEGAL_PERCENT_ENCODING
>>
>>
>> http://vestnik.dgu.ru/pol.aspx?razdel=2&rznameru=%u0421%u0435%u0440%u0438%u044f2%3a+%u0413%u0443%u043c%u0430%u043d%u0438%u0442%u0430%u0440%u043d%u044b%u0435+%u043d%u0430%u0443%u043a%u0438+&rznameen=Humanitarian+sciences
>> https://dbs.ossolineum.pl/lwow/pozycja.php?id=6053&s=9&search=b%
>> Code:5 8 CONTROL_CHARACTER
>> http://www.vill.nishiawakura.okayama.jp/wp/‚±‚Ç‚à %7D ‘ŠÙ/
>> http://kaminomachi.jp/ŽO“‡ %7D ‘ŠÙ/
>> Code:35 7 BAD_IDN
>> http://www.villagodipiovene.it
>> http://miracle-march.com(閉鎖)
>> Code:55 6 UNICODE_WHITESPACE
>>
>> http://sfi-cybium.fr/fr/les-campagnes-de-chalutages-expérimentaux
>> -en-mer-du-nord
>>
>> http://sfi-cybium.fr/fr/inventaire-systématique-des-blenniidae
>> des-côtes-tunisiennes
>> Code:100 5 Language not valid
>> zh-classical
>> zh-classical
>> Code:33 1 DNS_LABEL_DASH_START_OR_END
>> http://-www.plovdiv-press.bg
>> Code:46 1 NOT_NFC
>>
>>
>> http://www.barcelonaopenbancsabadell.com/index.php?ACD=10550055-3176246-&AID=3176246
>> 〈=en
>>
>
>
> --
>
>
> ---
> Marco Neumann
> KONA
>
>
--
---
Marco Neumann
KONA
Re: Parsing WikiData truthy - version of 2021-11-17
Posted by Marco Neumann <ma...@gmail.com>.
I got the following 107 warnings during parsing of the complete Wikidata
truthy - version of 2021-11-17
http://www.lotico.com/truthy.txt
On Fri, Nov 26, 2021 at 4:25 PM Andy Seaborne <an...@apache.org> wrote:
> Parsing Wikidata truthy - version of 2021-11-17
>
> File:
> 31,248,632,791 (30G) wikidata-20211117-truthy-BETA.nt.bz2
> Triples: (by line count):
> 6,356,805,162 triples (6.36 billion)
>
> Wikidata is about 15-16B at the same timepoint so truthy is about 40%.
>
> Wikidata grows at ~1B a quarter and presumably truthy grows
> proportionally - ~400e6/quarter.
>
> Machine: Dell XPS laptop : 2021 model : Core i7 @3GHz
> PCIe NVMe SSD
> (==> not a server machine!)
>
> To decompress and count lines: ("bzcat -d | wc -l" -- no Jena)
> (4h 4m 40s)
> => 434,089 lines per second.
>
> Parse:
> "riot --sink --check"
> 43,958.52 sec : 6,356,805,162 Triples : Rate: 144,609.19 per second
> 0 errors : 1,427 warnings
> (12h 12m 39s)
>
> ... which is really very clean.
>
> Summary of warnings:
> Code/Count/Message
> (sorted by frequency, with 2 illustrative examples of each):
>
> Code:13 834 DEFAULT_PORT_SHOULD_BE_OMITTED
> http://www.andreykashechkin.com:80/
> https://www.lucee.org:443/
> Code:58 194 PROHIBITED_COMPONENT_PRESENT
> https://sales@malopus.com.ua
> https://info@hsdf.org.ng
> Code:12 153 PORT_SHOULD_NOT_BE_EMPTY
>
> cvs://anonymous@linuxlibertine.cvs.sourceforge.net:/cvsroot/linuxlibertine
> http://https://chr.bg
> Code:201 78 Bad lexical form: XSD dateTime
> '-3500000000-01-01T00:00:00Z'
> '5000000000000-01-01T00:00:00Z'
> Code:11 35 LOWERCASE_PREFERRED
> Https://slovencivangliji.com
> Https://sysdevunicallib@gmail.com
> Code:57 27 REQUIRED_COMPONENT_MISSING
> http://:www.bjbcollege.in
> https:///w/files/pdfs/todesopfer-in-berlin-17-10032.pdf
> Code:0 26 ILLEGAL_CHARACTER
>
>
> http://www.tandfonline.com/loi/wzes20#.VKw-y2MhDTohttp://www.tandfonline.com/loi/wzes20#.VKw-y2MhDTo
>
>
> http://norgeskart.no/geoportal/#!?zoom=3&lon=626883.62&lat=7113922.81&wms=http:%2F%2Fwms.miljodirektoratet.no%2Fgeoserver%2Ffjordkatalogen%2Fwms&project=geonorge&layers=1002#6%252F-58549%252F6698183%252Fl%252Fwms%252F%255Bhttp:%252F%252Fwms.miljodirektoratet.no%252Fgeoserver%252Ffjordkatalogen%252Fwms%255D%252F+fjordkatalogen_grenser%252F+fjordkatalogen_omrade
> Code:36
> <http://norgeskart.no/geoportal/#!?zoom=3&lon=626883.62&lat=7113922.81&wms=http:%2F%2Fwms.miljodirektoratet.no%2Fgeoserver%2Ffjordkatalogen%2Fwms&project=geonorge&layers=1002%236%252F-58549%252F6698183%252Fl%252Fwms%252F%255Bhttp:%252F%252Fwms.miljodirektoratet.no%252Fgeoserver%252Ffjordkatalogen%252Fwms%255D%252F+fjordkatalogen_grenser%252F+fjordkatalogen_omradeCode:36>
> 21 HAS_PASSWORD
> cvs://anonymous:@offsystem.cvs.sourceforge.net/cvsroot/offsystem
> cvs://:pserver:cvs:@cvs.cvsnt.org:/cvsnt
> Code:300 16 Illegal character in IRI
> https://en.wiktionary.org/wiki/[U+D83C].
> <https://en.wiktionary.org/wiki/%5BU+D83C%5D.>..
> https://en.wiktionary.org/wiki/?[U+DE97]...
> Code:30 15 ILLEGAL_PERCENT_ENCODING
>
>
> http://vestnik.dgu.ru/pol.aspx?razdel=2&rznameru=%u0421%u0435%u0440%u0438%u044f2%3a+%u0413%u0443%u043c%u0430%u043d%u0438%u0442%u0430%u0440%u043d%u044b%u0435+%u043d%u0430%u0443%u043a%u0438+&rznameen=Humanitarian+sciences
> https://dbs.ossolineum.pl/lwow/pozycja.php?id=6053&s=9&search=b%
> Code:5 8 CONTROL_CHARACTER
> http://www.vill.nishiawakura.okayama.jp/wp/‚±‚Ç‚à %7D ‘ŠÙ/
> http://kaminomachi.jp/ŽO“‡ %7D ‘ŠÙ/
> Code:35 7 BAD_IDN
> http://www.villagodipiovene.it
> http://miracle-march.com(閉鎖)
> Code:55 6 UNICODE_WHITESPACE
>
> http://sfi-cybium.fr/fr/les-campagnes-de-chalutages-expérimentaux
> -en-mer-du-nord
>
> http://sfi-cybium.fr/fr/inventaire-systématique-des-blenniidae
> des-côtes-tunisiennes
> Code:100 5 Language not valid
> zh-classical
> zh-classical
> Code:33 1 DNS_LABEL_DASH_START_OR_END
> http://-www.plovdiv-press.bg
> Code:46 1 NOT_NFC
>
>
> http://www.barcelonaopenbancsabadell.com/index.php?ACD=10550055-3176246-&AID=3176246
> 〈=en
>
--
---
Marco Neumann
KONA