You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@any23.apache.org by lewis john mcgibbney <le...@apache.org> on 2021/01/21 21:05:49 UTC
Fwd: WebDataCommons releases 86.3 billion quads Microdata, Embedded JSON-LD, RDFa, and Microformat data originating from 15.3 million websites

FYI folks

---------- Forwarded message ---------
From: Lewis John Mcgibbney <le...@gmail.com>
Date: Thu, Jan 21, 2021 at 1:04 PM
Subject: Re: WebDataCommons releases 86.3 billion quads Microdata, Embedded
JSON-LD, RDFa, and Microformat data originating from 15.3 million websites
To: Web Data Commons <we...@googlegroups.com>


Congratulations on the new dataset release.
The statistics are really interesting.
Really good to hear that Any23 is performing nominally. That is good. :)

On Thursday, 21 January 2021 at 02:00:44 UTC-8 apri...@gmail.com wrote:

> Hi all,
>
> we are happy to announce the new release of the WebDataCommons Microdata,
> JSON-LD, RDFa and Microformat data corpus.
>
> The data has been extracted from the September 2020 version of the Common
> Crawl covering 3.4 billion HTML pages which originate from 34.5 million
> websites (pay-level domains). For the extraction of structured data, the
> newest version 2.4 of the any23 library was used.
>
> In summary, we found structured data within 1.7 billion HTML pages out of
> the 3.4 billion pages contained in the crawl (50%). These pages originate
> from 15.3 million different pay-level domains out of the 34.5 million
> pay-level-domains covered by the crawl (44.3%). Last year, we only found
> structured data in 37% of the pages and on 37.2% of the pay-level-domains.
>
> Approximately 7.8 million of the 2020 websites use Microdata, 7.6 million
> websites use JSON-LD, and 3.3 million websites make use of RDFa.
> Microformats are used by more than 4 million websites within the crawl.
>
>
>
> *Statistics about the December 2020 Release:*
>
> Basic statistics about the December 2020 Microdata, JSON-LD, RDFa, and
> Microformat data sets as well as the vocabularies that are used together
> with each markup format are found at:
>
> http://webdatacommons.org/structureddata/2020-12/stats/stats.html
>
>
>
> *Markup Format Adoption*
>
> The page below provides an overview of trends in the adoption of the
> different markup formats as well as widely used schema.org classes in the
> timespan 2012 to 2020:
>
> http://webdatacommons.org/structureddata/#toc3
>
> Comparing the statistics from the new 2020 release to the statistics about
> the 2019 release of the data sets
>
> http://webdatacommons.org/structureddata/2019-12/stats/stats.html
>
> we can observe that although the overall number of pages in the crawl is
> by 38.9% larger in comparison to the crawl used for the 2019 release, the
> corresponding growth in terms of domains is only 7.9%, indicating that the
> crawl corpus used this year is much deeper in comparison to the one of last
> year. However, we see that more and more websites annotate their content,
> as the yearly increase of the domains having annotated data was more than
> 28%. The markup format with the largest domain growth in adoption (>50%) is
> JSON-LD. The growing trend of the JSON-LD format becomes even more obvious
> in certain domains, such as hotels.com and yahoo.com, which have switched
> from using Microdata to using JSON-LD as dominant markup language.
> Concerning the vocabulary adoption, schema.org continues to be the most
> dominant vocabulary. More concretely, the classes schema:WebPage,
> schema:Product, schema:Rating, schema:Organization and schema:Person saw a
> major adoption increase in comparison to 2019 (>40%). Looking at the
> richness of JSON-LD descriptions, we notice that the average number of
> triples per URL has grown from 29 in 2019 to 41 in 2020 and has now reached
> a similar level of detail as the Microdata annotations (avg 39 triples per
> URL).
>
>
>
> *Download *
>
> The overall size of the December 2020 RDFa, Microdata, Embedded JSON-LD
> and Microformat data sets is 86.3 billion RDF quads. For download, we split
> the data into 21,346 files with a total size of 1.9 TB.
>
>
> http://webdatacommons.org/structureddata/2020-12/stats/how_to_get_the_data.html
>
> In addition, we have created for over 43 different schema.org classes
> separate files, including all quads extracted from pages, using a specific
> schema.org class.
>
>
> http://webdatacommons.org/structureddata/2020-12/stats/schema_org_subsets.html
>
>
>
> *Lots of thanks to:*
>
> + the Common Crawl project for providing their great web crawl and
> thus enabling the WebDataCommons project.
> + the Any23 project for providing and maintaining their great library of
> structured data parsers.
> + Amazon Web Services in Education Grant for supporting WebDataCommons.
>
>
> *General Information about the WebDataCommons Project*
>
> The WebDataCommons project extracts yearly since 2012 structured data from
> the Common Crawl, the largest web corpus available to the public, and
> provides the extracted data for public download in order to support
> researchers and companies in exploiting the wealth of information that is
> available on the Web. Beside of the yearly extractions of semantic
> annotations from webpages, the WebDataCommons project also provides large
> hyperlink graphs, the largest public corpus of web tables, two corpora of
> product data, as well as a collection of hypernyms extracted from billions
> of web pages for public download. General information about the
> WebDataCommons project is found at
>
> http://webdatacommons.org/
>
>
> Have fun with the new data set.
>
>
> Cheers,
>
> Anna Primpeli, Alexander Brinkmann and Chris Bizer
>
-- 
You received this message because you are subscribed to a topic in the
Google Groups "Web Data Commons" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/web-data-commons/IztabA5kMzg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
web-data-commons+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/web-data-commons/8dfc4ab6-97db-4260-8296-415f2837fab8n%40googlegroups.com
<https://groups.google.com/d/msgid/web-data-commons/8dfc4ab6-97db-4260-8296-415f2837fab8n%40googlegroups.com?utm_medium=email&utm_source=footer>
.


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc