You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@maven.apache.org by Sebastian Hellmann <he...@informatik.uni-leipzig.de> on 2019/09/11 19:17:26 UTC

[ANN] Maven for Data Beta

Dear all,

we developed a Maven for Data. It is still beta, but we already use it 
productively to publish DBpedia's data 
(https://en.wikipedia.org/wiki/DBpedia, https://wiki.dbpedia.org/)

While I attached a lot of specific information about the data and 
management part below, I would like to highlight the Maven parts:

# Download of Data

* http://databus.dbpedia.org is inspired by maven central and archiva. 
Software is much smaller in size and of course, we can not host all of 
it, therefore we just keep the metadata with links to decentral downloadURLs

* The data can be viewed on the website and also downloaded via SPARQL: 
http://dev.dbpedia.org/Download_Data

* The version URLs serve as download plugin configuration parameter: 
http://dev.dbpedia.org/Databus_Derive_Maven_Integration

  <version>https://databus.dbpedia.org/dbpedia/enrichment/mappingbased-literals/2019.03.01</version>

is equivalent to <dependency> for software, but it downloads the data 
into "target/databus/download"

the goal "databus-derive:clone" can be called before "exec", so the 
software can use the downloaded data


# Upload

* we have an upload plugin 
http://dev.dbpedia.org/Databus_Upload_User_Manual with:

** mvn validate -> check account and consistency

** mvn prepare-package (goal databus:metadata ->  collects metadata in 
target/databus/$artifact/$version/dataid.ttl

** mvn package -> copies data into a package directory on the server 
often /var/www/html/databusrepo/$user/$group/$artifact/$version

** mvn deploy -> post the dataid.ttl to databus.dbpedia.org

** We configure it with pom.xml and markdown docu: 
https://github.com/dbpedia/databus-maven-plugin/tree/master/dbpedia/mappings

* the derive plugin will be merged with features of the Databus Client: 
http://dev.dbpedia.org/Databus_Client


Overall, it doesn't have all features yet and it is overall not in a 
state where we could remove the "-SNAPSHOT" but we are running several 
thousand files through it each month.

Databus comes with Mods, which serve as a Continous Integration for data 
tests (parsing and SHACL) similar to Jenkins and Travis.


We would like to thank Maven for all its cool features. It is really 
good and we could work very effectively with it. Thanks to the 
flexibility, we could also bend it to fit data better.

Do you have any suggestions on potential cooperations?


All the best,

Sebastian



-------- Forwarded Message --------
Subject: 	[ANN] DBpedia’s Databus and strategic initiative to facilitate 
1 Billion derived Knowledge Graphs by and for Consumers until 2025
Resent-Date: 	Wed, 11 Sep 2019 09:25:41 +0000
Resent-From: 	public-ld4lt@w3.org
Date: 	Wed, 11 Sep 2019 11:23:44 +0200
From: 	Sebastian Hellmann <he...@informatik.uni-leipzig.de>
To: 	public-ld4lt@w3.org <pu...@w3.org>



**

[Please forward to interested colleagues]

We are proud to announce that the DBpedia Databus website 
at<https://databus.dbpedia.org/>_https://databus.dbpedia.org_ 
<https://databus.dbpedia.org/> and the SPARQL API 
at<https://databus.dbpedia.org/(repo/sparql|yasgui)>_https://databus.dbpedia.org/(repo/sparql|yasgui)_ 
(_docu_ <http://dev.dbpedia.org/Download_Data>) are in public beta now. 
The system is usable (eat-your-own-dog-food tested) following a “working 
software over comprehensive documentation” approach. Due to its many 
components (website, sparql endpoints, keycloak, mods, upload client, 
download client, and data debugging), we estimate approximately six 
months in beta to fix bugs, implement all features and improve the 
details. If you have any feedback or questions, please use 
the<https://forum.dbpedia.org/>_DBpedia Forum_ 
<https://forum.dbpedia.org/>, the “report issues” button, or 
_dbpedia@infai.org_.


The full document is available at: 
_https://databus.dbpedia.org/dbpedia/publication/strategy/2019.09.09/strategy_databus_initiative.pdf_ 


We are looking forward to the feedback and discussion at the_14th 
DBpedia Community Meeting at SEMANTiCS 2019 in Karlsruhe_ 
<https://wiki.dbpedia.org/events/14th-dbpedia-community-meeting-karlsruhe> 
on September 12th or online.


########
# Excerpt
########


      DBpedia Databus

The DBpedia Databus is a platform to capture invested effort by data 
consumers who needed better data quality (fitness for use) in order to 
use the data and give improvements back to the data source and other 
consumers. DBpedia Databus enables anybody to build an automated 
DBpedia-style extraction, mapping and testing for any data they need. 
Databus incorporates features from DNS, Git, RSS, online forums and 
Maven to harness the full workpower of data consumers.


      Vision

Professional consumers of data worldwide have already built stable 
cleaning and refinement chains for all available datasets, but their 
efforts are invisible and not reusable. Deep, cleaned data silos exist 
beyond the reach of publishers and other consumers trapped locally in 
pipelines.

*Data is not oil that flows out of inflexible pipelines*. Databus breaks 
existing pipelines into individual components that together form a 
decentralized, but centrally coordinated data network in which data can 
flow back to previous components, the original sources, or end up being 
consumed by external components,

The Databus provides a platform for re-publishing these files with very 
little effort (leaving file traffic as only cost factor) while offering 
the full benefits of built-in system features such as automated 
publication, structured querying, automatic ingestion, as well as 
pluggable automated analysis, data testing via continuous integration, 
and automated application deployment *(software with data)*. The impact 
is highly synergistic, just a few thousand professional consumers and 
research projects can expose millions of cleaned datasets, which are on 
par with what has long existed in deep silos and pipelines.


    1 Billion interconnected, quality-controlled Knowledge Graphs until 2025

As we are inversing the paradigm form a publisher-centric view to a data 
consumer network, we will open the download valve to enable discovery 
and access to massive amounts of cleaner data than published by the 
original source. The main DBpedia Knowledge Graph - cleaned data from 
Wikipedia in all languages and Wikidata - alone has 600k file downloads 
per year complemented by downloads at over 20 chapter, 
e.g.<http://es.dbpedia.org/>_http://es.dbpedia.org_ 
<http://es.dbpedia.org/> as well as over 8 million daily hits on the 
main Virtuoso endpoint. Community extension from the alpha phase such 
as<https://databus.dbpedia.org/sven-h/dbkwik/dbkwik/2019.09.02>_DBkWik_ 
<https://databus.dbpedia.org/sven-h/dbkwik/dbkwik/2019.09.02>,<https://databus.dbpedia.org/propan/lhd/linked-hypernyms>_LinkedHypernyms_ 
<https://databus.dbpedia.org/propan/lhd/linked-hypernyms> are being 
loaded onto the bus and consolidated and we expect this number to reach 
over 100 by the end of the year. Companies and organisations who 
have<https://github.com/dbpedia/links>_previously uploaded their 
backlinks here_ <https://github.com/dbpedia/links> will be able to 
migrate to the databus. Other datasets are cleaned and posted. In two of 
our research projects_LOD-GEOSS_ 
<https://www.enargus.de/pub/bscw.cgi/?op=enargus.eps2&s=14&q=BASF%20SE&v=10&m=2&id=1216225&p=1> 
and<http://plass.io/>_PLASS_ <http://plass.io/>, we will re-publish open 
datasets, clean them and create collections, which will result in 
DBpedia-style knowledge graphs for energy systems and supply-chain 
management.

The *full document* is available at: 
_https://databus.dbpedia.org/dbpedia/publication/strategy/2019.09.09/strategy_databus_initiative.pdf_ 


**

**

**