You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Leigh Dodds <le...@ldodds.com> on 2012/08/21 17:55:57 UTC
Cleaning triples with Riot
Hi,
I'm doing some testing of TDB for a client. They have data in an older
RDB database which accepted triples that TDB now rejects.
Is there a way I can run a data dump through riot to clean it (i.e.
leaving only acceptable triples) or getting TDB to reject triples but
continue to load the rest?
Apologies if this is an FAQ. I know others have hit this issue before,
but couldn't find a good solution.
Cheers,
L.
--
Leigh Dodds
Freelance Technologist
Open Data, Linked Data Geek
t: @ldodds
w: ldodds.com
e: leigh@ldodds.com
Re: Cleaning triples with Riot
Posted by Stefan Scheffler <ss...@avantgarde-labs.de>.
Hey Leigh,
I had this problem some months ago. At this time there where no
mechanism to "exclude" invalid triples. So i changed the Jena-Api and
included my own parsing proccess. But it was quite complicated and it
just works for my data. I think there are two not so complicated ways.
*first:*
- you could look for a ntriple-parser and do the checking before you
dump in tdb. I used the api of sesame some time ago. and it was quite good.
*second:*
- you could load triple per triple into tdb. and if it runs fail on one
triple you just load the next one. But if you do this you just can use
tdbloader and not tdbloader2, which slows down the loading proccess imense
Regards
Stefan
Am 21.08.2012 17:55, schrieb Leigh Dodds:
> Hi,
>
> I'm doing some testing of TDB for a client. They have data in an older
> RDB database which accepted triples that TDB now rejects.
>
> Is there a way I can run a data dump through riot to clean it (i.e.
> leaving only acceptable triples) or getting TDB to reject triples but
> continue to load the rest?
>
> Apologies if this is an FAQ. I know others have hit this issue before,
> but couldn't find a good solution.
>
> Cheers,
>
> L.
>
Re: Cleaning triples with Riot
Posted by Andy Seaborne <an...@apache.org>.
On 03/09/12 12:41, Michael Brunnbauer wrote:
>
> Hello Andy,
>
> On Tue, Aug 28, 2012 at 04:53:55PM +0100, Andy Seaborne wrote:
>>> Can anyone tell me how to rewrite this portion of code so that the parser
>>> will throw an exception for the invalid integer ?
>>
>> No need - TDB ought to handle that (invalid values aren't supposed to be
>> errors - things just get a little less efficient).
>>
>> "riot --validate" will issue a warning.
>>
>> The TDB system is supposed to handle out-of-range number - plain old
>> bug, now fixed in SVN.
>
> Thank you. I replaced jena-tdb-0.9.3.jar with
> jena-tdb-0.9.4-20120829.061613-27.jar in my jena-2.7.3 distribution and was
> able to do create the TDB. Unfortunately, dumping the TDB does not seem to
> be possible with the new jar:
>
> Exception in thread "main" java.lang.NoClassDefFoundError: com/hp/hpl/jena/sparql/core/DatasetGraphTrackActive
>
> Well - you already said that replacing jar files in lib/ might break things.
> I will use the old TDB .jar for the dump.
>
> Nightly builds of the Jena distribution - not only of the components - would
> be nice.
>
> Regards,
>
> Michael Brunnbauer
>
Michael,
You'll need to use a consistent set of jars - all the 20120829 build.
In this case, the version of ARQ now has DatasetGraphTrackActive in it,
moved up from TDB.
The two ways to get a consistent set of jars are:
1/ Use maven
2/ Download the combined distribution
https://repository.apache.org/content/groups/snapshots/org/apache/jena/apache-jena/2.7.4-SNAPSHOT/
currently:
apache-jena-2.7.4-20120903.055710-30.zip
which has all the jars in it
Andy
Re: Cleaning triples with Riot
Posted by Michael Brunnbauer <br...@netestate.de>.
Hello Andy,
On Tue, Aug 28, 2012 at 04:53:55PM +0100, Andy Seaborne wrote:
> >Can anyone tell me how to rewrite this portion of code so that the parser
> >will throw an exception for the invalid integer ?
>
> No need - TDB ought to handle that (invalid values aren't supposed to be
> errors - things just get a little less efficient).
>
> "riot --validate" will issue a warning.
>
> The TDB system is supposed to handle out-of-range number - plain old
> bug, now fixed in SVN.
Thank you. I replaced jena-tdb-0.9.3.jar with
jena-tdb-0.9.4-20120829.061613-27.jar in my jena-2.7.3 distribution and was
able to do create the TDB. Unfortunately, dumping the TDB does not seem to
be possible with the new jar:
Exception in thread "main" java.lang.NoClassDefFoundError: com/hp/hpl/jena/sparql/core/DatasetGraphTrackActive
Well - you already said that replacing jar files in lib/ might break things.
I will use the old TDB .jar for the dump.
Nightly builds of the Jena distribution - not only of the components - would
be nice.
Regards,
Michael Brunnbauer
--
++ Michael Brunnbauer
++ netEstate GmbH
++ Geisenhausener Straße 11a
++ 81379 München
++ Tel +49 89 32 19 77 80
++ Fax +49 89 32 19 77 89
++ E-Mail brunni@netestate.de
++ http://www.netestate.de/
++
++ Sitz: München, HRB Nr.142452 (Handelsregister B München)
++ USt-IdNr. DE221033342
++ Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++ Prokurist: Dipl. Kfm. (Univ.) Markus Hendel
Re: Cleaning triples with Riot
Posted by Andy Seaborne <an...@apache.org>.
On 28/08/12 16:19, Michael Brunnbauer wrote:
...
> My current problem is that the tool will not remove this invalid graph from
> dbpedia 3.8 with an out of range integer:
>
> <http://dbpedia.org/resource/Ridgeland_Township,_Iroquois_County,_Illinois> <http://dbpedia.org/property/postalCode> "6095560968"^^<http://www.w3.org/2001/XMLSchema#int> <http://en.wikipedia.org/wiki/Ridgeland_Township,_Iroquois_County,_Illinois?oldid=443535342#absolute-line=77> .
>
> tdbloader2 will fail with this exception:
>
> com.hp.hpl.jena.datatypes.DatatypeFormatException: Lexical form '6095560968' is not a legal instance of Datatype[http://www.w3.org/2001/XMLSchema#int -> class java.lang.Integer] Lexical form '6095560968' is not a legal instance of Datatype[http://www.w3.org/2001/XMLSchema#int -> class java.lang.Integer] during parse -org.apache.xerces.impl.dv.InvalidDatatypeValueException: cvc-maxInclusive-valid: Value '6095560968' is not facet-valid with respect to maxInclusive '2147483647' for type 'int'.
> at com.hp.hpl.jena.graph.impl.LiteralLabelImpl.getValue(LiteralLabelImpl.java:326)
> at com.hp.hpl.jena.tdb.store.NodeId.inline(NodeId.java:210)
> at com.hp.hpl.jena.tdb.nodetable.NodeTableInline.getAllocateNodeId(NodeTableInline.java:49)
> at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder$NodeTableBuilder.send(CmdNodeTableBuilder.java:223)
> at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder$NodeTableBuilder.send(CmdNodeTableBuilder.java:190)
> at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:71)
> at org.openjena.riot.lang.LangBase.parse(LangBase.java:43)
> at org.openjena.riot.RiotLoader.readQuads(RiotLoader.java:206)
> at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.exec(CmdNodeTableBuilder.java:168)
> at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101)
> at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
> at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
> at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.main(CmdNodeTableBuilder.java:79)
> Can anyone tell me how to rewrite this portion of code so that the parser
> will throw an exception for the invalid integer ?
No need - TDB ought to handle that (invalid values aren't supposed to be
errors - things just get a little less efficient).
"riot --validate" will issue a warning.
The TDB system is supposed to handle out-of-range number - plain old
bug, now fixed in SVN.
> import com.hp.hpl.jena.rdf.model.Model;
> import com.hp.hpl.jena.rdf.model.ModelFactory;
>
> Model model = ModelFactory.createDefaultModel();
> try {
> model.read(new StringReader(chunk.toString()), normalizeUrl(graph), "N-TRIPLE");
> } catch (final RuntimeException ex) {
>
> I think that tdbloader* really really needs an option to ignore invalid
> triples/quads instead of throwing an exception. The data out there will always
> be messy and even the DBpedia people do not get it right.
6 billion isn't that big ... choosing xsd:int was a bit limiting.
OK - it is in this case as the right answer is 403.
http://en.wikipedia.org/wiki/Ridgeland_Township,_Iroquois_County,_Illinois
Why anyone uses xsd:int is a mystery to be.
Andy
>
> Regards,
>
> Michael Brunnbauer
>
> On Fri, Aug 24, 2012 at 01:03:26PM +0100, Andy Seaborne wrote:
>> On 21/08/12 16:55, Leigh Dodds wrote:
>>> Hi,
>>>
>>> I'm doing some testing of TDB for a client. They have data in an older
>>> RDB database which accepted triples that TDB now rejects.
>>
>> What's being rejected?
>>
>> If it's syntax, then text processing n-triples is usually necessary.
>>
>>> Is there a way I can run a data dump through riot to clean it (i.e.
>>> leaving only acceptable triples) or getting TDB to reject triples but
>>> continue to load the rest?
>>
>> If you want to look at the triples and do some checking, then paring to
>> a Sink<Triple> and doing the tests you want or view a call to set up the
>> parser with a particular profile - ParserProfileChecker is the
>> validating one.
>>
>> Whatever you do, doing it and producing a clean load file for TDB is
>> better than trying to fix up as you load.
>>
>> Andy
>>
>>>
>>> Apologies if this is an FAQ. I know others have hit this issue before,
>>> but couldn't find a good solution.
>>>
>>> Cheers,
>>>
>>> L.
>>>
>
Re: Cleaning triples with Riot
Posted by Michael Brunnbauer <br...@netestate.de>.
hi all
find attached the sourcecode for a (sloppy written) tool to clean compressed
nquad dumps for tdbloader2. The tool assumes that the named graphs are not
scattered in the dump and processes one named graph at a time. If there is
an exception, the corresponding named graph is not written to stdout.
My current problem is that the tool will not remove this invalid graph from
dbpedia 3.8 with an out of range integer:
<http://dbpedia.org/resource/Ridgeland_Township,_Iroquois_County,_Illinois> <http://dbpedia.org/property/postalCode> "6095560968"^^<http://www.w3.org/2001/XMLSchema#int> <http://en.wikipedia.org/wiki/Ridgeland_Township,_Iroquois_County,_Illinois?oldid=443535342#absolute-line=77> .
tdbloader2 will fail with this exception:
com.hp.hpl.jena.datatypes.DatatypeFormatException: Lexical form '6095560968' is not a legal instance of Datatype[http://www.w3.org/2001/XMLSchema#int -> class java.lang.Integer] Lexical form '6095560968' is not a legal instance of Datatype[http://www.w3.org/2001/XMLSchema#int -> class java.lang.Integer] during parse -org.apache.xerces.impl.dv.InvalidDatatypeValueException: cvc-maxInclusive-valid: Value '6095560968' is not facet-valid with respect to maxInclusive '2147483647' for type 'int'.
at com.hp.hpl.jena.graph.impl.LiteralLabelImpl.getValue(LiteralLabelImpl.java:326)
at com.hp.hpl.jena.tdb.store.NodeId.inline(NodeId.java:210)
at com.hp.hpl.jena.tdb.nodetable.NodeTableInline.getAllocateNodeId(NodeTableInline.java:49)
at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder$NodeTableBuilder.send(CmdNodeTableBuilder.java:223)
at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder$NodeTableBuilder.send(CmdNodeTableBuilder.java:190)
at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:71)
at org.openjena.riot.lang.LangBase.parse(LangBase.java:43)
at org.openjena.riot.RiotLoader.readQuads(RiotLoader.java:206)
at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.exec(CmdNodeTableBuilder.java:168)
at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101)
at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.main(CmdNodeTableBuilder.java:79)
I am not very confident with Java and the Jena API and my Java programmer
is currently not available.
Can anyone tell me how to rewrite this portion of code so that the parser
will throw an exception for the invalid integer ?
import com.hp.hpl.jena.rdf.model.Model;
import com.hp.hpl.jena.rdf.model.ModelFactory;
Model model = ModelFactory.createDefaultModel();
try {
model.read(new StringReader(chunk.toString()), normalizeUrl(graph), "N-TRIPLE");
} catch (final RuntimeException ex) {
I think that tdbloader* really really needs an option to ignore invalid
triples/quads instead of throwing an exception. The data out there will always
be messy and even the DBpedia people do not get it right.
Regards,
Michael Brunnbauer
On Fri, Aug 24, 2012 at 01:03:26PM +0100, Andy Seaborne wrote:
> On 21/08/12 16:55, Leigh Dodds wrote:
> >Hi,
> >
> >I'm doing some testing of TDB for a client. They have data in an older
> >RDB database which accepted triples that TDB now rejects.
>
> What's being rejected?
>
> If it's syntax, then text processing n-triples is usually necessary.
>
> >Is there a way I can run a data dump through riot to clean it (i.e.
> >leaving only acceptable triples) or getting TDB to reject triples but
> >continue to load the rest?
>
> If you want to look at the triples and do some checking, then paring to
> a Sink<Triple> and doing the tests you want or view a call to set up the
> parser with a particular profile - ParserProfileChecker is the
> validating one.
>
> Whatever you do, doing it and producing a clean load file for TDB is
> better than trying to fix up as you load.
>
> Andy
>
> >
> >Apologies if this is an FAQ. I know others have hit this issue before,
> >but couldn't find a good solution.
> >
> >Cheers,
> >
> >L.
> >
--
++ Michael Brunnbauer
++ netEstate GmbH
++ Geisenhausener Stra�e 11a
++ 81379 M�nchen
++ Tel +49 89 32 19 77 80
++ Fax +49 89 32 19 77 89
++ E-Mail brunni@netestate.de
++ http://www.netestate.de/
++
++ Sitz: M�nchen, HRB Nr.142452 (Handelsregister B M�nchen)
++ USt-IdNr. DE221033342
++ Gesch�ftsf�hrer: Michael Brunnbauer, Franz Brunnbauer
++ Prokurist: Dipl. Kfm. (Univ.) Markus Hendel
Re: Cleaning triples with Riot
Posted by Frank Lee <fr...@yahoo.com>.
Hi, Andy,
It's easy to get the model for a named graph from local TDB dataset
tdb_dir = "c:\\tdb";
ds = TDBFactory.createDataset(tdb_dir);
String ngUri = "http://xxx .."
Model model = ds.getNamedModel(ngUri);
However, if we run fuseki server with TDB remotely, how can we get the model for the specified named graph?
For instance, the fuseki server run at remote server with IP address: 172.25.19.233 and tdb directory is /home/tdb at linux machine.
Thanks.
Frank
+++
________________________________
From: Andy Seaborne <an...@apache.org>
To: users@jena.apache.org
Sent: Friday, August 24, 2012 5:03 AM
Subject: Re: Cleaning triples with Riot
On 21/08/12 16:55, Leigh Dodds wrote:
> Hi,
>
> I'm doing some testing of TDB for a client. They have data in an older
> RDB database which accepted triples that TDB now rejects.
What's being rejected?
If it's syntax, then text processing n-triples is usually necessary.
> Is there a way I can run a data dump through riot to clean it (i.e.
> leaving only acceptable triples) or getting TDB to reject triples but
> continue to load the rest?
If you want to look at the triples and do some checking, then paring to
a Sink<Triple> and doing the tests you want or view a call to set up the
parser with a particular profile - ParserProfileChecker is the
validating one.
Whatever you do, doing it and producing a clean load file for TDB is
better than trying to fix up as you load.
Andy
>
> Apologies if this is an FAQ. I know others have hit this issue before,
> but couldn't find a good solution.
>
> Cheers,
>
> L.
>
Re: Cleaning triples with Riot
Posted by Andy Seaborne <an...@apache.org>.
On 21/08/12 16:55, Leigh Dodds wrote:
> Hi,
>
> I'm doing some testing of TDB for a client. They have data in an older
> RDB database which accepted triples that TDB now rejects.
What's being rejected?
If it's syntax, then text processing n-triples is usually necessary.
> Is there a way I can run a data dump through riot to clean it (i.e.
> leaving only acceptable triples) or getting TDB to reject triples but
> continue to load the rest?
If you want to look at the triples and do some checking, then paring to
a Sink<Triple> and doing the tests you want or view a call to set up the
parser with a particular profile - ParserProfileChecker is the
validating one.
Whatever you do, doing it and producing a clean load file for TDB is
better than trying to fix up as you load.
Andy
>
> Apologies if this is an FAQ. I know others have hit this issue before,
> but couldn't find a good solution.
>
> Cheers,
>
> L.
>