You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Leigh Dodds <le...@ldodds.com> on 2012/08/21 17:55:57 UTC

Cleaning triples with Riot

Hi,

I'm doing some testing of TDB for a client. They have data in an older
RDB database which accepted triples that TDB now rejects.

Is there a way I can run a data dump through riot to clean it (i.e.
leaving only acceptable triples) or getting TDB to reject triples but
continue to load the rest?

Apologies if this is an FAQ. I know others have hit this issue before,
but couldn't find a good solution.

Cheers,

L.

-- 
Leigh Dodds
Freelance Technologist
Open Data, Linked Data Geek
t: @ldodds
w: ldodds.com
e: leigh@ldodds.com

Re: Cleaning triples with Riot

Posted by Stefan Scheffler <ss...@avantgarde-labs.de>.
Hey Leigh,
I had this problem some months ago. At this time there where no 
mechanism to "exclude" invalid triples. So i changed the Jena-Api and 
included my own parsing proccess. But it was quite complicated and it 
just works for my data. I think there are two not so complicated ways.

*first:*
- you could look for a ntriple-parser and do the checking before you 
dump in tdb. I used the api of sesame some time ago. and it was quite good.

*second:*
- you could load triple per triple into tdb. and if it runs fail on one 
triple you just load the next one. But if you do this you just can use 
tdbloader and not tdbloader2, which slows down the loading proccess imense

Regards
Stefan


Am 21.08.2012 17:55, schrieb Leigh Dodds:
> Hi,
>
> I'm doing some testing of TDB for a client. They have data in an older
> RDB database which accepted triples that TDB now rejects.
>
> Is there a way I can run a data dump through riot to clean it (i.e.
> leaving only acceptable triples) or getting TDB to reject triples but
> continue to load the rest?
>
> Apologies if this is an FAQ. I know others have hit this issue before,
> but couldn't find a good solution.
>
> Cheers,
>
> L.
>


Re: Cleaning triples with Riot

Posted by Andy Seaborne <an...@apache.org>.
On 03/09/12 12:41, Michael Brunnbauer wrote:
>
> Hello Andy,
>
> On Tue, Aug 28, 2012 at 04:53:55PM +0100, Andy Seaborne wrote:
>>> Can anyone tell me how to rewrite this portion of code so that the parser
>>> will throw an exception for the invalid integer ?
>>
>> No need - TDB ought to handle that (invalid values aren't supposed to be
>> errors - things just get a little less efficient).
>>
>> "riot --validate" will issue a warning.
>>
>> The TDB system is supposed to handle out-of-range number - plain old
>> bug, now fixed in SVN.
>
> Thank you. I replaced jena-tdb-0.9.3.jar with
> jena-tdb-0.9.4-20120829.061613-27.jar in my jena-2.7.3 distribution and was
> able to do create the TDB. Unfortunately, dumping the TDB does not seem to
> be possible with the new jar:
>
>   Exception in thread "main" java.lang.NoClassDefFoundError: com/hp/hpl/jena/sparql/core/DatasetGraphTrackActive
>
> Well - you already said that replacing jar files in lib/ might break things.
> I will use the old TDB .jar for the dump.
>
> Nightly builds of the Jena distribution - not only of the components - would
> be nice.
>
> Regards,
>
> Michael Brunnbauer
>

Michael,

You'll need to use a consistent set of jars - all the 20120829 build. 
In this case, the version of ARQ now has DatasetGraphTrackActive in it, 
moved up from TDB.

The two ways to get a consistent set of jars are:

1/ Use maven
2/ Download the combined distribution

https://repository.apache.org/content/groups/snapshots/org/apache/jena/apache-jena/2.7.4-SNAPSHOT/

currently:
apache-jena-2.7.4-20120903.055710-30.zip

which has all the jars in it

	Andy


Re: Cleaning triples with Riot

Posted by Michael Brunnbauer <br...@netestate.de>.
Hello Andy,

On Tue, Aug 28, 2012 at 04:53:55PM +0100, Andy Seaborne wrote:
> >Can anyone tell me how to rewrite this portion of code so that the parser
> >will throw an exception for the invalid integer ?
> 
> No need - TDB ought to handle that (invalid values aren't supposed to be 
> errors - things just get a little less efficient).
> 
> "riot --validate" will issue a warning.
> 
> The TDB system is supposed to handle out-of-range number - plain old 
> bug, now fixed in SVN.

Thank you. I replaced jena-tdb-0.9.3.jar with 
jena-tdb-0.9.4-20120829.061613-27.jar in my jena-2.7.3 distribution and was
able to do create the TDB. Unfortunately, dumping the TDB does not seem to
be possible with the new jar:

 Exception in thread "main" java.lang.NoClassDefFoundError: com/hp/hpl/jena/sparql/core/DatasetGraphTrackActive

Well - you already said that replacing jar files in lib/ might break things.
I will use the old TDB .jar for the dump.

Nightly builds of the Jena distribution - not only of the components - would 
be nice.

Regards,

Michael Brunnbauer

-- 
++  Michael Brunnbauer
++  netEstate GmbH
++  Geisenhausener Straße 11a
++  81379 München
++  Tel +49 89 32 19 77 80
++  Fax +49 89 32 19 77 89 
++  E-Mail brunni@netestate.de
++  http://www.netestate.de/
++
++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
++  USt-IdNr. DE221033342
++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel

Re: Cleaning triples with Riot

Posted by Andy Seaborne <an...@apache.org>.
On 28/08/12 16:19, Michael Brunnbauer wrote:
...

> My current problem is that the tool will not remove this invalid graph from
> dbpedia 3.8 with an out of range integer:
>
> <http://dbpedia.org/resource/Ridgeland_Township,_Iroquois_County,_Illinois> <http://dbpedia.org/property/postalCode> "6095560968"^^<http://www.w3.org/2001/XMLSchema#int> <http://en.wikipedia.org/wiki/Ridgeland_Township,_Iroquois_County,_Illinois?oldid=443535342#absolute-line=77> .
>
> tdbloader2 will fail with this exception:
>
> com.hp.hpl.jena.datatypes.DatatypeFormatException: Lexical form '6095560968' is not a legal instance of Datatype[http://www.w3.org/2001/XMLSchema#int -> class java.lang.Integer] Lexical form '6095560968' is not a legal instance of Datatype[http://www.w3.org/2001/XMLSchema#int -> class java.lang.Integer] during parse -org.apache.xerces.impl.dv.InvalidDatatypeValueException: cvc-maxInclusive-valid: Value '6095560968' is not facet-valid with respect to maxInclusive '2147483647' for type 'int'.
> 	at com.hp.hpl.jena.graph.impl.LiteralLabelImpl.getValue(LiteralLabelImpl.java:326)
> 	at com.hp.hpl.jena.tdb.store.NodeId.inline(NodeId.java:210)
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableInline.getAllocateNodeId(NodeTableInline.java:49)
> 	at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder$NodeTableBuilder.send(CmdNodeTableBuilder.java:223)
> 	at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder$NodeTableBuilder.send(CmdNodeTableBuilder.java:190)
> 	at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:71)
> 	at org.openjena.riot.lang.LangBase.parse(LangBase.java:43)
> 	at org.openjena.riot.RiotLoader.readQuads(RiotLoader.java:206)
> 	at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.exec(CmdNodeTableBuilder.java:168)
> 	at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101)
> 	at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
> 	at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
> 	at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.main(CmdNodeTableBuilder.java:79)

> Can anyone tell me how to rewrite this portion of code so that the parser
> will throw an exception for the invalid integer ?

No need - TDB ought to handle that (invalid values aren't supposed to be 
errors - things just get a little less efficient).

"riot --validate" will issue a warning.

The TDB system is supposed to handle out-of-range number - plain old 
bug, now fixed in SVN.

> import com.hp.hpl.jena.rdf.model.Model;
> import com.hp.hpl.jena.rdf.model.ModelFactory;
>
>          Model model = ModelFactory.createDefaultModel();
>          try {
>              model.read(new StringReader(chunk.toString()), normalizeUrl(graph), "N-TRIPLE");
>          } catch (final RuntimeException ex) {
>
> I think that tdbloader* really really needs an option to ignore invalid
> triples/quads instead of throwing an exception. The data out there will always
> be messy and even the DBpedia people do not get it right.

6 billion isn't that big ... choosing xsd:int was a bit limiting.

OK - it is in this case as the right answer is 403.

http://en.wikipedia.org/wiki/Ridgeland_Township,_Iroquois_County,_Illinois

Why anyone uses xsd:int is a mystery to be.

	Andy


>
> Regards,
>
> Michael Brunnbauer
>
> On Fri, Aug 24, 2012 at 01:03:26PM +0100, Andy Seaborne wrote:
>> On 21/08/12 16:55, Leigh Dodds wrote:
>>> Hi,
>>>
>>> I'm doing some testing of TDB for a client. They have data in an older
>>> RDB database which accepted triples that TDB now rejects.
>>
>> What's being rejected?
>>
>> If it's syntax, then text processing n-triples is usually necessary.
>>
>>> Is there a way I can run a data dump through riot to clean it (i.e.
>>> leaving only acceptable triples) or getting TDB to reject triples but
>>> continue to load the rest?
>>
>> If you want to look at the triples and do some checking, then paring to
>> a Sink<Triple> and doing the tests you want or view a call to set up the
>> parser with a particular profile - ParserProfileChecker is the
>> validating one.
>>
>> Whatever you do, doing it and producing a clean load file for TDB is
>> better than trying to fix up as you load.
>>
>> 	Andy
>>
>>>
>>> Apologies if this is an FAQ. I know others have hit this issue before,
>>> but couldn't find a good solution.
>>>
>>> Cheers,
>>>
>>> L.
>>>
>


Re: Cleaning triples with Riot

Posted by Michael Brunnbauer <br...@netestate.de>.
hi all

find attached the sourcecode for a (sloppy written) tool to clean compressed
nquad dumps for tdbloader2. The tool assumes that the named graphs are not
scattered in the dump and processes one named graph at a time. If there is
an exception, the corresponding named graph is not written to stdout.

My current problem is that the tool will not remove this invalid graph from
dbpedia 3.8 with an out of range integer:

<http://dbpedia.org/resource/Ridgeland_Township,_Iroquois_County,_Illinois> <http://dbpedia.org/property/postalCode> "6095560968"^^<http://www.w3.org/2001/XMLSchema#int> <http://en.wikipedia.org/wiki/Ridgeland_Township,_Iroquois_County,_Illinois?oldid=443535342#absolute-line=77> .

tdbloader2 will fail with this exception:

com.hp.hpl.jena.datatypes.DatatypeFormatException: Lexical form '6095560968' is not a legal instance of Datatype[http://www.w3.org/2001/XMLSchema#int -> class java.lang.Integer] Lexical form '6095560968' is not a legal instance of Datatype[http://www.w3.org/2001/XMLSchema#int -> class java.lang.Integer] during parse -org.apache.xerces.impl.dv.InvalidDatatypeValueException: cvc-maxInclusive-valid: Value '6095560968' is not facet-valid with respect to maxInclusive '2147483647' for type 'int'.
	at com.hp.hpl.jena.graph.impl.LiteralLabelImpl.getValue(LiteralLabelImpl.java:326)
	at com.hp.hpl.jena.tdb.store.NodeId.inline(NodeId.java:210)
	at com.hp.hpl.jena.tdb.nodetable.NodeTableInline.getAllocateNodeId(NodeTableInline.java:49)
	at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder$NodeTableBuilder.send(CmdNodeTableBuilder.java:223)
	at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder$NodeTableBuilder.send(CmdNodeTableBuilder.java:190)
	at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:71)
	at org.openjena.riot.lang.LangBase.parse(LangBase.java:43)
	at org.openjena.riot.RiotLoader.readQuads(RiotLoader.java:206)
	at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.exec(CmdNodeTableBuilder.java:168)
	at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101)
	at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
	at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
	at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.main(CmdNodeTableBuilder.java:79)

I am not very confident with Java and the Jena API and my Java programmer
is currently not available.

Can anyone tell me how to rewrite this portion of code so that the parser
will throw an exception for the invalid integer ?

import com.hp.hpl.jena.rdf.model.Model;
import com.hp.hpl.jena.rdf.model.ModelFactory;

        Model model = ModelFactory.createDefaultModel();
        try {
            model.read(new StringReader(chunk.toString()), normalizeUrl(graph), "N-TRIPLE");
        } catch (final RuntimeException ex) {

I think that tdbloader* really really needs an option to ignore invalid 
triples/quads instead of throwing an exception. The data out there will always
be messy and even the DBpedia people do not get it right.

Regards,

Michael Brunnbauer

On Fri, Aug 24, 2012 at 01:03:26PM +0100, Andy Seaborne wrote:
> On 21/08/12 16:55, Leigh Dodds wrote:
> >Hi,
> >
> >I'm doing some testing of TDB for a client. They have data in an older
> >RDB database which accepted triples that TDB now rejects.
> 
> What's being rejected?
> 
> If it's syntax, then text processing n-triples is usually necessary.
> 
> >Is there a way I can run a data dump through riot to clean it (i.e.
> >leaving only acceptable triples) or getting TDB to reject triples but
> >continue to load the rest?
> 
> If you want to look at the triples and do some checking, then paring to 
> a Sink<Triple> and doing the tests you want or view a call to set up the 
> parser with a particular profile - ParserProfileChecker is the 
> validating one.
> 
> Whatever you do, doing it and producing a clean load file for TDB is 
> better than trying to fix up as you load.
> 
> 	Andy
> 
> >
> >Apologies if this is an FAQ. I know others have hit this issue before,
> >but couldn't find a good solution.
> >
> >Cheers,
> >
> >L.
> >

-- 
++  Michael Brunnbauer
++  netEstate GmbH
++  Geisenhausener Stra�e 11a
++  81379 M�nchen
++  Tel +49 89 32 19 77 80
++  Fax +49 89 32 19 77 89 
++  E-Mail brunni@netestate.de
++  http://www.netestate.de/
++
++  Sitz: M�nchen, HRB Nr.142452 (Handelsregister B M�nchen)
++  USt-IdNr. DE221033342
++  Gesch�ftsf�hrer: Michael Brunnbauer, Franz Brunnbauer
++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel

Re: Cleaning triples with Riot

Posted by Frank Lee <fr...@yahoo.com>.
Hi, Andy, 

It's easy to get the model for a named graph from local TDB dataset

tdb_dir = "c:\\tdb";

ds = TDBFactory.createDataset(tdb_dir);


String ngUri = "http://xxx .."
Model model = ds.getNamedModel(ngUri);

However, if we run fuseki server with TDB remotely,  how can we get the model for the specified named graph?
For instance, the fuseki server run at remote server with IP address: 172.25.19.233 and tdb directory is /home/tdb at linux machine. 

Thanks.

Frank


+++



________________________________
 From: Andy Seaborne <an...@apache.org>
To: users@jena.apache.org 
Sent: Friday, August 24, 2012 5:03 AM
Subject: Re: Cleaning triples with Riot
 
On 21/08/12 16:55, Leigh Dodds wrote:
> Hi,
>
> I'm doing some testing of TDB for a client. They have data in an older
> RDB database which accepted triples that TDB now rejects.

What's being rejected?

If it's syntax, then text processing n-triples is usually necessary.

> Is there a way I can run a data dump through riot to clean it (i.e.
> leaving only acceptable triples) or getting TDB to reject triples but
> continue to load the rest?

If you want to look at the triples and do some checking, then paring to 
a Sink<Triple> and doing the tests you want or view a call to set up the 
parser with a particular profile - ParserProfileChecker is the 
validating one.

Whatever you do, doing it and producing a clean load file for TDB is 
better than trying to fix up as you load.

    Andy

>
> Apologies if this is an FAQ. I know others have hit this issue before,
> but couldn't find a good solution.
>
> Cheers,
>
> L.
>

Re: Cleaning triples with Riot

Posted by Andy Seaborne <an...@apache.org>.
On 21/08/12 16:55, Leigh Dodds wrote:
> Hi,
>
> I'm doing some testing of TDB for a client. They have data in an older
> RDB database which accepted triples that TDB now rejects.

What's being rejected?

If it's syntax, then text processing n-triples is usually necessary.

> Is there a way I can run a data dump through riot to clean it (i.e.
> leaving only acceptable triples) or getting TDB to reject triples but
> continue to load the rest?

If you want to look at the triples and do some checking, then paring to 
a Sink<Triple> and doing the tests you want or view a call to set up the 
parser with a particular profile - ParserProfileChecker is the 
validating one.

Whatever you do, doing it and producing a clean load file for TDB is 
better than trying to fix up as you load.

	Andy

>
> Apologies if this is an FAQ. I know others have hit this issue before,
> but couldn't find a good solution.
>
> Cheers,
>
> L.
>