You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Laura Morales <la...@mail.com> on 2017/04/17 18:46:37 UTC

tdbloader skip bad file

I'm trying to tdbload several .rdf files like this

    $ tdbloader --quiet --graph=... --loc=... <file1.rdf> <file2.rdf> <file3.rdf> ...

problem is, if one file raises an exception (eg. bad IRI), the whole bunch is dropped, and no triples are loaded from any file. I've tried calling tdbloader for each file, but it seems significantly slower.
Is there some command line argument that I can use to tell tdbloader to skip bad .rdf files, but keep loading the good ones?

Re: tdbloader skip bad file

Posted by "A. Soroka" <aj...@virginia.edu>.
You can file a ticket for that functionality at the Jena JIRA instance:

https://issues.apache.org/jira/browse/JENA

---
A. Soroka
The University of Virginia Library

> On Apr 18, 2017, at 10:28 AM, Laura Morales <la...@mail.com> wrote:
> 
>> Convert to something cheaper (preferably stream-able, like N-triples, as Andy says) as early as possible.
> 
> It would be very handy if riot had an "--graph=..." option as well, such that I could immediately output all XML files into n-quads with a graph label (and `cat` all of them into a single .nq file).


Re: tdbloader skip bad file

Posted by "A. Soroka" <aj...@virginia.edu>.
One of the several advantages of N-Triples (and this is not an accident) is how easy it is to use standard Posix tools with it, e.g. cut, sed, grep, etc.

---
A. Soroka
The University of Virginia Library

> On Apr 18, 2017, at 11:46 AM, Laura Morales <la...@mail.com> wrote:
> 
>> In the meantime, you can use something like sed for this, something like: sed -e "s|\(.*\)|\1 <mygraphuri>|"
> 
> ah, right! This is a good suggestion. This seems to work: sed "s/\(.*\) \.$/\1 <graph> ./"  (all triples have a period at the end).
> I think I'll use this until RIOT has a --graph option that would be much more easy to work with :)


Re: tdbloader skip bad file

Posted by Laura Morales <la...@mail.com>.
> In the meantime, you can use something like sed for this, something like: sed -e "s|\(.*\)|\1 <mygraphuri>|"

ah, right! This is a good suggestion. This seems to work: sed "s/\(.*\) \.$/\1 <graph> ./"  (all triples have a period at the end).
I think I'll use this until RIOT has a --graph option that would be much more easy to work with :)

Re: tdbloader skip bad file

Posted by "A. Soroka" <aj...@virginia.edu>.
In the meantime, you can use something like sed for this, something like: sed  -e "s|\(.*\)|\1 <mygraphuri>|"

---
A. Soroka
The University of Virginia Library

> On Apr 18, 2017, at 10:28 AM, Laura Morales <la...@mail.com> wrote:
> 
>> Convert to something cheaper (preferably stream-able, like N-triples, as Andy says) as early as possible.
> 
> It would be very handy if riot had an "--graph=..." option as well, such that I could immediately output all XML files into n-quads with a graph label (and `cat` all of them into a single .nq file).


Re: tdbloader skip bad file

Posted by Laura Morales <la...@mail.com>.
> Convert to something cheaper (preferably stream-able, like N-triples, as Andy says) as early as possible.

It would be very handy if riot had an "--graph=..." option as well, such that I could immediately output all XML files into n-quads with a graph label (and `cat` all of them into a single .nq file).

Re: tdbloader skip bad file

Posted by "A. Soroka" <aj...@virginia.edu>.
If you don't have a specific reason to use RDF/XML inside your workflow, you almost certainly shouldn't. It's one of the most expensive RDF serializations to process. Convert to something cheaper (preferably stream-able, like N-triples, as Andy says) as early as possible.

As for the costs of validation, depending on your operating resources, it might be worthwhile to use something like GNU parallel or xargs -P to run several riot invocations together. That will only be true if the startup time for riot is very small compared to the time it takes to run over a given file, which will depend on the size of your files. In this case it seems unlikely to help much, but it may be useful at a different time. You can only load one file at a time into TDB with tdbloader, because only one process at a time can act against a given TDB database.


---
A. Soroka
The University of Virginia Library

> On Apr 18, 2017, at 5:38 AM, Andy Seaborne <an...@apache.org> wrote:
> 
> 
> 
> On 18/04/17 10:19, Laura Morales wrote:
>>> riot sets the Unix return code to 0 on success and 1 on failure in the
>> usual Unix fashion.
>>> 
>>> So build up a list of valid files by looping on the input files then
>> load all the valid ones in one go with tdbloader.
>> 
>> Thank you.
>> Unfortunately however, running "riot --validate" on each file doesn't seem much faster than running tdbloader on each single file. Processing all files seem to take approximately the same time.
>> 
> 
> running tdbloader with bad data can corrupt the database.
> 
> It's a bulk loader - not a fix-up-the data tool.
> 
> If they take about the same time, then the parse costs dominate - which is possible with RDF/XML on small data files.
> 
> If performance matters, parse/validate and output N-triples, then load the N-triples.
> 
>    Andy


Re: tdbloader skip bad file

Posted by Andy Seaborne <an...@apache.org>.

On 18/04/17 10:19, Laura Morales wrote:
>> riot sets the Unix return code to 0 on success and 1 on failure in the
> usual Unix fashion.
>>
>> So build up a list of valid files by looping on the input files then
> load all the valid ones in one go with tdbloader.
>
> Thank you.
> Unfortunately however, running "riot --validate" on each file doesn't seem much faster than running tdbloader on each single file. Processing all files seem to take approximately the same time.
>

running tdbloader with bad data can corrupt the database.

It's a bulk loader - not a fix-up-the data tool.

If they take about the same time, then the parse costs dominate - which 
is possible with RDF/XML on small data files.

If performance matters, parse/validate and output N-triples, then load 
the N-triples.

     Andy

Re: tdbloader skip bad file

Posted by Laura Morales <la...@mail.com>.
> riot sets the Unix return code to 0 on success and 1 on failure in the
usual Unix fashion.
> 
> So build up a list of valid files by looping on the input files then
load all the valid ones in one go with tdbloader.

Thank you.
Unfortunately however, running "riot --validate" on each file doesn't seem much faster than running tdbloader on each single file. Processing all files seem to take approximately the same time.

Re: tdbloader skip bad file

Posted by Andy Seaborne <an...@apache.org>.


On 17/04/17 22:56, Laura Morales wrote:
>> Check the data before loading.
>>
>> This is generally good practice.
>>
>> Call "riot --validate" before loading to check each file.
>
>
> Let's say I've downloaded these RDF files [1]. Some of those files are broken. How can I check-and-load all those files with a bash script? Should I loop all files, call riot for each of them singularly, then parse the riot output for each file?
>
> [1] https://svn.apache.org/repos/asf/comdev/projects.apache.org/data/projects.xml

riot sets the Unix return code to 0 on success and 1 on failure in the 
usual Unix fashion.

So build up a list of valid files by looping on the input files then 
load all the valid ones in one go with tdbloader.

The broken ones need fixing to be loadable.

     Andy

Re: tdbloader skip bad file

Posted by Laura Morales <la...@mail.com>.
> Check the data before loading.
> 
> This is generally good practice.
> 
> Call "riot --validate" before loading to check each file.


Let's say I've downloaded these RDF files [1]. Some of those files are broken. How can I check-and-load all those files with a bash script? Should I loop all files, call riot for each of them singularly, then parse the riot output for each file?

[1] https://svn.apache.org/repos/asf/comdev/projects.apache.org/data/projects.xml

Re: tdbloader skip bad file

Posted by Andy Seaborne <an...@apache.org>.

On 17/04/17 19:46, Laura Morales wrote:
> I'm trying to tdbload several .rdf files like this
>
>     $ tdbloader --quiet --graph=... --loc=... <file1.rdf> <file2.rdf> <file3.rdf> ...
>
> problem is, if one file raises an exception (eg. bad IRI), the whole bunch is dropped, and no triples are loaded from any file. I've tried calling tdbloader for each file, but it seems significantly slower.

Yes.

If the database is empty, tdbloader can use its optimizer loading ; 
otherwise it has to add the data with special care as to index creation 
which is much less optimal.

> Is there some command line argument that I can use to tell tdbloader to skip bad .rdf files, but keep loading the good ones?

Check the data before loading.

This is generally good practice.

Call "riot --validate" before loading to check each file.

     Andy