You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Andy Seaborne (Jira)" <ji...@apache.org> on 2021/04/10 09:57:00 UTC
[jira] [Commented] (JENA-2083) Support skipping/ignoring errors with tdbloader

    [ https://issues.apache.org/jira/browse/JENA-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318462#comment-17318462 ] 

Andy Seaborne commented on JENA-2083:
-------------------------------------

Hi there - how much data (in triples?)

> some of the files are incorrectly serialized,

What sort of errors? Some errors are limited to the current triple but some errors would remove several following triples in order to do some kind of recovery.

> It is not feasible right now to sort out the defective files from the good ones before running tdbloader.

The best approach is to validate the files first by running them though {{riot}}. Parsing is faster than loading and with separate files, can be done in parallel.

With any kind of recovery, the end result is that you don't know what data has actually been loaded and so when using the data, there will be unpredictable problems (e.g. queries mysteriously not matching). Such problems are painful and expensive to diagnose and fix. As a general rule, bad data in a database is hard and time-consuming to fix.

N-Triples can be processed by test handling tools - useful to patch up systematic errors.

tdb2.tdbloader provides a number of algorithms. Some work by manipulating the transaction system and some work with parallelism, both of which make error recovery hard and have an impact on performance.

The basic algorithm is transactional (as is loading into Fuseki with a TDB2 database). A file, or set of files, will load completely or not at all and leave the database in the state it was in before.



> Support skipping/ignoring errors with tdbloader
> -----------------------------------------------
>
>                 Key: JENA-2083
>                 URL: https://issues.apache.org/jira/browse/JENA-2083
>             Project: Apache Jena
>          Issue Type: New Feature
>          Components: TDB, TDB2
>            Reporter: Timothy Higinbottom
>            Priority: Major
>
> Hi all,
> I have a fairly large (~22,000) number of N-Triples files I hope to import into TDB2 to query with Fuseki.
> I boosted the RAM allotted to the JVM and used the parallel mode from tdb2.tdbloader. This whizzed through the first 1,000 of the files.
> However, some of the files are incorrectly serialized, so they caused errors when Jena tried to read them. It is not feasible right now to sort out the defective files from the good ones before running tdbloader.
> It would be great if tdbloader could add an option to skip the files that error so that it can continue to process the other files.
> The main reason this should be part of tdbloader itself is that the alternative (running xargs or a loop in Bash) decreases performance because then the loading is effectively synchronous and the user can't take advantage of the tdbloader modes and batching.
> Thanks for this great project!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)