You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Rob Vesse (JIRA)" <ji...@apache.org> on 2017/04/19 08:35:41 UTC
[jira] [Commented] (JENA-1325) RIOT parse many files at once, output only valid ones

    [ https://issues.apache.org/jira/browse/JENA-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15974317#comment-15974317 ] 

Rob Vesse commented on JENA-1325:
---------------------------------

I don't believe that this is something that can be supported. What you need to understand is that RIOT is a streaming tool i.e. It continuously reads just enough data to be able to produce the next triple/quad. So it is not reading a file in its entirety and then processing it, it is reading the file continuously as it goes. Reading files in their entirety is not scalable because it requires caching the data in memory, yes your files are mainly small and suitable for this but many people's files will not be!

Please remember that commandline tools are not intended to do everything, everything they do is backed by Jena code so if a command line tool does not have the exact behaviour you require you should be able to write your own tool using Jena APIs.

> RIOT parse many files at once, output only valid ones
> -----------------------------------------------------
>
>                 Key: JENA-1325
>                 URL: https://issues.apache.org/jira/browse/JENA-1325
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: RIOT
>         Environment: GNU/Linux
>            Reporter: Laura
>              Labels: easyfix, performance
>
> This issue is more or less related to this other one https://issues.apache.org/jira/browse/JENA-1322
> I have a folder with thousands of files, mostly small RDF/XML files. I'm using RIOT to validate them and dump the valid ones into ntriples files. The problem is that calling RIOT on each file is not going to cut it. The overhead is significant enough that this operation is just too slow (hours). So I've tried to call RIOT only once on all files together using
>     riot \
>         --verbose \
>         --stop \
>         --check \
>         --strict \
>         --output=nt \
>         files/*.rdf > files.nt
> and in this way validation is much faster. The problem is, that it's still dumping invalid files to the .nt output file. I'm downloading these files from the Internet, so I'm not going to fix them myself, I just want to skip bad files.
> Now, to be clear, I understand that RIOT is of course not meant to fix bad data, and I'm not asking for this. I'm suggesting however to add an *--option* such that RIOT can do the following:
> 1. parse multiple files at once (so that there is no need to invoke the same RIOT command for each file)
> 2. for every file, check/validate it
> 3. if *--output* is set, only output those files or triples that didn't raise any ERROR
> I think this is well in the scope of RIOT functionalities. Could this option please be added to RIOT?
> Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)