You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Luigi Asprino (Jira)" <ji...@apache.org> on 2021/06/14 14:53:00 UTC
[jira] [Created] (JENA-2117) Is it possible to ignore
RiotParseException in Apache Jena?
Luigi Asprino created JENA-2117:
-----------------------------------
Summary: Is it possible to ignore RiotParseException in Apache Jena?
Key: JENA-2117
URL: https://issues.apache.org/jira/browse/JENA-2117
Project: Apache Jena
Issue Type: Question
Components: RIOT
Affects Versions: Jena 3.17.0
Reporter: Luigi Asprino
I'm parsing a file serialized in NQuads format which contains some annoying triples having some bad character (the Apache Jena parser throws a RiotParseException saying "Bad character encoding"). Is there any way (e.g. RDFParser setting) to ignore such exception and go ahead parsing the file?
This is how I read parse the file:
```
AtomicInteger ai = new AtomicInteger();
StreamRDF s = new StreamRDFBase() {
@Override
public void triple(Triple triple) {
collect();
}
private void collect() {
ai.incrementAndGet();
if (ai.get() % 10000 == 0) {
System.out.println(ai.get());
}
}
@Override
public void quad(Quad quad) {
collect();
}
};
InputStream is = new GZIPInputStream(new FileInputStream(new File("data.nq.gz")), 4 * 1024);
RDFParser.create().source(is).lang(Lang.NQUADS).strict(false).parse(s);
```
This is the file I'm trying to read https://www.dropbox.com/s/yfrexouusz62m5n/data.nq.gz?dl=0
The file has a problem (the first, at least) with the line 899908
```
Exception in thread "main" org.apache.jena.riot.RiotException: [line: 899908, col: 154] Bad character encoding
at org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:153)
at org.apache.jena.riot.lang.LangEngine.raiseException(LangEngine.java:148)
at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:105)
at org.apache.jena.riot.lang.LangNQuads.parseOne(LangNQuads.java:72)
at org.apache.jena.riot.lang.LangNQuads.runParser(LangNQuads.java:53)
at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:41)
at org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:184)
at org.apache.jena.riot.RDFParser.read(RDFParser.java:353)
at org.apache.jena.riot.RDFParser.parseNotUri(RDFParser.java:343)
at org.apache.jena.riot.RDFParser.parse(RDFParser.java:292)
at org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:540)
```
I have experienced this "problem" many times and I found this workaround to cope with it.
```
RDFParserBuilder b = RDFParser.create().lang(Lang.NQUADS);
BufferedReader br = new BufferedReader(new InputStreamReader(is), 4 * 1024);
br.lines().parallel().forEach(l -> {
try {
b.fromString(l).parse(s);
} catch (Exception e) {
System.err.println(l);
}
});
```
But this is slower and works only if the input file has one triple per line.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)