You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Luigi Asprino (Jira)" <ji...@apache.org> on 2021/06/14 14:53:00 UTC
[jira] [Created] (JENA-2117) Is it possible to ignore RiotParseException in Apache Jena?

Luigi Asprino created JENA-2117:
-----------------------------------

             Summary: Is it possible to ignore RiotParseException in Apache Jena?
                 Key: JENA-2117
                 URL: https://issues.apache.org/jira/browse/JENA-2117
             Project: Apache Jena
          Issue Type: Question
          Components: RIOT
    Affects Versions: Jena 3.17.0
            Reporter: Luigi Asprino


I'm parsing a file serialized in NQuads format which contains some annoying triples having some bad character (the Apache Jena parser throws a RiotParseException saying "Bad character encoding"). Is there any way (e.g. RDFParser setting) to ignore such exception and go ahead parsing the file?
 
This is how I read parse the file:
 
```
 AtomicInteger ai = new AtomicInteger();
		StreamRDF s = new StreamRDFBase() {

			@Override
			public void triple(Triple triple) {
				collect();
			}

			private void collect() {
				ai.incrementAndGet();
				if (ai.get() % 10000 == 0) {
					System.out.println(ai.get());
				}

			}

			@Override
			public void quad(Quad quad) {
				collect();
			}

		};

		InputStream is = new GZIPInputStream(new FileInputStream(new File("data.nq.gz")), 4 * 1024);
		RDFParser.create().source(is).lang(Lang.NQUADS).strict(false).parse(s);
```

This is the file I'm trying to read https://www.dropbox.com/s/yfrexouusz62m5n/data.nq.gz?dl=0
The file has a problem (the first, at least) with the line 899908

```
Exception in thread "main" org.apache.jena.riot.RiotException: [line: 899908, col: 154] Bad character encoding
	at org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:153)
	at org.apache.jena.riot.lang.LangEngine.raiseException(LangEngine.java:148)
	at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:105)
	at org.apache.jena.riot.lang.LangNQuads.parseOne(LangNQuads.java:72)
	at org.apache.jena.riot.lang.LangNQuads.runParser(LangNQuads.java:53)
	at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:41)
	at org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:184)
	at org.apache.jena.riot.RDFParser.read(RDFParser.java:353)
	at org.apache.jena.riot.RDFParser.parseNotUri(RDFParser.java:343)
	at org.apache.jena.riot.RDFParser.parse(RDFParser.java:292)
	at org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:540)
```
 
I have experienced this "problem" many times and I found this workaround to cope with it. 

```
             RDFParserBuilder b = RDFParser.create().lang(Lang.NQUADS);

		BufferedReader br = new BufferedReader(new InputStreamReader(is), 4 * 1024);
		br.lines().parallel().forEach(l -> {
			try {
				b.fromString(l).parse(s);
			} catch (Exception e) {
				System.err.println(l);
			}

		});
```
But this is slower and works only if the input file has one triple per line. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)