You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Andy Seaborne (Jira)" <ji...@apache.org> on 2021/06/14 17:09:00 UTC

[jira] [Reopened] (JENA-2117) Is it possible to ignore RiotParseException in Apache Jena?

     [ https://issues.apache.org/jira/browse/JENA-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Seaborne reopened JENA-2117:
---------------------------------

> Is it possible to ignore RiotParseException in Apache Jena?
> -----------------------------------------------------------
>
>                 Key: JENA-2117
>                 URL: https://issues.apache.org/jira/browse/JENA-2117
>             Project: Apache Jena
>          Issue Type: Question
>          Components: RIOT
>    Affects Versions: Jena 3.17.0
>            Reporter: Luigi Asprino
>            Priority: Major
>
> I'm parsing a file serialized in NQuads format which contains some annoying triples having some bad character (the Apache Jena parser throws a RiotParseException saying "Bad character encoding"). Is there any way (e.g. RDFParser setting) to ignore such exception and go ahead parsing the file?
>  
> This is how I read parse the file:
>  
> ```
>  AtomicInteger ai = new AtomicInteger();
> 		StreamRDF s = new StreamRDFBase() {
> 			@Override
> 			public void triple(Triple triple) {
> 				collect();
> 			}
> 			private void collect() {
> 				ai.incrementAndGet();
> 				if (ai.get() % 10000 == 0) {
> 					System.out.println(ai.get());
> 				}
> 			}
> 			@Override
> 			public void quad(Quad quad) {
> 				collect();
> 			}
> 		};
> 		InputStream is = new GZIPInputStream(new FileInputStream(new File("data.nq.gz")), 4 * 1024);
> 		RDFParser.create().source(is).lang(Lang.NQUADS).strict(false).parse(s);
> ```
> This is the file I'm trying to read https://www.dropbox.com/s/yfrexouusz62m5n/data.nq.gz?dl=0
> The file has a problem (the first, at least) with the line 899908
> ```
> Exception in thread "main" org.apache.jena.riot.RiotException: [line: 899908, col: 154] Bad character encoding
> 	at org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:153)
> 	at org.apache.jena.riot.lang.LangEngine.raiseException(LangEngine.java:148)
> 	at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:105)
> 	at org.apache.jena.riot.lang.LangNQuads.parseOne(LangNQuads.java:72)
> 	at org.apache.jena.riot.lang.LangNQuads.runParser(LangNQuads.java:53)
> 	at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:41)
> 	at org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:184)
> 	at org.apache.jena.riot.RDFParser.read(RDFParser.java:353)
> 	at org.apache.jena.riot.RDFParser.parseNotUri(RDFParser.java:343)
> 	at org.apache.jena.riot.RDFParser.parse(RDFParser.java:292)
> 	at org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:540)
> ```
>  
> I have experienced this "problem" many times and I found this workaround to cope with it. 
> ```
>              RDFParserBuilder b = RDFParser.create().lang(Lang.NQUADS);
> 		BufferedReader br = new BufferedReader(new InputStreamReader(is), 4 * 1024);
> 		br.lines().parallel().forEach(l -> {
> 			try {
> 				b.fromString(l).parse(s);
> 			} catch (Exception e) {
> 				System.err.println(l);
> 			}
> 		});
> ```
> But this is slower and works only if the input file has one triple per line. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)