You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Andy Seaborne (Commented) (JIRA)" <ji...@apache.org> on 2012/03/21 20:09:43 UTC

[jira] [Commented] (JENA-225) TDB datasets can be corrupted by performing certain operations within a transaction

    [ https://issues.apache.org/jira/browse/JENA-225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234888#comment-13234888 ] 

Andy Seaborne commented on JENA-225:
------------------------------------

This issue is not related to transactions per se.  Normally, node caching hides the fact the DB has corrupted by illegal UTF-8.

The transaction system just happens to highlight the problem as it works without high-level node caches to make the actions idempotent.

The attached file shows it can happen for a raw storage dataset.  The code resets the system storage cache to remove all node table caches.

Also in teh code snippet, print out the size f the byte buffer after 'encode' and it will show it is short.

The problem is in the encoding of chars to bytes.  The jaba.nio.Charset encoder needs "onMalformedInput(CodingErrorAction.REPLACE)" to be set and it isn't.  This replaces the bad uniocde codepoint (high surrogate without a following low surrogate to make a surrorgate pair) with a '?' character -- standard Java charset decoder.

A more ambitious fix is to not use Java encoder/decoders, which are sensitive to codepoint-legality, and drop down to custom code that only uses UTF-8 encoding rules without checking for legal codepoints.   This would make TDB robust though something else may break when the data is leaves the JVM and is read in elsewhere because the data is not legal unicode.

Classes InStreamUTF8 and OutStreamUTF8 in ARQ show the encoding algorithm.  They are slightly slower (a few percent) than the standard java encoders (which are probably native code) when used in RIOT on large files needing multiple seconds decoding time. It wil only show in TDB on very large literals (100k+ ?).  Normally, lexical forms are less than a few 100 bytes and the difference is not measurable (the custom codec process may even be faster due to lower startup costs).  It is well below the rest of the database processing costs.
                
> TDB datasets can be corrupted by performing certain operations within a transaction 
> ------------------------------------------------------------------------------------
>
>                 Key: JENA-225
>                 URL: https://issues.apache.org/jira/browse/JENA-225
>             Project: Apache Jena
>          Issue Type: Bug
>    Affects Versions: TDB 0.9.0
>         Environment: jena-tdb-0.9.0-incubating
>            Reporter: Sam Tunnicliffe
>         Attachments: ReportBadUnicode1.java
>
>
> In a web application, we read some triples in a HTTP POST, using a LangTurtle instance and a tokenizer obtained from from TokenizerFactory.makeTokenizerUTF8. 
> We then write the parsed Triples back out (to temporary storage) using OutputLangUtils.write. At some later time, these Triples are then re-read, again using a Tokenizer from TokenizerFactory.makeTokenizerUTF8, before being inserted into a TDB dataset. 
> We have found it possible for the the input data to contain character strings which pass through the various parsers/serializers but which cause TDB's transaction layer to error in such a way as to make recovery from journals ineffective. 
> Eliminating transactions from the code path enables the database to be updated successfully.
> The stacktrace from TDB looks like this: 
> org.openjena.riot.RiotParseException: [line: 1, col: 2 ] Broken token: Hello 
> 	at org.openjena.riot.tokens.TokenizerText.exception(TokenizerText.java:1209)
> 	at org.openjena.riot.tokens.TokenizerText.readString(TokenizerText.java:620)
> 	at org.openjena.riot.tokens.TokenizerText.parseToken(TokenizerText.java:248)
> 	at org.openjena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:112)
> 	at com.hp.hpl.jena.tdb.nodetable.NodecSSE.decode(NodecSSE.java:105)
> 	at com.hp.hpl.jena.tdb.lib.NodeLib.decode(NodeLib.java:93)
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:234)
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:228)
> 	at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.append(NodeTableTrans.java:188)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.writeNodeJournal(NodeTableTrans.java:306)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.commitPrepare(NodeTableTrans.java:266)
> 	at com.hp.hpl.jena.tdb.transaction.Transaction.prepare(Transaction.java:131)
> 	at com.hp.hpl.jena.tdb.transaction.Transaction.commit(Transaction.java:112)
> 	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTxn.commit(DatasetGraphTxn.java:40)
> 	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTransaction._commit(DatasetGraphTransaction.java:106)
> 	at com.hp.hpl.jena.tdb.migrate.DatasetGraphTrackActive.commit(DatasetGraphTrackActive.java:60)
> 	at com.hp.hpl.jena.sparql.core.DatasetImpl.commit(DatasetImpl.java:143)
> At least part of the issue seems to be stem from NodecSSE (I know this isn't actual unicode escaping, but its derived from the user input we've received). 
> String s = "Hello \uDAE0 World";
> Node literal = Node.createLiteral(s);
> ByteBuffer bb = NodeLib.encode(literal);
> NodeLib.decode(bb);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira