You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by "Sam Tunnicliffe (Created) (JIRA)" <ji...@apache.org> on 2012/03/21 18:59:40 UTC

[jira] [Created] (JENA-225) TDB datasets can be corrupted by performing certain operations within a transaction

TDB datasets can be corrupted by performing certain operations within a transaction 
------------------------------------------------------------------------------------

                 Key: JENA-225
                 URL: https://issues.apache.org/jira/browse/JENA-225
             Project: Apache Jena
          Issue Type: Bug
    Affects Versions: TDB 0.9.0
         Environment: jena-tdb-0.9.0-incubating
            Reporter: Sam Tunnicliffe


In a web application, we read some triples in a HTTP POST, using a LangTurtle instance and a tokenizer obtained from from TokenizerFactory.makeTokenizerUTF8. 
We then write the parsed Triples back out (to temporary storage) using OutputLangUtils.write. At some later time, these Triples are then re-read, again using a Tokenizer from TokenizerFactory.makeTokenizerUTF8, before being inserted into a TDB dataset. 
We have found it possible for the the input data to contain character strings which pass through the various parsers/serializers but which cause TDB's transaction layer to error in such a way as to make recovery from journals ineffective. 

Eliminating transactions from the code path enables the database to be updated successfully.

The stacktrace from TDB looks like this: 
{code}
org.openjena.riot.RiotParseException: [line: 1, col: 2 ] Broken token: Hello 
	at org.openjena.riot.tokens.TokenizerText.exception(TokenizerText.java:1209)
	at org.openjena.riot.tokens.TokenizerText.readString(TokenizerText.java:620)
	at org.openjena.riot.tokens.TokenizerText.parseToken(TokenizerText.java:248)
	at org.openjena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:112)
	at com.hp.hpl.jena.tdb.nodetable.NodecSSE.decode(NodecSSE.java:105)
	at com.hp.hpl.jena.tdb.lib.NodeLib.decode(NodeLib.java:93)
	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:234)
	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:228)
	at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.append(NodeTableTrans.java:188)
	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.writeNodeJournal(NodeTableTrans.java:306)
	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.commitPrepare(NodeTableTrans.java:266)
	at com.hp.hpl.jena.tdb.transaction.Transaction.prepare(Transaction.java:131)
	at com.hp.hpl.jena.tdb.transaction.Transaction.commit(Transaction.java:112)
	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTxn.commit(DatasetGraphTxn.java:40)
	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTransaction._commit(DatasetGraphTransaction.java:106)
	at com.hp.hpl.jena.tdb.migrate.DatasetGraphTrackActive.commit(DatasetGraphTrackActive.java:60)
	at com.hp.hpl.jena.sparql.core.DatasetImpl.commit(DatasetImpl.java:143)
{code}

At least part of the issue seems to be stem from NodecSSE (I know this isn't actual unicode escaping, but its derived from the user input we've received). 

{code}
String s = "Hello \uDAE0 World";
Node literal = Node.createLiteral(s);
ByteBuffer bb = NodeLib.encode(literal);
NodeLib.decode(bb);
{code}


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (JENA-225) TDB datasets can be corrupted by performing certain operations within a transaction

Posted by "Andy Seaborne (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Seaborne updated JENA-225:
-------------------------------

    Attachment: JENA-225-v1.patch

Potential fix that sets the charset encoder/decoder to replace bad codepoints with the default replacement char (a '?').

Caveat: String data does not round trip, hashing and equality change, caches may be affected (needs checking).
                
> TDB datasets can be corrupted by performing certain operations within a transaction 
> ------------------------------------------------------------------------------------
>
>                 Key: JENA-225
>                 URL: https://issues.apache.org/jira/browse/JENA-225
>             Project: Apache Jena
>          Issue Type: Bug
>    Affects Versions: TDB 0.9.0
>         Environment: jena-tdb-0.9.0-incubating
>            Reporter: Sam Tunnicliffe
>         Attachments: JENA-225-v1.patch, ReportBadUnicode1.java
>
>
> In a web application, we read some triples in a HTTP POST, using a LangTurtle instance and a tokenizer obtained from from TokenizerFactory.makeTokenizerUTF8. 
> We then write the parsed Triples back out (to temporary storage) using OutputLangUtils.write. At some later time, these Triples are then re-read, again using a Tokenizer from TokenizerFactory.makeTokenizerUTF8, before being inserted into a TDB dataset. 
> We have found it possible for the the input data to contain character strings which pass through the various parsers/serializers but which cause TDB's transaction layer to error in such a way as to make recovery from journals ineffective. 
> Eliminating transactions from the code path enables the database to be updated successfully.
> The stacktrace from TDB looks like this: 
> org.openjena.riot.RiotParseException: [line: 1, col: 2 ] Broken token: Hello 
> 	at org.openjena.riot.tokens.TokenizerText.exception(TokenizerText.java:1209)
> 	at org.openjena.riot.tokens.TokenizerText.readString(TokenizerText.java:620)
> 	at org.openjena.riot.tokens.TokenizerText.parseToken(TokenizerText.java:248)
> 	at org.openjena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:112)
> 	at com.hp.hpl.jena.tdb.nodetable.NodecSSE.decode(NodecSSE.java:105)
> 	at com.hp.hpl.jena.tdb.lib.NodeLib.decode(NodeLib.java:93)
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:234)
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:228)
> 	at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.append(NodeTableTrans.java:188)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.writeNodeJournal(NodeTableTrans.java:306)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.commitPrepare(NodeTableTrans.java:266)
> 	at com.hp.hpl.jena.tdb.transaction.Transaction.prepare(Transaction.java:131)
> 	at com.hp.hpl.jena.tdb.transaction.Transaction.commit(Transaction.java:112)
> 	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTxn.commit(DatasetGraphTxn.java:40)
> 	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTransaction._commit(DatasetGraphTransaction.java:106)
> 	at com.hp.hpl.jena.tdb.migrate.DatasetGraphTrackActive.commit(DatasetGraphTrackActive.java:60)
> 	at com.hp.hpl.jena.sparql.core.DatasetImpl.commit(DatasetImpl.java:143)
> At least part of the issue seems to be stem from NodecSSE (I know this isn't actual unicode escaping, but its derived from the user input we've received). 
> String s = "Hello \uDAE0 World";
> Node literal = Node.createLiteral(s);
> ByteBuffer bb = NodeLib.encode(literal);
> NodeLib.decode(bb);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-225) TDB datasets can be corrupted by performing certain operations within a transaction

Posted by "Andy Seaborne (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237177#comment-13237177 ] 

Andy Seaborne commented on JENA-225:
------------------------------------

TDB switched to using binary safe BlockUTF8.
                
> TDB datasets can be corrupted by performing certain operations within a transaction 
> ------------------------------------------------------------------------------------
>
>                 Key: JENA-225
>                 URL: https://issues.apache.org/jira/browse/JENA-225
>             Project: Apache Jena
>          Issue Type: Bug
>    Affects Versions: TDB 0.9.0
>         Environment: jena-tdb-0.9.0-incubating
>            Reporter: Sam Tunnicliffe
>         Attachments: JENA-225-v1.patch, ReportBadUnicode1.java
>
>
> In a web application, we read some triples in a HTTP POST, using a LangTurtle instance and a tokenizer obtained from from TokenizerFactory.makeTokenizerUTF8. 
> We then write the parsed Triples back out (to temporary storage) using OutputLangUtils.write. At some later time, these Triples are then re-read, again using a Tokenizer from TokenizerFactory.makeTokenizerUTF8, before being inserted into a TDB dataset. 
> We have found it possible for the the input data to contain character strings which pass through the various parsers/serializers but which cause TDB's transaction layer to error in such a way as to make recovery from journals ineffective. 
> Eliminating transactions from the code path enables the database to be updated successfully.
> The stacktrace from TDB looks like this: 
> org.openjena.riot.RiotParseException: [line: 1, col: 2 ] Broken token: Hello 
> 	at org.openjena.riot.tokens.TokenizerText.exception(TokenizerText.java:1209)
> 	at org.openjena.riot.tokens.TokenizerText.readString(TokenizerText.java:620)
> 	at org.openjena.riot.tokens.TokenizerText.parseToken(TokenizerText.java:248)
> 	at org.openjena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:112)
> 	at com.hp.hpl.jena.tdb.nodetable.NodecSSE.decode(NodecSSE.java:105)
> 	at com.hp.hpl.jena.tdb.lib.NodeLib.decode(NodeLib.java:93)
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:234)
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:228)
> 	at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.append(NodeTableTrans.java:188)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.writeNodeJournal(NodeTableTrans.java:306)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.commitPrepare(NodeTableTrans.java:266)
> 	at com.hp.hpl.jena.tdb.transaction.Transaction.prepare(Transaction.java:131)
> 	at com.hp.hpl.jena.tdb.transaction.Transaction.commit(Transaction.java:112)
> 	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTxn.commit(DatasetGraphTxn.java:40)
> 	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTransaction._commit(DatasetGraphTransaction.java:106)
> 	at com.hp.hpl.jena.tdb.migrate.DatasetGraphTrackActive.commit(DatasetGraphTrackActive.java:60)
> 	at com.hp.hpl.jena.sparql.core.DatasetImpl.commit(DatasetImpl.java:143)
> At least part of the issue seems to be stem from NodecSSE (I know this isn't actual unicode escaping, but its derived from the user input we've received). 
> String s = "Hello \uDAE0 World";
> Node literal = Node.createLiteral(s);
> ByteBuffer bb = NodeLib.encode(literal);
> NodeLib.decode(bb);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-225) TDB datasets can be corrupted by performing certain operations within a transaction

Posted by "Sam Tunnicliffe (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13236455#comment-13236455 ] 

Sam Tunnicliffe commented on JENA-225:
--------------------------------------

the attached patch seems like a sensible solution to me. Yes, the strings aren't round-trippable but the behaviour is predictable and most importantly safe for the DB.
                
> TDB datasets can be corrupted by performing certain operations within a transaction 
> ------------------------------------------------------------------------------------
>
>                 Key: JENA-225
>                 URL: https://issues.apache.org/jira/browse/JENA-225
>             Project: Apache Jena
>          Issue Type: Bug
>    Affects Versions: TDB 0.9.0
>         Environment: jena-tdb-0.9.0-incubating
>            Reporter: Sam Tunnicliffe
>         Attachments: JENA-225-v1.patch, ReportBadUnicode1.java
>
>
> In a web application, we read some triples in a HTTP POST, using a LangTurtle instance and a tokenizer obtained from from TokenizerFactory.makeTokenizerUTF8. 
> We then write the parsed Triples back out (to temporary storage) using OutputLangUtils.write. At some later time, these Triples are then re-read, again using a Tokenizer from TokenizerFactory.makeTokenizerUTF8, before being inserted into a TDB dataset. 
> We have found it possible for the the input data to contain character strings which pass through the various parsers/serializers but which cause TDB's transaction layer to error in such a way as to make recovery from journals ineffective. 
> Eliminating transactions from the code path enables the database to be updated successfully.
> The stacktrace from TDB looks like this: 
> org.openjena.riot.RiotParseException: [line: 1, col: 2 ] Broken token: Hello 
> 	at org.openjena.riot.tokens.TokenizerText.exception(TokenizerText.java:1209)
> 	at org.openjena.riot.tokens.TokenizerText.readString(TokenizerText.java:620)
> 	at org.openjena.riot.tokens.TokenizerText.parseToken(TokenizerText.java:248)
> 	at org.openjena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:112)
> 	at com.hp.hpl.jena.tdb.nodetable.NodecSSE.decode(NodecSSE.java:105)
> 	at com.hp.hpl.jena.tdb.lib.NodeLib.decode(NodeLib.java:93)
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:234)
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:228)
> 	at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.append(NodeTableTrans.java:188)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.writeNodeJournal(NodeTableTrans.java:306)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.commitPrepare(NodeTableTrans.java:266)
> 	at com.hp.hpl.jena.tdb.transaction.Transaction.prepare(Transaction.java:131)
> 	at com.hp.hpl.jena.tdb.transaction.Transaction.commit(Transaction.java:112)
> 	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTxn.commit(DatasetGraphTxn.java:40)
> 	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTransaction._commit(DatasetGraphTransaction.java:106)
> 	at com.hp.hpl.jena.tdb.migrate.DatasetGraphTrackActive.commit(DatasetGraphTrackActive.java:60)
> 	at com.hp.hpl.jena.sparql.core.DatasetImpl.commit(DatasetImpl.java:143)
> At least part of the issue seems to be stem from NodecSSE (I know this isn't actual unicode escaping, but its derived from the user input we've received). 
> String s = "Hello \uDAE0 World";
> Node literal = Node.createLiteral(s);
> ByteBuffer bb = NodeLib.encode(literal);
> NodeLib.decode(bb);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (JENA-225) TDB datasets can be corrupted by performing certain operations within a transaction

Posted by "Sam Tunnicliffe (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sam Tunnicliffe updated JENA-225:
---------------------------------

    Description: 
In a web application, we read some triples in a HTTP POST, using a LangTurtle instance and a tokenizer obtained from from TokenizerFactory.makeTokenizerUTF8. 
We then write the parsed Triples back out (to temporary storage) using OutputLangUtils.write. At some later time, these Triples are then re-read, again using a Tokenizer from TokenizerFactory.makeTokenizerUTF8, before being inserted into a TDB dataset. 
We have found it possible for the the input data to contain character strings which pass through the various parsers/serializers but which cause TDB's transaction layer to error in such a way as to make recovery from journals ineffective. 

Eliminating transactions from the code path enables the database to be updated successfully.

The stacktrace from TDB looks like this: 
org.openjena.riot.RiotParseException: [line: 1, col: 2 ] Broken token: Hello 
	at org.openjena.riot.tokens.TokenizerText.exception(TokenizerText.java:1209)
	at org.openjena.riot.tokens.TokenizerText.readString(TokenizerText.java:620)
	at org.openjena.riot.tokens.TokenizerText.parseToken(TokenizerText.java:248)
	at org.openjena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:112)
	at com.hp.hpl.jena.tdb.nodetable.NodecSSE.decode(NodecSSE.java:105)
	at com.hp.hpl.jena.tdb.lib.NodeLib.decode(NodeLib.java:93)
	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:234)
	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:228)
	at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.append(NodeTableTrans.java:188)
	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.writeNodeJournal(NodeTableTrans.java:306)
	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.commitPrepare(NodeTableTrans.java:266)
	at com.hp.hpl.jena.tdb.transaction.Transaction.prepare(Transaction.java:131)
	at com.hp.hpl.jena.tdb.transaction.Transaction.commit(Transaction.java:112)
	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTxn.commit(DatasetGraphTxn.java:40)
	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTransaction._commit(DatasetGraphTransaction.java:106)
	at com.hp.hpl.jena.tdb.migrate.DatasetGraphTrackActive.commit(DatasetGraphTrackActive.java:60)
	at com.hp.hpl.jena.sparql.core.DatasetImpl.commit(DatasetImpl.java:143)

At least part of the issue seems to be stem from NodecSSE (I know this isn't actual unicode escaping, but its derived from the user input we've received). 

String s = "Hello \uDAE0 World";
Node literal = Node.createLiteral(s);
ByteBuffer bb = NodeLib.encode(literal);
NodeLib.decode(bb);


  was:
In a web application, we read some triples in a HTTP POST, using a LangTurtle instance and a tokenizer obtained from from TokenizerFactory.makeTokenizerUTF8. 
We then write the parsed Triples back out (to temporary storage) using OutputLangUtils.write. At some later time, these Triples are then re-read, again using a Tokenizer from TokenizerFactory.makeTokenizerUTF8, before being inserted into a TDB dataset. 
We have found it possible for the the input data to contain character strings which pass through the various parsers/serializers but which cause TDB's transaction layer to error in such a way as to make recovery from journals ineffective. 

Eliminating transactions from the code path enables the database to be updated successfully.

The stacktrace from TDB looks like this: 
{code}
org.openjena.riot.RiotParseException: [line: 1, col: 2 ] Broken token: Hello 
	at org.openjena.riot.tokens.TokenizerText.exception(TokenizerText.java:1209)
	at org.openjena.riot.tokens.TokenizerText.readString(TokenizerText.java:620)
	at org.openjena.riot.tokens.TokenizerText.parseToken(TokenizerText.java:248)
	at org.openjena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:112)
	at com.hp.hpl.jena.tdb.nodetable.NodecSSE.decode(NodecSSE.java:105)
	at com.hp.hpl.jena.tdb.lib.NodeLib.decode(NodeLib.java:93)
	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:234)
	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:228)
	at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.append(NodeTableTrans.java:188)
	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.writeNodeJournal(NodeTableTrans.java:306)
	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.commitPrepare(NodeTableTrans.java:266)
	at com.hp.hpl.jena.tdb.transaction.Transaction.prepare(Transaction.java:131)
	at com.hp.hpl.jena.tdb.transaction.Transaction.commit(Transaction.java:112)
	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTxn.commit(DatasetGraphTxn.java:40)
	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTransaction._commit(DatasetGraphTransaction.java:106)
	at com.hp.hpl.jena.tdb.migrate.DatasetGraphTrackActive.commit(DatasetGraphTrackActive.java:60)
	at com.hp.hpl.jena.sparql.core.DatasetImpl.commit(DatasetImpl.java:143)
{code}

At least part of the issue seems to be stem from NodecSSE (I know this isn't actual unicode escaping, but its derived from the user input we've received). 

{code}
String s = "Hello \uDAE0 World";
Node literal = Node.createLiteral(s);
ByteBuffer bb = NodeLib.encode(literal);
NodeLib.decode(bb);
{code}


    
> TDB datasets can be corrupted by performing certain operations within a transaction 
> ------------------------------------------------------------------------------------
>
>                 Key: JENA-225
>                 URL: https://issues.apache.org/jira/browse/JENA-225
>             Project: Apache Jena
>          Issue Type: Bug
>    Affects Versions: TDB 0.9.0
>         Environment: jena-tdb-0.9.0-incubating
>            Reporter: Sam Tunnicliffe
>
> In a web application, we read some triples in a HTTP POST, using a LangTurtle instance and a tokenizer obtained from from TokenizerFactory.makeTokenizerUTF8. 
> We then write the parsed Triples back out (to temporary storage) using OutputLangUtils.write. At some later time, these Triples are then re-read, again using a Tokenizer from TokenizerFactory.makeTokenizerUTF8, before being inserted into a TDB dataset. 
> We have found it possible for the the input data to contain character strings which pass through the various parsers/serializers but which cause TDB's transaction layer to error in such a way as to make recovery from journals ineffective. 
> Eliminating transactions from the code path enables the database to be updated successfully.
> The stacktrace from TDB looks like this: 
> org.openjena.riot.RiotParseException: [line: 1, col: 2 ] Broken token: Hello 
> 	at org.openjena.riot.tokens.TokenizerText.exception(TokenizerText.java:1209)
> 	at org.openjena.riot.tokens.TokenizerText.readString(TokenizerText.java:620)
> 	at org.openjena.riot.tokens.TokenizerText.parseToken(TokenizerText.java:248)
> 	at org.openjena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:112)
> 	at com.hp.hpl.jena.tdb.nodetable.NodecSSE.decode(NodecSSE.java:105)
> 	at com.hp.hpl.jena.tdb.lib.NodeLib.decode(NodeLib.java:93)
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:234)
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:228)
> 	at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.append(NodeTableTrans.java:188)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.writeNodeJournal(NodeTableTrans.java:306)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.commitPrepare(NodeTableTrans.java:266)
> 	at com.hp.hpl.jena.tdb.transaction.Transaction.prepare(Transaction.java:131)
> 	at com.hp.hpl.jena.tdb.transaction.Transaction.commit(Transaction.java:112)
> 	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTxn.commit(DatasetGraphTxn.java:40)
> 	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTransaction._commit(DatasetGraphTransaction.java:106)
> 	at com.hp.hpl.jena.tdb.migrate.DatasetGraphTrackActive.commit(DatasetGraphTrackActive.java:60)
> 	at com.hp.hpl.jena.sparql.core.DatasetImpl.commit(DatasetImpl.java:143)
> At least part of the issue seems to be stem from NodecSSE (I know this isn't actual unicode escaping, but its derived from the user input we've received). 
> String s = "Hello \uDAE0 World";
> Node literal = Node.createLiteral(s);
> ByteBuffer bb = NodeLib.encode(literal);
> NodeLib.decode(bb);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (JENA-225) TDB datasets can be corrupted by performing certain operations within a transaction

Posted by "Andy Seaborne (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234888#comment-13234888 ] 

Andy Seaborne edited comment on JENA-225 at 3/21/12 7:39 PM:
-------------------------------------------------------------

This issue is not related to transactions per se.  Normally, node caching hides the fact the DB has corrupted by illegal UTF-8.

The transaction system just happens to highlight the problem as it works without high-level node caches to make the actions idempotent.

The attached file shows it can happen for a raw storage dataset.  The code resets the system storage cache to remove all node table caches.

Also in teh code snippet, print out the size f the byte buffer after 'encode' and it will show it is short.

The problem is in the encoding of chars to bytes.  The jaba.nio.Charset encoder needs "onMalformedInput(CodingErrorAction.REPLACE)" to be set and it isn't.  This replaces the bad uniocde codepoint (high surrogate without a following low surrogate to make a surrorgate pair) with a '?' character -- standard Java charset decoder.

A more ambitious fix is to not use Java encoder/decoders, which are sensitive to codepoint-legality, and drop down to custom code that only uses UTF-8 encoding rules without checking for legal codepoints.   This would make TDB robust though something else may break when the data is leaves the JVM and is read in elsewhere because the data is not legal unicode.

Class BlockUTF8 is code to do String<->ByteBuffers. Classes InStreamUTF8 and OutStreamUTF8 in ARQ are the UTf-8 algorithm over input and output streams  The latter are slightly slower (a few percent) than the standard java encoders when used in RIOT on large files needing multiple seconds decoding time. 

Differences in speed will only show in TDB on very large literals (100k+ ?).  Normally, lexical forms are less than a few 100 bytes and the difference is not measurable (the custom codec process may even be faster due to lower startup costs).  It is well below the rest of the database processing costs.
                
      was (Author: andy.seaborne):
    This issue is not related to transactions per se.  Normally, node caching hides the fact the DB has corrupted by illegal UTF-8.

The transaction system just happens to highlight the problem as it works without high-level node caches to make the actions idempotent.

The attached file shows it can happen for a raw storage dataset.  The code resets the system storage cache to remove all node table caches.

Also in teh code snippet, print out the size f the byte buffer after 'encode' and it will show it is short.

The problem is in the encoding of chars to bytes.  The jaba.nio.Charset encoder needs "onMalformedInput(CodingErrorAction.REPLACE)" to be set and it isn't.  This replaces the bad uniocde codepoint (high surrogate without a following low surrogate to make a surrorgate pair) with a '?' character -- standard Java charset decoder.

A more ambitious fix is to not use Java encoder/decoders, which are sensitive to codepoint-legality, and drop down to custom code that only uses UTF-8 encoding rules without checking for legal codepoints.   This would make TDB robust though something else may break when the data is leaves the JVM and is read in elsewhere because the data is not legal unicode.

Classes InStreamUTF8 and OutStreamUTF8 in ARQ show the encoding algorithm.  They are slightly slower (a few percent) than the standard java encoders when used in RIOT on large files needing multiple seconds decoding time. It wil only show in TDB on very large literals (100k+ ?).  Normally, lexical forms are less than a few 100 bytes and the difference is not measurable (the custom codec process may even be faster due to lower startup costs).  It is well below the rest of the database processing costs.
                  
> TDB datasets can be corrupted by performing certain operations within a transaction 
> ------------------------------------------------------------------------------------
>
>                 Key: JENA-225
>                 URL: https://issues.apache.org/jira/browse/JENA-225
>             Project: Apache Jena
>          Issue Type: Bug
>    Affects Versions: TDB 0.9.0
>         Environment: jena-tdb-0.9.0-incubating
>            Reporter: Sam Tunnicliffe
>         Attachments: ReportBadUnicode1.java
>
>
> In a web application, we read some triples in a HTTP POST, using a LangTurtle instance and a tokenizer obtained from from TokenizerFactory.makeTokenizerUTF8. 
> We then write the parsed Triples back out (to temporary storage) using OutputLangUtils.write. At some later time, these Triples are then re-read, again using a Tokenizer from TokenizerFactory.makeTokenizerUTF8, before being inserted into a TDB dataset. 
> We have found it possible for the the input data to contain character strings which pass through the various parsers/serializers but which cause TDB's transaction layer to error in such a way as to make recovery from journals ineffective. 
> Eliminating transactions from the code path enables the database to be updated successfully.
> The stacktrace from TDB looks like this: 
> org.openjena.riot.RiotParseException: [line: 1, col: 2 ] Broken token: Hello 
> 	at org.openjena.riot.tokens.TokenizerText.exception(TokenizerText.java:1209)
> 	at org.openjena.riot.tokens.TokenizerText.readString(TokenizerText.java:620)
> 	at org.openjena.riot.tokens.TokenizerText.parseToken(TokenizerText.java:248)
> 	at org.openjena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:112)
> 	at com.hp.hpl.jena.tdb.nodetable.NodecSSE.decode(NodecSSE.java:105)
> 	at com.hp.hpl.jena.tdb.lib.NodeLib.decode(NodeLib.java:93)
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:234)
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:228)
> 	at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.append(NodeTableTrans.java:188)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.writeNodeJournal(NodeTableTrans.java:306)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.commitPrepare(NodeTableTrans.java:266)
> 	at com.hp.hpl.jena.tdb.transaction.Transaction.prepare(Transaction.java:131)
> 	at com.hp.hpl.jena.tdb.transaction.Transaction.commit(Transaction.java:112)
> 	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTxn.commit(DatasetGraphTxn.java:40)
> 	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTransaction._commit(DatasetGraphTransaction.java:106)
> 	at com.hp.hpl.jena.tdb.migrate.DatasetGraphTrackActive.commit(DatasetGraphTrackActive.java:60)
> 	at com.hp.hpl.jena.sparql.core.DatasetImpl.commit(DatasetImpl.java:143)
> At least part of the issue seems to be stem from NodecSSE (I know this isn't actual unicode escaping, but its derived from the user input we've received). 
> String s = "Hello \uDAE0 World";
> Node literal = Node.createLiteral(s);
> ByteBuffer bb = NodeLib.encode(literal);
> NodeLib.decode(bb);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-225) TDB datasets can be corrupted by performing certain operations within a transaction

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235827#comment-13235827 ] 

Hudson commented on JENA-225:
-----------------------------

Integrated in Jena_ARQ #510 (See [https://builds.apache.org/job/Jena_ARQ/510/])
    Partial fix for JENA-225.
This does not fix the problem completely for TDB because strings are 9still) not round-trip-safe. (Revision 1303934)

     Result = SUCCESS
andy : 
Files : 
* /incubator/jena/Jena2/ARQ/trunk/src/main/java/org/openjena/atlas/lib/Chars.java

                
> TDB datasets can be corrupted by performing certain operations within a transaction 
> ------------------------------------------------------------------------------------
>
>                 Key: JENA-225
>                 URL: https://issues.apache.org/jira/browse/JENA-225
>             Project: Apache Jena
>          Issue Type: Bug
>    Affects Versions: TDB 0.9.0
>         Environment: jena-tdb-0.9.0-incubating
>            Reporter: Sam Tunnicliffe
>         Attachments: JENA-225-v1.patch, ReportBadUnicode1.java
>
>
> In a web application, we read some triples in a HTTP POST, using a LangTurtle instance and a tokenizer obtained from from TokenizerFactory.makeTokenizerUTF8. 
> We then write the parsed Triples back out (to temporary storage) using OutputLangUtils.write. At some later time, these Triples are then re-read, again using a Tokenizer from TokenizerFactory.makeTokenizerUTF8, before being inserted into a TDB dataset. 
> We have found it possible for the the input data to contain character strings which pass through the various parsers/serializers but which cause TDB's transaction layer to error in such a way as to make recovery from journals ineffective. 
> Eliminating transactions from the code path enables the database to be updated successfully.
> The stacktrace from TDB looks like this: 
> org.openjena.riot.RiotParseException: [line: 1, col: 2 ] Broken token: Hello 
> 	at org.openjena.riot.tokens.TokenizerText.exception(TokenizerText.java:1209)
> 	at org.openjena.riot.tokens.TokenizerText.readString(TokenizerText.java:620)
> 	at org.openjena.riot.tokens.TokenizerText.parseToken(TokenizerText.java:248)
> 	at org.openjena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:112)
> 	at com.hp.hpl.jena.tdb.nodetable.NodecSSE.decode(NodecSSE.java:105)
> 	at com.hp.hpl.jena.tdb.lib.NodeLib.decode(NodeLib.java:93)
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:234)
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:228)
> 	at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.append(NodeTableTrans.java:188)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.writeNodeJournal(NodeTableTrans.java:306)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.commitPrepare(NodeTableTrans.java:266)
> 	at com.hp.hpl.jena.tdb.transaction.Transaction.prepare(Transaction.java:131)
> 	at com.hp.hpl.jena.tdb.transaction.Transaction.commit(Transaction.java:112)
> 	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTxn.commit(DatasetGraphTxn.java:40)
> 	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTransaction._commit(DatasetGraphTransaction.java:106)
> 	at com.hp.hpl.jena.tdb.migrate.DatasetGraphTrackActive.commit(DatasetGraphTrackActive.java:60)
> 	at com.hp.hpl.jena.sparql.core.DatasetImpl.commit(DatasetImpl.java:143)
> At least part of the issue seems to be stem from NodecSSE (I know this isn't actual unicode escaping, but its derived from the user input we've received). 
> String s = "Hello \uDAE0 World";
> Node literal = Node.createLiteral(s);
> ByteBuffer bb = NodeLib.encode(literal);
> NodeLib.decode(bb);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (JENA-225) TDB datasets can be corrupted by performing certain operations within a transaction

Posted by "Andy Seaborne (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234888#comment-13234888 ] 

Andy Seaborne edited comment on JENA-225 at 3/21/12 7:32 PM:
-------------------------------------------------------------

This issue is not related to transactions per se.  Normally, node caching hides the fact the DB has corrupted by illegal UTF-8.

The transaction system just happens to highlight the problem as it works without high-level node caches to make the actions idempotent.

The attached file shows it can happen for a raw storage dataset.  The code resets the system storage cache to remove all node table caches.

Also in teh code snippet, print out the size f the byte buffer after 'encode' and it will show it is short.

The problem is in the encoding of chars to bytes.  The jaba.nio.Charset encoder needs "onMalformedInput(CodingErrorAction.REPLACE)" to be set and it isn't.  This replaces the bad uniocde codepoint (high surrogate without a following low surrogate to make a surrorgate pair) with a '?' character -- standard Java charset decoder.

A more ambitious fix is to not use Java encoder/decoders, which are sensitive to codepoint-legality, and drop down to custom code that only uses UTF-8 encoding rules without checking for legal codepoints.   This would make TDB robust though something else may break when the data is leaves the JVM and is read in elsewhere because the data is not legal unicode.

Classes InStreamUTF8 and OutStreamUTF8 in ARQ show the encoding algorithm.  They are slightly slower (a few percent) than the standard java encoders when used in RIOT on large files needing multiple seconds decoding time. It wil only show in TDB on very large literals (100k+ ?).  Normally, lexical forms are less than a few 100 bytes and the difference is not measurable (the custom codec process may even be faster due to lower startup costs).  It is well below the rest of the database processing costs.
                
      was (Author: andy.seaborne):
    This issue is not related to transactions per se.  Normally, node caching hides the fact the DB has corrupted by illegal UTF-8.

The transaction system just happens to highlight the problem as it works without high-level node caches to make the actions idempotent.

The attached file shows it can happen for a raw storage dataset.  The code resets the system storage cache to remove all node table caches.

Also in teh code snippet, print out the size f the byte buffer after 'encode' and it will show it is short.

The problem is in the encoding of chars to bytes.  The jaba.nio.Charset encoder needs "onMalformedInput(CodingErrorAction.REPLACE)" to be set and it isn't.  This replaces the bad uniocde codepoint (high surrogate without a following low surrogate to make a surrorgate pair) with a '?' character -- standard Java charset decoder.

A more ambitious fix is to not use Java encoder/decoders, which are sensitive to codepoint-legality, and drop down to custom code that only uses UTF-8 encoding rules without checking for legal codepoints.   This would make TDB robust though something else may break when the data is leaves the JVM and is read in elsewhere because the data is not legal unicode.

Classes InStreamUTF8 and OutStreamUTF8 in ARQ show the encoding algorithm.  They are slightly slower (a few percent) than the standard java encoders (which are probably native code) when used in RIOT on large files needing multiple seconds decoding time. It wil only show in TDB on very large literals (100k+ ?).  Normally, lexical forms are less than a few 100 bytes and the difference is not measurable (the custom codec process may even be faster due to lower startup costs).  It is well below the rest of the database processing costs.
                  
> TDB datasets can be corrupted by performing certain operations within a transaction 
> ------------------------------------------------------------------------------------
>
>                 Key: JENA-225
>                 URL: https://issues.apache.org/jira/browse/JENA-225
>             Project: Apache Jena
>          Issue Type: Bug
>    Affects Versions: TDB 0.9.0
>         Environment: jena-tdb-0.9.0-incubating
>            Reporter: Sam Tunnicliffe
>         Attachments: ReportBadUnicode1.java
>
>
> In a web application, we read some triples in a HTTP POST, using a LangTurtle instance and a tokenizer obtained from from TokenizerFactory.makeTokenizerUTF8. 
> We then write the parsed Triples back out (to temporary storage) using OutputLangUtils.write. At some later time, these Triples are then re-read, again using a Tokenizer from TokenizerFactory.makeTokenizerUTF8, before being inserted into a TDB dataset. 
> We have found it possible for the the input data to contain character strings which pass through the various parsers/serializers but which cause TDB's transaction layer to error in such a way as to make recovery from journals ineffective. 
> Eliminating transactions from the code path enables the database to be updated successfully.
> The stacktrace from TDB looks like this: 
> org.openjena.riot.RiotParseException: [line: 1, col: 2 ] Broken token: Hello 
> 	at org.openjena.riot.tokens.TokenizerText.exception(TokenizerText.java:1209)
> 	at org.openjena.riot.tokens.TokenizerText.readString(TokenizerText.java:620)
> 	at org.openjena.riot.tokens.TokenizerText.parseToken(TokenizerText.java:248)
> 	at org.openjena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:112)
> 	at com.hp.hpl.jena.tdb.nodetable.NodecSSE.decode(NodecSSE.java:105)
> 	at com.hp.hpl.jena.tdb.lib.NodeLib.decode(NodeLib.java:93)
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:234)
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:228)
> 	at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.append(NodeTableTrans.java:188)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.writeNodeJournal(NodeTableTrans.java:306)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.commitPrepare(NodeTableTrans.java:266)
> 	at com.hp.hpl.jena.tdb.transaction.Transaction.prepare(Transaction.java:131)
> 	at com.hp.hpl.jena.tdb.transaction.Transaction.commit(Transaction.java:112)
> 	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTxn.commit(DatasetGraphTxn.java:40)
> 	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTransaction._commit(DatasetGraphTransaction.java:106)
> 	at com.hp.hpl.jena.tdb.migrate.DatasetGraphTrackActive.commit(DatasetGraphTrackActive.java:60)
> 	at com.hp.hpl.jena.sparql.core.DatasetImpl.commit(DatasetImpl.java:143)
> At least part of the issue seems to be stem from NodecSSE (I know this isn't actual unicode escaping, but its derived from the user input we've received). 
> String s = "Hello \uDAE0 World";
> Node literal = Node.createLiteral(s);
> ByteBuffer bb = NodeLib.encode(literal);
> NodeLib.decode(bb);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-225) TDB datasets can be corrupted by performing certain operations within a transaction

Posted by "Andy Seaborne (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13236528#comment-13236528 ] 

Andy Seaborne commented on JENA-225:
------------------------------------

The patch should stop DB crashes, but I'm not sure what's going to happen if the lexical form is mangled by the Java decoder and it puts a "?" in.  That changes it's java hash and it's MD5 hash so it might lead to inconsistency.

So the proper fix is to use a codec that is binary-robust.  BlockUTF8 has had tests added and is now aligned to the way Java handles codepoint 0 (illegal in unicode, java encodes it as (char)0, modified UTF-8 uses a pair xC0 x80. However TDB only needs the cycle chars->bytes->chars to work and this variance only affects the bytes->chars->bytes round trip.
                
> TDB datasets can be corrupted by performing certain operations within a transaction 
> ------------------------------------------------------------------------------------
>
>                 Key: JENA-225
>                 URL: https://issues.apache.org/jira/browse/JENA-225
>             Project: Apache Jena
>          Issue Type: Bug
>    Affects Versions: TDB 0.9.0
>         Environment: jena-tdb-0.9.0-incubating
>            Reporter: Sam Tunnicliffe
>         Attachments: JENA-225-v1.patch, ReportBadUnicode1.java
>
>
> In a web application, we read some triples in a HTTP POST, using a LangTurtle instance and a tokenizer obtained from from TokenizerFactory.makeTokenizerUTF8. 
> We then write the parsed Triples back out (to temporary storage) using OutputLangUtils.write. At some later time, these Triples are then re-read, again using a Tokenizer from TokenizerFactory.makeTokenizerUTF8, before being inserted into a TDB dataset. 
> We have found it possible for the the input data to contain character strings which pass through the various parsers/serializers but which cause TDB's transaction layer to error in such a way as to make recovery from journals ineffective. 
> Eliminating transactions from the code path enables the database to be updated successfully.
> The stacktrace from TDB looks like this: 
> org.openjena.riot.RiotParseException: [line: 1, col: 2 ] Broken token: Hello 
> 	at org.openjena.riot.tokens.TokenizerText.exception(TokenizerText.java:1209)
> 	at org.openjena.riot.tokens.TokenizerText.readString(TokenizerText.java:620)
> 	at org.openjena.riot.tokens.TokenizerText.parseToken(TokenizerText.java:248)
> 	at org.openjena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:112)
> 	at com.hp.hpl.jena.tdb.nodetable.NodecSSE.decode(NodecSSE.java:105)
> 	at com.hp.hpl.jena.tdb.lib.NodeLib.decode(NodeLib.java:93)
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:234)
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:228)
> 	at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.append(NodeTableTrans.java:188)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.writeNodeJournal(NodeTableTrans.java:306)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.commitPrepare(NodeTableTrans.java:266)
> 	at com.hp.hpl.jena.tdb.transaction.Transaction.prepare(Transaction.java:131)
> 	at com.hp.hpl.jena.tdb.transaction.Transaction.commit(Transaction.java:112)
> 	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTxn.commit(DatasetGraphTxn.java:40)
> 	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTransaction._commit(DatasetGraphTransaction.java:106)
> 	at com.hp.hpl.jena.tdb.migrate.DatasetGraphTrackActive.commit(DatasetGraphTrackActive.java:60)
> 	at com.hp.hpl.jena.sparql.core.DatasetImpl.commit(DatasetImpl.java:143)
> At least part of the issue seems to be stem from NodecSSE (I know this isn't actual unicode escaping, but its derived from the user input we've received). 
> String s = "Hello \uDAE0 World";
> Node literal = Node.createLiteral(s);
> ByteBuffer bb = NodeLib.encode(literal);
> NodeLib.decode(bb);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (JENA-225) TDB datasets can be corrupted by performing certain operations within a transaction

Posted by "Andy Seaborne (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Seaborne updated JENA-225:
-------------------------------

    Attachment: ReportBadUnicode1.java
    
> TDB datasets can be corrupted by performing certain operations within a transaction 
> ------------------------------------------------------------------------------------
>
>                 Key: JENA-225
>                 URL: https://issues.apache.org/jira/browse/JENA-225
>             Project: Apache Jena
>          Issue Type: Bug
>    Affects Versions: TDB 0.9.0
>         Environment: jena-tdb-0.9.0-incubating
>            Reporter: Sam Tunnicliffe
>         Attachments: ReportBadUnicode1.java
>
>
> In a web application, we read some triples in a HTTP POST, using a LangTurtle instance and a tokenizer obtained from from TokenizerFactory.makeTokenizerUTF8. 
> We then write the parsed Triples back out (to temporary storage) using OutputLangUtils.write. At some later time, these Triples are then re-read, again using a Tokenizer from TokenizerFactory.makeTokenizerUTF8, before being inserted into a TDB dataset. 
> We have found it possible for the the input data to contain character strings which pass through the various parsers/serializers but which cause TDB's transaction layer to error in such a way as to make recovery from journals ineffective. 
> Eliminating transactions from the code path enables the database to be updated successfully.
> The stacktrace from TDB looks like this: 
> org.openjena.riot.RiotParseException: [line: 1, col: 2 ] Broken token: Hello 
> 	at org.openjena.riot.tokens.TokenizerText.exception(TokenizerText.java:1209)
> 	at org.openjena.riot.tokens.TokenizerText.readString(TokenizerText.java:620)
> 	at org.openjena.riot.tokens.TokenizerText.parseToken(TokenizerText.java:248)
> 	at org.openjena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:112)
> 	at com.hp.hpl.jena.tdb.nodetable.NodecSSE.decode(NodecSSE.java:105)
> 	at com.hp.hpl.jena.tdb.lib.NodeLib.decode(NodeLib.java:93)
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:234)
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:228)
> 	at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.append(NodeTableTrans.java:188)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.writeNodeJournal(NodeTableTrans.java:306)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.commitPrepare(NodeTableTrans.java:266)
> 	at com.hp.hpl.jena.tdb.transaction.Transaction.prepare(Transaction.java:131)
> 	at com.hp.hpl.jena.tdb.transaction.Transaction.commit(Transaction.java:112)
> 	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTxn.commit(DatasetGraphTxn.java:40)
> 	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTransaction._commit(DatasetGraphTransaction.java:106)
> 	at com.hp.hpl.jena.tdb.migrate.DatasetGraphTrackActive.commit(DatasetGraphTrackActive.java:60)
> 	at com.hp.hpl.jena.sparql.core.DatasetImpl.commit(DatasetImpl.java:143)
> At least part of the issue seems to be stem from NodecSSE (I know this isn't actual unicode escaping, but its derived from the user input we've received). 
> String s = "Hello \uDAE0 World";
> Node literal = Node.createLiteral(s);
> ByteBuffer bb = NodeLib.encode(literal);
> NodeLib.decode(bb);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-225) TDB datasets can be corrupted by performing certain operations within a transaction

Posted by "Andy Seaborne (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234888#comment-13234888 ] 

Andy Seaborne commented on JENA-225:
------------------------------------

This issue is not related to transactions per se.  Normally, node caching hides the fact the DB has corrupted by illegal UTF-8.

The transaction system just happens to highlight the problem as it works without high-level node caches to make the actions idempotent.

The attached file shows it can happen for a raw storage dataset.  The code resets the system storage cache to remove all node table caches.

Also in teh code snippet, print out the size f the byte buffer after 'encode' and it will show it is short.

The problem is in the encoding of chars to bytes.  The jaba.nio.Charset encoder needs "onMalformedInput(CodingErrorAction.REPLACE)" to be set and it isn't.  This replaces the bad uniocde codepoint (high surrogate without a following low surrogate to make a surrorgate pair) with a '?' character -- standard Java charset decoder.

A more ambitious fix is to not use Java encoder/decoders, which are sensitive to codepoint-legality, and drop down to custom code that only uses UTF-8 encoding rules without checking for legal codepoints.   This would make TDB robust though something else may break when the data is leaves the JVM and is read in elsewhere because the data is not legal unicode.

Classes InStreamUTF8 and OutStreamUTF8 in ARQ show the encoding algorithm.  They are slightly slower (a few percent) than the standard java encoders (which are probably native code) when used in RIOT on large files needing multiple seconds decoding time. It wil only show in TDB on very large literals (100k+ ?).  Normally, lexical forms are less than a few 100 bytes and the difference is not measurable (the custom codec process may even be faster due to lower startup costs).  It is well below the rest of the database processing costs.
                
> TDB datasets can be corrupted by performing certain operations within a transaction 
> ------------------------------------------------------------------------------------
>
>                 Key: JENA-225
>                 URL: https://issues.apache.org/jira/browse/JENA-225
>             Project: Apache Jena
>          Issue Type: Bug
>    Affects Versions: TDB 0.9.0
>         Environment: jena-tdb-0.9.0-incubating
>            Reporter: Sam Tunnicliffe
>         Attachments: ReportBadUnicode1.java
>
>
> In a web application, we read some triples in a HTTP POST, using a LangTurtle instance and a tokenizer obtained from from TokenizerFactory.makeTokenizerUTF8. 
> We then write the parsed Triples back out (to temporary storage) using OutputLangUtils.write. At some later time, these Triples are then re-read, again using a Tokenizer from TokenizerFactory.makeTokenizerUTF8, before being inserted into a TDB dataset. 
> We have found it possible for the the input data to contain character strings which pass through the various parsers/serializers but which cause TDB's transaction layer to error in such a way as to make recovery from journals ineffective. 
> Eliminating transactions from the code path enables the database to be updated successfully.
> The stacktrace from TDB looks like this: 
> org.openjena.riot.RiotParseException: [line: 1, col: 2 ] Broken token: Hello 
> 	at org.openjena.riot.tokens.TokenizerText.exception(TokenizerText.java:1209)
> 	at org.openjena.riot.tokens.TokenizerText.readString(TokenizerText.java:620)
> 	at org.openjena.riot.tokens.TokenizerText.parseToken(TokenizerText.java:248)
> 	at org.openjena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:112)
> 	at com.hp.hpl.jena.tdb.nodetable.NodecSSE.decode(NodecSSE.java:105)
> 	at com.hp.hpl.jena.tdb.lib.NodeLib.decode(NodeLib.java:93)
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:234)
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:228)
> 	at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.append(NodeTableTrans.java:188)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.writeNodeJournal(NodeTableTrans.java:306)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.commitPrepare(NodeTableTrans.java:266)
> 	at com.hp.hpl.jena.tdb.transaction.Transaction.prepare(Transaction.java:131)
> 	at com.hp.hpl.jena.tdb.transaction.Transaction.commit(Transaction.java:112)
> 	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTxn.commit(DatasetGraphTxn.java:40)
> 	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTransaction._commit(DatasetGraphTransaction.java:106)
> 	at com.hp.hpl.jena.tdb.migrate.DatasetGraphTrackActive.commit(DatasetGraphTrackActive.java:60)
> 	at com.hp.hpl.jena.sparql.core.DatasetImpl.commit(DatasetImpl.java:143)
> At least part of the issue seems to be stem from NodecSSE (I know this isn't actual unicode escaping, but its derived from the user input we've received). 
> String s = "Hello \uDAE0 World";
> Node literal = Node.createLiteral(s);
> ByteBuffer bb = NodeLib.encode(literal);
> NodeLib.decode(bb);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-225) TDB datasets can be corrupted by performing certain operations within a transaction

Posted by "Andy Seaborne (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235815#comment-13235815 ] 

Andy Seaborne commented on JENA-225:
------------------------------------

(hearing nothing) v1 patch applied to ARQ.

This does not completely fix the situation because strings with illegal Unicode in them are not round-trip-safe.  Java encodes the single surrogate with a "?", not the surrogate.

                
> TDB datasets can be corrupted by performing certain operations within a transaction 
> ------------------------------------------------------------------------------------
>
>                 Key: JENA-225
>                 URL: https://issues.apache.org/jira/browse/JENA-225
>             Project: Apache Jena
>          Issue Type: Bug
>    Affects Versions: TDB 0.9.0
>         Environment: jena-tdb-0.9.0-incubating
>            Reporter: Sam Tunnicliffe
>         Attachments: JENA-225-v1.patch, ReportBadUnicode1.java
>
>
> In a web application, we read some triples in a HTTP POST, using a LangTurtle instance and a tokenizer obtained from from TokenizerFactory.makeTokenizerUTF8. 
> We then write the parsed Triples back out (to temporary storage) using OutputLangUtils.write. At some later time, these Triples are then re-read, again using a Tokenizer from TokenizerFactory.makeTokenizerUTF8, before being inserted into a TDB dataset. 
> We have found it possible for the the input data to contain character strings which pass through the various parsers/serializers but which cause TDB's transaction layer to error in such a way as to make recovery from journals ineffective. 
> Eliminating transactions from the code path enables the database to be updated successfully.
> The stacktrace from TDB looks like this: 
> org.openjena.riot.RiotParseException: [line: 1, col: 2 ] Broken token: Hello 
> 	at org.openjena.riot.tokens.TokenizerText.exception(TokenizerText.java:1209)
> 	at org.openjena.riot.tokens.TokenizerText.readString(TokenizerText.java:620)
> 	at org.openjena.riot.tokens.TokenizerText.parseToken(TokenizerText.java:248)
> 	at org.openjena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:112)
> 	at com.hp.hpl.jena.tdb.nodetable.NodecSSE.decode(NodecSSE.java:105)
> 	at com.hp.hpl.jena.tdb.lib.NodeLib.decode(NodeLib.java:93)
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:234)
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:228)
> 	at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.append(NodeTableTrans.java:188)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.writeNodeJournal(NodeTableTrans.java:306)
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.commitPrepare(NodeTableTrans.java:266)
> 	at com.hp.hpl.jena.tdb.transaction.Transaction.prepare(Transaction.java:131)
> 	at com.hp.hpl.jena.tdb.transaction.Transaction.commit(Transaction.java:112)
> 	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTxn.commit(DatasetGraphTxn.java:40)
> 	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTransaction._commit(DatasetGraphTransaction.java:106)
> 	at com.hp.hpl.jena.tdb.migrate.DatasetGraphTrackActive.commit(DatasetGraphTrackActive.java:60)
> 	at com.hp.hpl.jena.sparql.core.DatasetImpl.commit(DatasetImpl.java:143)
> At least part of the issue seems to be stem from NodecSSE (I know this isn't actual unicode escaping, but its derived from the user input we've received). 
> String s = "Hello \uDAE0 World";
> Node literal = Node.createLiteral(s);
> ByteBuffer bb = NodeLib.encode(literal);
> NodeLib.decode(bb);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira