You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Andy Seaborne (Jira)" <ji...@apache.org> on 2022/03/03 13:06:00 UTC

[jira] [Comment Edited] (JENA-2225) TDB/TDB2 dataset size stat serialized incorrectly for large datasets

    [ https://issues.apache.org/jira/browse/JENA-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500711#comment-17500711 ] 

Andy Seaborne edited comment on JENA-2225 at 3/3/22, 1:05 PM:
--------------------------------------------------------------

Not sure if opening a new issue would be better, but I guess we're not done here. We didn't recognize this because apparently I didn't know TDB2 assumes stats file in TDB2_LOCATION/DataXXX:

Now that the stats are being loaded, the change to long values leads to additional parse errors during the reordering setup/application because there are still integer values assumed:

{noformat}
java.lang.NumberFormatException: For input string: "16666525095"
	at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.base/java.lang.Integer.parseInt(Integer.java:652)
	at java.base/java.lang.Integer.parseInt(Integer.java:770)
	at org.apache.jena.sparql.sse.Item.asInteger(Item.java:275)
	at org.apache.jena.sparql.engine.optimizer.StatsMatcher.init(StatsMatcher.java:123)
	at org.apache.jena.sparql.engine.optimizer.StatsMatcher.<init>(StatsMatcher.java:97)
	at org.apache.jena.sparql.engine.optimizer.reorder.ReorderLib.weighted(ReorderLib.java:84)
	at org.apache.jena.tdb2.store.TDB2StorageBuilder.chooseReorderTransformation(TDB2StorageBuilder.java:352)
	at org.apache.jena.tdb2.store.TDB2StorageBuilder.build(TDB2StorageBuilder.java:112)
	at org.apache.jena.tdb2.sys.StoreConnection.make(StoreConnection.java:91)
	at org.apache.jena.tdb2.sys.StoreConnection.connectCreate(StoreConnection.java:59)
	at org.apache.jena.tdb2.sys.DatabaseOps.createSwitchable(DatabaseOps.java:100)
	at org.apache.jena.tdb2.sys.DatabaseOps.create(DatabaseOps.java:81)
	at org.apache.jena.tdb2.sys.DatabaseConnection.build(DatabaseConnection.java:101)
	at org.apache.jena.tdb2.sys.DatabaseConnection.lambda$make$0(DatabaseConnection.java:72)
	at java.base/java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1705)
	at org.apache.jena.tdb2.sys.DatabaseConnection.make(DatabaseConnection.java:72)
	at org.apache.jena.tdb2.sys.DatabaseConnection.connectCreate(DatabaseConnection.java:61)
	at org.apache.jena.tdb2.sys.DatabaseConnection.connectCreate(DatabaseConnection.java:52)
	at org.apache.jena.tdb2.DatabaseMgr.DB_ConnectCreate(DatabaseMgr.java:41)
	at org.apache.jena.tdb2.DatabaseMgr.connectDatasetGraph(DatabaseMgr.java:46)
	at org.apache.jena.tdb2.TDB2Factory.connectDataset(TDB2Factory.java:40)
	at tdb2.cmdline.ModTDBDataset.createDataset(ModTDBDataset.java:105)
	at arq.cmdline.ModDataset.getDataset(ModDataset.java:35)
	at arq.query.getDataset(query.java:179)
	at arq.query.queryExec(query.java:226)
	at arq.query.exec(query.java:157)
	at org.apache.jena.cmd.CmdMain.mainMethod(CmdMain.java:87)
	at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:56)
	at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:43)
	at tdb2.tdbquery.main(tdbquery.java:30)
{noformat}


was (Author: lorenzb):
Not sure if opening a new issue would be better, but I guess we're not done here. We didn't recognize this because apparently I didn't know TDB2 assumes stats file in TDB2_LOCATION/DataXXX:

Now that the stats are being loaded, the change to long values leads to additional parse errors during the reordering setup/application because there are still integer values assumed:

{{
java.lang.NumberFormatException: For input string: "16666525095"
	at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.base/java.lang.Integer.parseInt(Integer.java:652)
	at java.base/java.lang.Integer.parseInt(Integer.java:770)
	at org.apache.jena.sparql.sse.Item.asInteger(Item.java:275)
	at org.apache.jena.sparql.engine.optimizer.StatsMatcher.init(StatsMatcher.java:123)
	at org.apache.jena.sparql.engine.optimizer.StatsMatcher.<init>(StatsMatcher.java:97)
	at org.apache.jena.sparql.engine.optimizer.reorder.ReorderLib.weighted(ReorderLib.java:84)
	at org.apache.jena.tdb2.store.TDB2StorageBuilder.chooseReorderTransformation(TDB2StorageBuilder.java:352)
	at org.apache.jena.tdb2.store.TDB2StorageBuilder.build(TDB2StorageBuilder.java:112)
	at org.apache.jena.tdb2.sys.StoreConnection.make(StoreConnection.java:91)
	at org.apache.jena.tdb2.sys.StoreConnection.connectCreate(StoreConnection.java:59)
	at org.apache.jena.tdb2.sys.DatabaseOps.createSwitchable(DatabaseOps.java:100)
	at org.apache.jena.tdb2.sys.DatabaseOps.create(DatabaseOps.java:81)
	at org.apache.jena.tdb2.sys.DatabaseConnection.build(DatabaseConnection.java:101)
	at org.apache.jena.tdb2.sys.DatabaseConnection.lambda$make$0(DatabaseConnection.java:72)
	at java.base/java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1705)
	at org.apache.jena.tdb2.sys.DatabaseConnection.make(DatabaseConnection.java:72)
	at org.apache.jena.tdb2.sys.DatabaseConnection.connectCreate(DatabaseConnection.java:61)
	at org.apache.jena.tdb2.sys.DatabaseConnection.connectCreate(DatabaseConnection.java:52)
	at org.apache.jena.tdb2.DatabaseMgr.DB_ConnectCreate(DatabaseMgr.java:41)
	at org.apache.jena.tdb2.DatabaseMgr.connectDatasetGraph(DatabaseMgr.java:46)
	at org.apache.jena.tdb2.TDB2Factory.connectDataset(TDB2Factory.java:40)
	at tdb2.cmdline.ModTDBDataset.createDataset(ModTDBDataset.java:105)
	at arq.cmdline.ModDataset.getDataset(ModDataset.java:35)
	at arq.query.getDataset(query.java:179)
	at arq.query.queryExec(query.java:226)
	at arq.query.exec(query.java:157)
	at org.apache.jena.cmd.CmdMain.mainMethod(CmdMain.java:87)
	at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:56)
	at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:43)
	at tdb2.tdbquery.main(tdbquery.java:30)
}}

> TDB/TDB2 dataset size stat serialized incorrectly for large datasets
> --------------------------------------------------------------------
>
>                 Key: JENA-2225
>                 URL: https://issues.apache.org/jira/browse/JENA-2225
>             Project: Apache Jena
>          Issue Type: Bug
>          Components: TDB, TDB2
>    Affects Versions: Jena 4.3.1
>            Reporter: Lorenz Bühmann
>            Assignee: Andy Seaborne
>            Priority: Minor
>             Fix For: Jena 4.4.0
>
>
> When computing the TDB/TDB2 stats via CLI the size will be serialized incorrectly for large datasets.
> For example for latest Wikidata Truthy we get
> {noformat}
> (count -1983667112)){noformat}
> This happens because for both the corresponding `Stats.java` class does enforce an Integer type Node though the value is a long type:
> {code:java}
> if ( count >= 0 )
>     addPair(meta.getList(), StatsMatcher.COUNT, NodeFactoryExtra.intToNode((int)count)) ; {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)