You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@metamodel.apache.org by kaspersorensen <gi...@git.apache.org> on 2018/11/23 16:40:18 UTC
[GitHub] metamodel pull request #194: METAMODEL-1205: JDK >8 support in core and exce...
GitHub user kaspersorensen opened a pull request:
https://github.com/apache/metamodel/pull/194
METAMODEL-1205: JDK >8 support in core and excel (with POI v 4.0.0)
This is a step along the way to get to resolution of METAMODEL-1205 (supporting builds on JDKs > 8).
The main change in this PR is that I upgraded Apache POI to version 4.0.0.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/kaspersorensen/metamodel METAMODEL-1205-core-and-excel
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/metamodel/pull/194.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #194
----
commit e93f1e9070d9ad23f50143a83d5ee6083a3ab578
Author: Kasper Sørensen <i....@...>
Date: 2018-11-23T15:54:25Z
METAMODEL-1205: Fixed assertion that depends on JDK's exception msg
commit e3b3b9ca240e462fa80c4ccf28175ce8e81f7c11
Author: Kasper Sørensen <i....@...>
Date: 2018-11-23T16:08:39Z
METAMODEL-1205: Upgraded Apache POI to have JDK >8 support
commit faa0f88caa30563d1d46ddecd01e8819f3daffb6
Author: Kasper Sørensen <i....@...>
Date: 2018-11-23T16:38:18Z
METAMODEL-1205: Fixed changes after POI upgraded to version 4.0.0
I ran into an issue that reminded me of what we saw some time back [1]
with writing to a file-based workbook failing with message "user-mapped
section open". I found a reasonable workaround where workbooks are being
loaded based on their InputStream rather than their File reference
whenever we access the workbook in an update callback.
[1] https://bz.apache.org/bugzilla/show_bug.cgi?id=58480
----
---
[GitHub] metamodel issue #194: METAMODEL-1205: JDK >8 support in core and excel (with...
Posted by kaspersorensen <gi...@git.apache.org>.
Github user kaspersorensen commented on the issue:
https://github.com/apache/metamodel/pull/194
No I think this is perfect. Thanks again for doing extra due diligence on the PR. I think we can merge ...
Did you by any chance look at the two other JDK9+ related PRs?
---
[GitHub] metamodel issue #194: METAMODEL-1205: JDK >8 support in core and excel (with...
Posted by kaspersorensen <gi...@git.apache.org>.
Github user kaspersorensen commented on the issue:
https://github.com/apache/metamodel/pull/194
I like it! Thanks a bunch for testing it.
---
[GitHub] metamodel pull request #194: METAMODEL-1205: JDK >8 support in core and exce...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/metamodel/pull/194
---
[GitHub] metamodel issue #194: METAMODEL-1205: JDK >8 support in core and excel (with...
Posted by LosD <gi...@git.apache.org>.
Github user LosD commented on the issue:
https://github.com/apache/metamodel/pull/194
Okay, I'll give it a spin.
But first, I'll hunt around for a large Excel file (or make one) :)
---
[GitHub] metamodel issue #194: METAMODEL-1205: JDK >8 support in core and excel (with...
Posted by LosD <gi...@git.apache.org>.
Github user LosD commented on the issue:
https://github.com/apache/metamodel/pull/194
Okay, made an even stupider test, this time doing multiple inserts, from multiple threads (I think we had problems with that on Excel previously) :)
```java
import java.io.File;
import java.util.Collections;
import java.util.List;
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import org.apache.metamodel.excel.ExcelDataContext;
import org.apache.metamodel.schema.Column;
import org.apache.metamodel.schema.Schema;
import org.apache.metamodel.schema.Table;
public class ExcelTest {
private final ExcelDataContext dataContext;
private ExcelTest(ExcelDataContext dataContext) {
this.dataContext = dataContext;
}
private void go() throws InterruptedException {
long startTime = System.currentTimeMillis();
ExecutorService executor = Executors.newFixedThreadPool(8);
List<Callable<Long>> workers = Collections.nCopies(16, this::addStuff);
try {
final long combinedTime = executor.invokeAll(workers).stream().mapToLong(future -> {
try {
return future.get();
} catch (InterruptedException | ExecutionException e) {
throw new IllegalStateException(e);
}
}).sum();
long endTime = System.currentTimeMillis();
System.out.println(
"Combined execution time: " + combinedTime + "ms, real time: " + (endTime - startTime) + "ms");
} finally {
executor.shutdown();
}
}
private long addStuff() {
long startTime = System.currentTimeMillis();
Schema schema = dataContext.getDefaultSchema();
Table table = schema.getTable(0);
Column column1 = table.getColumn(0);
Column column2 = table.getColumn(1);
Column column3 = table.getColumn(2);
char[] chars = "abcdefghijklmnopqrstuvwxyz".toCharArray();
dataContext.executeUpdate(updateCallback -> {
for (int i = 0; i < 100; i++) {
updateCallback.insertInto(table).value(column1, i).value(column2, i * 10)
.value(column3, chars[i % chars.length]).execute();
}
});
long endTime = System.currentTimeMillis();
final long time = endTime - startTime;
System.out.println("Thread '" + Thread.currentThread().getName() + "' execution time: " + time + "ms");
return time;
}
public static void main(String[] args) throws InterruptedException {
ExcelTest excelTest = new ExcelTest(new ExcelDataContext(new File("C:\\Users\\Dennis\\Documents\\empty.xlsx")));
excelTest.go();
}
}
```
The test runs fine, at least I saw no issues over 10-15 runs. It's obvious that workers are blocked while others are writing, but that's fair, as long as nothing goes wrong.
XLSX is really slow when inserting:
```
Thread 'pool-1-thread-7' execution time: 2203ms
Thread 'pool-1-thread-2' execution time: 3333ms
Thread 'pool-1-thread-6' execution time: 4424ms
Thread 'pool-1-thread-3' execution time: 5528ms
Thread 'pool-1-thread-8' execution time: 6629ms
Thread 'pool-1-thread-5' execution time: 7697ms
Thread 'pool-1-thread-1' execution time: 8776ms
Thread 'pool-1-thread-4' execution time: 9880ms
Thread 'pool-1-thread-1' execution time: 2246ms
Thread 'pool-1-thread-5' execution time: 4444ms
Thread 'pool-1-thread-8' execution time: 6621ms
Thread 'pool-1-thread-3' execution time: 8854ms
Thread 'pool-1-thread-6' execution time: 11074ms
Thread 'pool-1-thread-2' execution time: 13257ms
Thread 'pool-1-thread-7' execution time: 15490ms
Thread 'pool-1-thread-4' execution time: 8893ms
Combined execution time: 119349ms, real time: 18827ms
```
(The combined numbers gets quite inflated by the waiting)
XLS is quite a bit faster:
```
Thread 'pool-1-thread-7' execution time: 244ms
Thread 'pool-1-thread-2' execution time: 260ms
Thread 'pool-1-thread-5' execution time: 274ms
Thread 'pool-1-thread-8' execution time: 287ms
Thread 'pool-1-thread-3' execution time: 303ms
Thread 'pool-1-thread-1' execution time: 319ms
Thread 'pool-1-thread-1' execution time: 16ms
Thread 'pool-1-thread-4' execution time: 352ms
Thread 'pool-1-thread-6' execution time: 369ms
Thread 'pool-1-thread-4' execution time: 35ms
Thread 'pool-1-thread-1' execution time: 68ms
Thread 'pool-1-thread-3' execution time: 118ms
Thread 'pool-1-thread-8' execution time: 154ms
Thread 'pool-1-thread-5' execution time: 188ms
Thread 'pool-1-thread-2' execution time: 224ms
Thread 'pool-1-thread-7' execution time: 264ms
Combined execution time: 3475ms, real time: 566ms
```
XLSX on the master branch was maybe a little bit faster, but I didn't really run it often enough to make sure:
```
Thread 'pool-1-thread-3' execution time: 1779ms
Thread 'pool-1-thread-1' execution time: 2903ms
Thread 'pool-1-thread-4' execution time: 4040ms
Thread 'pool-1-thread-2' execution time: 5155ms
Thread 'pool-1-thread-8' execution time: 6249ms
Thread 'pool-1-thread-5' execution time: 7354ms
Thread 'pool-1-thread-6' execution time: 8440ms
Thread 'pool-1-thread-7' execution time: 9552ms
Thread 'pool-1-thread-6' execution time: 2214ms
Thread 'pool-1-thread-5' execution time: 4377ms
Thread 'pool-1-thread-8' execution time: 6600ms
Thread 'pool-1-thread-2' execution time: 8795ms
Thread 'pool-1-thread-4' execution time: 11014ms
Thread 'pool-1-thread-1' execution time: 13242ms
Thread 'pool-1-thread-3' execution time: 15466ms
Thread 'pool-1-thread-7' execution time: 8781ms
Combined execution time: 115961ms, real time: 18391ms
```
XLS was very slow compared to the new branch, though:
```
Thread 'pool-1-thread-3' execution time: 1061ms
Thread 'pool-1-thread-4' execution time: 1896ms
Thread 'pool-1-thread-7' execution time: 2743ms
Thread 'pool-1-thread-1' execution time: 3582ms
Thread 'pool-1-thread-5' execution time: 4438ms
Thread 'pool-1-thread-8' execution time: 5276ms
Thread 'pool-1-thread-6' execution time: 6170ms
Thread 'pool-1-thread-2' execution time: 7016ms
Thread 'pool-1-thread-6' execution time: 1682ms
Thread 'pool-1-thread-8' execution time: 3432ms
Thread 'pool-1-thread-5' execution time: 5119ms
Thread 'pool-1-thread-1' execution time: 6836ms
Thread 'pool-1-thread-7' execution time: 8505ms
Thread 'pool-1-thread-4' execution time: 10180ms
Thread 'pool-1-thread-3' execution time: 11874ms
Thread 'pool-1-thread-2' execution time: 6791ms
Combined execution time: 86601ms, real time: 13865ms
```
To make sure, I checked the Excel documents, and everything seems correct (here the blocking is also obvious, as values are progressing from low to high values, with no intertwined values from other threads).
I think the conclusion must be that in general it works as it should, and with quite a good boost on XLS speed, _maybe_ at the cost of a little bit of writing speed on the XLXS side.
---
[GitHub] metamodel issue #194: METAMODEL-1205: JDK >8 support in core and excel (with...
Posted by LosD <gi...@git.apache.org>.
Github user LosD commented on the issue:
https://github.com/apache/metamodel/pull/194
D'oh, of course they're blocking, I'm doing the loop inside an update script! Let me know if you want me to change it. :-)
---
[GitHub] metamodel issue #194: METAMODEL-1205: JDK >8 support in core and excel (with...
Posted by LosD <gi...@git.apache.org>.
Github user LosD commented on the issue:
https://github.com/apache/metamodel/pull/194
Yeah, I agree, everything looks good.
Nope, not really. I'll take a look in a moment.
---
[GitHub] metamodel issue #194: METAMODEL-1205: JDK >8 support in core and excel (with...
Posted by kaspersorensen <gi...@git.apache.org>.
Github user kaspersorensen commented on the issue:
https://github.com/apache/metamodel/pull/194
I think our tests should cover most of the stuff, but the one thing that I will admit was somewhat complicated was what I noted in the third commit:
> I ran into an issue that reminded me of what we saw some time back [1]
> with writing to a file-based workbook failing with message "user-mapped
> section open". I found a reasonable workaround where workbooks are being
> loaded based on their InputStream rather than their File reference
> whenever we access the workbook in an update callback.
>
> [1] https://bz.apache.org/bugzilla/show_bug.cgi?id=58480
Or in terms of testing ... It would be interesting to compare performance and ability to do multiple inserts into the same sheet. I have a small fear that this new approach is slower because it is instantiating the Workbook based on `InputStream` instead of `File`. But OTOH I don't know of an alternative, because the `File` based thing is creating file system locks that prevent those inserts from happening without having to do very silly sleeps or something like that.
---
[GitHub] metamodel issue #194: METAMODEL-1205: JDK >8 support in core and excel (with...
Posted by LosD <gi...@git.apache.org>.
Github user LosD commented on the issue:
https://github.com/apache/metamodel/pull/194
LGTM, but I haven't tested yet. Guess such a big jump should have some kind of manual test. Anything I should look specifically at trying out?
---
[GitHub] metamodel issue #194: METAMODEL-1205: JDK >8 support in core and excel (with...
Posted by LosD <gi...@git.apache.org>.
Github user LosD commented on the issue:
https://github.com/apache/metamodel/pull/194
Made a (bit stupid) test:
```Java
import org.apache.metamodel.data.DataSet;
import org.apache.metamodel.excel.ExcelDataContext;
import org.apache.metamodel.schema.Column;
import org.apache.metamodel.schema.Schema;
import org.apache.metamodel.schema.Table;
import java.io.File;
public class Main {
public static void main(String[] args) {
long startTime = System.currentTimeMillis();
ExcelDataContext excelDataContext = new ExcelDataContext(new File("C:\\Users\\Dennis\\Documents\\testlarge.xlsx"));
Schema schema = excelDataContext.getDefaultSchema();
Table table = schema.getTable(0);
Column column = table.getColumn(0);
Column column2 = table.getColumn(1);
DataSet ds = excelDataContext.query().from(table).select(column).where(column2).gt(2).execute();
long rows = 0;
while (ds.next()) {
rows++;
if("NOWAY".equals(ds.getRow().getValue(0))) {
throw new RuntimeException("NO WAY!");
}
}
long endTime = System.currentTimeMillis();
System.out.println("Total execution time: " + (endTime-startTime) + "ms, for " + rows + " rows");
}
}
```
(Applied the `where` to give MM a little bit of calculating work, and did the NOWAY test just to ensure that Java doesn't optimize the loop out. It's been too long for me to remember a better way)
The source sheet is 2 columns of numbers, one of text, and 610560 rows of it on XLSX, 65536 on XLS (maximum). I newer really did anything with the third column, but I guess we're mostly testing data loading anyway.
It looks like improvements for XLS, and a toss-up for XLSX to me (scale is milliseconds). I could average to get a winner there, but I'm pretty sure there's not enough numbers anyway. Though it looks like the old one is maybe a bit more peaky (also noticed that when I, errrrr, tested my test):
![image](https://user-images.githubusercontent.com/1282832/48960652-9df01c80-ef6e-11e8-9894-4ea505f13493.png)
---
[GitHub] metamodel issue #194: METAMODEL-1205: JDK >8 support in core and excel (with...
Posted by LosD <gi...@git.apache.org>.
Github user LosD commented on the issue:
https://github.com/apache/metamodel/pull/194
D'oh! Forgot about the multiple inserts.
It's getting a bit late here, but I'll try those out Sunday (family visit tomorrow).
---