You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@metamodel.apache.org by LosD <gi...@git.apache.org> on 2015/09/09 11:33:56 UTC

[GitHub] metamodel pull request: Use File instead of InputStream for Excel ...

GitHub user LosD opened a pull request:

    https://github.com/apache/metamodel/pull/49

    Use File instead of InputStream for Excel when possible

    The memory usage by Apache POI is quite a bit lower when using File instead of InputStream. This makes sure that FileResource's internal File object will be passed object to POI.
    
    POI has some problems with saving to an already opened file, so a temp file is used for saving.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/LosD/metamodel master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/metamodel/pull/49.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #49
    
----
commit 053c34684f6f46948bc3ad3a71a5a366ac9526f8
Author: Dennis Du Krøger <de...@humaninference.com>
Date:   2015-09-03T13:41:44Z

    Use missing and exists filters for null comparisons.
    
    This will make isNull() and isNotNull() work. They failed before, as a
     term query, which does not support null values, was used. A test was
      also un-ignored; It was ignored as it used to fail, but the fix
      for METAMODEL-172 fixed the test as well.

commit d22f299956bc8ef9752b0cee920b1d5594e4b693
Author: Dennis Du Krøger <de...@humaninference.com>
Date:   2015-09-03T14:07:59Z

    Improves exception handling

commit 9221755629d258c069f45ecda98f53399df08781
Author: Dennis Du Krøger <de...@humaninference.com>
Date:   2015-09-08T12:33:10Z

    An attempt to lower memory usage of Excel
    
    This is currently not properly functioning (creating fails)

commit 79d222009570a725ede6b5293b5096d3839188f3
Author: Dennis Du Krøger <de...@humaninference.com>
Date:   2015-09-09T08:46:51Z

    Use tempfile when writing resource to keep POI happy
    
    This seems to actually make my approach work properly.

commit 250fd691032bcd2d1505f312682a3b99cf3570f7
Author: Dennis Du Krøger <de...@humaninference.com>
Date:   2015-09-09T09:02:15Z

    Merge remote-tracking branch 'origin/master'

commit ca65aab89657c22e2d15b09c94dbbadaa199e640
Author: Dennis Du Krøger <de...@humaninference.com>
Date:   2015-09-09T09:07:47Z

    Removes memory test

commit 517e22c0d5e25e5fcf703595824297d0cf364de3
Author: Dennis Du Krøger <de...@humaninference.com>
Date:   2015-09-09T09:10:14Z

    Merge branch 'feature/excel-memory-experiment'

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Use File instead of InputStream for Excel ...

Posted by kaspersorensen <gi...@git.apache.org>.
Github user kaspersorensen commented on the pull request:

    https://github.com/apache/metamodel/pull/49#issuecomment-139346283
  
    Damn. And we only see this happen on Travis, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Use File instead of InputStream for Excel ...

Posted by LosD <gi...@git.apache.org>.
Github user LosD commented on the pull request:

    https://github.com/apache/metamodel/pull/49#issuecomment-138855818
  
    Memory usage on a huge file, 50 MB/ ~1000000 million records (Excel max).
    
    InputStream:
    ![memory use inputstream-based](https://cloud.githubusercontent.com/assets/1282832/9758412/b8994af8-56e6-11e5-86f9-44144c17577e.png)
    
    File:
    ![memory use file-based](https://cloud.githubusercontent.com/assets/1282832/9758418/c1a02766-56e6-11e5-9f80-8c658d251fb1.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Use File instead of InputStream for Excel ...

Posted by LosD <gi...@git.apache.org>.
Github user LosD commented on the pull request:

    https://github.com/apache/metamodel/pull/49#issuecomment-139151999
  
    Okay, I'm officially giving up on this approach. I'm not sure, but the main issue seems to be the mixing of write-only SXSSFWorkbook and read-write XSSFWorkbook.
    
    I suggest we unmix them, maybe with some kind of flag for output-only mode, if that is even feasible. If not, I think we should just drop it, or see if XSSF alone could work for big Excel sheets, while still giving us memory benefits compared to the InputStream (though I doubt it).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Use File instead of InputStream for Excel ...

Posted by LosD <gi...@git.apache.org>.
Github user LosD commented on the pull request:

    https://github.com/apache/metamodel/pull/49#issuecomment-138859494
  
    Added :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Use File instead of InputStream for Excel ...

Posted by kaspersorensen <gi...@git.apache.org>.
Github user kaspersorensen commented on the pull request:

    https://github.com/apache/metamodel/pull/49#issuecomment-145392109
  
    Please see #56 ... Tried many things, and yes it was reproducable on my Mac :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Use File instead of InputStream for Excel ...

Posted by kaspersorensen <gi...@git.apache.org>.
Github user kaspersorensen commented on a diff in the pull request:

    https://github.com/apache/metamodel/pull/49#discussion_r41101020
  
    --- Diff: excel/src/main/java/org/apache/metamodel/excel/ExcelDataContext.java ---
    @@ -209,13 +205,13 @@ protected void onSchemaCacheRefreshed() {
             return null;
         }
     
    -    private SpreadsheetReaderDelegate getSpreadsheetReaderDelegate(Ref<InputStream> inputStream)
    +    private SpreadsheetReaderDelegate getSpreadsheetReaderDelegate()
                 throws MetaModelException {
             if (_spreadsheetReaderDelegate == null) {
                 synchronized (this) {
                     if (_spreadsheetReaderDelegate == null) {
                         try {
    -                        if (POIXMLDocument.hasOOXMLHeader(inputStream.get())) {
    +                        if (POIXMLDocument.hasOOXMLHeader(getInputStream())) {
    --- End diff --
    
    The InputStream here is never closed. In fact I think this code should be rewritten like this:
    
    ```
        private SpreadsheetReaderDelegate getSpreadsheetReaderDelegate() throws MetaModelException {
            if (_spreadsheetReaderDelegate == null) {
                synchronized (this) {
                    if (_spreadsheetReaderDelegate == null) {
                        _spreadsheetReaderDelegate = _resource.read(new Func<InputStream, SpreadsheetReaderDelegate>() {
                            @Override
                            public SpreadsheetReaderDelegate eval(InputStream inputStream) {
                                try {
                                    if (POIXMLDocument.hasOOXMLHeader(inputStream)) {
                                        return new XlsxSpreadsheetReaderDelegate(_configuration);
                                    } else {
                                        return new DefaultSpreadsheetReaderDelegate(_configuration);
                                    }
                                } catch (IOException e) {
                                    logger.error("Could not identify spreadsheet type, using default", e);
                                    return new DefaultSpreadsheetReaderDelegate(_configuration);
                                }
                            }
                        });
                    }
                }
            }
            return _spreadsheetReaderDelegate;
        }
    ```
    
    ... and then we should delete ```getInputStream()``` and ```getInputStreamRef()```.
    
    ... but I tried that, and it may be better, but it didn't make my Java not crash :-P


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Use File instead of InputStream for Excel ...

Posted by LosD <gi...@git.apache.org>.
Github user LosD commented on the pull request:

    https://github.com/apache/metamodel/pull/49#issuecomment-139346801
  
    Yeah, but I'm pretty sure it is somehow timing related, so I'll bet that users will see it too. There's nothing that should be Travis specific.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Use File instead of InputStream for Excel ...

Posted by kaspersorensen <gi...@git.apache.org>.
Github user kaspersorensen commented on a diff in the pull request:

    https://github.com/apache/metamodel/pull/49#discussion_r41100980
  
    --- Diff: excel/src/main/java/org/apache/metamodel/excel/ExcelUpdateCallback.java ---
    @@ -78,7 +87,24 @@ public ExcelDataContext getDataContext() {
     
         protected void close() {
             if (_workbook != null) {
    -            ExcelUtils.writeWorkbook(_dataContext, _workbook);
    +
    +            final Resource resource;
    +            if(_dataContext.getResource() instanceof FileResource){
    --- End diff --
    
    Trying to find what the issue might be ... And this if-sentence is making me wonder: Why is it here, and if it is supposed to write non-file resources to a temp-file, isn't it all wrong then? Shouldn't it be "if resources is NOT a FileResource"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Use File instead of InputStream for Excel ...

Posted by LosD <gi...@git.apache.org>.
Github user LosD commented on the pull request:

    https://github.com/apache/metamodel/pull/49#issuecomment-138888902
  
    Seems Travis is exploding. JVM segfaults in both 7 and 8.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Use File instead of InputStream for Excel ...

Posted by kaspersorensen <gi...@git.apache.org>.
Github user kaspersorensen commented on the pull request:

    https://github.com/apache/metamodel/pull/49#issuecomment-138859162
  
    +1
    
    Code looks good to me.
    Nice findings wrt. memory improvement and "good enough" documentation for me to prove it.
    
    Minor: I am missing a JIRA issue reference and an update to CHANGES.md


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Use File instead of InputStream for Excel ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/metamodel/pull/49


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Use File instead of InputStream for Excel ...

Posted by LosD <gi...@git.apache.org>.
Github user LosD commented on a diff in the pull request:

    https://github.com/apache/metamodel/pull/49#discussion_r41101294
  
    --- Diff: excel/src/main/java/org/apache/metamodel/excel/ExcelUpdateCallback.java ---
    @@ -78,7 +87,24 @@ public ExcelDataContext getDataContext() {
     
         protected void close() {
             if (_workbook != null) {
    -            ExcelUtils.writeWorkbook(_dataContext, _workbook);
    +
    +            final Resource resource;
    +            if(_dataContext.getResource() instanceof FileResource){
    --- End diff --
    
    I think it was an attempt to avoid file locking issues I saw with File-based Excel writing on close. In the end of my attempts there was quite a bit of throwing stuff at the wall to see what stuck (I couldn't test locally, as only Travis would fail). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Use File instead of InputStream for Excel ...

Posted by LosD <gi...@git.apache.org>.
Github user LosD commented on a diff in the pull request:

    https://github.com/apache/metamodel/pull/49#discussion_r41101077
  
    --- Diff: excel/src/main/java/org/apache/metamodel/excel/ExcelDataContext.java ---
    @@ -209,13 +205,13 @@ protected void onSchemaCacheRefreshed() {
             return null;
         }
     
    -    private SpreadsheetReaderDelegate getSpreadsheetReaderDelegate(Ref<InputStream> inputStream)
    +    private SpreadsheetReaderDelegate getSpreadsheetReaderDelegate()
                 throws MetaModelException {
             if (_spreadsheetReaderDelegate == null) {
                 synchronized (this) {
                     if (_spreadsheetReaderDelegate == null) {
                         try {
    -                        if (POIXMLDocument.hasOOXMLHeader(inputStream.get())) {
    +                        if (POIXMLDocument.hasOOXMLHeader(getInputStream())) {
    --- End diff --
    
    Wait, does that mean that your Java is doing like Travis? We're totally doing this on your computer. If it's your old laptop, bring it! If it's the Mac... I'd consider it a hardware issue :-P 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Use File instead of InputStream for Excel ...

Posted by LosD <gi...@git.apache.org>.
Github user LosD commented on a diff in the pull request:

    https://github.com/apache/metamodel/pull/49#discussion_r41101201
  
    --- Diff: excel/src/main/java/org/apache/metamodel/excel/ExcelDataContext.java ---
    @@ -209,13 +205,13 @@ protected void onSchemaCacheRefreshed() {
             return null;
         }
     
    -    private SpreadsheetReaderDelegate getSpreadsheetReaderDelegate(Ref<InputStream> inputStream)
    +    private SpreadsheetReaderDelegate getSpreadsheetReaderDelegate()
                 throws MetaModelException {
             if (_spreadsheetReaderDelegate == null) {
                 synchronized (this) {
                     if (_spreadsheetReaderDelegate == null) {
                         try {
    -                        if (POIXMLDocument.hasOOXMLHeader(inputStream.get())) {
    +                        if (POIXMLDocument.hasOOXMLHeader(getInputStream())) {
    --- End diff --
    
    Rewrite looks good, except I'm not sure I like the logger.error() when still falling back. Either cancel with an error, or try to recover with a warning. 
    
    (this may be the same as before, but Github's diff interface is terrible on the mobile, so I can't see it) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---