You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by "Julian Sedding (JIRA)" <ji...@apache.org> on 2010/03/18 09:52:27 UTC

[jira] Created: (JCR-2576) DbInputStream does not support mark()/reset() when exhausted.

DbInputStream does not support mark()/reset() when exhausted.
-------------------------------------------------------------

                 Key: JCR-2576
                 URL: https://issues.apache.org/jira/browse/JCR-2576
             Project: Jackrabbit Content Repository
          Issue Type: Bug
          Components: jackrabbit-core
    Affects Versions: 2.0.0
            Reporter: Julian Sedding


The DbDataStore implementation uses a DbInputStream to read binary properties from the database. When a new binary property is created, Jackrabbit attempts to index it. Tika's CharsetDetector is used in the process, which marks the input stream, reads the first 8000 bytes and then resets the stream.

This results in the stacktrace shown at the end of the issue, if the following two conditions hold true:
* the property is larger than the minRecordLength configuration of the Datastore and
* the property is smaller than 8000 bytes

The DbInputStream needs to have the following properties:
1. lazy instantiation of the underlying stream
2. auto-close underlying stream when EOF is reached
3. fully support mark()/reset() even if  the underlying stream is auto-closed due to 2.


12.03.2010 15:53:28 *WARN * LazyTextExtractorField: Failed to extract text from a binary property (LazyTextExtractorField.java, line 165)
java.io.EOFException
        at org.apache.jackrabbit.core.data.db.DbInputStream.reset(DbInputStream.java:180)
        at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:156)
        at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:156)
        at org.apache.tika.parser.txt.CharsetDetector.setText(CharsetDetector.java:131)
        at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:77)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
        at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:160)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:207)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (JCR-2576) DbInputStream does not support mark()/reset() when exhausted.

Posted by "Thomas Mueller (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JCR-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Mueller reassigned JCR-2576:
-----------------------------------

    Assignee: Thomas Mueller

> DbInputStream does not support mark()/reset() when exhausted.
> -------------------------------------------------------------
>
>                 Key: JCR-2576
>                 URL: https://issues.apache.org/jira/browse/JCR-2576
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>          Components: jackrabbit-core
>    Affects Versions: 2.0.0
>            Reporter: Julian Sedding
>            Assignee: Thomas Mueller
>
> The DbDataStore implementation uses a DbInputStream to read binary properties from the database. When a new binary property is created, Jackrabbit attempts to index it. Tika's CharsetDetector is used in the process, which marks the input stream, reads the first 8000 bytes and then resets the stream.
> This results in the stacktrace shown at the end of the issue, if the following two conditions hold true:
> * the property is larger than the minRecordLength configuration of the Datastore and
> * the property is smaller than 8000 bytes
> The DbInputStream needs to have the following properties:
> 1. lazy instantiation of the underlying stream
> 2. auto-close underlying stream when EOF is reached
> 3. fully support mark()/reset() even if  the underlying stream is auto-closed due to 2.
> 12.03.2010 15:53:28 *WARN * LazyTextExtractorField: Failed to extract text from a binary property (LazyTextExtractorField.java, line 165)
> java.io.EOFException
>         at org.apache.jackrabbit.core.data.db.DbInputStream.reset(DbInputStream.java:180)
>         at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:156)
>         at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:156)
>         at org.apache.tika.parser.txt.CharsetDetector.setText(CharsetDetector.java:131)
>         at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:77)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
>         at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:160)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:207)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (JCR-2576) DbInputStream does not support mark()/reset() when exhausted.

Posted by "Thomas Mueller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JCR-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846855#action_12846855 ] 

Thomas Mueller commented on JCR-2576:
-------------------------------------

Thanks a lot for the patch! I think the only remaining issue is that  closeOriginalStream() should not set originalStream to null.

However I would like to simplify things a bit by implementing the mark()/reset() features a different layer (use BufferedInputStream if possible).

A similar issue exists with TempFileInputStream by the way.

> DbInputStream does not support mark()/reset() when exhausted.
> -------------------------------------------------------------
>
>                 Key: JCR-2576
>                 URL: https://issues.apache.org/jira/browse/JCR-2576
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>          Components: jackrabbit-core
>    Affects Versions: 2.0.0
>            Reporter: Julian Sedding
>            Assignee: Thomas Mueller
>         Attachments: DbInputStream.patch
>
>
> The DbDataStore implementation uses a DbInputStream to read binary properties from the database. When a new binary property is created, Jackrabbit attempts to index it. Tika's CharsetDetector is used in the process, which marks the input stream, reads the first 8000 bytes and then resets the stream.
> This results in the stacktrace shown at the end of the issue, if the following two conditions hold true:
> * the property is larger than the minRecordLength configuration of the Datastore and
> * the property is smaller than 8000 bytes
> The DbInputStream needs to have the following properties:
> 1. lazy instantiation of the underlying stream
> 2. auto-close underlying stream when EOF is reached
> 3. fully support mark()/reset() even if  the underlying stream is auto-closed due to 2.
> 12.03.2010 15:53:28 *WARN * LazyTextExtractorField: Failed to extract text from a binary property (LazyTextExtractorField.java, line 165)
> java.io.EOFException
>         at org.apache.jackrabbit.core.data.db.DbInputStream.reset(DbInputStream.java:180)
>         at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:156)
>         at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:156)
>         at org.apache.tika.parser.txt.CharsetDetector.setText(CharsetDetector.java:131)
>         at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:77)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
>         at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:160)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:207)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (JCR-2576) DbInputStream does not support mark()/reset() when exhausted.

Posted by "Julian Sedding (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JCR-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julian Sedding updated JCR-2576:
--------------------------------

    Attachment: DbInputStream.patch

I have started working on a patch, which is not fully functional yet. Unfortunately I currently don't have time to finish it off. It should illustrate a possible approach to solve the problem though.

> DbInputStream does not support mark()/reset() when exhausted.
> -------------------------------------------------------------
>
>                 Key: JCR-2576
>                 URL: https://issues.apache.org/jira/browse/JCR-2576
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>          Components: jackrabbit-core
>    Affects Versions: 2.0.0
>            Reporter: Julian Sedding
>            Assignee: Thomas Mueller
>         Attachments: DbInputStream.patch
>
>
> The DbDataStore implementation uses a DbInputStream to read binary properties from the database. When a new binary property is created, Jackrabbit attempts to index it. Tika's CharsetDetector is used in the process, which marks the input stream, reads the first 8000 bytes and then resets the stream.
> This results in the stacktrace shown at the end of the issue, if the following two conditions hold true:
> * the property is larger than the minRecordLength configuration of the Datastore and
> * the property is smaller than 8000 bytes
> The DbInputStream needs to have the following properties:
> 1. lazy instantiation of the underlying stream
> 2. auto-close underlying stream when EOF is reached
> 3. fully support mark()/reset() even if  the underlying stream is auto-closed due to 2.
> 12.03.2010 15:53:28 *WARN * LazyTextExtractorField: Failed to extract text from a binary property (LazyTextExtractorField.java, line 165)
> java.io.EOFException
>         at org.apache.jackrabbit.core.data.db.DbInputStream.reset(DbInputStream.java:180)
>         at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:156)
>         at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:156)
>         at org.apache.tika.parser.txt.CharsetDetector.setText(CharsetDetector.java:131)
>         at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:77)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
>         at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:160)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:207)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (JCR-2576) DbInputStream does not support mark()/reset() when exhausted.

Posted by "Thomas Mueller (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JCR-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Mueller resolved JCR-2576.
---------------------------------

       Resolution: Fixed
    Fix Version/s: 2.0.1

> DbInputStream does not support mark()/reset() when exhausted.
> -------------------------------------------------------------
>
>                 Key: JCR-2576
>                 URL: https://issues.apache.org/jira/browse/JCR-2576
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>          Components: jackrabbit-core
>    Affects Versions: 2.0.0
>            Reporter: Julian Sedding
>            Assignee: Thomas Mueller
>             Fix For: 2.0.1
>
>         Attachments: DbInputStream.patch
>
>
> The DbDataStore implementation uses a DbInputStream to read binary properties from the database. When a new binary property is created, Jackrabbit attempts to index it. Tika's CharsetDetector is used in the process, which marks the input stream, reads the first 8000 bytes and then resets the stream.
> This results in the stacktrace shown at the end of the issue, if the following two conditions hold true:
> * the property is larger than the minRecordLength configuration of the Datastore and
> * the property is smaller than 8000 bytes
> The DbInputStream needs to have the following properties:
> 1. lazy instantiation of the underlying stream
> 2. auto-close underlying stream when EOF is reached
> 3. fully support mark()/reset() even if  the underlying stream is auto-closed due to 2.
> 12.03.2010 15:53:28 *WARN * LazyTextExtractorField: Failed to extract text from a binary property (LazyTextExtractorField.java, line 165)
> java.io.EOFException
>         at org.apache.jackrabbit.core.data.db.DbInputStream.reset(DbInputStream.java:180)
>         at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:156)
>         at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:156)
>         at org.apache.tika.parser.txt.CharsetDetector.setText(CharsetDetector.java:131)
>         at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:77)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
>         at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:160)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:207)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (JCR-2576) DbInputStream does not support mark()/reset() when exhausted.

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JCR-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated JCR-2576:
-------------------------------

    Fix Version/s: 2.1.0
                       (was: 2.0.1)

> DbInputStream does not support mark()/reset() when exhausted.
> -------------------------------------------------------------
>
>                 Key: JCR-2576
>                 URL: https://issues.apache.org/jira/browse/JCR-2576
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>          Components: jackrabbit-core
>    Affects Versions: 2.0.0
>            Reporter: Julian Sedding
>            Assignee: Thomas Mueller
>             Fix For: 2.1.0
>
>         Attachments: DbInputStream.patch
>
>
> The DbDataStore implementation uses a DbInputStream to read binary properties from the database. When a new binary property is created, Jackrabbit attempts to index it. Tika's CharsetDetector is used in the process, which marks the input stream, reads the first 8000 bytes and then resets the stream.
> This results in the stacktrace shown at the end of the issue, if the following two conditions hold true:
> * the property is larger than the minRecordLength configuration of the Datastore and
> * the property is smaller than 8000 bytes
> The DbInputStream needs to have the following properties:
> 1. lazy instantiation of the underlying stream
> 2. auto-close underlying stream when EOF is reached
> 3. fully support mark()/reset() even if  the underlying stream is auto-closed due to 2.
> 12.03.2010 15:53:28 *WARN * LazyTextExtractorField: Failed to extract text from a binary property (LazyTextExtractorField.java, line 165)
> java.io.EOFException
>         at org.apache.jackrabbit.core.data.db.DbInputStream.reset(DbInputStream.java:180)
>         at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:156)
>         at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:156)
>         at org.apache.tika.parser.txt.CharsetDetector.setText(CharsetDetector.java:131)
>         at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:77)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
>         at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:160)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:207)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira