You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Andrzej Bialecki (Created) (JIRA)" <ji...@apache.org> on 2012/02/17 02:13:59 UTC

[jira] [Created] (TIKA-864) Metadata.formatDate should use ThreadLocal

Metadata.formatDate should use ThreadLocal
------------------------------------------

                 Key: TIKA-864
                 URL: https://issues.apache.org/jira/browse/TIKA-864
             Project: Tika
          Issue Type: Improvement
          Components: metadata
            Reporter: Andrzej Bialecki 


Currently this is a synchronized method that uses a single instance of DateFormat. Instead it could use a pool of ThreadLocal DateFormat instances and avoid the sync blocking.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (TIKA-864) Metadata.formatDate causes blocking in concurrent use

Posted by "Jukka Zitting (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210313#comment-13210313 ] 

Jukka Zitting edited comment on TIKA-864 at 2/17/12 3:10 PM:
-------------------------------------------------------------

bq. perhaps the best solution here is just to add a bit of custom code that formats the requested string directly

Done in revision 1245600.
                
      was (Author: jukkaz):
    bq. perhaps the best solution here is
just to add a bit of custom code that formats the requested string directly

Done in revision 1245600.
                  
> Metadata.formatDate causes blocking in concurrent use
> -----------------------------------------------------
>
>                 Key: TIKA-864
>                 URL: https://issues.apache.org/jira/browse/TIKA-864
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata
>            Reporter: Andrzej Bialecki 
>            Assignee: Jukka Zitting
>             Fix For: 1.1
>
>
> Currently this is a synchronized method that uses a single instance of DateFormat. Instead it could use a pool of ThreadLocal DateFormat instances and avoid the sync blocking.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (TIKA-864) Metadata.formatDate causes blocking in concurrent use

Posted by "Jukka Zitting (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-864.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 1.1
         Assignee: Jukka Zitting

bq. perhaps the best solution here is
just to add a bit of custom code that formats the requested string directly

Done in revision 1245600.
                
> Metadata.formatDate causes blocking in concurrent use
> -----------------------------------------------------
>
>                 Key: TIKA-864
>                 URL: https://issues.apache.org/jira/browse/TIKA-864
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata
>            Reporter: Andrzej Bialecki 
>            Assignee: Jukka Zitting
>             Fix For: 1.1
>
>
> Currently this is a synchronized method that uses a single instance of DateFormat. Instead it could use a pool of ThreadLocal DateFormat instances and avoid the sync blocking.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-864) Metadata.formatDate should use ThreadLocal

Posted by "Jukka Zitting (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210240#comment-13210240 ] 

Jukka Zitting commented on TIKA-864:
------------------------------------

Like in TIKA-865, is this a real measurable performance bottleneck? If not, I suggest we keep the code as is.
                
> Metadata.formatDate should use ThreadLocal
> ------------------------------------------
>
>                 Key: TIKA-864
>                 URL: https://issues.apache.org/jira/browse/TIKA-864
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata
>            Reporter: Andrzej Bialecki 
>
> Currently this is a synchronized method that uses a single instance of DateFormat. Instead it could use a pool of ThreadLocal DateFormat instances and avoid the sync blocking.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-864) Metadata.formatDate should use ThreadLocal

Posted by "Andrzej Bialecki (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210238#comment-13210238 ] 

Andrzej Bialecki  commented on TIKA-864:
----------------------------------------

Good point. Maybe Tika should use Joda-Time instead of the built-in DateFormat and Calendar classes - not only it's much faster but also provides thread-safe classes for date parsing and formatting (http://joda-time.sourceforge.net/api-release/org/joda/time/format/DateTimeFormat.html)
                
> Metadata.formatDate should use ThreadLocal
> ------------------------------------------
>
>                 Key: TIKA-864
>                 URL: https://issues.apache.org/jira/browse/TIKA-864
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata
>            Reporter: Andrzej Bialecki 
>
> Currently this is a synchronized method that uses a single instance of DateFormat. Instead it could use a pool of ThreadLocal DateFormat instances and avoid the sync blocking.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-864) Metadata.formatDate causes blocking in concurrent use

Posted by "Jukka Zitting (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated TIKA-864:
-------------------------------

    Summary: Metadata.formatDate causes blocking in concurrent use  (was: Metadata.formatDate should use ThreadLocal)

bq. threads were very often blocked on a few sync blocks, among others on this one and the one in TIKA-865.

Reason enough for me, thanks for the background.

I updated the issue summary to identify the problem to be solved instead of a proposed solution. Using ThreadLocals is troublesome as already mentioned by Nick.

bq. Joda-Time

I'm not too excited about adding extra dependencies to tika-core. In TIKA-495 (which led to the use of a synchronized static variable) the FastDateFormat class from Commons Lang was considered as an alternative, but also there the overhead of an extra dependency (or embedding just that class) was a problem.

The formatDate() contract is pretty straightforward, so perhaps the best solution here is
just to add a bit of custom code that formats the requested string directly from the given
date object without needing extra formatter classes.
                
> Metadata.formatDate causes blocking in concurrent use
> -----------------------------------------------------
>
>                 Key: TIKA-864
>                 URL: https://issues.apache.org/jira/browse/TIKA-864
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata
>            Reporter: Andrzej Bialecki 
>
> Currently this is a synchronized method that uses a single instance of DateFormat. Instead it could use a pool of ThreadLocal DateFormat instances and avoid the sync blocking.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-864) Metadata.formatDate causes blocking in concurrent use

Posted by "Andrzej Bialecki (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210694#comment-13210694 ] 

Andrzej Bialecki  commented on TIKA-864:
----------------------------------------

Thanks Jukka, this solved the issue nicely - however, I just noticed that the javadoc for that method is now incorrect, because it claims the method is synchronized and points to TIKA-495. This comment should be removed.
                
> Metadata.formatDate causes blocking in concurrent use
> -----------------------------------------------------
>
>                 Key: TIKA-864
>                 URL: https://issues.apache.org/jira/browse/TIKA-864
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata
>            Reporter: Andrzej Bialecki 
>            Assignee: Jukka Zitting
>             Fix For: 1.1
>
>
> Currently this is a synchronized method that uses a single instance of DateFormat. Instead it could use a pool of ThreadLocal DateFormat instances and avoid the sync blocking.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-864) Metadata.formatDate should use ThreadLocal

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210207#comment-13210207 ] 

Nick Burch commented on TIKA-864:
---------------------------------

If we did store them on a ThreadLocal, then how would we allow them to be cleaned up?

For an example, Tomcat will give you warnings if you leave behind Thread Locals, so we'd need to give a way to clean them up that someone using Tika inside a webapp could use.
                
> Metadata.formatDate should use ThreadLocal
> ------------------------------------------
>
>                 Key: TIKA-864
>                 URL: https://issues.apache.org/jira/browse/TIKA-864
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata
>            Reporter: Andrzej Bialecki 
>
> Currently this is a synchronized method that uses a single instance of DateFormat. Instead it could use a pool of ThreadLocal DateFormat instances and avoid the sync blocking.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-864) Metadata.formatDate should use ThreadLocal

Posted by "Andrzej Bialecki (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210247#comment-13210247 ] 

Andrzej Bialecki  commented on TIKA-864:
----------------------------------------

I noticed this issue when profiling a larger application that uses a configurable pool of threads (hundreds) to process the Enron data-set (the version in plain text RFC822 format, available here http://www.cs.cmu.edu/~enron/enron_mail_20110402.tgz). I didn't measure in numbers the impact of this particular method call on the whole process, but I saw that threads were very often blocked on a few sync blocks, among others on this one and the one in TIKA-865.
                
> Metadata.formatDate should use ThreadLocal
> ------------------------------------------
>
>                 Key: TIKA-864
>                 URL: https://issues.apache.org/jira/browse/TIKA-864
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata
>            Reporter: Andrzej Bialecki 
>
> Currently this is a synchronized method that uses a single instance of DateFormat. Instead it could use a pool of ThreadLocal DateFormat instances and avoid the sync blocking.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira