You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2007/09/26 15:39:50 UTC

[jira] Created: (TIKA-33) Stateless parsers

Stateless parsers
-----------------

                 Key: TIKA-33
                 URL: https://issues.apache.org/jira/browse/TIKA-33
             Project: Tika
          Issue Type: Improvement
          Components: general
            Reporter: Jukka Zitting


As a next step towards the proposed Parser model from the mailing list I'd like to make the current (as after TIKA-31) Parser class stateless by removing the Content, InputStream, and other related member variables and giving them as parameters to a parse method call.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Updated: (TIKA-33) Stateless parsers

Posted by Carsten Ziegeler <cz...@apache.org>.
Jukka Zitting wrote:
>> 2) In ParserFactory, we have:
>>
>> } catch (Exception e) {
>>     logger.error("Unable to instantiate parser: " + className, e);
>>     throw new TikaException(e.getMessage());
>> }
>>
>> When we adapt an exception to a TikaException, would it make sense to wrap
>> the entire exception, and not just its getMessage()?
> 
> +1

And I think we should not log the exception in this case. Otherwise the
exception is logged more than once.

Carsten


-- 
Carsten Ziegeler
cziegeler@apache.org

Re: [jira] Updated: (TIKA-33) Stateless parsers

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 9/26/07, kbennett <kb...@bbsinc.biz> wrote:
> 1) While you are modifying the Parser class, could we change getContents()
> to not swallow exceptions?
> [...]
> and modify the method declaration to throw a TikaException?

Sure, that makes sense.

> 2) In ParserFactory, we have:
>
> } catch (Exception e) {
>     logger.error("Unable to instantiate parser: " + className, e);
>     throw new TikaException(e.getMessage());
> }
>
> When we adapt an exception to a TikaException, would it make sense to wrap
> the entire exception, and not just its getMessage()?

+1

> 3) In Parser.getContents(), we could use Commons Lang StringUtils to make
> the code more nullsafe and a bit more concise by replacing:
>
> int length = Math.min(contentStr.length(), 500);
> String summary = contentStr.substring(0, length);
>
> --- with: ---
>
> String summary = StringUtils.left(contentStr, 500);

-1 I'm not sure if that's worth the extra dependency to commons-lang.

> It's too bad we can't have a custom object...then we could have a
> getSummary() method that would do this so we don't run the risk of the
> summary getting out of sync with respect to the fulltext content.

I don't think we have any cases where the fulltext or summary would
change after parsing.

> Same for getValue() always being getValues().get(0).

Good idea, though I'd really like to replace the whole Content object
stuff with a different metadata mechanism.

BR,

Jukka Zitting

Re: [jira] Updated: (TIKA-33) Stateless parsers

Posted by kbennett <kb...@bbsinc.biz>.
Jukka -

Thanks for these improvements.  I noticed that, unlike me, you put the
stream closes in finally blocks.  Thank you.  My bad. ;)

Some questions:

1) While you are modifying the Parser class, could we change getContents()
to not swallow exceptions?  For example, change:

} catch (Exception e) {
    logger.error("Parse error: " + e.getMessage(), e);
    return "";
}

--- to something like: ---

} catch (TikaException e) {
    logger.error("Parse error: " + e.getMessage(), e);
    throw e;
} catch (Exception e) {
    logger.error("Parse error: " + e.getMessage(), e);
    throw new TikaException(e);
}

and modify the method declaration to throw a TikaException?


2) In ParserFactory, we have:

} catch (Exception e) {
    logger.error("Unable to instantiate parser: " + className, e);
    throw new TikaException(e.getMessage());
}

When we adapt an exception to a TikaException, would it make sense to wrap
the entire exception, and not just its getMessage()? 
TikaException.getMessage() could be modified to include the cause's
getMessage() in its return value if there is one.  The advantages are not
huge, but it seems to me marginally better to do it that way.  One reason to
preserve the exception cause is that Throwable has a getLocalizedMessage()
that the wrapped exception might implement, and we lose access to it if we
only wrap its getMessage().


3) In Parser.getContents(), we could use Commons Lang StringUtils to make
the code more nullsafe and a bit more concise by replacing:

int length = Math.min(contentStr.length(), 500);
String summary = contentStr.substring(0, length);

--- with: ---

String summary = StringUtils.left(contentStr, 500);

(Javadoc for StringUtils.left() is at
http://commons.apache.org/lang/apidocs/org/apache/commons/lang/StringUtils.html#left(java.lang.String,%20int%29).

It's too bad we can't have a custom object...then we could have a
getSummary() method that would do this so we don't run the risk of the
summary getting out of sync with respect to the fulltext content.  Same for
getValue() always being getValues().get(0).

Regards,
Keith


JIRA jira@apache.org wrote:
> 
> 
>      [
> https://issues.apache.org/jira/browse/TIKA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
> 
> Jukka Zitting updated TIKA-33:
> ------------------------------
> 
>     Attachment: TIKA-33.patch
> 
> Proposed patch.
> 
> This patch replaces the getContentStr(), getContents() and
> getContent(String) methods on Parser with the following method:
> 
>     String getContents(InputStream stream, Map<String, Content> contents)
> 
> The internal implementation is still the same as before, minus the state
> handling parts.
> 
> This change makes the current ParserFactory, ParseUtils, and test case
> code somewhat cumbersome, but I didn't want to streamline them yet too
> much to keep the patch clearly focused.
> 
> 

-- 
View this message in context: http://www.nabble.com/-jira--Created%3A-%28TIKA-33%29-Stateless-parsers-tf4522441.html#a12903197
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


[jira] Updated: (TIKA-33) Stateless parsers

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated TIKA-33:
------------------------------

    Attachment: TIKA-33.patch

Proposed patch.

This patch replaces the getContentStr(), getContents() and getContent(String) methods on Parser with the following method:

    String getContents(InputStream stream, Map<String, Content> contents)

The internal implementation is still the same as before, minus the state handling parts.

This change makes the current ParserFactory, ParseUtils, and test case code somewhat cumbersome, but I didn't want to streamline them yet too much to keep the patch clearly focused.

> Stateless parsers
> -----------------
>
>                 Key: TIKA-33
>                 URL: https://issues.apache.org/jira/browse/TIKA-33
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>            Reporter: Jukka Zitting
>         Attachments: TIKA-33.patch
>
>
> As a next step towards the proposed Parser model from the mailing list I'd like to make the current (as after TIKA-31) Parser class stateless by removing the Content, InputStream, and other related member variables and giving them as parameters to a parse method call.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-33) Stateless parsers

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-33.
-------------------------------

    Resolution: Fixed
      Assignee: Jukka Zitting

Committed in revision 580874.

> Stateless parsers
> -----------------
>
>                 Key: TIKA-33
>                 URL: https://issues.apache.org/jira/browse/TIKA-33
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>         Attachments: TIKA-33.patch
>
>
> As a next step towards the proposed Parser model from the mailing list I'd like to make the current (as after TIKA-31) Parser class stateless by removing the Content, InputStream, and other related member variables and giving them as parameters to a parse method call.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.