You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Keith R. Bennett (JIRA)" <ji...@apache.org> on 2007/09/25 18:19:50 UTC
[jira] Commented: (TIKA-31) protected Parser.parse(InputStream stream, Iterable contents)

    [ https://issues.apache.org/jira/browse/TIKA-31?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530149 ] 

Keith R. Bennett commented on TIKA-31:
--------------------------------------

Jukka -

I see you also removed the swallowed exceptions.  Thanks!

I have some comments and questions.  Some may not be related to your patch, but the original code.

----

In the new parse method:

protected String parse(InputStream stream, Iterable<Content> contents)

This means that contents needs to be populated with the metadata keys before being passed to parse(), right?  Is that the intention?  Wouldn't we want the parser to be responsible to know what metadata should go into contents and put it there itself?

----

Currently, even with the changes, the parser is stateful.  I'm aware you said that this was a step *on the way to* making it stateless.  However, the current code has a serious bug that I think we should fix, even if it's a minimal fix (such as my second suggestion below).

Parser.java has an input stream nonstatic member named 'is'.  This is set via setInputStream().  The parser can then be called on to return various kinds of content via getStrContent(), getContents(), and getContent().  It is possible to call setInputStream() to give it a different stream to parse.  However, any of the parse methods will reuse the parsed content of the previous stream, because the boolean 'parsed' will not have been reset to false.

Possible alternatives:

1) pass the input stream in only in the parse method, remove the input stream member variable, and return a document content object that contains the results (the full string content, the contents map).

2) have setInputStream() reset parsed to false

3) pass the InputStream into the object in its constructor, and make it immutable.  This will guarantee that each parser will be used for only one document, which is safe, but not very economical in cases where constructing a parser is nontrivial.

----

In XMLParser, Intellij Idea is reporting that:

} else if (obj instanceof CDATA) {

will never be evaluated to true (since CDATA extends Text, which comes before it in the list of conditions).

This occurs in the methods:

extractContent()
concatOccurance()

(BTW, "occurance" is a misspelling; it should be "occurrence".)

I can enter a Jira issue and submit a patch if you like.

----


Minor stuff:

In MSExcelParser, "a Excel document" should be "an Excel document".
In RTFParser, "a Excel document" should be "an RTF document".

----

Thanks,
- Keith


> protected Parser.parse(InputStream stream, Iterable<Content> contents)
> ----------------------------------------------------------------------
>
>                 Key: TIKA-31
>                 URL: https://issues.apache.org/jira/browse/TIKA-31
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>         Attachments: TIKA-31.patch
>
>
> In order to push towards a stateless Parser interface, I'd like to propose implementing the current Parser.getContents() method (as it exists after TIKA-26) in terms of a stateless abstract method with the following signature:
>     protected abstract String parse(InputStream stream, Iterable<Content> contents) throws IOException, TikaException;
> This method would return the fulltext content of the given stream as the String return value and place any extra metadata into the given set of Content instances. With this information the current Parser.getContents() method could populate a fulltext (and summary) Content entry and any regexp Content entries universally for any Parser classes. Also, we could centralize state handling and exception processing to a single class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.