You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jörg Ehrlich (JIRA)" <ji...@apache.org> on 2012/10/28 18:03:12 UTC

[jira] [Commented] (TIKA-775) Embed Capabilities

    [ https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485665#comment-13485665 ] 

Jörg Ehrlich commented on TIKA-775:
-----------------------------------

Hi Ray,

I think it would be great if Tika could also write Metadata back to files and it would be great to start on this rather sooner than later.
But I have a couple of comments regarding your proposed implementation:

1) Right now the Parsers do both content and metadata extraction. The proposed embedder does only Metadata embedding, which is fine because updating of content would be out of scope for Tika.
But if we introduce separate APIs to embed just metadata I think it would make sense to also introduce APIs to only extract metadata. Actually at Adobe we had stop using Tika to retrieve Metadata from specific file formats because it always parses the whole content which is simply too heavy an operation to scale in a larger system.
So I planned to get started on a new API and adjustments to parsers to just retrieve Metadata from files, but did not have time for this, yet. I guess it would make sense to synchronize these two new APIs, right?
Being able to just parse Metadata from files is actually also very important for the embedding of it, which I will explain further down.

2) Your documentation does not really specify in detail the behavior of the metadata update that should happen.
Does it always update all metadata in the file, i.e. does it delete properties that are not in the Metadata object? Or does it only update those properties that are provided in the Metadata object? How do I delete properties then? Do I make the property empty? But empty properties are in most metadata containers a valid property value and should not delete the property.
Where does the embedding take place? A lot of file formats have several metadata containers with similar properties. Does the embed method update all of them? Or just the ones, the parsers were looking at? What happens in case of inconsistencies? Do you read/write from specific fields or do you reconcile all of them together?
What happens for properties where the file format specific fields have a fixed length or different encodings? Do you just write as much as possible and the rest is simply ignored? 

For all such questions, you have to think about whether it makes sense to provide the client with the ability to either configure the embedder or provide a callback API for the client to decide if specific scenarios arise or if the embedder should always just do a best guess for the client.

In any such case, it is usually for the client important to get the original metadata from the file, before writing it back, so that no properties are wrongly deleted or changed. But even more so it is important for the Embedder as it would in most cases have to read the metadata anyway, in order to know how to update the file properly. It usually has to check if an in-place update of metadata can happen or if the whole file has to be restructured because the metadata chunks have grown too large to fit where they were before.
That's why I think it would be important to have a get-only-metadata API and Parser capabilities available, before starting writing it back.

3) This also leads me to the topic of error recovery and safe updating of files. I think the documentation should be more clear about what the Embedder will do in case of an error and what is expected by the client. 
There are all sorts of reasons the embedding could fail. If that happens, the original file usually ends up being corrupt and lost for the user. So it usually makes sense (for samller files) to do a safe update, which means writing the update in a new file and then swap it with the original one, after the update was successful.
But what about scenarios where a partial update is possible? You often have files where just specific metadata sections are corrupt because some tool did not read the spec and wrote it wrongly. But the rest of the file is still ok, so other parts could still be updated. Do you want to provide a callback API for the client to be able to react to error scenarios and decide what he wants to do? The embedder could do a best guess action, but that is usually quite dangerous for the user's files.

4) I take it that the expectation is that all parsers could also potentially implement the Embedder interface, so that both reading and writing is in one hand? Otherwise you probably end up with all sorts of inconsistencies between the two implementations regarding what metadata fields are read from where and what should be updated when, etc.

5) Why do you pass in an InputStream? That would mean the Embedder has to open up an own OutputStream to be able to write. That would imply that Tika knows how to properly create OutputStreams in the client's environment. Wouldn't it be better to leave the client in control here? And why do you want to return the InputStream?

6) I also agree with Jukka's comments that for such an important new feature we should spend some more thoughts on this. I think your proposal works ok for the external embedder scenario but I am not so sure for other scenarios.

Sorry that I did not speak up earlier. This issue has been around for quite a while.
Regards
Jörg
                
> Embed Capabilities
> ------------------
>
>                 Key: TIKA-775
>                 URL: https://issues.apache.org/jira/browse/TIKA-775
>             Project: Tika
>          Issue Type: Improvement
>          Components: general, metadata
>    Affects Versions: 1.0
>         Environment: The default ExternalEmbedder requires that sed be installed.
>            Reporter: Ray Gauss II
>              Labels: embed, patch
>             Fix For: 1.3
>
>         Attachments: embed.diff, tika-core-embed-patch.txt, tika-parsers-embed-patch.txt
>
>
> This patch defines and implements the concept of embedding tika metadata into a file stream, the reverse of extraction.
> In the tika-core project an interface defining an Embedder and a generic sed ExternalEmbedder implementation meant to be extended or configured are added.  These classes are essentially a reverse flow of the existing Parser and ExternalParser classes.
> In the tika-parsers project an ExternalEmbedderTest unit test is added which uses the default ExternalEmbedder (calls sed) to embed a value placed in Metadata.DESCRIPTION then verify the operation by parsing the resulting stream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira