You are viewing a plain text version of this content. The canonical link for it is here.

Posted to sanselan-dev@incubator.apache.org by Charles Matthew Chen <ch...@gmail.com> on 2007/12/18 05:52:33 UTC

Re: Sanselan: capabilities, questions, thoughts

Hi Endre,

   Thanks for your thoughtful, fascinating comments (as always).

   As you've probably noticed, work on the EXIF rewrite issue hasn't
come very far in the last few months as a) I haven't had much time for
Sanselan and b) I've been more focused on aspects of the project
brought on by the move to Apache, such as starting to rewrite the unit
tests, etc.  I'm hoping to get back to that soon, and I thank you for
your patience.  Given that the holidays are near, not much is likely
to happen before early January.

   Vector - good point, its just an old habit.  My philosophy has
always been to make optimization the LAST concern, behind correctness,
clean design, etc.  So, my habit has always been to use thread-safe
data structures from the get-go and replace them with non-thread-safe
versions later, when the design stabilizes.  However, you are right...
thread-safety issues are avoided entirely in Sanselan by having almost
no persistent state in the library, so using ArrayList should not be a
problem.

   I agree, supporting all String encodings is critical.

   TIFF pointer/reference integrity: so, let's break down the issue:
a) are any pointers/references from within an exif segment going to
point outside of that segment?  I don't think so.  I have yet to see
an example of this.  b) do we know of any pointer/reference tags (ie.
from phil thompson's tag encylopedia) that sanselan isn't handling
properly?  I need to review this, but I'm not aware of any.  c) do
maker notes contain pointers/references?  It sounds like they do.
Supporting all of the (many, undocumented) maker notes is outside the
scope of Sanselan at this point.  If someone wants to volunteer to
work on this, it would be great, but I don't have the interest or the
time for such a massive project.

   True, discarding Maker Notes is a terrible option, but I see no
other choice at this point.  Is there any library (in any language)
that supports rewriting the majority of Maker Notes?  I don't think
even Phil Thompson's Perl library does this?  Am I wrong?

   I agree, Sanselan should offer access to both raw and parsed tag data.

   I always enjoy your thoughts, thank you again.

Charles.

ps. I hope you don't mind if I cc the list at apache.


On Dec 14, 2007 7:38 AM, Endre Stølsvik <En...@stolsvik.com> wrote:
> Hi again!
>
> How's Sanselan coming along?
>
> Is the move to Apache taking longer than anticipated? After hanging
> around on Apache mailinglists for many years, those incubations seems to
> extend into forever pretty much always..
>
> But, remembering our discussion about metadata, I came over a new page
> today that I hadn't read (by Phil again), that points to the direct
> problem..
>
> To start off on our previous thread:
>
> >>      * The derivation of Exif from the TIFF file structure using offset
> >> pointers in the files means that data can be spread anywhere within a
> >> file, which means that software is likely to corrupt any pointers or
> >> corresponding data that it doesn't decode/encode. This is why most image
> >> editors damage or remove the Exif metadata (particularly the MakerNote)
> >> to some extent upon saving.
> >
> > The statement that "the pointers can refer to data anywhere in the
> > file" is wrong.  The TIFF file structure of the EXIF data is embedded
> > in the JPEG/JFIF APP1 segment; offset pointers within that EXIF data
> > are relative to the start of that EXIF data, and are local to the APP1
> > segment.
> >
> > It is true the the directory and value offsets can point anywhere
> > within that segment which (see my previous email) means that we'll
> > always have to parse and write the entire directory structure.  This
> > isn't incompatible with generally preserving binary compatibility at
> > the field level.
>
> Check http://www.sno.phy.queensu.ca/~phil/exiftool/canon_raw.html
>
> Relevant text (he compares TIFF to CIFF, Canon's "Camera Image File
> Format"):
>
>    " A short rant about TIFF inadequacies:
>
> TIFF format on the other hand, really sucks in comparison (this includes
> JPEG too, since JPEG uses TIFF format to store the EXIF information).
> The main problems are the use of absolute offsets and the ambiguity
> between integers and pointers (such as those used for custom IFD's).
> Because absolute offsets require adjusting whenever anything is moved in
> the file, the format of ALL contained data structures must be understood
> to properly edit the file. This results in an impossible situation when
> presented with undocumented custom structures like those used in the
> maker notes written by modern digital cameras. This is why it is so
> common for image editors to either scramble the maker notes or discard
> them completely. The official TIFF recommendation is to discard unknown
> information when rewriting the image (as Photoshop does), but for many,
> including myself, this option is simply unacceptable. "
>
> >
> >>      * The standard defines a MakerNote tag, which allows camera
> >> manufacturers to place any custom format metadata in the file. This is
> >> used increasingly by camera manufacturers to store a myriad of camera
> >> settings not listed in the Exif standard, such as shooting modes,
> >> post-processing settings, serial number, focusing modes, etc. As this
> >> tag format is proprietary and manufacturer-specific, it can be
> >> prohibitively difficult to retrieve this information from an image (or
> >> properly preserve it when rewriting an image). Some manufacturers
> >> encrypt portions of the information; for example, Nikon encrypts the
> >> detailed lens data in their newer MakerNote data versions.[3]
> >>
> >
> > I definitely plan on supporting MakerNote data eventually, but this is
> > non-trivial since it is vendor-specific.  In the meantime, it is easy
> > to preserve binary compatibility for MakerNotes.
>
> So, basically, the above suggests that *if you do not* support
> MakerNote, you might end up scrambling it, since there might be pointers
> within it that _should have been_ updated, but if you simply binary
> "dump" them into the next file, will end up pointing wrong.
>
>
> You do know about Metadata Extractor? It is a nice tool that have had
> quite a bit of research put into it. It also supports a bit of MakerNote
> for several brands. However, now that I have to dig down into it to use
> it, I (subjectively, of course) find several design flaws.
>
> (Beware: rant coming up..)
>
> For one, he always use default string conversions, sticking an array of
> bytes into the String constructor _without specifying encoding_, which I
> find simply god-awfully horrendous (Seriously, it pretty much pisses me
> off that people in these days can do such extreme and world-ignorant
> blunders). This obviously doesn't make much difference in the _vast_
> majority of cases, since a) Latin1 dominates the world, and is standard
> encoding on many (older) machines, and b) Since UTF-8 "fallbacks" to
> Latin-1 as long as the high-bit isn't set, it will work on much of the
> rest too. Even EBCDIC on IBM frames fallback for most of the pure
> characters and numbers to ASCII/Latin-1. However, it still feels VERY
> wobbly: Why not just stick in ,"ISO-8859-1" in the constructor???
>    Secondly, there is no way to get hold of the original binary data,
> only the somewhat parsed data. So, if I don't trust his conversion
> (which I not necessarily do, see above, and below (II/MM)!), I'm not at
> liberty to do the interpretation myself.
>    Third, I don't like the OO design - but that might be subjective.
> However, I believe a "parse the structure - then parse tag values on
> demand" would be much better suited: the structure is "picked apart",
> letting each directory and tag carry its raw id, and raw binary
> data/value. Then, when/if traversing the structure, a tag would "look
> up" its symbolic name, and parse/decode/interpret its value, on the fly,
> pretty much using the interpreting stuff as a utility, not an intrinsic
> part of the parsing. This way, the primary data was what actually came
> directly out of the file, while interpreted data was derived on the fly.
> This would also open up for easy parser-pluggability/extendability, as
> mentioned further down.
>    Fourth, if he don't understand a tag, he parses it into _integers_,
> not bytes. I cannot comprehend why. Again, bytes should be primary, and
> "parsed integers" should be derived. (I've come to understand that
> integers are somewhat "native" to TIFF/Exif. However, e.g. text strings
> are not. And when you don't know something, then _bytes_ are the only
> correct format (since that's the lowest denominator you can get in this
> situation; there aren't out-of-octet boundaries anywhere)).
>    Fifth - he reads out the "MM"/"II" part, but doesn't ever seem to use
> the result, thus I question the integrity of the parsed data on yet
> another level.
>    Sixth - there should be provisions to add "user parsing": If I've
> managed to decode more tags, I should be able to stick that into the
> framework, in wait for the fold-in of my new findings into the base
> code. Also overrides.
>
> I'm at this point inclined to find some way to use his parsing of the
> structure, but add entry points to the code so that I can use his
> parsing as raw utilitis, at my choosing, since there is so much research
> and quirk-handling put into it. Or, for example, use your stuff for
> parsing, and let his stuff decode/interpret whatever you've not done
> yet. Or do it myself..
>
>
> I'm now going to delve into your code! I'm so looking forward to the
> metadata structure write code, exited to see how you've solved the
> actual read/modify/write path..
>
> Okay, started:
>
> One HUGE thing: Why do you use Vector, *AT ALL*? You do know that Vector
> is worse than death? If the argument is that it's going to run on java
> 1.0 or 1.1, you're fucked already, since you in the same file use Map,
> which is 1.2.
>
> Vector is *synchronized* - every damn operation! Do a search-replace
> Vector->ArrayList (but of course use List on declarations), and you will
> probably get some absolutely free performance improvements.
>    (At this point, I cannot see any reason for wanting synchronized
> code - but then again, I've read about 5 classes by now, so maybe you
> have some absolutely amazing multi-threading going on in the
> parsing/decoding of a single image).
>
> Kind regards,
> Endre.
>
>
>

Re: Sanselan: capabilities, questions, thoughts

Posted by Charles Matthew Chen <ch...@gmail.com>.

Endre,

   I'd like to apologize for my lapse in judgment.

Charles.


Here is the sanselan-dev subscription info - you're right, it needs to
be added to our wiki.

Mailing Lists
-------------

To get involved with the Apache Sanselan project, start by having a
look at our website (link at top of page) and join our mailing
lists by sending an empty message to

   sanselan-dev-subscribe     :at: incubator.apache.org
and
   sanselan-commits-subscribe :at: incubator.apache.org

and the dev mailing list archives can be found at

   http://incubator.apache.org/mail/sanselan-dev/



On Dec 18, 2007 6:41 AM, Endre Stølsvik <On...@stolsvik.com> wrote:
> Charles Matthew Chen wrote:
> >
> > ps. I hope you don't mind if I cc the list at apache.
>
> Actually, I do - quite a bit: There was some rather intense critique of
> a completely different project there, meant exclusively for you (at
> least not meant for the public at large) - related to Sanselan only as a
> heads-up of how maybe NOT do some aspects of handling the metadata. Had
> I wanted it on a public list, with public archives, I'd known how to put
> it there myself.
>
> PS: There was no place on The Internet stating how to get onto the
> Sanselan mailing lists, at least that I found.
>
> Endre.
>

Re: Sanselan: capabilities, questions, thoughts

Posted by Endre Stølsvik <On...@stolsvik.com>.

Charles Matthew Chen wrote:
> 
> ps. I hope you don't mind if I cc the list at apache.

Actually, I do - quite a bit: There was some rather intense critique of 
a completely different project there, meant exclusively for you (at 
least not meant for the public at large) - related to Sanselan only as a 
heads-up of how maybe NOT do some aspects of handling the metadata. Had 
I wanted it on a public list, with public archives, I'd known how to put 
it there myself.

PS: There was no place on The Internet stating how to get onto the 
Sanselan mailing lists, at least that I found.

Endre.