You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "daniel schmidt (JIRA)" <ji...@apache.org> on 2018/02/27 21:05:00 UTC

[jira] [Comment Edited] (TIKA-2591) Some tiffs (Big Endian with fax compression) are showing up as x-tarr

    [ https://issues.apache.org/jira/browse/TIKA-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379137#comment-16379137 ] 

daniel schmidt edited comment on TIKA-2591 at 2/27/18 9:04 PM:
---------------------------------------------------------------

It is a bit of a "useful hack" as they say.

But it's also kind of weird, the code is written to depend on TarArchiveInputStream to throw an exception to be "not a tar". In this case, for these images "success" is essentially failure? 

It does seem odd to declare tiff a sub-type of tar, but that is where the code lead me, since Tika see's it as a tar, but then later sees it as a .tiff.

Another option I considered was, in the ArchiveStreamFactory class (see code below), actually guarding the construction of TarArchiveInputStream with a conditional that checked the tarHeader variable to see if it started with one of the TIFF magic numbers (II/MM  49 49 2A 00  / 4D 4D 00 2A). 
For tiffs, they are there in the tarHeader, and you can check them and go to the "simply not a tar" case and not rely on the TarArchiveInputStream constructor or the getNextTarEntry method throwing an exception. That also seemed a little goofy, but it also worked.

 // add magic number checks here to this if statement, skip right to exception throw: 
        if (signatureLength >= TAR_HEADER_SIZE) {
 
            TarArchiveInputStream tais = null;
            try {
                tais = new TarArchiveInputStream(new ByteArrayInputStream(tarHeader));
                // COMPRESS-191 - verify the header checksum
                if (tais.getNextTarEntry().isCheckSumOK()) {
                    return TAR;
                }
            } catch (final Exception e) { // NOPMD // NOSONAR
                // can generate IllegalArgumentException as well
                // as IOException
                // autodetection, simply not a TAR
                // ignored
            } finally {
                IOUtils.closeQuietly(tais);
            }
        }
        throw new ArchiveException("No Archiver found for the stream signature");



was (Author: schmiddc):
It is a bit of a "useful hack" as they say.

But it's also kind of weird, the code is written to depend on TarArchiveInputStream to throw an exception to be "not a tar". In this case, for these images "success" is essentially failure? 

It does seem odd to declare tiff a sub-type of tar, but that is where the code lead me, since Tika see's it as a tar, but then later see's it as a .tiff.
Another option I considered was actually guarding the construction TarArchiveInputStream with a conditional that checked the header for the TIFF magic numbers (II/MM  49 49 2A 00  / 4D 4D 00 2A). They are there, and you can check them and go to the "simply not a tar" case without even throwing an exception. That also seemed a little goofy, but it also worked.

  try {

                tais = new TarArchiveInputStream(new ByteArrayInputStream(tarHeader));

                // COMPRESS-191 - verify the header checksum

                if (tais.getNextTarEntry().isCheckSumOK())

{                     return TAR;                 }
            } catch (final Exception e)

{ // NOPMD // NOSONAR                 
// can generate IllegalArgumentException as well                 
// as IOException                 
// autodetection, simply not a TAR                 
// ignored            
 }


> Some tiffs (Big Endian with fax compression) are showing up as x-tarr
> ---------------------------------------------------------------------
>
>                 Key: TIKA-2591
>                 URL: https://issues.apache.org/jira/browse/TIKA-2591
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.16
>         Environment: Tika, running in a java application and a unit-test (windows and mac environments)
>            Reporter: daniel schmidt
>            Priority: Major
>              Labels: newbie
>             Fix For: 1.18
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have found that a certain tiff that we manage is now reporting application/x-tar in Tika where it previously reported as a tiff (image/tiff). 
> Observe this code in ArchiveStreamFactory, detect method.
>   // COMPRESS-117 - improve auto-recognition
>         if (signatureLength >= TAR_HEADER_SIZE) {
>             TarArchiveInputStream tais = null;
>             try {
>                 tais = new TarArchiveInputStream(new ByteArrayInputStream(tarHeader));
>                 // COMPRESS-191 - verify the header checksum
>                 if (tais.getNextTarEntry().isCheckSumOK()) {
>                     return TAR;
>                 }
>             } catch (final Exception e) { // NOPMD // NOSONAR
>                 // can generate IllegalArgumentException as well
>                 // as IOException
>                 // autodetection, simply not a TAR
>                 // ignored
>             } finally {
>                 IOUtils.closeQuietly(tais);
>             }
> What if find is that most TIFs, when they get to tais.getNextTarEntry() fail with an exception (i.e fall into the "simply not a tar" case). However this tiff actually does NOT fail here. This somewhat makes sense as the internal structure of a fax compressed tifs as a tar-like structure
> Note, the CompositeDetector class eventually does recognize it as a proper tiff as it loops through its detectors in its detect method. It is detected as tiff in the MimeTypes class, which is one of the implementations of the Detector interface
>  
>     public MediaType detect(InputStream input, Metadata metadata)
>             throws IOException {
>         MediaType type = MediaType.OCTET_STREAM;
>         for (Detector detector : getDetectors()) {
>             //short circuit via OverrideDetector
>             //can't rely on ordering because subsequent detector may
>             //change Override's to a specialization of Override's
>             if (detector instanceof OverrideDetector &&        metadata.get(TikaCoreProperties.CONTENT_TYPE_OVERRIDE) != null) {
>                 return detector.detect(input, metadata);
>             }
>             MediaType detected = detector.detect(input, metadata);
>             if (registry.isSpecializationOf(detected, type)) {
>                 type = detected;
>             }
>         }
>         return type;
> However since Image/tiff isn't a specialization of application/x-tar it does not replace the type with tiff.
> My fix was to add a  "<sub-class-of type="application/x-tar"/>" to the definition for image/tiff in the tika-mimetypes.xml file
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)