You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Albretch Mueller <lb...@gmail.com> on 2011/12/31 22:38:06 UTC

... all major file formats

~
 http://projects.apache.org/projects/tika.html
~
 http://tika.apache.org/1.0/formats.html
~
 say that it: " ... easily detect(s) and extract(s) metadata and
content from all major file formats"
~
 I think "all major file formats" should be somehow functionally
specified through something like
~
 core.tika.formatHandlers.getAll[DefinedFormat]Handlers
~
 accessing some registry and (selectively) returning metadata in
XML-based RDF sections or a similar data structure. I also think that
registry should include some CMS-like interface (just for the
metadata) of the files in the repository, with some searchable
(ideally through queries) interface
~
 The thing is that (I think) most people using tika will most probably
need it for large databanks/corpora and they would love to avail
themselves of such an interface to do some statistics or play with the
data. Say you have large amounts or MS Word documents you would like
to translate to ODT, but you don't want to lose any formatting and you
don't have the time to eyeball all of the files ...
~
 lbrtchx

Re: ... all major file formats

Posted by Albretch Mueller <lb...@gmail.com>.

~
 and as we know the semantics of all pieces of metadata is (fairly
well ;-)) defined in the RFC for the MIME types
 and users should be able to extend (or constrain) the difinitions of
their own metadata with extra annotations
~
 lbrtchx


On 1/2/12, Albretch Mueller <lb...@gmail.com> wrote:
> ~
>  For some reason I could not try the latest version (see below), but
> with version tika-app-0.9 I got the MIME file types for each of the
> parsers. Yet I am thinking about more than that, say, you work for
> some relatively large business that intensively uses text (any
> business ;-)) and you would like to consolidate all documents with an
> open document format instead of keeping them in the various formats
> people may use and build some archive/CMS/corpus with all those texts
> and include a metadata registry search. How can someone know that the
> heading for a PDF file corresponds to the heading of a MS Word and or
> RTF file or the title on an HTML file corresponds to the title of a
> media file?
>
> ~
>
>  How could you run a query (based on the metadata you have for all
> files) "simply" asking: tell me the location, owner and last modified
> date of all data files which title is (or contains) "brainstorming"
> for "so und so" project? The way I see it is as a registry of a large,
> indexed resolution of a many-many relationship among the metadata.
>
> ~
>
>  You may tell me, well this is not tika's business ;-), but I see it
> as some "natural" next-step or sister project to tika. In fact, to me
> that should be part of tika
> ~
>  lbrtchx
> ~
> $ java -jar tika-app-0.9.jar --list-parser-details
> org.apache.tika.parser.AutoDetectParser (Composite Parser):
>   org.apache.tika.parser.DefaultParser (Composite Parser):
>     org.apache.tika.parser.asm.ClassParser
>       application/java-vm
>     org.apache.tika.parser.audio.AudioParser
>       audio/x-aiff
>       audio/x-wav
>       audio/basic
>     org.apache.tika.parser.audio.MidiParser
>       audio/midi
>       application/x-midi
> ...
>     org.apache.tika.parser.pdf.PDFParser
>       application/pdf
>     org.apache.tika.parser.pkg.PackageParser
>       application/x-tar
>       application/x-archive
>       application/x-gtar
>       application/x-gzip
>       application/x-cpio
>       application/x-bzip2
>       application/zip
>       application/x-bzip
>     org.apache.tika.parser.rtf.RTFParser
>       application/rtf
>     org.apache.tika.parser.txt.TXTParser
>       text/plain
>     org.apache.tika.parser.video.FLVParser
>       video/x-flv
>     org.apache.tika.parser.xml.DcXMLParser
>       application/xml
>       image/svg+xml
> ~
> $ wget http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.0.jar
> --2012-01-01 18:16:26--
> http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.0.jar
> Resolving www.apache.org... 140.211.11.131
> Connecting to www.apache.org|140.211.11.131|:80... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: unspecified [text/html]
> Saving to: `tika-app-1.0.jar'
>
>     [  <=>
>       ] 26,255       101K/s   in 0.3s
>
> 2012-01-01 18:16:27 (101 KB/s) - `tika-app-1.0.jar' saved [26255]
>
> $ java -jar tika-app-1.0.jar --list-parser-details
> Invalid or corrupt jarfile tika-app-1.0.jar
>
> $ ls -l tika-app-1.*
> -rwxrwxrwx 1 knoppix knoppix 26255 Jan  1 18:16 tika-app-1.0.jar
>
> $ md5sum tika-app-1.0.jar
> 321e83759e39e64817553f7eef15b5a8  tika-app-1.0.jar
>
> $ wget http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.0-src.zip
> --2012-01-01 18:18:59--
> http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.0-src.zip
> Resolving www.apache.org... 140.211.11.131
> Connecting to www.apache.org|140.211.11.131|:80... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: unspecified [text/html]
> Saving to: `apache-tika-1.0-src.zip'
>
>     [  <=>
>                      ] 26,967       103K/s   in 0.3s
>
> 2012-01-01 18:19:00 (103 KB/s) - `apache-tika-1.0-src.zip' saved [26967]
>
> $ md5sum -b apache-tika-1.0-src.zip
> 56d7028c50259b10705ef4d16ba08737 *apache-tika-1.0-src.zip
>
> $ unzip apache-tika-1.0-src.zip
> Archive:  apache-tika-1.0-src.zip
>   End-of-central-directory signature not found.  Either this file is not
>   a zipfile, or it constitutes one disk of a multi-part archive.  In the
>   latter case the central directory and zipfile comment will be found on
>   the last disk(s) of this archive.
> note:  apache-tika-1.0-src.zip may be a plain executable, not an archive
> unzip:  cannot find zipfile directory in one of apache-tika-1.0-src.zip or
>         apache-tika-1.0-src.zip.zip, and cannot find
> apache-tika-1.0-src.zip.ZIP, period.
> ~
>
>
> On 1/1/12, Nick Burch <ni...@alfresco.com> wrote:
>> On Sat, 31 Dec 2011, Albretch Mueller wrote:
>>> I think "all major file formats" should be somehow functionally
>>> specified through something like
>>> ~
>>> core.tika.formatHandlers.getAll[DefinedFormat]Handlers
>>
>> In code:
>>    TikaConfig config = TikaConfig.getDefaultConfig();
>>    Set<MediaType> supported =
>>      config.getParser().getSupportedTypes(new ParseContext());
>>
>> With the Tika App:
>>    java -jar tika-app-1.1-SNAPSHOT.jar --list-parser-details
>>
>>
>> Do these not already provide all the info required?
>>
>> Nick
>>
>

Re: ... all major file formats

Posted by Nick Burch <ni...@alfresco.com>.

On Mon, 2 Jan 2012, Albretch Mueller wrote:
> How can someone know that the heading for a PDF file corresponds to the 
> heading of a MS Word and or RTF file or the title on an HTML file 
> corresponds to the title of a media file?

They can't - both formats allow you to make something look like a heading 
text without semantically marking it as such. Tika will give you all the 
info it can, but it can't (usually!) help when you've done something odd 
with your doc and thrown away information...

> How could you run a query (based on the metadata you have for all
> files) "simply" asking: tell me the location, owner and last modified
> date of all data files which title is (or contains) "brainstorming"
> for "so und so" project?

You need a CMS that integrates with Tika (there are several!), then you 
ask that to find documents matching that extracted metadata

Nick

Re: ... all major file formats

Posted by Alex Ott <al...@gmail.com>.

Regarding downloads problem - if you'll look onto Content-Type - this
is HTML page, that is served by closer.cgi and contains list of
mirrors... - you need to select which mirror to use to perform actual
download

On Mon, Jan 2, 2012 at 1:05 AM, Albretch Mueller <lb...@gmail.com> wrote:
> ~
>  For some reason I could not try the latest version (see below), but
> with version tika-app-0.9 I got the MIME file types for each of the
> parsers. Yet I am thinking about more than that, say, you work for
> some relatively large business that intensively uses text (any
> business ;-)) and you would like to consolidate all documents with an
> open document format instead of keeping them in the various formats
> people may use and build some archive/CMS/corpus with all those texts
> and include a metadata registry search. How can someone know that the
> heading for a PDF file corresponds to the heading of a MS Word and or
> RTF file or the title on an HTML file corresponds to the title of a
> media file?
>
> ~
>
>  How could you run a query (based on the metadata you have for all
> files) "simply" asking: tell me the location, owner and last modified
> date of all data files which title is (or contains) "brainstorming"
> for "so und so" project? The way I see it is as a registry of a large,
> indexed resolution of a many-many relationship among the metadata.
>
> ~
>
>  You may tell me, well this is not tika's business ;-), but I see it
> as some "natural" next-step or sister project to tika. In fact, to me
> that should be part of tika
> ~
>  lbrtchx
> ~
> $ java -jar tika-app-0.9.jar --list-parser-details
> org.apache.tika.parser.AutoDetectParser (Composite Parser):
>  org.apache.tika.parser.DefaultParser (Composite Parser):
>    org.apache.tika.parser.asm.ClassParser
>      application/java-vm
>    org.apache.tika.parser.audio.AudioParser
>      audio/x-aiff
>      audio/x-wav
>      audio/basic
>    org.apache.tika.parser.audio.MidiParser
>      audio/midi
>      application/x-midi
> ...
>    org.apache.tika.parser.pdf.PDFParser
>      application/pdf
>    org.apache.tika.parser.pkg.PackageParser
>      application/x-tar
>      application/x-archive
>      application/x-gtar
>      application/x-gzip
>      application/x-cpio
>      application/x-bzip2
>      application/zip
>      application/x-bzip
>    org.apache.tika.parser.rtf.RTFParser
>      application/rtf
>    org.apache.tika.parser.txt.TXTParser
>      text/plain
>    org.apache.tika.parser.video.FLVParser
>      video/x-flv
>    org.apache.tika.parser.xml.DcXMLParser
>      application/xml
>      image/svg+xml
> ~
> $ wget http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.0.jar
> --2012-01-01 18:16:26--
> http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.0.jar
> Resolving www.apache.org... 140.211.11.131
> Connecting to www.apache.org|140.211.11.131|:80... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: unspecified [text/html]
> Saving to: `tika-app-1.0.jar'
>
>    [  <=>
>      ] 26,255       101K/s   in 0.3s
>
> 2012-01-01 18:16:27 (101 KB/s) - `tika-app-1.0.jar' saved [26255]
>
> $ java -jar tika-app-1.0.jar --list-parser-details
> Invalid or corrupt jarfile tika-app-1.0.jar
>
> $ ls -l tika-app-1.*
> -rwxrwxrwx 1 knoppix knoppix 26255 Jan  1 18:16 tika-app-1.0.jar
>
> $ md5sum tika-app-1.0.jar
> 321e83759e39e64817553f7eef15b5a8  tika-app-1.0.jar
>
> $ wget http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.0-src.zip
> --2012-01-01 18:18:59--
> http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.0-src.zip
> Resolving www.apache.org... 140.211.11.131
> Connecting to www.apache.org|140.211.11.131|:80... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: unspecified [text/html]
> Saving to: `apache-tika-1.0-src.zip'
>
>    [  <=>
>                     ] 26,967       103K/s   in 0.3s
>
> 2012-01-01 18:19:00 (103 KB/s) - `apache-tika-1.0-src.zip' saved [26967]
>
> $ md5sum -b apache-tika-1.0-src.zip
> 56d7028c50259b10705ef4d16ba08737 *apache-tika-1.0-src.zip
>
> $ unzip apache-tika-1.0-src.zip
> Archive:  apache-tika-1.0-src.zip
>  End-of-central-directory signature not found.  Either this file is not
>  a zipfile, or it constitutes one disk of a multi-part archive.  In the
>  latter case the central directory and zipfile comment will be found on
>  the last disk(s) of this archive.
> note:  apache-tika-1.0-src.zip may be a plain executable, not an archive
> unzip:  cannot find zipfile directory in one of apache-tika-1.0-src.zip or
>        apache-tika-1.0-src.zip.zip, and cannot find
> apache-tika-1.0-src.zip.ZIP, period.
> ~
>
>
> On 1/1/12, Nick Burch <ni...@alfresco.com> wrote:
>> On Sat, 31 Dec 2011, Albretch Mueller wrote:
>>> I think "all major file formats" should be somehow functionally
>>> specified through something like
>>> ~
>>> core.tika.formatHandlers.getAll[DefinedFormat]Handlers
>>
>> In code:
>>    TikaConfig config = TikaConfig.getDefaultConfig();
>>    Set<MediaType> supported =
>>      config.getParser().getSupportedTypes(new ParseContext());
>>
>> With the Tika App:
>>    java -jar tika-app-1.1-SNAPSHOT.jar --list-parser-details
>>
>>
>> Do these not already provide all the info required?
>>
>> Nick
>>



-- 
With best wishes,                    Alex Ott
http://alexott.net/
Tiwtter: alexott_en (English), alexott (Russian)
Skype: alex.ott

Re: ... all major file formats

Posted by Albretch Mueller <lb...@gmail.com>.

~
 For some reason I could not try the latest version (see below), but
with version tika-app-0.9 I got the MIME file types for each of the
parsers. Yet I am thinking about more than that, say, you work for
some relatively large business that intensively uses text (any
business ;-)) and you would like to consolidate all documents with an
open document format instead of keeping them in the various formats
people may use and build some archive/CMS/corpus with all those texts
and include a metadata registry search. How can someone know that the
heading for a PDF file corresponds to the heading of a MS Word and or
RTF file or the title on an HTML file corresponds to the title of a
media file?

~

 How could you run a query (based on the metadata you have for all
files) "simply" asking: tell me the location, owner and last modified
date of all data files which title is (or contains) "brainstorming"
for "so und so" project? The way I see it is as a registry of a large,
indexed resolution of a many-many relationship among the metadata.

~

 You may tell me, well this is not tika's business ;-), but I see it
as some "natural" next-step or sister project to tika. In fact, to me
that should be part of tika
~
 lbrtchx
~
$ java -jar tika-app-0.9.jar --list-parser-details
org.apache.tika.parser.AutoDetectParser (Composite Parser):
  org.apache.tika.parser.DefaultParser (Composite Parser):
    org.apache.tika.parser.asm.ClassParser
      application/java-vm
    org.apache.tika.parser.audio.AudioParser
      audio/x-aiff
      audio/x-wav
      audio/basic
    org.apache.tika.parser.audio.MidiParser
      audio/midi
      application/x-midi
...
    org.apache.tika.parser.pdf.PDFParser
      application/pdf
    org.apache.tika.parser.pkg.PackageParser
      application/x-tar
      application/x-archive
      application/x-gtar
      application/x-gzip
      application/x-cpio
      application/x-bzip2
      application/zip
      application/x-bzip
    org.apache.tika.parser.rtf.RTFParser
      application/rtf
    org.apache.tika.parser.txt.TXTParser
      text/plain
    org.apache.tika.parser.video.FLVParser
      video/x-flv
    org.apache.tika.parser.xml.DcXMLParser
      application/xml
      image/svg+xml
~
$ wget http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.0.jar
--2012-01-01 18:16:26--
http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.0.jar
Resolving www.apache.org... 140.211.11.131
Connecting to www.apache.org|140.211.11.131|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `tika-app-1.0.jar'

    [  <=>
      ] 26,255       101K/s   in 0.3s

2012-01-01 18:16:27 (101 KB/s) - `tika-app-1.0.jar' saved [26255]

$ java -jar tika-app-1.0.jar --list-parser-details
Invalid or corrupt jarfile tika-app-1.0.jar

$ ls -l tika-app-1.*
-rwxrwxrwx 1 knoppix knoppix 26255 Jan  1 18:16 tika-app-1.0.jar

$ md5sum tika-app-1.0.jar
321e83759e39e64817553f7eef15b5a8  tika-app-1.0.jar

$ wget http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.0-src.zip
--2012-01-01 18:18:59--
http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.0-src.zip
Resolving www.apache.org... 140.211.11.131
Connecting to www.apache.org|140.211.11.131|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `apache-tika-1.0-src.zip'

    [  <=>
                     ] 26,967       103K/s   in 0.3s

2012-01-01 18:19:00 (103 KB/s) - `apache-tika-1.0-src.zip' saved [26967]

$ md5sum -b apache-tika-1.0-src.zip
56d7028c50259b10705ef4d16ba08737 *apache-tika-1.0-src.zip

$ unzip apache-tika-1.0-src.zip
Archive:  apache-tika-1.0-src.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
note:  apache-tika-1.0-src.zip may be a plain executable, not an archive
unzip:  cannot find zipfile directory in one of apache-tika-1.0-src.zip or
        apache-tika-1.0-src.zip.zip, and cannot find
apache-tika-1.0-src.zip.ZIP, period.
~


On 1/1/12, Nick Burch <ni...@alfresco.com> wrote:
> On Sat, 31 Dec 2011, Albretch Mueller wrote:
>> I think "all major file formats" should be somehow functionally
>> specified through something like
>> ~
>> core.tika.formatHandlers.getAll[DefinedFormat]Handlers
>
> In code:
>    TikaConfig config = TikaConfig.getDefaultConfig();
>    Set<MediaType> supported =
>      config.getParser().getSupportedTypes(new ParseContext());
>
> With the Tika App:
>    java -jar tika-app-1.1-SNAPSHOT.jar --list-parser-details
>
>
> Do these not already provide all the info required?
>
> Nick
>

Re: ... all major file formats

Posted by Nick Burch <ni...@alfresco.com>.

On Sat, 31 Dec 2011, Albretch Mueller wrote:
> I think "all major file formats" should be somehow functionally
> specified through something like
> ~
> core.tika.formatHandlers.getAll[DefinedFormat]Handlers

In code:
   TikaConfig config = TikaConfig.getDefaultConfig();
   Set<MediaType> supported =
     config.getParser().getSupportedTypes(new ParseContext());

With the Tika App:
   java -jar tika-app-1.1-SNAPSHOT.jar --list-parser-details


Do these not already provide all the info required?

Nick