You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Albretch Mueller <lb...@gmail.com> on 2011/12/31 22:38:06 UTC
... all major file formats
~
http://projects.apache.org/projects/tika.html
~
http://tika.apache.org/1.0/formats.html
~
say that it: " ... easily detect(s) and extract(s) metadata and
content from all major file formats"
~
I think "all major file formats" should be somehow functionally
specified through something like
~
core.tika.formatHandlers.getAll[DefinedFormat]Handlers
~
accessing some registry and (selectively) returning metadata in
XML-based RDF sections or a similar data structure. I also think that
registry should include some CMS-like interface (just for the
metadata) of the files in the repository, with some searchable
(ideally through queries) interface
~
The thing is that (I think) most people using tika will most probably
need it for large databanks/corpora and they would love to avail
themselves of such an interface to do some statistics or play with the
data. Say you have large amounts or MS Word documents you would like
to translate to ODT, but you don't want to lose any formatting and you
don't have the time to eyeball all of the files ...
~
lbrtchx
Re: ... all major file formats
Posted by Albretch Mueller <lb...@gmail.com>.
~
and as we know the semantics of all pieces of metadata is (fairly
well ;-)) defined in the RFC for the MIME types
and users should be able to extend (or constrain) the difinitions of
their own metadata with extra annotations
~
lbrtchx
On 1/2/12, Albretch Mueller <lb...@gmail.com> wrote:
> ~
> For some reason I could not try the latest version (see below), but
> with version tika-app-0.9 I got the MIME file types for each of the
> parsers. Yet I am thinking about more than that, say, you work for
> some relatively large business that intensively uses text (any
> business ;-)) and you would like to consolidate all documents with an
> open document format instead of keeping them in the various formats
> people may use and build some archive/CMS/corpus with all those texts
> and include a metadata registry search. How can someone know that the
> heading for a PDF file corresponds to the heading of a MS Word and or
> RTF file or the title on an HTML file corresponds to the title of a
> media file?
>
> ~
>
> How could you run a query (based on the metadata you have for all
> files) "simply" asking: tell me the location, owner and last modified
> date of all data files which title is (or contains) "brainstorming"
> for "so und so" project? The way I see it is as a registry of a large,
> indexed resolution of a many-many relationship among the metadata.
>
> ~
>
> You may tell me, well this is not tika's business ;-), but I see it
> as some "natural" next-step or sister project to tika. In fact, to me
> that should be part of tika
> ~
> lbrtchx
> ~
> $ java -jar tika-app-0.9.jar --list-parser-details
> org.apache.tika.parser.AutoDetectParser (Composite Parser):
> org.apache.tika.parser.DefaultParser (Composite Parser):
> org.apache.tika.parser.asm.ClassParser
> application/java-vm
> org.apache.tika.parser.audio.AudioParser
> audio/x-aiff
> audio/x-wav
> audio/basic
> org.apache.tika.parser.audio.MidiParser
> audio/midi
> application/x-midi
> ...
> org.apache.tika.parser.pdf.PDFParser
> application/pdf
> org.apache.tika.parser.pkg.PackageParser
> application/x-tar
> application/x-archive
> application/x-gtar
> application/x-gzip
> application/x-cpio
> application/x-bzip2
> application/zip
> application/x-bzip
> org.apache.tika.parser.rtf.RTFParser
> application/rtf
> org.apache.tika.parser.txt.TXTParser
> text/plain
> org.apache.tika.parser.video.FLVParser
> video/x-flv
> org.apache.tika.parser.xml.DcXMLParser
> application/xml
> image/svg+xml
> ~
> $ wget http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.0.jar
> --2012-01-01 18:16:26--
> http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.0.jar
> Resolving www.apache.org... 140.211.11.131
> Connecting to www.apache.org|140.211.11.131|:80... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: unspecified [text/html]
> Saving to: `tika-app-1.0.jar'
>
> [ <=>
> ] 26,255 101K/s in 0.3s
>
> 2012-01-01 18:16:27 (101 KB/s) - `tika-app-1.0.jar' saved [26255]
>
> $ java -jar tika-app-1.0.jar --list-parser-details
> Invalid or corrupt jarfile tika-app-1.0.jar
>
> $ ls -l tika-app-1.*
> -rwxrwxrwx 1 knoppix knoppix 26255 Jan 1 18:16 tika-app-1.0.jar
>
> $ md5sum tika-app-1.0.jar
> 321e83759e39e64817553f7eef15b5a8 tika-app-1.0.jar
>
> $ wget http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.0-src.zip
> --2012-01-01 18:18:59--
> http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.0-src.zip
> Resolving www.apache.org... 140.211.11.131
> Connecting to www.apache.org|140.211.11.131|:80... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: unspecified [text/html]
> Saving to: `apache-tika-1.0-src.zip'
>
> [ <=>
> ] 26,967 103K/s in 0.3s
>
> 2012-01-01 18:19:00 (103 KB/s) - `apache-tika-1.0-src.zip' saved [26967]
>
> $ md5sum -b apache-tika-1.0-src.zip
> 56d7028c50259b10705ef4d16ba08737 *apache-tika-1.0-src.zip
>
> $ unzip apache-tika-1.0-src.zip
> Archive: apache-tika-1.0-src.zip
> End-of-central-directory signature not found. Either this file is not
> a zipfile, or it constitutes one disk of a multi-part archive. In the
> latter case the central directory and zipfile comment will be found on
> the last disk(s) of this archive.
> note: apache-tika-1.0-src.zip may be a plain executable, not an archive
> unzip: cannot find zipfile directory in one of apache-tika-1.0-src.zip or
> apache-tika-1.0-src.zip.zip, and cannot find
> apache-tika-1.0-src.zip.ZIP, period.
> ~
>
>
> On 1/1/12, Nick Burch <ni...@alfresco.com> wrote:
>> On Sat, 31 Dec 2011, Albretch Mueller wrote:
>>> I think "all major file formats" should be somehow functionally
>>> specified through something like
>>> ~
>>> core.tika.formatHandlers.getAll[DefinedFormat]Handlers
>>
>> In code:
>> TikaConfig config = TikaConfig.getDefaultConfig();
>> Set<MediaType> supported =
>> config.getParser().getSupportedTypes(new ParseContext());
>>
>> With the Tika App:
>> java -jar tika-app-1.1-SNAPSHOT.jar --list-parser-details
>>
>>
>> Do these not already provide all the info required?
>>
>> Nick
>>
>
Re: ... all major file formats
Posted by Nick Burch <ni...@alfresco.com>.
On Mon, 2 Jan 2012, Albretch Mueller wrote:
> How can someone know that the heading for a PDF file corresponds to the
> heading of a MS Word and or RTF file or the title on an HTML file
> corresponds to the title of a media file?
They can't - both formats allow you to make something look like a heading
text without semantically marking it as such. Tika will give you all the
info it can, but it can't (usually!) help when you've done something odd
with your doc and thrown away information...
> How could you run a query (based on the metadata you have for all
> files) "simply" asking: tell me the location, owner and last modified
> date of all data files which title is (or contains) "brainstorming"
> for "so und so" project?
You need a CMS that integrates with Tika (there are several!), then you
ask that to find documents matching that extracted metadata
Nick
Re: ... all major file formats
Posted by Alex Ott <al...@gmail.com>.
Regarding downloads problem - if you'll look onto Content-Type - this
is HTML page, that is served by closer.cgi and contains list of
mirrors... - you need to select which mirror to use to perform actual
download
On Mon, Jan 2, 2012 at 1:05 AM, Albretch Mueller <lb...@gmail.com> wrote:
> ~
> For some reason I could not try the latest version (see below), but
> with version tika-app-0.9 I got the MIME file types for each of the
> parsers. Yet I am thinking about more than that, say, you work for
> some relatively large business that intensively uses text (any
> business ;-)) and you would like to consolidate all documents with an
> open document format instead of keeping them in the various formats
> people may use and build some archive/CMS/corpus with all those texts
> and include a metadata registry search. How can someone know that the
> heading for a PDF file corresponds to the heading of a MS Word and or
> RTF file or the title on an HTML file corresponds to the title of a
> media file?
>
> ~
>
> How could you run a query (based on the metadata you have for all
> files) "simply" asking: tell me the location, owner and last modified
> date of all data files which title is (or contains) "brainstorming"
> for "so und so" project? The way I see it is as a registry of a large,
> indexed resolution of a many-many relationship among the metadata.
>
> ~
>
> You may tell me, well this is not tika's business ;-), but I see it
> as some "natural" next-step or sister project to tika. In fact, to me
> that should be part of tika
> ~
> lbrtchx
> ~
> $ java -jar tika-app-0.9.jar --list-parser-details
> org.apache.tika.parser.AutoDetectParser (Composite Parser):
> org.apache.tika.parser.DefaultParser (Composite Parser):
> org.apache.tika.parser.asm.ClassParser
> application/java-vm
> org.apache.tika.parser.audio.AudioParser
> audio/x-aiff
> audio/x-wav
> audio/basic
> org.apache.tika.parser.audio.MidiParser
> audio/midi
> application/x-midi
> ...
> org.apache.tika.parser.pdf.PDFParser
> application/pdf
> org.apache.tika.parser.pkg.PackageParser
> application/x-tar
> application/x-archive
> application/x-gtar
> application/x-gzip
> application/x-cpio
> application/x-bzip2
> application/zip
> application/x-bzip
> org.apache.tika.parser.rtf.RTFParser
> application/rtf
> org.apache.tika.parser.txt.TXTParser
> text/plain
> org.apache.tika.parser.video.FLVParser
> video/x-flv
> org.apache.tika.parser.xml.DcXMLParser
> application/xml
> image/svg+xml
> ~
> $ wget http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.0.jar
> --2012-01-01 18:16:26--
> http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.0.jar
> Resolving www.apache.org... 140.211.11.131
> Connecting to www.apache.org|140.211.11.131|:80... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: unspecified [text/html]
> Saving to: `tika-app-1.0.jar'
>
> [ <=>
> ] 26,255 101K/s in 0.3s
>
> 2012-01-01 18:16:27 (101 KB/s) - `tika-app-1.0.jar' saved [26255]
>
> $ java -jar tika-app-1.0.jar --list-parser-details
> Invalid or corrupt jarfile tika-app-1.0.jar
>
> $ ls -l tika-app-1.*
> -rwxrwxrwx 1 knoppix knoppix 26255 Jan 1 18:16 tika-app-1.0.jar
>
> $ md5sum tika-app-1.0.jar
> 321e83759e39e64817553f7eef15b5a8 tika-app-1.0.jar
>
> $ wget http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.0-src.zip
> --2012-01-01 18:18:59--
> http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.0-src.zip
> Resolving www.apache.org... 140.211.11.131
> Connecting to www.apache.org|140.211.11.131|:80... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: unspecified [text/html]
> Saving to: `apache-tika-1.0-src.zip'
>
> [ <=>
> ] 26,967 103K/s in 0.3s
>
> 2012-01-01 18:19:00 (103 KB/s) - `apache-tika-1.0-src.zip' saved [26967]
>
> $ md5sum -b apache-tika-1.0-src.zip
> 56d7028c50259b10705ef4d16ba08737 *apache-tika-1.0-src.zip
>
> $ unzip apache-tika-1.0-src.zip
> Archive: apache-tika-1.0-src.zip
> End-of-central-directory signature not found. Either this file is not
> a zipfile, or it constitutes one disk of a multi-part archive. In the
> latter case the central directory and zipfile comment will be found on
> the last disk(s) of this archive.
> note: apache-tika-1.0-src.zip may be a plain executable, not an archive
> unzip: cannot find zipfile directory in one of apache-tika-1.0-src.zip or
> apache-tika-1.0-src.zip.zip, and cannot find
> apache-tika-1.0-src.zip.ZIP, period.
> ~
>
>
> On 1/1/12, Nick Burch <ni...@alfresco.com> wrote:
>> On Sat, 31 Dec 2011, Albretch Mueller wrote:
>>> I think "all major file formats" should be somehow functionally
>>> specified through something like
>>> ~
>>> core.tika.formatHandlers.getAll[DefinedFormat]Handlers
>>
>> In code:
>> TikaConfig config = TikaConfig.getDefaultConfig();
>> Set<MediaType> supported =
>> config.getParser().getSupportedTypes(new ParseContext());
>>
>> With the Tika App:
>> java -jar tika-app-1.1-SNAPSHOT.jar --list-parser-details
>>
>>
>> Do these not already provide all the info required?
>>
>> Nick
>>
--
With best wishes, Alex Ott
http://alexott.net/
Tiwtter: alexott_en (English), alexott (Russian)
Skype: alex.ott
Re: ... all major file formats
Posted by Albretch Mueller <lb...@gmail.com>.
~
For some reason I could not try the latest version (see below), but
with version tika-app-0.9 I got the MIME file types for each of the
parsers. Yet I am thinking about more than that, say, you work for
some relatively large business that intensively uses text (any
business ;-)) and you would like to consolidate all documents with an
open document format instead of keeping them in the various formats
people may use and build some archive/CMS/corpus with all those texts
and include a metadata registry search. How can someone know that the
heading for a PDF file corresponds to the heading of a MS Word and or
RTF file or the title on an HTML file corresponds to the title of a
media file?
~
How could you run a query (based on the metadata you have for all
files) "simply" asking: tell me the location, owner and last modified
date of all data files which title is (or contains) "brainstorming"
for "so und so" project? The way I see it is as a registry of a large,
indexed resolution of a many-many relationship among the metadata.
~
You may tell me, well this is not tika's business ;-), but I see it
as some "natural" next-step or sister project to tika. In fact, to me
that should be part of tika
~
lbrtchx
~
$ java -jar tika-app-0.9.jar --list-parser-details
org.apache.tika.parser.AutoDetectParser (Composite Parser):
org.apache.tika.parser.DefaultParser (Composite Parser):
org.apache.tika.parser.asm.ClassParser
application/java-vm
org.apache.tika.parser.audio.AudioParser
audio/x-aiff
audio/x-wav
audio/basic
org.apache.tika.parser.audio.MidiParser
audio/midi
application/x-midi
...
org.apache.tika.parser.pdf.PDFParser
application/pdf
org.apache.tika.parser.pkg.PackageParser
application/x-tar
application/x-archive
application/x-gtar
application/x-gzip
application/x-cpio
application/x-bzip2
application/zip
application/x-bzip
org.apache.tika.parser.rtf.RTFParser
application/rtf
org.apache.tika.parser.txt.TXTParser
text/plain
org.apache.tika.parser.video.FLVParser
video/x-flv
org.apache.tika.parser.xml.DcXMLParser
application/xml
image/svg+xml
~
$ wget http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.0.jar
--2012-01-01 18:16:26--
http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.0.jar
Resolving www.apache.org... 140.211.11.131
Connecting to www.apache.org|140.211.11.131|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `tika-app-1.0.jar'
[ <=>
] 26,255 101K/s in 0.3s
2012-01-01 18:16:27 (101 KB/s) - `tika-app-1.0.jar' saved [26255]
$ java -jar tika-app-1.0.jar --list-parser-details
Invalid or corrupt jarfile tika-app-1.0.jar
$ ls -l tika-app-1.*
-rwxrwxrwx 1 knoppix knoppix 26255 Jan 1 18:16 tika-app-1.0.jar
$ md5sum tika-app-1.0.jar
321e83759e39e64817553f7eef15b5a8 tika-app-1.0.jar
$ wget http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.0-src.zip
--2012-01-01 18:18:59--
http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.0-src.zip
Resolving www.apache.org... 140.211.11.131
Connecting to www.apache.org|140.211.11.131|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `apache-tika-1.0-src.zip'
[ <=>
] 26,967 103K/s in 0.3s
2012-01-01 18:19:00 (103 KB/s) - `apache-tika-1.0-src.zip' saved [26967]
$ md5sum -b apache-tika-1.0-src.zip
56d7028c50259b10705ef4d16ba08737 *apache-tika-1.0-src.zip
$ unzip apache-tika-1.0-src.zip
Archive: apache-tika-1.0-src.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
note: apache-tika-1.0-src.zip may be a plain executable, not an archive
unzip: cannot find zipfile directory in one of apache-tika-1.0-src.zip or
apache-tika-1.0-src.zip.zip, and cannot find
apache-tika-1.0-src.zip.ZIP, period.
~
On 1/1/12, Nick Burch <ni...@alfresco.com> wrote:
> On Sat, 31 Dec 2011, Albretch Mueller wrote:
>> I think "all major file formats" should be somehow functionally
>> specified through something like
>> ~
>> core.tika.formatHandlers.getAll[DefinedFormat]Handlers
>
> In code:
> TikaConfig config = TikaConfig.getDefaultConfig();
> Set<MediaType> supported =
> config.getParser().getSupportedTypes(new ParseContext());
>
> With the Tika App:
> java -jar tika-app-1.1-SNAPSHOT.jar --list-parser-details
>
>
> Do these not already provide all the info required?
>
> Nick
>
Re: ... all major file formats
Posted by Nick Burch <ni...@alfresco.com>.
On Sat, 31 Dec 2011, Albretch Mueller wrote:
> I think "all major file formats" should be somehow functionally
> specified through something like
> ~
> core.tika.formatHandlers.getAll[DefinedFormat]Handlers
In code:
TikaConfig config = TikaConfig.getDefaultConfig();
Set<MediaType> supported =
config.getParser().getSupportedTypes(new ParseContext());
With the Tika App:
java -jar tika-app-1.1-SNAPSHOT.jar --list-parser-details
Do these not already provide all the info required?
Nick