You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Kamil Żyta <ka...@pwr.edu.pl> on 2014/10/14 12:55:36 UTC
External parser
Hi,
I want to use external parser but on web there isn't complex howto/tutorial.
I only found parser/external/tika-external-parsers.xml sample configuration
but I don't know how to register/enable this parser in tika parsers.
I would be thankful for any help.
regards,
KŻ
Re: External parser
Posted by Kamil Żyta <ka...@pwr.edu.pl>.
On Tue, Oct 14, 2014 at 04:01:43PM +0200, Kamil Żyta wrote:
> On Tue, Oct 14, 2014 at 02:46:07PM +0100, Nick Burch wrote:
> > On Tue, 14 Oct 2014, Kamil Żyta wrote:
> > >> You'd basically need to do something like
> > >> java -classpath tika-app.jar:. org.apache.tika.cli.TikaCLI
> > >>
> > >
> > > http://pastebin.com/wSgwFva3
> >
> > Key there is
> > <meta name="X-Parsed-By" content="org.apache.tika.parser.EmptyParser"/>
> >
> > Can you try asking the Tika CLI what it detects your file as, and what
> > parsers it thinks are present?
>
> <meta name="Content-Type" content="video/x-matroska"/>
>
> # java -classpath tika-app-1.6.jar:. org.apache.tika.cli.TikaCLI --list-parsers
> org.apache.tika.parser.AutoDetectParser (Composite Parser):
> org.apache.tika.parser.DefaultParser (Composite Parser):
> org.apache.tika.parser.asm.ClassParser
> org.apache.tika.parser.audio.AudioParser
> org.apache.tika.parser.audio.MidiParser
> org.apache.tika.parser.chm.ChmParser
> org.apache.tika.parser.code.SourceCodeParser
> org.apache.tika.parser.crypto.Pkcs7Parser
> org.apache.tika.parser.dwg.DWGParser
> org.apache.tika.parser.epub.EpubParser
> org.apache.tika.parser.executable.ExecutableParser
> org.apache.tika.parser.feed.FeedParser
> org.apache.tika.parser.font.AdobeFontMetricParser
> org.apache.tika.parser.font.TrueTypeParser
> org.apache.tika.parser.hdf.HDFParser
> org.apache.tika.parser.html.HtmlParser
> org.apache.tika.parser.image.ImageParser
> org.apache.tika.parser.image.PSDParser
> org.apache.tika.parser.image.TiffParser
> org.apache.tika.parser.iptc.IptcAnpaParser
> org.apache.tika.parser.iwork.IWorkPackageParser
> org.apache.tika.parser.jpeg.JpegParser
> org.apache.tika.parser.mail.RFC822Parser
> org.apache.tika.parser.mat.MatParser
> org.apache.tika.parser.mbox.MboxParser
> org.apache.tika.parser.mbox.OutlookPSTParser
> org.apache.tika.parser.microsoft.OfficeParser
> org.apache.tika.parser.microsoft.TNEFParser
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser
> org.apache.tika.parser.mp3.Mp3Parser
> org.apache.tika.parser.mp4.MP4Parser
> org.apache.tika.parser.netcdf.NetCDFParser
> org.apache.tika.parser.odf.OpenDocumentParser
> org.apache.tika.parser.pdf.PDFParser
> org.apache.tika.parser.pkg.CompressorParser
> org.apache.tika.parser.pkg.PackageParser
> org.apache.tika.parser.rtf.RTFParser
> org.apache.tika.parser.txt.TXTParser
> org.apache.tika.parser.video.FLVParser
> org.apache.tika.parser.xml.DcXMLParser
> org.apache.tika.parser.xml.FictionBookParser
> org.gagravarr.tika.FlacParser
> org.gagravarr.tika.OggParser
> org.gagravarr.tika.OpusParser
> org.gagravarr.tika.SpeexParser
> org.gagravarr.tika.VorbisParser
anyone can help me with this?
K
Re: External parser
Posted by Kamil Żyta <ka...@pwr.edu.pl>.
On Tue, Oct 14, 2014 at 02:46:07PM +0100, Nick Burch wrote:
> On Tue, 14 Oct 2014, Kamil Żyta wrote:
> >> You'd basically need to do something like
> >> java -classpath tika-app.jar:. org.apache.tika.cli.TikaCLI
> >>
> >
> > http://pastebin.com/wSgwFva3
>
> Key there is
> <meta name="X-Parsed-By" content="org.apache.tika.parser.EmptyParser"/>
>
> Can you try asking the Tika CLI what it detects your file as, and what
> parsers it thinks are present?
<meta name="Content-Type" content="video/x-matroska"/>
# java -classpath tika-app-1.6.jar:. org.apache.tika.cli.TikaCLI --list-parsers
org.apache.tika.parser.AutoDetectParser (Composite Parser):
org.apache.tika.parser.DefaultParser (Composite Parser):
org.apache.tika.parser.asm.ClassParser
org.apache.tika.parser.audio.AudioParser
org.apache.tika.parser.audio.MidiParser
org.apache.tika.parser.chm.ChmParser
org.apache.tika.parser.code.SourceCodeParser
org.apache.tika.parser.crypto.Pkcs7Parser
org.apache.tika.parser.dwg.DWGParser
org.apache.tika.parser.epub.EpubParser
org.apache.tika.parser.executable.ExecutableParser
org.apache.tika.parser.feed.FeedParser
org.apache.tika.parser.font.AdobeFontMetricParser
org.apache.tika.parser.font.TrueTypeParser
org.apache.tika.parser.hdf.HDFParser
org.apache.tika.parser.html.HtmlParser
org.apache.tika.parser.image.ImageParser
org.apache.tika.parser.image.PSDParser
org.apache.tika.parser.image.TiffParser
org.apache.tika.parser.iptc.IptcAnpaParser
org.apache.tika.parser.iwork.IWorkPackageParser
org.apache.tika.parser.jpeg.JpegParser
org.apache.tika.parser.mail.RFC822Parser
org.apache.tika.parser.mat.MatParser
org.apache.tika.parser.mbox.MboxParser
org.apache.tika.parser.mbox.OutlookPSTParser
org.apache.tika.parser.microsoft.OfficeParser
org.apache.tika.parser.microsoft.TNEFParser
org.apache.tika.parser.microsoft.ooxml.OOXMLParser
org.apache.tika.parser.mp3.Mp3Parser
org.apache.tika.parser.mp4.MP4Parser
org.apache.tika.parser.netcdf.NetCDFParser
org.apache.tika.parser.odf.OpenDocumentParser
org.apache.tika.parser.pdf.PDFParser
org.apache.tika.parser.pkg.CompressorParser
org.apache.tika.parser.pkg.PackageParser
org.apache.tika.parser.rtf.RTFParser
org.apache.tika.parser.txt.TXTParser
org.apache.tika.parser.video.FLVParser
org.apache.tika.parser.xml.DcXMLParser
org.apache.tika.parser.xml.FictionBookParser
org.gagravarr.tika.FlacParser
org.gagravarr.tika.OggParser
org.gagravarr.tika.OpusParser
org.gagravarr.tika.SpeexParser
org.gagravarr.tika.VorbisParser
K
Re: External parser
Posted by Nick Burch <ap...@gagravarr.org>.
On Tue, 14 Oct 2014, Kamil Żyta wrote:
>> You'd basically need to do something like
>> java -classpath tika-app.jar:. org.apache.tika.cli.TikaCLI
>>
>
> http://pastebin.com/wSgwFva3
Key there is
<meta name="X-Parsed-By" content="org.apache.tika.parser.EmptyParser"/>
Can you try asking the Tika CLI what it detects your file as, and what
parsers it thinks are present?
Nick
Re: External parser
Posted by Kamil Żyta <ka...@pwr.edu.pl>.
On Tue, Oct 14, 2014 at 01:34:00PM +0100, Nick Burch wrote:
> On Tue, 14 Oct 2014, Kamil Żyta wrote:
> >> project conventions
> >
> > I'm not familiar with java. So I need to create org/apache/tika/parser/external/ dir
> > and copy there tika-external-parsers.xml?
>
> + add that to your classpath, so java finds it. (Long term putting it in a
> jar might be best)
>
> >> tika-core/src/test/java/org/apache/tika/parser/external/ExternalParserTest.java
> >
> > I use version 1.5 and this directory is missing
>
> I'd suggest trying a newer version of Tika
in 1.6 this file is missing too.
>
> >> Not on my version of Tika it doesn't...
> >>
> >> If you edit that file to do so, you still need to provide the external
> >> parsers xml file to Tika in the right place, so Tika will find it
> >>
> >
> > http://pastebin.com/Ug1ebdWd
>
> You're not adding the current directory to your classpath, so when you run
> the tika app it isn't picking up the xml file
>
> Running a jar with additional things on the classpath is a bit fiddly:
> http://stackoverflow.com/questions/15930782/call-java-jar-myfile-jar-with-additional-classpath-option
>
> You'd basically need to do something like
> java -classpath tika-app.jar:. org.apache.tika.cli.TikaCLI
>
http://pastebin.com/wSgwFva3
K
Re: External parser
Posted by Nick Burch <ap...@gagravarr.org>.
On Tue, 14 Oct 2014, Kamil Żyta wrote:
>> project conventions
>
> I'm not familiar with java. So I need to create org/apache/tika/parser/external/ dir
> and copy there tika-external-parsers.xml?
+ add that to your classpath, so java finds it. (Long term putting it in a
jar might be best)
>> tika-core/src/test/java/org/apache/tika/parser/external/ExternalParserTest.java
>
> I use version 1.5 and this directory is missing
I'd suggest trying a newer version of Tika
>> Not on my version of Tika it doesn't...
>>
>> If you edit that file to do so, you still need to provide the external
>> parsers xml file to Tika in the right place, so Tika will find it
>>
>
> http://pastebin.com/Ug1ebdWd
You're not adding the current directory to your classpath, so when you run
the tika app it isn't picking up the xml file
Running a jar with additional things on the classpath is a bit fiddly:
http://stackoverflow.com/questions/15930782/call-java-jar-myfile-jar-with-additional-classpath-option
You'd basically need to do something like
java -classpath tika-app.jar:. org.apache.tika.cli.TikaCLI
Nick
Re: External parser
Posted by Kamil Żyta <ka...@pwr.edu.pl>.
On Tue, Oct 14, 2014 at 01:05:03PM +0100, Nick Burch wrote:
> On Tue, 14 Oct 2014, Kamil Żyta wrote:
> > On Tue, Oct 14, 2014 at 12:32:52PM +0100, Nick Burch wrote:
> >> On Tue, 14 Oct 2014, Kamil Żyta wrote:
> >> All you should need to do is provide a tika-external-parsers.xml file on
> >> your classpath (in the appropriate directory), which defines how to talk
> >> to your command line tool. Tika will find that and wire it up to the
> >> external parser for you
> >
> > where is the appropriate directory?
> > # find . -name tika-external-parsers.xml
> > ./tika-parsers/src/main/resources/org/apache/tika/parser/external/tika-external-parsers.xml
> > ./tika-parsers/target/classes/org/apache/tika/parser/external/tika-external-parsers.xml
>
> It's org/apache/tika/parser/external/ - the rest before is just maven
> project conventions
I'm not familiar with java. So I need to create org/apache/tika/parser/external/ dir
and copy there tika-external-parsers.xml?
>
> >> Tika has tests for the external parser included in it, you can try
> >> looking at those for inspiration
> >
> > I can not find it
>
> tika-core/src/test/java/org/apache/tika/parser/external/ExternalParserTest.java
I use version 1.5 and this directory is missing
>
> > sample tika-external-parsers.xml include ffmpeg for 'video/x-matroska'
>
> Not on my version of Tika it doesn't...
>
> If you edit that file to do so, you still need to provide the external
> parsers xml file to Tika in the right place, so Tika will find it
>
http://pastebin.com/Ug1ebdWd
K
Re: External parser
Posted by Nick Burch <ap...@gagravarr.org>.
On Tue, 14 Oct 2014, Kamil Żyta wrote:
> On Tue, Oct 14, 2014 at 12:32:52PM +0100, Nick Burch wrote:
>> On Tue, 14 Oct 2014, Kamil Żyta wrote:
>> All you should need to do is provide a tika-external-parsers.xml file on
>> your classpath (in the appropriate directory), which defines how to talk
>> to your command line tool. Tika will find that and wire it up to the
>> external parser for you
>
> where is the appropriate directory?
> # find . -name tika-external-parsers.xml
> ./tika-parsers/src/main/resources/org/apache/tika/parser/external/tika-external-parsers.xml
> ./tika-parsers/target/classes/org/apache/tika/parser/external/tika-external-parsers.xml
It's org/apache/tika/parser/external/ - the rest before is just maven
project conventions
>> Tika has tests for the external parser included in it, you can try
>> looking at those for inspiration
>
> I can not find it
tika-core/src/test/java/org/apache/tika/parser/external/ExternalParserTest.java
> sample tika-external-parsers.xml include ffmpeg for 'video/x-matroska'
Not on my version of Tika it doesn't...
If you edit that file to do so, you still need to provide the external
parsers xml file to Tika in the right place, so Tika will find it
Nick
Re: javax.mail and WADL gen dependencies in Tika JAX-RS server
Posted by Sergey Beryozkin <sb...@gmail.com>.
On 14/10/14 13:11, Sergey Beryozkin wrote:
> Hi
>
> I've updated the server to depend on CXF 3.0.2.
>
> The JAX-RS frontend in CXF 3.0.2 does not have a javax.mail dependency
> so I added it directly to the tika-server POM. javax.mail dependency is
> used only for a basic check of Content-Disposition header - and we have
> a utility code for that in CXF too. I can drop this dependency and
> replace with a CXF equivalent.
I guess it can be kept as is, the code would be a bit more portable
>
> The other minor issue is that there's a test there checking if the
> server can generate a WADL document. Do we really need it given that
> Tika Server has its own mechanism for describing the server ? If no then
> I can delete the WADL test and drop one more dependency (on the CXF
> module which generates WADL)
>
I've added a test scope to the CXF rt/rs/service/descriptions so one
less dependency for a server. This can always be added back to the
runtime scope should a WADL generation become of interest. That module
also has some initial Swagger support but it is pretty basic on its own
for now
Thanks, Sergey
> Cheers, Sergey
--
Sergey Beryozkin
Talend Community Coders
http://coders.talend.com/
Blog: http://sberyozkin.blogspot.com
javax.mail and WADL gen dependencies in Tika JAX-RS server
Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi
I've updated the server to depend on CXF 3.0.2.
The JAX-RS frontend in CXF 3.0.2 does not have a javax.mail dependency
so I added it directly to the tika-server POM. javax.mail dependency is
used only for a basic check of Content-Disposition header - and we have
a utility code for that in CXF too. I can drop this dependency and
replace with a CXF equivalent.
The other minor issue is that there's a test there checking if the
server can generate a WADL document. Do we really need it given that
Tika Server has its own mechanism for describing the server ? If no then
I can delete the WADL test and drop one more dependency (on the CXF
module which generates WADL)
Cheers, Sergey
Re: External parser
Posted by Kamil Żyta <ka...@pwr.edu.pl>.
On Tue, Oct 14, 2014 at 12:32:52PM +0100, Nick Burch wrote:
> On Tue, 14 Oct 2014, Kamil Żyta wrote:
> > I want to use external parser but on web there isn't complex
> > howto/tutorial. I only found parser/external/tika-external-parsers.xml
> > sample configuration but I don't know how to register/enable this parser
> > in tika parsers.
>
> All you should need to do is provide a tika-external-parsers.xml file on
> your classpath (in the appropriate directory), which defines how to talk
> to your command line tool. Tika will find that and wire it up to the
> external parser for you
where is the appropriate directory?
# find . -name tika-external-parsers.xml
./tika-parsers/src/main/resources/org/apache/tika/parser/external/tika-external-parsers.xml
./tika-parsers/target/classes/org/apache/tika/parser/external/tika-external-parsers.xml
>
> Tika has tests for the external parser included in it, you can try looking
> at those for inspiration
I can not find it
>
> You can also look at the ffmpeg plugin for another example, that's based
> on the external parser - https://github.com/AlfrescoLabs/tika-ffmpeg
>
sample tika-external-parsers.xml include ffmpeg for 'video/x-matroska' but when I do:
# java -jar tika-app/target/tika-app-*.jar ~/xvid_480p_as_l5_1mbps_he-aac_foreign_subs_matrix.mkv
I got only resourceName, Content-Length and Content-Type
K
Re: External parser
Posted by Nick Burch <ap...@gagravarr.org>.
On Tue, 14 Oct 2014, Kamil Żyta wrote:
> I want to use external parser but on web there isn't complex
> howto/tutorial. I only found parser/external/tika-external-parsers.xml
> sample configuration but I don't know how to register/enable this parser
> in tika parsers.
All you should need to do is provide a tika-external-parsers.xml file on
your classpath (in the appropriate directory), which defines how to talk
to your command line tool. Tika will find that and wire it up to the
external parser for you
Tika has tests for the external parser included in it, you can try looking
at those for inspiration
You can also look at the ffmpeg plugin for another example, that's based
on the external parser - https://github.com/AlfrescoLabs/tika-ffmpeg
Nick