You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Kamil Żyta <ka...@pwr.edu.pl> on 2014/10/14 12:55:36 UTC

External parser

Hi,
I want to use external parser but on web there isn't complex howto/tutorial.
I only found parser/external/tika-external-parsers.xml sample configuration
but I don't know how to register/enable this parser in tika parsers.

I would be thankful for any help.

regards,
KŻ


Re: External parser

Posted by Kamil Żyta <ka...@pwr.edu.pl>.
On Tue, Oct 14, 2014 at 04:01:43PM +0200, Kamil Żyta wrote:
> On Tue, Oct 14, 2014 at 02:46:07PM +0100, Nick Burch wrote:
> > On Tue, 14 Oct 2014, Kamil Żyta wrote:
> > >> You'd basically need to do something like
> > >> java -classpath tika-app.jar:. org.apache.tika.cli.TikaCLI
> > >>
> > >
> > > http://pastebin.com/wSgwFva3
> > 
> > Key there is
> >    <meta name="X-Parsed-By" content="org.apache.tika.parser.EmptyParser"/>
> > 
> > Can you try asking the Tika CLI what it detects your file as, and what 
> > parsers it thinks are present?
> 
> <meta name="Content-Type" content="video/x-matroska"/>
> 
> # java -classpath tika-app-1.6.jar:. org.apache.tika.cli.TikaCLI --list-parsers
>    org.apache.tika.parser.AutoDetectParser (Composite Parser):
>       org.apache.tika.parser.DefaultParser (Composite Parser):
>          org.apache.tika.parser.asm.ClassParser
>          org.apache.tika.parser.audio.AudioParser
>          org.apache.tika.parser.audio.MidiParser
>          org.apache.tika.parser.chm.ChmParser
>          org.apache.tika.parser.code.SourceCodeParser
>          org.apache.tika.parser.crypto.Pkcs7Parser
>          org.apache.tika.parser.dwg.DWGParser
>          org.apache.tika.parser.epub.EpubParser
>          org.apache.tika.parser.executable.ExecutableParser
>          org.apache.tika.parser.feed.FeedParser
>          org.apache.tika.parser.font.AdobeFontMetricParser
>          org.apache.tika.parser.font.TrueTypeParser
>          org.apache.tika.parser.hdf.HDFParser
>          org.apache.tika.parser.html.HtmlParser
>          org.apache.tika.parser.image.ImageParser
>          org.apache.tika.parser.image.PSDParser
>          org.apache.tika.parser.image.TiffParser
>          org.apache.tika.parser.iptc.IptcAnpaParser
>          org.apache.tika.parser.iwork.IWorkPackageParser
>          org.apache.tika.parser.jpeg.JpegParser
>          org.apache.tika.parser.mail.RFC822Parser
>          org.apache.tika.parser.mat.MatParser
>          org.apache.tika.parser.mbox.MboxParser
>          org.apache.tika.parser.mbox.OutlookPSTParser
>          org.apache.tika.parser.microsoft.OfficeParser
>          org.apache.tika.parser.microsoft.TNEFParser
>          org.apache.tika.parser.microsoft.ooxml.OOXMLParser
>          org.apache.tika.parser.mp3.Mp3Parser
>          org.apache.tika.parser.mp4.MP4Parser
>          org.apache.tika.parser.netcdf.NetCDFParser
>          org.apache.tika.parser.odf.OpenDocumentParser
>          org.apache.tika.parser.pdf.PDFParser
>          org.apache.tika.parser.pkg.CompressorParser
>          org.apache.tika.parser.pkg.PackageParser
>          org.apache.tika.parser.rtf.RTFParser
>          org.apache.tika.parser.txt.TXTParser
>          org.apache.tika.parser.video.FLVParser
>          org.apache.tika.parser.xml.DcXMLParser
>          org.apache.tika.parser.xml.FictionBookParser
>          org.gagravarr.tika.FlacParser
>          org.gagravarr.tika.OggParser
>          org.gagravarr.tika.OpusParser
>          org.gagravarr.tika.SpeexParser
>          org.gagravarr.tika.VorbisParser

anyone can help me with this?

K

Re: External parser

Posted by Kamil Żyta <ka...@pwr.edu.pl>.
On Tue, Oct 14, 2014 at 02:46:07PM +0100, Nick Burch wrote:
> On Tue, 14 Oct 2014, Kamil Żyta wrote:
> >> You'd basically need to do something like
> >> java -classpath tika-app.jar:. org.apache.tika.cli.TikaCLI
> >>
> >
> > http://pastebin.com/wSgwFva3
> 
> Key there is
>    <meta name="X-Parsed-By" content="org.apache.tika.parser.EmptyParser"/>
> 
> Can you try asking the Tika CLI what it detects your file as, and what 
> parsers it thinks are present?

<meta name="Content-Type" content="video/x-matroska"/>

# java -classpath tika-app-1.6.jar:. org.apache.tika.cli.TikaCLI --list-parsers
   org.apache.tika.parser.AutoDetectParser (Composite Parser):
      org.apache.tika.parser.DefaultParser (Composite Parser):
         org.apache.tika.parser.asm.ClassParser
         org.apache.tika.parser.audio.AudioParser
         org.apache.tika.parser.audio.MidiParser
         org.apache.tika.parser.chm.ChmParser
         org.apache.tika.parser.code.SourceCodeParser
         org.apache.tika.parser.crypto.Pkcs7Parser
         org.apache.tika.parser.dwg.DWGParser
         org.apache.tika.parser.epub.EpubParser
         org.apache.tika.parser.executable.ExecutableParser
         org.apache.tika.parser.feed.FeedParser
         org.apache.tika.parser.font.AdobeFontMetricParser
         org.apache.tika.parser.font.TrueTypeParser
         org.apache.tika.parser.hdf.HDFParser
         org.apache.tika.parser.html.HtmlParser
         org.apache.tika.parser.image.ImageParser
         org.apache.tika.parser.image.PSDParser
         org.apache.tika.parser.image.TiffParser
         org.apache.tika.parser.iptc.IptcAnpaParser
         org.apache.tika.parser.iwork.IWorkPackageParser
         org.apache.tika.parser.jpeg.JpegParser
         org.apache.tika.parser.mail.RFC822Parser
         org.apache.tika.parser.mat.MatParser
         org.apache.tika.parser.mbox.MboxParser
         org.apache.tika.parser.mbox.OutlookPSTParser
         org.apache.tika.parser.microsoft.OfficeParser
         org.apache.tika.parser.microsoft.TNEFParser
         org.apache.tika.parser.microsoft.ooxml.OOXMLParser
         org.apache.tika.parser.mp3.Mp3Parser
         org.apache.tika.parser.mp4.MP4Parser
         org.apache.tika.parser.netcdf.NetCDFParser
         org.apache.tika.parser.odf.OpenDocumentParser
         org.apache.tika.parser.pdf.PDFParser
         org.apache.tika.parser.pkg.CompressorParser
         org.apache.tika.parser.pkg.PackageParser
         org.apache.tika.parser.rtf.RTFParser
         org.apache.tika.parser.txt.TXTParser
         org.apache.tika.parser.video.FLVParser
         org.apache.tika.parser.xml.DcXMLParser
         org.apache.tika.parser.xml.FictionBookParser
         org.gagravarr.tika.FlacParser
         org.gagravarr.tika.OggParser
         org.gagravarr.tika.OpusParser
         org.gagravarr.tika.SpeexParser
         org.gagravarr.tika.VorbisParser

K

Re: External parser

Posted by Nick Burch <ap...@gagravarr.org>.
On Tue, 14 Oct 2014, Kamil Żyta wrote:
>> You'd basically need to do something like
>> java -classpath tika-app.jar:. org.apache.tika.cli.TikaCLI
>>
>
> http://pastebin.com/wSgwFva3

Key there is
   <meta name="X-Parsed-By" content="org.apache.tika.parser.EmptyParser"/>

Can you try asking the Tika CLI what it detects your file as, and what 
parsers it thinks are present?

Nick

Re: External parser

Posted by Kamil Żyta <ka...@pwr.edu.pl>.
On Tue, Oct 14, 2014 at 01:34:00PM +0100, Nick Burch wrote:
> On Tue, 14 Oct 2014, Kamil Żyta wrote:
> >> project conventions
> >
> > I'm not familiar with java. So I need to create org/apache/tika/parser/external/ dir
> > and copy there tika-external-parsers.xml?
> 
> + add that to your classpath, so java finds it. (Long term putting it in a 
> jar might be best)
> 
> >> tika-core/src/test/java/org/apache/tika/parser/external/ExternalParserTest.java
> >
> > I use version 1.5 and this directory is missing
> 
> I'd suggest trying a newer version of Tika

in 1.6 this file is missing too.

> 
> >> Not on my version of Tika it doesn't...
> >>
> >> If you edit that file to do so, you still need to provide the external
> >> parsers xml file to Tika in the right place, so Tika will find it
> >>
> >
> > http://pastebin.com/Ug1ebdWd
> 
> You're not adding the current directory to your classpath, so when you run 
> the tika app it isn't picking up the xml file
> 
> Running a jar with additional things on the classpath is a bit fiddly:
> http://stackoverflow.com/questions/15930782/call-java-jar-myfile-jar-with-additional-classpath-option
> 
> You'd basically need to do something like
> java -classpath tika-app.jar:. org.apache.tika.cli.TikaCLI
> 

http://pastebin.com/wSgwFva3

K

Re: External parser

Posted by Nick Burch <ap...@gagravarr.org>.
On Tue, 14 Oct 2014, Kamil Żyta wrote:
>> project conventions
>
> I'm not familiar with java. So I need to create org/apache/tika/parser/external/ dir
> and copy there tika-external-parsers.xml?

+ add that to your classpath, so java finds it. (Long term putting it in a 
jar might be best)

>> tika-core/src/test/java/org/apache/tika/parser/external/ExternalParserTest.java
>
> I use version 1.5 and this directory is missing

I'd suggest trying a newer version of Tika

>> Not on my version of Tika it doesn't...
>>
>> If you edit that file to do so, you still need to provide the external
>> parsers xml file to Tika in the right place, so Tika will find it
>>
>
> http://pastebin.com/Ug1ebdWd

You're not adding the current directory to your classpath, so when you run 
the tika app it isn't picking up the xml file

Running a jar with additional things on the classpath is a bit fiddly:
http://stackoverflow.com/questions/15930782/call-java-jar-myfile-jar-with-additional-classpath-option

You'd basically need to do something like
java -classpath tika-app.jar:. org.apache.tika.cli.TikaCLI

Nick

Re: External parser

Posted by Kamil Żyta <ka...@pwr.edu.pl>.
On Tue, Oct 14, 2014 at 01:05:03PM +0100, Nick Burch wrote:
> On Tue, 14 Oct 2014, Kamil Żyta wrote:
> > On Tue, Oct 14, 2014 at 12:32:52PM +0100, Nick Burch wrote:
> >> On Tue, 14 Oct 2014, Kamil Żyta wrote:
> >> All you should need to do is provide a tika-external-parsers.xml file on
> >> your classpath (in the appropriate directory), which defines how to talk
> >> to your command line tool. Tika will find that and wire it up to the
> >> external parser for you
> >
> > where is the appropriate directory?
> > # find . -name tika-external-parsers.xml
> > ./tika-parsers/src/main/resources/org/apache/tika/parser/external/tika-external-parsers.xml
> > ./tika-parsers/target/classes/org/apache/tika/parser/external/tika-external-parsers.xml
> 
> It's org/apache/tika/parser/external/ - the rest before is just maven 
> project conventions

I'm not familiar with java. So I need to create org/apache/tika/parser/external/ dir
and copy there tika-external-parsers.xml?

> 
> >> Tika has tests for the external parser included in it, you can try 
> >> looking at those for inspiration
> >
> > I can not find it
> 
> tika-core/src/test/java/org/apache/tika/parser/external/ExternalParserTest.java

I use version 1.5 and this directory is missing

> 
> > sample tika-external-parsers.xml include ffmpeg for 'video/x-matroska'
> 
> Not on my version of Tika it doesn't...
> 
> If you edit that file to do so, you still need to provide the external 
> parsers xml file to Tika in the right place, so Tika will find it
> 

http://pastebin.com/Ug1ebdWd

K

Re: External parser

Posted by Nick Burch <ap...@gagravarr.org>.
On Tue, 14 Oct 2014, Kamil Żyta wrote:
> On Tue, Oct 14, 2014 at 12:32:52PM +0100, Nick Burch wrote:
>> On Tue, 14 Oct 2014, Kamil Żyta wrote:
>> All you should need to do is provide a tika-external-parsers.xml file on
>> your classpath (in the appropriate directory), which defines how to talk
>> to your command line tool. Tika will find that and wire it up to the
>> external parser for you
>
> where is the appropriate directory?
> # find . -name tika-external-parsers.xml
> ./tika-parsers/src/main/resources/org/apache/tika/parser/external/tika-external-parsers.xml
> ./tika-parsers/target/classes/org/apache/tika/parser/external/tika-external-parsers.xml

It's org/apache/tika/parser/external/ - the rest before is just maven 
project conventions

>> Tika has tests for the external parser included in it, you can try 
>> looking at those for inspiration
>
> I can not find it

tika-core/src/test/java/org/apache/tika/parser/external/ExternalParserTest.java

> sample tika-external-parsers.xml include ffmpeg for 'video/x-matroska'

Not on my version of Tika it doesn't...

If you edit that file to do so, you still need to provide the external 
parsers xml file to Tika in the right place, so Tika will find it

Nick

Re: javax.mail and WADL gen dependencies in Tika JAX-RS server

Posted by Sergey Beryozkin <sb...@gmail.com>.
On 14/10/14 13:11, Sergey Beryozkin wrote:
> Hi
>
> I've updated the server to depend on CXF 3.0.2.
>
> The JAX-RS frontend in CXF 3.0.2 does not have a javax.mail dependency
> so I added it directly to the tika-server POM. javax.mail dependency is
> used only for a basic check of Content-Disposition header - and we have
> a utility code for that in CXF too. I can drop this dependency and
> replace with a CXF equivalent.
I guess it can be kept as is, the code would be a bit more portable
>
> The other minor issue is that there's a test there checking if the
> server can generate a WADL document. Do we really need it given that
> Tika Server has its own mechanism for describing the server ? If no then
> I can delete the WADL test and drop one more dependency (on the CXF
> module which generates WADL)
>
I've added a test scope to the CXF rt/rs/service/descriptions so one 
less dependency for a server. This can always be added back to the 
runtime scope should a WADL generation become of interest. That module 
also has some initial Swagger support but it is pretty basic on its own 
for now

Thanks, Sergey

> Cheers, Sergey


-- 
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Blog: http://sberyozkin.blogspot.com

javax.mail and WADL gen dependencies in Tika JAX-RS server

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi

I've updated the server to depend on CXF 3.0.2.

The JAX-RS frontend in CXF 3.0.2 does not have a javax.mail dependency 
so I added it directly to the tika-server POM. javax.mail dependency is 
used only for a basic check of Content-Disposition header - and we have 
a utility code for that in CXF too. I can drop this dependency and 
replace with a CXF equivalent.

The other minor issue is that there's a test there checking if the 
server can generate a WADL document. Do we really need it given that 
Tika Server has its own mechanism for describing the server ? If no then 
I can delete the WADL test and drop one more dependency (on the CXF 
module which generates WADL)

Cheers, Sergey

Re: External parser

Posted by Kamil Żyta <ka...@pwr.edu.pl>.
On Tue, Oct 14, 2014 at 12:32:52PM +0100, Nick Burch wrote:
> On Tue, 14 Oct 2014, Kamil Żyta wrote:
> > I want to use external parser but on web there isn't complex 
> > howto/tutorial. I only found parser/external/tika-external-parsers.xml 
> > sample configuration but I don't know how to register/enable this parser 
> > in tika parsers.
> 
> All you should need to do is provide a tika-external-parsers.xml file on 
> your classpath (in the appropriate directory), which defines how to talk 
> to your command line tool. Tika will find that and wire it up to the 
> external parser for you

where is the appropriate directory?
# find . -name tika-external-parsers.xml
./tika-parsers/src/main/resources/org/apache/tika/parser/external/tika-external-parsers.xml
./tika-parsers/target/classes/org/apache/tika/parser/external/tika-external-parsers.xml

> 
> Tika has tests for the external parser included in it, you can try looking 
> at those for inspiration

I can not find it

> 
> You can also look at the ffmpeg plugin for another example, that's based 
> on the external parser - https://github.com/AlfrescoLabs/tika-ffmpeg
> 

sample tika-external-parsers.xml include ffmpeg for 'video/x-matroska' but when I do:

# java -jar tika-app/target/tika-app-*.jar ~/xvid_480p_as_l5_1mbps_he-aac_foreign_subs_matrix.mkv

I got only resourceName, Content-Length and Content-Type

K

Re: External parser

Posted by Nick Burch <ap...@gagravarr.org>.
On Tue, 14 Oct 2014, Kamil Żyta wrote:
> I want to use external parser but on web there isn't complex 
> howto/tutorial. I only found parser/external/tika-external-parsers.xml 
> sample configuration but I don't know how to register/enable this parser 
> in tika parsers.

All you should need to do is provide a tika-external-parsers.xml file on 
your classpath (in the appropriate directory), which defines how to talk 
to your command line tool. Tika will find that and wire it up to the 
external parser for you

Tika has tests for the external parser included in it, you can try looking 
at those for inspiration

You can also look at the ffmpeg plugin for another example, that's based 
on the external parser - https://github.com/AlfrescoLabs/tika-ffmpeg

Nick