You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Arturo Beltran <ar...@uji.es> on 2010/06/17 16:39:17 UTC
Getting started
Hi all,
Some of you already know that I'm working on a new parser
(https://issues.apache.org/jira/browse/TIKA-443). After all day trying
to set up a workspace for Eclipse, I implemented the typical "hello
world" class, in the Tika Parser version. My problem now, is how to
configure Tika in order to call my new parser when a file with especific
extension (p.e. *.shp) is found. I read something about a configuration
file (tika-config.xml) but I couldn't find it in the source code.
Greetings and thanks in advance
Arturo
--
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es
Re: Getting started
Posted by Nick Burch <ni...@alfresco.com>.
On Tue, 13 Jul 2010, Arturo Beltran wrote:
> It might be interesting to write a small manual: "How to create a new Tika
> Parser for Dummies". Simply including the three steps that I have finally
> figured out (new Parser, tika-mimetypes.xml, list the new parser).
The 3rd step is only needed if you want to use the auto detect parser. If
you figure out the correct parser a different way, it isn't needed
It sounds like a very helpful short document though. The wiki is at
http://wiki.apache.org/tika/ if you fancy writing it up :)
Nick
Re: Getting started
Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Arturo,
Working on committing it right now, thanks!
Cheers,
Chris
On 7/16/10 4:17 AM, "Arturo Beltran" <ar...@uji.es> wrote:
The guide is ready.
It can be found attached at: https://issues.apache.org/jira/browse/TIKA-464
Greetings and have nice weekend
Arturo
El 13/07/2010 16:01, Mattmann, Chris A (388J) escribió:
> Thanks Nick and thanks Arturo, for the offer to write a small guide to getting started with parsing. It might be good to create a JIRA issue for this? Arturo, can you head over to JIRA and create an issue to contribute a "get Tika parsing up and running in 5 minutes" quick start guide? Then, you could write the guide in APT format (see here [1] for an example and [2] for more detailed information), add your new guide file to your local SVN checkout, create a patch and then attach it to your new issue. I'd be happy to get it into the documentation sources.
>
> Thanks!
>
> Cheers,
> Chris
>
> [1] http://svn.apache.org/repos/asf/tika/trunk/src/site/apt/formats.apt
> [2] http://maven.apache.org/doxia/references/apt-format.html
>
>
> On 7/13/10 3:54 AM, "Arturo Beltran"<ar...@uji.es> wrote:
>
> That was my "big" problem all this time, I almost went crazy. Now it
> works perfectly, thank you very much for your help.
>
> It might be interesting to write a small manual: "How to create a new
> Tika Parser for Dummies". Simply including the three steps that I have
> finally figured out (new Parser, tika-mimetypes.xml, list the new parser).
>
> Greetings and thanks Nick it has been a great help
>
>
>
> El 13/07/2010 12:37, Nick Burch escribió:
>
>> On Tue, 13 Jul 2010, Arturo Beltran wrote:
>>
>>> I'm calling my parser using the Tika-app included, so I think I'm
>>> using AutoDetectParser.
>>>
>> You have to explicitly tell the AutoDetectParser to try your parser,
>> in addition to the mime type definition
>>
>> List your new parser in:
>> tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
>>
>> and I think it should then be picked up
>>
>> Nick
>>
>>
>
> --
> Arturo Beltran Fonollosa
> Institute of New Imaging Technologies (INIT): http://www.init.uji.es
> Geographic Information research group: http://www.geoinfo.uji.es
> Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
> E-12071, Castellón, Spain
> mailto: arturo.beltran@uji.es
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW: http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
--
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Re: Getting started
Posted by Arturo Beltran <ar...@uji.es>.
The guide is ready.
It can be found attached at: https://issues.apache.org/jira/browse/TIKA-464
Greetings and have nice weekend
Arturo
El 13/07/2010 16:01, Mattmann, Chris A (388J) escribió:
> Thanks Nick and thanks Arturo, for the offer to write a small guide to getting started with parsing. It might be good to create a JIRA issue for this? Arturo, can you head over to JIRA and create an issue to contribute a "get Tika parsing up and running in 5 minutes" quick start guide? Then, you could write the guide in APT format (see here [1] for an example and [2] for more detailed information), add your new guide file to your local SVN checkout, create a patch and then attach it to your new issue. I'd be happy to get it into the documentation sources.
>
> Thanks!
>
> Cheers,
> Chris
>
> [1] http://svn.apache.org/repos/asf/tika/trunk/src/site/apt/formats.apt
> [2] http://maven.apache.org/doxia/references/apt-format.html
>
>
> On 7/13/10 3:54 AM, "Arturo Beltran"<ar...@uji.es> wrote:
>
> That was my "big" problem all this time, I almost went crazy. Now it
> works perfectly, thank you very much for your help.
>
> It might be interesting to write a small manual: "How to create a new
> Tika Parser for Dummies". Simply including the three steps that I have
> finally figured out (new Parser, tika-mimetypes.xml, list the new parser).
>
> Greetings and thanks Nick it has been a great help
>
>
>
> El 13/07/2010 12:37, Nick Burch escribió:
>
>> On Tue, 13 Jul 2010, Arturo Beltran wrote:
>>
>>> I'm calling my parser using the Tika-app included, so I think I'm
>>> using AutoDetectParser.
>>>
>> You have to explicitly tell the AutoDetectParser to try your parser,
>> in addition to the mime type definition
>>
>> List your new parser in:
>> tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
>>
>> and I think it should then be picked up
>>
>> Nick
>>
>>
>
> --
> Arturo Beltran Fonollosa
> Institute of New Imaging Technologies (INIT): http://www.init.uji.es
> Geographic Information research group: http://www.geoinfo.uji.es
> Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
> E-12071, Castellón, Spain
> mailto: arturo.beltran@uji.es
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW: http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
--
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es
Re: Getting started
Posted by Arturo Beltran <ar...@uji.es>.
No problem, I'll do it.
El 13/07/2010 16:01, Mattmann, Chris A (388J) escribió:
> Thanks Nick and thanks Arturo, for the offer to write a small guide to getting started with parsing. It might be good to create a JIRA issue for this? Arturo, can you head over to JIRA and create an issue to contribute a "get Tika parsing up and running in 5 minutes" quick start guide? Then, you could write the guide in APT format (see here [1] for an example and [2] for more detailed information), add your new guide file to your local SVN checkout, create a patch and then attach it to your new issue. I'd be happy to get it into the documentation sources.
>
> Thanks!
>
> Cheers,
> Chris
>
> [1] http://svn.apache.org/repos/asf/tika/trunk/src/site/apt/formats.apt
> [2] http://maven.apache.org/doxia/references/apt-format.html
>
>
> On 7/13/10 3:54 AM, "Arturo Beltran"<ar...@uji.es> wrote:
>
> That was my "big" problem all this time, I almost went crazy. Now it
> works perfectly, thank you very much for your help.
>
> It might be interesting to write a small manual: "How to create a new
> Tika Parser for Dummies". Simply including the three steps that I have
> finally figured out (new Parser, tika-mimetypes.xml, list the new parser).
>
> Greetings and thanks Nick it has been a great help
>
>
>
> El 13/07/2010 12:37, Nick Burch escribió:
>
>> On Tue, 13 Jul 2010, Arturo Beltran wrote:
>>
>>> I'm calling my parser using the Tika-app included, so I think I'm
>>> using AutoDetectParser.
>>>
>> You have to explicitly tell the AutoDetectParser to try your parser,
>> in addition to the mime type definition
>>
>> List your new parser in:
>> tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
>>
>> and I think it should then be picked up
>>
>> Nick
>>
>>
>
> --
> Arturo Beltran Fonollosa
> Institute of New Imaging Technologies (INIT): http://www.init.uji.es
> Geographic Information research group: http://www.geoinfo.uji.es
> Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
> E-12071, Castellón, Spain
> mailto: arturo.beltran@uji.es
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW: http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
--
Arturo Beltran Fonollosa
Geographic Information research group: http://www.geoinfo.uji.es
Centro de Visualización Interactiva (CeVI) http://www.cevi.uji.es
Departamento de Lenguajes y Sistemas Informáticos (LSI)
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es
Re: Getting started
Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Thanks Nick and thanks Arturo, for the offer to write a small guide to getting started with parsing. It might be good to create a JIRA issue for this? Arturo, can you head over to JIRA and create an issue to contribute a "get Tika parsing up and running in 5 minutes" quick start guide? Then, you could write the guide in APT format (see here [1] for an example and [2] for more detailed information), add your new guide file to your local SVN checkout, create a patch and then attach it to your new issue. I'd be happy to get it into the documentation sources.
Thanks!
Cheers,
Chris
[1] http://svn.apache.org/repos/asf/tika/trunk/src/site/apt/formats.apt
[2] http://maven.apache.org/doxia/references/apt-format.html
On 7/13/10 3:54 AM, "Arturo Beltran" <ar...@uji.es> wrote:
That was my "big" problem all this time, I almost went crazy. Now it
works perfectly, thank you very much for your help.
It might be interesting to write a small manual: "How to create a new
Tika Parser for Dummies". Simply including the three steps that I have
finally figured out (new Parser, tika-mimetypes.xml, list the new parser).
Greetings and thanks Nick it has been a great help
El 13/07/2010 12:37, Nick Burch escribió:
> On Tue, 13 Jul 2010, Arturo Beltran wrote:
>> I'm calling my parser using the Tika-app included, so I think I'm
>> using AutoDetectParser.
>
> You have to explicitly tell the AutoDetectParser to try your parser,
> in addition to the mime type definition
>
> List your new parser in:
> tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
>
> and I think it should then be picked up
>
> Nick
>
--
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Re: Getting started
Posted by Arturo Beltran <ar...@uji.es>.
That was my "big" problem all this time, I almost went crazy. Now it
works perfectly, thank you very much for your help.
It might be interesting to write a small manual: "How to create a new
Tika Parser for Dummies". Simply including the three steps that I have
finally figured out (new Parser, tika-mimetypes.xml, list the new parser).
Greetings and thanks Nick it has been a great help
El 13/07/2010 12:37, Nick Burch escribió:
> On Tue, 13 Jul 2010, Arturo Beltran wrote:
>> I'm calling my parser using the Tika-app included, so I think I'm
>> using AutoDetectParser.
>
> You have to explicitly tell the AutoDetectParser to try your parser,
> in addition to the mime type definition
>
> List your new parser in:
> tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
>
> and I think it should then be picked up
>
> Nick
>
--
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es
Re: Getting started
Posted by Nick Burch <ni...@alfresco.com>.
On Tue, 13 Jul 2010, Arturo Beltran wrote:
> I'm calling my parser using the Tika-app included, so I think I'm using
> AutoDetectParser.
You have to explicitly tell the AutoDetectParser to try your parser, in
addition to the mime type definition
List your new parser in:
tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
and I think it should then be picked up
Nick
Re: Getting started
Posted by Arturo Beltran <ar...@uji.es>.
Hi Chris and all,
El 07/07/2010 16:04, Mattmann, Chris A (388J) escribió:
> Hi Arturo,
>
> How exactly are you calling your parser? Are you using the AutoDetectParser? If so, can you put some print statements in in the public void parse(...) method of CompositeParser? Specifically, add a line right after:
>
I'm calling my parser using the Tika-app included, so I think I'm using
AutoDetectParser.
>
> Parser parser = getParser(metadata);
> // print out the returned parser
> System.out.println("Parser returned is: ["+parser.getClass().getName()+"]");
>
> What does that return? Also, have you done the work to map your incoming document type in the tika-mimetypes.xml file?
Yes, sure.
> That is, if you're using AutoDetectParser or anything that extends CompositeParser, the mime type of the incoming document is used to determine what parser gets called? Is the mime type being detected appropriately? You can check this by putting a println right before getParser in the parse(...) method:
>
Yes, it returns "application/shp"
> // print the mime type
> System.out.println("The MIME type is: ["+ metadata.get(Metadata.CONTENT_TYPE)+"]);
> Parser parser = getParser(metadata);
>
> What does that print out?
>
> Finally if both of these printlns check out, you should check and make sure that your new parser is correctly mapped to the media type it supports, in other words what Ken said below. Does your parser declare that it supports your expected MIME type?
>
Yes I declared this MIME type in my parser. But the
/getSupportedTypes(context)/ function is never called.
I uploaded a file with the Tika source code that includes my modified
/tika-mimetypes.xml/ file and my new parser /GeoParser.java/. Perhaps
one of you will try it and find out where I'm wrong.
Here the link: http://elcano.dlsi.uji.es/arturo/tika_geo.zip
Greetings and thanks in advance for your help,
Arturo
> Let me know and thanks!
>
> Cheers,
> Chris
>
>
>
>
> On 7/7/10 4:25 AM, "Arturo Beltran"<ar...@uji.es> wrote:
>
> Hi,
>
> I'm still with the same problem.
> I think it's all good, I do the/ "mvn install/" and my new class is
> included in the generated JAR, but never called.
> It should be very simple. I feel a little silly. I don't know how to
> make my new parser is found by Tika.
>
> Thanks in advance
> Arturo
>
>
> El 21/06/2010 19:04, Ken Krugler escribió:
>
>> Are you sure your new parser is on the classpath?
>>
>> E.g. put a break on getSupportedTypes() and make sure that's getting
>> called - if not, then the parser isn't being "found" by Tika.
>>
>> -- Ken
>>
>> On Jun 21, 2010, at 3:34am, Arturo Beltran wrote:
>>
>>
>>> Hi Ken,
>>>
>>> First of all, thanks for your quick response.
>>> This's exactly what I'm doing, but despite that Tika recognizes the
>>> new MIME tipe, my new parser is not called.
>>>
>>> I added to tika-mimetypes.xml:
>>>
>>> <mime-type type="application/shp">
>>> <!--sub-class-of type="application/octet-stream"/-->
>>> <glob pattern="*.shp"/>
>>> </mime-type>
>>>
>>> I created a new class GeoParser:
>>>
>>> public class GeoParser implements Parser {
>>>
>>> private static final Set<MediaType> SUPPORTED_TYPES =
>>> Collections.singleton(MediaType.application("shp"));
>>> public static final String SHP_MIME_TYPE = "application/shp";
>>>
>>> public Set<MediaType> getSupportedTypes(ParseContext context) {
>>> return SUPPORTED_TYPES;
>>> }
>>>
>>> public void parse(
>>> InputStream stream, ContentHandler handler,
>>> Metadata metadata, ParseContext context)
>>> throws IOException, SAXException, TikaException {
>>>
>>> metadata.set(Metadata.CONTENT_TYPE, SHP_MIME_TYPE);
>>> metadata.set("Hello", "World");
>>>
>>> System.out.println("HELLO WORLD");
>>> System.err.println("ERR Hello world");
>>>
>>> XHTMLContentHandler xhtml = new XHTMLContentHandler(handler,
>>> metadata);
>>> xhtml.startDocument();
>>> xhtml.endDocument();
>>> }
>>> ...
>>> }
>>>
>>> And that's the result:
>>>
>>> Content-Length: 755072
>>> Content-Type: application/shp
>>> resourceName: comarques250.shp
>>>
>>> I don't know wht exactly is failing, but I can't make it work.
>>>
>>> Greetings and thanks in advance for your help.
>>> Arturo
>>>
>>>
>>> El 17/06/2010 18:25, Ken Krugler escribió:
>>>
>>>> Hi Arturo,
>>>>
>>>>
>>>>> Some of you already know that I'm working on a new parser
>>>>> (https://issues.apache.org/jira/browse/TIKA-443). After all day
>>>>> trying to set up a workspace for Eclipse, I implemented the typical
>>>>> "hello world" class, in the Tika Parser version. My problem now, is
>>>>> how to configure Tika in order to call my new parser when a file
>>>>> with especific extension (p.e. *.shp) is found. I read something
>>>>> about a configuration file (tika-config.xml) but I couldn't find it
>>>>> in the source code.
>>>>>
>>>> You first need to modify
>>>> tika-core/src/main/resources/tika-mimetypes.xml.
>>>>
>>>> E.g. something like this was done for mailbox files.
>>>>
>>>> <mime-type type="application/mbox">
>>>> <sub-class-of type="text/plain"/>
>>>> <glob pattern="*.mbox"/>
>>>> </mime-type>
>>>>
>>>> That maps the suffix to the mime-type.
>>>>
>>>> Then you define the SUPPORTED_TYPES static class field in your
>>>> parser class that defines what mime-types it supports.
>>>>
>>>> E.g. for MboxParser:
>>>>
>>>> public class MboxParser implements Parser {
>>>>
>>>> private static final Set<MediaType> SUPPORTED_TYPES =
>>>> Collections.singleton(MediaType.application("mbox"));
>>>>
>>>>
>>>> -- Ken
>>>>
>>>> --------------------------------------------
>>>> <http://ken-blog.krugler.org>
>>>> +1 530-265-2225
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --------------------------------------------
>>>> Ken Krugler
>>>> +1 530-210-6378
>>>> http://bixolabs.com
>>>> e l a s t i c w e b m i n i n g
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> Arturo Beltran Fonollosa
>>> Institute of New Imaging Technologies (INIT): http://www.init.uji.es
>>> Geographic Information research group: http://www.geoinfo.uji.es
>>> Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
>>> E-12071, Castellón, Spain
>>> mailto: arturo.beltran@uji.es
>>>
>>>
>> --------------------------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> e l a s t i c w e b m i n i n g
>>
>>
>>
>>
>>
>>
>
> --
> Arturo Beltran Fonollosa
> Institute of New Imaging Technologies (INIT): http://www.init.uji.es
> Geographic Information research group: http://www.geoinfo.uji.es
> Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
> E-12071, Castellón, Spain
> mailto: arturo.beltran@uji.es
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW: http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
--
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es
Re: Getting started
Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Arturo,
How exactly are you calling your parser? Are you using the AutoDetectParser? If so, can you put some print statements in in the public void parse(...) method of CompositeParser? Specifically, add a line right after:
Parser parser = getParser(metadata);
// print out the returned parser
System.out.println("Parser returned is: ["+parser.getClass().getName()+"]");
What does that return? Also, have you done the work to map your incoming document type in the tika-mimetypes.xml file? That is, if you're using AutoDetectParser or anything that extends CompositeParser, the mime type of the incoming document is used to determine what parser gets called? Is the mime type being detected appropriately? You can check this by putting a println right before getParser in the parse(...) method:
// print the mime type
System.out.println("The MIME type is: ["+ metadata.get(Metadata.CONTENT_TYPE)+"]);
Parser parser = getParser(metadata);
What does that print out?
Finally if both of these printlns check out, you should check and make sure that your new parser is correctly mapped to the media type it supports, in other words what Ken said below. Does your parser declare that it supports your expected MIME type?
Let me know and thanks!
Cheers,
Chris
On 7/7/10 4:25 AM, "Arturo Beltran" <ar...@uji.es> wrote:
Hi,
I'm still with the same problem.
I think it's all good, I do the/ "mvn install/" and my new class is
included in the generated JAR, but never called.
It should be very simple. I feel a little silly. I don't know how to
make my new parser is found by Tika.
Thanks in advance
Arturo
El 21/06/2010 19:04, Ken Krugler escribió:
> Are you sure your new parser is on the classpath?
>
> E.g. put a break on getSupportedTypes() and make sure that's getting
> called - if not, then the parser isn't being "found" by Tika.
>
> -- Ken
>
> On Jun 21, 2010, at 3:34am, Arturo Beltran wrote:
>
>> Hi Ken,
>>
>> First of all, thanks for your quick response.
>> This's exactly what I'm doing, but despite that Tika recognizes the
>> new MIME tipe, my new parser is not called.
>>
>> I added to tika-mimetypes.xml:
>>
>> <mime-type type="application/shp">
>> <!--sub-class-of type="application/octet-stream"/-->
>> <glob pattern="*.shp"/>
>> </mime-type>
>>
>> I created a new class GeoParser:
>>
>> public class GeoParser implements Parser {
>>
>> private static final Set<MediaType> SUPPORTED_TYPES =
>> Collections.singleton(MediaType.application("shp"));
>> public static final String SHP_MIME_TYPE = "application/shp";
>>
>> public Set<MediaType> getSupportedTypes(ParseContext context) {
>> return SUPPORTED_TYPES;
>> }
>>
>> public void parse(
>> InputStream stream, ContentHandler handler,
>> Metadata metadata, ParseContext context)
>> throws IOException, SAXException, TikaException {
>>
>> metadata.set(Metadata.CONTENT_TYPE, SHP_MIME_TYPE);
>> metadata.set("Hello", "World");
>>
>> System.out.println("HELLO WORLD");
>> System.err.println("ERR Hello world");
>>
>> XHTMLContentHandler xhtml = new XHTMLContentHandler(handler,
>> metadata);
>> xhtml.startDocument();
>> xhtml.endDocument();
>> }
>> ...
>> }
>>
>> And that's the result:
>>
>> Content-Length: 755072
>> Content-Type: application/shp
>> resourceName: comarques250.shp
>>
>> I don't know wht exactly is failing, but I can't make it work.
>>
>> Greetings and thanks in advance for your help.
>> Arturo
>>
>>
>> El 17/06/2010 18:25, Ken Krugler escribió:
>>> Hi Arturo,
>>>
>>>> Some of you already know that I'm working on a new parser
>>>> (https://issues.apache.org/jira/browse/TIKA-443). After all day
>>>> trying to set up a workspace for Eclipse, I implemented the typical
>>>> "hello world" class, in the Tika Parser version. My problem now, is
>>>> how to configure Tika in order to call my new parser when a file
>>>> with especific extension (p.e. *.shp) is found. I read something
>>>> about a configuration file (tika-config.xml) but I couldn't find it
>>>> in the source code.
>>>
>>> You first need to modify
>>> tika-core/src/main/resources/tika-mimetypes.xml.
>>>
>>> E.g. something like this was done for mailbox files.
>>>
>>> <mime-type type="application/mbox">
>>> <sub-class-of type="text/plain"/>
>>> <glob pattern="*.mbox"/>
>>> </mime-type>
>>>
>>> That maps the suffix to the mime-type.
>>>
>>> Then you define the SUPPORTED_TYPES static class field in your
>>> parser class that defines what mime-types it supports.
>>>
>>> E.g. for MboxParser:
>>>
>>> public class MboxParser implements Parser {
>>>
>>> private static final Set<MediaType> SUPPORTED_TYPES =
>>> Collections.singleton(MediaType.application("mbox"));
>>>
>>>
>>> -- Ken
>>>
>>> --------------------------------------------
>>> <http://ken-blog.krugler.org>
>>> +1 530-265-2225
>>>
>>>
>>>
>>>
>>>
>>>
>>> --------------------------------------------
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://bixolabs.com
>>> e l a s t i c w e b m i n i n g
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Arturo Beltran Fonollosa
>> Institute of New Imaging Technologies (INIT): http://www.init.uji.es
>> Geographic Information research group: http://www.geoinfo.uji.es
>> Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
>> E-12071, Castellón, Spain
>> mailto: arturo.beltran@uji.es
>>
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c w e b m i n i n g
>
>
>
>
>
--
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Re: Getting started
Posted by Arturo Beltran <ar...@uji.es>.
Hi,
I'm still with the same problem.
I think it's all good, I do the/ "mvn install/" and my new class is
included in the generated JAR, but never called.
It should be very simple. I feel a little silly. I don't know how to
make my new parser is found by Tika.
Thanks in advance
Arturo
El 21/06/2010 19:04, Ken Krugler escribió:
> Are you sure your new parser is on the classpath?
>
> E.g. put a break on getSupportedTypes() and make sure that's getting
> called - if not, then the parser isn't being "found" by Tika.
>
> -- Ken
>
> On Jun 21, 2010, at 3:34am, Arturo Beltran wrote:
>
>> Hi Ken,
>>
>> First of all, thanks for your quick response.
>> This's exactly what I'm doing, but despite that Tika recognizes the
>> new MIME tipe, my new parser is not called.
>>
>> I added to tika-mimetypes.xml:
>>
>> <mime-type type="application/shp">
>> <!--sub-class-of type="application/octet-stream"/-->
>> <glob pattern="*.shp"/>
>> </mime-type>
>>
>> I created a new class GeoParser:
>>
>> public class GeoParser implements Parser {
>>
>> private static final Set<MediaType> SUPPORTED_TYPES =
>> Collections.singleton(MediaType.application("shp"));
>> public static final String SHP_MIME_TYPE = "application/shp";
>>
>> public Set<MediaType> getSupportedTypes(ParseContext context) {
>> return SUPPORTED_TYPES;
>> }
>>
>> public void parse(
>> InputStream stream, ContentHandler handler,
>> Metadata metadata, ParseContext context)
>> throws IOException, SAXException, TikaException {
>>
>> metadata.set(Metadata.CONTENT_TYPE, SHP_MIME_TYPE);
>> metadata.set("Hello", "World");
>>
>> System.out.println("HELLO WORLD");
>> System.err.println("ERR Hello world");
>>
>> XHTMLContentHandler xhtml = new XHTMLContentHandler(handler,
>> metadata);
>> xhtml.startDocument();
>> xhtml.endDocument();
>> }
>> ...
>> }
>>
>> And that's the result:
>>
>> Content-Length: 755072
>> Content-Type: application/shp
>> resourceName: comarques250.shp
>>
>> I don't know wht exactly is failing, but I can't make it work.
>>
>> Greetings and thanks in advance for your help.
>> Arturo
>>
>>
>> El 17/06/2010 18:25, Ken Krugler escribió:
>>> Hi Arturo,
>>>
>>>> Some of you already know that I'm working on a new parser
>>>> (https://issues.apache.org/jira/browse/TIKA-443). After all day
>>>> trying to set up a workspace for Eclipse, I implemented the typical
>>>> "hello world" class, in the Tika Parser version. My problem now, is
>>>> how to configure Tika in order to call my new parser when a file
>>>> with especific extension (p.e. *.shp) is found. I read something
>>>> about a configuration file (tika-config.xml) but I couldn't find it
>>>> in the source code.
>>>
>>> You first need to modify
>>> tika-core/src/main/resources/tika-mimetypes.xml.
>>>
>>> E.g. something like this was done for mailbox files.
>>>
>>> <mime-type type="application/mbox">
>>> <sub-class-of type="text/plain"/>
>>> <glob pattern="*.mbox"/>
>>> </mime-type>
>>>
>>> That maps the suffix to the mime-type.
>>>
>>> Then you define the SUPPORTED_TYPES static class field in your
>>> parser class that defines what mime-types it supports.
>>>
>>> E.g. for MboxParser:
>>>
>>> public class MboxParser implements Parser {
>>>
>>> private static final Set<MediaType> SUPPORTED_TYPES =
>>> Collections.singleton(MediaType.application("mbox"));
>>>
>>>
>>> -- Ken
>>>
>>> --------------------------------------------
>>> <http://ken-blog.krugler.org>
>>> +1 530-265-2225
>>>
>>>
>>>
>>>
>>>
>>>
>>> --------------------------------------------
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://bixolabs.com
>>> e l a s t i c w e b m i n i n g
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Arturo Beltran Fonollosa
>> Institute of New Imaging Technologies (INIT): http://www.init.uji.es
>> Geographic Information research group: http://www.geoinfo.uji.es
>> Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
>> E-12071, Castellón, Spain
>> mailto: arturo.beltran@uji.es
>>
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c w e b m i n i n g
>
>
>
>
>
--
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es
Re: Getting started
Posted by Ken Krugler <kk...@transpac.com>.
Are you sure your new parser is on the classpath?
E.g. put a break on getSupportedTypes() and make sure that's getting
called - if not, then the parser isn't being "found" by Tika.
-- Ken
On Jun 21, 2010, at 3:34am, Arturo Beltran wrote:
> Hi Ken,
>
> First of all, thanks for your quick response.
> This's exactly what I'm doing, but despite that Tika recognizes the
> new MIME tipe, my new parser is not called.
>
> I added to tika-mimetypes.xml:
>
> <mime-type type="application/shp">
> <!--sub-class-of type="application/octet-stream"/-->
> <glob pattern="*.shp"/>
> </mime-type>
>
> I created a new class GeoParser:
>
> public class GeoParser implements Parser {
>
> private static final Set<MediaType> SUPPORTED_TYPES =
> Collections.singleton(MediaType.application("shp"));
> public static final String SHP_MIME_TYPE = "application/shp";
>
> public Set<MediaType> getSupportedTypes(ParseContext context) {
> return SUPPORTED_TYPES;
> }
>
> public void parse(
> InputStream stream, ContentHandler handler,
> Metadata metadata, ParseContext context)
> throws IOException, SAXException, TikaException {
>
> metadata.set(Metadata.CONTENT_TYPE, SHP_MIME_TYPE);
> metadata.set("Hello", "World");
>
> System.out.println("HELLO WORLD");
> System.err.println("ERR Hello world");
>
> XHTMLContentHandler xhtml = new XHTMLContentHandler(handler,
> metadata);
> xhtml.startDocument();
> xhtml.endDocument();
> }
> ...
> }
>
> And that's the result:
>
> Content-Length: 755072
> Content-Type: application/shp
> resourceName: comarques250.shp
>
> I don't know wht exactly is failing, but I can't make it work.
>
> Greetings and thanks in advance for your help.
> Arturo
>
>
> El 17/06/2010 18:25, Ken Krugler escribió:
>> Hi Arturo,
>>
>>> Some of you already know that I'm working on a new parser (https://issues.apache.org/jira/browse/TIKA-443
>>> ). After all day trying to set up a workspace for Eclipse, I
>>> implemented the typical "hello world" class, in the Tika Parser
>>> version. My problem now, is how to configure Tika in order to call
>>> my new parser when a file with especific extension (p.e. *.shp) is
>>> found. I read something about a configuration file (tika-
>>> config.xml) but I couldn't find it in the source code.
>>
>> You first need to modify tika-core/src/main/resources/tika-
>> mimetypes.xml.
>>
>> E.g. something like this was done for mailbox files.
>>
>> <mime-type type="application/mbox">
>> <sub-class-of type="text/plain"/>
>> <glob pattern="*.mbox"/>
>> </mime-type>
>>
>> That maps the suffix to the mime-type.
>>
>> Then you define the SUPPORTED_TYPES static class field in your
>> parser class that defines what mime-types it supports.
>>
>> E.g. for MboxParser:
>>
>> public class MboxParser implements Parser {
>>
>> private static final Set<MediaType> SUPPORTED_TYPES =
>> Collections.singleton(MediaType.application("mbox"));
>>
>>
>> -- Ken
>>
>> --------------------------------------------
>> <http://ken-blog.krugler.org>
>> +1 530-265-2225
>>
>>
>>
>>
>>
>>
>> --------------------------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> e l a s t i c w e b m i n i n g
>>
>>
>>
>>
>>
>
>
> --
> Arturo Beltran Fonollosa
> Institute of New Imaging Technologies (INIT): http://www.init.uji.es
> Geographic Information research group: http://www.geoinfo.uji.es
> Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
> E-12071, Castellón, Spain
> mailto: arturo.beltran@uji.es
>
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
Re: Getting started
Posted by Arturo Beltran <ar...@uji.es>.
Hi Ken,
First of all, thanks for your quick response.
This's exactly what I'm doing, but despite that Tika recognizes the new
MIME tipe, my new parser is not called.
I added to tika-mimetypes.xml:
<mime-type type="application/shp">
<!--sub-class-of type="application/octet-stream"/-->
<glob pattern="*.shp"/>
</mime-type>
I created a new class GeoParser:
public class GeoParser implements Parser {
private static final Set<MediaType> SUPPORTED_TYPES =
Collections.singleton(MediaType.application("shp"));
public static final String SHP_MIME_TYPE = "application/shp";
public Set<MediaType> getSupportedTypes(ParseContext context) {
return SUPPORTED_TYPES;
}
public void parse(
InputStream stream, ContentHandler handler,
Metadata metadata, ParseContext context)
throws IOException, SAXException, TikaException {
metadata.set(Metadata.CONTENT_TYPE, SHP_MIME_TYPE);
metadata.set("Hello", "World");
System.out.println("HELLO WORLD");
System.err.println("ERR Hello world");
XHTMLContentHandler xhtml = new XHTMLContentHandler(handler,
metadata);
xhtml.startDocument();
xhtml.endDocument();
}
...
}
And that's the result:
Content-Length: 755072
Content-Type: application/shp
resourceName: comarques250.shp
I don't know wht exactly is failing, but I can't make it work.
Greetings and thanks in advance for your help.
Arturo
El 17/06/2010 18:25, Ken Krugler escribió:
> Hi Arturo,
>
>> Some of you already know that I'm working on a new parser
>> (https://issues.apache.org/jira/browse/TIKA-443). After all day
>> trying to set up a workspace for Eclipse, I implemented the typical
>> "hello world" class, in the Tika Parser version. My problem now, is
>> how to configure Tika in order to call my new parser when a file with
>> especific extension (p.e. *.shp) is found. I read something about a
>> configuration file (tika-config.xml) but I couldn't find it in the
>> source code.
>
> You first need to modify tika-core/src/main/resources/tika-mimetypes.xml.
>
> E.g. something like this was done for mailbox files.
>
> <mime-type type="application/mbox">
> <sub-class-of type="text/plain"/>
> <glob pattern="*.mbox"/>
> </mime-type>
>
> That maps the suffix to the mime-type.
>
> Then you define the SUPPORTED_TYPES static class field in your parser
> class that defines what mime-types it supports.
>
> E.g. for MboxParser:
>
> public class MboxParser implements Parser {
>
> private static final Set<MediaType> SUPPORTED_TYPES =
> Collections.singleton(MediaType.application("mbox"));
>
>
> -- Ken
>
> --------------------------------------------
> <http://ken-blog.krugler.org>
> +1 530-265-2225
>
>
>
>
>
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c w e b m i n i n g
>
>
>
>
>
--
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es
Re: Getting started
Posted by Ken Krugler <kk...@transpac.com>.
Hi Arturo,
> Some of you already know that I'm working on a new parser (https://issues.apache.org/jira/browse/TIKA-443
> ). After all day trying to set up a workspace for Eclipse, I
> implemented the typical "hello world" class, in the Tika Parser
> version. My problem now, is how to configure Tika in order to call
> my new parser when a file with especific extension (p.e. *.shp) is
> found. I read something about a configuration file (tika-config.xml)
> but I couldn't find it in the source code.
You first need to modify tika-core/src/main/resources/tika-
mimetypes.xml.
E.g. something like this was done for mailbox files.
<mime-type type="application/mbox">
<sub-class-of type="text/plain"/>
<glob pattern="*.mbox"/>
</mime-type>
That maps the suffix to the mime-type.
Then you define the SUPPORTED_TYPES static class field in your parser
class that defines what mime-types it supports.
E.g. for MboxParser:
public class MboxParser implements Parser {
private static final Set<MediaType> SUPPORTED_TYPES =
Collections.singleton(MediaType.application("mbox"));
-- Ken
--------------------------------------------
<http://ken-blog.krugler.org>
+1 530-265-2225
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g