You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Chris A. Mattmann (JIRA)" <ji...@apache.org> on 2010/06/17 16:14:24 UTC

[jira] Commented: (TIKA-443) Geographic Information Parser

    [ https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879788#action_12879788 ] 

Chris A. Mattmann commented on TIKA-443:
----------------------------------------

Hi Arturo,

Thanks for reporting this issue and it sounds awesome! I'm definitely interested in this topic and will be sure to help however I can.

Cheers,
Chris


> Geographic Information Parser
> -----------------------------
>
>                 Key: TIKA-443
>                 URL: https://issues.apache.org/jira/browse/TIKA-443
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Arturo Beltran
>
> I'm working in the automatic description of geospatial resources, and I think that might be interesting to incorporate new parser/s to Tika in order to manage and describe some geo-formats. These geo-formats include files, services and databases.
> If anyone is interested in this issue or want to collaborate do not hesitate to contact me. Any help is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: Getting started

Posted by Nick Burch <ni...@alfresco.com>.
On Tue, 13 Jul 2010, Arturo Beltran wrote:
> It might be interesting to write a small manual: "How to create a new Tika 
> Parser for Dummies". Simply including the three steps that I have finally 
> figured out (new Parser, tika-mimetypes.xml, list the new parser).

The 3rd step is only needed if you want to use the auto detect parser. If 
you figure out the correct parser a different way, it isn't needed

It sounds like a very helpful short document though. The wiki is at
http://wiki.apache.org/tika/ if you fancy writing it up :)

Nick

Re: Getting started

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Arturo,

Working on committing it right now, thanks!

Cheers,
Chris


On 7/16/10 4:17 AM, "Arturo Beltran" <ar...@uji.es> wrote:

The guide is ready.
It can be found attached at: https://issues.apache.org/jira/browse/TIKA-464


Greetings and have nice weekend
      Arturo



El 13/07/2010 16:01, Mattmann, Chris A (388J) escribió:
> Thanks Nick and thanks Arturo, for the offer to write a small guide to getting started with parsing. It might be good to create a JIRA issue for this? Arturo, can you head over to JIRA and create an issue to contribute a "get Tika parsing up and running in 5 minutes" quick start guide? Then, you could write the guide in APT format (see here [1] for an example and [2] for more detailed information), add your new guide file to your local SVN checkout, create a patch and then attach it to your new issue. I'd be happy to get it into the documentation sources.
>
> Thanks!
>
> Cheers,
> Chris
>
> [1] http://svn.apache.org/repos/asf/tika/trunk/src/site/apt/formats.apt
> [2] http://maven.apache.org/doxia/references/apt-format.html
>
>
> On 7/13/10 3:54 AM, "Arturo Beltran"<ar...@uji.es>  wrote:
>
> That was my "big" problem all this time, I almost went crazy. Now it
> works perfectly, thank you very much for your help.
>
> It might be interesting to write a small manual: "How to create a new
> Tika Parser for Dummies". Simply including the three steps that I have
> finally figured out (new Parser, tika-mimetypes.xml, list the new parser).
>
> Greetings and thanks Nick it has been a great help
>
>
>
> El 13/07/2010 12:37, Nick Burch escribió:
>
>> On Tue, 13 Jul 2010, Arturo Beltran wrote:
>>
>>> I'm calling my parser using the Tika-app included, so I think I'm
>>> using AutoDetectParser.
>>>
>> You have to explicitly tell the AutoDetectParser to try your parser,
>> in addition to the mime type definition
>>
>> List your new parser in:
>> tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
>>
>> and I think it should then be picked up
>>
>> Nick
>>
>>
>
> --
> Arturo Beltran Fonollosa
> Institute of New Imaging Technologies (INIT): http://www.init.uji.es
> Geographic Information research group: http://www.geoinfo.uji.es
> Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
> E-12071, Castellón, Spain
> mailto: arturo.beltran@uji.es
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>


--
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es




++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Getting started

Posted by Arturo Beltran <ar...@uji.es>.
The guide is ready.
It can be found attached at: https://issues.apache.org/jira/browse/TIKA-464


Greetings and have nice weekend
      Arturo



El 13/07/2010 16:01, Mattmann, Chris A (388J) escribió:
> Thanks Nick and thanks Arturo, for the offer to write a small guide to getting started with parsing. It might be good to create a JIRA issue for this? Arturo, can you head over to JIRA and create an issue to contribute a "get Tika parsing up and running in 5 minutes" quick start guide? Then, you could write the guide in APT format (see here [1] for an example and [2] for more detailed information), add your new guide file to your local SVN checkout, create a patch and then attach it to your new issue. I'd be happy to get it into the documentation sources.
>
> Thanks!
>
> Cheers,
> Chris
>
> [1] http://svn.apache.org/repos/asf/tika/trunk/src/site/apt/formats.apt
> [2] http://maven.apache.org/doxia/references/apt-format.html
>
>
> On 7/13/10 3:54 AM, "Arturo Beltran"<ar...@uji.es>  wrote:
>
> That was my "big" problem all this time, I almost went crazy. Now it
> works perfectly, thank you very much for your help.
>
> It might be interesting to write a small manual: "How to create a new
> Tika Parser for Dummies". Simply including the three steps that I have
> finally figured out (new Parser, tika-mimetypes.xml, list the new parser).
>
> Greetings and thanks Nick it has been a great help
>
>
>
> El 13/07/2010 12:37, Nick Burch escribió:
>    
>> On Tue, 13 Jul 2010, Arturo Beltran wrote:
>>      
>>> I'm calling my parser using the Tika-app included, so I think I'm
>>> using AutoDetectParser.
>>>        
>> You have to explicitly tell the AutoDetectParser to try your parser,
>> in addition to the mime type definition
>>
>> List your new parser in:
>> tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
>>
>> and I think it should then be picked up
>>
>> Nick
>>
>>      
>
> --
> Arturo Beltran Fonollosa
> Institute of New Imaging Technologies (INIT): http://www.init.uji.es
> Geographic Information research group: http://www.geoinfo.uji.es
> Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
> E-12071, Castellón, Spain
> mailto: arturo.beltran@uji.es
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>    


-- 
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es


Re: Getting started

Posted by Arturo Beltran <ar...@uji.es>.
No problem, I'll do it.


El 13/07/2010 16:01, Mattmann, Chris A (388J) escribió:
> Thanks Nick and thanks Arturo, for the offer to write a small guide to getting started with parsing. It might be good to create a JIRA issue for this? Arturo, can you head over to JIRA and create an issue to contribute a "get Tika parsing up and running in 5 minutes" quick start guide? Then, you could write the guide in APT format (see here [1] for an example and [2] for more detailed information), add your new guide file to your local SVN checkout, create a patch and then attach it to your new issue. I'd be happy to get it into the documentation sources.
>
> Thanks!
>
> Cheers,
> Chris
>
> [1] http://svn.apache.org/repos/asf/tika/trunk/src/site/apt/formats.apt
> [2] http://maven.apache.org/doxia/references/apt-format.html
>
>
> On 7/13/10 3:54 AM, "Arturo Beltran"<ar...@uji.es>  wrote:
>
> That was my "big" problem all this time, I almost went crazy. Now it
> works perfectly, thank you very much for your help.
>
> It might be interesting to write a small manual: "How to create a new
> Tika Parser for Dummies". Simply including the three steps that I have
> finally figured out (new Parser, tika-mimetypes.xml, list the new parser).
>
> Greetings and thanks Nick it has been a great help
>
>
>
> El 13/07/2010 12:37, Nick Burch escribió:
>    
>> On Tue, 13 Jul 2010, Arturo Beltran wrote:
>>      
>>> I'm calling my parser using the Tika-app included, so I think I'm
>>> using AutoDetectParser.
>>>        
>> You have to explicitly tell the AutoDetectParser to try your parser,
>> in addition to the mime type definition
>>
>> List your new parser in:
>> tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
>>
>> and I think it should then be picked up
>>
>> Nick
>>
>>      
>
> --
> Arturo Beltran Fonollosa
> Institute of New Imaging Technologies (INIT): http://www.init.uji.es
> Geographic Information research group: http://www.geoinfo.uji.es
> Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
> E-12071, Castellón, Spain
> mailto: arturo.beltran@uji.es
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>    


-- 
Arturo Beltran Fonollosa
Geographic Information research group: http://www.geoinfo.uji.es
Centro de Visualización Interactiva (CeVI) http://www.cevi.uji.es
Departamento de Lenguajes y Sistemas Informáticos (LSI)
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es


Re: Getting started

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Thanks Nick and thanks Arturo, for the offer to write a small guide to getting started with parsing. It might be good to create a JIRA issue for this? Arturo, can you head over to JIRA and create an issue to contribute a "get Tika parsing up and running in 5 minutes" quick start guide? Then, you could write the guide in APT format (see here [1] for an example and [2] for more detailed information), add your new guide file to your local SVN checkout, create a patch and then attach it to your new issue. I'd be happy to get it into the documentation sources.

Thanks!

Cheers,
Chris

[1] http://svn.apache.org/repos/asf/tika/trunk/src/site/apt/formats.apt
[2] http://maven.apache.org/doxia/references/apt-format.html


On 7/13/10 3:54 AM, "Arturo Beltran" <ar...@uji.es> wrote:

That was my "big" problem all this time, I almost went crazy. Now it
works perfectly, thank you very much for your help.

It might be interesting to write a small manual: "How to create a new
Tika Parser for Dummies". Simply including the three steps that I have
finally figured out (new Parser, tika-mimetypes.xml, list the new parser).

Greetings and thanks Nick it has been a great help



El 13/07/2010 12:37, Nick Burch escribió:
> On Tue, 13 Jul 2010, Arturo Beltran wrote:
>> I'm calling my parser using the Tika-app included, so I think I'm
>> using AutoDetectParser.
>
> You have to explicitly tell the AutoDetectParser to try your parser,
> in addition to the mime type definition
>
> List your new parser in:
> tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
>
> and I think it should then be picked up
>
> Nick
>


--
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es




++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Getting started

Posted by Arturo Beltran <ar...@uji.es>.
That was my "big" problem all this time, I almost went crazy. Now it 
works perfectly, thank you very much for your help.

It might be interesting to write a small manual: "How to create a new 
Tika Parser for Dummies". Simply including the three steps that I have 
finally figured out (new Parser, tika-mimetypes.xml, list the new parser).

Greetings and thanks Nick it has been a great help



El 13/07/2010 12:37, Nick Burch escribió:
> On Tue, 13 Jul 2010, Arturo Beltran wrote:
>> I'm calling my parser using the Tika-app included, so I think I'm 
>> using AutoDetectParser.
>
> You have to explicitly tell the AutoDetectParser to try your parser, 
> in addition to the mime type definition
>
> List your new parser in:
> tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser 
>
> and I think it should then be picked up
>
> Nick
>


-- 
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es


Re: Getting started

Posted by Nick Burch <ni...@alfresco.com>.
On Tue, 13 Jul 2010, Arturo Beltran wrote:
> I'm calling my parser using the Tika-app included, so I think I'm using 
> AutoDetectParser.

You have to explicitly tell the AutoDetectParser to try your parser, in 
addition to the mime type definition

List your new parser in:
tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
and I think it should then be picked up

Nick

Re: Getting started

Posted by Arturo Beltran <ar...@uji.es>.
Hi Chris and all,

El 07/07/2010 16:04, Mattmann, Chris A (388J) escribió:
> Hi Arturo,
>
> How exactly are you calling your parser? Are you using the AutoDetectParser? If so, can you put some print statements in in the public void parse(...) method of CompositeParser? Specifically, add a line right after:
>    
I'm calling my parser using the Tika-app included, so I think I'm using 
AutoDetectParser.

>
> Parser parser = getParser(metadata);
> // print out the returned parser
> System.out.println("Parser returned is: ["+parser.getClass().getName()+"]");
>
> What does that return? Also, have you done the work to map your incoming document type in the tika-mimetypes.xml file?
Yes, sure.
>   That is, if you're using AutoDetectParser or anything that extends CompositeParser, the mime type of the incoming document is used to determine what parser gets called? Is the mime type being detected appropriately? You can check this by putting a println right before getParser in the parse(...) method:
>    
Yes, it returns "application/shp"
> // print the mime type
> System.out.println("The MIME type is: ["+ metadata.get(Metadata.CONTENT_TYPE)+"]);
> Parser parser = getParser(metadata);
>
> What does that print out?
>
> Finally if both of these printlns check out, you should check and make sure that your new parser is correctly mapped to the media type it supports, in other words what Ken said below. Does your parser declare that it supports your expected MIME type?
>    
Yes I declared this MIME type in my parser. But the 
/getSupportedTypes(context)/ function is never called.

I uploaded a file with the Tika source code that includes my modified 
/tika-mimetypes.xml/ file and my new parser /GeoParser.java/. Perhaps 
one of you will try it and find out where I'm wrong.
Here the link: http://elcano.dlsi.uji.es/arturo/tika_geo.zip


Greetings and thanks in advance for your help,
      Arturo
> Let me know and thanks!
>
> Cheers,
> Chris
>
>
>
>
> On 7/7/10 4:25 AM, "Arturo Beltran"<ar...@uji.es>  wrote:
>
> Hi,
>
> I'm still with the same problem.
> I think it's all good, I do the/ "mvn install/" and my new class is
> included in the generated JAR, but never called.
> It should be very simple. I feel a little silly. I don't know how to
> make my new parser is found by Tika.
>
> Thanks in advance
>        Arturo
>
>
> El 21/06/2010 19:04, Ken Krugler escribió:
>    
>> Are you sure your new parser is on the classpath?
>>
>> E.g. put a break on getSupportedTypes() and make sure that's getting
>> called - if not, then the parser isn't being "found" by Tika.
>>
>> -- Ken
>>
>> On Jun 21, 2010, at 3:34am, Arturo Beltran wrote:
>>
>>      
>>> Hi Ken,
>>>
>>> First of all, thanks for your quick response.
>>> This's exactly what I'm doing, but despite that Tika recognizes the
>>> new MIME tipe, my new parser is not called.
>>>
>>> I added to tika-mimetypes.xml:
>>>
>>> <mime-type type="application/shp">
>>> <!--sub-class-of type="application/octet-stream"/-->
>>> <glob pattern="*.shp"/>
>>> </mime-type>
>>>
>>> I created a new class GeoParser:
>>>
>>> public class GeoParser implements Parser {
>>>
>>>     private static final Set<MediaType>  SUPPORTED_TYPES =
>>> Collections.singleton(MediaType.application("shp"));
>>>     public static final String SHP_MIME_TYPE = "application/shp";
>>>
>>>     public Set<MediaType>  getSupportedTypes(ParseContext context) {
>>>         return SUPPORTED_TYPES;
>>>     }
>>>
>>>     public void parse(
>>>             InputStream stream, ContentHandler handler,
>>>             Metadata metadata, ParseContext context)
>>>             throws IOException, SAXException, TikaException {
>>>
>>>         metadata.set(Metadata.CONTENT_TYPE, SHP_MIME_TYPE);
>>>         metadata.set("Hello", "World");
>>>
>>>         System.out.println("HELLO WORLD");
>>>         System.err.println("ERR Hello world");
>>>
>>>         XHTMLContentHandler xhtml = new XHTMLContentHandler(handler,
>>> metadata);
>>>         xhtml.startDocument();
>>>         xhtml.endDocument();
>>>     }
>>> ...
>>> }
>>>
>>> And that's the result:
>>>
>>> Content-Length:  755072
>>> Content-Type:  application/shp
>>> resourceName:  comarques250.shp
>>>
>>> I don't know wht exactly is failing, but I can't make it work.
>>>
>>> Greetings and thanks in advance for your help.
>>>      Arturo
>>>
>>>
>>> El 17/06/2010 18:25, Ken Krugler escribió:
>>>        
>>>> Hi Arturo,
>>>>
>>>>          
>>>>> Some of you already know that I'm working on a new parser
>>>>> (https://issues.apache.org/jira/browse/TIKA-443). After all day
>>>>> trying to set up a workspace for Eclipse, I implemented the typical
>>>>> "hello world" class, in the Tika Parser version. My problem now, is
>>>>> how to configure Tika in order to call my new parser when a file
>>>>> with especific extension (p.e. *.shp) is found. I read something
>>>>> about a configuration file (tika-config.xml) but I couldn't find it
>>>>> in the source code.
>>>>>            
>>>> You first need to modify
>>>> tika-core/src/main/resources/tika-mimetypes.xml.
>>>>
>>>> E.g. something like this was done for mailbox files.
>>>>
>>>> <mime-type type="application/mbox">
>>>> <sub-class-of type="text/plain"/>
>>>> <glob pattern="*.mbox"/>
>>>> </mime-type>
>>>>
>>>> That maps the suffix to the mime-type.
>>>>
>>>> Then you define the SUPPORTED_TYPES static class field in your
>>>> parser class that defines what mime-types it supports.
>>>>
>>>> E.g. for MboxParser:
>>>>
>>>> public class MboxParser implements Parser {
>>>>
>>>>     private static final Set<MediaType>  SUPPORTED_TYPES =
>>>>         Collections.singleton(MediaType.application("mbox"));
>>>>
>>>>
>>>> -- Ken
>>>>
>>>> --------------------------------------------
>>>> <http://ken-blog.krugler.org>
>>>> +1 530-265-2225
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --------------------------------------------
>>>> Ken Krugler
>>>> +1 530-210-6378
>>>> http://bixolabs.com
>>>> e l a s t i c   w e b   m i n i n g
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>          
>>>
>>> --
>>> Arturo Beltran Fonollosa
>>> Institute of New Imaging Technologies (INIT): http://www.init.uji.es
>>> Geographic Information research group: http://www.geoinfo.uji.es
>>> Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
>>> E-12071, Castellón, Spain
>>> mailto: arturo.beltran@uji.es
>>>
>>>        
>> --------------------------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> e l a s t i c   w e b   m i n i n g
>>
>>
>>
>>
>>
>>      
>
> --
> Arturo Beltran Fonollosa
> Institute of New Imaging Technologies (INIT): http://www.init.uji.es
> Geographic Information research group: http://www.geoinfo.uji.es
> Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
> E-12071, Castellón, Spain
> mailto: arturo.beltran@uji.es
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>    


-- 
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es


Re: Getting started

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Arturo,

How exactly are you calling your parser? Are you using the AutoDetectParser? If so, can you put some print statements in in the public void parse(...) method of CompositeParser? Specifically, add a line right after:


Parser parser = getParser(metadata);
// print out the returned parser
System.out.println("Parser returned is: ["+parser.getClass().getName()+"]");

What does that return? Also, have you done the work to map your incoming document type in the tika-mimetypes.xml file? That is, if you're using AutoDetectParser or anything that extends CompositeParser, the mime type of the incoming document is used to determine what parser gets called? Is the mime type being detected appropriately? You can check this by putting a println right before getParser in the parse(...) method:

// print the mime type
System.out.println("The MIME type is: ["+ metadata.get(Metadata.CONTENT_TYPE)+"]);
Parser parser = getParser(metadata);

What does that print out?

Finally if both of these printlns check out, you should check and make sure that your new parser is correctly mapped to the media type it supports, in other words what Ken said below. Does your parser declare that it supports your expected MIME type?

Let me know and thanks!

Cheers,
Chris




On 7/7/10 4:25 AM, "Arturo Beltran" <ar...@uji.es> wrote:

Hi,

I'm still with the same problem.
I think it's all good, I do the/ "mvn install/" and my new class is
included in the generated JAR, but never called.
It should be very simple. I feel a little silly. I don't know how to
make my new parser is found by Tika.

Thanks in advance
      Arturo


El 21/06/2010 19:04, Ken Krugler escribió:
> Are you sure your new parser is on the classpath?
>
> E.g. put a break on getSupportedTypes() and make sure that's getting
> called - if not, then the parser isn't being "found" by Tika.
>
> -- Ken
>
> On Jun 21, 2010, at 3:34am, Arturo Beltran wrote:
>
>> Hi Ken,
>>
>> First of all, thanks for your quick response.
>> This's exactly what I'm doing, but despite that Tika recognizes the
>> new MIME tipe, my new parser is not called.
>>
>> I added to tika-mimetypes.xml:
>>
>> <mime-type type="application/shp">
>> <!--sub-class-of type="application/octet-stream"/-->
>> <glob pattern="*.shp"/>
>> </mime-type>
>>
>> I created a new class GeoParser:
>>
>> public class GeoParser implements Parser {
>>
>>    private static final Set<MediaType> SUPPORTED_TYPES =
>> Collections.singleton(MediaType.application("shp"));
>>    public static final String SHP_MIME_TYPE = "application/shp";
>>
>>    public Set<MediaType> getSupportedTypes(ParseContext context) {
>>        return SUPPORTED_TYPES;
>>    }
>>
>>    public void parse(
>>            InputStream stream, ContentHandler handler,
>>            Metadata metadata, ParseContext context)
>>            throws IOException, SAXException, TikaException {
>>
>>        metadata.set(Metadata.CONTENT_TYPE, SHP_MIME_TYPE);
>>        metadata.set("Hello", "World");
>>
>>        System.out.println("HELLO WORLD");
>>        System.err.println("ERR Hello world");
>>
>>        XHTMLContentHandler xhtml = new XHTMLContentHandler(handler,
>> metadata);
>>        xhtml.startDocument();
>>        xhtml.endDocument();
>>    }
>> ...
>> }
>>
>> And that's the result:
>>
>> Content-Length:  755072
>> Content-Type:  application/shp
>> resourceName:  comarques250.shp
>>
>> I don't know wht exactly is failing, but I can't make it work.
>>
>> Greetings and thanks in advance for your help.
>>     Arturo
>>
>>
>> El 17/06/2010 18:25, Ken Krugler escribió:
>>> Hi Arturo,
>>>
>>>> Some of you already know that I'm working on a new parser
>>>> (https://issues.apache.org/jira/browse/TIKA-443). After all day
>>>> trying to set up a workspace for Eclipse, I implemented the typical
>>>> "hello world" class, in the Tika Parser version. My problem now, is
>>>> how to configure Tika in order to call my new parser when a file
>>>> with especific extension (p.e. *.shp) is found. I read something
>>>> about a configuration file (tika-config.xml) but I couldn't find it
>>>> in the source code.
>>>
>>> You first need to modify
>>> tika-core/src/main/resources/tika-mimetypes.xml.
>>>
>>> E.g. something like this was done for mailbox files.
>>>
>>> <mime-type type="application/mbox">
>>> <sub-class-of type="text/plain"/>
>>> <glob pattern="*.mbox"/>
>>> </mime-type>
>>>
>>> That maps the suffix to the mime-type.
>>>
>>> Then you define the SUPPORTED_TYPES static class field in your
>>> parser class that defines what mime-types it supports.
>>>
>>> E.g. for MboxParser:
>>>
>>> public class MboxParser implements Parser {
>>>
>>>    private static final Set<MediaType> SUPPORTED_TYPES =
>>>        Collections.singleton(MediaType.application("mbox"));
>>>
>>>
>>> -- Ken
>>>
>>> --------------------------------------------
>>> <http://ken-blog.krugler.org>
>>> +1 530-265-2225
>>>
>>>
>>>
>>>
>>>
>>>
>>> --------------------------------------------
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://bixolabs.com
>>> e l a s t i c   w e b   m i n i n g
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Arturo Beltran Fonollosa
>> Institute of New Imaging Technologies (INIT): http://www.init.uji.es
>> Geographic Information research group: http://www.geoinfo.uji.es
>> Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
>> E-12071, Castellón, Spain
>> mailto: arturo.beltran@uji.es
>>
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>


--
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es




++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Getting started

Posted by Arturo Beltran <ar...@uji.es>.
Hi,

I'm still with the same problem.
I think it's all good, I do the/ "mvn install/" and my new class is 
included in the generated JAR, but never called.
It should be very simple. I feel a little silly. I don't know how to 
make my new parser is found by Tika.

Thanks in advance
      Arturo


El 21/06/2010 19:04, Ken Krugler escribió:
> Are you sure your new parser is on the classpath?
>
> E.g. put a break on getSupportedTypes() and make sure that's getting 
> called - if not, then the parser isn't being "found" by Tika.
>
> -- Ken
>
> On Jun 21, 2010, at 3:34am, Arturo Beltran wrote:
>
>> Hi Ken,
>>
>> First of all, thanks for your quick response.
>> This's exactly what I'm doing, but despite that Tika recognizes the 
>> new MIME tipe, my new parser is not called.
>>
>> I added to tika-mimetypes.xml:
>>
>> <mime-type type="application/shp">
>> <!--sub-class-of type="application/octet-stream"/-->
>> <glob pattern="*.shp"/>
>> </mime-type>
>>
>> I created a new class GeoParser:
>>
>> public class GeoParser implements Parser {
>>
>>    private static final Set<MediaType> SUPPORTED_TYPES = 
>> Collections.singleton(MediaType.application("shp"));
>>    public static final String SHP_MIME_TYPE = "application/shp";
>>
>>    public Set<MediaType> getSupportedTypes(ParseContext context) {
>>        return SUPPORTED_TYPES;
>>    }
>>
>>    public void parse(
>>            InputStream stream, ContentHandler handler,
>>            Metadata metadata, ParseContext context)
>>            throws IOException, SAXException, TikaException {
>>
>>        metadata.set(Metadata.CONTENT_TYPE, SHP_MIME_TYPE);
>>        metadata.set("Hello", "World");
>>
>>        System.out.println("HELLO WORLD");
>>        System.err.println("ERR Hello world");
>>
>>        XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, 
>> metadata);
>>        xhtml.startDocument();
>>        xhtml.endDocument();
>>    }
>> ...
>> }
>>
>> And that's the result:
>>
>> Content-Length:  755072
>> Content-Type:  application/shp
>> resourceName:  comarques250.shp
>>
>> I don't know wht exactly is failing, but I can't make it work.
>>
>> Greetings and thanks in advance for your help.
>>     Arturo
>>
>>
>> El 17/06/2010 18:25, Ken Krugler escribió:
>>> Hi Arturo,
>>>
>>>> Some of you already know that I'm working on a new parser 
>>>> (https://issues.apache.org/jira/browse/TIKA-443). After all day 
>>>> trying to set up a workspace for Eclipse, I implemented the typical 
>>>> "hello world" class, in the Tika Parser version. My problem now, is 
>>>> how to configure Tika in order to call my new parser when a file 
>>>> with especific extension (p.e. *.shp) is found. I read something 
>>>> about a configuration file (tika-config.xml) but I couldn't find it 
>>>> in the source code.
>>>
>>> You first need to modify 
>>> tika-core/src/main/resources/tika-mimetypes.xml.
>>>
>>> E.g. something like this was done for mailbox files.
>>>
>>> <mime-type type="application/mbox">
>>> <sub-class-of type="text/plain"/>
>>> <glob pattern="*.mbox"/>
>>> </mime-type>
>>>
>>> That maps the suffix to the mime-type.
>>>
>>> Then you define the SUPPORTED_TYPES static class field in your 
>>> parser class that defines what mime-types it supports.
>>>
>>> E.g. for MboxParser:
>>>
>>> public class MboxParser implements Parser {
>>>
>>>    private static final Set<MediaType> SUPPORTED_TYPES =
>>>        Collections.singleton(MediaType.application("mbox"));
>>>
>>>
>>> -- Ken
>>>
>>> --------------------------------------------
>>> <http://ken-blog.krugler.org>
>>> +1 530-265-2225
>>>
>>>
>>>
>>>
>>>
>>>
>>> --------------------------------------------
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://bixolabs.com
>>> e l a s t i c   w e b   m i n i n g
>>>
>>>
>>>
>>>
>>>
>>
>>
>> -- 
>> Arturo Beltran Fonollosa
>> Institute of New Imaging Technologies (INIT): http://www.init.uji.es
>> Geographic Information research group: http://www.geoinfo.uji.es
>> Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
>> E-12071, Castellón, Spain
>> mailto: arturo.beltran@uji.es
>>
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>


-- 
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es


Re: Getting started

Posted by Ken Krugler <kk...@transpac.com>.
Are you sure your new parser is on the classpath?

E.g. put a break on getSupportedTypes() and make sure that's getting  
called - if not, then the parser isn't being "found" by Tika.

-- Ken

On Jun 21, 2010, at 3:34am, Arturo Beltran wrote:

> Hi Ken,
>
> First of all, thanks for your quick response.
> This's exactly what I'm doing, but despite that Tika recognizes the  
> new MIME tipe, my new parser is not called.
>
> I added to tika-mimetypes.xml:
>
> <mime-type type="application/shp">
> <!--sub-class-of type="application/octet-stream"/-->
> <glob pattern="*.shp"/>
> </mime-type>
>
> I created a new class GeoParser:
>
> public class GeoParser implements Parser {
>
>    private static final Set<MediaType> SUPPORTED_TYPES =  
> Collections.singleton(MediaType.application("shp"));
>    public static final String SHP_MIME_TYPE = "application/shp";
>
>    public Set<MediaType> getSupportedTypes(ParseContext context) {
>        return SUPPORTED_TYPES;
>    }
>
>    public void parse(
>            InputStream stream, ContentHandler handler,
>            Metadata metadata, ParseContext context)
>            throws IOException, SAXException, TikaException {
>
>        metadata.set(Metadata.CONTENT_TYPE, SHP_MIME_TYPE);
>        metadata.set("Hello", "World");
>
>        System.out.println("HELLO WORLD");
>        System.err.println("ERR Hello world");
>
>        XHTMLContentHandler xhtml = new XHTMLContentHandler(handler,  
> metadata);
>        xhtml.startDocument();
>        xhtml.endDocument();
>    }
> ...
> }
>
> And that's the result:
>
> Content-Length:  755072
> Content-Type:  application/shp
> resourceName:  comarques250.shp
>
> I don't know wht exactly is failing, but I can't make it work.
>
> Greetings and thanks in advance for your help.
>     Arturo
>
>
> El 17/06/2010 18:25, Ken Krugler escribió:
>> Hi Arturo,
>>
>>> Some of you already know that I'm working on a new parser (https://issues.apache.org/jira/browse/TIKA-443 
>>> ). After all day trying to set up a workspace for Eclipse, I  
>>> implemented the typical "hello world" class, in the Tika Parser  
>>> version. My problem now, is how to configure Tika in order to call  
>>> my new parser when a file with especific extension (p.e. *.shp) is  
>>> found. I read something about a configuration file (tika- 
>>> config.xml) but I couldn't find it in the source code.
>>
>> You first need to modify tika-core/src/main/resources/tika- 
>> mimetypes.xml.
>>
>> E.g. something like this was done for mailbox files.
>>
>> <mime-type type="application/mbox">
>> <sub-class-of type="text/plain"/>
>> <glob pattern="*.mbox"/>
>> </mime-type>
>>
>> That maps the suffix to the mime-type.
>>
>> Then you define the SUPPORTED_TYPES static class field in your  
>> parser class that defines what mime-types it supports.
>>
>> E.g. for MboxParser:
>>
>> public class MboxParser implements Parser {
>>
>>    private static final Set<MediaType> SUPPORTED_TYPES =
>>        Collections.singleton(MediaType.application("mbox"));
>>
>>
>> -- Ken
>>
>> --------------------------------------------
>> <http://ken-blog.krugler.org>
>> +1 530-265-2225
>>
>>
>>
>>
>>
>>
>> --------------------------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> e l a s t i c   w e b   m i n i n g
>>
>>
>>
>>
>>
>
>
> -- 
> Arturo Beltran Fonollosa
> Institute of New Imaging Technologies (INIT): http://www.init.uji.es
> Geographic Information research group: http://www.geoinfo.uji.es
> Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
> E-12071, Castellón, Spain
> mailto: arturo.beltran@uji.es
>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Re: Getting started

Posted by Arturo Beltran <ar...@uji.es>.
Hi Ken,

First of all, thanks for your quick response.
This's exactly what I'm doing, but despite that Tika recognizes the new 
MIME tipe, my new parser is not called.

I added to tika-mimetypes.xml:

<mime-type type="application/shp">
<!--sub-class-of type="application/octet-stream"/-->
<glob pattern="*.shp"/>
</mime-type>

I created a new class GeoParser:

public class GeoParser implements Parser {

     private static final Set<MediaType> SUPPORTED_TYPES = 
Collections.singleton(MediaType.application("shp"));
     public static final String SHP_MIME_TYPE = "application/shp";

     public Set<MediaType> getSupportedTypes(ParseContext context) {
         return SUPPORTED_TYPES;
     }

     public void parse(
             InputStream stream, ContentHandler handler,
             Metadata metadata, ParseContext context)
             throws IOException, SAXException, TikaException {

         metadata.set(Metadata.CONTENT_TYPE, SHP_MIME_TYPE);
         metadata.set("Hello", "World");

         System.out.println("HELLO WORLD");
         System.err.println("ERR Hello world");

         XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, 
metadata);
         xhtml.startDocument();
         xhtml.endDocument();
     }
  ...
}

And that's the result:

Content-Length:  755072
Content-Type:  application/shp
resourceName:  comarques250.shp

I don't know wht exactly is failing, but I can't make it work.

Greetings and thanks in advance for your help.
      Arturo


El 17/06/2010 18:25, Ken Krugler escribió:
> Hi Arturo,
>
>> Some of you already know that I'm working on a new parser 
>> (https://issues.apache.org/jira/browse/TIKA-443). After all day 
>> trying to set up a workspace for Eclipse, I implemented the typical 
>> "hello world" class, in the Tika Parser version. My problem now, is 
>> how to configure Tika in order to call my new parser when a file with 
>> especific extension (p.e. *.shp) is found. I read something about a 
>> configuration file (tika-config.xml) but I couldn't find it in the 
>> source code.
>
> You first need to modify tika-core/src/main/resources/tika-mimetypes.xml.
>
> E.g. something like this was done for mailbox files.
>
> <mime-type type="application/mbox">
> <sub-class-of type="text/plain"/>
> <glob pattern="*.mbox"/>
> </mime-type>
>
> That maps the suffix to the mime-type.
>
> Then you define the SUPPORTED_TYPES static class field in your parser 
> class that defines what mime-types it supports.
>
> E.g. for MboxParser:
>
> public class MboxParser implements Parser {
>
>     private static final Set<MediaType> SUPPORTED_TYPES =
>         Collections.singleton(MediaType.application("mbox"));
>
>
> -- Ken
>
> --------------------------------------------
> <http://ken-blog.krugler.org>
> +1 530-265-2225
>
>
>
>
>
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>


-- 
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es


Re: Getting started

Posted by Ken Krugler <kk...@transpac.com>.
Hi Arturo,

> Some of you already know that I'm working on a new parser (https://issues.apache.org/jira/browse/TIKA-443 
> ). After all day trying to set up a workspace for Eclipse, I  
> implemented the typical "hello world" class, in the Tika Parser  
> version. My problem now, is how to configure Tika in order to call  
> my new parser when a file with especific extension (p.e. *.shp) is  
> found. I read something about a configuration file (tika-config.xml)  
> but I couldn't find it in the source code.

You first need to modify tika-core/src/main/resources/tika- 
mimetypes.xml.

E.g. something like this was done for mailbox files.

   <mime-type type="application/mbox">
     <sub-class-of type="text/plain"/>
     <glob pattern="*.mbox"/>
   </mime-type>

That maps the suffix to the mime-type.

Then you define the SUPPORTED_TYPES static class field in your parser  
class that defines what mime-types it supports.

E.g. for MboxParser:

public class MboxParser implements Parser {

     private static final Set<MediaType> SUPPORTED_TYPES =
         Collections.singleton(MediaType.application("mbox"));


-- Ken

--------------------------------------------
<http://ken-blog.krugler.org>
+1 530-265-2225






--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Getting started

Posted by Arturo Beltran <ar...@uji.es>.
Hi all,

Some of you already know that I'm working on a new parser 
(https://issues.apache.org/jira/browse/TIKA-443). After all day trying 
to set up a workspace for Eclipse, I implemented the typical "hello 
world" class, in the Tika Parser version. My problem now, is how to 
configure Tika in order to call my new parser when a file with especific 
extension (p.e. *.shp) is found. I read something about a configuration 
file (tika-config.xml) but I couldn't find it in the source code.

Greetings and thanks in advance
      Arturo

-- 
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es