You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Arturo Beltran (JIRA)" <ji...@apache.org> on 2010/06/17 11:26:24 UTC

[jira] Created: (TIKA-443) Geographic Information Parser

Geographic Information Parser
-----------------------------

                 Key: TIKA-443
                 URL: https://issues.apache.org/jira/browse/TIKA-443
             Project: Tika
          Issue Type: New Feature
          Components: parser
            Reporter: Arturo Beltran


I'm working in the automatic description of geospatial resources, and I think that might be interesting to incorporate new parser/s to Tika in order to manage and describe some geo-formats. These geo-formats include files, services and databases.

If anyone is interested in this issue or want to collaborate do not hesitate to contact me. Any help is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-443) Geographic Information Parser

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883196#action_12883196 ] 

Nick Burch commented on TIKA-443:
---------------------------------

I've opened TIKA-445 and uploaded a first stab at a patch to implement it. Feedback appreciated!

> Geographic Information Parser
> -----------------------------
>
>                 Key: TIKA-443
>                 URL: https://issues.apache.org/jira/browse/TIKA-443
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Arturo Beltran
>         Attachments: getFDOMetadata.xml
>
>
> I'm working in the automatic description of geospatial resources, and I think that might be interesting to incorporate new parser/s to Tika in order to manage and describe some geo-formats. These geo-formats include files, services and databases.
> If anyone is interested in this issue or want to collaborate do not hesitate to contact me. Any help is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (TIKA-443) Geographic Information Parser

Posted by "Mayank Singh (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880722#action_12880722 ] 

Mayank Singh edited comment on TIKA-443 at 6/21/10 5:15 AM:
------------------------------------------------------------

Hi Arturo
I would like to collaborate on this issue. I have also sent you an e-mail regarding the same.
Thanks and regards
Mayank

      was (Author: singhmayank):
    Hi Arturo
I would like to collaborate on this issue. I have also sent you a mal regarding the same.
Thanks and regards
Mayank
  
> Geographic Information Parser
> -----------------------------
>
>                 Key: TIKA-443
>                 URL: https://issues.apache.org/jira/browse/TIKA-443
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Arturo Beltran
>
> I'm working in the automatic description of geospatial resources, and I think that might be interesting to incorporate new parser/s to Tika in order to manage and describe some geo-formats. These geo-formats include files, services and databases.
> If anyone is interested in this issue or want to collaborate do not hesitate to contact me. Any help is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: Getting started

Posted by Nick Burch <ni...@alfresco.com>.
On Tue, 13 Jul 2010, Arturo Beltran wrote:
> It might be interesting to write a small manual: "How to create a new Tika 
> Parser for Dummies". Simply including the three steps that I have finally 
> figured out (new Parser, tika-mimetypes.xml, list the new parser).

The 3rd step is only needed if you want to use the auto detect parser. If 
you figure out the correct parser a different way, it isn't needed

It sounds like a very helpful short document though. The wiki is at
http://wiki.apache.org/tika/ if you fancy writing it up :)

Nick

Re: Getting started

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Arturo,

Working on committing it right now, thanks!

Cheers,
Chris


On 7/16/10 4:17 AM, "Arturo Beltran" <ar...@uji.es> wrote:

The guide is ready.
It can be found attached at: https://issues.apache.org/jira/browse/TIKA-464


Greetings and have nice weekend
      Arturo



El 13/07/2010 16:01, Mattmann, Chris A (388J) escribió:
> Thanks Nick and thanks Arturo, for the offer to write a small guide to getting started with parsing. It might be good to create a JIRA issue for this? Arturo, can you head over to JIRA and create an issue to contribute a "get Tika parsing up and running in 5 minutes" quick start guide? Then, you could write the guide in APT format (see here [1] for an example and [2] for more detailed information), add your new guide file to your local SVN checkout, create a patch and then attach it to your new issue. I'd be happy to get it into the documentation sources.
>
> Thanks!
>
> Cheers,
> Chris
>
> [1] http://svn.apache.org/repos/asf/tika/trunk/src/site/apt/formats.apt
> [2] http://maven.apache.org/doxia/references/apt-format.html
>
>
> On 7/13/10 3:54 AM, "Arturo Beltran"<ar...@uji.es>  wrote:
>
> That was my "big" problem all this time, I almost went crazy. Now it
> works perfectly, thank you very much for your help.
>
> It might be interesting to write a small manual: "How to create a new
> Tika Parser for Dummies". Simply including the three steps that I have
> finally figured out (new Parser, tika-mimetypes.xml, list the new parser).
>
> Greetings and thanks Nick it has been a great help
>
>
>
> El 13/07/2010 12:37, Nick Burch escribió:
>
>> On Tue, 13 Jul 2010, Arturo Beltran wrote:
>>
>>> I'm calling my parser using the Tika-app included, so I think I'm
>>> using AutoDetectParser.
>>>
>> You have to explicitly tell the AutoDetectParser to try your parser,
>> in addition to the mime type definition
>>
>> List your new parser in:
>> tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
>>
>> and I think it should then be picked up
>>
>> Nick
>>
>>
>
> --
> Arturo Beltran Fonollosa
> Institute of New Imaging Technologies (INIT): http://www.init.uji.es
> Geographic Information research group: http://www.geoinfo.uji.es
> Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
> E-12071, Castellón, Spain
> mailto: arturo.beltran@uji.es
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>


--
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es




++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Getting started

Posted by Arturo Beltran <ar...@uji.es>.
The guide is ready.
It can be found attached at: https://issues.apache.org/jira/browse/TIKA-464


Greetings and have nice weekend
      Arturo



El 13/07/2010 16:01, Mattmann, Chris A (388J) escribió:
> Thanks Nick and thanks Arturo, for the offer to write a small guide to getting started with parsing. It might be good to create a JIRA issue for this? Arturo, can you head over to JIRA and create an issue to contribute a "get Tika parsing up and running in 5 minutes" quick start guide? Then, you could write the guide in APT format (see here [1] for an example and [2] for more detailed information), add your new guide file to your local SVN checkout, create a patch and then attach it to your new issue. I'd be happy to get it into the documentation sources.
>
> Thanks!
>
> Cheers,
> Chris
>
> [1] http://svn.apache.org/repos/asf/tika/trunk/src/site/apt/formats.apt
> [2] http://maven.apache.org/doxia/references/apt-format.html
>
>
> On 7/13/10 3:54 AM, "Arturo Beltran"<ar...@uji.es>  wrote:
>
> That was my "big" problem all this time, I almost went crazy. Now it
> works perfectly, thank you very much for your help.
>
> It might be interesting to write a small manual: "How to create a new
> Tika Parser for Dummies". Simply including the three steps that I have
> finally figured out (new Parser, tika-mimetypes.xml, list the new parser).
>
> Greetings and thanks Nick it has been a great help
>
>
>
> El 13/07/2010 12:37, Nick Burch escribió:
>    
>> On Tue, 13 Jul 2010, Arturo Beltran wrote:
>>      
>>> I'm calling my parser using the Tika-app included, so I think I'm
>>> using AutoDetectParser.
>>>        
>> You have to explicitly tell the AutoDetectParser to try your parser,
>> in addition to the mime type definition
>>
>> List your new parser in:
>> tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
>>
>> and I think it should then be picked up
>>
>> Nick
>>
>>      
>
> --
> Arturo Beltran Fonollosa
> Institute of New Imaging Technologies (INIT): http://www.init.uji.es
> Geographic Information research group: http://www.geoinfo.uji.es
> Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
> E-12071, Castellón, Spain
> mailto: arturo.beltran@uji.es
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>    


-- 
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es


Re: Getting started

Posted by Arturo Beltran <ar...@uji.es>.
No problem, I'll do it.


El 13/07/2010 16:01, Mattmann, Chris A (388J) escribió:
> Thanks Nick and thanks Arturo, for the offer to write a small guide to getting started with parsing. It might be good to create a JIRA issue for this? Arturo, can you head over to JIRA and create an issue to contribute a "get Tika parsing up and running in 5 minutes" quick start guide? Then, you could write the guide in APT format (see here [1] for an example and [2] for more detailed information), add your new guide file to your local SVN checkout, create a patch and then attach it to your new issue. I'd be happy to get it into the documentation sources.
>
> Thanks!
>
> Cheers,
> Chris
>
> [1] http://svn.apache.org/repos/asf/tika/trunk/src/site/apt/formats.apt
> [2] http://maven.apache.org/doxia/references/apt-format.html
>
>
> On 7/13/10 3:54 AM, "Arturo Beltran"<ar...@uji.es>  wrote:
>
> That was my "big" problem all this time, I almost went crazy. Now it
> works perfectly, thank you very much for your help.
>
> It might be interesting to write a small manual: "How to create a new
> Tika Parser for Dummies". Simply including the three steps that I have
> finally figured out (new Parser, tika-mimetypes.xml, list the new parser).
>
> Greetings and thanks Nick it has been a great help
>
>
>
> El 13/07/2010 12:37, Nick Burch escribió:
>    
>> On Tue, 13 Jul 2010, Arturo Beltran wrote:
>>      
>>> I'm calling my parser using the Tika-app included, so I think I'm
>>> using AutoDetectParser.
>>>        
>> You have to explicitly tell the AutoDetectParser to try your parser,
>> in addition to the mime type definition
>>
>> List your new parser in:
>> tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
>>
>> and I think it should then be picked up
>>
>> Nick
>>
>>      
>
> --
> Arturo Beltran Fonollosa
> Institute of New Imaging Technologies (INIT): http://www.init.uji.es
> Geographic Information research group: http://www.geoinfo.uji.es
> Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
> E-12071, Castellón, Spain
> mailto: arturo.beltran@uji.es
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>    


-- 
Arturo Beltran Fonollosa
Geographic Information research group: http://www.geoinfo.uji.es
Centro de Visualización Interactiva (CeVI) http://www.cevi.uji.es
Departamento de Lenguajes y Sistemas Informáticos (LSI)
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es


Re: Getting started

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Thanks Nick and thanks Arturo, for the offer to write a small guide to getting started with parsing. It might be good to create a JIRA issue for this? Arturo, can you head over to JIRA and create an issue to contribute a "get Tika parsing up and running in 5 minutes" quick start guide? Then, you could write the guide in APT format (see here [1] for an example and [2] for more detailed information), add your new guide file to your local SVN checkout, create a patch and then attach it to your new issue. I'd be happy to get it into the documentation sources.

Thanks!

Cheers,
Chris

[1] http://svn.apache.org/repos/asf/tika/trunk/src/site/apt/formats.apt
[2] http://maven.apache.org/doxia/references/apt-format.html


On 7/13/10 3:54 AM, "Arturo Beltran" <ar...@uji.es> wrote:

That was my "big" problem all this time, I almost went crazy. Now it
works perfectly, thank you very much for your help.

It might be interesting to write a small manual: "How to create a new
Tika Parser for Dummies". Simply including the three steps that I have
finally figured out (new Parser, tika-mimetypes.xml, list the new parser).

Greetings and thanks Nick it has been a great help



El 13/07/2010 12:37, Nick Burch escribió:
> On Tue, 13 Jul 2010, Arturo Beltran wrote:
>> I'm calling my parser using the Tika-app included, so I think I'm
>> using AutoDetectParser.
>
> You have to explicitly tell the AutoDetectParser to try your parser,
> in addition to the mime type definition
>
> List your new parser in:
> tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
>
> and I think it should then be picked up
>
> Nick
>


--
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es




++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Getting started

Posted by Arturo Beltran <ar...@uji.es>.
That was my "big" problem all this time, I almost went crazy. Now it 
works perfectly, thank you very much for your help.

It might be interesting to write a small manual: "How to create a new 
Tika Parser for Dummies". Simply including the three steps that I have 
finally figured out (new Parser, tika-mimetypes.xml, list the new parser).

Greetings and thanks Nick it has been a great help



El 13/07/2010 12:37, Nick Burch escribió:
> On Tue, 13 Jul 2010, Arturo Beltran wrote:
>> I'm calling my parser using the Tika-app included, so I think I'm 
>> using AutoDetectParser.
>
> You have to explicitly tell the AutoDetectParser to try your parser, 
> in addition to the mime type definition
>
> List your new parser in:
> tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser 
>
> and I think it should then be picked up
>
> Nick
>


-- 
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es


Re: Getting started

Posted by Nick Burch <ni...@alfresco.com>.
On Tue, 13 Jul 2010, Arturo Beltran wrote:
> I'm calling my parser using the Tika-app included, so I think I'm using 
> AutoDetectParser.

You have to explicitly tell the AutoDetectParser to try your parser, in 
addition to the mime type definition

List your new parser in:
tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
and I think it should then be picked up

Nick

Re: Getting started

Posted by Arturo Beltran <ar...@uji.es>.
Hi Chris and all,

El 07/07/2010 16:04, Mattmann, Chris A (388J) escribió:
> Hi Arturo,
>
> How exactly are you calling your parser? Are you using the AutoDetectParser? If so, can you put some print statements in in the public void parse(...) method of CompositeParser? Specifically, add a line right after:
>    
I'm calling my parser using the Tika-app included, so I think I'm using 
AutoDetectParser.

>
> Parser parser = getParser(metadata);
> // print out the returned parser
> System.out.println("Parser returned is: ["+parser.getClass().getName()+"]");
>
> What does that return? Also, have you done the work to map your incoming document type in the tika-mimetypes.xml file?
Yes, sure.
>   That is, if you're using AutoDetectParser or anything that extends CompositeParser, the mime type of the incoming document is used to determine what parser gets called? Is the mime type being detected appropriately? You can check this by putting a println right before getParser in the parse(...) method:
>    
Yes, it returns "application/shp"
> // print the mime type
> System.out.println("The MIME type is: ["+ metadata.get(Metadata.CONTENT_TYPE)+"]);
> Parser parser = getParser(metadata);
>
> What does that print out?
>
> Finally if both of these printlns check out, you should check and make sure that your new parser is correctly mapped to the media type it supports, in other words what Ken said below. Does your parser declare that it supports your expected MIME type?
>    
Yes I declared this MIME type in my parser. But the 
/getSupportedTypes(context)/ function is never called.

I uploaded a file with the Tika source code that includes my modified 
/tika-mimetypes.xml/ file and my new parser /GeoParser.java/. Perhaps 
one of you will try it and find out where I'm wrong.
Here the link: http://elcano.dlsi.uji.es/arturo/tika_geo.zip


Greetings and thanks in advance for your help,
      Arturo
> Let me know and thanks!
>
> Cheers,
> Chris
>
>
>
>
> On 7/7/10 4:25 AM, "Arturo Beltran"<ar...@uji.es>  wrote:
>
> Hi,
>
> I'm still with the same problem.
> I think it's all good, I do the/ "mvn install/" and my new class is
> included in the generated JAR, but never called.
> It should be very simple. I feel a little silly. I don't know how to
> make my new parser is found by Tika.
>
> Thanks in advance
>        Arturo
>
>
> El 21/06/2010 19:04, Ken Krugler escribió:
>    
>> Are you sure your new parser is on the classpath?
>>
>> E.g. put a break on getSupportedTypes() and make sure that's getting
>> called - if not, then the parser isn't being "found" by Tika.
>>
>> -- Ken
>>
>> On Jun 21, 2010, at 3:34am, Arturo Beltran wrote:
>>
>>      
>>> Hi Ken,
>>>
>>> First of all, thanks for your quick response.
>>> This's exactly what I'm doing, but despite that Tika recognizes the
>>> new MIME tipe, my new parser is not called.
>>>
>>> I added to tika-mimetypes.xml:
>>>
>>> <mime-type type="application/shp">
>>> <!--sub-class-of type="application/octet-stream"/-->
>>> <glob pattern="*.shp"/>
>>> </mime-type>
>>>
>>> I created a new class GeoParser:
>>>
>>> public class GeoParser implements Parser {
>>>
>>>     private static final Set<MediaType>  SUPPORTED_TYPES =
>>> Collections.singleton(MediaType.application("shp"));
>>>     public static final String SHP_MIME_TYPE = "application/shp";
>>>
>>>     public Set<MediaType>  getSupportedTypes(ParseContext context) {
>>>         return SUPPORTED_TYPES;
>>>     }
>>>
>>>     public void parse(
>>>             InputStream stream, ContentHandler handler,
>>>             Metadata metadata, ParseContext context)
>>>             throws IOException, SAXException, TikaException {
>>>
>>>         metadata.set(Metadata.CONTENT_TYPE, SHP_MIME_TYPE);
>>>         metadata.set("Hello", "World");
>>>
>>>         System.out.println("HELLO WORLD");
>>>         System.err.println("ERR Hello world");
>>>
>>>         XHTMLContentHandler xhtml = new XHTMLContentHandler(handler,
>>> metadata);
>>>         xhtml.startDocument();
>>>         xhtml.endDocument();
>>>     }
>>> ...
>>> }
>>>
>>> And that's the result:
>>>
>>> Content-Length:  755072
>>> Content-Type:  application/shp
>>> resourceName:  comarques250.shp
>>>
>>> I don't know wht exactly is failing, but I can't make it work.
>>>
>>> Greetings and thanks in advance for your help.
>>>      Arturo
>>>
>>>
>>> El 17/06/2010 18:25, Ken Krugler escribió:
>>>        
>>>> Hi Arturo,
>>>>
>>>>          
>>>>> Some of you already know that I'm working on a new parser
>>>>> (https://issues.apache.org/jira/browse/TIKA-443). After all day
>>>>> trying to set up a workspace for Eclipse, I implemented the typical
>>>>> "hello world" class, in the Tika Parser version. My problem now, is
>>>>> how to configure Tika in order to call my new parser when a file
>>>>> with especific extension (p.e. *.shp) is found. I read something
>>>>> about a configuration file (tika-config.xml) but I couldn't find it
>>>>> in the source code.
>>>>>            
>>>> You first need to modify
>>>> tika-core/src/main/resources/tika-mimetypes.xml.
>>>>
>>>> E.g. something like this was done for mailbox files.
>>>>
>>>> <mime-type type="application/mbox">
>>>> <sub-class-of type="text/plain"/>
>>>> <glob pattern="*.mbox"/>
>>>> </mime-type>
>>>>
>>>> That maps the suffix to the mime-type.
>>>>
>>>> Then you define the SUPPORTED_TYPES static class field in your
>>>> parser class that defines what mime-types it supports.
>>>>
>>>> E.g. for MboxParser:
>>>>
>>>> public class MboxParser implements Parser {
>>>>
>>>>     private static final Set<MediaType>  SUPPORTED_TYPES =
>>>>         Collections.singleton(MediaType.application("mbox"));
>>>>
>>>>
>>>> -- Ken
>>>>
>>>> --------------------------------------------
>>>> <http://ken-blog.krugler.org>
>>>> +1 530-265-2225
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --------------------------------------------
>>>> Ken Krugler
>>>> +1 530-210-6378
>>>> http://bixolabs.com
>>>> e l a s t i c   w e b   m i n i n g
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>          
>>>
>>> --
>>> Arturo Beltran Fonollosa
>>> Institute of New Imaging Technologies (INIT): http://www.init.uji.es
>>> Geographic Information research group: http://www.geoinfo.uji.es
>>> Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
>>> E-12071, Castellón, Spain
>>> mailto: arturo.beltran@uji.es
>>>
>>>        
>> --------------------------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> e l a s t i c   w e b   m i n i n g
>>
>>
>>
>>
>>
>>      
>
> --
> Arturo Beltran Fonollosa
> Institute of New Imaging Technologies (INIT): http://www.init.uji.es
> Geographic Information research group: http://www.geoinfo.uji.es
> Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
> E-12071, Castellón, Spain
> mailto: arturo.beltran@uji.es
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>    


-- 
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es


Re: Getting started

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Arturo,

How exactly are you calling your parser? Are you using the AutoDetectParser? If so, can you put some print statements in in the public void parse(...) method of CompositeParser? Specifically, add a line right after:


Parser parser = getParser(metadata);
// print out the returned parser
System.out.println("Parser returned is: ["+parser.getClass().getName()+"]");

What does that return? Also, have you done the work to map your incoming document type in the tika-mimetypes.xml file? That is, if you're using AutoDetectParser or anything that extends CompositeParser, the mime type of the incoming document is used to determine what parser gets called? Is the mime type being detected appropriately? You can check this by putting a println right before getParser in the parse(...) method:

// print the mime type
System.out.println("The MIME type is: ["+ metadata.get(Metadata.CONTENT_TYPE)+"]);
Parser parser = getParser(metadata);

What does that print out?

Finally if both of these printlns check out, you should check and make sure that your new parser is correctly mapped to the media type it supports, in other words what Ken said below. Does your parser declare that it supports your expected MIME type?

Let me know and thanks!

Cheers,
Chris




On 7/7/10 4:25 AM, "Arturo Beltran" <ar...@uji.es> wrote:

Hi,

I'm still with the same problem.
I think it's all good, I do the/ "mvn install/" and my new class is
included in the generated JAR, but never called.
It should be very simple. I feel a little silly. I don't know how to
make my new parser is found by Tika.

Thanks in advance
      Arturo


El 21/06/2010 19:04, Ken Krugler escribió:
> Are you sure your new parser is on the classpath?
>
> E.g. put a break on getSupportedTypes() and make sure that's getting
> called - if not, then the parser isn't being "found" by Tika.
>
> -- Ken
>
> On Jun 21, 2010, at 3:34am, Arturo Beltran wrote:
>
>> Hi Ken,
>>
>> First of all, thanks for your quick response.
>> This's exactly what I'm doing, but despite that Tika recognizes the
>> new MIME tipe, my new parser is not called.
>>
>> I added to tika-mimetypes.xml:
>>
>> <mime-type type="application/shp">
>> <!--sub-class-of type="application/octet-stream"/-->
>> <glob pattern="*.shp"/>
>> </mime-type>
>>
>> I created a new class GeoParser:
>>
>> public class GeoParser implements Parser {
>>
>>    private static final Set<MediaType> SUPPORTED_TYPES =
>> Collections.singleton(MediaType.application("shp"));
>>    public static final String SHP_MIME_TYPE = "application/shp";
>>
>>    public Set<MediaType> getSupportedTypes(ParseContext context) {
>>        return SUPPORTED_TYPES;
>>    }
>>
>>    public void parse(
>>            InputStream stream, ContentHandler handler,
>>            Metadata metadata, ParseContext context)
>>            throws IOException, SAXException, TikaException {
>>
>>        metadata.set(Metadata.CONTENT_TYPE, SHP_MIME_TYPE);
>>        metadata.set("Hello", "World");
>>
>>        System.out.println("HELLO WORLD");
>>        System.err.println("ERR Hello world");
>>
>>        XHTMLContentHandler xhtml = new XHTMLContentHandler(handler,
>> metadata);
>>        xhtml.startDocument();
>>        xhtml.endDocument();
>>    }
>> ...
>> }
>>
>> And that's the result:
>>
>> Content-Length:  755072
>> Content-Type:  application/shp
>> resourceName:  comarques250.shp
>>
>> I don't know wht exactly is failing, but I can't make it work.
>>
>> Greetings and thanks in advance for your help.
>>     Arturo
>>
>>
>> El 17/06/2010 18:25, Ken Krugler escribió:
>>> Hi Arturo,
>>>
>>>> Some of you already know that I'm working on a new parser
>>>> (https://issues.apache.org/jira/browse/TIKA-443). After all day
>>>> trying to set up a workspace for Eclipse, I implemented the typical
>>>> "hello world" class, in the Tika Parser version. My problem now, is
>>>> how to configure Tika in order to call my new parser when a file
>>>> with especific extension (p.e. *.shp) is found. I read something
>>>> about a configuration file (tika-config.xml) but I couldn't find it
>>>> in the source code.
>>>
>>> You first need to modify
>>> tika-core/src/main/resources/tika-mimetypes.xml.
>>>
>>> E.g. something like this was done for mailbox files.
>>>
>>> <mime-type type="application/mbox">
>>> <sub-class-of type="text/plain"/>
>>> <glob pattern="*.mbox"/>
>>> </mime-type>
>>>
>>> That maps the suffix to the mime-type.
>>>
>>> Then you define the SUPPORTED_TYPES static class field in your
>>> parser class that defines what mime-types it supports.
>>>
>>> E.g. for MboxParser:
>>>
>>> public class MboxParser implements Parser {
>>>
>>>    private static final Set<MediaType> SUPPORTED_TYPES =
>>>        Collections.singleton(MediaType.application("mbox"));
>>>
>>>
>>> -- Ken
>>>
>>> --------------------------------------------
>>> <http://ken-blog.krugler.org>
>>> +1 530-265-2225
>>>
>>>
>>>
>>>
>>>
>>>
>>> --------------------------------------------
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://bixolabs.com
>>> e l a s t i c   w e b   m i n i n g
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Arturo Beltran Fonollosa
>> Institute of New Imaging Technologies (INIT): http://www.init.uji.es
>> Geographic Information research group: http://www.geoinfo.uji.es
>> Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
>> E-12071, Castellón, Spain
>> mailto: arturo.beltran@uji.es
>>
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>


--
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es




++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Getting started

Posted by Arturo Beltran <ar...@uji.es>.
Hi,

I'm still with the same problem.
I think it's all good, I do the/ "mvn install/" and my new class is 
included in the generated JAR, but never called.
It should be very simple. I feel a little silly. I don't know how to 
make my new parser is found by Tika.

Thanks in advance
      Arturo


El 21/06/2010 19:04, Ken Krugler escribió:
> Are you sure your new parser is on the classpath?
>
> E.g. put a break on getSupportedTypes() and make sure that's getting 
> called - if not, then the parser isn't being "found" by Tika.
>
> -- Ken
>
> On Jun 21, 2010, at 3:34am, Arturo Beltran wrote:
>
>> Hi Ken,
>>
>> First of all, thanks for your quick response.
>> This's exactly what I'm doing, but despite that Tika recognizes the 
>> new MIME tipe, my new parser is not called.
>>
>> I added to tika-mimetypes.xml:
>>
>> <mime-type type="application/shp">
>> <!--sub-class-of type="application/octet-stream"/-->
>> <glob pattern="*.shp"/>
>> </mime-type>
>>
>> I created a new class GeoParser:
>>
>> public class GeoParser implements Parser {
>>
>>    private static final Set<MediaType> SUPPORTED_TYPES = 
>> Collections.singleton(MediaType.application("shp"));
>>    public static final String SHP_MIME_TYPE = "application/shp";
>>
>>    public Set<MediaType> getSupportedTypes(ParseContext context) {
>>        return SUPPORTED_TYPES;
>>    }
>>
>>    public void parse(
>>            InputStream stream, ContentHandler handler,
>>            Metadata metadata, ParseContext context)
>>            throws IOException, SAXException, TikaException {
>>
>>        metadata.set(Metadata.CONTENT_TYPE, SHP_MIME_TYPE);
>>        metadata.set("Hello", "World");
>>
>>        System.out.println("HELLO WORLD");
>>        System.err.println("ERR Hello world");
>>
>>        XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, 
>> metadata);
>>        xhtml.startDocument();
>>        xhtml.endDocument();
>>    }
>> ...
>> }
>>
>> And that's the result:
>>
>> Content-Length:  755072
>> Content-Type:  application/shp
>> resourceName:  comarques250.shp
>>
>> I don't know wht exactly is failing, but I can't make it work.
>>
>> Greetings and thanks in advance for your help.
>>     Arturo
>>
>>
>> El 17/06/2010 18:25, Ken Krugler escribió:
>>> Hi Arturo,
>>>
>>>> Some of you already know that I'm working on a new parser 
>>>> (https://issues.apache.org/jira/browse/TIKA-443). After all day 
>>>> trying to set up a workspace for Eclipse, I implemented the typical 
>>>> "hello world" class, in the Tika Parser version. My problem now, is 
>>>> how to configure Tika in order to call my new parser when a file 
>>>> with especific extension (p.e. *.shp) is found. I read something 
>>>> about a configuration file (tika-config.xml) but I couldn't find it 
>>>> in the source code.
>>>
>>> You first need to modify 
>>> tika-core/src/main/resources/tika-mimetypes.xml.
>>>
>>> E.g. something like this was done for mailbox files.
>>>
>>> <mime-type type="application/mbox">
>>> <sub-class-of type="text/plain"/>
>>> <glob pattern="*.mbox"/>
>>> </mime-type>
>>>
>>> That maps the suffix to the mime-type.
>>>
>>> Then you define the SUPPORTED_TYPES static class field in your 
>>> parser class that defines what mime-types it supports.
>>>
>>> E.g. for MboxParser:
>>>
>>> public class MboxParser implements Parser {
>>>
>>>    private static final Set<MediaType> SUPPORTED_TYPES =
>>>        Collections.singleton(MediaType.application("mbox"));
>>>
>>>
>>> -- Ken
>>>
>>> --------------------------------------------
>>> <http://ken-blog.krugler.org>
>>> +1 530-265-2225
>>>
>>>
>>>
>>>
>>>
>>>
>>> --------------------------------------------
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://bixolabs.com
>>> e l a s t i c   w e b   m i n i n g
>>>
>>>
>>>
>>>
>>>
>>
>>
>> -- 
>> Arturo Beltran Fonollosa
>> Institute of New Imaging Technologies (INIT): http://www.init.uji.es
>> Geographic Information research group: http://www.geoinfo.uji.es
>> Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
>> E-12071, Castellón, Spain
>> mailto: arturo.beltran@uji.es
>>
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>


-- 
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es


Re: Getting started

Posted by Ken Krugler <kk...@transpac.com>.
Are you sure your new parser is on the classpath?

E.g. put a break on getSupportedTypes() and make sure that's getting  
called - if not, then the parser isn't being "found" by Tika.

-- Ken

On Jun 21, 2010, at 3:34am, Arturo Beltran wrote:

> Hi Ken,
>
> First of all, thanks for your quick response.
> This's exactly what I'm doing, but despite that Tika recognizes the  
> new MIME tipe, my new parser is not called.
>
> I added to tika-mimetypes.xml:
>
> <mime-type type="application/shp">
> <!--sub-class-of type="application/octet-stream"/-->
> <glob pattern="*.shp"/>
> </mime-type>
>
> I created a new class GeoParser:
>
> public class GeoParser implements Parser {
>
>    private static final Set<MediaType> SUPPORTED_TYPES =  
> Collections.singleton(MediaType.application("shp"));
>    public static final String SHP_MIME_TYPE = "application/shp";
>
>    public Set<MediaType> getSupportedTypes(ParseContext context) {
>        return SUPPORTED_TYPES;
>    }
>
>    public void parse(
>            InputStream stream, ContentHandler handler,
>            Metadata metadata, ParseContext context)
>            throws IOException, SAXException, TikaException {
>
>        metadata.set(Metadata.CONTENT_TYPE, SHP_MIME_TYPE);
>        metadata.set("Hello", "World");
>
>        System.out.println("HELLO WORLD");
>        System.err.println("ERR Hello world");
>
>        XHTMLContentHandler xhtml = new XHTMLContentHandler(handler,  
> metadata);
>        xhtml.startDocument();
>        xhtml.endDocument();
>    }
> ...
> }
>
> And that's the result:
>
> Content-Length:  755072
> Content-Type:  application/shp
> resourceName:  comarques250.shp
>
> I don't know wht exactly is failing, but I can't make it work.
>
> Greetings and thanks in advance for your help.
>     Arturo
>
>
> El 17/06/2010 18:25, Ken Krugler escribió:
>> Hi Arturo,
>>
>>> Some of you already know that I'm working on a new parser (https://issues.apache.org/jira/browse/TIKA-443 
>>> ). After all day trying to set up a workspace for Eclipse, I  
>>> implemented the typical "hello world" class, in the Tika Parser  
>>> version. My problem now, is how to configure Tika in order to call  
>>> my new parser when a file with especific extension (p.e. *.shp) is  
>>> found. I read something about a configuration file (tika- 
>>> config.xml) but I couldn't find it in the source code.
>>
>> You first need to modify tika-core/src/main/resources/tika- 
>> mimetypes.xml.
>>
>> E.g. something like this was done for mailbox files.
>>
>> <mime-type type="application/mbox">
>> <sub-class-of type="text/plain"/>
>> <glob pattern="*.mbox"/>
>> </mime-type>
>>
>> That maps the suffix to the mime-type.
>>
>> Then you define the SUPPORTED_TYPES static class field in your  
>> parser class that defines what mime-types it supports.
>>
>> E.g. for MboxParser:
>>
>> public class MboxParser implements Parser {
>>
>>    private static final Set<MediaType> SUPPORTED_TYPES =
>>        Collections.singleton(MediaType.application("mbox"));
>>
>>
>> -- Ken
>>
>> --------------------------------------------
>> <http://ken-blog.krugler.org>
>> +1 530-265-2225
>>
>>
>>
>>
>>
>>
>> --------------------------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> e l a s t i c   w e b   m i n i n g
>>
>>
>>
>>
>>
>
>
> -- 
> Arturo Beltran Fonollosa
> Institute of New Imaging Technologies (INIT): http://www.init.uji.es
> Geographic Information research group: http://www.geoinfo.uji.es
> Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
> E-12071, Castellón, Spain
> mailto: arturo.beltran@uji.es
>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Re: Getting started

Posted by Arturo Beltran <ar...@uji.es>.
Hi Ken,

First of all, thanks for your quick response.
This's exactly what I'm doing, but despite that Tika recognizes the new 
MIME tipe, my new parser is not called.

I added to tika-mimetypes.xml:

<mime-type type="application/shp">
<!--sub-class-of type="application/octet-stream"/-->
<glob pattern="*.shp"/>
</mime-type>

I created a new class GeoParser:

public class GeoParser implements Parser {

     private static final Set<MediaType> SUPPORTED_TYPES = 
Collections.singleton(MediaType.application("shp"));
     public static final String SHP_MIME_TYPE = "application/shp";

     public Set<MediaType> getSupportedTypes(ParseContext context) {
         return SUPPORTED_TYPES;
     }

     public void parse(
             InputStream stream, ContentHandler handler,
             Metadata metadata, ParseContext context)
             throws IOException, SAXException, TikaException {

         metadata.set(Metadata.CONTENT_TYPE, SHP_MIME_TYPE);
         metadata.set("Hello", "World");

         System.out.println("HELLO WORLD");
         System.err.println("ERR Hello world");

         XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, 
metadata);
         xhtml.startDocument();
         xhtml.endDocument();
     }
  ...
}

And that's the result:

Content-Length:  755072
Content-Type:  application/shp
resourceName:  comarques250.shp

I don't know wht exactly is failing, but I can't make it work.

Greetings and thanks in advance for your help.
      Arturo


El 17/06/2010 18:25, Ken Krugler escribió:
> Hi Arturo,
>
>> Some of you already know that I'm working on a new parser 
>> (https://issues.apache.org/jira/browse/TIKA-443). After all day 
>> trying to set up a workspace for Eclipse, I implemented the typical 
>> "hello world" class, in the Tika Parser version. My problem now, is 
>> how to configure Tika in order to call my new parser when a file with 
>> especific extension (p.e. *.shp) is found. I read something about a 
>> configuration file (tika-config.xml) but I couldn't find it in the 
>> source code.
>
> You first need to modify tika-core/src/main/resources/tika-mimetypes.xml.
>
> E.g. something like this was done for mailbox files.
>
> <mime-type type="application/mbox">
> <sub-class-of type="text/plain"/>
> <glob pattern="*.mbox"/>
> </mime-type>
>
> That maps the suffix to the mime-type.
>
> Then you define the SUPPORTED_TYPES static class field in your parser 
> class that defines what mime-types it supports.
>
> E.g. for MboxParser:
>
> public class MboxParser implements Parser {
>
>     private static final Set<MediaType> SUPPORTED_TYPES =
>         Collections.singleton(MediaType.application("mbox"));
>
>
> -- Ken
>
> --------------------------------------------
> <http://ken-blog.krugler.org>
> +1 530-265-2225
>
>
>
>
>
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>


-- 
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es


Re: Getting started

Posted by Ken Krugler <kk...@transpac.com>.
Hi Arturo,

> Some of you already know that I'm working on a new parser (https://issues.apache.org/jira/browse/TIKA-443 
> ). After all day trying to set up a workspace for Eclipse, I  
> implemented the typical "hello world" class, in the Tika Parser  
> version. My problem now, is how to configure Tika in order to call  
> my new parser when a file with especific extension (p.e. *.shp) is  
> found. I read something about a configuration file (tika-config.xml)  
> but I couldn't find it in the source code.

You first need to modify tika-core/src/main/resources/tika- 
mimetypes.xml.

E.g. something like this was done for mailbox files.

   <mime-type type="application/mbox">
     <sub-class-of type="text/plain"/>
     <glob pattern="*.mbox"/>
   </mime-type>

That maps the suffix to the mime-type.

Then you define the SUPPORTED_TYPES static class field in your parser  
class that defines what mime-types it supports.

E.g. for MboxParser:

public class MboxParser implements Parser {

     private static final Set<MediaType> SUPPORTED_TYPES =
         Collections.singleton(MediaType.application("mbox"));


-- Ken

--------------------------------------------
<http://ken-blog.krugler.org>
+1 530-265-2225






--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Getting started

Posted by Arturo Beltran <ar...@uji.es>.
Hi all,

Some of you already know that I'm working on a new parser 
(https://issues.apache.org/jira/browse/TIKA-443). After all day trying 
to set up a workspace for Eclipse, I implemented the typical "hello 
world" class, in the Tika Parser version. My problem now, is how to 
configure Tika in order to call my new parser when a file with especific 
extension (p.e. *.shp) is found. I read something about a configuration 
file (tika-config.xml) but I couldn't find it in the source code.

Greetings and thanks in advance
      Arturo

-- 
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es


[jira] Commented: (TIKA-443) Geographic Information Parser

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879788#action_12879788 ] 

Chris A. Mattmann commented on TIKA-443:
----------------------------------------

Hi Arturo,

Thanks for reporting this issue and it sounds awesome! I'm definitely interested in this topic and will be sure to help however I can.

Cheers,
Chris


> Geographic Information Parser
> -----------------------------
>
>                 Key: TIKA-443
>                 URL: https://issues.apache.org/jira/browse/TIKA-443
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Arturo Beltran
>
> I'm working in the automatic description of geospatial resources, and I think that might be interesting to incorporate new parser/s to Tika in order to manage and describe some geo-formats. These geo-formats include files, services and databases.
> If anyone is interested in this issue or want to collaborate do not hesitate to contact me. Any help is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-443) Geographic Information Parser

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883171#action_12883171 ] 

Nick Burch commented on TIKA-443:
---------------------------------

I was thinking that making sure you put in the right matching pairs, and remove them again is a little fiddly, but that's nothing that a little wrapper library wouldn't fix for you. With that in mind, I think your proposed solution is likely to be much better than changing tika to support composite values, with the problems that that would bring

Any objections to creating a new Metadata keyspace of Geographic, with to start with LATITUDE = geo:latitude & LONGITUDE = geo:longitude ? I can think of a few others we might want in future (height, bearing etc), which makes me think its own space might make sense

> Geographic Information Parser
> -----------------------------
>
>                 Key: TIKA-443
>                 URL: https://issues.apache.org/jira/browse/TIKA-443
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Arturo Beltran
>         Attachments: getFDOMetadata.xml
>
>
> I'm working in the automatic description of geospatial resources, and I think that might be interesting to incorporate new parser/s to Tika in order to manage and describe some geo-formats. These geo-formats include files, services and databases.
> If anyone is interested in this issue or want to collaborate do not hesitate to contact me. Any help is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-443) Geographic Information Parser

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880836#action_12880836 ] 

Chris A. Mattmann commented on TIKA-443:
----------------------------------------

Hi Guys,

Thanks for the effort here. Please try hard to keep the discussions on list as the community will benefit from them and can help provide feedback incrementally.

Thanks,
Chris


> Geographic Information Parser
> -----------------------------
>
>                 Key: TIKA-443
>                 URL: https://issues.apache.org/jira/browse/TIKA-443
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Arturo Beltran
>
> I'm working in the automatic description of geospatial resources, and I think that might be interesting to incorporate new parser/s to Tika in order to manage and describe some geo-formats. These geo-formats include files, services and databases.
> If anyone is interested in this issue or want to collaborate do not hesitate to contact me. Any help is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-443) Geographic Information Parser

Posted by "Arturo Beltran (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881146#action_12881146 ] 

Arturo Beltran commented on TIKA-443:
-------------------------------------

I'm not convinced about using OGDI. From what I understand from reading the documentation, OGDI offers an API in C, so we encounter the same problem to integrate it with Java. In addition, the project is not updated since 2008, so new geographic formats are not supported (i.e: KML). Also, I think OGDI does not support databases or services.

However, you can do some proof of concept to see if it would be very difficult to integrate with Java and see exactly what metadata can be extracted using OGDI. Then we can compare these results with mine and decide. 

As you can see, I've attached a sample XML file (getFDOMetadata.xml) that contains the information extracted of a SHP by my proof of concept server based on FDO. This is the result after a simple HTTP call (http://localhost:12345/getFDOMetadata?source=C:\ExampleData\shp_world_countries\country.shp&provider=SHP)

For now, I'll keep trying to run muy "Hello world" Tika parser.

Regards,
     Arturo

> Geographic Information Parser
> -----------------------------
>
>                 Key: TIKA-443
>                 URL: https://issues.apache.org/jira/browse/TIKA-443
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Arturo Beltran
>         Attachments: getFDOMetadata.xml
>
>
> I'm working in the automatic description of geospatial resources, and I think that might be interesting to incorporate new parser/s to Tika in order to manage and describe some geo-formats. These geo-formats include files, services and databases.
> If anyone is interested in this issue or want to collaborate do not hesitate to contact me. Any help is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-443) Geographic Information Parser

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883159#action_12883159 ] 

Chris A. Mattmann commented on TIKA-443:
----------------------------------------

Hey Nick,

I think we need to support both cases (single lat/lon per document as well as many lat/lon pairs per document). In the case of the former, it's easy, we have:

key: Metadata.LATITUDE
val:  some lat

key: Metadata.LONGITUDE
val:  some lon

And, in the case of the latter, we have:

key: Metadata.LATITUDE
val:  some lat, some lat2, some lat3, some lat n...

key: Metadata.LONGITUDE
val:  some lon, some lon2, some lon3, some lon n...

Because the keys are ordered in the Metadata object, I think that we can make sure they match up and treat single points the same as for multiple points. It's great to have support for both on a per Metadata object basis too since many scientific data formats have both scenarios in them (e.g., NetCDF and HDF typically have arrays of lats and lons, and sometimes, singe point values as well). 

The reason we need to support both is that distance computation (point/radius, bounding box, and polygon) would require both scenarios to be supported. I've been thinking that once this work is prototyped, to integrate Tika with the work in SIS to build out a computational spatial library. I think Tika could be used to feed in lats/lons into SIS.

Cheers,
Chris


> Geographic Information Parser
> -----------------------------
>
>                 Key: TIKA-443
>                 URL: https://issues.apache.org/jira/browse/TIKA-443
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Arturo Beltran
>         Attachments: getFDOMetadata.xml
>
>
> I'm working in the automatic description of geospatial resources, and I think that might be interesting to incorporate new parser/s to Tika in order to manage and describe some geo-formats. These geo-formats include files, services and databases.
> If anyone is interested in this issue or want to collaborate do not hesitate to contact me. Any help is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-443) Geographic Information Parser

Posted by "Mayank Singh (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880722#action_12880722 ] 

Mayank Singh commented on TIKA-443:
-----------------------------------

Hi Arturo
I would like to collaborate on this issue. I have also sent you a mal regarding the same.
Thanks and regards
Mayank

> Geographic Information Parser
> -----------------------------
>
>                 Key: TIKA-443
>                 URL: https://issues.apache.org/jira/browse/TIKA-443
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Arturo Beltran
>
> I'm working in the automatic description of geospatial resources, and I think that might be interesting to incorporate new parser/s to Tika in order to manage and describe some geo-formats. These geo-formats include files, services and databases.
> If anyone is interested in this issue or want to collaborate do not hesitate to contact me. Any help is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-443) Geographic Information Parser

Posted by "Arturo Beltran (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880762#action_12880762 ] 

Arturo Beltran commented on TIKA-443:
-------------------------------------

Hi all,

I am pleased by the interest shown by the community on my proposal. As I said, any help is welcome.
I have sent Mayank all the details about my work on this issue. If anyone else is interested in collaborating or simply provide their ideas/comments do not hesitate to contact me.

Cheers,
Arturo

> Geographic Information Parser
> -----------------------------
>
>                 Key: TIKA-443
>                 URL: https://issues.apache.org/jira/browse/TIKA-443
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Arturo Beltran
>
> I'm working in the automatic description of geospatial resources, and I think that might be interesting to incorporate new parser/s to Tika in order to manage and describe some geo-formats. These geo-formats include files, services and databases.
> If anyone is interested in this issue or want to collaborate do not hesitate to contact me. Any help is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-443) Geographic Information Parser

Posted by "Arturo Beltran (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880844#action_12880844 ] 

Arturo Beltran commented on TIKA-443:
-------------------------------------

You are right Chris. Since now, I will try to keep the discussions on the list or here.

I will try to explain in brief where exactly I'm working in order that you can get involved.
The first piece is what allows us to access resources, we need a platform to access by the most homogenous way to heterogeneous resources. The best approach I've found has been FDO (http://fdo.osgeo.org/). In short, FDO is an API for manipulating, defining and analyzing geospatial information regardless of where it is stored.

So it looks simple, I only have to integrate FDO as a Tika parser and I have it. The problem appeared when trying to connect this C++ API with Java. I have worked with SWIG and directly with JNI but I have not gotten it to work.
Finally, temporary and to serve as a proof of concept, I implemented a simple HTTP server in .NET that offers resource descriptions using FDO. And now I'm trying to create a dummy parser for Tika to make calls to that server.

I hope I explained well and that you could understand something, otherwise, feel free to ask again.

Greetings and thanks for your interest:
     Arturo

> Geographic Information Parser
> -----------------------------
>
>                 Key: TIKA-443
>                 URL: https://issues.apache.org/jira/browse/TIKA-443
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Arturo Beltran
>
> I'm working in the automatic description of geospatial resources, and I think that might be interesting to incorporate new parser/s to Tika in order to manage and describe some geo-formats. These geo-formats include files, services and databases.
> If anyone is interested in this issue or want to collaborate do not hesitate to contact me. Any help is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-443) Geographic Information Parser

Posted by "Mayank Singh (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880921#action_12880921 ] 

Mayank Singh commented on TIKA-443:
-----------------------------------

Arturo I am not very comfortable with C++ and have no knowledge of .NET platform (I'm a Java guy) so my help in this matter will be very limited to you if you plan on using FDO. However, I was looking around for alternatives and found OGDI (http://ogdi.sourceforge.net/) which can act as a middle layer between various data sources and has almost the same capabilities of data dissemination over the network as FDO (more info here: http://www.gisdevelopment.net/technology/gis/techgi0057b.htm).
   So what I am suggesting is we look into it and once we get the heterogeneous data into the OGDI supported uniform data structure we can use Java to integrate it with Tika.
    I'll keep searching for more info. Do tell me your views on this
Regards
Mayank

> Geographic Information Parser
> -----------------------------
>
>                 Key: TIKA-443
>                 URL: https://issues.apache.org/jira/browse/TIKA-443
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Arturo Beltran
>
> I'm working in the automatic description of geospatial resources, and I think that might be interesting to incorporate new parser/s to Tika in order to manage and describe some geo-formats. These geo-formats include files, services and databases.
> If anyone is interested in this issue or want to collaborate do not hesitate to contact me. Any help is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-443) Geographic Information Parser

Posted by "Arturo Beltran (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883822#action_12883822 ] 

Arturo Beltran commented on TIKA-443:
-------------------------------------

As I commented in the issue TIKA-445, after a few days off I found a pleasant surprise. Good job. 
 
Greetings and thanks for your work

> Geographic Information Parser
> -----------------------------
>
>                 Key: TIKA-443
>                 URL: https://issues.apache.org/jira/browse/TIKA-443
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Arturo Beltran
>         Attachments: getFDOMetadata.xml
>
>
> I'm working in the automatic description of geospatial resources, and I think that might be interesting to incorporate new parser/s to Tika in order to manage and describe some geo-formats. These geo-formats include files, services and databases.
> If anyone is interested in this issue or want to collaborate do not hesitate to contact me. Any help is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-443) Geographic Information Parser

Posted by "Arturo Beltran (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arturo Beltran updated TIKA-443:
--------------------------------

    Attachment: getFDOMetadata.xml

XML Example that contains the information extracted of a SHP by my proof of concept server based on FDO 

> Geographic Information Parser
> -----------------------------
>
>                 Key: TIKA-443
>                 URL: https://issues.apache.org/jira/browse/TIKA-443
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Arturo Beltran
>         Attachments: getFDOMetadata.xml
>
>
> I'm working in the automatic description of geospatial resources, and I think that might be interesting to incorporate new parser/s to Tika in order to manage and describe some geo-formats. These geo-formats include files, services and databases.
> If anyone is interested in this issue or want to collaborate do not hesitate to contact me. Any help is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-443) Geographic Information Parser

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883182#action_12883182 ] 

Chris A. Mattmann commented on TIKA-443:
----------------------------------------

Hey Nick,

Yep +1 on having the new namespace called "Geographic" with the given 2 fields as a starting point. We should probably track it and commit in a new issue. 

Thanks for your thoughts on this!

Cheers,
Chris


> Geographic Information Parser
> -----------------------------
>
>                 Key: TIKA-443
>                 URL: https://issues.apache.org/jira/browse/TIKA-443
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Arturo Beltran
>         Attachments: getFDOMetadata.xml
>
>
> I'm working in the automatic description of geospatial resources, and I think that might be interesting to incorporate new parser/s to Tika in order to manage and describe some geo-formats. These geo-formats include files, services and databases.
> If anyone is interested in this issue or want to collaborate do not hesitate to contact me. Any help is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-443) Geographic Information Parser

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883153#action_12883153 ] 

Nick Burch commented on TIKA-443:
---------------------------------

I was wondering about extracting geo data from jpeg exif tags. For this, we'd probably want dedicated metadata properties for lat and long

(Other files can have a single lat+long in them too, eg html pages with the icbm meta tags)

Not sure how well that might integrate with this work though, since shapefiles will typically contain a large number lats+longs (or similar geographic points)

Anyone have any ideas about a single created-at position vs stream of locations from geo formats?

> Geographic Information Parser
> -----------------------------
>
>                 Key: TIKA-443
>                 URL: https://issues.apache.org/jira/browse/TIKA-443
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Arturo Beltran
>         Attachments: getFDOMetadata.xml
>
>
> I'm working in the automatic description of geospatial resources, and I think that might be interesting to incorporate new parser/s to Tika in order to manage and describe some geo-formats. These geo-formats include files, services and databases.
> If anyone is interested in this issue or want to collaborate do not hesitate to contact me. Any help is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.