You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Joe White <wh...@gmail.com> on 2012/02/26 18:39:30 UTC

Gdal Integration (TIKA 605)

Hi,
I'm looking into implementing a bridge/link between Tika and GDAL so that geospatial information can be saved from georeferenced images and vector types.  One thing that I have noticed while going through the code is that the code only defines geographic coordinate types, using latitudes and longitudes.  Is this by design?  If GDAL is wrapped into Tika, and a projected image is imported, are the geospatial extents meant to be held in the metadata as geographic points, possibly as WGS 84?  

Thanks

Joe White

Re: Gdal Integration (TIKA 605)

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Joe,

On Feb 26, 2012, at 11:06 AM, Joe White wrote:

> Hi, Chris,
> I would agree that we probably should come up with a more comprehensive solution for this wrt the metadata object and the resulting XHTML.  That would make this feel a little more like the geospatial stuff is more of a first class citizen in the metadata hierarchy.

+1.

> 
> We will probably need to support more coordinate systems than just WGS 84, as there are a number of systems that either have no transformation to WGS 84.  

+1, agreed, WGS84 was just the first one that came to mind.


> The encoding of the WKT is also pretty important.  Would you rather break it down to it's component parts, probably datum and projection for starters, or leave it whole?  Obviously, the more metadata we have, the more powerful Tika becomes, but there is a point where you have too much data that is not as useful.

Let's start out with its component parts, datum and projection, and encode those as metadata fields. So we'd likely update the existing Geographic metadata interface
with these new keys as a starter.

> 
> On another note, I took a look at the code for your 605 patch, and I have a suggestion. Reading the notes on the checkins for the patch, I noticed that no one had suggested using the in-memory Dataset as the default type.  There is no reason why the stream used to open the Tika parser could not be used to fill a buffer with the file data, and then use that to create a dataset.

Hmm, so your suggestion is to use the in-memory Dataset API and that would be streamable via Tika? Hmm, that would be great, I just wasn't as familiar with GDAL
to know how to do that, so a coding example if you have one in Java would help me to wrap my head around it.

> 
> As it is, I'm trying to get GDAL to cooperate with me on my Mac.  Being a newcomer to Mac seems to be a drawback when trying to be productive.  It just takes a little more fight to get the bits to do what I really want.
> 

Heh, yeah I was trying to do this too. At one point I had it running but a few OS upgrades have nixed that. Let's see if I can get it up
and running again too so we can co-develop this.

> In any case, once I get GDAL whipped into shape, I'll see if I can't get a test file to recognize any geospatial data, and then we will be off and running.

Great!

Cheers,
Chris

> On Feb 26, 2012, at 1:10 PM, Mattmann, Chris A (388J) wrote:
> 
>> Hi Joe,
>> 
>> Awesome! Thanks for picking this up and getting interested in this work. Right now, the only use cases we've had so far
>> is to represent lats and lons (WGS84). It would be great to extract more information and come up with a policy for representing
>> more WKTs and so forth. We should probably start by coming up with a scheme for encoding the extracted information in the 
>> Tika metadata object and in its output XHTML. Do you have any ideas about how to do that? Right now in the existing patch
>> on TIKA-605, I simply was intended to use the met object and its key-multi-value structure to represent the extracted information
>> but to take advantage of streaming and of content handlers, we ought to encode this information in the output XHTML.
>> 
>> Thoughts?
>> 
>> Cheers,
>> Chris
>> 
>> On Feb 26, 2012, at 9:39 AM, Joe White wrote:
>> 
>>> Hi,
>>> I'm looking into implementing a bridge/link between Tika and GDAL so that geospatial information can be saved from georeferenced images and vector types.  One thing that I have noticed while going through the code is that the code only defines geographic coordinate types, using latitudes and longitudes.  Is this by design?  If GDAL is wrapped into Tika, and a projected image is imported, are the geospatial extents meant to be held in the metadata as geographic points, possibly as WGS 84?  
>>> 
>>> Thanks
>>> 
>>> Joe White
>> 
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: chris.a.mattmann@nasa.gov
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Gdal Integration (TIKA 605)

Posted by Joe White <wh...@gmail.com>.
Hi, Chris,
I would agree that we probably should come up with a more comprehensive solution for this wrt the metadata object and the resulting XHTML.  That would make this feel a little more like the geospatial stuff is more of a first class citizen in the metadata hierarchy.

We will probably need to support more coordinate systems than just WGS 84, as there are a number of systems that either have no transformation to WGS 84.  The encoding of the WKT is also pretty important.  Would you rather break it down to it's component parts, probably datum and projection for starters, or leave it whole?  Obviously, the more metadata we have, the more powerful Tika becomes, but there is a point where you have too much data that is not as useful.

On another note, I took a look at the code for your 605 patch, and I have a suggestion. Reading the notes on the checkins for the patch, I noticed that no one had suggested using the in-memory Dataset as the default type.  There is no reason why the stream used to open the Tika parser could not be used to fill a buffer with the file data, and then use that to create a dataset.

As it is, I'm trying to get GDAL to cooperate with me on my Mac.  Being a newcomer to Mac seems to be a drawback when trying to be productive.  It just takes a little more fight to get the bits to do what I really want.

In any case, once I get GDAL whipped into shape, I'll see if I can't get a test file to recognize any geospatial data, and then we will be off and running.

Thanks

Joe 
On Feb 26, 2012, at 1:10 PM, Mattmann, Chris A (388J) wrote:

> Hi Joe,
> 
> Awesome! Thanks for picking this up and getting interested in this work. Right now, the only use cases we've had so far
> is to represent lats and lons (WGS84). It would be great to extract more information and come up with a policy for representing
> more WKTs and so forth. We should probably start by coming up with a scheme for encoding the extracted information in the 
> Tika metadata object and in its output XHTML. Do you have any ideas about how to do that? Right now in the existing patch
> on TIKA-605, I simply was intended to use the met object and its key-multi-value structure to represent the extracted information
> but to take advantage of streaming and of content handlers, we ought to encode this information in the output XHTML.
> 
> Thoughts?
> 
> Cheers,
> Chris
> 
> On Feb 26, 2012, at 9:39 AM, Joe White wrote:
> 
>> Hi,
>> I'm looking into implementing a bridge/link between Tika and GDAL so that geospatial information can be saved from georeferenced images and vector types.  One thing that I have noticed while going through the code is that the code only defines geographic coordinate types, using latitudes and longitudes.  Is this by design?  If GDAL is wrapped into Tika, and a projected image is imported, are the geospatial extents meant to be held in the metadata as geographic points, possibly as WGS 84?  
>> 
>> Thanks
>> 
>> Joe White
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 


Re: Gdal Integration (TIKA 605)

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
+1 to making it work for vector formats too -- geospatial imagery was just the first notch to tackle... :)

Cheers,
Chris

On Feb 26, 2012, at 11:09 AM, Joe White wrote:

> Chris,
> One other thing occurred to me while looking at this.  All of the discussion I've seen thus far revolves around geospatial imagery.  Has there been any discussion about using Tika on any of the geospatial vector formats?  I would think they would go hand in hand, and OGR recognizes many of them.
> 
> Joe
> 
> On Feb 26, 2012, at 1:10 PM, Mattmann, Chris A (388J) wrote:
> 
>> Hi Joe,
>> 
>> Awesome! Thanks for picking this up and getting interested in this work. Right now, the only use cases we've had so far
>> is to represent lats and lons (WGS84). It would be great to extract more information and come up with a policy for representing
>> more WKTs and so forth. We should probably start by coming up with a scheme for encoding the extracted information in the 
>> Tika metadata object and in its output XHTML. Do you have any ideas about how to do that? Right now in the existing patch
>> on TIKA-605, I simply was intended to use the met object and its key-multi-value structure to represent the extracted information
>> but to take advantage of streaming and of content handlers, we ought to encode this information in the output XHTML.
>> 
>> Thoughts?
>> 
>> Cheers,
>> Chris
>> 
>> On Feb 26, 2012, at 9:39 AM, Joe White wrote:
>> 
>>> Hi,
>>> I'm looking into implementing a bridge/link between Tika and GDAL so that geospatial information can be saved from georeferenced images and vector types.  One thing that I have noticed while going through the code is that the code only defines geographic coordinate types, using latitudes and longitudes.  Is this by design?  If GDAL is wrapped into Tika, and a projected image is imported, are the geospatial extents meant to be held in the metadata as geographic points, possibly as WGS 84?  
>>> 
>>> Thanks
>>> 
>>> Joe White
>> 
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: chris.a.mattmann@nasa.gov
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Gdal Integration (TIKA 605)

Posted by Joe White <wh...@gmail.com>.
Chris,
One other thing occurred to me while looking at this.  All of the discussion I've seen thus far revolves around geospatial imagery.  Has there been any discussion about using Tika on any of the geospatial vector formats?  I would think they would go hand in hand, and OGR recognizes many of them.

Joe

On Feb 26, 2012, at 1:10 PM, Mattmann, Chris A (388J) wrote:

> Hi Joe,
> 
> Awesome! Thanks for picking this up and getting interested in this work. Right now, the only use cases we've had so far
> is to represent lats and lons (WGS84). It would be great to extract more information and come up with a policy for representing
> more WKTs and so forth. We should probably start by coming up with a scheme for encoding the extracted information in the 
> Tika metadata object and in its output XHTML. Do you have any ideas about how to do that? Right now in the existing patch
> on TIKA-605, I simply was intended to use the met object and its key-multi-value structure to represent the extracted information
> but to take advantage of streaming and of content handlers, we ought to encode this information in the output XHTML.
> 
> Thoughts?
> 
> Cheers,
> Chris
> 
> On Feb 26, 2012, at 9:39 AM, Joe White wrote:
> 
>> Hi,
>> I'm looking into implementing a bridge/link between Tika and GDAL so that geospatial information can be saved from georeferenced images and vector types.  One thing that I have noticed while going through the code is that the code only defines geographic coordinate types, using latitudes and longitudes.  Is this by design?  If GDAL is wrapped into Tika, and a projected image is imported, are the geospatial extents meant to be held in the metadata as geographic points, possibly as WGS 84?  
>> 
>> Thanks
>> 
>> Joe White
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 


Re: Gdal Integration (TIKA 605)

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Joe,

Awesome! Thanks for picking this up and getting interested in this work. Right now, the only use cases we've had so far
is to represent lats and lons (WGS84). It would be great to extract more information and come up with a policy for representing
more WKTs and so forth. We should probably start by coming up with a scheme for encoding the extracted information in the 
Tika metadata object and in its output XHTML. Do you have any ideas about how to do that? Right now in the existing patch
on TIKA-605, I simply was intended to use the met object and its key-multi-value structure to represent the extracted information
but to take advantage of streaming and of content handlers, we ought to encode this information in the output XHTML.

Thoughts?

Cheers,
Chris

On Feb 26, 2012, at 9:39 AM, Joe White wrote:

> Hi,
> I'm looking into implementing a bridge/link between Tika and GDAL so that geospatial information can be saved from georeferenced images and vector types.  One thing that I have noticed while going through the code is that the code only defines geographic coordinate types, using latitudes and longitudes.  Is this by design?  If GDAL is wrapped into Tika, and a projected image is imported, are the geospatial extents meant to be held in the metadata as geographic points, possibly as WGS 84?  
> 
> Thanks
> 
> Joe White


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++