You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@sis.apache.org by Martin Desruisseaux <ma...@geomatys.fr> on 2013/11/26 05:51:46 UTC

Current state of work (2013-11-25)

Hello all

The datum package should be complete for now, except for the problem of 
datum aliases. This issue exists because names of geodetic datum are the 
only way to differentiate two otherwise identical datum, but there is 
some variations in the choice of name among producers. JIRA task [1] 
gives more details. This issue is partially addressed in SIS by 
implementing heuristic rules in a "nameMatches(String)" method [2].

I would like a better method name for "nameMatches". If possible, I 
would like something that contains the word "heuristic" or "lenient" in 
it, or anything else which said the heuristic nature of this method. 
Does anyone have suggestions? I do not know if 
"nameMatchesHeuristically" or "heuristicNameMatches" would be correct 
English.

The current work is now on the "coordinate system" (CS) package [3]. 
After that package, the last package to create before a SIS 0.4 
candidate would be "coordinate reference system" (CRS).

     Martin


[1] https://issues.apache.org/jira/browse/SIS-145
[2] 
https://builds.apache.org/job/sis-jdk7/site/apidocs/org/apache/sis/referencing/AbstractIdentifiedObject.html#nameMatches%28java.lang.String%29
[3] 
https://builds.apache.org/job/sis-jdk7/site/apidocs/org/apache/sis/referencing/cs/package-summary.html

Re: Current state of work (2013-11-25)

Posted by Martin Desruisseaux <ma...@geomatys.fr>.

Hello Adam and Joe

Thanks for the feedbacks. I was not aware of OGRSpatialReference API, it 
is interresting to know.

     Martin


Le 26/11/13 20:09, Joe White a écrit :
> Hi, Adam,
> Lucene would be way overkill for the type of matching that Martin is talking about.  The work he's doing now corresponds to both doing alias lookups in the EPSG database and the morphFromEsri call in OGRSpatialReference.  Unfortunately, datum matching is limited to the name, as Martin said, and everyone seems to have their own.
>
> Joe
> Sent from my iPad
>
>> On Nov 26, 2013, at 7:37 PM, Adam Estrada <es...@gmail.com> wrote:
>>
>> Understood! Thanks Martin and I think isHeuristicMatchForName() sounds great!
>>
>> Adam
>>
>> On Tue, Nov 26, 2013 at 7:18 PM, Martin Desruisseaux
>> <ma...@geomatys.fr> wrote:
>>> Hello Adam
>>>
>>> Thanks for the links, I was not aware of them. There is currently no
>>> probability value for matching string(s). The current heuristic rules are
>>> based on known practices, like ESRI adding the "D_" prefix for datum, spaces
>>> replaced by '_' and non-alphanumeric characters ignored. I have not yet
>>> found a need to match strings that are only similar. For now I have seen
>>> either exact match with above rules, or completely different names (e.g.
>>> "International 1924" and "Hayford 1909" are the same ellipsoid).
>>>
>>> Lucene of course have a role, and actually we do use it, but rather in some
>>> layers on top of metadata. I think it will come to SIS later, presumably in
>>> a separated module...
>>>
>>>     Martin
>>>
>>>
>>>
>>> Le 26/11/13 18:49, Adam Estrada a écrit :
>>>
>>>> Martin,
>>>>
>>>> Is there a probability value that is returned for the matching
>>>> string(s)? I actually just came across a blog post[1] that does
>>>> something similar to what you are working towards. They use the
>>>> verbiage "best partial" for determining strings of noticeably
>>>> different lengths. This appears to be similar to using a Jaccard
>>>> index[2] for string comparison but on smaller bodies of text like the
>>>> titles of said aliases. Would this be an application for using a
>>>> Lucene index that already has all the info retrieval goodness built in
>>>> to it?
>>>>
>>>> Adam
>>>>
>>>> [1] http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
>>>> [2] http://en.wikipedia.org/wiki/Jaccard_index

Re: Current state of work (2013-11-25)

Posted by Joe White <wh...@gmail.com>.

Hi, Adam,
Lucene would be way overkill for the type of matching that Martin is talking about.  The work he's doing now corresponds to both doing alias lookups in the EPSG database and the morphFromEsri call in OGRSpatialReference.  Unfortunately, datum matching is limited to the name, as Martin said, and everyone seems to have their own.

Joe
Sent from my iPad

> On Nov 26, 2013, at 7:37 PM, Adam Estrada <es...@gmail.com> wrote:
> 
> Understood! Thanks Martin and I think isHeuristicMatchForName() sounds great!
> 
> Adam
> 
> On Tue, Nov 26, 2013 at 7:18 PM, Martin Desruisseaux
> <ma...@geomatys.fr> wrote:
>> Hello Adam
>> 
>> Thanks for the links, I was not aware of them. There is currently no
>> probability value for matching string(s). The current heuristic rules are
>> based on known practices, like ESRI adding the "D_" prefix for datum, spaces
>> replaced by '_' and non-alphanumeric characters ignored. I have not yet
>> found a need to match strings that are only similar. For now I have seen
>> either exact match with above rules, or completely different names (e.g.
>> "International 1924" and "Hayford 1909" are the same ellipsoid).
>> 
>> Lucene of course have a role, and actually we do use it, but rather in some
>> layers on top of metadata. I think it will come to SIS later, presumably in
>> a separated module...
>> 
>>    Martin
>> 
>> 
>> 
>> Le 26/11/13 18:49, Adam Estrada a écrit :
>> 
>>> Martin,
>>> 
>>> Is there a probability value that is returned for the matching
>>> string(s)? I actually just came across a blog post[1] that does
>>> something similar to what you are working towards. They use the
>>> verbiage "best partial" for determining strings of noticeably
>>> different lengths. This appears to be similar to using a Jaccard
>>> index[2] for string comparison but on smaller bodies of text like the
>>> titles of said aliases. Would this be an application for using a
>>> Lucene index that already has all the info retrieval goodness built in
>>> to it?
>>> 
>>> Adam
>>> 
>>> [1]
>>> http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
>>> [2] http://en.wikipedia.org/wiki/Jaccard_index
>> 
>>

Re: Current state of work (2013-11-25)

Posted by Adam Estrada <es...@gmail.com>.

Understood! Thanks Martin and I think isHeuristicMatchForName() sounds great!

Adam

On Tue, Nov 26, 2013 at 7:18 PM, Martin Desruisseaux
<ma...@geomatys.fr> wrote:
> Hello Adam
>
> Thanks for the links, I was not aware of them. There is currently no
> probability value for matching string(s). The current heuristic rules are
> based on known practices, like ESRI adding the "D_" prefix for datum, spaces
> replaced by '_' and non-alphanumeric characters ignored. I have not yet
> found a need to match strings that are only similar. For now I have seen
> either exact match with above rules, or completely different names (e.g.
> "International 1924" and "Hayford 1909" are the same ellipsoid).
>
> Lucene of course have a role, and actually we do use it, but rather in some
> layers on top of metadata. I think it will come to SIS later, presumably in
> a separated module...
>
>     Martin
>
>
>
> Le 26/11/13 18:49, Adam Estrada a écrit :
>
>> Martin,
>>
>> Is there a probability value that is returned for the matching
>> string(s)? I actually just came across a blog post[1] that does
>> something similar to what you are working towards. They use the
>> verbiage "best partial" for determining strings of noticeably
>> different lengths. This appears to be similar to using a Jaccard
>> index[2] for string comparison but on smaller bodies of text like the
>> titles of said aliases. Would this be an application for using a
>> Lucene index that already has all the info retrieval goodness built in
>> to it?
>>
>> Adam
>>
>> [1]
>> http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
>> [2] http://en.wikipedia.org/wiki/Jaccard_index
>
>

Re: Current state of work (2013-11-25)

Posted by Martin Desruisseaux <ma...@geomatys.fr>.

Hello Adam

Thanks for the links, I was not aware of them. There is currently no 
probability value for matching string(s). The current heuristic rules 
are based on known practices, like ESRI adding the "D_" prefix for 
datum, spaces replaced by '_' and non-alphanumeric characters ignored. I 
have not yet found a need to match strings that are only similar. For 
now I have seen either exact match with above rules, or completely 
different names (e.g. "International 1924" and "Hayford 1909" are the 
same ellipsoid).

Lucene of course have a role, and actually we do use it, but rather in 
some layers on top of metadata. I think it will come to SIS later, 
presumably in a separated module...

     Martin



Le 26/11/13 18:49, Adam Estrada a écrit :
> Martin,
>
> Is there a probability value that is returned for the matching
> string(s)? I actually just came across a blog post[1] that does
> something similar to what you are working towards. They use the
> verbiage "best partial" for determining strings of noticeably
> different lengths. This appears to be similar to using a Jaccard
> index[2] for string comparison but on smaller bodies of text like the
> titles of said aliases. Would this be an application for using a
> Lucene index that already has all the info retrieval goodness built in
> to it?
>
> Adam
>
> [1] http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
> [2] http://en.wikipedia.org/wiki/Jaccard_index

Re: Current state of work (2013-11-25)

Posted by Adam Estrada <es...@gmail.com>.

Martin,

Is there a probability value that is returned for the matching
string(s)? I actually just came across a blog post[1] that does
something similar to what you are working towards. They use the
verbiage "best partial" for determining strings of noticeably
different lengths. This appears to be similar to using a Jaccard
index[2] for string comparison but on smaller bodies of text like the
titles of said aliases. Would this be an application for using a
Lucene index that already has all the info retrieval goodness built in
to it?

Adam

[1] http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
[2] http://en.wikipedia.org/wiki/Jaccard_index

On Tue, Nov 26, 2013 at 4:11 PM, Martin Desruisseaux
<ma...@geomatys.fr> wrote:
> Le 25/11/13 23:51, Martin Desruisseaux a écrit :
>
>> I would like a better method name for "nameMatches". If possible, I would
>> like something that contains the word "heuristic" or "lenient" in it, or
>> anything else which said the heuristic nature of this method. Does anyone
>> have suggestions? I do not know if "nameMatchesHeuristically" or
>> "heuristicNameMatches" would be correct English.
>
>
> I'm trying "isHeuristicMatchForName(String)" [1]. A search on internet found
> a few hits for "is heuristic match". If anyone has other idea, please let us
> known.
>
> This particular method may need to be revisited as we try to handle data
> from a larger range of data producers, so I think it is worth to make its
> purpose easy to spot.
>
>     Martin
>
>
> [1]
> https://builds.apache.org/job/sis-jdk7/site/apidocs/org/apache/sis/referencing/AbstractIdentifiedObject.html#isHeuristicMatchForName%28java.lang.String%29
>

Re: Current state of work (2013-11-25)

Posted by Martin Desruisseaux <ma...@geomatys.fr>.

Le 25/11/13 23:51, Martin Desruisseaux a écrit :
> I would like a better method name for "nameMatches". If possible, I 
> would like something that contains the word "heuristic" or "lenient" 
> in it, or anything else which said the heuristic nature of this 
> method. Does anyone have suggestions? I do not know if 
> "nameMatchesHeuristically" or "heuristicNameMatches" would be correct 
> English.

I'm trying "isHeuristicMatchForName(String)" [1]. A search on internet 
found a few hits for "is heuristic match". If anyone has other idea, 
please let us known.

This particular method may need to be revisited as we try to handle data 
from a larger range of data producers, so I think it is worth to make 
its purpose easy to spot.

     Martin


[1] 
https://builds.apache.org/job/sis-jdk7/site/apidocs/org/apache/sis/referencing/AbstractIdentifiedObject.html#isHeuristicMatchForName%28java.lang.String%29