You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2015/12/23 05:59:43 UTC
[Tika Wiki] Update of "GeoTopicParser" by MadhavSharan
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "GeoTopicParser" page has been changed by MadhavSharan:
https://wiki.apache.org/tika/GeoTopicParser?action=diff&rev1=10&rev2=11
usage: lucene-geo-gazetteer
-b,--build <gazetteer file> The Path to the Geonames
allCountries.txt
+ -c,--count <number of results> Number of best results to be
+ returned for one location
-h,--help Print this message.
-i,--index <directoryPath> The path to the Lucene index
directory to either create or read
+ -json,--json Formats output in well defined json
+ structure
-s,--search <set of location names> Location names to search the
Gazetteer for
+ -server,--server Launches Geo Gazetteer Service
+
}}}
You will now need to build a Gazetteer using the Geonames.org dataset. Instructions are provided below. Note that you will need least 1.2 GB disk space for building Lucene Index for the Gazetteer.
@@ -44, +50 @@
You can verify that the Gazetteer build worked by searching e.g., for Pasadena, and/or Texas:
{{{
- $ lucene-geo-gazetteer -s Pasadena Texas
+ $ lucene-geo-gazetteer -s Pasadena Texas -json
+ {"Texas":[{"name":"Texas","countryCode":"US","admin1Code":"TX","admin2Code":"","latitude":31.25044,"longitude":-99.25061}],"Pasadena":[{"name":"Pasadena","countryCode":"US","admin1Code":"CA","admin2Code":"037","latitude":34.14778,"longitude":-118.14452}]}
- [
- {"Texas" : [
- "Texas",
- "-91.92139",
- "18.05333"
- ]},
- {"Pasadena" : [
- "Pasadena",
- "-74.06446",
- "4.6964"
- ]}
- ]
}}}
+ Now you need to start REST service of lucene-geo-gazetteer. Tika uses this service internally
+
+ {{{
+ $ lucene-geo-gazetteer -server
+ }}}
+
+ You can verify that the REST API is responding by searching e.g., for Pasadena, and/or Texas:
+
+ {{{
+ $ curl "http://localhost:8765/api/search?s=Pasadena&s=Texas"
+ {"Texas":[{"name":"Texas","countryCode":"US","admin1Code":"TX","admin2Code":"","latitude":31.25044,"longitude":-99.25061}],"Pasadena":[{"name":"Pasadena","countryCode":"US","admin1Code":"CA","admin2Code":"037","latitude":34.14778,"longitude":-118.14452}]}
+ }}}
+
- Note that we used the convenience script `lucene-geo-gazetteer` which assumes that you created an indexed named geoIndex in the $HOME/src/lucene-geo-gazetter/geoIndex directory. We could have also used the pure Java command line to search. The return from the Gazetteer is a JSON List of JSON Object structures in which the structure is a key->JSON List map. The key is the location name given and the List is a list of closest match (by Edit Distance) in the Gazetteer for that name, followed by Latitude, and Longitude of that location.
+ Note that we used the convenience script `lucene-geo-gazetteer` which assumes that you created an indexed named geoIndex in the $HOME/src/lucene-geo-gazetter/geoIndex directory. We could have also used the pure Java command line to search. The return from the Gazetteer is a JSON List of Object structures in which the structure is a key->Object List map. The key is the location name given and the Object List is a list of most popular location objects in the Gazetteer for that name.
== Installing and downloading an NER model ==