You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Adam Estrada <es...@gmail.com> on 2010/12/29 18:05:07 UTC
[DIH] and XML Namespaces
All,
I am indexing some RSS feeds that are bound to specific namespaces. See
below...
<dataConfig>
<dataSource type="HttpDataSource"
encoding="UTF-8"
connectionTimeout="500000"
readTimeout="500000"/>
<document>
<entity name="filedatasource"
processor="FileListEntityProcessor"
baseDir="C:/Apache/Solr-Nightly/solr/example/solr/conf/dataimporthandler"
fileName="^.*xml$"
recursive="true"
rootEntity="false"
dataSource="null">
<entity name="CBP"
pk="link"
datasource="filedatasource"
url="
http://ws.geonames.org/rssToGeoRSS?geoRSS=simple&feedUrl=http://www.cbp.gov/xp/cgov/admin/rss/?rssUrl=/home.xml
"
processor="XPathEntityProcessor"
forEach="/rss/channel | /rss/channel/item"
transformer="DateFormatTransformer,HTMLStripTransformer">
<field column="source" xpath="/rss/channel/title"
commonField="true" />
<field column="source-link" xpath="/rss/channel/link"
commonField="true" />
<field column="subject" xpath="/rss/channel/description"
commonField="true" />
<field column="title" xpath="/rss/channel/item/title" />
<field column="link" xpath="/rss/channel/item/link" />
<field column="description" xpath="/rss/channel/item/description"
stripHTML="true" />
<field column="creator" xpath="/rss/channel/item/dc:creator" />
<field column="item-subject" xpath="/rss/channel/item/subject" />
<field column="author" xpath="/rss/channel/item/author" />
<field column="comments" xpath="/rss/channel/item/comments" />
<field column="pubdate" xpath="/rss/channel/item/pubDate"
dateTimeFormat="yyyy-MM-dd'T'HH:mm:ss'Z'" />
<field column="dcdate" xpath="/rss/channel/item/dc:date"
dateTimeFormat="yyyy-MM-dd'T'HH:mm:ss'Z'" />
<field column="store" xpath="/rss/channel/item/georss:point"
/>
</entity>
The process completely skips over any path with a colon in it.
ie. /rss/channel/item/georss:point. Any ideas how to get around this using
the DIH?
Thanks to Chris Mattmann for the heads up on the geocoding services.
Adam
Re: [DIH] and XML Namespaces
Posted by Adam Estrada <es...@gmail.com>.
Piece of cake!
http://wiki.apache.org/solr/DataImportHandler#HttpDataSource_Example
<http://wiki.apache.org/solr/DataImportHandler#HttpDataSource_Example>"Our
XPath support has its limitations (no wildcards , only fullpath etc) but we
have tried to make sure that common use-cases are covered and since it's
based on a streaming parser, it is extremely fast and consumes constant
amount of memory even for large XMLs. It does not support namespaces , but
it can handle xmls with namespaces . When you provide the xpath, just drop
the namespace and give the rest (eg if the tag is '<dc:subject>' the mapping
should just contain'subject').Easy, isn't it? And you didn't need to write
one line of code! Enjoy [image: :)]"
On Wed, Dec 29, 2010 at 12:05 PM, Adam Estrada <
estrada.adam.groups@gmail.com> wrote:
> All,
>
> I am indexing some RSS feeds that are bound to specific namespaces. See
> below...
>
> <dataConfig>
> <dataSource type="HttpDataSource"
> encoding="UTF-8"
> connectionTimeout="500000"
> readTimeout="500000"/>
> <document>
> <entity name="filedatasource"
> processor="FileListEntityProcessor"
>
> baseDir="C:/Apache/Solr-Nightly/solr/example/solr/conf/dataimporthandler"
> fileName="^.*xml$"
> recursive="true"
> rootEntity="false"
> dataSource="null">
>
> <entity name="CBP"
> pk="link"
> datasource="filedatasource"
> url="
> http://ws.geonames.org/rssToGeoRSS?geoRSS=simple&feedUrl=http://www.cbp.gov/xp/cgov/admin/rss/?rssUrl=/home.xml<http://ws.geonames.org/rssToGeoRSS?geoRSS=simple&feedUrl=http://www.cbp.gov/xp/cgov/admin/rss/?rssUrl=/home.xml>
> "
> processor="XPathEntityProcessor"
> forEach="/rss/channel | /rss/channel/item"
> transformer="DateFormatTransformer,HTMLStripTransformer">
>
> <field column="source" xpath="/rss/channel/title"
> commonField="true" />
> <field column="source-link" xpath="/rss/channel/link"
> commonField="true" />
> <field column="subject" xpath="/rss/channel/description"
> commonField="true" />
> <field column="title" xpath="/rss/channel/item/title" />
> <field column="link" xpath="/rss/channel/item/link" />
> <field column="description" xpath="/rss/channel/item/description"
> stripHTML="true" />
> <field column="creator" xpath="/rss/channel/item/dc:creator"
> />
> <field column="item-subject" xpath="/rss/channel/item/subject" />
> <field column="author" xpath="/rss/channel/item/author" />
> <field column="comments" xpath="/rss/channel/item/comments" />
> <field column="pubdate" xpath="/rss/channel/item/pubDate"
> dateTimeFormat="yyyy-MM-dd'T'HH:mm:ss'Z'" />
> <field column="dcdate" xpath="/rss/channel/item/dc:date"
> dateTimeFormat="yyyy-MM-dd'T'HH:mm:ss'Z'" />
> <field column="store" xpath="/rss/channel/item/georss:point"
> />
> </entity>
>
> The process completely skips over any path with a colon in it.
> ie. /rss/channel/item/georss:point. Any ideas how to get around this using
> the DIH?
>
> Thanks to Chris Mattmann for the heads up on the geocoding services.
>
> Adam
>