You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Adam Estrada <es...@gmail.com> on 2010/12/29 18:05:07 UTC

[DIH] and XML Namespaces

All,

I am indexing some RSS feeds that are bound to specific namespaces. See
below...

<dataConfig>
<dataSource type="HttpDataSource"
            encoding="UTF-8"
            connectionTimeout="500000"
            readTimeout="500000"/>
  <document>
    <entity name="filedatasource"
            processor="FileListEntityProcessor"

 baseDir="C:/Apache/Solr-Nightly/solr/example/solr/conf/dataimporthandler"
            fileName="^.*xml$"
            recursive="true"
            rootEntity="false"
            dataSource="null">

      <entity name="CBP"
        pk="link"
        datasource="filedatasource"
        url="
http://ws.geonames.org/rssToGeoRSS?geoRSS=simple&amp;feedUrl=http://www.cbp.gov/xp/cgov/admin/rss/?rssUrl=/home.xml
"
        processor="XPathEntityProcessor"
        forEach="/rss/channel | /rss/channel/item"
        transformer="DateFormatTransformer,HTMLStripTransformer">

        <field column="source"       xpath="/rss/channel/title"
commonField="true" />
        <field column="source-link"  xpath="/rss/channel/link"
 commonField="true" />
        <field column="subject"      xpath="/rss/channel/description"
commonField="true" />
        <field column="title"        xpath="/rss/channel/item/title" />
        <field column="link"         xpath="/rss/channel/item/link" />
        <field column="description"  xpath="/rss/channel/item/description"
stripHTML="true" />
        <field column="creator"      xpath="/rss/channel/item/dc:creator" />
        <field column="item-subject" xpath="/rss/channel/item/subject" />
        <field column="author"       xpath="/rss/channel/item/author" />
        <field column="comments"     xpath="/rss/channel/item/comments" />
        <field column="pubdate"      xpath="/rss/channel/item/pubDate"
dateTimeFormat="yyyy-MM-dd'T'HH:mm:ss'Z'" />
        <field column="dcdate"       xpath="/rss/channel/item/dc:date"
dateTimeFormat="yyyy-MM-dd'T'HH:mm:ss'Z'" />
        <field column="store"        xpath="/rss/channel/item/georss:point"
/>
      </entity>

The process completely skips over any path with a colon in it.
ie. /rss/channel/item/georss:point.  Any ideas how to get around this using
the DIH?

Thanks to Chris Mattmann for the heads up on the geocoding services.

Adam

Re: [DIH] and XML Namespaces

Posted by Adam Estrada <es...@gmail.com>.
Piece of cake!

http://wiki.apache.org/solr/DataImportHandler#HttpDataSource_Example

<http://wiki.apache.org/solr/DataImportHandler#HttpDataSource_Example>"Our
XPath support has its limitations (no wildcards , only fullpath etc) but we
have tried to make sure that common use-cases are covered and since it's
based on a streaming parser, it is extremely fast and consumes constant
amount of memory even for large XMLs. It does not support namespaces , but
it can handle xmls with namespaces . When you provide the xpath, just drop
the namespace and give the rest (eg if the tag is '<dc:subject>' the mapping
should just contain'subject').Easy, isn't it? And you didn't need to write
one line of code! Enjoy [image: :)]"

On Wed, Dec 29, 2010 at 12:05 PM, Adam Estrada <
estrada.adam.groups@gmail.com> wrote:

> All,
>
> I am indexing some RSS feeds that are bound to specific namespaces. See
> below...
>
> <dataConfig>
> <dataSource type="HttpDataSource"
>             encoding="UTF-8"
>             connectionTimeout="500000"
>             readTimeout="500000"/>
>   <document>
>     <entity name="filedatasource"
>             processor="FileListEntityProcessor"
>
>  baseDir="C:/Apache/Solr-Nightly/solr/example/solr/conf/dataimporthandler"
>             fileName="^.*xml$"
>             recursive="true"
>             rootEntity="false"
>             dataSource="null">
>
>       <entity name="CBP"
>         pk="link"
>         datasource="filedatasource"
>         url="
> http://ws.geonames.org/rssToGeoRSS?geoRSS=simple&amp;feedUrl=http://www.cbp.gov/xp/cgov/admin/rss/?rssUrl=/home.xml<http://ws.geonames.org/rssToGeoRSS?geoRSS=simple&feedUrl=http://www.cbp.gov/xp/cgov/admin/rss/?rssUrl=/home.xml>
> "
>         processor="XPathEntityProcessor"
>         forEach="/rss/channel | /rss/channel/item"
>         transformer="DateFormatTransformer,HTMLStripTransformer">
>
>         <field column="source"       xpath="/rss/channel/title"
> commonField="true" />
>         <field column="source-link"  xpath="/rss/channel/link"
>  commonField="true" />
>         <field column="subject"      xpath="/rss/channel/description"
> commonField="true" />
>         <field column="title"        xpath="/rss/channel/item/title" />
>         <field column="link"         xpath="/rss/channel/item/link" />
>         <field column="description"  xpath="/rss/channel/item/description"
> stripHTML="true" />
>         <field column="creator"      xpath="/rss/channel/item/dc:creator"
> />
>         <field column="item-subject" xpath="/rss/channel/item/subject" />
>         <field column="author"       xpath="/rss/channel/item/author" />
>         <field column="comments"     xpath="/rss/channel/item/comments" />
>         <field column="pubdate"      xpath="/rss/channel/item/pubDate"
> dateTimeFormat="yyyy-MM-dd'T'HH:mm:ss'Z'" />
>         <field column="dcdate"       xpath="/rss/channel/item/dc:date"
> dateTimeFormat="yyyy-MM-dd'T'HH:mm:ss'Z'" />
>         <field column="store"        xpath="/rss/channel/item/georss:point"
> />
>       </entity>
>
> The process completely skips over any path with a colon in it.
> ie. /rss/channel/item/georss:point.  Any ideas how to get around this using
> the DIH?
>
> Thanks to Chris Mattmann for the heads up on the geocoding services.
>
> Adam
>