You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Derek Werthmuller <dw...@ctg.albany.edu> on 2011/01/13 19:47:26 UTC

DataimportHandler development issue

We're just getting started with Solr and are very interested in using Solr
for search applications.

I've got the rss example working 1.4.1 didn't work out of the box, but we
figured it out -then found fixes in the svn.  Any way we are learning how
to load the data/rss & atom feeds into the Solr index.  We are trying to
modify the rss-data-import.xml file so that we can import atom feeds also.
But for some reason they don't load.  Here is what we have for the
configuration.  

We've been using the DataImportHandler Development Console
http://localhost:8983/solr/rss/admin/dataimport.jsp?handler=/rssimport
<http://localhost:8983/solr/rss/admin/dataimport.jsp?handler=/rssimport>  to
look at the status and the DocsNum but only the rss feed works.
If we remove all the slashdot -rss entity the atom example still doesn't
work.  We've tried creating a seperate atom-data-config.xml file and adding
the 
proper entry to the solrconfig.xml to support the extra dataimport.  That
gave us the same results.  
				<response>
				−
				<lst name="responseHeader">
				<int name="status">0</int>
				<int name="QTime">1</int>
				</lst>
				−
				<lst name="initArgs">
				−
				<lst name="defaults">
				<str
name="config">atom-data-config.xml</str>
				</lst>
				</lst>
				<str name="command">status</str>
				<str name="status">idle</str>
				<str name="importResponse"/>
				−
				<lst name="statusMessages">
				<str name="Total Requests made to
DataSource">1</str>
				<str name="Total Rows Fetched">0</str>
				<str name="Total Documents Skipped">0</str>
				<str name="Full Dump Started">2011-01-13
08:42:53</str>
				<str name="Total Documents
Processed">0</str>
				<str name="Time taken ">0:0:0.519</str>
				</lst>
				−
				<str name="WARNING">
				This response format is experimental. It is
likely to change in the future.
				</str>
				</response>


Its not clear why its not working.  Advice?
Also is this the best way to load data?  We intent on loading several
thousand docbook documents once we understand how this all works.  We stuck
with the rss/atom example since we didn't want to deal with schema changes
yet.
Thanks
	Derek

example-DIH/solr/rss/conf/rss-data-config.xml  modified source:
<dataConfig>
<dataSource type="URLDataSource" />
<document>
<entity name="slashdot"
pk="link"
url="http://twitter.com/statuses/user_timeline/existdb.rss"
processor="XPathEntityProcessor"
forEach="/rss/channel | /rss/channel/item"
transformer="DateFormatTransformer">

<field column="source" xpath="/rss/channel/title" commonField="true" />
<field column="source-link" xpath="/rss/channel/link" commonField="true" />
<field column="subject" xpath="/rss/channel/subject" commonField="true" />

<field column="title" xpath="/rss/channel/item/title" />
<field column="link" xpath="/rss/channel/item/link" />
<field column="description" xpath="/rss/channel/item/description" />
<field column="creator" xpath="/rss/channel/item/creator" />
<field column="item-subject" xpath="/rss/channel/item/subject" />
<field column="date" xpath="/rss/channel/item/date"
dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" />
<field column="slash-department" xpath="/rss/channel/item/department" />
<field column="slash-section" xpath="/rss/channel/item/section" />
<field column="slash-comments" xpath="/rss/channel/item/comments" />
</entity>

<entity name="twitter"
pk="link"
url="http://twitter.com/statuses/user_timeline/ctg_ualbany.atom"
processor="XPathEntityProcessor"
forEach="/feed | /feed/entry"
transformer="DateFormatTransformer">

<field column="source" xpath="/feed/title" commonField="true" />
<field column="source-link" xpath="/feed/link" commonField="true" />
<field column="subject" xpath="/feed/subtitle" commonField="true" />

<field column="title" xpath="/feed/entry/title" />
<field column="link" xpath="/feed/entry/link" />
<field column="description" xpath="/feed/entry/description" />
<field column="creator" xpath="/feed/entry/creator" />
<field column="item-subject" xpath="/feed/entry/subject" />
<field column="date" xpath="/rss/channel/item/date"
dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" />
<field column="slash-department" xpath="/feed/entry/department" />
<field column="slash-section" xpath="/feed/entry/section" />
<field column="slash-comments" xpath="/feed/entry/comments" />
</entity>
</document>
</dataConfig>


Re: DataimportHandler development issue

Posted by Gora Mohanty <go...@mimirtech.com>.
On Fri, Jan 14, 2011 at 12:17 AM, Derek Werthmuller
<dw...@ctg.albany.edu> wrote:

> Its not clear why its not working.  Advice?
> Also is this the best way to load data?  We intent on loading several
> thousand docbook documents once we understand how this all works.  We stuck
> with the rss/atom example since we didn't want to deal with schema changes
> yet.
> Thanks
>        Derek
>
> example-DIH/solr/rss/conf/rss-data-config.xml  modified source:
> <dataConfig>
> <dataSource type="URLDataSource" />
> <document>
> <entity name="slashdot"
> pk="link"
> url="http://twitter.com/statuses/user_timeline/existdb.rss"
> processor="XPathEntityProcessor"
> forEach="/rss/channel | /rss/channel/item"
> transformer="DateFormatTransformer">
>
> <field column="source" xpath="/rss/channel/title" commonField="true" />
> <field column="source-link" xpath="/rss/channel/link" commonField="true" />
> <field column="subject" xpath="/rss/channel/subject" commonField="true" />
>
> <field column="title" xpath="/rss/channel/item/title" />
> <field column="link" xpath="/rss/channel/item/link" />
> <field column="description" xpath="/rss/channel/item/description" />
> <field column="creator" xpath="/rss/channel/item/creator" />
> <field column="item-subject" xpath="/rss/channel/item/subject" />
> <field column="date" xpath="/rss/channel/item/date"
> dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" />
> <field column="slash-department" xpath="/rss/channel/item/department" />
> <field column="slash-section" xpath="/rss/channel/item/section" />
> <field column="slash-comments" xpath="/rss/channel/item/comments" />
> </entity>
>
> <entity name="twitter"
> pk="link"
> url="http://twitter.com/statuses/user_timeline/ctg_ualbany.atom"
> processor="XPathEntityProcessor"
> forEach="/feed | /feed/entry"
> transformer="DateFormatTransformer">
>
> <field column="source" xpath="/feed/title" commonField="true" />
> <field column="source-link" xpath="/feed/link" commonField="true" />
> <field column="subject" xpath="/feed/subtitle" commonField="true" />
>
> <field column="title" xpath="/feed/entry/title" />
> <field column="link" xpath="/feed/entry/link" />
> <field column="description" xpath="/feed/entry/description" />
> <field column="creator" xpath="/feed/entry/creator" />
> <field column="item-subject" xpath="/feed/entry/subject" />
> <field column="date" xpath="/rss/channel/item/date"
> dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" />
> <field column="slash-department" xpath="/feed/entry/department" />
> <field column="slash-section" xpath="/feed/entry/section" />
> <field column="slash-comments" xpath="/feed/entry/comments" />
> </entity>
> </document>
> </dataConfig>

Your problem is the second entity in the DIH configuration file. The
Solr schema defines the unique key to be the field "link". As noted in
the comments in schema.xml, this means that this field is required.
Solr is not able to populate the "link" field from the Atom feed. I have
not tracked down why this is so, but it is probably because there is
more than one link node under /feed/entry, and the "link" field is not
multi-valued. Change the xpath to, say, "/feed/entry/id", and the
import works. Also, while this is not necessarily an issue, please
note that several other fields have incorrect xpaths for this entity.

To answer your other question, this way of importing data should
work fine.

Regards,
Gora