You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Lance Norskog <go...@gmail.com> on 2008/11/01 06:00:58 UTC

DIH Http input bug - problem with two-level RSS walker

I wrote a nested HttpDataSource RSS poller. The outer loop reads an rss feed
which contains N links to other rss feeds. The nested loop then reads each
one of those to create documents. (Yes, this is an obnoxious thing to do.)
Let's say the outer RSS feed gives 10 items. Both feeds use the same
structure: /rss/channel with a <title> node and then N <item> nodes inside
the channel. This should create two separate XML streams with two separate
Xpath iterators, right?

<entity name="outer" http stuff>
    <field column="name" xpath="/rss/channel/title" />
    <field column="url" xpath="/rss/channel/item/link"/>

    <entity name="inner" http stuff url="${outer.url}" pk="title" >
        <field column="title" xpath="/rss/channel/item/title" />
    </entity>
</entity>

This does indeed walk each url from the outer feed and then fetch the inner
rss feed. Bravo! 

However, I found two separate problems in xpath iteration. They may be
related. The first problem is that it only stores the first document from
each "inner" feed. Each feed has several documents with different title
fields but it only grabs the first.

The other is an off-by-one bug. The outer loop iterates through the 10 items
and then tries to pull an 11th.  It then gives this exception trace:

INFO: Created URL to:  [inner url]
Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.HttpDataSource
getData
SEVERE: Exception thrown while getting data
java.net.MalformedURLException: no protocol: null/account.rss
        at java.net.URL.<init>(URL.java:567)
        at java.net.URL.<init>(URL.java:464)
        at java.net.URL.<init>(URL.java:413)
        at
org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
a:90)
        at
org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
a:47)
        at
org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:18
3)
        at
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntit
yProcessor.java:210)
        at
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEn
tityProcessor.java:180)
        at
org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityP
rocessor.java:160)
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:
285)
 ...
Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.DocBuilder
buildDocument
SEVERE: Exception while processing: album document :
SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}]
org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in
invoking url null Processing Document # 11
        at
org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
a:115)
        at
org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
a:47)

Re: DIH Http input bug - problem with two-level RSS walker

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.

Hi Jon ,
Using a CachedSqlEntityProcessor is the root entity is of no use. it
must be only as good as using a SqlEntityProcessor .for classes
belonging to the package 'org.apache.solr.handler.dataimport' the
package name can be omited (for better readability).


On Sun, Nov 2, 2008 at 8:08 AM, Jon Baer <jo...@gmail.com> wrote:
> Another idea is to use create the logic you need and dump to a temp MySQL
> table and then fetch the feeds, that has worked pretty nicely for me, it
> removes the need for the outer feed to do the work.  @ first I could not
> figure out if this was a bug or feature ... Something like ...
>
>        <entity dataSource="db" name="db" query="SELECT id FROM table"
> processor="org.apache.solr.handler.dataimport.CachedSqlEntityProcessor">
>                        <entity dataSource="feeds"
> url="http://{$db.id}.somedomain.com/feed.xml" name="feeds" pk="link"
> processor="org.apache.solr.handler.dataimport.XPathEntityProcessor"
> forEach="/rss/channel/item"
> transformer="org.apache.solr.handler.dataimport.TemplateTransformer,
> org.apache.solr.handler.dataimport.DateFormatTransformer">
>                                <field column="title"
> xpath="/rss/channel/item/title"/>
>                                <field column="link"
> xpath="/rss/channel/item/link"/>
>                                <field column="docid"
> template="DOC-${feeds.link}"/>
>                                <field column="doctype" template="video"/>
>                                <field column="description"
> xpath="/rss/channel/item/description"/>
>                                <field column="thumbnail"
> xpath="/rss/channel/item/enclosure/@url"/>
>                                <field column="pubdate"
> xpath="/rss/channel/item/pubDate"
> dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'"/>
>                        </entity>
>                </entity>
>
> - Jon
>
> On Nov 1, 2008, at 3:26 PM, Norskog, Lance wrote:
>
>> The inner entity drills down and gets more detail about each item in the
>> outer loop. It creates one document.
>>
>> -----Original Message-----
>> From: Shalin Shekhar Mangar [mailto:shalinmangar@gmail.com]
>> Sent: Friday, October 31, 2008 10:24 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: DIH Http input bug - problem with two-level RSS walker
>>
>> On Sat, Nov 1, 2008 at 10:30 AM, Lance Norskog <go...@gmail.com>
>> wrote:
>>
>>> I wrote a nested HttpDataSource RSS poller. The outer loop reads an
>>> rss feed which contains N links to other rss feeds. The nested loop
>>> then reads each one of those to create documents. (Yes, this is an
>>> obnoxious thing to do.) Let's say the outer RSS feed gives 10 items.
>>> Both feeds use the same
>>> structure: /rss/channel with a <title> node and then N <item> nodes
>>> inside the channel. This should create two separate XML streams with
>>> two separate Xpath iterators, right?
>>>
>>> <entity name="outer" http stuff>
>>>  <field column="name" xpath="/rss/channel/title" />
>>>  <field column="url" xpath="/rss/channel/item/link"/>
>>>
>>>  <entity name="inner" http stuff url="${outer.url}" pk="title" >
>>>      <field column="title" xpath="/rss/channel/item/title" />
>>>  </entity>
>>> </entity>
>>>
>>> This does indeed walk each url from the outer feed and then fetch the
>>> inner rss feed. Bravo!
>>>
>>> However, I found two separate problems in xpath iteration. They may be
>>
>>> related. The first problem is that it only stores the first document
>>> from each "inner" feed. Each feed has several documents with different
>>
>>> title fields but it only grabs the first.
>>>
>>
>> The idea behind nested entities is to join them together so that one
>> Solr document is created for each root entity and the child entities
>> provide more fields which are added to the parent document.
>>
>> I guess you want to create separate Solr documents from the root entity
>> as well as the child entities. I don't think that is possible with
>> nested entities. Essentially, you are trying to crawl feeds, not join
>> them.
>>
>> Probably an integration with Apache Droids can be thought about.
>> http://incubator.apache.org/projects/droids.html
>> http://people.apache.org/~thorsten/droids/
>>
>> If you are going to crawl only one level, there may be a workaround.
>> However, it may be easier to implement all this with your own Java
>> program and just post results to Solr as usual.
>>
>>
>>
>>> The other is an off-by-one bug. The outer loop iterates through the 10
>>
>>> items and then tries to pull an 11th.  It then gives this exception
>>> trace:
>>>
>>> INFO: Created URL to:  [inner url]
>>> Oct 31, 2008 11:21:20 PM
>>> org.apache.solr.handler.dataimport.HttpDataSource
>>> getData
>>> SEVERE: Exception thrown while getting data
>>> java.net.MalformedURLException: no protocol: null/account.rss
>>>      at java.net.URL.<init>(URL.java:567)
>>>      at java.net.URL.<init>(URL.java:464)
>>>      at java.net.URL.<init>(URL.java:413)
>>>      at
>>>
>>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
>>> ce.jav
>>> a:90)
>>>      at
>>>
>>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
>>> ce.jav
>>> a:47)
>>>      at
>>>
>>> org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.j
>>> ava:18
>>> 3)
>>>      at
>>>
>>> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPat
>>> hEntit
>>> yProcessor.java:210)
>>>      at
>>>
>>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(X
>>> PathEn
>>> tityProcessor.java:180)
>>>      at
>>>
>>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathE
>>> ntityP
>>> rocessor.java:160)
>>>      at
>>>
>>>
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
>> ava:
>>>
>>> 285)
>>> ...
>>> Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.DocBuilder
>>> buildDocument
>>> SEVERE: Exception while processing: album document :
>>> SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}]
>>> org.apache.solr.handler.dataimport.DataImportHandlerException:
>>> Exception in invoking url null Processing Document # 11
>>>      at
>>>
>>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
>>> ce.jav
>>> a:115)
>>>      at
>>>
>>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
>>> ce.jav
>>> a:47)
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>
>



-- 
--Noble Paul

Re: DIH Http input bug - problem with two-level RSS walker

Posted by Jon Baer <jo...@gmail.com>.

Another idea is to use create the logic you need and dump to a temp  
MySQL table and then fetch the feeds, that has worked pretty nicely  
for me, it removes the need for the outer feed to do the work.  @  
first I could not figure out if this was a bug or feature ...  
Something like ...

	<entity dataSource="db" name="db" query="SELECT id FROM table"  
processor="org.apache.solr.handler.dataimport.CachedSqlEntityProcessor">
			<entity dataSource="feeds" url="http://{$db.id}.somedomain.com/ 
feed.xml" name="feeds" pk="link"  
processor="org.apache.solr.handler.dataimport.XPathEntityProcessor"  
forEach="/rss/channel/item"  
transformer="org.apache.solr.handler.dataimport.TemplateTransformer,  
org.apache.solr.handler.dataimport.DateFormatTransformer">
				<field column="title" xpath="/rss/channel/item/title"/>
				<field column="link" xpath="/rss/channel/item/link"/>
				<field column="docid" template="DOC-${feeds.link}"/>
				<field column="doctype" template="video"/>
				<field column="description" xpath="/rss/channel/item/description"/>
				<field column="thumbnail" xpath="/rss/channel/item/enclosure/@url"/>
				<field column="pubdate" xpath="/rss/channel/item/pubDate"  
dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'"/>
			</entity>
		</entity>

- Jon

On Nov 1, 2008, at 3:26 PM, Norskog, Lance wrote:

> The inner entity drills down and gets more detail about each item in  
> the
> outer loop. It creates one document.
>
> -----Original Message-----
> From: Shalin Shekhar Mangar [mailto:shalinmangar@gmail.com]
> Sent: Friday, October 31, 2008 10:24 PM
> To: solr-user@lucene.apache.org
> Subject: Re: DIH Http input bug - problem with two-level RSS walker
>
> On Sat, Nov 1, 2008 at 10:30 AM, Lance Norskog <go...@gmail.com>
> wrote:
>
>> I wrote a nested HttpDataSource RSS poller. The outer loop reads an
>> rss feed which contains N links to other rss feeds. The nested loop
>> then reads each one of those to create documents. (Yes, this is an
>> obnoxious thing to do.) Let's say the outer RSS feed gives 10 items.
>> Both feeds use the same
>> structure: /rss/channel with a <title> node and then N <item> nodes
>> inside the channel. This should create two separate XML streams with
>> two separate Xpath iterators, right?
>>
>> <entity name="outer" http stuff>
>>   <field column="name" xpath="/rss/channel/title" />
>>   <field column="url" xpath="/rss/channel/item/link"/>
>>
>>   <entity name="inner" http stuff url="${outer.url}" pk="title" >
>>       <field column="title" xpath="/rss/channel/item/title" />
>>   </entity>
>> </entity>
>>
>> This does indeed walk each url from the outer feed and then fetch the
>> inner rss feed. Bravo!
>>
>> However, I found two separate problems in xpath iteration. They may  
>> be
>
>> related. The first problem is that it only stores the first document
>> from each "inner" feed. Each feed has several documents with  
>> different
>
>> title fields but it only grabs the first.
>>
>
> The idea behind nested entities is to join them together so that one
> Solr document is created for each root entity and the child entities
> provide more fields which are added to the parent document.
>
> I guess you want to create separate Solr documents from the root  
> entity
> as well as the child entities. I don't think that is possible with
> nested entities. Essentially, you are trying to crawl feeds, not join
> them.
>
> Probably an integration with Apache Droids can be thought about.
> http://incubator.apache.org/projects/droids.html
> http://people.apache.org/~thorsten/droids/
>
> If you are going to crawl only one level, there may be a workaround.
> However, it may be easier to implement all this with your own Java
> program and just post results to Solr as usual.
>
>
>
>> The other is an off-by-one bug. The outer loop iterates through the  
>> 10
>
>> items and then tries to pull an 11th.  It then gives this exception
>> trace:
>>
>> INFO: Created URL to:  [inner url]
>> Oct 31, 2008 11:21:20 PM
>> org.apache.solr.handler.dataimport.HttpDataSource
>> getData
>> SEVERE: Exception thrown while getting data
>> java.net.MalformedURLException: no protocol: null/account.rss
>>       at java.net.URL.<init>(URL.java:567)
>>       at java.net.URL.<init>(URL.java:464)
>>       at java.net.URL.<init>(URL.java:413)
>>       at
>>
>> org 
>> .apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
>> ce.jav
>> a:90)
>>       at
>>
>> org 
>> .apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
>> ce.jav
>> a:47)
>>       at
>>
>> org.apache.solr.handler.dataimport.DebugLogger 
>> $2.getData(DebugLogger.j
>> ava:18
>> 3)
>>       at
>>
>> org 
>> .apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPat
>> hEntit
>> yProcessor.java:210)
>>       at
>>
>> org 
>> .apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(X
>> PathEn
>> tityProcessor.java:180)
>>       at
>>
>> org 
>> .apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathE
>> ntityP
>> rocessor.java:160)
>>       at
>>
>>
> org 
> .apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
> ava:
>> 285)
>> ...
>> Oct 31, 2008 11:21:20 PM  
>> org.apache.solr.handler.dataimport.DocBuilder
>> buildDocument
>> SEVERE: Exception while processing: album document :
>> SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}]
>> org.apache.solr.handler.dataimport.DataImportHandlerException:
>> Exception in invoking url null Processing Document # 11
>>       at
>>
>> org 
>> .apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
>> ce.jav
>> a:115)
>>       at
>>
>> org 
>> .apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
>> ce.jav
>> a:47)
>>
>>
>>
>>
>>
>>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.

RE: DIH Http input bug - problem with two-level RSS walker

Posted by "Norskog, Lance" <la...@divvio.com>.

The inner entity drills down and gets more detail about each item in the
outer loop. It creates one document. 

-----Original Message-----
From: Shalin Shekhar Mangar [mailto:shalinmangar@gmail.com] 
Sent: Friday, October 31, 2008 10:24 PM
To: solr-user@lucene.apache.org
Subject: Re: DIH Http input bug - problem with two-level RSS walker

On Sat, Nov 1, 2008 at 10:30 AM, Lance Norskog <go...@gmail.com>
wrote:

> I wrote a nested HttpDataSource RSS poller. The outer loop reads an 
> rss feed which contains N links to other rss feeds. The nested loop 
> then reads each one of those to create documents. (Yes, this is an 
> obnoxious thing to do.) Let's say the outer RSS feed gives 10 items. 
> Both feeds use the same
> structure: /rss/channel with a <title> node and then N <item> nodes 
> inside the channel. This should create two separate XML streams with 
> two separate Xpath iterators, right?
>
> <entity name="outer" http stuff>
>    <field column="name" xpath="/rss/channel/title" />
>    <field column="url" xpath="/rss/channel/item/link"/>
>
>    <entity name="inner" http stuff url="${outer.url}" pk="title" >
>        <field column="title" xpath="/rss/channel/item/title" />
>    </entity>
> </entity>
>
> This does indeed walk each url from the outer feed and then fetch the 
> inner rss feed. Bravo!
>
> However, I found two separate problems in xpath iteration. They may be

> related. The first problem is that it only stores the first document 
> from each "inner" feed. Each feed has several documents with different

> title fields but it only grabs the first.
>

The idea behind nested entities is to join them together so that one
Solr document is created for each root entity and the child entities
provide more fields which are added to the parent document.

I guess you want to create separate Solr documents from the root entity
as well as the child entities. I don't think that is possible with
nested entities. Essentially, you are trying to crawl feeds, not join
them.

Probably an integration with Apache Droids can be thought about.
http://incubator.apache.org/projects/droids.html
http://people.apache.org/~thorsten/droids/

If you are going to crawl only one level, there may be a workaround.
However, it may be easier to implement all this with your own Java
program and just post results to Solr as usual.



> The other is an off-by-one bug. The outer loop iterates through the 10

> items and then tries to pull an 11th.  It then gives this exception 
> trace:
>
> INFO: Created URL to:  [inner url]
> Oct 31, 2008 11:21:20 PM 
> org.apache.solr.handler.dataimport.HttpDataSource
> getData
> SEVERE: Exception thrown while getting data
> java.net.MalformedURLException: no protocol: null/account.rss
>        at java.net.URL.<init>(URL.java:567)
>        at java.net.URL.<init>(URL.java:464)
>        at java.net.URL.<init>(URL.java:413)
>        at
>
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
> ce.jav
> a:90)
>        at
>
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
> ce.jav
> a:47)
>        at
>
> org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.j
> ava:18
> 3)
>        at
>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPat
> hEntit
> yProcessor.java:210)
>        at
>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(X
> PathEn
> tityProcessor.java:180)
>        at
>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathE
> ntityP
> rocessor.java:160)
>        at
>
>
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
ava:
> 285)
>  ...
> Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.DocBuilder
> buildDocument
> SEVERE: Exception while processing: album document :
> SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}]
> org.apache.solr.handler.dataimport.DataImportHandlerException: 
> Exception in invoking url null Processing Document # 11
>        at
>
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
> ce.jav
> a:115)
>        at
>
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
> ce.jav
> a:47)
>
>
>
>
>
>


--
Regards,
Shalin Shekhar Mangar.

Re: DIH Http input bug - problem with two-level RSS walker

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

On Sat, Nov 1, 2008 at 10:30 AM, Lance Norskog <go...@gmail.com> wrote:

> I wrote a nested HttpDataSource RSS poller. The outer loop reads an rss
> feed
> which contains N links to other rss feeds. The nested loop then reads each
> one of those to create documents. (Yes, this is an obnoxious thing to do.)
> Let's say the outer RSS feed gives 10 items. Both feeds use the same
> structure: /rss/channel with a <title> node and then N <item> nodes inside
> the channel. This should create two separate XML streams with two separate
> Xpath iterators, right?
>
> <entity name="outer" http stuff>
>    <field column="name" xpath="/rss/channel/title" />
>    <field column="url" xpath="/rss/channel/item/link"/>
>
>    <entity name="inner" http stuff url="${outer.url}" pk="title" >
>        <field column="title" xpath="/rss/channel/item/title" />
>    </entity>
> </entity>
>
> This does indeed walk each url from the outer feed and then fetch the inner
> rss feed. Bravo!
>
> However, I found two separate problems in xpath iteration. They may be
> related. The first problem is that it only stores the first document from
> each "inner" feed. Each feed has several documents with different title
> fields but it only grabs the first.
>

The idea behind nested entities is to join them together so that one Solr
document is created for each root entity and the child entities provide more
fields which are added to the parent document.

I guess you want to create separate Solr documents from the root entity as
well as the child entities. I don't think that is possible with nested
entities. Essentially, you are trying to crawl feeds, not join them.

Probably an integration with Apache Droids can be thought about.
http://incubator.apache.org/projects/droids.html
http://people.apache.org/~thorsten/droids/

If you are going to crawl only one level, there may be a workaround.
However, it may be easier to implement all this with your own Java program
and just post results to Solr as usual.



> The other is an off-by-one bug. The outer loop iterates through the 10
> items
> and then tries to pull an 11th.  It then gives this exception trace:
>
> INFO: Created URL to:  [inner url]
> Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.HttpDataSource
> getData
> SEVERE: Exception thrown while getting data
> java.net.MalformedURLException: no protocol: null/account.rss
>        at java.net.URL.<init>(URL.java:567)
>        at java.net.URL.<init>(URL.java:464)
>        at java.net.URL.<init>(URL.java:413)
>        at
>
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
> a:90)
>        at
>
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
> a:47)
>        at
>
> org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:18
> 3)
>        at
>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntit
> yProcessor.java:210)
>        at
>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEn
> tityProcessor.java:180)
>        at
>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityP
> rocessor.java:160)
>        at
>
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:
> 285)
>  ...
> Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.DocBuilder
> buildDocument
> SEVERE: Exception while processing: album document :
> SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}]
> org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in
> invoking url null Processing Document # 11
>        at
>
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
> a:115)
>        at
>
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
> a:47)
>
>
>
>
>
>


-- 
Regards,
Shalin Shekhar Mangar.

Re: DIH Http input bug - problem with two-level RSS walker

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.

It may be fine to provide that but, what other benefit can you get
which you can't get from writing a Simple DataSource in java.Script is
just a convenience , right?

--Noble

On Mon, Nov 3, 2008 at 11:41 AM, Jon Baer <jo...@gmail.com> wrote:
> On a side note ... it would be nice if your data source could also be the
> result of a script (instead of trying to hack around it w/ JdbcDataSource)
> ...
>
> Something similar to what ScriptTransformer does ...
> (http://wiki.apache.org/solr/DataImportHandler#head-27fcc2794bd71f7d727104ffc6b99e194bdb6ff9)
>
> An example would be:
>
> <dataSource type="ScriptDataSource" name="outerloop" script="outerloop.js"
> />
>
> (The script would basically contain just a callback - getData(String query)
> that results in an array set or might set values on it's children, etc)
>
> - Jon
>
> On Nov 3, 2008, at 12:40 AM, Noble Paul നോബിള്‍ नोब्ळ् wrote:
>
>> Hi Lance,
>> I guess I got your problem
>> So you wish to create docs for both entities (as suggested by Jon
>> Baer). So the best solution would be to create two root entities. The
>> first one should be the outer and write a transformer to store all the
>> urls into the db . The JdbcDataSource can do inserts/update too (the
>> method is same getData()). The second entity can read from db and
>> create docs  (see Jon baer's suggestion) using the
>> XPathEntityProcessor as a sub-entity
>> --Noble
>>
>> On Mon, Nov 3, 2008 at 9:44 AM, Noble Paul നോബിള്‍ नोब्ळ्
>> <no...@gmail.com> wrote:
>>>
>>> Hi Lance,
>>> Do a full import w/o debug and let us know if my suggestion worked
>>> (rootEntity="false" ) . If it didn't , I can suggest u something else
>>> (Writing a Transformer )
>>>
>>>
>>> On Sun, Nov 2, 2008 at 8:13 AM, Noble Paul നോബിള്‍ नोब्ळ्
>>> <no...@gmail.com> wrote:
>>>>
>>>> If you wish to create 1 doc per inner entity the set
>>>> rootEntity="false" for the entity outer.
>>>> The exception is because the url is wrong
>>>>
>>>> On Sat, Nov 1, 2008 at 10:30 AM, Lance Norskog <go...@gmail.com>
>>>> wrote:
>>>>>
>>>>> I wrote a nested HttpDataSource RSS poller. The outer loop reads an rss
>>>>> feed
>>>>> which contains N links to other rss feeds. The nested loop then reads
>>>>> each
>>>>> one of those to create documents. (Yes, this is an obnoxious thing to
>>>>> do.)
>>>>> Let's say the outer RSS feed gives 10 items. Both feeds use the same
>>>>> structure: /rss/channel with a <title> node and then N <item> nodes
>>>>> inside
>>>>> the channel. This should create two separate XML streams with two
>>>>> separate
>>>>> Xpath iterators, right?
>>>>>
>>>>> <entity name="outer" http stuff>
>>>>>  <field column="name" xpath="/rss/channel/title" />
>>>>>  <field column="url" xpath="/rss/channel/item/link"/>
>>>>>
>>>>>  <entity name="inner" http stuff url="${outer.url}" pk="title" >
>>>>>      <field column="title" xpath="/rss/channel/item/title" />
>>>>>  </entity>
>>>>> </entity>
>>>>>
>>>>> This does indeed walk each url from the outer feed and then fetch the
>>>>> inner
>>>>> rss feed. Bravo!
>>>>>
>>>>> However, I found two separate problems in xpath iteration. They may be
>>>>> related. The first problem is that it only stores the first document
>>>>> from
>>>>> each "inner" feed. Each feed has several documents with different title
>>>>> fields but it only grabs the first.
>>>>>
>>>>> The other is an off-by-one bug. The outer loop iterates through the 10
>>>>> items
>>>>> and then tries to pull an 11th.  It then gives this exception trace:
>>>>>
>>>>> INFO: Created URL to:  [inner url]
>>>>> Oct 31, 2008 11:21:20 PM
>>>>> org.apache.solr.handler.dataimport.HttpDataSource
>>>>> getData
>>>>> SEVERE: Exception thrown while getting data
>>>>> java.net.MalformedURLException: no protocol: null/account.rss
>>>>>      at java.net.URL.<init>(URL.java:567)
>>>>>      at java.net.URL.<init>(URL.java:464)
>>>>>      at java.net.URL.<init>(URL.java:413)
>>>>>      at
>>>>>
>>>>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>>>>> a:90)
>>>>>      at
>>>>>
>>>>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>>>>> a:47)
>>>>>      at
>>>>>
>>>>> org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:18
>>>>> 3)
>>>>>      at
>>>>>
>>>>> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntit
>>>>> yProcessor.java:210)
>>>>>      at
>>>>>
>>>>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEn
>>>>> tityProcessor.java:180)
>>>>>      at
>>>>>
>>>>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityP
>>>>> rocessor.java:160)
>>>>>      at
>>>>>
>>>>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:
>>>>> 285)
>>>>> ...
>>>>> Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.DocBuilder
>>>>> buildDocument
>>>>> SEVERE: Exception while processing: album document :
>>>>> SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}]
>>>>> org.apache.solr.handler.dataimport.DataImportHandlerException:
>>>>> Exception in
>>>>> invoking url null Processing Document # 11
>>>>>      at
>>>>>
>>>>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>>>>> a:115)
>>>>>      at
>>>>>
>>>>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>>>>> a:47)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> --Noble Paul
>>>>
>>>
>>>
>>>
>>> --
>>> --Noble Paul
>>>
>>
>>
>>
>> --
>> --Noble Paul
>
>



-- 
--Noble Paul

Re: DIH Http input bug - problem with two-level RSS walker

Posted by Jon Baer <jo...@gmail.com>.

On a side note ... it would be nice if your data source could also be  
the result of a script (instead of trying to hack around it w/  
JdbcDataSource) ...

Something similar to what ScriptTransformer does ...
(http://wiki.apache.org/solr/DataImportHandler#head-27fcc2794bd71f7d727104ffc6b99e194bdb6ff9 
)

An example would be:

<dataSource type="ScriptDataSource" name="outerloop"  
script="outerloop.js" />

(The script would basically contain just a callback - getData(String  
query) that results in an array set or might set values on it's  
children, etc)

- Jon

On Nov 3, 2008, at 12:40 AM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:

> Hi Lance,
> I guess I got your problem
> So you wish to create docs for both entities (as suggested by Jon
> Baer). So the best solution would be to create two root entities. The
> first one should be the outer and write a transformer to store all the
> urls into the db . The JdbcDataSource can do inserts/update too (the
> method is same getData()). The second entity can read from db and
> create docs  (see Jon baer's suggestion) using the
> XPathEntityProcessor as a sub-entity
> --Noble
>
> On Mon, Nov 3, 2008 at 9:44 AM, Noble Paul നോബിള്‍  
> नोब्ळ्
> <no...@gmail.com> wrote:
>> Hi Lance,
>> Do a full import w/o debug and let us know if my suggestion worked
>> (rootEntity="false" ) . If it didn't , I can suggest u something else
>> (Writing a Transformer )
>>
>>
>> On Sun, Nov 2, 2008 at 8:13 AM, Noble Paul നോബിള്‍  
>> नोब्ळ्
>> <no...@gmail.com> wrote:
>>> If you wish to create 1 doc per inner entity the set
>>> rootEntity="false" for the entity outer.
>>> The exception is because the url is wrong
>>>
>>> On Sat, Nov 1, 2008 at 10:30 AM, Lance Norskog <go...@gmail.com>  
>>> wrote:
>>>> I wrote a nested HttpDataSource RSS poller. The outer loop reads  
>>>> an rss feed
>>>> which contains N links to other rss feeds. The nested loop then  
>>>> reads each
>>>> one of those to create documents. (Yes, this is an obnoxious  
>>>> thing to do.)
>>>> Let's say the outer RSS feed gives 10 items. Both feeds use the  
>>>> same
>>>> structure: /rss/channel with a <title> node and then N <item>  
>>>> nodes inside
>>>> the channel. This should create two separate XML streams with two  
>>>> separate
>>>> Xpath iterators, right?
>>>>
>>>> <entity name="outer" http stuff>
>>>>   <field column="name" xpath="/rss/channel/title" />
>>>>   <field column="url" xpath="/rss/channel/item/link"/>
>>>>
>>>>   <entity name="inner" http stuff url="${outer.url}" pk="title" >
>>>>       <field column="title" xpath="/rss/channel/item/title" />
>>>>   </entity>
>>>> </entity>
>>>>
>>>> This does indeed walk each url from the outer feed and then fetch  
>>>> the inner
>>>> rss feed. Bravo!
>>>>
>>>> However, I found two separate problems in xpath iteration. They  
>>>> may be
>>>> related. The first problem is that it only stores the first  
>>>> document from
>>>> each "inner" feed. Each feed has several documents with different  
>>>> title
>>>> fields but it only grabs the first.
>>>>
>>>> The other is an off-by-one bug. The outer loop iterates through  
>>>> the 10 items
>>>> and then tries to pull an 11th.  It then gives this exception  
>>>> trace:
>>>>
>>>> INFO: Created URL to:  [inner url]
>>>> Oct 31, 2008 11:21:20 PM  
>>>> org.apache.solr.handler.dataimport.HttpDataSource
>>>> getData
>>>> SEVERE: Exception thrown while getting data
>>>> java.net.MalformedURLException: no protocol: null/account.rss
>>>>       at java.net.URL.<init>(URL.java:567)
>>>>       at java.net.URL.<init>(URL.java:464)
>>>>       at java.net.URL.<init>(URL.java:413)
>>>>       at
>>>> org 
>>>> .apache 
>>>> .solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>>>> a:90)
>>>>       at
>>>> org 
>>>> .apache 
>>>> .solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>>>> a:47)
>>>>       at
>>>> org.apache.solr.handler.dataimport.DebugLogger 
>>>> $2.getData(DebugLogger.java:18
>>>> 3)
>>>>       at
>>>> org 
>>>> .apache 
>>>> .solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntit
>>>> yProcessor.java:210)
>>>>       at
>>>> org 
>>>> .apache 
>>>> .solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEn
>>>> tityProcessor.java:180)
>>>>       at
>>>> org 
>>>> .apache 
>>>> .solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityP
>>>> rocessor.java:160)
>>>>       at
>>>> org 
>>>> .apache 
>>>> .solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:
>>>> 285)
>>>> ...
>>>> Oct 31, 2008 11:21:20 PM  
>>>> org.apache.solr.handler.dataimport.DocBuilder
>>>> buildDocument
>>>> SEVERE: Exception while processing: album document :
>>>> SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}]
>>>> org.apache.solr.handler.dataimport.DataImportHandlerException:  
>>>> Exception in
>>>> invoking url null Processing Document # 11
>>>>       at
>>>> org 
>>>> .apache 
>>>> .solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>>>> a:115)
>>>>       at
>>>> org 
>>>> .apache 
>>>> .solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>>>> a:47)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> --Noble Paul
>>>
>>
>>
>>
>> --
>> --Noble Paul
>>
>
>
>
> -- 
> --Noble Paul

Re: DIH Http input bug - problem with two-level RSS walker

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.

Hi Lance,
I guess I got your problem
So you wish to create docs for both entities (as suggested by Jon
Baer). So the best solution would be to create two root entities. The
first one should be the outer and write a transformer to store all the
urls into the db . The JdbcDataSource can do inserts/update too (the
method is same getData()). The second entity can read from db and
create docs  (see Jon baer's suggestion) using the
XPathEntityProcessor as a sub-entity
--Noble

On Mon, Nov 3, 2008 at 9:44 AM, Noble Paul നോബിള്‍ नोब्ळ्
<no...@gmail.com> wrote:
> Hi Lance,
> Do a full import w/o debug and let us know if my suggestion worked
> (rootEntity="false" ) . If it didn't , I can suggest u something else
> (Writing a Transformer )
>
>
> On Sun, Nov 2, 2008 at 8:13 AM, Noble Paul നോബിള്‍ नोब्ळ्
> <no...@gmail.com> wrote:
>> If you wish to create 1 doc per inner entity the set
>> rootEntity="false" for the entity outer.
>> The exception is because the url is wrong
>>
>> On Sat, Nov 1, 2008 at 10:30 AM, Lance Norskog <go...@gmail.com> wrote:
>>> I wrote a nested HttpDataSource RSS poller. The outer loop reads an rss feed
>>> which contains N links to other rss feeds. The nested loop then reads each
>>> one of those to create documents. (Yes, this is an obnoxious thing to do.)
>>> Let's say the outer RSS feed gives 10 items. Both feeds use the same
>>> structure: /rss/channel with a <title> node and then N <item> nodes inside
>>> the channel. This should create two separate XML streams with two separate
>>> Xpath iterators, right?
>>>
>>> <entity name="outer" http stuff>
>>>    <field column="name" xpath="/rss/channel/title" />
>>>    <field column="url" xpath="/rss/channel/item/link"/>
>>>
>>>    <entity name="inner" http stuff url="${outer.url}" pk="title" >
>>>        <field column="title" xpath="/rss/channel/item/title" />
>>>    </entity>
>>> </entity>
>>>
>>> This does indeed walk each url from the outer feed and then fetch the inner
>>> rss feed. Bravo!
>>>
>>> However, I found two separate problems in xpath iteration. They may be
>>> related. The first problem is that it only stores the first document from
>>> each "inner" feed. Each feed has several documents with different title
>>> fields but it only grabs the first.
>>>
>>> The other is an off-by-one bug. The outer loop iterates through the 10 items
>>> and then tries to pull an 11th.  It then gives this exception trace:
>>>
>>> INFO: Created URL to:  [inner url]
>>> Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.HttpDataSource
>>> getData
>>> SEVERE: Exception thrown while getting data
>>> java.net.MalformedURLException: no protocol: null/account.rss
>>>        at java.net.URL.<init>(URL.java:567)
>>>        at java.net.URL.<init>(URL.java:464)
>>>        at java.net.URL.<init>(URL.java:413)
>>>        at
>>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>>> a:90)
>>>        at
>>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>>> a:47)
>>>        at
>>> org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:18
>>> 3)
>>>        at
>>> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntit
>>> yProcessor.java:210)
>>>        at
>>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEn
>>> tityProcessor.java:180)
>>>        at
>>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityP
>>> rocessor.java:160)
>>>        at
>>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:
>>> 285)
>>>  ...
>>> Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.DocBuilder
>>> buildDocument
>>> SEVERE: Exception while processing: album document :
>>> SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}]
>>> org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in
>>> invoking url null Processing Document # 11
>>>        at
>>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>>> a:115)
>>>        at
>>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>>> a:47)
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>> --
>> --Noble Paul
>>
>
>
>
> --
> --Noble Paul
>



-- 
--Noble Paul

Re: DIH Http input bug - problem with two-level RSS walker

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.

Hi Lance,
Do a full import w/o debug and let us know if my suggestion worked
(rootEntity="false" ) . If it didn't , I can suggest u something else
(Writing a Transformer )


On Sun, Nov 2, 2008 at 8:13 AM, Noble Paul നോബിള്‍ नोब्ळ्
<no...@gmail.com> wrote:
> If you wish to create 1 doc per inner entity the set
> rootEntity="false" for the entity outer.
> The exception is because the url is wrong
>
> On Sat, Nov 1, 2008 at 10:30 AM, Lance Norskog <go...@gmail.com> wrote:
>> I wrote a nested HttpDataSource RSS poller. The outer loop reads an rss feed
>> which contains N links to other rss feeds. The nested loop then reads each
>> one of those to create documents. (Yes, this is an obnoxious thing to do.)
>> Let's say the outer RSS feed gives 10 items. Both feeds use the same
>> structure: /rss/channel with a <title> node and then N <item> nodes inside
>> the channel. This should create two separate XML streams with two separate
>> Xpath iterators, right?
>>
>> <entity name="outer" http stuff>
>>    <field column="name" xpath="/rss/channel/title" />
>>    <field column="url" xpath="/rss/channel/item/link"/>
>>
>>    <entity name="inner" http stuff url="${outer.url}" pk="title" >
>>        <field column="title" xpath="/rss/channel/item/title" />
>>    </entity>
>> </entity>
>>
>> This does indeed walk each url from the outer feed and then fetch the inner
>> rss feed. Bravo!
>>
>> However, I found two separate problems in xpath iteration. They may be
>> related. The first problem is that it only stores the first document from
>> each "inner" feed. Each feed has several documents with different title
>> fields but it only grabs the first.
>>
>> The other is an off-by-one bug. The outer loop iterates through the 10 items
>> and then tries to pull an 11th.  It then gives this exception trace:
>>
>> INFO: Created URL to:  [inner url]
>> Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.HttpDataSource
>> getData
>> SEVERE: Exception thrown while getting data
>> java.net.MalformedURLException: no protocol: null/account.rss
>>        at java.net.URL.<init>(URL.java:567)
>>        at java.net.URL.<init>(URL.java:464)
>>        at java.net.URL.<init>(URL.java:413)
>>        at
>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>> a:90)
>>        at
>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>> a:47)
>>        at
>> org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:18
>> 3)
>>        at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntit
>> yProcessor.java:210)
>>        at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEn
>> tityProcessor.java:180)
>>        at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityP
>> rocessor.java:160)
>>        at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:
>> 285)
>>  ...
>> Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.DocBuilder
>> buildDocument
>> SEVERE: Exception while processing: album document :
>> SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}]
>> org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in
>> invoking url null Processing Document # 11
>>        at
>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>> a:115)
>>        at
>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
>> a:47)
>>
>>
>>
>>
>>
>>
>
>
>
> --
> --Noble Paul
>



-- 
--Noble Paul

Re: DIH Http input bug - problem with two-level RSS walker

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.

On Tue, Nov 4, 2008 at 1:31 AM, Lance Norskog <go...@gmail.com> wrote:
> Thank you for the "rootEntity" tip. Does this mean that the inner loop only walks the first item and breaks out of the loop? This is very good because it allows me to drill down a few levels without downloading 10,000 feeds. (Public API sites tend to dislike this behavior :)
>

nope . It goes through each item in the inner loop and create one
document for each item.

> The URL is wrong because the streaming parser is iterating past the end of the element entries. It is an off-by-one bug of some sort in the DIH code.
>
> Thanks,
>
> Lance
>
> -----Original Message-----
> From: Noble Paul നോബിള്‍ नोब्ळ् [mailto:noble.paul@gmail.com]
> Sent: Saturday, November 01, 2008 7:44 PM
> To: solr-user@lucene.apache.org
> Subject: Re: DIH Http input bug - problem with two-level RSS walker
>
> If you wish to create 1 doc per inner entity the set rootEntity="false" for the entity outer.
> The exception is because the url is wrong
>
> On Sat, Nov 1, 2008 at 10:30 AM, Lance Norskog <go...@gmail.com> wrote:
>> I wrote a nested HttpDataSource RSS poller. The outer loop reads an
>> rss feed which contains N links to other rss feeds. The nested loop
>> then reads each one of those to create documents. (Yes, this is an
>> obnoxious thing to do.) Let's say the outer RSS feed gives 10 items.
>> Both feeds use the same
>> structure: /rss/channel with a <title> node and then N <item> nodes
>> inside the channel. This should create two separate XML streams with
>> two separate Xpath iterators, right?
>>
>> <entity name="outer" http stuff>
>>    <field column="name" xpath="/rss/channel/title" />
>>    <field column="url" xpath="/rss/channel/item/link"/>
>>
>>    <entity name="inner" http stuff url="${outer.url}" pk="title" >
>>        <field column="title" xpath="/rss/channel/item/title" />
>>    </entity>
>> </entity>
>>
>> This does indeed walk each url from the outer feed and then fetch the
>> inner rss feed. Bravo!
>>
>> However, I found two separate problems in xpath iteration. They may be
>> related. The first problem is that it only stores the first document
>> from each "inner" feed. Each feed has several documents with different
>> title fields but it only grabs the first.
>>
>> The other is an off-by-one bug. The outer loop iterates through the 10
>> items and then tries to pull an 11th.  It then gives this exception trace:
>>
>> INFO: Created URL to:  [inner url]
>> Oct 31, 2008 11:21:20 PM
>> org.apache.solr.handler.dataimport.HttpDataSource
>> getData
>> SEVERE: Exception thrown while getting data
>> java.net.MalformedURLException: no protocol: null/account.rss
>>        at java.net.URL.<init>(URL.java:567)
>>        at java.net.URL.<init>(URL.java:464)
>>        at java.net.URL.<init>(URL.java:413)
>>        at
>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
>> ce.jav
>> a:90)
>>        at
>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
>> ce.jav
>> a:47)
>>        at
>> org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.j
>> ava:18
>> 3)
>>        at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPat
>> hEntit
>> yProcessor.java:210)
>>        at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(X
>> PathEn
>> tityProcessor.java:180)
>>        at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathE
>> ntityP
>> rocessor.java:160)
>>        at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:
>> 285)
>>  ...
>> Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.DocBuilder
>> buildDocument
>> SEVERE: Exception while processing: album document :
>> SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}]
>> org.apache.solr.handler.dataimport.DataImportHandlerException:
>> Exception in invoking url null Processing Document # 11
>>        at
>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
>> ce.jav
>> a:115)
>>        at
>> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
>> ce.jav
>> a:47)
>>
>>
>>
>>
>>
>>
>
>
>
> --
> --Noble Paul
>
>



-- 
--Noble Paul

RE: DIH Http input bug - problem with two-level RSS walker

Posted by Lance Norskog <go...@gmail.com>.

Thank you for the "rootEntity" tip. Does this mean that the inner loop only walks the first item and breaks out of the loop? This is very good because it allows me to drill down a few levels without downloading 10,000 feeds. (Public API sites tend to dislike this behavior :)

The URL is wrong because the streaming parser is iterating past the end of the element entries. It is an off-by-one bug of some sort in the DIH code. 

Thanks,

Lance

-----Original Message-----
From: Noble Paul നോബിള്‍ नोब्ळ् [mailto:noble.paul@gmail.com] 
Sent: Saturday, November 01, 2008 7:44 PM
To: solr-user@lucene.apache.org
Subject: Re: DIH Http input bug - problem with two-level RSS walker

If you wish to create 1 doc per inner entity the set rootEntity="false" for the entity outer.
The exception is because the url is wrong

On Sat, Nov 1, 2008 at 10:30 AM, Lance Norskog <go...@gmail.com> wrote:
> I wrote a nested HttpDataSource RSS poller. The outer loop reads an 
> rss feed which contains N links to other rss feeds. The nested loop 
> then reads each one of those to create documents. (Yes, this is an 
> obnoxious thing to do.) Let's say the outer RSS feed gives 10 items. 
> Both feeds use the same
> structure: /rss/channel with a <title> node and then N <item> nodes 
> inside the channel. This should create two separate XML streams with 
> two separate Xpath iterators, right?
>
> <entity name="outer" http stuff>
>    <field column="name" xpath="/rss/channel/title" />
>    <field column="url" xpath="/rss/channel/item/link"/>
>
>    <entity name="inner" http stuff url="${outer.url}" pk="title" >
>        <field column="title" xpath="/rss/channel/item/title" />
>    </entity>
> </entity>
>
> This does indeed walk each url from the outer feed and then fetch the 
> inner rss feed. Bravo!
>
> However, I found two separate problems in xpath iteration. They may be 
> related. The first problem is that it only stores the first document 
> from each "inner" feed. Each feed has several documents with different 
> title fields but it only grabs the first.
>
> The other is an off-by-one bug. The outer loop iterates through the 10 
> items and then tries to pull an 11th.  It then gives this exception trace:
>
> INFO: Created URL to:  [inner url]
> Oct 31, 2008 11:21:20 PM 
> org.apache.solr.handler.dataimport.HttpDataSource
> getData
> SEVERE: Exception thrown while getting data
> java.net.MalformedURLException: no protocol: null/account.rss
>        at java.net.URL.<init>(URL.java:567)
>        at java.net.URL.<init>(URL.java:464)
>        at java.net.URL.<init>(URL.java:413)
>        at
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
> ce.jav
> a:90)
>        at
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
> ce.jav
> a:47)
>        at
> org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.j
> ava:18
> 3)
>        at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPat
> hEntit
> yProcessor.java:210)
>        at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(X
> PathEn
> tityProcessor.java:180)
>        at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathE
> ntityP
> rocessor.java:160)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:
> 285)
>  ...
> Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.DocBuilder
> buildDocument
> SEVERE: Exception while processing: album document :
> SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}]
> org.apache.solr.handler.dataimport.DataImportHandlerException: 
> Exception in invoking url null Processing Document # 11
>        at
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
> ce.jav
> a:115)
>        at
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour
> ce.jav
> a:47)
>
>
>
>
>
>



--
--Noble Paul

Re: DIH Http input bug - problem with two-level RSS walker

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.

If you wish to create 1 doc per inner entity the set
rootEntity="false" for the entity outer.
The exception is because the url is wrong

On Sat, Nov 1, 2008 at 10:30 AM, Lance Norskog <go...@gmail.com> wrote:
> I wrote a nested HttpDataSource RSS poller. The outer loop reads an rss feed
> which contains N links to other rss feeds. The nested loop then reads each
> one of those to create documents. (Yes, this is an obnoxious thing to do.)
> Let's say the outer RSS feed gives 10 items. Both feeds use the same
> structure: /rss/channel with a <title> node and then N <item> nodes inside
> the channel. This should create two separate XML streams with two separate
> Xpath iterators, right?
>
> <entity name="outer" http stuff>
>    <field column="name" xpath="/rss/channel/title" />
>    <field column="url" xpath="/rss/channel/item/link"/>
>
>    <entity name="inner" http stuff url="${outer.url}" pk="title" >
>        <field column="title" xpath="/rss/channel/item/title" />
>    </entity>
> </entity>
>
> This does indeed walk each url from the outer feed and then fetch the inner
> rss feed. Bravo!
>
> However, I found two separate problems in xpath iteration. They may be
> related. The first problem is that it only stores the first document from
> each "inner" feed. Each feed has several documents with different title
> fields but it only grabs the first.
>
> The other is an off-by-one bug. The outer loop iterates through the 10 items
> and then tries to pull an 11th.  It then gives this exception trace:
>
> INFO: Created URL to:  [inner url]
> Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.HttpDataSource
> getData
> SEVERE: Exception thrown while getting data
> java.net.MalformedURLException: no protocol: null/account.rss
>        at java.net.URL.<init>(URL.java:567)
>        at java.net.URL.<init>(URL.java:464)
>        at java.net.URL.<init>(URL.java:413)
>        at
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
> a:90)
>        at
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
> a:47)
>        at
> org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:18
> 3)
>        at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntit
> yProcessor.java:210)
>        at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEn
> tityProcessor.java:180)
>        at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityP
> rocessor.java:160)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:
> 285)
>  ...
> Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.DocBuilder
> buildDocument
> SEVERE: Exception while processing: album document :
> SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}]
> org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in
> invoking url null Processing Document # 11
>        at
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
> a:115)
>        at
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
> a:47)
>
>
>
>
>
>



-- 
--Noble Paul