You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Antonio Eggberg <an...@yahoo.se> on 2009/07/22 10:46:06 UTC

DIH example explanation

Hi, 

I am looking at the slashdot example and I am having hard time understanding the following, from the wiki

==

"You can use this feature for indexing from REST API's such as rss/atom feeds, XML data feeds , other Solr servers or even well formed xhtml documents . Our XPath support has its limitations (no wildcards , only fullpath etc) but we have tried to make sure that common use-cases are covered and since it's based on a streaming parser, it is extremely fast and consumes constant amount of memory even for large XMLs. It does not support namespaces , but it can handle xmls with namespaces . When you provide the xpath, just drop the namespace and give the rest (eg if the tag is '<dc:subject>' the mapping should just contain 'subject').Easy, isn't it? And you didn't need to write one line of code! Enjoy"
==

How does <dc:subject> becomes field subject and why it's mapping xpath="/RDF/item/subject".. what is the secret? 

I am trying to index atom files and I need to understand the above cos I have namespace, not sure how to proceed. are there any atom example anywhere?

Thanks again for clarification.
Anton


      __________________________________________________________
Ta semester! - sök efter resor hos Kelkoo.
Jämför pris på flygbiljetter och hotellrum här:
http://www.kelkoo.se/c-169901-resor-biljetter.html?partnerId=96914052

Re: DIH example explanation

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.

any string that is templatized in DIH can have variables like this ${a.b}

for instance look at the following

url="http://xyz.com/atom/${dataimporter.request.foo}"

if you pass a parameter foo=bar when you invoke the command the url
invoked becomes

http://xyz.com/atom/bar

the variable can come from many places

see this http://wiki.apache.org/solr/DataImportHandler#head-86408ce7721ea6f9a3f05b12ace8742fd41737d4


On Wed, Jul 22, 2009 at 4:30 PM, Antonio
Eggberg<an...@yahoo.se> wrote:
> :)
>
> thank you paul! and it works! I have one more stupid question about the wiki.
>
> "url (required) : The url used to invoke the REST API. (Can be templatized)."
>
> How do you templatize the URL? My URL's are being updated all the time by an external program. i.e. list of atom sites it's a text file. So I should use some form of transformer to process it? any hint..
>
> Thanks.
> Anton
>
> --- Den ons 2009-07-22 skrev Noble Paul നോബിള്‍  नोब्ळ् <no...@corp.aol.com>:
>
>> Från: Noble Paul നോബിള്‍  नोब्ळ् <no...@corp.aol.com>
>> Ämne: Re: DIH example explanation
>> Till: solr-user@lucene.apache.org
>> Datum: onsdag 22 juli 2009 10.52
>> The point is that namespace is
>> ignored while DIH reads the xml. So
>> just use the part after the colon (:) in your xpath
>> expressions and it
>> should just work.
>>
>>
>>
>>
>>
>> On Wed, Jul 22, 2009 at 2:16 PM, Antonio
>> Eggberg<an...@yahoo.se>
>> wrote:
>> > Hi,
>> >
>> > I am looking at the slashdot example and I am having
>> hard time understanding the following, from the wiki
>> >
>> > ==
>> >
>> > "You can use this feature for indexing from REST API's
>> such as rss/atom feeds, XML data feeds , other Solr servers
>> or even well formed xhtml documents . Our XPath support has
>> its limitations (no wildcards , only fullpath etc) but we
>> have tried to make sure that common use-cases are covered
>> and since it's based on a streaming parser, it is extremely
>> fast and consumes constant amount of memory even for large
>> XMLs. It does not support namespaces , but it can handle
>> xmls with namespaces . When you provide the xpath, just drop
>> the namespace and give the rest (eg if the tag is
>> '<dc:subject>' the mapping should just contain
>> 'subject').Easy, isn't it? And you didn't need to write one
>> line of code! Enjoy"
>> > ==
>> >
>> > How does <dc:subject> becomes field subject and
>> why it's mapping xpath="/RDF/item/subject".. what is the
>> secret?
>> >
>> > I am trying to index atom files and I need to
>> understand the above cos I have namespace, not sure how to
>> proceed. are there any atom example anywhere?
>> >
>> > Thanks again for clarification.
>> > Anton
>> >
>> >
>> >
>>  __________________________________________________________
>> > Ta semester! - sök efter resor hos Kelkoo.
>> > Jämför pris på flygbiljetter och hotellrum här:
>> > http://www.kelkoo.se/c-169901-resor-biljetter.html?partnerId=96914052
>> >
>> >
>>
>>
>>
>> --
>> -----------------------------------------------------
>> Noble Paul | Principal Engineer| AOL | http://aol.com
>>
>
>
>      __________________________________________________________
> Ta semester! - sök efter resor hos Kelkoo.
> Jämför pris på flygbiljetter och hotellrum här:
> http://www.kelkoo..se/c-169901-resor-biljetter.html?partnerId=96914052
>
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: DIH example explanation

Posted by Antonio Eggberg <an...@yahoo.se>.

:)

thank you paul! and it works! I have one more stupid question about the wiki.

"url (required) : The url used to invoke the REST API. (Can be templatized)."

How do you templatize the URL? My URL's are being updated all the time by an external program. i.e. list of atom sites it's a text file. So I should use some form of transformer to process it? any hint..

Thanks.
Anton

--- Den ons 2009-07-22 skrev Noble Paul നോബിള്‍  नोब्ळ् <no...@corp.aol.com>:

> Från: Noble Paul നോബിള്‍  नोब्ळ् <no...@corp.aol.com>
> Ämne: Re: DIH example explanation
> Till: solr-user@lucene.apache.org
> Datum: onsdag 22 juli 2009 10.52
> The point is that namespace is
> ignored while DIH reads the xml. So
> just use the part after the colon (:) in your xpath
> expressions and it
> should just work.
> 
> 
> 
> 
> 
> On Wed, Jul 22, 2009 at 2:16 PM, Antonio
> Eggberg<an...@yahoo.se>
> wrote:
> > Hi,
> >
> > I am looking at the slashdot example and I am having
> hard time understanding the following, from the wiki
> >
> > ==
> >
> > "You can use this feature for indexing from REST API's
> such as rss/atom feeds, XML data feeds , other Solr servers
> or even well formed xhtml documents . Our XPath support has
> its limitations (no wildcards , only fullpath etc) but we
> have tried to make sure that common use-cases are covered
> and since it's based on a streaming parser, it is extremely
> fast and consumes constant amount of memory even for large
> XMLs. It does not support namespaces , but it can handle
> xmls with namespaces . When you provide the xpath, just drop
> the namespace and give the rest (eg if the tag is
> '<dc:subject>' the mapping should just contain
> 'subject').Easy, isn't it? And you didn't need to write one
> line of code! Enjoy"
> > ==
> >
> > How does <dc:subject> becomes field subject and
> why it's mapping xpath="/RDF/item/subject".. what is the
> secret?
> >
> > I am trying to index atom files and I need to
> understand the above cos I have namespace, not sure how to
> proceed. are there any atom example anywhere?
> >
> > Thanks again for clarification.
> > Anton
> >
> >
> >    
>  __________________________________________________________
> > Ta semester! - sök efter resor hos Kelkoo.
> > Jämför pris på flygbiljetter och hotellrum här:
> > http://www.kelkoo.se/c-169901-resor-biljetter.html?partnerId=96914052
> >
> >
> 
> 
> 
> -- 
> -----------------------------------------------------
> Noble Paul | Principal Engineer| AOL | http://aol.com
> 


      __________________________________________________________
Ta semester! - sök efter resor hos Kelkoo.
Jämför pris på flygbiljetter och hotellrum här:
http://www.kelkoo..se/c-169901-resor-biljetter.html?partnerId=96914052

Re: DIH example explanation

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.

The point is that namespace is ignored while DIH reads the xml. So
just use the part after the colon (:) in your xpath expressions and it
should just work.





On Wed, Jul 22, 2009 at 2:16 PM, Antonio
Eggberg<an...@yahoo.se> wrote:
> Hi,
>
> I am looking at the slashdot example and I am having hard time understanding the following, from the wiki
>
> ==
>
> "You can use this feature for indexing from REST API's such as rss/atom feeds, XML data feeds , other Solr servers or even well formed xhtml documents . Our XPath support has its limitations (no wildcards , only fullpath etc) but we have tried to make sure that common use-cases are covered and since it's based on a streaming parser, it is extremely fast and consumes constant amount of memory even for large XMLs. It does not support namespaces , but it can handle xmls with namespaces . When you provide the xpath, just drop the namespace and give the rest (eg if the tag is '<dc:subject>' the mapping should just contain 'subject').Easy, isn't it? And you didn't need to write one line of code! Enjoy"
> ==
>
> How does <dc:subject> becomes field subject and why it's mapping xpath="/RDF/item/subject".. what is the secret?
>
> I am trying to index atom files and I need to understand the above cos I have namespace, not sure how to proceed. are there any atom example anywhere?
>
> Thanks again for clarification.
> Anton
>
>
>      __________________________________________________________
> Ta semester! - sök efter resor hos Kelkoo.
> Jämför pris på flygbiljetter och hotellrum här:
> http://www.kelkoo.se/c-169901-resor-biljetter.html?partnerId=96914052
>
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com