You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Pulkit Singhal <pu...@gmail.com> on 2011/09/12 20:45:08 UTC

How to combine RSS w/ Tika when using Data Import Handler (DIH)

Given an RSS raw feed source link such as the following:
http://persistent.info/cgi-bin/feed-proxy?url=http%3A%2F%2Fwww.amazon.com%2Frss%2Ftag%2Fblu-ray%2Fnew%2Fref%3Dtag_rsh_hl_ersn

I can easily get to the value of the description for an item like so:
<field column="description" xpath="/rss/item/description" />

But the content of "description" happens to be in HTML and sadly it is this
HTML chunk that has some pretty decent information that I would like to
import as well.
1) For example it has the image for the item:
<img src="
http://ecx.images-amazon.com/images/I/51yyAAoYzKL._SL160_SS160_.jpg" ... />
2) It has the price for the item:
<span class="tgProductPrice">$13.99</span>
And many other useful pieces of data that aren't in a proper rss format but
they are simply thrown together inside the html chunk that is served as the
value for the xpath="/rss/item/description"

So, how can I configure DIH to start importing this html information as
well?
Is Tika the way to go?
Can someone give a brief example of what a config file with both Tika config
and RSS config would/should look like?

Thanks!
- Pulkit

Re: How to combine RSS w/ Tika when using Data Import Handler (DIH)

Posted by Chris Hostetter <ho...@fucit.org>.
: I've been investigating and I understand that using the RegexTransformer is
: an option that is open for identifying and extracting data to multiple
: fields from a single rss value source ... But rather than hack together
: something I once again wanted to check with the community: Is there another
: option for navigating the HTML DOM tree using some well-tested transformer
: or TIka or something?

I don't think so ... if it's a *really* wellformed feed, then the 
description will actually be xhtml nodes (with the appropriate 
namespace) that are already part of the Document's DOM.

But if it's just a blob of CDATA that happens to contain welformed HTML, 
then I think a regex is currently your best option -- you'll probably want 
something tailor made for the subtleties of the site whose RSS you're 
scraping anyway since things like "are & chars in the URLs html escaped?" 
is going to vary from site to site.

It would probably be possible to write a DIH Transformer based on 
something like tagsoup to actually produce a DOM from an arbitrary html 
string in an entity, so you could then treat it as a subentity and use the 
XPathEntityProcessor -- but i don't think i've seen anyone talk about 
doing anything like that before.

-Hoss

Re: How to combine RSS w/ Tika when using Data Import Handler (DIH)

Posted by Pulkit Singhal <pu...@gmail.com>.
Hello Everyone,

I've been investigating and I understand that using the RegexTransformer is
an option that is open for identifying and extracting data to multiple
fields from a single rss value source ... But rather than hack together
something I once again wanted to check with the community: Is there another
option for navigating the HTML DOM tree using some well-tested transformer
or TIka or something?

Thanks!
- Pulkit

On Mon, Sep 12, 2011 at 1:45 PM, Pulkit Singhal <pu...@gmail.com>wrote:

> Given an RSS raw feed source link such as the following:
>
> http://persistent.info/cgi-bin/feed-proxy?url=http%3A%2F%2Fwww.amazon.com%2Frss%2Ftag%2Fblu-ray%2Fnew%2Fref%3Dtag_rsh_hl_ersn
>
> I can easily get to the value of the description for an item like so:
> <field column="description" xpath="/rss/item/description" />
>
> But the content of "description" happens to be in HTML and sadly it is this
> HTML chunk that has some pretty decent information that I would like to
> import as well.
> 1) For example it has the image for the item:
> <img src="
> http://ecx.images-amazon.com/images/I/51yyAAoYzKL._SL160_SS160_.jpg" ...
> />
> 2) It has the price for the item:
> <span class="tgProductPrice">$13.99</span>
> And many other useful pieces of data that aren't in a proper rss format but
> they are simply thrown together inside the html chunk that is served as the
> value for the xpath="/rss/item/description"
>
> So, how can I configure DIH to start importing this html information as
> well?
> Is Tika the way to go?
> Can someone give a brief example of what a config file with both Tika
> config and RSS config would/should look like?
>
> Thanks!
> - Pulkit
>