You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by lee carroll <le...@googlemail.com> on 2010/11/22 01:46:07 UTC

Can a URL based datasource in DIH return non xml

Hi,

Can a URL based datasource in DIH return non xml. My pages being indexed are
writen by many authors and will
often be invalid xhtml. Can DIH cope with htis or will i need another
approach ?

thanks in advance Lee C

Re: Can a URL based datasource in DIH return non xml

Posted by lee carroll <le...@googlemail.com>.

Hi Erik,

Thank you for the response. Just for completeness of the thread
I'm going to process the xhtml off-line. Another approach could be to set up
a web service which DIH could call which returned xml from a html parser.
However for my purposes its just as easy to use curl and perl and then use
DIH

cheers Lee

On 22 November 2010 12:59, Erick Erickson <er...@gmail.com> wrote:

> DIH does some good stuff, but it doesn't handle bad input very robustly
> (actually, how could it intuit what "the right thing" is?). I'd consider
> SolrJ coupled with a "forgiving" HTML parser, e.g.
> http://sourceforge.net/projects/nekohtml/
>
> <http://sourceforge.net/projects/nekohtml/>Best
> Erick
>
> On Sun, Nov 21, 2010 at 7:46 PM, lee carroll
> <le...@googlemail.com>wrote:
>
> > Hi,
> >
> > Can a URL based datasource in DIH return non xml. My pages being indexed
> > are
> > writen by many authors and will
> > often be invalid xhtml. Can DIH cope with htis or will i need another
> > approach ?
> >
> > thanks in advance Lee C
> >
>

Re: Can a URL based datasource in DIH return non xml

Posted by Erick Erickson <er...@gmail.com>.

DIH does some good stuff, but it doesn't handle bad input very robustly
(actually, how could it intuit what "the right thing" is?). I'd consider
SolrJ coupled with a "forgiving" HTML parser, e.g.
http://sourceforge.net/projects/nekohtml/

<http://sourceforge.net/projects/nekohtml/>Best
Erick

On Sun, Nov 21, 2010 at 7:46 PM, lee carroll
<le...@googlemail.com>wrote:

> Hi,
>
> Can a URL based datasource in DIH return non xml. My pages being indexed
> are
> writen by many authors and will
> often be invalid xhtml. Can DIH cope with htis or will i need another
> approach ?
>
> thanks in advance Lee C
>