You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jon Drukman <jd...@gmail.com> on 2012/08/15 23:03:59 UTC

Does DataImportHandler do any sanitizing?

I am pulling some fields from a mysql database using DataImportHandler and
some of them have invalid XML in them.  Does DataImportHandler do any kind
of filtering/sanitizing to ensure that it will go in OK or is it all on me?

Example bad data:  orphaned ampersands ("Peanut Butter & Jelly"), curly
quotes ("we’re")

-jsd-

Re: Does DataImportHandler do any sanitizing?

Posted by Lance Norskog <go...@gmail.com>.
If you want to sanitize them during indexing, the regular expression
tools can do this. You would create a regular expression that matches
bogus elements. There is a regular expression transformer in the DIH,
and a regular expression CharFilter inside the Lucene text analysis
stack.

On Wed, Aug 15, 2012 at 2:10 PM, Michael Della Bitta
<mi...@appinions.com> wrote:
> Hi, Jon,
>
> As far as I know, DataImportHandler doesn't transfer data to the rest
> of Solr via XML so it shouldn't be a problem...
>
> Michael Della Bitta
>
> ------------------------------------------------
> Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
> www.appinions.com
> Where Influence Isn’t a Game
>
>
> On Wed, Aug 15, 2012 at 5:03 PM, Jon Drukman <jd...@gmail.com> wrote:
>> I am pulling some fields from a mysql database using DataImportHandler and
>> some of them have invalid XML in them.  Does DataImportHandler do any kind
>> of filtering/sanitizing to ensure that it will go in OK or is it all on me?
>>
>> Example bad data:  orphaned ampersands ("Peanut Butter & Jelly"), curly
>> quotes ("we’re")
>>
>> -jsd-



-- 
Lance Norskog
goksron@gmail.com

Re: Does DataImportHandler do any sanitizing?

Posted by Michael Della Bitta <mi...@appinions.com>.
Hi, Jon,

As far as I know, DataImportHandler doesn't transfer data to the rest
of Solr via XML so it shouldn't be a problem...

Michael Della Bitta

------------------------------------------------
Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game


On Wed, Aug 15, 2012 at 5:03 PM, Jon Drukman <jd...@gmail.com> wrote:
> I am pulling some fields from a mysql database using DataImportHandler and
> some of them have invalid XML in them.  Does DataImportHandler do any kind
> of filtering/sanitizing to ensure that it will go in OK or is it all on me?
>
> Example bad data:  orphaned ampersands ("Peanut Butter & Jelly"), curly
> quotes ("we’re")
>
> -jsd-