You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Amit Nithian <an...@gmail.com> on 2010/07/31 22:41:44 UTC

DIH, UTF8 and default DIH encoding value

All,

I am not sure if this is overly obvious or not (it wasn't to me) but in
trying to index some international characters from XML files using the DIH,
I found that setting the encoding attribute on the dataSource element to
"UTF-8" fixed my problem.

<dataSource type="FileDataSource" encoding="UTF-8"/>

My question is why the default isn't UTF-8 or if there is a good reason, can
the DIH wiki be made more clear that this encoding attribute can affect the
indexing of international characters? If I can get access to edit this wiki
page, I can add a section to that effect.. perhaps under a troubleshooting
section?

Thanks!
Amit

Re: DIH, UTF8 and default DIH encoding value

Posted by Amit Nithian <an...@gmail.com>.
Thanks Otis. I went ahead and added this section. I hope that others can add
to this too but of course the list should be short :-)

- Amit

On Sun, Aug 1, 2010 at 12:00 AM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:

> Hi Amit,
>
> Anyone can edit any Solr Wiki page - just create an account (I think the
> link to
> that is in the page footer) and edit.
>
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> ----- Original Message ----
> > From: Amit Nithian <an...@gmail.com>
> > To: solr-user@lucene.apache.org
> > Sent: Sat, July 31, 2010 4:41:44 PM
> > Subject: DIH, UTF8 and default DIH encoding value
> >
> > All,
> >
> > I am not sure if this is overly obvious or not (it wasn't to me) but  in
> > trying to index some international characters from XML files using the
>  DIH,
> > I found that setting the encoding attribute on the dataSource element  to
> > "UTF-8" fixed my problem.
> >
> > <dataSource type="FileDataSource"  encoding="UTF-8"/>
> >
> > My question is why the default isn't UTF-8 or if  there is a good reason,
> can
> > the DIH wiki be made more clear that this  encoding attribute can affect
> the
> > indexing of international characters? If I  can get access to edit this
> wiki
> > page, I can add a section to that effect..  perhaps under a
> troubleshooting
> > section?
> >
> > Thanks!
> > Amit
> >
>

Re: DIH, UTF8 and default DIH encoding value

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi Amit,

Anyone can edit any Solr Wiki page - just create an account (I think the link to 
that is in the page footer) and edit.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Amit Nithian <an...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Sat, July 31, 2010 4:41:44 PM
> Subject: DIH, UTF8 and default DIH encoding value
> 
> All,
> 
> I am not sure if this is overly obvious or not (it wasn't to me) but  in
> trying to index some international characters from XML files using the  DIH,
> I found that setting the encoding attribute on the dataSource element  to
> "UTF-8" fixed my problem.
> 
> <dataSource type="FileDataSource"  encoding="UTF-8"/>
> 
> My question is why the default isn't UTF-8 or if  there is a good reason, can
> the DIH wiki be made more clear that this  encoding attribute can affect the
> indexing of international characters? If I  can get access to edit this wiki
> page, I can add a section to that effect..  perhaps under a troubleshooting
> section?
> 
> Thanks!
> Amit
>