You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by KK <di...@gmail.com> on 2009/05/21 08:25:24 UTC

Posting unicode data to lucene not working during searching/retreival!

How to post utf-8 unicoded data to lucene index. Do we have to specify
something special, any sort of flag saying that we're posting unicoded data?
I tried to post some utf-8 encoded data, during retrieval I'm not able to
see those data , there are just "?" marks in all those places. Earlier I was
using Solr and I was posting using the same method and retreival was also
working fine, but I dont' know what is the issue with lucene, may be I'm
missing something. Can someone tell me what could be the issue? Thank you.

KK,

RE: Posting unicode data to lucene not working during searching/retreival!

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi KK,

> right? and remove this conversion that I'm doing later ,
> 
> byte [] utfEncodeByteArray = textOnly.getBytes();
>  String utfString = new String(utfEncodeByteArray, Charset.forName("UTF-
>  8"));
> 
> This will make sure I'm not depending on the platform encoding, right?

In principle, yes. This is because you encode the binary bytes to a
wrong-encoded stream in the platform encoding, then you decode that stream
again and reencode it using UTF-8. This works, as long as you will not loose
chars through this conversion!

> This
> seems to fix my indexing issue. Now regarding searching I dont need to
> mention any charset thing there, I'm using stardard anyalyzer? As I know
> lucene stores the chars as raw unicode so when I present my query in the
> same unicode format lucene will give me proper results. Currently I'm not
> using the encoding for HTTP parameters, I'll use that and let you know.
> Thank you very much.
> 
> KK,
> 
> On Thu, May 21, 2009 at 12:50 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
> 
> > I forgot:
> >
> > > byte [] utfEncodeByteArray = textOnly.getBytes();
> > > String utfString = new String(utfEncodeByteArray,
> Charset.forName("UTF-
> > > 8"));
> > >
> > > here textonly is the text extracted from the downloaded page
> >
> > What is textonly here? A String, if yes, why decode and then again
> encode
> > it? The important thing is:
> > Strings in Java are always invariant to charsets (internally they are
> > UTF-16). So if you convert a byte array to a string you have to specify
> a
> > charset (as you have done in new String code). If you convert a String
> to a
> > byte array, you must do the same.
> >
> > As mentioned in the mail before, the same is true, when converting
> > InputStreams to Readers and Writers to OutputStreams (this can be done
> > using
> > the converter).
> >
> > And: If you get a String from somewhere, that looks bad, you cannot
> convert
> > the String to another encoding, it was corrupted during conversion to
> > string
> > before.
> >
> > E.g. in a WebAppclcation, use ServletRequest.setEncoding() to specify
> the
> > input encoding of the HTTP parameters and so on.
> >
> > Uwe
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Posting unicode data to lucene not working during searching/retreival!

Posted by KK <di...@gmail.com>.

Thank you very much. As you told me I just added a single line in the jsp
page mentioning the charset as utf-8 and it worked like a charm. Thank you.

KK

On Thu, May 21, 2009 at 5:47 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

> If you print the result e.g. to a webpage through the servlet API, the
> output is done with ISO-8859-1 (which is the default for HTTP). If you want
> to change this, you must tell the servlet layer the encoding before getting
> a PrintWriter (response.setEncoding(), response.setContentTpe("text/html;
> charset=UTF-8") or something like that. Or just get the ServletOutputStream
> and convert using a OutputStreamWriter just as before. But you have to tell
> the browser the encoding... (which is done through the Content-Type header
> step). This all is not Lucene specific, so you should ask on a
> Tomcat/Jetty/whatever-container-you use list.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
> > -----Original Message-----
> > From: KK [mailto:dioxide.software@gmail.com]
> > Sent: Thursday, May 21, 2009 7:01 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Posting unicode data to lucene not working during
> > searching/retreival!
> >
> > I did all the changes but no improvement. the data is getting indexed
> > properly, I think because I'm able to see the results through luke and
> > luke
> > has option for seeing the results in both utf-8 encoding and string
> > default
> > encoding. I tried to use both but no difference. In both the cases I'm
> > able
> > to see the regional text. but no through the browser . How to decoding
> > when
> > fetching the search results throught searcher?
> >
> > Thanks
> > KK
> >
> > On Thu, May 21, 2009 at 1:05 PM, KK <di...@gmail.com> wrote:
> >
> > > Thanks @Uwe.
> > > #To answer your last mails query, textOnly is the output of the method
> > > downloadPage(), complete text thing includeing all html tags etc...
> > > #Instead of doing the encode/decode later, what i should do is when
> > > downloading the page through buffered reader put the charset as utf-8
> as
> > you
> > > mentioned in your last mail. so instead of
> > >  BufferedReader reader =
> > >                     new BufferedReader(new InputStreamReader(
> > >                     pageUrl.openStream()));
> > >
> > > I should do this,
> > > BufferedReader reader =
> > >                     new BufferedReader(new InputStreamReader(
> > >                      pageUrl.openStream(), <mention the charset like
> > > Charset.forName("UTF-8")>));
> > >
> > > right? and remove this conversion that I'm doing later ,
> > >
> > > byte [] utfEncodeByteArray = textOnly.getBytes();
> > >  String utfString = new String(utfEncodeByteArray,
> Charset.forName("UTF-
> > >  8"));
> > >
> > > This will make sure I'm not depending on the platform encoding, right?
> > This
> > > seems to fix my indexing issue. Now regarding searching I dont need to
> > > mention any charset thing there, I'm using stardard anyalyzer? As I
> know
> > > lucene stores the chars as raw unicode so when I present my query in
> the
> > > same unicode format lucene will give me proper results. Currently I'm
> > not
> > > using the encoding for HTTP parameters, I'll use that and let you know.
> > > Thank you very much.
> > >
> > > KK,
> > >
> > >
> > > On Thu, May 21, 2009 at 12:50 PM, Uwe Schindler <uw...@thetaphi.de>
> wrote:
> > >
> > >> I forgot:
> > >>
> > >> > byte [] utfEncodeByteArray = textOnly.getBytes();
> > >> > String utfString = new String(utfEncodeByteArray,
> > Charset.forName("UTF-
> > >> > 8"));
> > >> >
> > >> > here textonly is the text extracted from the downloaded page
> > >>
> > >> What is textonly here? A String, if yes, why decode and then again
> > encode
> > >> it? The important thing is:
> > >> Strings in Java are always invariant to charsets (internally they are
> > >> UTF-16). So if you convert a byte array to a string you have to
> specify
> > a
> > >> charset (as you have done in new String code). If you convert a String
> > to
> > >> a
> > >> byte array, you must do the same.
> > >>
> > >> As mentioned in the mail before, the same is true, when converting
> > >> InputStreams to Readers and Writers to OutputStreams (this can be done
> > >> using
> > >> the converter).
> > >>
> > >> And: If you get a String from somewhere, that looks bad, you cannot
> > >> convert
> > >> the String to another encoding, it was corrupted during conversion to
> > >> string
> > >> before.
> > >>
> > >> E.g. in a WebAppclcation, use ServletRequest.setEncoding() to specify
> > the
> > >> input encoding of the HTTP parameters and so on.
> > >>
> > >> Uwe
> > >>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>
> > >>
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Posting unicode data to lucene not working during searching/retreival!

Posted by Uwe Schindler <uw...@thetaphi.de>.

If you print the result e.g. to a webpage through the servlet API, the
output is done with ISO-8859-1 (which is the default for HTTP). If you want
to change this, you must tell the servlet layer the encoding before getting
a PrintWriter (response.setEncoding(), response.setContentTpe("text/html;
charset=UTF-8") or something like that. Or just get the ServletOutputStream
and convert using a OutputStreamWriter just as before. But you have to tell
the browser the encoding... (which is done through the Content-Type header
step). This all is not Lucene specific, so you should ask on a
Tomcat/Jetty/whatever-container-you use list.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: KK [mailto:dioxide.software@gmail.com]
> Sent: Thursday, May 21, 2009 7:01 PM
> To: java-user@lucene.apache.org
> Subject: Re: Posting unicode data to lucene not working during
> searching/retreival!
> 
> I did all the changes but no improvement. the data is getting indexed
> properly, I think because I'm able to see the results through luke and
> luke
> has option for seeing the results in both utf-8 encoding and string
> default
> encoding. I tried to use both but no difference. In both the cases I'm
> able
> to see the regional text. but no through the browser . How to decoding
> when
> fetching the search results throught searcher?
> 
> Thanks
> KK
> 
> On Thu, May 21, 2009 at 1:05 PM, KK <di...@gmail.com> wrote:
> 
> > Thanks @Uwe.
> > #To answer your last mails query, textOnly is the output of the method
> > downloadPage(), complete text thing includeing all html tags etc...
> > #Instead of doing the encode/decode later, what i should do is when
> > downloading the page through buffered reader put the charset as utf-8 as
> you
> > mentioned in your last mail. so instead of
> >  BufferedReader reader =
> >                     new BufferedReader(new InputStreamReader(
> >                     pageUrl.openStream()));
> >
> > I should do this,
> > BufferedReader reader =
> >                     new BufferedReader(new InputStreamReader(
> >                      pageUrl.openStream(), <mention the charset like
> > Charset.forName("UTF-8")>));
> >
> > right? and remove this conversion that I'm doing later ,
> >
> > byte [] utfEncodeByteArray = textOnly.getBytes();
> >  String utfString = new String(utfEncodeByteArray, Charset.forName("UTF-
> >  8"));
> >
> > This will make sure I'm not depending on the platform encoding, right?
> This
> > seems to fix my indexing issue. Now regarding searching I dont need to
> > mention any charset thing there, I'm using stardard anyalyzer? As I know
> > lucene stores the chars as raw unicode so when I present my query in the
> > same unicode format lucene will give me proper results. Currently I'm
> not
> > using the encoding for HTTP parameters, I'll use that and let you know.
> > Thank you very much.
> >
> > KK,
> >
> >
> > On Thu, May 21, 2009 at 12:50 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
> >
> >> I forgot:
> >>
> >> > byte [] utfEncodeByteArray = textOnly.getBytes();
> >> > String utfString = new String(utfEncodeByteArray,
> Charset.forName("UTF-
> >> > 8"));
> >> >
> >> > here textonly is the text extracted from the downloaded page
> >>
> >> What is textonly here? A String, if yes, why decode and then again
> encode
> >> it? The important thing is:
> >> Strings in Java are always invariant to charsets (internally they are
> >> UTF-16). So if you convert a byte array to a string you have to specify
> a
> >> charset (as you have done in new String code). If you convert a String
> to
> >> a
> >> byte array, you must do the same.
> >>
> >> As mentioned in the mail before, the same is true, when converting
> >> InputStreams to Readers and Writers to OutputStreams (this can be done
> >> using
> >> the converter).
> >>
> >> And: If you get a String from somewhere, that looks bad, you cannot
> >> convert
> >> the String to another encoding, it was corrupted during conversion to
> >> string
> >> before.
> >>
> >> E.g. in a WebAppclcation, use ServletRequest.setEncoding() to specify
> the
> >> input encoding of the HTTP parameters and so on.
> >>
> >> Uwe
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Posting unicode data to lucene not working during searching/retreival!

Posted by KK <di...@gmail.com>.

I did all the changes but no improvement. the data is getting indexed
properly, I think because I'm able to see the results through luke and luke
has option for seeing the results in both utf-8 encoding and string default
encoding. I tried to use both but no difference. In both the cases I'm able
to see the regional text. but no through the browser . How to decoding when
fetching the search results throught searcher?

Thanks
KK

On Thu, May 21, 2009 at 1:05 PM, KK <di...@gmail.com> wrote:

> Thanks @Uwe.
> #To answer your last mails query, textOnly is the output of the method
> downloadPage(), complete text thing includeing all html tags etc...
> #Instead of doing the encode/decode later, what i should do is when
> downloading the page through buffered reader put the charset as utf-8 as you
> mentioned in your last mail. so instead of
>  BufferedReader reader =
>                     new BufferedReader(new InputStreamReader(
>                     pageUrl.openStream()));
>
> I should do this,
> BufferedReader reader =
>                     new BufferedReader(new InputStreamReader(
>                      pageUrl.openStream(), <mention the charset like
> Charset.forName("UTF-8")>));
>
> right? and remove this conversion that I'm doing later ,
>
> byte [] utfEncodeByteArray = textOnly.getBytes();
>  String utfString = new String(utfEncodeByteArray, Charset.forName("UTF-
>  8"));
>
> This will make sure I'm not depending on the platform encoding, right? This
> seems to fix my indexing issue. Now regarding searching I dont need to
> mention any charset thing there, I'm using stardard anyalyzer? As I know
> lucene stores the chars as raw unicode so when I present my query in the
> same unicode format lucene will give me proper results. Currently I'm not
> using the encoding for HTTP parameters, I'll use that and let you know.
> Thank you very much.
>
> KK,
>
>
> On Thu, May 21, 2009 at 12:50 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
>
>> I forgot:
>>
>> > byte [] utfEncodeByteArray = textOnly.getBytes();
>> > String utfString = new String(utfEncodeByteArray, Charset.forName("UTF-
>> > 8"));
>> >
>> > here textonly is the text extracted from the downloaded page
>>
>> What is textonly here? A String, if yes, why decode and then again encode
>> it? The important thing is:
>> Strings in Java are always invariant to charsets (internally they are
>> UTF-16). So if you convert a byte array to a string you have to specify a
>> charset (as you have done in new String code). If you convert a String to
>> a
>> byte array, you must do the same.
>>
>> As mentioned in the mail before, the same is true, when converting
>> InputStreams to Readers and Writers to OutputStreams (this can be done
>> using
>> the converter).
>>
>> And: If you get a String from somewhere, that looks bad, you cannot
>> convert
>> the String to another encoding, it was corrupted during conversion to
>> string
>> before.
>>
>> E.g. in a WebAppclcation, use ServletRequest.setEncoding() to specify the
>> input encoding of the HTTP parameters and so on.
>>
>> Uwe
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

Re: Posting unicode data to lucene not working during searching/retreival!

Posted by KK <di...@gmail.com>.

Thanks @Uwe.
#To answer your last mails query, textOnly is the output of the method
downloadPage(), complete text thing includeing all html tags etc...
#Instead of doing the encode/decode later, what i should do is when
downloading the page through buffered reader put the charset as utf-8 as you
mentioned in your last mail. so instead of
BufferedReader reader =
                    new BufferedReader(new InputStreamReader(
                    pageUrl.openStream()));

I should do this,
BufferedReader reader =
                    new BufferedReader(new InputStreamReader(
                    pageUrl.openStream(), <mention the charset like
Charset.forName("UTF-8")>));

right? and remove this conversion that I'm doing later ,

byte [] utfEncodeByteArray = textOnly.getBytes();
 String utfString = new String(utfEncodeByteArray, Charset.forName("UTF-
 8"));

This will make sure I'm not depending on the platform encoding, right? This
seems to fix my indexing issue. Now regarding searching I dont need to
mention any charset thing there, I'm using stardard anyalyzer? As I know
lucene stores the chars as raw unicode so when I present my query in the
same unicode format lucene will give me proper results. Currently I'm not
using the encoding for HTTP parameters, I'll use that and let you know.
Thank you very much.

KK,

On Thu, May 21, 2009 at 12:50 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

> I forgot:
>
> > byte [] utfEncodeByteArray = textOnly.getBytes();
> > String utfString = new String(utfEncodeByteArray, Charset.forName("UTF-
> > 8"));
> >
> > here textonly is the text extracted from the downloaded page
>
> What is textonly here? A String, if yes, why decode and then again encode
> it? The important thing is:
> Strings in Java are always invariant to charsets (internally they are
> UTF-16). So if you convert a byte array to a string you have to specify a
> charset (as you have done in new String code). If you convert a String to a
> byte array, you must do the same.
>
> As mentioned in the mail before, the same is true, when converting
> InputStreams to Readers and Writers to OutputStreams (this can be done
> using
> the converter).
>
> And: If you get a String from somewhere, that looks bad, you cannot convert
> the String to another encoding, it was corrupted during conversion to
> string
> before.
>
> E.g. in a WebAppclcation, use ServletRequest.setEncoding() to specify the
> input encoding of the HTTP parameters and so on.
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Posting unicode data to lucene not working during searching/retreival!

Posted by Uwe Schindler <uw...@thetaphi.de>.

I forgot:

> byte [] utfEncodeByteArray = textOnly.getBytes();
> String utfString = new String(utfEncodeByteArray, Charset.forName("UTF-
> 8"));
> 
> here textonly is the text extracted from the downloaded page

What is textonly here? A String, if yes, why decode and then again encode
it? The important thing is:
Strings in Java are always invariant to charsets (internally they are
UTF-16). So if you convert a byte array to a string you have to specify a
charset (as you have done in new String code). If you convert a String to a
byte array, you must do the same.

As mentioned in the mail before, the same is true, when converting
InputStreams to Readers and Writers to OutputStreams (this can be done using
the converter).

And: If you get a String from somewhere, that looks bad, you cannot convert
the String to another encoding, it was corrupted during conversion to string
before.

E.g. in a WebAppclcation, use ServletRequest.setEncoding() to specify the
input encoding of the HTTP parameters and so on.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Posting unicode data to lucene not working during searching/retreival!

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hallo KK.,

> Thanks for your quick response. Let me explain the whole thing.
> I'm downloading the pages for give urls and then extracting text and
> converting that to unicode utf-8 this way,
> 
> byte [] utfEncodeByteArray = textOnly.getBytes();
> String utfString = new String(utfEncodeByteArray, Charset.forName("UTF-
> 8"));
> 
> here textonly is the text extracted from the downloaded page, and this is
> the way i'm donwloading the pages,
> private String downloadPage(URL pageUrl) {
>         try {
>             // Open connection to URL for reading.
>             BufferedReader reader =
>                     new BufferedReader(new InputStreamReader(
>                     pageUrl.openStream()));
> 
>             // Read page into buffer.
>             String line;
>             StringBuffer pageBuffer = new StringBuffer();
>             while ((line = reader.readLine()) != null) {
>                 pageBuffer.append(line);
>             }
> 
>             return pageBuffer.toString();
>         } catch (Exception e) {
>         }
> 
>         return null;
> }
> 
> I'm I going wrong anywhere, do I've to specify the charset when opening
> hte
> bufferedReader?

You have to specify the charset when converting the InputStream to a Reader,
so specify the charset in the InputStreamReader ctor [new
InputStreamReader(InputStream,charset)]! If you not do this, the ctor would
use the default charset of your platform, which may be not UTF-8!

...

> and for searcher this is the code:
> package solrSearch;
> 
> import java.io.FileReader;
> import org.stringtree.json.JSONWriter;
> import java.util.*;
> 
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.index.FilterIndexReader;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.queryParser.QueryParser;
> import org.apache.lucene.search.HitCollector;
> import org.apache.lucene.search.Hits;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.ScoreDoc;
> import org.apache.lucene.search.Searcher;
> import org.apache.lucene.search.TopDocCollector;
> 
> /** Simple searcher  */
> public class SimpleSearcher {
>     private static final String baseIndexPath = "/opt/lucene/index/" ;
>     private Map resultMap = new HashMap();
> 
>     public String searchIndex(String queryString, String coreId) throws
> Exception{
>         String result = "@#";
>         String trueIndexPath = baseIndexPath + "core" + coreId;
>         String searchField = "content";
>          IndexSearcher searcher = new IndexSearcher(trueIndexPath);
>         QueryParser queryParser = null;
>         try {
>             queryParser = new QueryParser(searchField, new
> StandardAnalyzer());
>         } catch (Exception ex) {
>              ex.printStackTrace();
>         }
> 
>         Query query = queryParser.parse(queryString);
> 
>         Hits hits = null;
>         try {
>              hits = searcher.search(query);
>         } catch (Exception ex) {
>              ex.printStackTrace();
>         }
> 
>         int hitCount = hits.length();
>         System.out.println("Results found :" + hitCount);
> 
>         for (int ix=0; (ix<hitCount && ix<10); ix++) {
>              Document doc = hits.doc(ix);
>             System.out.println(doc.get("id"));
>             System.out.println(doc.get("content"));
>             result = result + doc.get("id") + "," + doc.get("content");
>             resultMap.put(doc.get("id"), doc.get("content"));
>         }
>         JSONWriter writer = new JSONWriter();
>         return writer.write(resultMap);
>         //return result;
>     }
> 
>     public static void main(String args[]) throws Exception{
>          SimpleSearcher searcher = new SimpleSearcher();
>         String queryString = args[0];
>         System.out.println("Quering for :" + queryString);
>         searcher.searchIndex(queryString, "0");
>     }
> 
> }
> NB: Please ignore improper naming conventions. indentations etc.
> Can some one point me whats going wrong. And one more thing when I tried
> to
> see the indexed docs using the LUKE, I found that the doc content contains
> one regional char and then &#2367 like this but when I clicked "show " for
> that page it showed me the true regional content wihtout any of "?" or the
> above &#... things. It seems the indexing is fine but I've to modify my
> searcher .

Is the parameter queryString created using the correct encoding (e.g. when
converting a string coming from the HTTP request).

> How to do that, any hints? Thank you very much. One more thing
> when searching throuh luke I'm able to see many results but through my
> SimpleSearcher class I'm not able to see all these results for the same
> query. What could be the reason?

Did you use the same analyzer in Luke when searching? If the query string is
incorrectly encoded, see above!

> Thanks,
> KK.
> 
> 
> 
> On Thu, May T21, 2009 at 12:03 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
> 
> > Indexed data is coming out in the same way as put in. Lucene works with
> > Java
> > Strings, so encoding is irrelevant. When you index your values, you must
> be
> > sure, to construct your index string/char arrays correctly using the
> UTF-8
> > encoding (e.g. by using a standard Java Reader, new String byte[],
> charset)
> > and so on. When you then print stored fields you must do the same in the
> > other direction. So the general rule: Always specify the correct charset
> > when converting to/from strings to bytes.
> > For searching: It roughly also depends also on the Analyzer used during
> > indexing and searching. Often analyzers written for specific languages
> > cannot correctly handle characters from foreign languages. But e.g.
> > StandardAnalyzer or WhitespaceAnalyzer does not modify the tokens in any
> > way
> > (if making them lowercase is not a problem).
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> > > -----Original Message-----
> > > From: KK [mailto:dioxide.software@gmail.com]
> > > Sent: Thursday, May 21, 2009 3:25 PM
> > > To: java-user@lucene.apache.org
> > > Subject: Posting unicode data to lucene not working during
> > > searching/retreival!
> > >
> > > How to post utf-8 unicoded data to lucene index. Do we have to specify
> > > something special, any sort of flag saying that we're posting unicoded
> > > data?
> > > I tried to post some utf-8 encoded data, during retrieval I'm not able
> to
> > > see those data , there are just "?" marks in all those places. Earlier
> I
> > > was
> > > using Solr and I was posting using the same method and retreival was
> also
> > > working fine, but I dont' know what is the issue with lucene, may be
> I'm
> > > missing something. Can someone tell me what could be the issue? Thank
> > you.
> > >
> > > KK,
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Posting unicode data to lucene not working during searching/retreival!

Posted by KK <di...@gmail.com>.

Thanks for your quick response. Let me explain the whole thing.
I'm downloading the pages for give urls and then extracting text and
converting that to unicode utf-8 this way,

byte [] utfEncodeByteArray = textOnly.getBytes();
String utfString = new String(utfEncodeByteArray, Charset.forName("UTF-8"));

here textonly is the text extracted from the downloaded page, and this is
the way i'm donwloading the pages,
private String downloadPage(URL pageUrl) {
        try {
            // Open connection to URL for reading.
            BufferedReader reader =
                    new BufferedReader(new InputStreamReader(
                    pageUrl.openStream()));

            // Read page into buffer.
            String line;
            StringBuffer pageBuffer = new StringBuffer();
            while ((line = reader.readLine()) != null) {
                pageBuffer.append(line);
            }

            return pageBuffer.toString();
        } catch (Exception e) {
        }

        return null;
}

I'm I going wrong anywhere, do I've to specify the charset when opening hte
bufferedReader?
And yes for indexing I'm using ,
package solrSearch;

import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;

public class SimpleIndexer {

  // Base Path to the index directory
  private static final String baseIndexPath = "/opt/lucene/index/";


  public void createIndex(String pageContent, String pageId, String coreId)
throws Exception {
    String trueIndexPath = baseIndexPath + "core" + coreId ;
    String contentField = "content";
    String idField    = "id";

    // Create a writer
    IndexWriter writer = new IndexWriter(trueIndexPath, new
StandardAnalyzer());

    System.out.println("Adding page to lucene " + pageId);
    Document doc = new Document();
    doc.add(new Field(contentField, pageContent, Field.Store.YES,
Field.Index.TOKENIZED));
    doc.add(new Field(idField, pageId, Field.Store.YES,
Field.Index.TOKENIZED));

    // Add documents to the index
    writer.addDocument(doc);

    // Lucene recommends calling optimize upon completion of indexing
    writer.optimize();

    // clean up
    writer.close();
  }

  public static void main(String args[]) throws Exception{
       SimpleIndexer empIndex = new SimpleIndexer();
    empIndex.createIndex("this is sample test content", "test0", "core0");
    System.out.println("Data indexed by lucene");
  }

}

and for searcher this is the code
package solrSearch;

import java.io.FileReader;
import org.stringtree.json.JSONWriter;
import java.util.*;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.FilterIndexReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.HitCollector;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.TopDocCollector;

/** Simple searcher  */
public class SimpleSearcher {
    private static final String baseIndexPath = "/opt/lucene/index/" ;
    private Map resultMap = new HashMap();

    public String searchIndex(String queryString, String coreId) throws
Exception{
        String result = "@#";
        String trueIndexPath = baseIndexPath + "core" + coreId;
        String searchField = "content";
         IndexSearcher searcher = new IndexSearcher(trueIndexPath);
        QueryParser queryParser = null;
        try {
            queryParser = new QueryParser(searchField, new
StandardAnalyzer());
        } catch (Exception ex) {
             ex.printStackTrace();
        }

        Query query = queryParser.parse(queryString);

        Hits hits = null;
        try {
             hits = searcher.search(query);
        } catch (Exception ex) {
             ex.printStackTrace();
        }

        int hitCount = hits.length();
        System.out.println("Results found :" + hitCount);

        for (int ix=0; (ix<hitCount && ix<10); ix++) {
             Document doc = hits.doc(ix);
            System.out.println(doc.get("id"));
            System.out.println(doc.get("content"));
            result = result + doc.get("id") + "," + doc.get("content");
            resultMap.put(doc.get("id"), doc.get("content"));
        }
        JSONWriter writer = new JSONWriter();
        return writer.write(resultMap);
        //return result;
    }

    public static void main(String args[]) throws Exception{
         SimpleSearcher searcher = new SimpleSearcher();
        String queryString = args[0];
        System.out.println("Quering for :" + queryString);
        searcher.searchIndex(queryString, "0");
    }

}
NB: Please ignore improper naming conventions. indentations etc.
Can some one point me whats going wrong. And one more thing when I tried to
see the indexed docs using the LUKE, I found that the doc content contains
one regional char and then &#2367 like this but when I clicked "show " for
that page it showed me the true regional content wihtout any of "?" or the
above &#... things. It seems the indexing is fine but I've to modify my
searcher . How to do that, any hints? Thank you very much. One more thing
when searching throuh luke I'm able to see many results but through my
SimpleSearcher class I'm not able to see all these results for the same
query. What could be the reason?

Thanks,
KK.



On Thu, May T21, 2009 at 12:03 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

> Indexed data is coming out in the same way as put in. Lucene works with
> Java
> Strings, so encoding is irrelevant. When you index your values, you must be
> sure, to construct your index string/char arrays correctly using the UTF-8
> encoding (e.g. by using a standard Java Reader, new String byte[], charset)
> and so on. When you then print stored fields you must do the same in the
> other direction. So the general rule: Always specify the correct charset
> when converting to/from strings to bytes.
> For searching: It roughly also depends also on the Analyzer used during
> indexing and searching. Often analyzers written for specific languages
> cannot correctly handle characters from foreign languages. But e.g.
> StandardAnalyzer or WhitespaceAnalyzer does not modify the tokens in any
> way
> (if making them lowercase is not a problem).
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> > -----Original Message-----
> > From: KK [mailto:dioxide.software@gmail.com]
> > Sent: Thursday, May 21, 2009 3:25 PM
> > To: java-user@lucene.apache.org
> > Subject: Posting unicode data to lucene not working during
> > searching/retreival!
> >
> > How to post utf-8 unicoded data to lucene index. Do we have to specify
> > something special, any sort of flag saying that we're posting unicoded
> > data?
> > I tried to post some utf-8 encoded data, during retrieval I'm not able to
> > see those data , there are just "?" marks in all those places. Earlier I
> > was
> > using Solr and I was posting using the same method and retreival was also
> > working fine, but I dont' know what is the issue with lucene, may be I'm
> > missing something. Can someone tell me what could be the issue? Thank
> you.
> >
> > KK,
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Posting unicode data to lucene not working during searching/retreival!

Posted by Uwe Schindler <uw...@thetaphi.de>.

Indexed data is coming out in the same way as put in. Lucene works with Java
Strings, so encoding is irrelevant. When you index your values, you must be
sure, to construct your index string/char arrays correctly using the UTF-8
encoding (e.g. by using a standard Java Reader, new String byte[], charset)
and so on. When you then print stored fields you must do the same in the
other direction. So the general rule: Always specify the correct charset
when converting to/from strings to bytes.
For searching: It roughly also depends also on the Analyzer used during
indexing and searching. Often analyzers written for specific languages
cannot correctly handle characters from foreign languages. But e.g.
StandardAnalyzer or WhitespaceAnalyzer does not modify the tokens in any way
(if making them lowercase is not a problem).

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: KK [mailto:dioxide.software@gmail.com]
> Sent: Thursday, May 21, 2009 3:25 PM
> To: java-user@lucene.apache.org
> Subject: Posting unicode data to lucene not working during
> searching/retreival!
> 
> How to post utf-8 unicoded data to lucene index. Do we have to specify
> something special, any sort of flag saying that we're posting unicoded
> data?
> I tried to post some utf-8 encoded data, during retrieval I'm not able to
> see those data , there are just "?" marks in all those places. Earlier I
> was
> using Solr and I was posting using the same method and retreival was also
> working fine, but I dont' know what is the issue with lucene, may be I'm
> missing something. Can someone tell me what could be the issue? Thank you.
> 
> KK,


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org