You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by John Cherouvim <jc...@eworx.gr> on 2005/09/29 12:45:47 UTC

lucene and UTF-8

Hello

I'm having some problems indexing my UTF-8 html pages. I am running 
lucene on Linux and I cannot understand why does the index generated 
depends on the locale of my operating system.
If I do set | grep LANG I get: LANG=el_GR which is Greek. If I set this 
to en_US the index generated will be different. Why is this the case? My 
HTMLs are all UTF-8.

Also, is there a lucene index browser? I am currently using Luke, which 
is good but it doesn't show the Greek UTF-8 from within the index 
correctly. Is this a matter of a setting in Luke?

Regads,
J.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: lucene and UTF-8

Posted by John Haxby <jc...@scalix.com>.

John Cherouvim wrote:

> I'm having some problems indexing my UTF-8 html pages. I am running 
> lucene on Linux and I cannot understand why does the index generated 
> depends on the locale of my operating system.
> If I do set | grep LANG I get: LANG=el_GR which is Greek. If I set 
> this to en_US the index generated will be different. Why is this the 
> case? My HTMLs are all UTF-8.

What verison of Linux are you using?

On Fedora Core 4 (and probably other Fedora's and RHEL)  LANG=el_GR sets 
the character set to ISO 8859-7, eg (on my various machines):

    $ LANG=en_GR date | iconv -f iso88597
    Πεμ Σεπ 29 11:59:19 BST 2005
    $ LANG=el_GR.utf8 date
    Πεμ Σεπ 29 12:01:40 BST 2005

(Everything in FC4 is UTF-8 so it displays right and it seems that the 
Greek for "Sep" is "Sep" -- no surprises there I guess.)

In your case, replacing "date" with whatever the command is that you use 
to generate the indexes should do the right thing.

jch

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: TermDocs.freq()

Posted by Yonik Seeley <ys...@gmail.com>.

See IndexWriter.setMaxFieldLength()

-Yonik
Now hiring -- http://tinyurl.com/7m67g

On 10/3/05, Tricia Williams <pg...@student.cs.uwaterloo.ca> wrote:
>
> To follow up on my post from Thursday. I have written a very basic test
> for TermPositions. This test allows me to identify that only the
> first 10001 tokens are considered to determine term frequency (ie with
> the searching term in a position greater than 10001 my test fails).
>
> Is this by design? Is there an obvious work-around so that the frequency
> that I receive is correct for my document?
>
> Thank you for your consideration,
> Tricia
>
> On Thu, 29 Sep 2005, Tricia Williams wrote:
>
> > I am finding that TermDocs.freq() method is returning an incorrect
> value.
> > I was wondering if anyone else had experienced this problem.
> >
> > I am using tp = IndexReader.termPositions( queryTerm ) to return a
> object
> > which implements TermPositions. I then use tp.skipTo( docid ) to go
> > directly to the document from which I wish to retrieve term positions.
> The
> > following for loop adds the positions to my ArrayList which I use later:
> >
> > for( int pos = tp.nextPosition(), k = 0;
> > k < tp.freq();
> > pos = tp.nextPosition(), k++ )
> > {
> > positionMatches.add( new Integer( pos ) );
> > }
> >
> > In a document which I know has 48 references to the term, a frequency of
> > 23 is returned. There doesn't seem to be a pattern to this as some other
> > documents have (frequency, actual): (25, 48), (36, 43), (30, 149).
> >
> > These frequencies are from results within my code and confirmed in Luke,
> > so I'm pretty certain that this isn't an error on my part.
> >
> > I've been trying to find out where the origin of this issue is without
> > luck thus far. Any help or advice would be appreciated.
> >
> > Thanks,
> > Tricia
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: TermDocs.freq()

Posted by Tricia Williams <pg...@student.cs.uwaterloo.ca>.

To follow up on my post from Thursday.  I have written a very basic test
for TermPositions.  This test allows me to identify that only the
first 10001 tokens are considered to determine term frequency (ie with
the searching term in a position greater than 10001 my test fails).

Is this by design?  Is there an obvious work-around so that the frequency
that I receive is correct for my document?

Thank you for your consideration,
Tricia

On Thu, 29 Sep 2005, Tricia Williams wrote:

> I am finding that TermDocs.freq() method is returning an incorrect value.
> I was wondering if anyone else had experienced this problem.
>
> I am using tp = IndexReader.termPositions( queryTerm ) to return a object
> which implements TermPositions.  I then use tp.skipTo( docid ) to go
> directly to the document from which I wish to retrieve term positions. The
> following for loop adds the positions to my ArrayList which I use later:
>
> for( 	int pos = tp.nextPosition(), k = 0;
> 	k < tp.freq();
> 	pos = tp.nextPosition(), k++ )
> {
> 	positionMatches.add( new Integer( pos ) );
> }
>
> In a document which I know has 48 references to the term, a frequency of
> 23 is returned.  There doesn't seem to be a pattern to this as some other
> documents have (frequency, actual): (25, 48), (36, 43), (30, 149).
>
> These frequencies are from results within my code and confirmed in Luke,
> so I'm pretty certain that this isn't an error on my part.
>
> I've been trying to find out where the origin of this issue is without
> luck thus far.  Any help or advice would be appreciated.
>
> Thanks,
> Tricia
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: TermDocs.freq()

Posted by Greg Gershman <gr...@yahoo.com>.

Save user queries in a database along with number of
results from last time queried, use that as suggestion
base.

Notice that Google's result count in Suggest differs
from the actual result count.  They are not computing
results on the fly.

Greg

--- Jérôme BENOIS <be...@argia-engineering.fr> wrote:

> Hello everybody,
> 
> 	I would like implement a "Google
> Suggest"
> (http://www.google.com/webhp?complete=1&hl=en) like
> but how to
> get similar criteria and number of results. 
> 
> 	Are you an idea ?
> 
> Thanks,
> Jérôme.
> 

__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: TermDocs.freq()

Posted by Jérôme BENOIS <be...@argia-engineering.fr>.

Hello everybody,

	I would like implement a "Google
Suggest" (http://www.google.com/webhp?complete=1&hl=en) like but how to
get similar criteria and number of results. 

	Are you an idea ?

Thanks,
Jérôme.

TermDocs.freq()

Posted by Tricia Williams <pg...@student.cs.uwaterloo.ca>.

I am finding that TermDocs.freq() method is returning an incorrect value.
I was wondering if anyone else had experienced this problem.

I am using tp = IndexReader.termPositions( queryTerm ) to return a object
which implements TermPositions.  I then use tp.skipTo( docid ) to go
directly to the document from which I wish to retrieve term positions. The
following for loop adds the positions to my ArrayList which I use later:

for( 	int pos = tp.nextPosition(), k = 0;
	k < tp.freq();
	pos = tp.nextPosition(), k++ )
{
	positionMatches.add( new Integer( pos ) );
}

In a document which I know has 48 references to the term, a frequency of
23 is returned.  There doesn't seem to be a pattern to this as some other
documents have (frequency, actual): (25, 48), (36, 43), (30, 149).

These frequencies are from results within my code and confirmed in Luke,
so I'm pretty certain that this isn't an error on my part.

I've been trying to find out where the origin of this issue is without
luck thus far.  Any help or advice would be appreciated.

Thanks,
Tricia

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

luke start problem

Posted by Dirk Hennig <dh...@rrzn.uni-hannover.de>.

Hello,

I downloaded lukeall.jar, put it in my classpath and tried to start it:
 > java org.getopt.luke.Luke

and I get:
------
Exception in thread "main" java.lang.SecurityException: class 
"org.apache.lucene.store.IndexInput"'s signer information does not match 
signer information of other classes in the same package
        at java.lang.ClassLoader.checkCerts(ClassLoader.java:575)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:503)
        at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:123)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:246)
        at java.net.URLClassLoader.access$100(URLClassLoader.java:54)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:193)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:186)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:265)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:262)
        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:322)
------

What's the problem?

Or am I using Luke the wrong way?
I couldn't find any documentation on how to use it. Is there something 
available?

Thanx,
Dirk


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: lucene and UTF-8

Posted by Andrzej Bialecki <ab...@getopt.org>.

John Cherouvim wrote:
> Hello
> 
> I'm having some problems indexing my UTF-8 html pages. I am running 
> lucene on Linux and I cannot understand why does the index generated 
> depends on the locale of my operating system.
> If I do set | grep LANG I get: LANG=el_GR which is Greek. If I set this 
> to en_US the index generated will be different. Why is this the case? My 
> HTMLs are all UTF-8.

I think the difference comes from the default character encoding, if the 
page is NOT clearly marked as UTF-8 - then the system has to guess, and 
it guesses differently depending on the current locale.

> 
> Also, is there a lucene index browser? I am currently using Luke, which 
> is good but it doesn't show the Greek UTF-8 from within the index 
> correctly. Is this a matter of a setting in Luke?

It's a matter of setting the appropriate font in Settings.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: lucene and UTF-8

Posted by Chris Hostetter <ho...@fucit.org>.

: I'm having some problems indexing my UTF-8 html pages. I am running
: lucene on Linux and I cannot understand why does the index generated
: depends on the locale of my operating system.
: If I do set | grep LANG I get: LANG=el_GR which is Greek. If I set this
: to en_US the index generated will be different. Why is this the case? My
: HTMLs are all UTF-8.

To elaborate a little bit more on the comments other people have made, the
differences you are seeing are most likely related to your JVM using the
LANG variable to determine what the default charset will be when you open
readers.  You should look carefully at how you are opening the HTML files
and reading them in. If you raen't specifying the Charset explicitly in
your code, then you're getting the system default.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org