You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by poeta simbolista <po...@gmail.com> on 2007/09/04 16:38:40 UTC

Look for strange encodings -- tokenization

Hi all,

I'd want to know the best way to look for strange encodings on a Lucene
index.
i have several inputs where input can have been encoded on different sets. I
not always know if my guess about the encoding has been ok. Hence, I'd
thought of querying the index for some typical strings that would show bad
encodings.

All the index has been already constructed using the StandardAnalyzer. I
have read using another analyzer could yield some unexpected results... But
I suppose that's ok for my purposes - testing quality of the index.

Which way do you think it is better to tackle this issue? I've been taking a
look at the Analyzers -- the StandardAnalyzer. I thought about creating a
custom tokenizer that splits on letter, number, spaces so it only leaves
"weird" strings as tokens -- they will show bad encodings. Still, and
possibly due to lack of knowledge of lucene .) I have the feeling this can
be done better somehow.

Thanks a lot in advance!
-- 
View this message in context: http://www.nabble.com/Look-for-strange-encodings----tokenization-tf4378064.html#a12479370
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Look for strange encodings -- tokenization

Posted by poeta simbolista <po...@gmail.com>.
Thank you Steven,

I have problems while providing those searches, I think it is because of the
StandardAnalyzer is taking those bad-encoding characters as separators hence
not creating such tokens when reading...

Regarding the other idea you provided, did you mean then, that if a document
contains many unseen terms that may mean encoding problems?

Also, what I would like is to be able to at least, measure the impact of
such problems, so I can decide whether the effort will be paid back :)

Cheers
 P


Steven Rowe wrote:
> 
> poeta simbolista wrote:
>> I'd want to know the best way to look for strange encodings on a Lucene
>> index.
>> i have several inputs where input can have been encoded on different
>> sets. I
>> not always know if my guess about the encoding has been ok. Hence, I'd
>> thought of querying the index for some typical strings that would show
>> bad
>> encodings.
> 
> In my experience, the best thing to do first is to look at a random
> sample of the data you suspect to be problematic, and keep track of what
> you find.  Then, decide based on what you find whether it's worth it to
> pursue it further.  (Data is messy, and sometimes it's not worth the
> effort to find and fix everything, as long as you know that the
> probability of problems is relatively low.)
> 
> If you do find that it's worth pursuing, I'd guess that the best spot to
> find problems is at index time rather than query time, mostly because at
> query time, you don't necessarily know what to look for.  If you did,
> then you could already improve your guesser at index time, right?
> 
> One technique that you might find useful is to see if a document
> contains too many previously unseen terms.  You could index documents in
> the same language and subject domain as those which might have
> problematic charset conversion issues, but which do not have those
> issues themselves, and then tokenize potentially problematically
> converted documents, checking for the existence of each term in the
> index[1] and keeping track of the ratio of previously unseen terms to
> the total number of terms.  If you compare this ratio to that for the
> average known good document (and/or the worst-case near-last addition to
> the index), you could get an idea about whether or not the document in
> question has issues.
> 
> Steve
> 
> [1]
> <http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/index/IndexReader.html#terms(org.apache.lucene.index.Term)>
> 
> -- 
> Steve Rowe
> Center for Natural Language Processing
> http://www.cnlp.org/tech/lucene.asp
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Look-for-strange-encodings----tokenization-tf4378064.html#a12504196
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Look for strange encodings -- tokenization

Posted by Steven Rowe <sa...@syr.edu>.
poeta simbolista wrote:
> I'd want to know the best way to look for strange encodings on a Lucene
> index.
> i have several inputs where input can have been encoded on different sets. I
> not always know if my guess about the encoding has been ok. Hence, I'd
> thought of querying the index for some typical strings that would show bad
> encodings.

In my experience, the best thing to do first is to look at a random
sample of the data you suspect to be problematic, and keep track of what
you find.  Then, decide based on what you find whether it's worth it to
pursue it further.  (Data is messy, and sometimes it's not worth the
effort to find and fix everything, as long as you know that the
probability of problems is relatively low.)

If you do find that it's worth pursuing, I'd guess that the best spot to
find problems is at index time rather than query time, mostly because at
query time, you don't necessarily know what to look for.  If you did,
then you could already improve your guesser at index time, right?

One technique that you might find useful is to see if a document
contains too many previously unseen terms.  You could index documents in
the same language and subject domain as those which might have
problematic charset conversion issues, but which do not have those
issues themselves, and then tokenize potentially problematically
converted documents, checking for the existence of each term in the
index[1] and keeping track of the ratio of previously unseen terms to
the total number of terms.  If you compare this ratio to that for the
average known good document (and/or the worst-case near-last addition to
the index), you could get an idea about whether or not the document in
question has issues.

Steve

[1]
<http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/index/IndexReader.html#terms(org.apache.lucene.index.Term)>

-- 
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org