You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Daan Hoogland <da...@asml.com> on 2004/10/07 08:00:31 UTC

indexing numeric entities?

Hello,

Does anyone do indexeing of numeric entities for japanese characters? I 
have (non-x)html containing those entities and need to index and search 
them.


-- 
The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: indexing numeric entities?

Posted by Damian Gajda <dg...@caltha.pl>.
Yes You need to parse the entities Yourself. I implemented an HTML
entity parser as a part of http://objectledge.org project. You may use
it if it will fit Your needs. It is in a ledge-components project
module. See http://objectledge.org/modules/ledge-components/index.html

Have fun,
-- 
Damian Gajda
Caltha Sp. j.
http://www.caltha.pl/




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: indexing numeric entities?

Posted by Daan Hoogland <da...@asml.com>.
maybe inline?

<html xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <head>
  <title>japan</title>
 </head>
 <body bgcolor="#FFFFFF" alink="black">
  <p>

&#12501;&#12451;&#12540;&#12523;&#12489;&#12469;&#12540;&#12499;&#12473;&#12456;&#12531;&#12472;&#12491;&#12450;

  </p>

</html>

Indexing the above document using the HTMLParser demo and the 
CJKAnalyzer, only the term "japan" is found in the content. This is not 
correct, is it?
Should I convert the entities by hand?


Sorry for the mess I send before.


-- 
The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: indexing numeric entities?

Posted by Daan Hoogland <da...@asml.com>.
I guess something wnet wrong;

Daan Hoogland wrote:

>Daan Hoogland wrote:
>
>  
>
>>Daan Hoogland wrote:
>>
>> 
>>
>>    
>>
>>>Hello,
>>>
>>>Does anyone do indexeing of numeric entities for japanese characters? I 
>>>have (non-x)html containing those entities and need to index and search 
>>>them.
>>>
>>>
>>>
>>>
>>>   
>>>
>>>      
>>>
>>Can the CJKAnalyzer index a string like "&#9679;&#20837;&#31038;"? It 
>>seems to be ignored completely when used with the demo. There was talk 
>>on this list of fixes for the demo HTMLParser, do these adres this 
>>issue? When I look ate the code it seems that the entities should have 
>>been interpreted before indexing. What am I missing?
>>
>>Any comment please?
>>Or a pointer to a howto for dumm^H^H^H^H^H westerners?
>> 
>>
>>    
>>
>Indexing the attached document using the HTMLParser demo and the 
>CJKAnalyzer, only the term "japan" is found in the content. This is not 
>correct, is it?
>Should I convert the entities by hand?
>
>  
>
>>thanks,
>>
>>
>> 
>>
>>    
>>
>
>
>
>  
>
>------------------------------------------------------------------------
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>



-- 
The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt.

Re: indexing numeric entities?

Posted by Daan Hoogland <da...@asml.com>.
Daan Hoogland wrote:

>Daan Hoogland wrote:
>
>  
>
>>Hello,
>>
>>Does anyone do indexeing of numeric entities for japanese characters? I 
>>have (non-x)html containing those entities and need to index and search 
>>them.
>>
>>
>> 
>>
>>    
>>
>Can the CJKAnalyzer index a string like "&#9679;&#20837;&#31038;"? It 
>seems to be ignored completely when used with the demo. There was talk 
>on this list of fixes for the demo HTMLParser, do these adres this 
>issue? When I look ate the code it seems that the entities should have 
>been interpreted before indexing. What am I missing?
>
>Any comment please?
>Or a pointer to a howto for dumm^H^H^H^H^H westerners?
>  
>
Indexing the attached document using the HTMLParser demo and the 
CJKAnalyzer, only the term "japan" is found in the content. This is not 
correct, is it?
Should I convert the entities by hand?

>
>thanks,
>
>
>  
>



-- 
The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt.


Re: indexing numeric entities?

Posted by Daan Hoogland <da...@asml.com>.
Daan Hoogland wrote:

>Hello,
>
>Does anyone do indexeing of numeric entities for japanese characters? I 
>have (non-x)html containing those entities and need to index and search 
>them.
>
>
>  
>
Can the CJKAnalyzer index a string like "&#9679;&#20837;&#31038;"? It 
seems to be ignored completely when used with the demo. There was talk 
on this list of fixes for the demo HTMLParser, do these adres this 
issue? When I look ate the code it seems that the entities should have 
been interpreted before indexing. What am I missing?

Any comment please?
Or a pointer to a howto for dumm^H^H^H^H^H westerners?


thanks,


-- 
The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org