You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Developer <bb...@gmail.com> on 2013/11/19 23:46:45 UTC

How to index X™ as ™ (HTML decimal entity)

I have a data coming in to SOLR as below.

<field name="displayName">X™ - Black</field> 

I need to store the HTML Entity (decimal) equivalent value (i.e. &#8482;) 
in SOLR rather than storing the original value.

Is there a way to do this?



--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-index-X-as-8482-HTML-decimal-entity-tp4102002.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to index X™ as ™ (HTML decimal entity)

Posted by Walter Underwood <wu...@wunderwood.org>.
I know all about formatted text -- I worked at MarkLogic. That is why I mentioned the XML Infoset.

Numeric entities are part of the final presentation, really, part of the encoding. They should never be stored. Always store the Unicode.

Numeric and named entities are a convenience for tools and encodings that can't handle  Unicode. That is all they are.

wunder

On Nov 21, 2013, at 9:02 AM, "Jack Krupansky" <ja...@basetechnology.com> wrote:

> Ah... now I understand your perspective - you have taken a narrow view of what "text" is. A broader view is that it can contain formatting and special "entities" as well, or rich text in general. My "read" is that it all depends on the nature of the application and its requirements, not a "one size fits all" approach. The four main approaches being pure ASCII, Unicode/UTF-8, SGML for non-ASCII characters, and full HTML for formatting and rich text. And let the app needs determine which is most appropriate for each piece of text.
> 
> The goal of SGML and HTML is not to hard-wire the final presentation, but simply to preserve some level of source format and structure, and then apply final presentation formatting on top of that.
> 
> Some apps may opt to store the same information in multiple formats, such as one for raw text search, one for basic display, and one for "detail" display.
> 
> I'm more of a "platform" guy than an "app-specific" guy - give the app developer tools that they can blend to meet their own requirements (or interests or tastes.)
> 
> But Solr users should make no mistake, SGML entities are a perfectly valid intermediate format for rich text.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Walter Underwood
> Sent: Thursday, November 21, 2013 11:44 AM
> To: solr-user@lucene.apache.org
> Subject: Re: How to index X™ as ™ (HTML decimal entity)
> 
> And this is the exact problem. Some characters are stored as entities, some are not. When it is time to display, what else needs escaped? At a minimum, you would have to always store & as &amp; to avoid escaping the leading ampersand in the entities.
> 
> You could store every single character as a numeric entity. Or you could store every non-ASCII character as a numeric entity. Or every non-Latin1 character. Plus ampersand, of course.
> 
> In these e-mails, we are distinguishing between ™ and &trade;. How would you do that? By storing "&trade;" as "&amp;trade;".
> 
> To avoid all this double-think, always store text as Unicode code points, encoded with a standard Unicode method (UTF-8, etc.).
> 
> When displaying, only make entities if the codepoints cannot be represented in the target character encoding. If you are sending things in US-ASCII, you will be sending lots of entities.
> 
> A good encoding library has callbacks for characters that cannot be represented. You can use these callbacks to format out-of-charset codepoints as entities. I've done this in product code, it really works.
> 
> Finally, if you don't believe me, believe the XML Infoset, where numeric entities are always interpreted as treated as Unicode codepoints.
> 
> The other way to go insane is storing local time in the database. Always store UTC and convert at the edges.
> 
> wunder
> 
> On Nov 21, 2013, at 7:50 AM, "Jack Krupansky" <ja...@basetechnology.com> wrote:
> 
>> "Would you store "a" as "&#65;" ?"
>> 
>> No, not in any case.
>> 
>> -- Jack Krupansky
>> 
>> -----Original Message----- From: Michael Sokolov
>> Sent: Thursday, November 21, 2013 8:56 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>> 
>> I have to agree w/Walter.  Use unicode as a storage format.  The entity
>> encodings are for transfer/interchange.  Encode/decode on the way in and
>> out if you have to.  Would you store "a" as "&#65;" ?  It makes it
>> impossible to search for, for one thing.  What if someone wants to
>> search for the TM character?
>> 
>> -Mike
>> 
>> On 11/20/13 12:07 PM, Jack Krupansky wrote:
>>> AFAICT, it's not an "extremely bad idea" - using SGML/HTML as a format for storing text to be rendered. If you disagree - try explaining yourself.
>>> 
>>> But maybe TM should be encoded as "&trade;". Ditto for other named SGML entities.
>>> 
>>> -- Jack Krupansky
>>> 
>>> -----Original Message----- From: Walter Underwood
>>> Sent: Wednesday, November 20, 2013 11:21 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>>> 
>>> Again, I'd like to know why this is wanted. It sounds like an X-Y, problem. Storing Unicode characters as XML/HTML encoded character references is an extremely bad idea.
>>> 
>>> wunder
>>> 
>>> On Nov 20, 2013, at 5:01 AM, "Jack Krupansky" <ja...@basetechnology.com> wrote:
>>> 
>>>> Any analysis filtering affects the indexed value only, but the stored value would be unchanged from the original input value. An update processor lets you modify the original input value that will be stored.
>>>> 
>>>> -- Jack Krupansky
>>>> 
>>>> -----Original Message----- From: Uwe Reh
>>>> Sent: Wednesday, November 20, 2013 5:43 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>>>> 
>>>> What's about having a simple charfilter in the analyzer queue for
>>>> indexing *and* searching. e.g
>>>> <charFilter class="solr.PatternReplaceFilterFactory" pattern="™"
>>>> replacement="&#8482;" />
>>>> or
>>>> <charFilter class="solr.MappingCharFilterFactory"
>>>> mapping="mapping-specials.txt" />
>>>> 
>>>> Uwe
>>>> 
>>>> Am 19.11.2013 23:46, schrieb Developer:
>>>>> I have a data coming in to SOLR as below.
>>>>> 
>>>>> <field name="displayName">X™ - Black</field>
>>>>> 
>>>>> I need to store the HTML Entity (decimal) equivalent value (i.e. &#8482;)
>>>>> in SOLR rather than storing the original value.
>>>>> 
>>>>> Is there a way to do this?
>>>> 
>>> 
>>> -- 
>>> Walter Underwood
>>> wunder@wunderwood.org
>>> 
>>> 
>> 
> 
> --
> Walter Underwood
> wunder@wunderwood.org
> 
> 
> 

--
Walter Underwood
wunder@wunderwood.org




Re: How to index X™ as ™ (HTML decimal entity)

Posted by Jack Krupansky <ja...@basetechnology.com>.
Ah... now I understand your perspective - you have taken a narrow view of 
what "text" is. A broader view is that it can contain formatting and special 
"entities" as well, or rich text in general. My "read" is that it all 
depends on the nature of the application and its requirements, not a "one 
size fits all" approach. The four main approaches being pure ASCII, 
Unicode/UTF-8, SGML for non-ASCII characters, and full HTML for formatting 
and rich text. And let the app needs determine which is most appropriate for 
each piece of text.

The goal of SGML and HTML is not to hard-wire the final presentation, but 
simply to preserve some level of source format and structure, and then apply 
final presentation formatting on top of that.

Some apps may opt to store the same information in multiple formats, such as 
one for raw text search, one for basic display, and one for "detail" 
display.

I'm more of a "platform" guy than an "app-specific" guy - give the app 
developer tools that they can blend to meet their own requirements (or 
interests or tastes.)

But Solr users should make no mistake, SGML entities are a perfectly valid 
intermediate format for rich text.

-- Jack Krupansky

-----Original Message----- 
From: Walter Underwood
Sent: Thursday, November 21, 2013 11:44 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

And this is the exact problem. Some characters are stored as entities, some 
are not. When it is time to display, what else needs escaped? At a minimum, 
you would have to always store & as &amp; to avoid escaping the leading 
ampersand in the entities.

You could store every single character as a numeric entity. Or you could 
store every non-ASCII character as a numeric entity. Or every non-Latin1 
character. Plus ampersand, of course.

In these e-mails, we are distinguishing between ™ and &trade;. How would you 
do that? By storing "&trade;" as "&amp;trade;".

To avoid all this double-think, always store text as Unicode code points, 
encoded with a standard Unicode method (UTF-8, etc.).

When displaying, only make entities if the codepoints cannot be represented 
in the target character encoding. If you are sending things in US-ASCII, you 
will be sending lots of entities.

A good encoding library has callbacks for characters that cannot be 
represented. You can use these callbacks to format out-of-charset codepoints 
as entities. I've done this in product code, it really works.

Finally, if you don't believe me, believe the XML Infoset, where numeric 
entities are always interpreted as treated as Unicode codepoints.

The other way to go insane is storing local time in the database. Always 
store UTC and convert at the edges.

wunder

On Nov 21, 2013, at 7:50 AM, "Jack Krupansky" <ja...@basetechnology.com> 
wrote:

> "Would you store "a" as "&#65;" ?"
>
> No, not in any case.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Michael Sokolov
> Sent: Thursday, November 21, 2013 8:56 AM
> To: solr-user@lucene.apache.org
> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>
> I have to agree w/Walter.  Use unicode as a storage format.  The entity
> encodings are for transfer/interchange.  Encode/decode on the way in and
> out if you have to.  Would you store "a" as "&#65;" ?  It makes it
> impossible to search for, for one thing.  What if someone wants to
> search for the TM character?
>
> -Mike
>
> On 11/20/13 12:07 PM, Jack Krupansky wrote:
>> AFAICT, it's not an "extremely bad idea" - using SGML/HTML as a format 
>> for storing text to be rendered. If you disagree - try explaining 
>> yourself.
>>
>> But maybe TM should be encoded as "&trade;". Ditto for other named SGML 
>> entities.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Walter Underwood
>> Sent: Wednesday, November 20, 2013 11:21 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>>
>> Again, I'd like to know why this is wanted. It sounds like an X-Y, 
>> problem. Storing Unicode characters as XML/HTML encoded character 
>> references is an extremely bad idea.
>>
>> wunder
>>
>> On Nov 20, 2013, at 5:01 AM, "Jack Krupansky" <ja...@basetechnology.com> 
>> wrote:
>>
>>> Any analysis filtering affects the indexed value only, but the stored 
>>> value would be unchanged from the original input value. An update 
>>> processor lets you modify the original input value that will be stored.
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: Uwe Reh
>>> Sent: Wednesday, November 20, 2013 5:43 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>>>
>>> What's about having a simple charfilter in the analyzer queue for
>>> indexing *and* searching. e.g
>>> <charFilter class="solr.PatternReplaceFilterFactory" pattern="™"
>>> replacement="&#8482;" />
>>> or
>>> <charFilter class="solr.MappingCharFilterFactory"
>>> mapping="mapping-specials.txt" />
>>>
>>> Uwe
>>>
>>> Am 19.11.2013 23:46, schrieb Developer:
>>>> I have a data coming in to SOLR as below.
>>>>
>>>> <field name="displayName">X™ - Black</field>
>>>>
>>>> I need to store the HTML Entity (decimal) equivalent value (i.e. 
>>>> &#8482;)
>>>> in SOLR rather than storing the original value.
>>>>
>>>> Is there a way to do this?
>>>
>>
>> -- 
>> Walter Underwood
>> wunder@wunderwood.org
>>
>>
>

--
Walter Underwood
wunder@wunderwood.org




Re: How to index X™ as ™ (HTML decimal entity)

Posted by Walter Underwood <wu...@wunderwood.org>.
And this is the exact problem. Some characters are stored as entities, some are not. When it is time to display, what else needs escaped? At a minimum, you would have to always store & as &amp; to avoid escaping the leading ampersand in the entities.

You could store every single character as a numeric entity. Or you could store every non-ASCII character as a numeric entity. Or every non-Latin1 character. Plus ampersand, of course.

In these e-mails, we are distinguishing between ™ and &trade;. How would you do that? By storing "&trade;" as "&amp;trade;".

To avoid all this double-think, always store text as Unicode code points, encoded with a standard Unicode method (UTF-8, etc.).

When displaying, only make entities if the codepoints cannot be represented in the target character encoding. If you are sending things in US-ASCII, you will be sending lots of entities.

A good encoding library has callbacks for characters that cannot be represented. You can use these callbacks to format out-of-charset codepoints as entities. I've done this in product code, it really works.

Finally, if you don't believe me, believe the XML Infoset, where numeric entities are always interpreted as treated as Unicode codepoints.

The other way to go insane is storing local time in the database. Always store UTC and convert at the edges.

wunder

On Nov 21, 2013, at 7:50 AM, "Jack Krupansky" <ja...@basetechnology.com> wrote:

> "Would you store "a" as "&#65;" ?"
> 
> No, not in any case.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Michael Sokolov
> Sent: Thursday, November 21, 2013 8:56 AM
> To: solr-user@lucene.apache.org
> Subject: Re: How to index X™ as ™ (HTML decimal entity)
> 
> I have to agree w/Walter.  Use unicode as a storage format.  The entity
> encodings are for transfer/interchange.  Encode/decode on the way in and
> out if you have to.  Would you store "a" as "&#65;" ?  It makes it
> impossible to search for, for one thing.  What if someone wants to
> search for the TM character?
> 
> -Mike
> 
> On 11/20/13 12:07 PM, Jack Krupansky wrote:
>> AFAICT, it's not an "extremely bad idea" - using SGML/HTML as a format for storing text to be rendered. If you disagree - try explaining yourself.
>> 
>> But maybe TM should be encoded as "&trade;". Ditto for other named SGML entities.
>> 
>> -- Jack Krupansky
>> 
>> -----Original Message----- From: Walter Underwood
>> Sent: Wednesday, November 20, 2013 11:21 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>> 
>> Again, I'd like to know why this is wanted. It sounds like an X-Y, problem. Storing Unicode characters as XML/HTML encoded character references is an extremely bad idea.
>> 
>> wunder
>> 
>> On Nov 20, 2013, at 5:01 AM, "Jack Krupansky" <ja...@basetechnology.com> wrote:
>> 
>>> Any analysis filtering affects the indexed value only, but the stored value would be unchanged from the original input value. An update processor lets you modify the original input value that will be stored.
>>> 
>>> -- Jack Krupansky
>>> 
>>> -----Original Message----- From: Uwe Reh
>>> Sent: Wednesday, November 20, 2013 5:43 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>>> 
>>> What's about having a simple charfilter in the analyzer queue for
>>> indexing *and* searching. e.g
>>> <charFilter class="solr.PatternReplaceFilterFactory" pattern="™"
>>> replacement="&#8482;" />
>>> or
>>> <charFilter class="solr.MappingCharFilterFactory"
>>> mapping="mapping-specials.txt" />
>>> 
>>> Uwe
>>> 
>>> Am 19.11.2013 23:46, schrieb Developer:
>>>> I have a data coming in to SOLR as below.
>>>> 
>>>> <field name="displayName">X™ - Black</field>
>>>> 
>>>> I need to store the HTML Entity (decimal) equivalent value (i.e. &#8482;)
>>>> in SOLR rather than storing the original value.
>>>> 
>>>> Is there a way to do this?
>>> 
>> 
>> -- 
>> Walter Underwood
>> wunder@wunderwood.org
>> 
>> 
> 

--
Walter Underwood
wunder@wunderwood.org




Re: How to index X™ as ™ (HTML decimal entity)

Posted by Jack Krupansky <ja...@basetechnology.com>.
"there is not really anything special about "special" characters"

Well, the distinction was about "named entities", which are indeed special.

Besides, in general, for more sophisticated text processing, character 
"types" are a valid distinction.

But all of this begs the question of the original question: "I need to store 
the HTML Entity (decimal) equivalent value (i.e. &#8482;) in SOLR rather 
than storing the original value."

Maybe the original poster could clarify the nature of their need.

-- Jack Krupansky

-----Original Message----- 
From: Michael Sokolov
Sent: Thursday, November 21, 2013 11:37 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

OK - probably I should have said "A",or "&#97;" :)  My point was just
that there is not really anything special about "special" characters.

On 11/21/2013 10:50 AM, Jack Krupansky wrote:
> "Would you store "a" as "&#65;" ?"
>
> No, not in any case.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Michael Sokolov
> Sent: Thursday, November 21, 2013 8:56 AM
> To: solr-user@lucene.apache.org
> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>
> I have to agree w/Walter.  Use unicode as a storage format.  The entity
> encodings are for transfer/interchange.  Encode/decode on the way in and
> out if you have to.  Would you store "a" as "&#65;" ?  It makes it
> impossible to search for, for one thing.  What if someone wants to
> search for the TM character?
>
> -Mike
>
> On 11/20/13 12:07 PM, Jack Krupansky wrote:
>> AFAICT, it's not an "extremely bad idea" - using SGML/HTML as a format 
>> for storing text to be rendered. If you disagree - try explaining 
>> yourself.
>>
>> But maybe TM should be encoded as "&trade;". Ditto for other named SGML 
>> entities.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Walter Underwood
>> Sent: Wednesday, November 20, 2013 11:21 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>>
>> Again, I'd like to know why this is wanted. It sounds like an X-Y, 
>> problem. Storing Unicode characters as XML/HTML encoded character 
>> references is an extremely bad idea.
>>
>> wunder
>>
>> On Nov 20, 2013, at 5:01 AM, "Jack Krupansky" <ja...@basetechnology.com> 
>> wrote:
>>
>>> Any analysis filtering affects the indexed value only, but the stored 
>>> value would be unchanged from the original input value. An update 
>>> processor lets you modify the original input value that will be stored.
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: Uwe Reh
>>> Sent: Wednesday, November 20, 2013 5:43 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>>>
>>> What's about having a simple charfilter in the analyzer queue for
>>> indexing *and* searching. e.g
>>> <charFilter class="solr.PatternReplaceFilterFactory" pattern="™"
>>> replacement="&#8482;" />
>>> or
>>> <charFilter class="solr.MappingCharFilterFactory"
>>> mapping="mapping-specials.txt" />
>>>
>>> Uwe
>>>
>>> Am 19.11.2013 23:46, schrieb Developer:
>>>> I have a data coming in to SOLR as below.
>>>>
>>>> <field name="displayName">X™ - Black</field>
>>>>
>>>> I need to store the HTML Entity (decimal) equivalent value (i.e. 
>>>> &#8482;)
>>>> in SOLR rather than storing the original value.
>>>>
>>>> Is there a way to do this?
>>>
>>
>> -- 
>> Walter Underwood
>> wunder@wunderwood.org
>>
>>
>>
> 


Re: How to index X™ as ™ (HTML decimal entity)

Posted by Michael Sokolov <ms...@safaribooksonline.com>.
OK - probably I should have said "A",or "&#97;" :)  My point was just 
that there is not really anything special about "special" characters.

On 11/21/2013 10:50 AM, Jack Krupansky wrote:
> "Would you store "a" as "&#65;" ?"
>
> No, not in any case.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Michael Sokolov
> Sent: Thursday, November 21, 2013 8:56 AM
> To: solr-user@lucene.apache.org
> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>
> I have to agree w/Walter.  Use unicode as a storage format.  The entity
> encodings are for transfer/interchange.  Encode/decode on the way in and
> out if you have to.  Would you store "a" as "&#65;" ?  It makes it
> impossible to search for, for one thing.  What if someone wants to
> search for the TM character?
>
> -Mike
>
> On 11/20/13 12:07 PM, Jack Krupansky wrote:
>> AFAICT, it's not an "extremely bad idea" - using SGML/HTML as a 
>> format for storing text to be rendered. If you disagree - try 
>> explaining yourself.
>>
>> But maybe TM should be encoded as "&trade;". Ditto for other named 
>> SGML entities.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Walter Underwood
>> Sent: Wednesday, November 20, 2013 11:21 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>>
>> Again, I'd like to know why this is wanted. It sounds like an X-Y, 
>> problem. Storing Unicode characters as XML/HTML encoded character 
>> references is an extremely bad idea.
>>
>> wunder
>>
>> On Nov 20, 2013, at 5:01 AM, "Jack Krupansky" 
>> <ja...@basetechnology.com> wrote:
>>
>>> Any analysis filtering affects the indexed value only, but the 
>>> stored value would be unchanged from the original input value. An 
>>> update processor lets you modify the original input value that will 
>>> be stored.
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: Uwe Reh
>>> Sent: Wednesday, November 20, 2013 5:43 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>>>
>>> What's about having a simple charfilter in the analyzer queue for
>>> indexing *and* searching. e.g
>>> <charFilter class="solr.PatternReplaceFilterFactory" pattern="™"
>>> replacement="&#8482;" />
>>> or
>>> <charFilter class="solr.MappingCharFilterFactory"
>>> mapping="mapping-specials.txt" />
>>>
>>> Uwe
>>>
>>> Am 19.11.2013 23:46, schrieb Developer:
>>>> I have a data coming in to SOLR as below.
>>>>
>>>> <field name="displayName">X™ - Black</field>
>>>>
>>>> I need to store the HTML Entity (decimal) equivalent value (i.e. 
>>>> &#8482;)
>>>> in SOLR rather than storing the original value.
>>>>
>>>> Is there a way to do this?
>>>
>>
>> -- 
>> Walter Underwood
>> wunder@wunderwood.org
>>
>>
>>
>


Re: How to index X™ as ™ (HTML decimal entity)

Posted by Jack Krupansky <ja...@basetechnology.com>.
"Would you store "a" as "&#65;" ?"

No, not in any case.

-- Jack Krupansky

-----Original Message----- 
From: Michael Sokolov
Sent: Thursday, November 21, 2013 8:56 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

I have to agree w/Walter.  Use unicode as a storage format.  The entity
encodings are for transfer/interchange.  Encode/decode on the way in and
out if you have to.  Would you store "a" as "&#65;" ?  It makes it
impossible to search for, for one thing.  What if someone wants to
search for the TM character?

-Mike

On 11/20/13 12:07 PM, Jack Krupansky wrote:
> AFAICT, it's not an "extremely bad idea" - using SGML/HTML as a format for 
> storing text to be rendered. If you disagree - try explaining yourself.
>
> But maybe TM should be encoded as "&trade;". Ditto for other named SGML 
> entities.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Walter Underwood
> Sent: Wednesday, November 20, 2013 11:21 AM
> To: solr-user@lucene.apache.org
> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>
> Again, I'd like to know why this is wanted. It sounds like an X-Y, 
> problem. Storing Unicode characters as XML/HTML encoded character 
> references is an extremely bad idea.
>
> wunder
>
> On Nov 20, 2013, at 5:01 AM, "Jack Krupansky" <ja...@basetechnology.com> 
> wrote:
>
>> Any analysis filtering affects the indexed value only, but the stored 
>> value would be unchanged from the original input value. An update 
>> processor lets you modify the original input value that will be stored.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Uwe Reh
>> Sent: Wednesday, November 20, 2013 5:43 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>>
>> What's about having a simple charfilter in the analyzer queue for
>> indexing *and* searching. e.g
>> <charFilter class="solr.PatternReplaceFilterFactory" pattern="™"
>> replacement="&#8482;" />
>> or
>> <charFilter class="solr.MappingCharFilterFactory"
>> mapping="mapping-specials.txt" />
>>
>> Uwe
>>
>> Am 19.11.2013 23:46, schrieb Developer:
>>> I have a data coming in to SOLR as below.
>>>
>>> <field name="displayName">X™ - Black</field>
>>>
>>> I need to store the HTML Entity (decimal) equivalent value (i.e. 
>>> &#8482;)
>>> in SOLR rather than storing the original value.
>>>
>>> Is there a way to do this?
>>
>
> -- 
> Walter Underwood
> wunder@wunderwood.org
>
>
> 


Re: How to index X™ as ™ (HTML decimal entity)

Posted by Michael Sokolov <ms...@safaribooksonline.com>.
I have to agree w/Walter.  Use unicode as a storage format.  The entity 
encodings are for transfer/interchange.  Encode/decode on the way in and 
out if you have to.  Would you store "a" as "&#65;" ?  It makes it 
impossible to search for, for one thing.  What if someone wants to 
search for the TM character?

-Mike

On 11/20/13 12:07 PM, Jack Krupansky wrote:
> AFAICT, it's not an "extremely bad idea" - using SGML/HTML as a format 
> for storing text to be rendered. If you disagree - try explaining 
> yourself.
>
> But maybe TM should be encoded as "&trade;". Ditto for other named 
> SGML entities.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Walter Underwood
> Sent: Wednesday, November 20, 2013 11:21 AM
> To: solr-user@lucene.apache.org
> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>
> Again, I'd like to know why this is wanted. It sounds like an X-Y, 
> problem. Storing Unicode characters as XML/HTML encoded character 
> references is an extremely bad idea.
>
> wunder
>
> On Nov 20, 2013, at 5:01 AM, "Jack Krupansky" 
> <ja...@basetechnology.com> wrote:
>
>> Any analysis filtering affects the indexed value only, but the stored 
>> value would be unchanged from the original input value. An update 
>> processor lets you modify the original input value that will be stored.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Uwe Reh
>> Sent: Wednesday, November 20, 2013 5:43 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>>
>> What's about having a simple charfilter in the analyzer queue for
>> indexing *and* searching. e.g
>> <charFilter class="solr.PatternReplaceFilterFactory" pattern="™"
>> replacement="&#8482;" />
>> or
>> <charFilter class="solr.MappingCharFilterFactory"
>> mapping="mapping-specials.txt" />
>>
>> Uwe
>>
>> Am 19.11.2013 23:46, schrieb Developer:
>>> I have a data coming in to SOLR as below.
>>>
>>> <field name="displayName">X™ - Black</field>
>>>
>>> I need to store the HTML Entity (decimal) equivalent value (i.e. 
>>> &#8482;)
>>> in SOLR rather than storing the original value.
>>>
>>> Is there a way to do this?
>>
>
> -- 
> Walter Underwood
> wunder@wunderwood.org
>
>
>


Re: How to index X™ as ™ (HTML decimal entity)

Posted by Jack Krupansky <ja...@basetechnology.com>.
AFAICT, it's not an "extremely bad idea" - using SGML/HTML as a format for 
storing text to be rendered. If you disagree - try explaining yourself.

But maybe TM should be encoded as "&trade;". Ditto for other named SGML 
entities.

-- Jack Krupansky

-----Original Message----- 
From: Walter Underwood
Sent: Wednesday, November 20, 2013 11:21 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

Again, I'd like to know why this is wanted. It sounds like an X-Y, problem. 
Storing Unicode characters as XML/HTML encoded character references is an 
extremely bad idea.

wunder

On Nov 20, 2013, at 5:01 AM, "Jack Krupansky" <ja...@basetechnology.com> 
wrote:

> Any analysis filtering affects the indexed value only, but the stored 
> value would be unchanged from the original input value. An update 
> processor lets you modify the original input value that will be stored.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Uwe Reh
> Sent: Wednesday, November 20, 2013 5:43 AM
> To: solr-user@lucene.apache.org
> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>
> What's about having a simple charfilter in the analyzer queue for
> indexing *and* searching. e.g
> <charFilter class="solr.PatternReplaceFilterFactory" pattern="™"
> replacement="&#8482;" />
> or
> <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-specials.txt" />
>
> Uwe
>
> Am 19.11.2013 23:46, schrieb Developer:
>> I have a data coming in to SOLR as below.
>>
>> <field name="displayName">X™ - Black</field>
>>
>> I need to store the HTML Entity (decimal) equivalent value (i.e. &#8482;)
>> in SOLR rather than storing the original value.
>>
>> Is there a way to do this?
>

--
Walter Underwood
wunder@wunderwood.org




Re: How to index X™ as ™ (HTML decimal entity)

Posted by Walter Underwood <wu...@wunderwood.org>.
Again, I'd like to know why this is wanted. It sounds like an X-Y, problem. Storing Unicode characters as XML/HTML encoded character references is an extremely bad idea.

wunder

On Nov 20, 2013, at 5:01 AM, "Jack Krupansky" <ja...@basetechnology.com> wrote:

> Any analysis filtering affects the indexed value only, but the stored value would be unchanged from the original input value. An update processor lets you modify the original input value that will be stored.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Uwe Reh
> Sent: Wednesday, November 20, 2013 5:43 AM
> To: solr-user@lucene.apache.org
> Subject: Re: How to index X™ as ™ (HTML decimal entity)
> 
> What's about having a simple charfilter in the analyzer queue for
> indexing *and* searching. e.g
> <charFilter class="solr.PatternReplaceFilterFactory" pattern="™"
> replacement="&#8482;" />
> or
> <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-specials.txt" />
> 
> Uwe
> 
> Am 19.11.2013 23:46, schrieb Developer:
>> I have a data coming in to SOLR as below.
>> 
>> <field name="displayName">X™ - Black</field>
>> 
>> I need to store the HTML Entity (decimal) equivalent value (i.e. &#8482;)
>> in SOLR rather than storing the original value.
>> 
>> Is there a way to do this?
> 

--
Walter Underwood
wunder@wunderwood.org




Re: How to index X™ as ™ (HTML decimal entity)

Posted by Jack Krupansky <ja...@basetechnology.com>.
Any analysis filtering affects the indexed value only, but the stored value 
would be unchanged from the original input value. An update processor lets 
you modify the original input value that will be stored.

-- Jack Krupansky

-----Original Message----- 
From: Uwe Reh
Sent: Wednesday, November 20, 2013 5:43 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

What's about having a simple charfilter in the analyzer queue for
indexing *and* searching. e.g
<charFilter class="solr.PatternReplaceFilterFactory" pattern="™"
replacement="&#8482;" />
or
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-specials.txt" />

Uwe

Am 19.11.2013 23:46, schrieb Developer:
> I have a data coming in to SOLR as below.
>
> <field name="displayName">X™ - Black</field>
>
> I need to store the HTML Entity (decimal) equivalent value (i.e. &#8482;)
> in SOLR rather than storing the original value.
>
> Is there a way to do this?
> 


Re: How to index X™ as ™ (HTML decimal entity)

Posted by Uwe Reh <re...@hebis.uni-frankfurt.de>.
What's about having a simple charfilter in the analyzer queue for 
indexing *and* searching. e.g
<charFilter class="solr.PatternReplaceFilterFactory" pattern="™" 
replacement="&#8482;" />
or
<charFilter class="solr.MappingCharFilterFactory" 
mapping="mapping-specials.txt" />

Uwe

Am 19.11.2013 23:46, schrieb Developer:
> I have a data coming in to SOLR as below.
>
> <field name="displayName">X™ - Black</field>
>
> I need to store the HTML Entity (decimal) equivalent value (i.e. &#8482;)
> in SOLR rather than storing the original value.
>
> Is there a way to do this?
>


Re: How to index X™ as ™ (HTML decimal entity)

Posted by Walter Underwood <wu...@wunderwood.org>.
Why do you want to do this? You can always do this transformation on the presentation side. Doing this on the search server could be a really bad idea.

wunder

On Nov 19, 2013, at 8:19 PM, "Jack Krupansky" <ja...@basetechnology.com> wrote:

> You could use an update processor to map non-ASCII codes to SGML entities. You could code it as a JavaScript script and use the stateless script update processor.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Developer
> Sent: Tuesday, November 19, 2013 5:46 PM
> To: solr-user@lucene.apache.org
> Subject: How to index X™ as ™ (HTML decimal entity)
> 
> I have a data coming in to SOLR as below.
> 
> <field name="displayName">X™ - Black</field>
> 
> I need to store the HTML Entity (decimal) equivalent value (i.e. &#8482;)
> in SOLR rather than storing the original value.
> 
> Is there a way to do this?
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/How-to-index-X-as-8482-HTML-decimal-entity-tp4102002.html
> Sent from the Solr - User mailing list archive at Nabble.com. 

--
Walter Underwood
wunder@wunderwood.org




Re: How to index X™ as ™ (HTML decimal entity)

Posted by Jack Krupansky <ja...@basetechnology.com>.
You could use an update processor to map non-ASCII codes to SGML entities. 
You could code it as a JavaScript script and use the stateless script update 
processor.

-- Jack Krupansky

-----Original Message----- 
From: Developer
Sent: Tuesday, November 19, 2013 5:46 PM
To: solr-user@lucene.apache.org
Subject: How to index X™ as ™ (HTML decimal entity)

I have a data coming in to SOLR as below.

<field name="displayName">X™ - Black</field>

I need to store the HTML Entity (decimal) equivalent value (i.e. &#8482;)
in SOLR rather than storing the original value.

Is there a way to do this?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-index-X-as-8482-HTML-decimal-entity-tp4102002.html
Sent from the Solr - User mailing list archive at Nabble.com.