You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Zheng Lin Edwin Yeo <ed...@gmail.com> on 2015/05/26 15:20:49 UTC

Removing characters like '\n \n' from indexing

Hi,

Is there a way to remove the special characters like \n during indexing of
the rich text documents.

I have quite alot of leading \n \n in front of my indexed content of rich
text documents due to the space and empty lines with the original
documents, and it's causing the content to be flooded with '\n \n' at the
start before the actual content comes in. This causes the content to look
ugly, and also takes up unnecessary bandwidth in the system.


Regards,
Edwin

Re: Removing characters like '\n \n' from indexing

Posted by Erick Erickson <er...@gmail.com>.
The other alternative is to use SolrJ to parse the documents and do
your processing there. Here's an article on the pros/cons and an
example program.

https://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Wed, May 27, 2015 at 1:57 AM, Erik Hatcher <er...@gmail.com> wrote:
> Edwin -
>
> There’s a bunch of built-in update processors you can use, including a script one that allows you to code it dynamically in JavaScript (or other JVM scripting language).
>
> See https://cwiki.apache.org/confluence/display/solr/Update+Request+Processors <https://cwiki.apache.org/confluence/display/solr/Update+Request+Processors> for an exhaustive list.  The RegexReplaceProcessorFactory probably will do what you need.
>
> —
> Erik Hatcher, Senior Solutions Architect
> http://www.lucidworks.com <http://www.lucidworks.com/>
>
>
>
>
>> On May 27, 2015, at 3:36 AM, Zheng Lin Edwin Yeo <ed...@gmail.com> wrote:
>>
>> Hi Shawn,
>>
>> Thanks for your reply.
>>
>> So that means the only way for me is to write my own custom class in order
>> for the removing characters like '\n' to work?
>>
>>
>> Regards,
>> Edwin
>>
>>
>>
>> On 27 May 2015 at 14:46, Shawn Heisey <ap...@elyograg.org> wrote:
>>
>>> On 5/26/2015 10:16 PM, Zheng Lin Edwin Yeo wrote:
>>>> I tried to follow the example here
>>>> https://wiki.apache.org/solr/UpdateRequestProcessor, by putting
>>>> the updateRequestProcessorChain in my solrconfig.xml
>>>>
>>>> But I'm getting the following error when I tried to reload the core.
>>>>
>>>> Caused by: org.apache.solr.common.SolrException: Error loading class
>>>> 'solr.CustomUpdateRequestProcessorFactory'
>>>>
>>>> Is there anything I might have missed out? I'm using Solr 5.1.
>>>
>>> CustomUpdateRequestProcessorFactory is not the name of an actual usable
>>> update processor.  On that wiki page, it is a placeholder for a custom
>>> class name.
>>>
>>> This class actually does exist within the Solr source code, but it is
>>> defined in the *TEST* code, not the main source code that actually
>>> creates the information that's included in the Solr download.
>>>
>>> I've updated the wiki page to try making this more clear, by using an
>>> entirely fictional class name.
>>>
>>> Thanks,
>>> Shawn
>>>
>>>
>

Re: Removing characters like '\n \n' from indexing

Posted by Erik Hatcher <er...@gmail.com>.
Edwin -

There’s a bunch of built-in update processors you can use, including a script one that allows you to code it dynamically in JavaScript (or other JVM scripting language).

See https://cwiki.apache.org/confluence/display/solr/Update+Request+Processors <https://cwiki.apache.org/confluence/display/solr/Update+Request+Processors> for an exhaustive list.  The RegexReplaceProcessorFactory probably will do what you need.

—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com <http://www.lucidworks.com/>




> On May 27, 2015, at 3:36 AM, Zheng Lin Edwin Yeo <ed...@gmail.com> wrote:
> 
> Hi Shawn,
> 
> Thanks for your reply.
> 
> So that means the only way for me is to write my own custom class in order
> for the removing characters like '\n' to work?
> 
> 
> Regards,
> Edwin
> 
> 
> 
> On 27 May 2015 at 14:46, Shawn Heisey <ap...@elyograg.org> wrote:
> 
>> On 5/26/2015 10:16 PM, Zheng Lin Edwin Yeo wrote:
>>> I tried to follow the example here
>>> https://wiki.apache.org/solr/UpdateRequestProcessor, by putting
>>> the updateRequestProcessorChain in my solrconfig.xml
>>> 
>>> But I'm getting the following error when I tried to reload the core.
>>> 
>>> Caused by: org.apache.solr.common.SolrException: Error loading class
>>> 'solr.CustomUpdateRequestProcessorFactory'
>>> 
>>> Is there anything I might have missed out? I'm using Solr 5.1.
>> 
>> CustomUpdateRequestProcessorFactory is not the name of an actual usable
>> update processor.  On that wiki page, it is a placeholder for a custom
>> class name.
>> 
>> This class actually does exist within the Solr source code, but it is
>> defined in the *TEST* code, not the main source code that actually
>> creates the information that's included in the Solr download.
>> 
>> I've updated the wiki page to try making this more clear, by using an
>> entirely fictional class name.
>> 
>> Thanks,
>> Shawn
>> 
>> 


Re: Removing characters like '\n \n' from indexing

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi Shawn,

Thanks for your reply.

So that means the only way for me is to write my own custom class in order
for the removing characters like '\n' to work?


Regards,
Edwin



On 27 May 2015 at 14:46, Shawn Heisey <ap...@elyograg.org> wrote:

> On 5/26/2015 10:16 PM, Zheng Lin Edwin Yeo wrote:
> > I tried to follow the example here
> > https://wiki.apache.org/solr/UpdateRequestProcessor, by putting
> > the updateRequestProcessorChain in my solrconfig.xml
> >
> > But I'm getting the following error when I tried to reload the core.
> >
> > Caused by: org.apache.solr.common.SolrException: Error loading class
> > 'solr.CustomUpdateRequestProcessorFactory'
> >
> > Is there anything I might have missed out? I'm using Solr 5.1.
>
> CustomUpdateRequestProcessorFactory is not the name of an actual usable
> update processor.  On that wiki page, it is a placeholder for a custom
> class name.
>
> This class actually does exist within the Solr source code, but it is
> defined in the *TEST* code, not the main source code that actually
> creates the information that's included in the Solr download.
>
> I've updated the wiki page to try making this more clear, by using an
> entirely fictional class name.
>
> Thanks,
> Shawn
>
>

Re: Removing characters like '\n \n' from indexing

Posted by Shawn Heisey <ap...@elyograg.org>.
On 5/26/2015 10:16 PM, Zheng Lin Edwin Yeo wrote:
> I tried to follow the example here
> https://wiki.apache.org/solr/UpdateRequestProcessor, by putting
> the updateRequestProcessorChain in my solrconfig.xml
> 
> But I'm getting the following error when I tried to reload the core.
> 
> Caused by: org.apache.solr.common.SolrException: Error loading class
> 'solr.CustomUpdateRequestProcessorFactory'
> 
> Is there anything I might have missed out? I'm using Solr 5.1.

CustomUpdateRequestProcessorFactory is not the name of an actual usable
update processor.  On that wiki page, it is a placeholder for a custom
class name.

This class actually does exist within the Solr source code, but it is
defined in the *TEST* code, not the main source code that actually
creates the information that's included in the Solr download.

I've updated the wiki page to try making this more clear, by using an
entirely fictional class name.

Thanks,
Shawn


Re: Removing characters like '\n \n' from indexing

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
I tried to follow the example here
https://wiki.apache.org/solr/UpdateRequestProcessor, by putting
the updateRequestProcessorChain in my solrconfig.xml

But I'm getting the following error when I tried to reload the core.

Caused by: org.apache.solr.common.SolrException: Error loading class
'solr.CustomUpdateRequestProcessorFactory'

Is there anything I might have missed out? I'm using Solr 5.1.


Regards,
Edwin


On 27 May 2015 at 10:13, Zheng Lin Edwin Yeo <ed...@gmail.com> wrote:

> I'm using ExtractingRequestHandler to do the indexing. Do I have to
> implement the UpdateProcessor method at the ExtractingRequestHandler or
> as a separate method?
>
> Regards,
> Edwin
>
> On 26 May 2015 at 23:42, Alessandro Benedetti <be...@gmail.com>
> wrote:
>
>> I think this is still in topic,
>> Assuming we are using the Extract Update handler, I think the update
>> processor approach still applies.
>> But is it not possible to strip them directly with some extract request
>> handler param?
>>
>>
>> 2015-05-26 16:33 GMT+01:00 Jack Krupansky <ja...@gmail.com>:
>>
>> > Neither - it removes the characters before indexing. The distinction is
>> > that if you remove them during indexing they will still appear in the
>> > stored field values even if they are removed from the indexed values,
>> but
>> > by removing them before indexing, they will not appear in the stored
>> field
>> > values. Again, the distinction is between indexed field values and
>> stored
>> > field values.
>> >
>> > -- Jack Krupansky
>> >
>> > On Tue, May 26, 2015 at 10:25 AM, Zheng Lin Edwin Yeo <
>> > edwinyeozl@gmail.com>
>> > wrote:
>> >
>> > > It is showing up in the search results. Just to confirm, does this
>> > > UpdateProcessor method remove the characters during indexing or only
>> > after
>> > > indexing has been done?
>> > >
>> > > Regards,
>> > > Edwin
>> > >
>> > > On 26 May 2015 at 21:30, Upayavira <uv...@odoko.co.uk> wrote:
>> > >
>> > > >
>> > > >
>> > > > On Tue, May 26, 2015, at 02:20 PM, Zheng Lin Edwin Yeo wrote:
>> > > > > Hi,
>> > > > >
>> > > > > Is there a way to remove the special characters like \n during
>> > indexing
>> > > > > of
>> > > > > the rich text documents.
>> > > > >
>> > > > > I have quite alot of leading \n \n in front of my indexed content
>> of
>> > > rich
>> > > > > text documents due to the space and empty lines with the original
>> > > > > documents, and it's causing the content to be flooded with '\n
>> \n' at
>> > > the
>> > > > > start before the actual content comes in. This causes the content
>> to
>> > > look
>> > > > > ugly, and also takes up unnecessary bandwidth in the system.
>> > > >
>> > > > Where is this showing up?
>> > > >
>> > > > If it is in search results, you must use an UpdateProcessor, as
>> these
>> > > > happen before fields are stored (E.g.
>> RegexpReplaceProcessorFactory).
>> > > >
>> > > > If you are concerned about facet results, then you can do it in an
>> > > > analysis chain, for example with a RegexpFilterFactory.
>> > > >
>> > > > Upayavira
>> > > >
>> > >
>> >
>>
>>
>>
>> --
>> --------------------------
>>
>> Benedetti Alessandro
>> Visiting card : http://about.me/alessandro_benedetti
>>
>> "Tyger, tyger burning bright
>> In the forests of the night,
>> What immortal hand or eye
>> Could frame thy fearful symmetry?"
>>
>> William Blake - Songs of Experience -1794 England
>>
>
>

Re: Removing characters like '\n \n' from indexing

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
I'm using ExtractingRequestHandler to do the indexing. Do I have to
implement the UpdateProcessor method at the ExtractingRequestHandler or as
a separate method?

Regards,
Edwin

On 26 May 2015 at 23:42, Alessandro Benedetti <be...@gmail.com>
wrote:

> I think this is still in topic,
> Assuming we are using the Extract Update handler, I think the update
> processor approach still applies.
> But is it not possible to strip them directly with some extract request
> handler param?
>
>
> 2015-05-26 16:33 GMT+01:00 Jack Krupansky <ja...@gmail.com>:
>
> > Neither - it removes the characters before indexing. The distinction is
> > that if you remove them during indexing they will still appear in the
> > stored field values even if they are removed from the indexed values, but
> > by removing them before indexing, they will not appear in the stored
> field
> > values. Again, the distinction is between indexed field values and stored
> > field values.
> >
> > -- Jack Krupansky
> >
> > On Tue, May 26, 2015 at 10:25 AM, Zheng Lin Edwin Yeo <
> > edwinyeozl@gmail.com>
> > wrote:
> >
> > > It is showing up in the search results. Just to confirm, does this
> > > UpdateProcessor method remove the characters during indexing or only
> > after
> > > indexing has been done?
> > >
> > > Regards,
> > > Edwin
> > >
> > > On 26 May 2015 at 21:30, Upayavira <uv...@odoko.co.uk> wrote:
> > >
> > > >
> > > >
> > > > On Tue, May 26, 2015, at 02:20 PM, Zheng Lin Edwin Yeo wrote:
> > > > > Hi,
> > > > >
> > > > > Is there a way to remove the special characters like \n during
> > indexing
> > > > > of
> > > > > the rich text documents.
> > > > >
> > > > > I have quite alot of leading \n \n in front of my indexed content
> of
> > > rich
> > > > > text documents due to the space and empty lines with the original
> > > > > documents, and it's causing the content to be flooded with '\n \n'
> at
> > > the
> > > > > start before the actual content comes in. This causes the content
> to
> > > look
> > > > > ugly, and also takes up unnecessary bandwidth in the system.
> > > >
> > > > Where is this showing up?
> > > >
> > > > If it is in search results, you must use an UpdateProcessor, as these
> > > > happen before fields are stored (E.g. RegexpReplaceProcessorFactory).
> > > >
> > > > If you are concerned about facet results, then you can do it in an
> > > > analysis chain, for example with a RegexpFilterFactory.
> > > >
> > > > Upayavira
> > > >
> > >
> >
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>

Re: Removing characters like '\n \n' from indexing

Posted by Alessandro Benedetti <be...@gmail.com>.
I think this is still in topic,
Assuming we are using the Extract Update handler, I think the update
processor approach still applies.
But is it not possible to strip them directly with some extract request
handler param?


2015-05-26 16:33 GMT+01:00 Jack Krupansky <ja...@gmail.com>:

> Neither - it removes the characters before indexing. The distinction is
> that if you remove them during indexing they will still appear in the
> stored field values even if they are removed from the indexed values, but
> by removing them before indexing, they will not appear in the stored field
> values. Again, the distinction is between indexed field values and stored
> field values.
>
> -- Jack Krupansky
>
> On Tue, May 26, 2015 at 10:25 AM, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com>
> wrote:
>
> > It is showing up in the search results. Just to confirm, does this
> > UpdateProcessor method remove the characters during indexing or only
> after
> > indexing has been done?
> >
> > Regards,
> > Edwin
> >
> > On 26 May 2015 at 21:30, Upayavira <uv...@odoko.co.uk> wrote:
> >
> > >
> > >
> > > On Tue, May 26, 2015, at 02:20 PM, Zheng Lin Edwin Yeo wrote:
> > > > Hi,
> > > >
> > > > Is there a way to remove the special characters like \n during
> indexing
> > > > of
> > > > the rich text documents.
> > > >
> > > > I have quite alot of leading \n \n in front of my indexed content of
> > rich
> > > > text documents due to the space and empty lines with the original
> > > > documents, and it's causing the content to be flooded with '\n \n' at
> > the
> > > > start before the actual content comes in. This causes the content to
> > look
> > > > ugly, and also takes up unnecessary bandwidth in the system.
> > >
> > > Where is this showing up?
> > >
> > > If it is in search results, you must use an UpdateProcessor, as these
> > > happen before fields are stored (E.g. RegexpReplaceProcessorFactory).
> > >
> > > If you are concerned about facet results, then you can do it in an
> > > analysis chain, for example with a RegexpFilterFactory.
> > >
> > > Upayavira
> > >
> >
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Removing characters like '\n \n' from indexing

Posted by Jack Krupansky <ja...@gmail.com>.
Neither - it removes the characters before indexing. The distinction is
that if you remove them during indexing they will still appear in the
stored field values even if they are removed from the indexed values, but
by removing them before indexing, they will not appear in the stored field
values. Again, the distinction is between indexed field values and stored
field values.

-- Jack Krupansky

On Tue, May 26, 2015 at 10:25 AM, Zheng Lin Edwin Yeo <ed...@gmail.com>
wrote:

> It is showing up in the search results. Just to confirm, does this
> UpdateProcessor method remove the characters during indexing or only after
> indexing has been done?
>
> Regards,
> Edwin
>
> On 26 May 2015 at 21:30, Upayavira <uv...@odoko.co.uk> wrote:
>
> >
> >
> > On Tue, May 26, 2015, at 02:20 PM, Zheng Lin Edwin Yeo wrote:
> > > Hi,
> > >
> > > Is there a way to remove the special characters like \n during indexing
> > > of
> > > the rich text documents.
> > >
> > > I have quite alot of leading \n \n in front of my indexed content of
> rich
> > > text documents due to the space and empty lines with the original
> > > documents, and it's causing the content to be flooded with '\n \n' at
> the
> > > start before the actual content comes in. This causes the content to
> look
> > > ugly, and also takes up unnecessary bandwidth in the system.
> >
> > Where is this showing up?
> >
> > If it is in search results, you must use an UpdateProcessor, as these
> > happen before fields are stored (E.g. RegexpReplaceProcessorFactory).
> >
> > If you are concerned about facet results, then you can do it in an
> > analysis chain, for example with a RegexpFilterFactory.
> >
> > Upayavira
> >
>

Re: Removing characters like '\n \n' from indexing

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
It is showing up in the search results. Just to confirm, does this
UpdateProcessor method remove the characters during indexing or only after
indexing has been done?

Regards,
Edwin

On 26 May 2015 at 21:30, Upayavira <uv...@odoko.co.uk> wrote:

>
>
> On Tue, May 26, 2015, at 02:20 PM, Zheng Lin Edwin Yeo wrote:
> > Hi,
> >
> > Is there a way to remove the special characters like \n during indexing
> > of
> > the rich text documents.
> >
> > I have quite alot of leading \n \n in front of my indexed content of rich
> > text documents due to the space and empty lines with the original
> > documents, and it's causing the content to be flooded with '\n \n' at the
> > start before the actual content comes in. This causes the content to look
> > ugly, and also takes up unnecessary bandwidth in the system.
>
> Where is this showing up?
>
> If it is in search results, you must use an UpdateProcessor, as these
> happen before fields are stored (E.g. RegexpReplaceProcessorFactory).
>
> If you are concerned about facet results, then you can do it in an
> analysis chain, for example with a RegexpFilterFactory.
>
> Upayavira
>

Re: Removing characters like '\n \n' from indexing

Posted by Upayavira <uv...@odoko.co.uk>.

On Tue, May 26, 2015, at 02:20 PM, Zheng Lin Edwin Yeo wrote:
> Hi,
> 
> Is there a way to remove the special characters like \n during indexing
> of
> the rich text documents.
> 
> I have quite alot of leading \n \n in front of my indexed content of rich
> text documents due to the space and empty lines with the original
> documents, and it's causing the content to be flooded with '\n \n' at the
> start before the actual content comes in. This causes the content to look
> ugly, and also takes up unnecessary bandwidth in the system.

Where is this showing up?

If it is in search results, you must use an UpdateProcessor, as these
happen before fields are stored (E.g. RegexpReplaceProcessorFactory). 

If you are concerned about facet results, then you can do it in an
analysis chain, for example with a RegexpFilterFactory.

Upayavira