You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by marotosg <ma...@gmail.com> on 2016/11/14 16:19:05 UTC

RTF Rich text format

Hi,

I have a use case where I need to index information coming from a database
where there is a field which contains rich text format. I would like to
convert that text into simple plain text, same as tika does when indexing
documents.

Is there any way to achive that having a field only where i sent this rich
text and then Solr cleans that data? I can't find anyhting so far.

Thanks
Sergio



--
View this message in context: http://lucene.472066.n3.nabble.com/RTF-Rich-text-format-tp4305778.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: RTF Rich text format

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
The logical place to do that (if you cannot do outside of Solr) would
be in an UpdateRequestProcessor.

Unfortunately, there is no TikaExtract URP though other similar ones
exist (e.g. for language guessing). The full list is here:
http://www.solr-start.com/info/update-request-processors/

But you could write. You'd have to be very careful about using Tika to
not leak memory and to handle the failure states, but technically it
should be possible.

Regards,
   Alex.
----
Solr Example reading group is starting November 2016, join us at
http://j.mp/SolrERG
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 15 November 2016 at 04:01, Sergio García Maroto <ma...@gmail.com> wrote:
> Thanks for the response.
>
> I am afraid I can't use the DataImportHandler. I do the indexation using an
> Indexation Service joining data from several places.
>
> I have a final xml with plenty of data and one of them is the rtf field.
> That's the xml I send to Solr using the /update. I am guessing if it would
> be possible Solr to do it with a tokenizer filter or something like that.
>
> On 14 November 2016 at 16:24, Alexandre Rafalovitch <ar...@gmail.com>
> wrote:
>
>> I think DataImportHandler with nested entity (JDBC, then Tika with
>> FieldReaderDataSource) should do the trick.
>>
>> Have you tried that?
>>
>> Regards,
>>    Alex.
>> ----
>> Solr Example reading group is starting November 2016, join us at
>> http://j.mp/SolrERG
>> Newsletter and resources for Solr beginners and intermediates:
>> http://www.solr-start.com/
>>
>>
>> On 15 November 2016 at 03:19, marotosg <ma...@gmail.com> wrote:
>> > Hi,
>> >
>> > I have a use case where I need to index information coming from a
>> database
>> > where there is a field which contains rich text format. I would like to
>> > convert that text into simple plain text, same as tika does when indexing
>> > documents.
>> >
>> > Is there any way to achive that having a field only where i sent this
>> rich
>> > text and then Solr cleans that data? I can't find anyhting so far.
>> >
>> > Thanks
>> > Sergio
>> >
>> >
>> >
>> > --
>> > View this message in context: http://lucene.472066.n3.
>> nabble.com/RTF-Rich-text-format-tp4305778.html
>> > Sent from the Solr - User mailing list archive at Nabble.com.
>>

Re: RTF Rich text format

Posted by Sergio García Maroto <ma...@gmail.com>.
Thanks for the response.

I am afraid I can't use the DataImportHandler. I do the indexation using an
Indexation Service joining data from several places.

I have a final xml with plenty of data and one of them is the rtf field.
That's the xml I send to Solr using the /update. I am guessing if it would
be possible Solr to do it with a tokenizer filter or something like that.

On 14 November 2016 at 16:24, Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> I think DataImportHandler with nested entity (JDBC, then Tika with
> FieldReaderDataSource) should do the trick.
>
> Have you tried that?
>
> Regards,
>    Alex.
> ----
> Solr Example reading group is starting November 2016, join us at
> http://j.mp/SolrERG
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 15 November 2016 at 03:19, marotosg <ma...@gmail.com> wrote:
> > Hi,
> >
> > I have a use case where I need to index information coming from a
> database
> > where there is a field which contains rich text format. I would like to
> > convert that text into simple plain text, same as tika does when indexing
> > documents.
> >
> > Is there any way to achive that having a field only where i sent this
> rich
> > text and then Solr cleans that data? I can't find anyhting so far.
> >
> > Thanks
> > Sergio
> >
> >
> >
> > --
> > View this message in context: http://lucene.472066.n3.
> nabble.com/RTF-Rich-text-format-tp4305778.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: RTF Rich text format

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
I think DataImportHandler with nested entity (JDBC, then Tika with
FieldReaderDataSource) should do the trick.

Have you tried that?

Regards,
   Alex.
----
Solr Example reading group is starting November 2016, join us at
http://j.mp/SolrERG
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 15 November 2016 at 03:19, marotosg <ma...@gmail.com> wrote:
> Hi,
>
> I have a use case where I need to index information coming from a database
> where there is a field which contains rich text format. I would like to
> convert that text into simple plain text, same as tika does when indexing
> documents.
>
> Is there any way to achive that having a field only where i sent this rich
> text and then Solr cleans that data? I can't find anyhting so far.
>
> Thanks
> Sergio
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/RTF-Rich-text-format-tp4305778.html
> Sent from the Solr - User mailing list archive at Nabble.com.