You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by Marisol Redondo <ma...@gmail.com> on 2017/06/01 14:28:03 UTC

Re: UTF-8 Format from Confluence to Solr

I fixed the problem.

The problem is that the Confluence connector is getting the entity of the
request with the default encoding ("ISO-8859-1"), and not UTF-8.

To fix that, I made a change in the Confluence connector, and each time is
reading the request's entity I use EntityUtils.toString(entity,*"UTF-8"*)

Thanks


On 31 May 2017 at 10:13, Marisol Redondo <ma...@gmail.com>
wrote:

> Hi.
>
> I'm having problems with the encoding when injecting in Solr 6 in
> standalone mode from a Confluence wiki.
>
> I have Manifold 2.5 with Tomcat-8.
>
> The repository connector from the job take the information from a
> Confluence wiki and the output connector is Solr, using the Tika
> transformation, a custom transformation and a Metadata adjuster.
>
> When the document is injected into solr, the content of the document has
> some character that shouldn't be there because are not in the confluence
> page, mainly a Â character.
>
> I have checked that confluence, the tomcat server when manifold is
> running, the http request to confluence has the Accept-Charset header set
> to UTF-8, the solr server is acepting UTF8.
>
> In the log, I have seen that when retrieving the information from
> confluence, the content is fine, and when it's sending the information to
> solr, it has the character. I have tried without using any transfomer and
> getting the same log entry.
>
> Is this a bug or how can I resolve this?
>
> Thanks for your help
>
>
>
>
>

Re: UTF-8 Format from Confluence to Solr

Posted by Karl Wright <da...@gmail.com>.

Committed a fix.
Karl


On Mon, Jun 12, 2017 at 7:27 PM, Karl Wright <da...@gmail.com> wrote:

> There's already a ticket for this, assigned to me.  CONNECTORS-1251.  I'll
> freshen it up.
>
> Karl
>
>
>
>
> On Mon, Jun 12, 2017 at 2:52 PM, Furkan KAMACI <fu...@gmail.com>
> wrote:
>
>> Hi Marisol,
>>
>> You can create a ticket from here: https://issues.apache.or
>> g/jira/projects/CONNECTORS
>>
>> Kind Regards,
>> Furkan KAMACI
>>
>>
>> 12 Haz 2017 Pzt, saat 18:25 tarihinde Marisol Redondo <
>> marisol.redondo.garcia@gmail.com> şunu yazdı:
>>
>>> How can I do that?
>>>
>>> On 1 June 2017 at 16:43, Antonio David Pérez Morales <
>>> adperezmorales@gmail.com> wrote:
>>>
>>>> Hi Marisol
>>>>
>>>> Could you mind to create a ticket and provide a patch?
>>>>
>>>> This way we can test it in our ends and include it for the next
>>>> Manifold release.
>>>>
>>>> Thanks
>>>>
>>>> Regards
>>>>
>>>> 2017-06-01 16:28 GMT+02:00 Marisol Redondo <
>>>> marisol.redondo.garcia@gmail.com>:
>>>>
>>>>> I fixed the problem.
>>>>>
>>>>> The problem is that the Confluence connector is getting the entity of
>>>>> the request with the default encoding ("ISO-8859-1"), and not UTF-8.
>>>>>
>>>>> To fix that, I made a change in the Confluence connector, and each
>>>>> time is reading the request's entity I use EntityUtils.toString(entit
>>>>> y,*"UTF-8"*)
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> On 31 May 2017 at 10:13, Marisol Redondo <
>>>>> marisol.redondo.garcia@gmail.com> wrote:
>>>>>
>>>>>> Hi.
>>>>>>
>>>>>> I'm having problems with the encoding when injecting in Solr 6 in
>>>>>> standalone mode from a Confluence wiki.
>>>>>>
>>>>>> I have Manifold 2.5 with Tomcat-8.
>>>>>>
>>>>>> The repository connector from the job take the information from a
>>>>>> Confluence wiki and the output connector is Solr, using the Tika
>>>>>> transformation, a custom transformation and a Metadata adjuster.
>>>>>>
>>>>>> When the document is injected into solr, the content of the document
>>>>>> has some character that shouldn't be there because are not in the
>>>>>> confluence page, mainly a Â character.
>>>>>>
>>>>>> I have checked that confluence, the tomcat server when manifold is
>>>>>> running, the http request to confluence has the Accept-Charset header set
>>>>>> to UTF-8, the solr server is acepting UTF8.
>>>>>>
>>>>>> In the log, I have seen that when retrieving the information from
>>>>>> confluence, the content is fine, and when it's sending the information to
>>>>>> solr, it has the character. I have tried without using any transfomer and
>>>>>> getting the same log entry.
>>>>>>
>>>>>> Is this a bug or how can I resolve this?
>>>>>>
>>>>>> Thanks for your help
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

Re: UTF-8 Format from Confluence to Solr

Posted by Karl Wright <da...@gmail.com>.

There's already a ticket for this, assigned to me.  CONNECTORS-1251.  I'll
freshen it up.

Karl




On Mon, Jun 12, 2017 at 2:52 PM, Furkan KAMACI <fu...@gmail.com>
wrote:

> Hi Marisol,
>
> You can create a ticket from here: https://issues.apache.
> org/jira/projects/CONNECTORS
>
> Kind Regards,
> Furkan KAMACI
>
>
> 12 Haz 2017 Pzt, saat 18:25 tarihinde Marisol Redondo <
> marisol.redondo.garcia@gmail.com> şunu yazdı:
>
>> How can I do that?
>>
>> On 1 June 2017 at 16:43, Antonio David Pérez Morales <
>> adperezmorales@gmail.com> wrote:
>>
>>> Hi Marisol
>>>
>>> Could you mind to create a ticket and provide a patch?
>>>
>>> This way we can test it in our ends and include it for the next Manifold
>>> release.
>>>
>>> Thanks
>>>
>>> Regards
>>>
>>> 2017-06-01 16:28 GMT+02:00 Marisol Redondo <
>>> marisol.redondo.garcia@gmail.com>:
>>>
>>>> I fixed the problem.
>>>>
>>>> The problem is that the Confluence connector is getting the entity of
>>>> the request with the default encoding ("ISO-8859-1"), and not UTF-8.
>>>>
>>>> To fix that, I made a change in the Confluence connector, and each time
>>>> is reading the request's entity I use EntityUtils.toString(entity,
>>>> *"UTF-8"*)
>>>>
>>>> Thanks
>>>>
>>>>
>>>> On 31 May 2017 at 10:13, Marisol Redondo <marisol.redondo.garcia@gmail.
>>>> com> wrote:
>>>>
>>>>> Hi.
>>>>>
>>>>> I'm having problems with the encoding when injecting in Solr 6 in
>>>>> standalone mode from a Confluence wiki.
>>>>>
>>>>> I have Manifold 2.5 with Tomcat-8.
>>>>>
>>>>> The repository connector from the job take the information from a
>>>>> Confluence wiki and the output connector is Solr, using the Tika
>>>>> transformation, a custom transformation and a Metadata adjuster.
>>>>>
>>>>> When the document is injected into solr, the content of the document
>>>>> has some character that shouldn't be there because are not in the
>>>>> confluence page, mainly a Â character.
>>>>>
>>>>> I have checked that confluence, the tomcat server when manifold is
>>>>> running, the http request to confluence has the Accept-Charset header set
>>>>> to UTF-8, the solr server is acepting UTF8.
>>>>>
>>>>> In the log, I have seen that when retrieving the information from
>>>>> confluence, the content is fine, and when it's sending the information to
>>>>> solr, it has the character. I have tried without using any transfomer and
>>>>> getting the same log entry.
>>>>>
>>>>> Is this a bug or how can I resolve this?
>>>>>
>>>>> Thanks for your help
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>

Re: UTF-8 Format from Confluence to Solr

Posted by Furkan KAMACI <fu...@gmail.com>.

Hi Marisol,

You can create a ticket from here:
https://issues.apache.org/jira/projects/CONNECTORS

Kind Regards,
Furkan KAMACI


12 Haz 2017 Pzt, saat 18:25 tarihinde Marisol Redondo <
marisol.redondo.garcia@gmail.com> şunu yazdı:

> How can I do that?
>
> On 1 June 2017 at 16:43, Antonio David Pérez Morales <
> adperezmorales@gmail.com> wrote:
>
>> Hi Marisol
>>
>> Could you mind to create a ticket and provide a patch?
>>
>> This way we can test it in our ends and include it for the next Manifold
>> release.
>>
>> Thanks
>>
>> Regards
>>
>> 2017-06-01 16:28 GMT+02:00 Marisol Redondo <
>> marisol.redondo.garcia@gmail.com>:
>>
>>> I fixed the problem.
>>>
>>> The problem is that the Confluence connector is getting the entity of
>>> the request with the default encoding ("ISO-8859-1"), and not UTF-8.
>>>
>>> To fix that, I made a change in the Confluence connector, and each time
>>> is reading the request's entity I use EntityUtils.toString(entity,
>>> *"UTF-8"*)
>>>
>>> Thanks
>>>
>>>
>>> On 31 May 2017 at 10:13, Marisol Redondo <
>>> marisol.redondo.garcia@gmail.com> wrote:
>>>
>>>> Hi.
>>>>
>>>> I'm having problems with the encoding when injecting in Solr 6 in
>>>> standalone mode from a Confluence wiki.
>>>>
>>>> I have Manifold 2.5 with Tomcat-8.
>>>>
>>>> The repository connector from the job take the information from a
>>>> Confluence wiki and the output connector is Solr, using the Tika
>>>> transformation, a custom transformation and a Metadata adjuster.
>>>>
>>>> When the document is injected into solr, the content of the document
>>>> has some character that shouldn't be there because are not in the
>>>> confluence page, mainly a Â character.
>>>>
>>>> I have checked that confluence, the tomcat server when manifold is
>>>> running, the http request to confluence has the Accept-Charset header set
>>>> to UTF-8, the solr server is acepting UTF8.
>>>>
>>>> In the log, I have seen that when retrieving the information from
>>>> confluence, the content is fine, and when it's sending the information to
>>>> solr, it has the character. I have tried without using any transfomer and
>>>> getting the same log entry.
>>>>
>>>> Is this a bug or how can I resolve this?
>>>>
>>>> Thanks for your help
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: UTF-8 Format from Confluence to Solr

Posted by Marisol Redondo <ma...@gmail.com>.

How can I do that?

On 1 June 2017 at 16:43, Antonio David Pérez Morales <
adperezmorales@gmail.com> wrote:

> Hi Marisol
>
> Could you mind to create a ticket and provide a patch?
>
> This way we can test it in our ends and include it for the next Manifold
> release.
>
> Thanks
>
> Regards
>
> 2017-06-01 16:28 GMT+02:00 Marisol Redondo <marisol.redondo.garcia@gmail.
> com>:
>
>> I fixed the problem.
>>
>> The problem is that the Confluence connector is getting the entity of the
>> request with the default encoding ("ISO-8859-1"), and not UTF-8.
>>
>> To fix that, I made a change in the Confluence connector, and each time
>> is reading the request's entity I use EntityUtils.toString(entity,
>> *"UTF-8"*)
>>
>> Thanks
>>
>>
>> On 31 May 2017 at 10:13, Marisol Redondo <marisol.redondo.garcia@gmail.
>> com> wrote:
>>
>>> Hi.
>>>
>>> I'm having problems with the encoding when injecting in Solr 6 in
>>> standalone mode from a Confluence wiki.
>>>
>>> I have Manifold 2.5 with Tomcat-8.
>>>
>>> The repository connector from the job take the information from a
>>> Confluence wiki and the output connector is Solr, using the Tika
>>> transformation, a custom transformation and a Metadata adjuster.
>>>
>>> When the document is injected into solr, the content of the document has
>>> some character that shouldn't be there because are not in the confluence
>>> page, mainly a Â character.
>>>
>>> I have checked that confluence, the tomcat server when manifold is
>>> running, the http request to confluence has the Accept-Charset header set
>>> to UTF-8, the solr server is acepting UTF8.
>>>
>>> In the log, I have seen that when retrieving the information from
>>> confluence, the content is fine, and when it's sending the information to
>>> solr, it has the character. I have tried without using any transfomer and
>>> getting the same log entry.
>>>
>>> Is this a bug or how can I resolve this?
>>>
>>> Thanks for your help
>>>
>>>
>>>
>>>
>>>
>>
>

Re: UTF-8 Format from Confluence to Solr

Posted by Antonio David Pérez Morales <ad...@gmail.com>.

Hi Marisol

Could you mind to create a ticket and provide a patch?

This way we can test it in our ends and include it for the next Manifold
release.

Thanks

Regards

2017-06-01 16:28 GMT+02:00 Marisol Redondo <marisol.redondo.garcia@gmail.com
>:

> I fixed the problem.
>
> The problem is that the Confluence connector is getting the entity of the
> request with the default encoding ("ISO-8859-1"), and not UTF-8.
>
> To fix that, I made a change in the Confluence connector, and each time is
> reading the request's entity I use EntityUtils.toString(entity,*"UTF-8"*)
>
> Thanks
>
>
> On 31 May 2017 at 10:13, Marisol Redondo <marisol.redondo.garcia@gmail.com
> > wrote:
>
>> Hi.
>>
>> I'm having problems with the encoding when injecting in Solr 6 in
>> standalone mode from a Confluence wiki.
>>
>> I have Manifold 2.5 with Tomcat-8.
>>
>> The repository connector from the job take the information from a
>> Confluence wiki and the output connector is Solr, using the Tika
>> transformation, a custom transformation and a Metadata adjuster.
>>
>> When the document is injected into solr, the content of the document has
>> some character that shouldn't be there because are not in the confluence
>> page, mainly a Â character.
>>
>> I have checked that confluence, the tomcat server when manifold is
>> running, the http request to confluence has the Accept-Charset header set
>> to UTF-8, the solr server is acepting UTF8.
>>
>> In the log, I have seen that when retrieving the information from
>> confluence, the content is fine, and when it's sending the information to
>> solr, it has the character. I have tried without using any transfomer and
>> getting the same log entry.
>>
>> Is this a bug or how can I resolve this?
>>
>> Thanks for your help
>>
>>
>>
>>
>>
>