You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by Piergiorgio Lucidi <pi...@apache.org> on 2014/06/05 15:13:48 UTC
Solr Output Connector - Questions
Hi guys,
I'm wondering if it is possible to disable the URL encoding on fields on
this connector.
Trying to create indexes from a repository that have the name of the field
with a URI similar to the following:
{http://www.alfresco.org/model/system}store_identifier
But the field is stored in Solr in this way:
http_3a_2f_2fwww_alfresco_org_2fmodel_2fsystem_2f1_0_7dstore_identifier
That is bad, our customer needs to create a copyField to solve this issue.
Is there a reason why we continue to use the URL encoding?
I saw in the code that it seems to be an issue related to SolrJ, but do you
think that we can find a workaround for this?
Thank you.
Cheers,
Piergiorgio
--
Piergiorgio Lucidi
Open Source ECM Specialist
http://www.open4dev.com
Re: Solr Output Connector - Questions
Posted by Piergiorgio Lucidi <pi...@apache.org>.
I have created the ticket to track this:
https://issues.apache.org/jira/browse/CONNECTORS-956
and now I'm trying to remove the preEncode invocation.
2014-06-05 15:26 GMT+02:00 Karl Wright <da...@gmail.com>:
> Ah, it seems I remember it backwards.
>
> SolrJ did *not* url-encode field names; that was the bug. Instead, it
> tried to send field names in unencoded form, which would mess up URLs and
> cause problems. See this method in SolrConnector.java:
>
> >>>>>>
> /** Preprocess field name.
> * SolrJ has a bug where it does not URL-escape field names. This causes
> carnage for
> * ManifoldCF, because it results in IllegalArgumentExceptions getting
> thrown deep in SolrJ.
> * See CONNECTORS-630.
> * In order to get around this, we need to URL-encode argument names, at
> least until the underlying
> * SolrJ issue is fixed.
> */
> protected static String preEncode(String fieldName)
> {
> return URLEncoder.encode(fieldName);
> }
> <<<<<<
>
> It sounds like the SolrJ issue may have been fixed. Can you try, or have
> the customer try, changing this method to just "return fieldName;"? Make
> sure that you can still ingest documents that have funky field names that
> include international characters and punctuation; these come from some of
> our connectors.
>
> Thanks,
> Karl
>
>
>
> On Thu, Jun 5, 2014 at 9:20 AM, Karl Wright <da...@gmail.com> wrote:
>
> > Hi Piergiorgio,
> >
> > I had a back-and-forth with Eric Hatcher about this issue about a year
> > ago. Solr technically accepts only a limited set of characters, and
> SolrJ
> > is therefore not well coded to deal with anything much out of the
> > ordinary. I tried to get them to consider removing the url encoding, but
> > they said no for that reason: "you are doing things which you shouldn't
> be
> > doing anyway".
> >
> > We did work around this problem for field *values*. I'll review what was
> > done to see if can be applied to field *names* too. In the meantime,
> > please open a ticket to track the issue.
> >
> > Thanks,
> > Karl
> >
> >
> >
> >
> > On Thu, Jun 5, 2014 at 9:13 AM, Piergiorgio Lucidi <
> piergiorgio@apache.org
> > > wrote:
> >
> >> Hi guys,
> >>
> >> I'm wondering if it is possible to disable the URL encoding on fields on
> >> this connector.
> >>
> >> Trying to create indexes from a repository that have the name of the
> field
> >> with a URI similar to the following:
> >>
> >> {http://www.alfresco.org/model/system}store_identifier
> >>
> >>
> >> But the field is stored in Solr in this way:
> >>
> >> http_3a_2f_2fwww_alfresco_org_2fmodel_2fsystem_2f1_0_7dstore_identifier
> >>
> >>
> >> That is bad, our customer needs to create a copyField to solve this
> issue.
> >>
> >> Is there a reason why we continue to use the URL encoding?
> >>
> >> I saw in the code that it seems to be an issue related to SolrJ, but do
> >> you
> >> think that we can find a workaround for this?
> >>
> >> Thank you.
> >>
> >> Cheers,
> >> Piergiorgio
> >> --
> >> Piergiorgio Lucidi
> >> Open Source ECM Specialist
> >> http://www.open4dev.com
> >>
> >
> >
>
> --
> Piergiorgio Lucidi
> Open Source ECM Specialist
> http://www.open4dev.com
>
Re: Solr Output Connector - Questions
Posted by Karl Wright <da...@gmail.com>.
Ah, it seems I remember it backwards.
SolrJ did *not* url-encode field names; that was the bug. Instead, it
tried to send field names in unencoded form, which would mess up URLs and
cause problems. See this method in SolrConnector.java:
>>>>>>
/** Preprocess field name.
* SolrJ has a bug where it does not URL-escape field names. This causes
carnage for
* ManifoldCF, because it results in IllegalArgumentExceptions getting
thrown deep in SolrJ.
* See CONNECTORS-630.
* In order to get around this, we need to URL-encode argument names, at
least until the underlying
* SolrJ issue is fixed.
*/
protected static String preEncode(String fieldName)
{
return URLEncoder.encode(fieldName);
}
<<<<<<
It sounds like the SolrJ issue may have been fixed. Can you try, or have
the customer try, changing this method to just "return fieldName;"? Make
sure that you can still ingest documents that have funky field names that
include international characters and punctuation; these come from some of
our connectors.
Thanks,
Karl
On Thu, Jun 5, 2014 at 9:20 AM, Karl Wright <da...@gmail.com> wrote:
> Hi Piergiorgio,
>
> I had a back-and-forth with Eric Hatcher about this issue about a year
> ago. Solr technically accepts only a limited set of characters, and SolrJ
> is therefore not well coded to deal with anything much out of the
> ordinary. I tried to get them to consider removing the url encoding, but
> they said no for that reason: "you are doing things which you shouldn't be
> doing anyway".
>
> We did work around this problem for field *values*. I'll review what was
> done to see if can be applied to field *names* too. In the meantime,
> please open a ticket to track the issue.
>
> Thanks,
> Karl
>
>
>
>
> On Thu, Jun 5, 2014 at 9:13 AM, Piergiorgio Lucidi <piergiorgio@apache.org
> > wrote:
>
>> Hi guys,
>>
>> I'm wondering if it is possible to disable the URL encoding on fields on
>> this connector.
>>
>> Trying to create indexes from a repository that have the name of the field
>> with a URI similar to the following:
>>
>> {http://www.alfresco.org/model/system}store_identifier
>>
>>
>> But the field is stored in Solr in this way:
>>
>> http_3a_2f_2fwww_alfresco_org_2fmodel_2fsystem_2f1_0_7dstore_identifier
>>
>>
>> That is bad, our customer needs to create a copyField to solve this issue.
>>
>> Is there a reason why we continue to use the URL encoding?
>>
>> I saw in the code that it seems to be an issue related to SolrJ, but do
>> you
>> think that we can find a workaround for this?
>>
>> Thank you.
>>
>> Cheers,
>> Piergiorgio
>> --
>> Piergiorgio Lucidi
>> Open Source ECM Specialist
>> http://www.open4dev.com
>>
>
>
Re: Solr Output Connector - Questions
Posted by Karl Wright <da...@gmail.com>.
Hi Piergiorgio,
I had a back-and-forth with Eric Hatcher about this issue about a year
ago. Solr technically accepts only a limited set of characters, and SolrJ
is therefore not well coded to deal with anything much out of the
ordinary. I tried to get them to consider removing the url encoding, but
they said no for that reason: "you are doing things which you shouldn't be
doing anyway".
We did work around this problem for field *values*. I'll review what was
done to see if can be applied to field *names* too. In the meantime,
please open a ticket to track the issue.
Thanks,
Karl
On Thu, Jun 5, 2014 at 9:13 AM, Piergiorgio Lucidi <pi...@apache.org>
wrote:
> Hi guys,
>
> I'm wondering if it is possible to disable the URL encoding on fields on
> this connector.
>
> Trying to create indexes from a repository that have the name of the field
> with a URI similar to the following:
>
> {http://www.alfresco.org/model/system}store_identifier
>
>
> But the field is stored in Solr in this way:
>
> http_3a_2f_2fwww_alfresco_org_2fmodel_2fsystem_2f1_0_7dstore_identifier
>
>
> That is bad, our customer needs to create a copyField to solve this issue.
>
> Is there a reason why we continue to use the URL encoding?
>
> I saw in the code that it seems to be an issue related to SolrJ, but do you
> think that we can find a workaround for this?
>
> Thank you.
>
> Cheers,
> Piergiorgio
> --
> Piergiorgio Lucidi
> Open Source ECM Specialist
> http://www.open4dev.com
>