You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (Jira)" <ji...@apache.org> on 2021/03/20 07:39:00 UTC

[jira] [Commented] (CONNECTORS-1666) ElasticSearch connector cannot use full URLs for IDs

    [ https://issues.apache.org/jira/browse/CONNECTORS-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17305344#comment-17305344 ] 

Karl Wright commented on CONNECTORS-1666:
-----------------------------------------

{quote}
Hi, there.

I've found another trouble in Elasticsearch connector.
Elasticsearch output connector use the URI string as ID.
Elasticsearch allows the length of ID no more than 512 bytes.
If the URL length is too long, it causes HTTP 400 error.

I prepare two solutions with this attached patch.
The one is URI decoding.
If the URI includes multibyte characters,
the ID is URL encoded duplicately.
Ex) U+3000 -> %E3%80%80 -> %25E3%2580%2580
This enlarges the ID length unnecessarily.
Then I add the option to decode URI as the ID before encoding.

But the length may still longer than 512 bytes.
The other solution is hashing.
The new added options are the following.
Raw) uses the URI string as is.
Hash) hashes (SHA1) the URI string always.
Hash if long) hashes the URI only if its length exceeds 512 bytes.
The last one is prepared for the compatibility.

Both of solutions cause a new problem.
If the URI is decoded or hashed,
the original URI cannot be keeped in each document.
Then I add the new fields.
URI field name) keeps the original URI string as is.
Decoded URI field name) keeps the decoded URI string.
The default settings provides these fields as empty.


I sended the patch for Ingest-Attachment the other day.
Then this mail attaches the two patches.
apache-manifoldcf-2.18-elastic-id.patch.gz:
 The patch for 2.18 including the patch of the other day.
apache-manifoldcf-elastic-id.patch.gz:
 The patch for the source patched the other day.

By the way, I tryed to describe the above to some documents.
But no suitable document is found in the ManifoldCF package.
The Elasticsearch document may be wrote for the ancient spacifications.
Where can I describe this new specifications?
{quote}


> ElasticSearch connector cannot use full URLs for IDs
> ----------------------------------------------------
>
>                 Key: CONNECTORS-1666
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1666
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Elastic Search connector
>    Affects Versions: ManifoldCF 2.17
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>            Priority: Major
>         Attachments: apache-manifoldcf-2.18-elastic-id.patch.gz, apache-manifoldcf-elastic-id-2.patch.gz, apache-manifoldcf-elastic-id.patch.gz
>
>
> The size of the ElasticSearch ID field is severely limited.  We therefore need to use a strategy to hash the ID when it gets too long so that ES doesn't fail on such documents.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)