You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Shirai Takashi/ 白井隆 <sh...@nintendo.co.jp> on 2021/03/02 03:10:20 UTC

Another Elasticsearch patch to allow the long URI

Hi, there.

I've found another trouble in Elasticsearch connector.
Elasticsearch output connector use the URI string as ID.
Elasticsearch allows the length of ID no more than 512 bytes.
If the URL length is too long, it causes HTTP 400 error.

I prepare two solutions with this attached patch.
The one is URI decoding.
If the URI includes multibyte characters,
the ID is URL encoded duplicately.
Ex) U+3000 -> %E3%80%80 -> %25E3%2580%2580
This enlarges the ID length unnecessarily.
Then I add the option to decode URI as the ID before encoding.

But the length may still longer than 512 bytes.
The other solution is hashing.
The new added options are the following.
Raw) uses the URI string as is.
Hash) hashes (SHA1) the URI string always.
Hash if long) hashes the URI only if its length exceeds 512 bytes.
The last one is prepared for the compatibility.

Both of solutions cause a new problem.
If the URI is decoded or hashed,
the original URI cannot be keeped in each document.
Then I add the new fields.
URI field name) keeps the original URI string as is.
Decoded URI field name) keeps the decoded URI string.
The default settings provides these fields as empty.


I sended the patch for Ingest-Attachment the other day.
Then this mail attaches the two patches.
apache-manifoldcf-2.18-elastic-id.patch.gz:
 The patch for 2.18 including the patch of the other day.
apache-manifoldcf-elastic-id.patch.gz:
 The patch for the source patched the other day.

By the way, I tryed to describe the above to some documents.
But no suitable document is found in the ManifoldCF package.
The Elasticsearch document may be wrote for the ancient spacifications.
Where can I describe this new specifications?

----
Nintendo, Co., Ltd.
Product Technology Dept.
Takashi SHIRAI
PHONE: +81-75-662-9600
mailto:shirai@nintendo.co.jp

Re: Another Elasticsearch patch to allow the long URI

Posted by Shirai Takashi/ 白井隆 <sh...@nintendo.co.jp>.
Hi, Karl.

Karl Wright wrote:
>Backwards compatibility means that we very likely have to
>use the hash approach, and not use the decoding approach.

Do you object to the decoding?

It may be useless for the users with the alphabetical language.
But it's useful for the users with the multibyte language like as CJK.

----
Nintendo, Co., Ltd.
Product Technology Dept.
Takashi SHIRAI
PHONE: +81-75-662-9600
mailto:shirai@nintendo.co.jp

Re: Another Elasticsearch patch to allow the long URI

Posted by Karl Wright <da...@gmail.com>.
Hi - this is very helpful.  I would like you to officially create a ticket
in Jira: https://issues.apache.org/jira , project "CONNECTORS", and attach
these patches.  Backwards compatibility means that we very likely have to
use the hash approach, and not use the decoding approach.

Thanks,
Karl


On Mon, Mar 1, 2021 at 10:10 PM Shirai Takashi/ 白井隆 <sh...@nintendo.co.jp>
wrote:

> Hi, there.
>
> I've found another trouble in Elasticsearch connector.
> Elasticsearch output connector use the URI string as ID.
> Elasticsearch allows the length of ID no more than 512 bytes.
> If the URL length is too long, it causes HTTP 400 error.
>
> I prepare two solutions with this attached patch.
> The one is URI decoding.
> If the URI includes multibyte characters,
> the ID is URL encoded duplicately.
> Ex) U+3000 -> %E3%80%80 -> %25E3%2580%2580
> This enlarges the ID length unnecessarily.
> Then I add the option to decode URI as the ID before encoding.
>
> But the length may still longer than 512 bytes.
> The other solution is hashing.
> The new added options are the following.
> Raw) uses the URI string as is.
> Hash) hashes (SHA1) the URI string always.
> Hash if long) hashes the URI only if its length exceeds 512 bytes.
> The last one is prepared for the compatibility.
>
> Both of solutions cause a new problem.
> If the URI is decoded or hashed,
> the original URI cannot be keeped in each document.
> Then I add the new fields.
> URI field name) keeps the original URI string as is.
> Decoded URI field name) keeps the decoded URI string.
> The default settings provides these fields as empty.
>
>
> I sended the patch for Ingest-Attachment the other day.
> Then this mail attaches the two patches.
> apache-manifoldcf-2.18-elastic-id.patch.gz:
>  The patch for 2.18 including the patch of the other day.
> apache-manifoldcf-elastic-id.patch.gz:
>  The patch for the source patched the other day.
>
> By the way, I tryed to describe the above to some documents.
> But no suitable document is found in the ManifoldCF package.
> The Elasticsearch document may be wrote for the ancient spacifications.
> Where can I describe this new specifications?
>
> ----
> Nintendo, Co., Ltd.
> Product Technology Dept.
> Takashi SHIRAI
> PHONE: +81-75-662-9600
> mailto:shirai@nintendo.co.jp

Re: Another Elasticsearch patch to allow the long URI

Posted by Shirai Takashi/ 白井隆 <sh...@nintendo.co.jp>.
Hi, There.

Shirai Takashi/ 白井隆 wrote:
>I can use SHA-256 with Elasticsearch connector.

I've prepared the patch to support SHA-256.
It minimizes changes, to avoid the global effects.
It seems unbeautiful to include the try-catch clause.

I can't decide which is better.

----
Nintendo, Co., Ltd.
Product Technology Dept.
Takashi SHIRAI
PHONE: +81-75-662-9600
mailto:shirai@nintendo.co.jp

Re: Another Elasticsearch patch to allow the long URI

Posted by Shirai Takashi/ 白井隆 <sh...@nintendo.co.jp>.
Hi, Karl.

Karl Wrightさんは書きました:
>field).  I would like to know more about this.  Does the "types" field no
>longer work?  Should we send both, in order to be sure that the connector
>works with most versions of ElasticSearch?  Please help clarify so that I
>can finish this off.

The "types" field is meaningless in 6.x, and deprecated in 7.x.
Please see the following.
https://www.elastic.co/guide/en/elasticsearch/reference/current/removal-of-types.html

You shouldn't delete this field for the reason of compatibility.
But the latest Elasticsearch can receive only '_doc',
then the default value should be '_doc'.

----
Nintendo, Co., Ltd.
Product Technology Dept.
Takashi SHIRAI
PHONE: +81-75-662-9600
mailto:shirai@nintendo.co.jp

Re: Another Elasticsearch patch to allow the long URI

Posted by Shirai Takashi/ 白井隆 <sh...@nintendo.co.jp>.
Hi, Karl.

Karl Wright wrote:
>I have now updated (I think) everything that this patch actually has, save
>for one deprecated field substitution (the "types" field is now the "doc_"

I've confirmed the updated sources via git://git.apache.org/manifoldcf.git,
to find some problem in the following codes.

connectors/elasticsearch/connector/src/main/java/org/apache/manifoldcf/agents/output/elasticsearch/ElasticSearchIndex.java:
	if (useIngesterAttachment && inputStream != null) {
	...(Clause1)...
	}
	if (useMapperAttachments && inputStream != null) {
	...(Clause2)...
	}
	if (!useMapperAttachments && inputStream != null) {
	...(Clause3)...
	}

In the default case, it executes only Clause3.
If useMapperAttachments is set, it executes only Clause2.
(Both of useMapperAttachments and useIngesterAttachment will be never set.)
But if useIngesterAttachment is set, it executes Clause1 and Clause3.
These clause must be exclusive.
Each of Clause1 and Clause3 provides contentAttributeName field.
If both of them is executed, this field will be duplicated.

Please fix them as the following.
	if (!useIngesterAttachment && !useMapperAttachments && inputStream != null) {
	...(Clause3)...
	}


P.S.
Where is "Ingester" from?
The strict name of plugin is "Ingest Attachment Processor Plugin".
https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html

P.S.2
The stritc name of product is not "ElasticSearch" but "Elasticsearch".
https://www.elastic.co/what-is/elasticsearch

----
Nintendo, Co., Ltd.
Product Technology Dept.
Takashi SHIRAI
PHONE: +81-75-662-9600
mailto:shirai@nintendo.co.jp

Re: Another Elasticsearch patch to allow the long URI

Posted by Karl Wright <da...@gmail.com>.
I have now updated (I think) everything that this patch actually has, save
for one deprecated field substitution (the "types" field is now the "doc_"
field).  I would like to know more about this.  Does the "types" field no
longer work?  Should we send both, in order to be sure that the connector
works with most versions of ElasticSearch?  Please help clarify so that I
can finish this off.

The changes are committed to trunk; I would be very appreciative if  Shirai
Takashi/ 白井隆 reviewed them there.Thanks!
Karl


On Sat, Mar 20, 2021 at 4:32 AM Karl Wright <da...@gmail.com> wrote:

> Hi,
>
> Please see https://issues.apache.org/jira/browse/CONNECTORS-1666 .
>
> I did not commit the patches as given because I felt that the fix was a
> relatively narrow one and it could be implemented with no user
> involvement.  Adding control for the user was therefore beyond the scope of
> the repair.
>
> There are more changes in these patches than just the ID length issue.  I
> am working to add this functionality as well but without anything I would
> consider to be unneeded.
> Karl
>
>
> On Fri, Mar 19, 2021 at 3:48 AM Karl Wright <da...@gmail.com> wrote:
>
>> Thanks for the information.  I'll see what I can do.
>> Karl
>>
>>
>> On Thu, Mar 18, 2021 at 7:23 PM Shirai Takashi/ 白井隆 <
>> shirai@nintendo.co.jp> wrote:
>>
>>> Hi, Karl.
>>>
>>> Karl Wright wrote:
>>> >Hi - I'm still waiting for this patch to be attached to a ticket.  That
>>> is
>>> >the only way I believe we're allowed to accept it legally.
>>>
>>> Do you ask me to send the patch to the JIRA ticket?
>>> I can't access the JIRA because of our firewall.
>>> Sorry.
>>> What can I do without JIRA?
>>>
>>> ----
>>> Nintendo, Co., Ltd.
>>> Product Technology Dept.
>>> Takashi SHIRAI
>>> PHONE: +81-75-662-9600
>>> mailto:shirai@nintendo.co.jp
>>>
>>

Re: Another Elasticsearch patch to allow the long URI

Posted by Karl Wright <da...@gmail.com>.
Hi,

Please see https://issues.apache.org/jira/browse/CONNECTORS-1666 .

I did not commit the patches as given because I felt that the fix was a
relatively narrow one and it could be implemented with no user
involvement.  Adding control for the user was therefore beyond the scope of
the repair.

There are more changes in these patches than just the ID length issue.  I
am working to add this functionality as well but without anything I would
consider to be unneeded.
Karl


On Fri, Mar 19, 2021 at 3:48 AM Karl Wright <da...@gmail.com> wrote:

> Thanks for the information.  I'll see what I can do.
> Karl
>
>
> On Thu, Mar 18, 2021 at 7:23 PM Shirai Takashi/ 白井隆 <sh...@nintendo.co.jp>
> wrote:
>
>> Hi, Karl.
>>
>> Karl Wright wrote:
>> >Hi - I'm still waiting for this patch to be attached to a ticket.  That
>> is
>> >the only way I believe we're allowed to accept it legally.
>>
>> Do you ask me to send the patch to the JIRA ticket?
>> I can't access the JIRA because of our firewall.
>> Sorry.
>> What can I do without JIRA?
>>
>> ----
>> Nintendo, Co., Ltd.
>> Product Technology Dept.
>> Takashi SHIRAI
>> PHONE: +81-75-662-9600
>> mailto:shirai@nintendo.co.jp
>>
>

Re: Another Elasticsearch patch to allow the long URI

Posted by Karl Wright <da...@gmail.com>.
Thanks for the information.  I'll see what I can do.
Karl


On Thu, Mar 18, 2021 at 7:23 PM Shirai Takashi/ 白井隆 <sh...@nintendo.co.jp>
wrote:

> Hi, Karl.
>
> Karl Wright wrote:
> >Hi - I'm still waiting for this patch to be attached to a ticket.  That is
> >the only way I believe we're allowed to accept it legally.
>
> Do you ask me to send the patch to the JIRA ticket?
> I can't access the JIRA because of our firewall.
> Sorry.
> What can I do without JIRA?
>
> ----
> Nintendo, Co., Ltd.
> Product Technology Dept.
> Takashi SHIRAI
> PHONE: +81-75-662-9600
> mailto:shirai@nintendo.co.jp
>

Re: Another Elasticsearch patch to allow the long URI

Posted by Shirai Takashi/ 白井隆 <sh...@nintendo.co.jp>.
Hi, Karl.

Karl Wright wrote:
>Hi - I'm still waiting for this patch to be attached to a ticket.  That is
>the only way I believe we're allowed to accept it legally.

Do you ask me to send the patch to the JIRA ticket?
I can't access the JIRA because of our firewall.
Sorry.
What can I do without JIRA?

----
Nintendo, Co., Ltd.
Product Technology Dept.
Takashi SHIRAI
PHONE: +81-75-662-9600
mailto:shirai@nintendo.co.jp

Re: Another Elasticsearch patch to allow the long URI

Posted by Karl Wright <da...@gmail.com>.
Hi - I'm still waiting for this patch to be attached to a ticket.  That is
the only way I believe we're allowed to accept it legally.

Karl


On Thu, Mar 4, 2021 at 7:16 PM Shirai Takashi/ 白井隆 <sh...@nintendo.co.jp>
wrote:

> Hi, Karl.
>
> Karl Wrightさんは書きました:
> >I agree it is unlikely that the JDK will lose support for SHA-1 because it
> >is used commonly, as is MD5.  So please feel free to use it.
>
> I know.
> I think that SHA-1 is better on the whole.
> I don't care that apache-manifoldcf-elastic-id-2.patch.gz is discarded.
>
> SHA-256 is surely safer from the risk of collision.
> But the risk with SHA-1 can be ignored unless intension.
> It should be considered only when ManifoldCF is used for the worldwide
> data.
>
> ----
> Nintendo, Co., Ltd.
> Product Technology Dept.
> Takashi SHIRAI
> PHONE: +81-75-662-9600
> mailto:shirai@nintendo.co.jp
>

Re: Another Elasticsearch patch to allow the long URI

Posted by Shirai Takashi/ 白井隆 <sh...@nintendo.co.jp>.
Hi, Karl.

Karl Wrightさんは書きました:
>I agree it is unlikely that the JDK will lose support for SHA-1 because it
>is used commonly, as is MD5.  So please feel free to use it.

I know.
I think that SHA-1 is better on the whole.
I don't care that apache-manifoldcf-elastic-id-2.patch.gz is discarded.

SHA-256 is surely safer from the risk of collision.
But the risk with SHA-1 can be ignored unless intension.
It should be considered only when ManifoldCF is used for the worldwide data.

----
Nintendo, Co., Ltd.
Product Technology Dept.
Takashi SHIRAI
PHONE: +81-75-662-9600
mailto:shirai@nintendo.co.jp

Re: Another Elasticsearch patch to allow the long URI

Posted by Karl Wright <da...@gmail.com>.
I agree it is unlikely that the JDK will lose support for SHA-1 because it
is used commonly, as is MD5.  So please feel free to use it.

Karl


On Wed, Mar 3, 2021 at 7:54 PM Shirai Takashi/ 白井隆 <sh...@nintendo.co.jp>
wrote:

> Hi, Horn.
>
> Jörn Franke wrote:
> >Makes sense
>
> I don't think that it's easy.
>
>
> >>> Maybe use SHA-256 or later. SHA-1 is obsolete and one never knows when
> it will be removed from JDK.
>
> I also know SHA-1 is dangerous.
> Someone can generate the string which is hashed into the same SHA-1 to
> pretend another one.
> Then SHA-1 should not be used with certifications.
> The future JDK may stop using SHA-1 with certifications.
> But JDK will never stop supporting SHA-1 algorism.
>
> If SHA-1 is removed from JDK,
> ManifoldCF can not be built for reasons of another using of SHA-1.
> Some connectors already use SHA-1 as the ID value,
> then the previous saved records will be inaccessible.
> I can use SHA-256 with Elasticsearch connector.
> How should the other SHA-1 be managed?
>
> ----
> Nintendo, Co., Ltd.
> Product Technology Dept.
> Takashi SHIRAI
> PHONE: +81-75-662-9600
> mailto:shirai@nintendo.co.jp
>

Re: Another Elasticsearch patch to allow the long URI

Posted by Shirai Takashi/ 白井隆 <sh...@nintendo.co.jp>.
Hi, Horn.

Jörn Franke wrote:
>Makes sense

I don't think that it's easy.


>>> Maybe use SHA-256 or later. SHA-1 is obsolete and one never knows when it will be removed from JDK.

I also know SHA-1 is dangerous.
Someone can generate the string which is hashed into the same SHA-1 to pretend another one.
Then SHA-1 should not be used with certifications.
The future JDK may stop using SHA-1 with certifications.
But JDK will never stop supporting SHA-1 algorism.

If SHA-1 is removed from JDK,
ManifoldCF can not be built for reasons of another using of SHA-1.
Some connectors already use SHA-1 as the ID value,
then the previous saved records will be inaccessible.
I can use SHA-256 with Elasticsearch connector.
How should the other SHA-1 be managed?

----
Nintendo, Co., Ltd.
Product Technology Dept.
Takashi SHIRAI
PHONE: +81-75-662-9600
mailto:shirai@nintendo.co.jp

Re: Another Elasticsearch patch to allow the long URI

Posted by Jörn Franke <jo...@gmail.com>.
Makes sense

> Am 02.03.2021 um 08:33 schrieb Shirai Takashi/ 白井隆 <sh...@nintendo.co.jp>:
> 
> Hi, Jorn.
> 
> Jörn Franke wrote:
>> Maybe use SHA-256 or later. SHA-1 is obsolete and one never knows when it will be removed from JDK.
> 
> SHA-1 is used in the ManifoldCF existent class.
> (org.apache.manifoldcf.core.system.ManifoldCF)
> If "SHA" is replaced "SHA-256" in this class,
> the default algorism is updated entirely.
> I've just followed the standard of ManifoldCF.
> I also think SHA-256 or later is better.
> 
> Why the current ManifoldCF use SHA-1?
> This case may have to use SHA-1 depending on the reason.
> If the reason is only the compatibility,
> I can re-design the method ManifoldCF.hash(),
> to add the argument which indicates the algorism.
> 
> ----
> Nintendo, Co., Ltd.
> Product Technology Dept.
> Takashi SHIRAI
> PHONE: +81-75-662-9600
> mailto:shirai@nintendo.co.jp

Re: Another Elasticsearch patch to allow the long URI

Posted by Shirai Takashi/ 白井隆 <sh...@nintendo.co.jp>.
Hi, Jorn.

Jörn Franke wrote:
>Maybe use SHA-256 or later. SHA-1 is obsolete and one never knows when it will be removed from JDK.

SHA-1 is used in the ManifoldCF existent class.
(org.apache.manifoldcf.core.system.ManifoldCF)
If "SHA" is replaced "SHA-256" in this class,
the default algorism is updated entirely.
I've just followed the standard of ManifoldCF.
I also think SHA-256 or later is better.

Why the current ManifoldCF use SHA-1?
This case may have to use SHA-1 depending on the reason.
If the reason is only the compatibility,
I can re-design the method ManifoldCF.hash(),
to add the argument which indicates the algorism.

----
Nintendo, Co., Ltd.
Product Technology Dept.
Takashi SHIRAI
PHONE: +81-75-662-9600
mailto:shirai@nintendo.co.jp

Re: Another Elasticsearch patch to allow the long URI

Posted by Jörn Franke <jo...@gmail.com>.
Maybe use SHA-256 or later. SHA-1 is obsolete and one never knows when it will be removed from JDK.

> Am 02.03.2021 um 04:10 schrieb Shirai Takashi/ 白井隆 <sh...@nintendo.co.jp>:
> 
> Hi, there.
> 
> I've found another trouble in Elasticsearch connector.
> Elasticsearch output connector use the URI string as ID.
> Elasticsearch allows the length of ID no more than 512 bytes.
> If the URL length is too long, it causes HTTP 400 error.
> 
> I prepare two solutions with this attached patch.
> The one is URI decoding.
> If the URI includes multibyte characters,
> the ID is URL encoded duplicately.
> Ex) U+3000 -> %E3%80%80 -> %25E3%2580%2580
> This enlarges the ID length unnecessarily.
> Then I add the option to decode URI as the ID before encoding.
> 
> But the length may still longer than 512 bytes.
> The other solution is hashing.
> The new added options are the following.
> Raw) uses the URI string as is.
> Hash) hashes (SHA1) the URI string always.
> Hash if long) hashes the URI only if its length exceeds 512 bytes.
> The last one is prepared for the compatibility.
> 
> Both of solutions cause a new problem.
> If the URI is decoded or hashed,
> the original URI cannot be keeped in each document.
> Then I add the new fields.
> URI field name) keeps the original URI string as is.
> Decoded URI field name) keeps the decoded URI string.
> The default settings provides these fields as empty.
> 
> 
> I sended the patch for Ingest-Attachment the other day.
> Then this mail attaches the two patches.
> apache-manifoldcf-2.18-elastic-id.patch.gz:
> The patch for 2.18 including the patch of the other day.
> apache-manifoldcf-elastic-id.patch.gz:
> The patch for the source patched the other day.
> 
> By the way, I tryed to describe the above to some documents.
> But no suitable document is found in the ManifoldCF package.
> The Elasticsearch document may be wrote for the ancient spacifications.
> Where can I describe this new specifications?
> 
> ----
> Nintendo, Co., Ltd.
> Product Technology Dept.
> Takashi SHIRAI
> PHONE: +81-75-662-9600
> mailto:shirai@nintendo.co.jp
> <apache-manifoldcf-2.18-elastic-id.patch.gz>
> <apache-manifoldcf-elastic-id.patch.gz>