You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Nguyen Huu Nhat (Jira)" <ji...@apache.org> on 2022/08/30 02:35:00 UTC
[jira] [Created] (CONNECTORS-1729) The Confluence-v6 Repository Connector's attachment logic is incorrect

Nguyen Huu Nhat created CONNECTORS-1729:
-------------------------------------------

             Summary: The Confluence-v6 Repository Connector's attachment logic is incorrect
                 Key: CONNECTORS-1729
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1729
             Project: ManifoldCF
          Issue Type: Bug
            Reporter: Nguyen Huu Nhat


Hi there,

As there is an issue that is still not handled occurs in use, I would like to suggest the following fix for the source code of Confluence Repository Connector.
For details about this issue, please refer to the information below:

h3. +*1. Connector Name*+

confluence-v6 \ Confluence Repository Connector

h3. +*2. Overview*+

 * In the Confluence Repository Connector, there is an error in the logic that determines wether the document has attachments or not.
 * Wrong logic leads to attachments not being crawled.

※ This error only occurs when crawling documents from Confluence server, while crawling documents from Confluence Cloud (SaaS) still works normally.
 * Formats of the document's ID when there is a file attached are as below:
 ** Crawled from Confluence server: *<id of attchment>-<id of blog/page>*
 ** Crawled from Confluence cloud (SaaS): *att<id of attchment>-<id of blog/page>*

h3. +*3. Reproduction*+

 * On Confluence server:
 ** Create a blog.
 ** Add attachments to the newly created blog.
 * On ManifoldCF:
 ** Create a Confluence Repository Connector with the aforementioned Confluence server information.
 ** Create a job using the connector created above with the following details:
 *** On the [Page] tab:
 **** Process Attachments: (Check).
 **** Type Specification: Blog.
 ** Start job.
 ** Check [Simple History Report].

h3. +*4. Cause*+

 * At the logic for judging whether the document has / does not have a file attachment, if the ID of the document begins with *att*, it is judging that there is a file attachment.
 * However, the ID field of the document crawled from the Confluence server, in fact, when the file is attached, does not prefix it with *att* (format mentioned in item 2).

h3. +*5. Solution*+

My observation is as below:
 * If a document has a file attachment, the ID of that document is a string of characters connected by *-* character.
 * If a document does not have a file attachment, the ID of that document does not contain *-* character.

Therefore, it is possible to judge whether a file is is attached or not by checking if the ID contains *-* character.

h3. +*6. Suggested source code (based on release 2.22.1)*+

***Class: org.apache.manifoldcf.crawler.connectors.confluence.v6.util.ConfluenceUtil***

[https://github.com/apache/manifoldcf/blob/release-2.22.1/connectors/confluence-v6/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/confluence/v6/util/ConfluenceUtil.java#L28]
{code:java}
-  private static final String ATTACHMENT_ID_PREFIX = "att";
+  private static final String ATTACHMENT_ID_CHARACTER = "-";
{code}

[https://github.com/apache/manifoldcf/blob/release-2.22.1/connectors/confluence-v6/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/confluence/v6/util/ConfluenceUtil.java#L47]
{code:java}
   public static Boolean isAttachment(String id) {
-    return id.startsWith(ATTACHMENT_ID_PREFIX);
+    return id.contains(ATTACHMENT_ID_CHARACTER);
   }
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)