You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2014/05/19 19:44:38 UTC
[jira] [Comment Edited] (CONNECTORS-916) Amazon CloudSearch output connector

    [ https://issues.apache.org/jira/browse/CONNECTORS-916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002049#comment-14002049 ] 

Karl Wright edited comment on CONNECTORS-916 at 5/19/14 5:43 PM:
-----------------------------------------------------------------

Hi Takumi,

I see a couple of issues with this patch.

(1) There is already a poi download; you can't download two different versions of poi to the same place and have anything sensible happen.
(2) The patch includes a large number of new jars.  Somebody needs to do research to find out with the license is for each of these jars, and they will need to be mentioned in the appropriate license file.  If any of the jars is not an Apache-approved license, something else will need to be done.  If you can summarize the jars being added and their licenses, I can write the actual license update.
(3) There's some nonsense code in the build file I don't understand, for example:

{code}
+            <include name="poi-ooxml*.jar"/>
+            <include name="poi-ooxml*.jar"/>
+            <include name="poi-ooxml*.jar"/>
+            <include name="poi-ooxml*.jar"/>
+            <include name="poi-ooxml*.jar"/>
{code}

Why does this need to be included 5 times? xercesImpl is also included twice too.

(4) I see a lot of diffs of this kind, which don't appear to be changing anything.  What is happening here?

{code}
-  {
-    // Establish a session
+  {
+    // Establish a session
     getSession();
     
     SpecPacker sp = new SpecPacker(outputDescription);
     
-    String jsondata = "";
+    String jsondata = "";
{code}

(5) I have some concerns about this:

{code}
+      // generate document id from document URI using md5 checksum.
+      // document id must be shorter than 128 characters because of AmazonCloudSearch.
+      try {
+        doc.setId(URLEncoder.encode(generateid(documentURI),"utf-8"));
+      } catch (NoSuchAlgorithmException e) {
+        throw new ManifoldCFException(e);
+      }
+      
{code}

Generally, since hashes have a possibility of colliding, it is better to send the URI as metadata, and then look up the document identifier by searching for the document metadata URI.  You can create a document identifier then only if you don't find the matching record.  If that is not possible, I'd still structure the code so that there's a single place where the hashing is done.  You might also look at and use:

http://manifoldcf.apache.org/release/trunk/api/framework/org/apache/manifoldcf/core/system/ManifoldCF.html#encrypt%28java.lang.String%29










was (Author: kwright@metacarta.com):
Hi Takumi,

I see a couple of issues with this patch.

(1) There is already a poi download; you can't download two different versions of poi to the same place and have anything sensible happen.
(2) The patch includes a large number of new jars.  Somebody needs to do research to find out with the license is for each of these jars, and they will need to be mentioned in the appropriate license file.  If any of the jars is not an Apache-approved license, something else will need to be done.  If you can summarize the jars being added and their licenses, I can write the actual license update.
(3) There's some nonsense code in the build file I don't understand, for example:

{code}
+            <include name="poi-ooxml*.jar"/>
+            <include name="poi-ooxml*.jar"/>
+            <include name="poi-ooxml*.jar"/>
+            <include name="poi-ooxml*.jar"/>
+            <include name="poi-ooxml*.jar"/>
{code}

(4) I see a lot of diffs of this kind, which don't appear to be changing anything.  What is happening here?

{code}
-  {
-    // Establish a session
+  {
+    // Establish a session
     getSession();
     
     SpecPacker sp = new SpecPacker(outputDescription);
     
-    String jsondata = "";
+    String jsondata = "";
{code}

(5) I have some concerns about this:

{code}
+      // generate document id from document URI using md5 checksum.
+      // document id must be shorter than 128 characters because of AmazonCloudSearch.
+      try {
+        doc.setId(URLEncoder.encode(generateid(documentURI),"utf-8"));
+      } catch (NoSuchAlgorithmException e) {
+        throw new ManifoldCFException(e);
+      }
+      
{code}

Generally, since hashes have a possibility of colliding, it is better to send the URI as metadata, and then look up the document identifier by searching for the document metadata URI.  You can create a document identifier then only if you don't find the matching record.  If that is not possible, I'd still structure the code so that there's a single place where the hashing is done.  You might also look at and use:

http://manifoldcf.apache.org/release/trunk/api/framework/org/apache/manifoldcf/core/system/ManifoldCF.html#encrypt%28java.lang.String%29




Why does this need to be included 5 times? xercesImpl is also included twice too.




> Amazon CloudSearch output connector
> -----------------------------------
>
>                 Key: CONNECTORS-916
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-916
>             Project: ManifoldCF
>          Issue Type: New Feature
>          Components: Amazon CloudSearch output connector
>    Affects Versions: ManifoldCF 1.7
>            Reporter: Takumi Yoshida
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.7
>
>         Attachments: 0507.diff, 0520.diff, 1.patch, 2.diff, 3.diff, AmazonCloudSearchParam.java, AmazonCloudSearchSpecs.java, exception_handling.diff, exception_handling_2.diff
>
>
> I wrote some codes snipetts of output connector for Amazon CloudSearch.
> I would like you to review my code. You can crawl web site and feed HTML page to Amazon CloudSearch.
> but it is not perfectly completed followoing reason.
> - does not write any codes for configuration page.
> - supporting file type is only HTML
> Thank you for your time,
>  Takumi Yoshida



--
This message was sent by Atlassian JIRA
(v6.2#6252)