You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Shinichiro Abe (JIRA)" <ji...@apache.org> on 2012/05/24 07:56:40 UTC

[jira] [Created] (CONNECTORS-477) Support for fullWidthSpace against url

Shinichiro Abe created CONNECTORS-477:
-----------------------------------------

             Summary: Support for fullWidthSpace against url
                 Key: CONNECTORS-477
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-477
             Project: ManifoldCF
          Issue Type: Improvement
          Components: Web connector
            Reporter: Shinichiro Abe
            Priority: Minor
             Fix For: ManifoldCF 0.6




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CONNECTORS-477) Support for full-width space against url

Posted by "Shinichiro Abe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CONNECTORS-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shinichiro Abe updated CONNECTORS-477:
--------------------------------------

    Attachment: CONNECTORS-477.patch

Here is a patch. It can encode full-width space.
                
> Support for full-width space against url
> ----------------------------------------
>
>                 Key: CONNECTORS-477
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-477
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>            Reporter: Shinichiro Abe
>            Assignee: Shinichiro Abe
>            Priority: Minor
>             Fix For: ManifoldCF 0.6
>
>         Attachments: CONNECTORS-477.patch
>
>
> When url includes full-width space (" ") MCF can't ingest their documents.
> e.g.
> 1.file name
>  http://server/site1/Shared%20Documents/test/aaa bbb.txt
> 2.path
>  http://localhost/aaa bbb/aaa.txt
> MCF's log says:
> {noformat}
> WEB: Can't use url '/site1/Shared%20Documents/test/aaa bbb.txt' because it is badly formed: Illegal character in path at index 34: /site1/Shared%20Documents/test/aaa bbb.txt
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Comment Edited] (CONNECTORS-477) Support for full-width space against url

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282223#comment-13282223 ] 

Karl Wright edited comment on CONNECTORS-477 at 5/24/12 6:39 AM:
-----------------------------------------------------------------

Trying to crawl SharePoint using the web connector is not going to work for quite a number of reasons.  Microsoft is notorious for ignoring web standards in its URLs and has done this in multiple ways in SharePoint.  Also, there is no way to accurately "discover" SharePoint documents using the Web Connector.  So it is not a good idea to try to make Web Connector into a replacement for the SharePoint connector.

                
      was (Author: kwright@metacarta.com):
    Trying to crawl SharePoint using the web connector is not going to work for quite a number of reasons.  Microsoft is notorious for ignoring web standards in its URLs and has done this in multiple ways in SharePoint.  Also, there is no way to accurately "discover" SharePoint documents using the Web Connector.

                  
> Support for full-width space against url
> ----------------------------------------
>
>                 Key: CONNECTORS-477
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-477
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>            Reporter: Shinichiro Abe
>            Assignee: Shinichiro Abe
>            Priority: Minor
>             Fix For: ManifoldCF 0.6
>
>         Attachments: CONNECTORS-477.patch
>
>
> When url includes full-width space (" ") MCF can't ingest their documents.
> e.g.
> 1.file name
>  http://server/site1/Shared%20Documents/test/aaa bbb.txt
> 2.path
>  http://localhost/aaa bbb/aaa.txt
> MCF's log says:
> {noformat}
> WEB: Can't use url '/site1/Shared%20Documents/test/aaa bbb.txt' because it is badly formed: Illegal character in path at index 34: /site1/Shared%20Documents/test/aaa bbb.txt
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Comment Edited] (CONNECTORS-477) Support for full-width space against url

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282231#comment-13282231 ] 

Karl Wright edited comment on CONNECTORS-477 at 5/24/12 6:49 AM:
-----------------------------------------------------------------

About the fix itself: If you can find a document or standard that describes a standard for URL transformation that is not supported by the Web Connector, or you can show that such a transformation works in IE and in Firefox, then we should modify the WebURL class to support this transformation, and others that are similar.  I created the WebURL class specifically for the purpose of providing support for URL forms that were unsupported by the Java implementation of URI, which is no longer up-to-date as far as standards are concerned.  So, if there was going to be a fix for this issue, I'd recommend that it be done there, and not in WebcrawlerConnector.

My understanding of how URL encoding was supposed to work was that a URL is encoded in links, NOT by the browser (or crawler).  This is necessary because the browser does not typically understand the context within a URL correctly.  Now, Microsoft modified that standard by supporting certain transformations within IE, and other browsers have copied those transformations.  If there is sufficient support across browsers, we should go ahead and provide a similar feature in the web connector.

In order to check whether your transformation of full-width space qualifies as feature we should support, you would want to create a website locally (running under IIS probably), which has documents with names that include problematic characters such as full-width space.  Then, also create a page that has links to these documents, in the form <a href="...">...</a>, where the full-width space character is NOT properly URL encoded but is exposed.  Browse to the link page and click on the link.  Does the browser load the expected document, or not?  Which browsers work, and which do not?  If it does seem to be supported, are there other characters that work the same, or not?


                
      was (Author: kwright@metacarta.com):
    About the fix itself: If you can find a document or standard that describes a standard for URL transformation that is not supported by the Web Connector, or you can show that such a transformation works in IE and in Firefox, then we should modify the WebURL class to support this transformation, and others that are similar.  I created the WebURL class specifically for the purpose of providing support for URL forms that were unsupported by the Java implementation of URI, which is no longer up-to-date as far as standards are concerned.  So, if there was going to be a fix for this issue, I'd recommend that it be done there, and not in WebcrawlerConnector.

My understanding of how URL encoding was supposed to work was that a URL is encoded in links, NOT by the browser (or crawler).  This is necessary because the browser does not typically understand the context within a URL correctly.  Now, Microsoft modified that standard by supporting certain transformations within IE, and other browsers have copied those transformations.  If there is sufficient support across browsers, we should go ahead and provide a similar feature in the web connector.

In order to check whether your transformation of full-width space qualifies as feature we should support, you would want to create a website locally (running under IIS probably), which has documents with names that include problematic characters such as full-width space.  Then, also create a page that has links to these documents, in the form <a href="...">...</a>, where the full-width space character is NOT properly URL encoded but is exposed.  Browse to the link page and click on the link.  Does the browser load the expected document, or not?  Which browsers work, and which do not?


                  
> Support for full-width space against url
> ----------------------------------------
>
>                 Key: CONNECTORS-477
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-477
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>            Reporter: Shinichiro Abe
>            Assignee: Shinichiro Abe
>            Priority: Minor
>             Fix For: ManifoldCF 0.6
>
>         Attachments: CONNECTORS-477.patch
>
>
> When url includes full-width space (" ") MCF can't ingest their documents.
> e.g.
> 1.file name
>  http://server/site1/Shared%20Documents/test/aaa bbb.txt
> 2.path
>  http://localhost/aaa bbb/aaa.txt
> MCF's log says:
> {noformat}
> WEB: Can't use url '/site1/Shared%20Documents/test/aaa bbb.txt' because it is badly formed: Illegal character in path at index 34: /site1/Shared%20Documents/test/aaa bbb.txt
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Comment Edited] (CONNECTORS-477) Support for full-width space against url

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282231#comment-13282231 ] 

Karl Wright edited comment on CONNECTORS-477 at 5/24/12 6:42 AM:
-----------------------------------------------------------------

About the fix itself: If you can find a document or standard that describes a standard for URL transformation that is not supported by the Web Connector, or you can show that such a transformation works in IE and in Firefox, then we should modify the WebURL class to support this transformation, and others that are similar.  I created the WebURL class specifically for the purpose of providing support for URL forms that were unsupported by the Java implementation of URI, which is no longer up-to-date as far as standards are concerned.  So, if there was going to be a fix for this issue, I'd recommend that it be done there, and not in WebcrawlerConnector.

My understanding of how URL encoding was supposed to work was that a URL is encoded in links, NOT by the browser (or crawler).  This is necessary because the browser does not typically understand the context within a URL correctly.  Now, Microsoft modified that standard by supporting certain transformations within IE, and other browsers have copied those transformations.  If there is sufficient support across browsers, we should go ahead and provide a similar feature in the web connector.

In order to check whether your transformation of full-width space qualifies as feature we should support, you would want to create a website locally (running under IIS probably), which has documents with names that include problematic characters such as full-width space.  Then, also create a page that has links to these documents, in the form <a href="...">...</a>, where the full-width space character is NOT properly URL encoded but is exposed.  Browse to the link page and click on the link.  Does the browser load the expected document, or not?  Which browsers work, and which do not?


                
      was (Author: kwright@metacarta.com):
    If you can find a document or standard that describes a standard for URL transformation that is not supported by the Web Connector, or you can show that such a transformation works in IE and in Firefox, then we should modify the WebURL class to support that transformation.  I created the WebURL class specifically for the purpose of providing support for URL forms that were unsupported by the Java implementation of URI, which is no longer up-to-date as far as standards are concerned.  So, if there was going to be a fix for this issue, I'd recommend that it be done there, and not in WebcrawlerConnector.

But, as I said before, I'd be very careful to avoid trying to make the Web Connector into a replacement for the SharePoint connector.  My understanding of how URL encoding was supposed to work was that a URL is encoded in links, NOT by the browser (or crawler).  This is necessary because the browser does not typically understand the context within the URL correctly.  Microsoft modified that standard by supporting certain transformations within IE, and other browsers copied those transformations.  SharePoint may be relying on such non-standard transformations to work correctly - either that, or it never presents non-standard URLs as links at all.

In order to check whether your transformation of full-width space qualifies as feature we should support, you would want to create a website locally (running under IIS probably), which has documents with names that include problematic characters such as full-width space.  Then, also create a page that has links to these documents, in the form <a href="...">...</a>, where the full-width space character is NOT properly URL encoded but is exposed.  Browse to the link page and click on the link.  Does the browser load the expected document, or not?  Which browsers work, and which do not?


                  
> Support for full-width space against url
> ----------------------------------------
>
>                 Key: CONNECTORS-477
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-477
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>            Reporter: Shinichiro Abe
>            Assignee: Shinichiro Abe
>            Priority: Minor
>             Fix For: ManifoldCF 0.6
>
>         Attachments: CONNECTORS-477.patch
>
>
> When url includes full-width space (" ") MCF can't ingest their documents.
> e.g.
> 1.file name
>  http://server/site1/Shared%20Documents/test/aaa bbb.txt
> 2.path
>  http://localhost/aaa bbb/aaa.txt
> MCF's log says:
> {noformat}
> WEB: Can't use url '/site1/Shared%20Documents/test/aaa bbb.txt' because it is badly formed: Illegal character in path at index 34: /site1/Shared%20Documents/test/aaa bbb.txt
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (CONNECTORS-477) Support for full-width space against url

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13396608#comment-13396608 ] 

Karl Wright commented on CONNECTORS-477:
----------------------------------------

bq. But there are the sites which have improperly encoded URL links in Japan.
bq. I want to support this on webconnector but I'm thinking of better solution for a while.

I agree, especially where there are browsers that accept the bad URLs.  We have to have something to evaluate our code against, to emulate.  And, the changes should go in the org.apache.manifoldcf.crawler.connectors.webconnector.WebURL class.



                
> Support for full-width space against url
> ----------------------------------------
>
>                 Key: CONNECTORS-477
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-477
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>            Reporter: Shinichiro Abe
>            Assignee: Shinichiro Abe
>            Priority: Minor
>             Fix For: ManifoldCF next
>
>         Attachments: CONNECTORS-477.patch
>
>
> When url includes full-width space (" ") MCF can't ingest their documents.
> e.g.
> 1.file name
>  http://server/site1/Shared%20Documents/test/aaa bbb.txt
> 2.path
>  http://localhost/aaa bbb/aaa.txt
> MCF's log says:
> {noformat}
> WEB: Can't use url '/site1/Shared%20Documents/test/aaa bbb.txt' because it is badly formed: Illegal character in path at index 34: /site1/Shared%20Documents/test/aaa bbb.txt
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (CONNECTORS-477) Support for full-width space against url

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282231#comment-13282231 ] 

Karl Wright commented on CONNECTORS-477:
----------------------------------------

If you can find a document or standard that describes a standard for URL transformation that is not supported by the Web Connector, or you can show that such a transformation works in IE and in Firefox, then we should modify the WebURL class to support that transformation.  I created the WebURL class specifically for the purpose of providing support for URL forms that were unsupported by the Java implementation of URI, which is no longer up-to-date as far as standards are concerned.  So, if there was going to be a fix for this issue, I'd recommend that it be done there, and not in WebcrawlerConnector.

But, as I said before, I'd be very careful to avoid trying to make the Web Connector into a replacement for the SharePoint connector.  My understanding of how URL encoding was supposed to work was that a URL is encoded in links, NOT by the browser (or crawler).  This is necessary because the browser does not typically understand the context within the URL correctly.  Microsoft modified that standard by supporting certain transformations within IE, and other browsers copied those transformations.  SharePoint may be relying on such non-standard transformations to work correctly - either that, or it never presents non-standard URLs as links at all.



                
> Support for full-width space against url
> ----------------------------------------
>
>                 Key: CONNECTORS-477
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-477
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>            Reporter: Shinichiro Abe
>            Assignee: Shinichiro Abe
>            Priority: Minor
>             Fix For: ManifoldCF 0.6
>
>         Attachments: CONNECTORS-477.patch
>
>
> When url includes full-width space (" ") MCF can't ingest their documents.
> e.g.
> 1.file name
>  http://server/site1/Shared%20Documents/test/aaa bbb.txt
> 2.path
>  http://localhost/aaa bbb/aaa.txt
> MCF's log says:
> {noformat}
> WEB: Can't use url '/site1/Shared%20Documents/test/aaa bbb.txt' because it is badly formed: Illegal character in path at index 34: /site1/Shared%20Documents/test/aaa bbb.txt
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (CONNECTORS-477) Support for full-width space against url

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282223#comment-13282223 ] 

Karl Wright commented on CONNECTORS-477:
----------------------------------------

Trying to crawl SharePoint using the web connector is not going to work for quite a number of reasons.  Microsoft is notorious for ignoring web standards in its URLs and has done this in multiple ways in SharePoint.  Also, there is no way to accurately "discover" SharePoint documents using the Web Connector.

                
> Support for full-width space against url
> ----------------------------------------
>
>                 Key: CONNECTORS-477
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-477
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>            Reporter: Shinichiro Abe
>            Assignee: Shinichiro Abe
>            Priority: Minor
>             Fix For: ManifoldCF 0.6
>
>         Attachments: CONNECTORS-477.patch
>
>
> When url includes full-width space (" ") MCF can't ingest their documents.
> e.g.
> 1.file name
>  http://server/site1/Shared%20Documents/test/aaa bbb.txt
> 2.path
>  http://localhost/aaa bbb/aaa.txt
> MCF's log says:
> {noformat}
> WEB: Can't use url '/site1/Shared%20Documents/test/aaa bbb.txt' because it is badly formed: Illegal character in path at index 34: /site1/Shared%20Documents/test/aaa bbb.txt
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Comment Edited] (CONNECTORS-477) Support for full-width space against url

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282231#comment-13282231 ] 

Karl Wright edited comment on CONNECTORS-477 at 5/24/12 6:38 AM:
-----------------------------------------------------------------

If you can find a document or standard that describes a standard for URL transformation that is not supported by the Web Connector, or you can show that such a transformation works in IE and in Firefox, then we should modify the WebURL class to support that transformation.  I created the WebURL class specifically for the purpose of providing support for URL forms that were unsupported by the Java implementation of URI, which is no longer up-to-date as far as standards are concerned.  So, if there was going to be a fix for this issue, I'd recommend that it be done there, and not in WebcrawlerConnector.

But, as I said before, I'd be very careful to avoid trying to make the Web Connector into a replacement for the SharePoint connector.  My understanding of how URL encoding was supposed to work was that a URL is encoded in links, NOT by the browser (or crawler).  This is necessary because the browser does not typically understand the context within the URL correctly.  Microsoft modified that standard by supporting certain transformations within IE, and other browsers copied those transformations.  SharePoint may be relying on such non-standard transformations to work correctly - either that, or it never presents non-standard URLs as links at all.

In order to check whether your transformation of full-width space qualifies as feature we should support, you would want to create a website locally (running under IIS probably), which has documents with names that include problematic characters such as full-width space.  Then, also create a page that has links to these documents, in the form <a href="...">...</a>, where the full-width space character is NOT properly URL encoded but is exposed.  Browse to the link page and click on the link.  Does the browser load the expected document, or not?  Which browsers work, and which do not?


                
      was (Author: kwright@metacarta.com):
    If you can find a document or standard that describes a standard for URL transformation that is not supported by the Web Connector, or you can show that such a transformation works in IE and in Firefox, then we should modify the WebURL class to support that transformation.  I created the WebURL class specifically for the purpose of providing support for URL forms that were unsupported by the Java implementation of URI, which is no longer up-to-date as far as standards are concerned.  So, if there was going to be a fix for this issue, I'd recommend that it be done there, and not in WebcrawlerConnector.

But, as I said before, I'd be very careful to avoid trying to make the Web Connector into a replacement for the SharePoint connector.  My understanding of how URL encoding was supposed to work was that a URL is encoded in links, NOT by the browser (or crawler).  This is necessary because the browser does not typically understand the context within the URL correctly.  Microsoft modified that standard by supporting certain transformations within IE, and other browsers copied those transformations.  SharePoint may be relying on such non-standard transformations to work correctly - either that, or it never presents non-standard URLs as links at all.



                  
> Support for full-width space against url
> ----------------------------------------
>
>                 Key: CONNECTORS-477
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-477
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>            Reporter: Shinichiro Abe
>            Assignee: Shinichiro Abe
>            Priority: Minor
>             Fix For: ManifoldCF 0.6
>
>         Attachments: CONNECTORS-477.patch
>
>
> When url includes full-width space (" ") MCF can't ingest their documents.
> e.g.
> 1.file name
>  http://server/site1/Shared%20Documents/test/aaa bbb.txt
> 2.path
>  http://localhost/aaa bbb/aaa.txt
> MCF's log says:
> {noformat}
> WEB: Can't use url '/site1/Shared%20Documents/test/aaa bbb.txt' because it is badly formed: Illegal character in path at index 34: /site1/Shared%20Documents/test/aaa bbb.txt
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (CONNECTORS-477) Support for full-width space against url

Posted by "Shinichiro Abe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CONNECTORS-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shinichiro Abe updated CONNECTORS-477:
--------------------------------------

    Description: 
When url includes full-width space (" ") MCF can't ingest their documents.

e.g.
1.file name
 http://server/site1/Shared%20Documents/test/aaa bbb.txt
2.path
 http://localhost/aaa bbb/aaa.txt

MCF's log says:
{noformat}
WEB: Can't use url '/site1/Shared%20Documents/test/aaa bbb.txt' because it is badly formed: Illegal character in path at index 34: /site1/Shared%20Documents/test/aaa bbb.txt
{noformat}


       Assignee: Shinichiro Abe
        Summary: Support for full-width space against url  (was: Support for fullWidthSpace against url)
    
> Support for full-width space against url
> ----------------------------------------
>
>                 Key: CONNECTORS-477
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-477
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>            Reporter: Shinichiro Abe
>            Assignee: Shinichiro Abe
>            Priority: Minor
>             Fix For: ManifoldCF 0.6
>
>
> When url includes full-width space (" ") MCF can't ingest their documents.
> e.g.
> 1.file name
>  http://server/site1/Shared%20Documents/test/aaa bbb.txt
> 2.path
>  http://localhost/aaa bbb/aaa.txt
> MCF's log says:
> {noformat}
> WEB: Can't use url '/site1/Shared%20Documents/test/aaa bbb.txt' because it is badly formed: Illegal character in path at index 34: /site1/Shared%20Documents/test/aaa bbb.txt
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Comment Edited] (CONNECTORS-477) Support for full-width space against url

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282223#comment-13282223 ] 

Karl Wright edited comment on CONNECTORS-477 at 5/24/12 6:43 AM:
-----------------------------------------------------------------

A warning: Trying to crawl SharePoint using the web connector is not going to work for quite a number of reasons.  Microsoft is notorious for ignoring web standards in its URLs and has done this in multiple ways in SharePoint.  Also, there is no way to accurately "discover" SharePoint documents using the Web Connector.  So it is not a good idea to try to make Web Connector into a replacement for the SharePoint connector.

                
      was (Author: kwright@metacarta.com):
    Trying to crawl SharePoint using the web connector is not going to work for quite a number of reasons.  Microsoft is notorious for ignoring web standards in its URLs and has done this in multiple ways in SharePoint.  Also, there is no way to accurately "discover" SharePoint documents using the Web Connector.  So it is not a good idea to try to make Web Connector into a replacement for the SharePoint connector.

                  
> Support for full-width space against url
> ----------------------------------------
>
>                 Key: CONNECTORS-477
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-477
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>            Reporter: Shinichiro Abe
>            Assignee: Shinichiro Abe
>            Priority: Minor
>             Fix For: ManifoldCF 0.6
>
>         Attachments: CONNECTORS-477.patch
>
>
> When url includes full-width space (" ") MCF can't ingest their documents.
> e.g.
> 1.file name
>  http://server/site1/Shared%20Documents/test/aaa bbb.txt
> 2.path
>  http://localhost/aaa bbb/aaa.txt
> MCF's log says:
> {noformat}
> WEB: Can't use url '/site1/Shared%20Documents/test/aaa bbb.txt' because it is badly formed: Illegal character in path at index 34: /site1/Shared%20Documents/test/aaa bbb.txt
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (CONNECTORS-477) Support for full-width space against url

Posted by "Shinichiro Abe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CONNECTORS-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shinichiro Abe updated CONNECTORS-477:
--------------------------------------

    Fix Version/s:     (was: ManifoldCF 0.6)
                   ManifoldCF next
    
> Support for full-width space against url
> ----------------------------------------
>
>                 Key: CONNECTORS-477
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-477
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>            Reporter: Shinichiro Abe
>            Assignee: Shinichiro Abe
>            Priority: Minor
>             Fix For: ManifoldCF next
>
>         Attachments: CONNECTORS-477.patch
>
>
> When url includes full-width space (" ") MCF can't ingest their documents.
> e.g.
> 1.file name
>  http://server/site1/Shared%20Documents/test/aaa bbb.txt
> 2.path
>  http://localhost/aaa bbb/aaa.txt
> MCF's log says:
> {noformat}
> WEB: Can't use url '/site1/Shared%20Documents/test/aaa bbb.txt' because it is badly formed: Illegal character in path at index 34: /site1/Shared%20Documents/test/aaa bbb.txt
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (CONNECTORS-477) Support for full-width space against url

Posted by "Shinichiro Abe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13396452#comment-13396452 ] 

Shinichiro Abe commented on CONNECTORS-477:
-------------------------------------------

bq. So it is not a good idea to try to make Web Connector into a replacement for the SharePoint connector.
I think so, too.

But there are the sites which have improperly encoded URL links in Japan.
I want to support this on webconnector but I'm thinking of better solution for a while.
I want to convert the urls which java.net.URI.URI class doesn't support, e.g.harf-width space, full-width, <, >, and so on. It may take more time to look into.
                
> Support for full-width space against url
> ----------------------------------------
>
>                 Key: CONNECTORS-477
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-477
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>            Reporter: Shinichiro Abe
>            Assignee: Shinichiro Abe
>            Priority: Minor
>             Fix For: ManifoldCF next
>
>         Attachments: CONNECTORS-477.patch
>
>
> When url includes full-width space (" ") MCF can't ingest their documents.
> e.g.
> 1.file name
>  http://server/site1/Shared%20Documents/test/aaa bbb.txt
> 2.path
>  http://localhost/aaa bbb/aaa.txt
> MCF's log says:
> {noformat}
> WEB: Can't use url '/site1/Shared%20Documents/test/aaa bbb.txt' because it is badly formed: Illegal character in path at index 34: /site1/Shared%20Documents/test/aaa bbb.txt
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira