You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Gene Liu (JIRA)" <ji...@apache.org> on 2012/10/23 23:13:12 UTC

[jira] [Created] (CONNECTORS-557) web crawler feed into solr with all html tag removed

Gene Liu created CONNECTORS-557:
-----------------------------------

             Summary: web crawler feed into solr with all html tag removed
                 Key: CONNECTORS-557
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-557
             Project: ManifoldCF
          Issue Type: Improvement
          Components: Web connector
    Affects Versions: ManifoldCF 1.0
         Environment: ManifoldCF1.0 --> Solr4
            Reporter: Gene Liu


All html tags are removed when manifoldCF feeds the content into Solr

I am new for Solr. I use manifoldcf crawling webpages and send the content into solr for indexing. I found that all the html tags are removed when I get query result from solr. I am not sure if manifoldcf removed them before sending to solr or solr removed them. 

p.s. As I could not find a way to send email to the user list, so I open a ticket here.

Appreciate any suggestions/comments.

Gene

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CONNECTORS-557) web crawler feed into solr with all html tag removed

Posted by "Gene Liu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13484128#comment-13484128 ] 

Gene Liu commented on CONNECTORS-557:
-------------------------------------

Thank you Karl for your quick answer! Your suspect is right. I will have a try of other handler.
                
> web crawler feed into solr with all html tag removed
> ----------------------------------------------------
>
>                 Key: CONNECTORS-557
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-557
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>    Affects Versions: ManifoldCF 1.0
>         Environment: ManifoldCF1.0 --> Solr4
>            Reporter: Gene Liu
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.1
>
>
> All html tags are removed when manifoldCF feeds the content into Solr
> I am new for Solr. I use manifoldcf crawling webpages and send the content into solr for indexing. I found that all the html tags are removed when I get query result from solr. I am not sure if manifoldcf removed them before sending to solr or solr removed them. 
> p.s. As I could not find a way to send email to the user list, so I open a ticket here.
> Appreciate any suggestions/comments.
> Gene

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CONNECTORS-557) web crawler feed into solr with all html tag removed

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482718#comment-13482718 ] 

Karl Wright commented on CONNECTORS-557:
----------------------------------------

Hi Gene,

ManifoldCF does not alter documents in any way.  I suspect you have set up a Solr Cell (Tika) handler which is doing it.

To subscribe to the ManifoldCF user list, please follow the instructions here:

http://manifoldcf.apache.org/en_US/mail.html


                
> web crawler feed into solr with all html tag removed
> ----------------------------------------------------
>
>                 Key: CONNECTORS-557
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-557
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>    Affects Versions: ManifoldCF 1.0
>         Environment: ManifoldCF1.0 --> Solr4
>            Reporter: Gene Liu
>             Fix For: ManifoldCF 1.1
>
>
> All html tags are removed when manifoldCF feeds the content into Solr
> I am new for Solr. I use manifoldcf crawling webpages and send the content into solr for indexing. I found that all the html tags are removed when I get query result from solr. I am not sure if manifoldcf removed them before sending to solr or solr removed them. 
> p.s. As I could not find a way to send email to the user list, so I open a ticket here.
> Appreciate any suggestions/comments.
> Gene

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (CONNECTORS-557) web crawler feed into solr with all html tag removed

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CONNECTORS-557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wright resolved CONNECTORS-557.
------------------------------------

       Resolution: Not A Problem
    Fix Version/s: ManifoldCF 1.1
         Assignee: Karl Wright
    
> web crawler feed into solr with all html tag removed
> ----------------------------------------------------
>
>                 Key: CONNECTORS-557
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-557
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>    Affects Versions: ManifoldCF 1.0
>         Environment: ManifoldCF1.0 --> Solr4
>            Reporter: Gene Liu
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.1
>
>
> All html tags are removed when manifoldCF feeds the content into Solr
> I am new for Solr. I use manifoldcf crawling webpages and send the content into solr for indexing. I found that all the html tags are removed when I get query result from solr. I am not sure if manifoldcf removed them before sending to solr or solr removed them. 
> p.s. As I could not find a way to send email to the user list, so I open a ticket here.
> Appreciate any suggestions/comments.
> Gene

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira