You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Kate McGonigal (JIRA)" <ji...@apache.org> on 2011/08/02 20:56:27 UTC

[jira] [Created] (CONNECTORS-235) item description element not indexed

item description element not indexed
------------------------------------

                 Key: CONNECTORS-235
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-235
             Project: ManifoldCF
          Issue Type: Improvement
          Components: RSS connector
    Affects Versions: ManifoldCF 0.2
            Reporter: Kate McGonigal


The RSS feed's *item* description is not written to any field in the Solr index. 

I have a typical RSS feed with the general structure:

<rss>
    <channel>
        <title></title>
        <link></link>
        <description></description>
        <item>
            <title></title>
            <link></link>
            <pubDate></pubDate>
            <description> *** the description I do want *** </description>
            <author></author>
            <category></category>
        </item>
    </channel>
</rss>

Example:
For the RSS feed: http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
the rss/channel/item/description field is not indexed into Solr.
Example notes:
  - what does get written to the Solr "description" field is the description metadata from the website, i.e. "Jazz radio show from Winnipeg on CKUW 95.9 FM, hosted by Maurice Hogue." in this case.
  - on the "Dechromed Content" tab of the job, "No dechromed content" is selected. I'm not sure if that is relevant.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-235) item description element not indexed

Posted by "Kate McGonigal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078466#comment-13078466 ] 

Kate McGonigal commented on CONNECTORS-235:
-------------------------------------------

I also tried setting "Dechromed Content" to "if present, in 'description' field", but that just seems to hang the ingestion process at the beginning: the job status gets to "Running", but it never finishes and nothing is ever sent to Solr and the number of "Active" documents never decreases.

> item description element not indexed
> ------------------------------------
>
>                 Key: CONNECTORS-235
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-235
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.2
>            Reporter: Kate McGonigal
>
> The RSS feed's *item* description is not written to any field in the Solr index. 
> I have a typical RSS feed with the general structure:
> <rss>
>     <channel>
>         <title></title>
>         <link></link>
>         <description></description>
>         <item>
>             <title></title>
>             <link></link>
>             <pubDate></pubDate>
>             <description> *** the description I do want *** </description>
>             <author></author>
>             <category></category>
>         </item>
>     </channel>
> </rss>
> Example:
> For the RSS feed: http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
> the rss/channel/item/description field is not indexed into Solr.
> Example notes:
>   - what does get written to the Solr "description" field is the description metadata from the website, i.e. "Jazz radio show from Winnipeg on CKUW 95.9 FM, hosted by Maurice Hogue." in this case.
>   - on the "Dechromed Content" tab of the job, "No dechromed content" is selected. I'm not sure if that is relevant.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (CONNECTORS-235) item description element not indexed

Posted by "Kate McGonigal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078466#comment-13078466 ] 

Kate McGonigal edited comment on CONNECTORS-235 at 8/2/11 10:35 PM:
--------------------------------------------------------------------

I also tried setting "Dechromed Content" to "if present, in 'description' field", but that just seems to hang the ingestion process at the beginning: the job status gets to "Running", but it never finishes and nothing is ever sent to Solr and the number of "Active" documents never decreases.


The log file shows:
Error tossed: java.lang.String cannot be cast to org.apache.manifoldcf.core.interfaces.CharacterInput
java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.manifoldcf.core.interfaces.CharacterInput
	at org.apache.manifoldcf.crawler.jobs.Carrydown.getDataValuesAsFiles(Carrydown.java:595)
	at org.apache.manifoldcf.crawler.jobs.JobManager.retrieveParentDataAsFiles(JobManager.java:4274)
	at org.apache.manifoldcf.crawler.system.WorkerThread$VersionActivity.retrieveParentDataAsFiles(WorkerThread.java:1220)
	at org.apache.manifoldcf.crawler.connectors.rss.RSSConnector.getDocumentVersions(RSSConnector.java:827)
	at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:342)

      was (Author: kmcgonig):
    I also tried setting "Dechromed Content" to "if present, in 'description' field", but that just seems to hang the ingestion process at the beginning: the job status gets to "Running", but it never finishes and nothing is ever sent to Solr and the number of "Active" documents never decreases.
  
> item description element not indexed
> ------------------------------------
>
>                 Key: CONNECTORS-235
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-235
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.2
>            Reporter: Kate McGonigal
>
> The RSS feed's *item* description is not written to any field in the Solr index. 
> I have a typical RSS feed with the general structure:
> <rss>
>     <channel>
>         <title></title>
>         <link></link>
>         <description></description>
>         <item>
>             <title></title>
>             <link></link>
>             <pubDate></pubDate>
>             <description> *** the description I do want *** </description>
>             <author></author>
>             <category></category>
>         </item>
>     </channel>
> </rss>
> Example:
> For the RSS feed: http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
> the rss/channel/item/description field is not indexed into Solr.
> Example notes:
>   - what does get written to the Solr "description" field is the description metadata from the website, i.e. "Jazz radio show from Winnipeg on CKUW 95.9 FM, hosted by Maurice Hogue." in this case.
>   - on the "Dechromed Content" tab of the job, "No dechromed content" is selected. I'm not sure if that is relevant.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-235) item description element not indexed

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079118#comment-13079118 ] 

Karl Wright commented on CONNECTORS-235:
----------------------------------------

One problem I found is that due to a rebuild I was not using PostgreSQL after all, so here's another check-in to fix its handling of streamed carrydown info.  r1153702.

> item description element not indexed
> ------------------------------------
>
>                 Key: CONNECTORS-235
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-235
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.2
>            Reporter: Kate McGonigal
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.3
>
>
> The RSS feed's *item* description is not written to any field in the Solr index. 
> I have a typical RSS feed with the general structure:
> <rss>
>     <channel>
>         <title></title>
>         <link></link>
>         <description></description>
>         <item>
>             <title></title>
>             <link></link>
>             <pubDate></pubDate>
>             <description> *** the description I do want *** </description>
>             <author></author>
>             <category></category>
>         </item>
>     </channel>
> </rss>
> Example:
> For the RSS feed: http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
> the rss/channel/item/description field is not indexed into Solr.
> Example notes:
>   - what does get written to the Solr "description" field is the description metadata from the website, i.e. "Jazz radio show from Winnipeg on CKUW 95.9 FM, hosted by Maurice Hogue." in this case.
>   - on the "Dechromed Content" tab of the job, "No dechromed content" is selected. I'm not sure if that is relevant.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-235) item description element not indexed

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079137#comment-13079137 ] 

Karl Wright commented on CONNECTORS-235:
----------------------------------------

I switched the name to "summary".  r1153705.


> item description element not indexed
> ------------------------------------
>
>                 Key: CONNECTORS-235
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-235
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.2
>            Reporter: Kate McGonigal
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.3
>
>
> The RSS feed's *item* description is not written to any field in the Solr index. 
> I have a typical RSS feed with the general structure:
> <rss>
>     <channel>
>         <title></title>
>         <link></link>
>         <description></description>
>         <item>
>             <title></title>
>             <link></link>
>             <pubDate></pubDate>
>             <description> *** the description I do want *** </description>
>             <author></author>
>             <category></category>
>         </item>
>     </channel>
> </rss>
> Example:
> For the RSS feed: http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
> the rss/channel/item/description field is not indexed into Solr.
> Example notes:
>   - what does get written to the Solr "description" field is the description metadata from the website, i.e. "Jazz radio show from Winnipeg on CKUW 95.9 FM, hosted by Maurice Hogue." in this case.
>   - on the "Dechromed Content" tab of the job, "No dechromed content" is selected. I'm not sure if that is relevant.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-235) item description element not indexed

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078502#comment-13078502 ] 

Karl Wright commented on CONNECTORS-235:
----------------------------------------

That's not good.  I'll have a look into this as well.



> item description element not indexed
> ------------------------------------
>
>                 Key: CONNECTORS-235
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-235
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.2
>            Reporter: Kate McGonigal
>
> The RSS feed's *item* description is not written to any field in the Solr index. 
> I have a typical RSS feed with the general structure:
> <rss>
>     <channel>
>         <title></title>
>         <link></link>
>         <description></description>
>         <item>
>             <title></title>
>             <link></link>
>             <pubDate></pubDate>
>             <description> *** the description I do want *** </description>
>             <author></author>
>             <category></category>
>         </item>
>     </channel>
> </rss>
> Example:
> For the RSS feed: http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
> the rss/channel/item/description field is not indexed into Solr.
> Example notes:
>   - what does get written to the Solr "description" field is the description metadata from the website, i.e. "Jazz radio show from Winnipeg on CKUW 95.9 FM, hosted by Maurice Hogue." in this case.
>   - on the "Dechromed Content" tab of the job, "No dechromed content" is selected. I'm not sure if that is relevant.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-235) item description element not indexed

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078521#comment-13078521 ] 

Karl Wright commented on CONNECTORS-235:
----------------------------------------

Same thing under HSQLDB.  So the trick is going to be to fix these two databases without breaking PostgreSQL.


> item description element not indexed
> ------------------------------------
>
>                 Key: CONNECTORS-235
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-235
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.2
>            Reporter: Kate McGonigal
>
> The RSS feed's *item* description is not written to any field in the Solr index. 
> I have a typical RSS feed with the general structure:
> <rss>
>     <channel>
>         <title></title>
>         <link></link>
>         <description></description>
>         <item>
>             <title></title>
>             <link></link>
>             <pubDate></pubDate>
>             <description> *** the description I do want *** </description>
>             <author></author>
>             <category></category>
>         </item>
>     </channel>
> </rss>
> Example:
> For the RSS feed: http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
> the rss/channel/item/description field is not indexed into Solr.
> Example notes:
>   - what does get written to the Solr "description" field is the description metadata from the website, i.e. "Jazz radio show from Winnipeg on CKUW 95.9 FM, hosted by Maurice Hogue." in this case.
>   - on the "Dechromed Content" tab of the job, "No dechromed content" is selected. I'm not sure if that is relevant.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-235) item description element not indexed

Posted by "Kate McGonigal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079085#comment-13079085 ] 

Kate McGonigal commented on CONNECTORS-235:
-------------------------------------------

I'm afraid these problems still exist for me. 

A few hours ago I built the latest from trunk. It is running on PostgreSQL.

Just in case, I also started from a fresh install of Solr 3.3.0.  I'm using the example that comes with the distribution. It is thus running on Derby. I realize the schema is not optimal for RSS feeds, but it does include a "description"  field, which is what I'm interested in at the moment.

Problem 1) When I try running the example job with "Dechromed Content" set to "No dechromed content", what shows up in the description field (for all documents) is "Jazz radio show from Winnipeg on CKUW 95.9 FM, hosted by Maurice Hogue." which is not the item-description in the RSS feed's XML, but rather from the website's metadata description element in the HTML.  I have tried another RSS feed, with the same result.

Problem 2) When I try running the example job (see original post) with "Dechromed Content" set to "if present, in 'description' field" it still hangs with the log file showing:
{quote}FATAL 2011-08-03 16:08:21,703 (Worker thread '10') - Error tossed: java.lang.String cannot be cast to org.apache.manifoldcf.core.interfaces.CharacterInput
java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.manifoldcf.core.interfaces.CharacterInput
	at org.apache.manifoldcf.crawler.jobs.Carrydown.getDataValuesAsFiles(Carrydown.java:611)
	at org.apache.manifoldcf.crawler.jobs.JobManager.retrieveParentDataAsFiles(JobManager.java:4263)
	at org.apache.manifoldcf.crawler.system.WorkerThread$VersionActivity.retrieveParentDataAsFiles(WorkerThread.java:1221)
	at org.apache.manifoldcf.crawler.connectors.rss.RSSConnector.getDocumentVersions(RSSConnector.java:824)
	at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:321){quote}

And just to be clear on what I am ultimately trying to do: I'd like to be able to show my searchers the "description" from the RSS feed for each of the documents that match their searches. I actually only need to index the item-description field (as opposed to what is at the item link) since my RSS feeds are of scientific papers that will have a detailed abstract in the item-description.

> item description element not indexed
> ------------------------------------
>
>                 Key: CONNECTORS-235
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-235
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.2
>            Reporter: Kate McGonigal
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.3
>
>
> The RSS feed's *item* description is not written to any field in the Solr index. 
> I have a typical RSS feed with the general structure:
> <rss>
>     <channel>
>         <title></title>
>         <link></link>
>         <description></description>
>         <item>
>             <title></title>
>             <link></link>
>             <pubDate></pubDate>
>             <description> *** the description I do want *** </description>
>             <author></author>
>             <category></category>
>         </item>
>     </channel>
> </rss>
> Example:
> For the RSS feed: http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
> the rss/channel/item/description field is not indexed into Solr.
> Example notes:
>   - what does get written to the Solr "description" field is the description metadata from the website, i.e. "Jazz radio show from Winnipeg on CKUW 95.9 FM, hosted by Maurice Hogue." in this case.
>   - on the "Dechromed Content" tab of the job, "No dechromed content" is selected. I'm not sure if that is relevant.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-235) item description element not indexed

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078536#comment-13078536 ] 

Karl Wright commented on CONNECTORS-235:
----------------------------------------

Fix for file-based carrydown data for HSQLDB and Derby.  r1153314.


> item description element not indexed
> ------------------------------------
>
>                 Key: CONNECTORS-235
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-235
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.2
>            Reporter: Kate McGonigal
>
> The RSS feed's *item* description is not written to any field in the Solr index. 
> I have a typical RSS feed with the general structure:
> <rss>
>     <channel>
>         <title></title>
>         <link></link>
>         <description></description>
>         <item>
>             <title></title>
>             <link></link>
>             <pubDate></pubDate>
>             <description> *** the description I do want *** </description>
>             <author></author>
>             <category></category>
>         </item>
>     </channel>
> </rss>
> Example:
> For the RSS feed: http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
> the rss/channel/item/description field is not indexed into Solr.
> Example notes:
>   - what does get written to the Solr "description" field is the description metadata from the website, i.e. "Jazz radio show from Winnipeg on CKUW 95.9 FM, hosted by Maurice Hogue." in this case.
>   - on the "Dechromed Content" tab of the job, "No dechromed content" is selected. I'm not sure if that is relevant.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-235) item description element not indexed

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079134#comment-13079134 ] 

Karl Wright commented on CONNECTORS-235:
----------------------------------------

Ok, another mystery solved.  The RSS chromed data mode of "None" was not properly tried because of the inadvertant database switch, and I found that recrawling vs. crawling fresh generated incorrect version information.  I've fixed that problem but I can't check it in because it causes the following error against a plain-vanilla Solr installation:

ERROR: [http://www.onemansjazz.ca/content/view/330/50/] multiple values encountered for non multiValued field description: [Jazz radio show from Winnipeg on CKUW 95.9 FM, hosted by Maurice Hogue., I have created a Listener Survey and if you have the time to complete it, that would be terrific. I&#39;m trying to do an evaluation of One Man&#39;s Jazz as well as considering some new options that have arisen. Your feedback would be most appreciate.This survey is in two parts and is a total of twenty parts, most of them just require a click of your mouse. Click here (http://www.surveymonkey.com/s/C3DZ3JK) for Part One, and here (http://www.surveymonkey.com/s/C38FVH8) for Part Two. Thanks again for your input. ]

I'm not sure why Solr is interpreting this long field as multivalued, but clearly it would be much better if I used a metadata name that wasn't "description", since Solr's example configuration has dibs on that.  I'll experiment and post further.


> item description element not indexed
> ------------------------------------
>
>                 Key: CONNECTORS-235
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-235
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.2
>            Reporter: Kate McGonigal
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.3
>
>
> The RSS feed's *item* description is not written to any field in the Solr index. 
> I have a typical RSS feed with the general structure:
> <rss>
>     <channel>
>         <title></title>
>         <link></link>
>         <description></description>
>         <item>
>             <title></title>
>             <link></link>
>             <pubDate></pubDate>
>             <description> *** the description I do want *** </description>
>             <author></author>
>             <category></category>
>         </item>
>     </channel>
> </rss>
> Example:
> For the RSS feed: http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
> the rss/channel/item/description field is not indexed into Solr.
> Example notes:
>   - what does get written to the Solr "description" field is the description metadata from the website, i.e. "Jazz radio show from Winnipeg on CKUW 95.9 FM, hosted by Maurice Hogue." in this case.
>   - on the "Dechromed Content" tab of the job, "No dechromed content" is selected. I'm not sure if that is relevant.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-235) item description element not indexed

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078503#comment-13078503 ] 

Karl Wright commented on CONNECTORS-235:
----------------------------------------

What database are you using?  Is this PostgreSQL, or Derby, or HSQLDB?


> item description element not indexed
> ------------------------------------
>
>                 Key: CONNECTORS-235
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-235
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.2
>            Reporter: Kate McGonigal
>
> The RSS feed's *item* description is not written to any field in the Solr index. 
> I have a typical RSS feed with the general structure:
> <rss>
>     <channel>
>         <title></title>
>         <link></link>
>         <description></description>
>         <item>
>             <title></title>
>             <link></link>
>             <pubDate></pubDate>
>             <description> *** the description I do want *** </description>
>             <author></author>
>             <category></category>
>         </item>
>     </channel>
> </rss>
> Example:
> For the RSS feed: http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
> the rss/channel/item/description field is not indexed into Solr.
> Example notes:
>   - what does get written to the Solr "description" field is the description metadata from the website, i.e. "Jazz radio show from Winnipeg on CKUW 95.9 FM, hosted by Maurice Hogue." in this case.
>   - on the "Dechromed Content" tab of the job, "No dechromed content" is selected. I'm not sure if that is relevant.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-235) item description element not indexed

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079097#comment-13079097 ] 

Karl Wright commented on CONNECTORS-235:
----------------------------------------

Hmm, I'm using the very same feed you are, with PostgreSQL, and seeing perfect results.
Can you attach a screen shot of the View Job page of the job in question?  Also, the View Connection page for both the output connection and the repository connection?


> item description element not indexed
> ------------------------------------
>
>                 Key: CONNECTORS-235
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-235
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.2
>            Reporter: Kate McGonigal
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.3
>
>
> The RSS feed's *item* description is not written to any field in the Solr index. 
> I have a typical RSS feed with the general structure:
> <rss>
>     <channel>
>         <title></title>
>         <link></link>
>         <description></description>
>         <item>
>             <title></title>
>             <link></link>
>             <pubDate></pubDate>
>             <description> *** the description I do want *** </description>
>             <author></author>
>             <category></category>
>         </item>
>     </channel>
> </rss>
> Example:
> For the RSS feed: http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
> the rss/channel/item/description field is not indexed into Solr.
> Example notes:
>   - what does get written to the Solr "description" field is the description metadata from the website, i.e. "Jazz radio show from Winnipeg on CKUW 95.9 FM, hosted by Maurice Hogue." in this case.
>   - on the "Dechromed Content" tab of the job, "No dechromed content" is selected. I'm not sure if that is relevant.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-235) item description element not indexed

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078518#comment-13078518 ] 

Karl Wright commented on CONNECTORS-235:
----------------------------------------

I was able to reproduce this with Derby.  I am certain that it doesn't happen under postgresql.  I'll check HSQLDB.



> item description element not indexed
> ------------------------------------
>
>                 Key: CONNECTORS-235
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-235
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.2
>            Reporter: Kate McGonigal
>
> The RSS feed's *item* description is not written to any field in the Solr index. 
> I have a typical RSS feed with the general structure:
> <rss>
>     <channel>
>         <title></title>
>         <link></link>
>         <description></description>
>         <item>
>             <title></title>
>             <link></link>
>             <pubDate></pubDate>
>             <description> *** the description I do want *** </description>
>             <author></author>
>             <category></category>
>         </item>
>     </channel>
> </rss>
> Example:
> For the RSS feed: http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
> the rss/channel/item/description field is not indexed into Solr.
> Example notes:
>   - what does get written to the Solr "description" field is the description metadata from the website, i.e. "Jazz radio show from Winnipeg on CKUW 95.9 FM, hosted by Maurice Hogue." in this case.
>   - on the "Dechromed Content" tab of the job, "No dechromed content" is selected. I'm not sure if that is relevant.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-235) item description element not indexed

Posted by "Kate McGonigal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078764#comment-13078764 ] 

Kate McGonigal commented on CONNECTORS-235:
-------------------------------------------

Thanks! For the record though, I was using PostgreSQL.

> item description element not indexed
> ------------------------------------
>
>                 Key: CONNECTORS-235
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-235
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.2
>            Reporter: Kate McGonigal
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.3
>
>
> The RSS feed's *item* description is not written to any field in the Solr index. 
> I have a typical RSS feed with the general structure:
> <rss>
>     <channel>
>         <title></title>
>         <link></link>
>         <description></description>
>         <item>
>             <title></title>
>             <link></link>
>             <pubDate></pubDate>
>             <description> *** the description I do want *** </description>
>             <author></author>
>             <category></category>
>         </item>
>     </channel>
> </rss>
> Example:
> For the RSS feed: http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
> the rss/channel/item/description field is not indexed into Solr.
> Example notes:
>   - what does get written to the Solr "description" field is the description metadata from the website, i.e. "Jazz radio show from Winnipeg on CKUW 95.9 FM, hosted by Maurice Hogue." in this case.
>   - on the "Dechromed Content" tab of the job, "No dechromed content" is selected. I'm not sure if that is relevant.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-235) item description element not indexed

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079141#comment-13079141 ] 

Karl Wright commented on CONNECTORS-235:
----------------------------------------

Just to be clear, here's an example of the Solr log line for indexing one of the documents from the above mentioned feed.  You can, of course, configure the job to map the field names to whatever you like.  This is with no mapping whatsoever.

INFO: [] webapp=/solr path=/update/extract params={literal.source=http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/&literal.category=Radio+-+Play+lists&literal.summary=I+had+a+lot+of+fun+putting+this+show+together+this+week.+Hope+you+enjoy+it,+too.&literal.id=http://www.onemansjazz.ca/content/view/332/30/&literal.title=July+23,+2011+Playlist&literal.pubdate=1311339967000} status=0 QTime=13

I'm pretty certain you must have a metadata value set for "description" in your job, because there is absolutely no mechanism (and never was one) for picking up the channel description from the feed. So you will have to remove that in order to get all this to work for you.

> item description element not indexed
> ------------------------------------
>
>                 Key: CONNECTORS-235
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-235
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.2
>            Reporter: Kate McGonigal
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.3
>
>
> The RSS feed's *item* description is not written to any field in the Solr index. 
> I have a typical RSS feed with the general structure:
> <rss>
>     <channel>
>         <title></title>
>         <link></link>
>         <description></description>
>         <item>
>             <title></title>
>             <link></link>
>             <pubDate></pubDate>
>             <description> *** the description I do want *** </description>
>             <author></author>
>             <category></category>
>         </item>
>     </channel>
> </rss>
> Example:
> For the RSS feed: http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
> the rss/channel/item/description field is not indexed into Solr.
> Example notes:
>   - what does get written to the Solr "description" field is the description metadata from the website, i.e. "Jazz radio show from Winnipeg on CKUW 95.9 FM, hosted by Maurice Hogue." in this case.
>   - on the "Dechromed Content" tab of the job, "No dechromed content" is selected. I'm not sure if that is relevant.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (CONNECTORS-235) item description element not indexed

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CONNECTORS-235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wright resolved CONNECTORS-235.
------------------------------------

       Resolution: Fixed
    Fix Version/s: ManifoldCF 0.3
         Assignee: Karl Wright

r1153361. The description field value will now come through as "description" metadata, when it is not being used for dechromed content.


> item description element not indexed
> ------------------------------------
>
>                 Key: CONNECTORS-235
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-235
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.2
>            Reporter: Kate McGonigal
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.3
>
>
> The RSS feed's *item* description is not written to any field in the Solr index. 
> I have a typical RSS feed with the general structure:
> <rss>
>     <channel>
>         <title></title>
>         <link></link>
>         <description></description>
>         <item>
>             <title></title>
>             <link></link>
>             <pubDate></pubDate>
>             <description> *** the description I do want *** </description>
>             <author></author>
>             <category></category>
>         </item>
>     </channel>
> </rss>
> Example:
> For the RSS feed: http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
> the rss/channel/item/description field is not indexed into Solr.
> Example notes:
>   - what does get written to the Solr "description" field is the description metadata from the website, i.e. "Jazz radio show from Winnipeg on CKUW 95.9 FM, hosted by Maurice Hogue." in this case.
>   - on the "Dechromed Content" tab of the job, "No dechromed content" is selected. I'm not sure if that is relevant.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-235) item description element not indexed

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078387#comment-13078387 ] 

Karl Wright commented on CONNECTORS-235:
----------------------------------------

It turns out that the "solr" description field is not coming from the RSS feed at all, but probably from a hard-wired parameter from either the solr-related job specification or the RSS-related job specification.  The RSS connector does not set any "description" metadata at all at this time.

> item description element not indexed
> ------------------------------------
>
>                 Key: CONNECTORS-235
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-235
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.2
>            Reporter: Kate McGonigal
>
> The RSS feed's *item* description is not written to any field in the Solr index. 
> I have a typical RSS feed with the general structure:
> <rss>
>     <channel>
>         <title></title>
>         <link></link>
>         <description></description>
>         <item>
>             <title></title>
>             <link></link>
>             <pubDate></pubDate>
>             <description> *** the description I do want *** </description>
>             <author></author>
>             <category></category>
>         </item>
>     </channel>
> </rss>
> Example:
> For the RSS feed: http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
> the rss/channel/item/description field is not indexed into Solr.
> Example notes:
>   - what does get written to the Solr "description" field is the description metadata from the website, i.e. "Jazz radio show from Winnipeg on CKUW 95.9 FM, hosted by Maurice Hogue." in this case.
>   - on the "Dechromed Content" tab of the job, "No dechromed content" is selected. I'm not sure if that is relevant.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (CONNECTORS-235) item description element not indexed

Posted by "Kate McGonigal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079085#comment-13079085 ] 

Kate McGonigal edited comment on CONNECTORS-235 at 8/3/11 10:31 PM:
--------------------------------------------------------------------

I'm afraid these problems still exist for me. 

A few hours ago I built the latest from trunk. It is running on PostgreSQL.

Just in case, I also started from a fresh install of Solr 3.3.0.  I'm using the example that comes with the distribution. It is thus running on Derby. I realize the schema is not optimal for RSS feeds, but it does include a "description"  field, which is what I'm interested in at the moment.

Problem 1) When I try running the example job (see original post) with "Dechromed Content" set to "No dechromed content", what shows up in the description field (for all documents) is "Jazz radio show from Winnipeg on CKUW 95.9 FM, hosted by Maurice Hogue." which is not the item-description in the RSS feed's XML, but rather from the website's metadata description element in the HTML.  I have tried another RSS feed, with the same result.

Problem 2) When I try running the example job with "Dechromed Content" set to "if present, in 'description' field" it still hangs with the log file showing:
{quote}FATAL 2011-08-03 16:08:21,703 (Worker thread '10') - Error tossed: java.lang.String cannot be cast to org.apache.manifoldcf.core.interfaces.CharacterInput
java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.manifoldcf.core.interfaces.CharacterInput
	at org.apache.manifoldcf.crawler.jobs.Carrydown.getDataValuesAsFiles(Carrydown.java:611)
	at org.apache.manifoldcf.crawler.jobs.JobManager.retrieveParentDataAsFiles(JobManager.java:4263)
	at org.apache.manifoldcf.crawler.system.WorkerThread$VersionActivity.retrieveParentDataAsFiles(WorkerThread.java:1221)
	at org.apache.manifoldcf.crawler.connectors.rss.RSSConnector.getDocumentVersions(RSSConnector.java:824)
	at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:321){quote}

And just to be clear on what I am ultimately trying to do: I'd like to be able to show my searchers the "description" from the RSS feed for each of the documents that match their searches. I actually only need to index the item-description field (as opposed to what is at the item link) since my RSS feeds are of scientific papers that will have a detailed abstract in the item-description.

      was (Author: kmcgonig):
    I'm afraid these problems still exist for me. 

A few hours ago I built the latest from trunk. It is running on PostgreSQL.

Just in case, I also started from a fresh install of Solr 3.3.0.  I'm using the example that comes with the distribution. It is thus running on Derby. I realize the schema is not optimal for RSS feeds, but it does include a "description"  field, which is what I'm interested in at the moment.

Problem 1) When I try running the example job with "Dechromed Content" set to "No dechromed content", what shows up in the description field (for all documents) is "Jazz radio show from Winnipeg on CKUW 95.9 FM, hosted by Maurice Hogue." which is not the item-description in the RSS feed's XML, but rather from the website's metadata description element in the HTML.  I have tried another RSS feed, with the same result.

Problem 2) When I try running the example job (see original post) with "Dechromed Content" set to "if present, in 'description' field" it still hangs with the log file showing:
{quote}FATAL 2011-08-03 16:08:21,703 (Worker thread '10') - Error tossed: java.lang.String cannot be cast to org.apache.manifoldcf.core.interfaces.CharacterInput
java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.manifoldcf.core.interfaces.CharacterInput
	at org.apache.manifoldcf.crawler.jobs.Carrydown.getDataValuesAsFiles(Carrydown.java:611)
	at org.apache.manifoldcf.crawler.jobs.JobManager.retrieveParentDataAsFiles(JobManager.java:4263)
	at org.apache.manifoldcf.crawler.system.WorkerThread$VersionActivity.retrieveParentDataAsFiles(WorkerThread.java:1221)
	at org.apache.manifoldcf.crawler.connectors.rss.RSSConnector.getDocumentVersions(RSSConnector.java:824)
	at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:321){quote}

And just to be clear on what I am ultimately trying to do: I'd like to be able to show my searchers the "description" from the RSS feed for each of the documents that match their searches. I actually only need to index the item-description field (as opposed to what is at the item link) since my RSS feeds are of scientific papers that will have a detailed abstract in the item-description.
  
> item description element not indexed
> ------------------------------------
>
>                 Key: CONNECTORS-235
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-235
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.2
>            Reporter: Kate McGonigal
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.3
>
>
> The RSS feed's *item* description is not written to any field in the Solr index. 
> I have a typical RSS feed with the general structure:
> <rss>
>     <channel>
>         <title></title>
>         <link></link>
>         <description></description>
>         <item>
>             <title></title>
>             <link></link>
>             <pubDate></pubDate>
>             <description> *** the description I do want *** </description>
>             <author></author>
>             <category></category>
>         </item>
>     </channel>
> </rss>
> Example:
> For the RSS feed: http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
> the rss/channel/item/description field is not indexed into Solr.
> Example notes:
>   - what does get written to the Solr "description" field is the description metadata from the website, i.e. "Jazz radio show from Winnipeg on CKUW 95.9 FM, hosted by Maurice Hogue." in this case.
>   - on the "Dechromed Content" tab of the job, "No dechromed content" is selected. I'm not sure if that is relevant.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-235) item description element not indexed

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078766#comment-13078766 ] 

Karl Wright commented on CONNECTORS-235:
----------------------------------------

Thanks for the info.  The fix, as structured, should generally apply to PostgreSQL too.  Please let me know if it works for you.  But I'll need to research how this problem could have gotten past the tests regardless.


> item description element not indexed
> ------------------------------------
>
>                 Key: CONNECTORS-235
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-235
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: RSS connector
>    Affects Versions: ManifoldCF 0.2
>            Reporter: Kate McGonigal
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.3
>
>
> The RSS feed's *item* description is not written to any field in the Solr index. 
> I have a typical RSS feed with the general structure:
> <rss>
>     <channel>
>         <title></title>
>         <link></link>
>         <description></description>
>         <item>
>             <title></title>
>             <link></link>
>             <pubDate></pubDate>
>             <description> *** the description I do want *** </description>
>             <author></author>
>             <category></category>
>         </item>
>     </channel>
> </rss>
> Example:
> For the RSS feed: http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
> the rss/channel/item/description field is not indexed into Solr.
> Example notes:
>   - what does get written to the Solr "description" field is the description metadata from the website, i.e. "Jazz radio show from Winnipeg on CKUW 95.9 FM, hosted by Maurice Hogue." in this case.
>   - on the "Dechromed Content" tab of the job, "No dechromed content" is selected. I'm not sure if that is relevant.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira