You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by K McGonigal <km...@gmail.com> on 2011/08/02 17:41:58 UTC

Field mapping for RSS feed

Hi,

I'm trying to use ManifoldCF to index an RSS feed into Solr.  It sort of
works, but my main problem at the moment is that the *channel* description
from the RSS feed is written to the "description" field in Solr when I would
really like the *item* description to be written instead.

I have a typical RSS feed with the general structure:

<rss>
    <channel>
        <title></title>
        <link></link>
        <description> *** the description I don't want *** </description>
        <item>
            <title></title>
            <link></link>
            <pubDate></pubDate>
            <description> *** the description I do want *** </description>
            <author></author>
            <category></category>
        </item>
    </channel>
</rss>

I tried setting up the  field mapping on the job with the XPath address of
the second description, i.e. "/rss/channel/item/description" as the source,
but that did not work.

I suspect I'm overlooking something simple, but I've spent 2 days trying to
solve it.  I would be grateful for any help.


Kate McGonigal

Re: Field mapping for RSS feed

Posted by Karl Wright <da...@gmail.com>.

All fixes for the ticket are complete.
Of course, in order to use them you will want to build and use trunk
instead of the 0.2-incubating release.  Let me know if this is a
problem.

Thanks!
Karl

On Tue, Aug 2, 2011 at 3:04 PM, Karl Wright <da...@gmail.com> wrote:
> Hi Kate,
>
> Many news RSS feeds put the full article in either the item
> description or the item content field, while the document described by
> the url field is not just straight content but contains navigation and
> advertising "chrome".  In such cases it's often preferable to generate
> an index based on the description or content field contents rather
> than the actual document with all of that chrome.  The Dechromed
> Content options allow you to set up that behavior for a specific job.
>
> Thanks for opening the ticket; I'll propose a solution shortly.
>
> Karl
>
>
> On Tue, Aug 2, 2011 at 2:56 PM, K McGonigal <km...@gmail.com> wrote:
>> Hi Karl,
>>
>> Thank you for your quick response. I've opened a Jira ticket for this,
>> though I don't really understand what sort of solution you had in mind so I
>> didn't propose anything.
>>
>> I'm afraid I don't understand exactly what the Dechromed Content options do
>> either. I read about them in the End User Documentation, but there wasn't
>> much there yet.
>>
>> I find it odd that I would be the first person to have this problem. You'd
>> think it would be very common.
>>
>>
>> Kate
>>
>>
>> On Tue, Aug 2, 2011 at 11:05 AM, Karl Wright <da...@gmail.com> wrote:
>>>
>>> I just looked at the code.  It's not a bug rather than an oversight of
>>> sorts.  The "description" or "content" fields are indexed as the
>>> primary content of the document if the "chrome" mode is selected
>>> accordingly.  If "None" is the "chrome" mode, then the item-level
>>> description field is ignored even when present.
>>>
>>> So I recommend simply adding a new kind of "description" field for
>>> when the "chrome" mode is set to "None".  "item/description" may be
>>> its name, or maybe the full XPath, your choice.  Propose something in
>>> the ticket and I'll respond.
>>>
>>> Thanks!
>>> Karl
>>>
>>>
>>> On Tue, Aug 2, 2011 at 11:47 AM, Karl Wright <da...@gmail.com> wrote:
>>> > Hi Kate,
>>> >
>>> > The field mapping won't do the trick because the RSS connector is
>>> > currently very selective about what fields it extracts - it by no
>>> > means extracts all of them, so the ones that it *does* extract from
>>> > the feed are "special".
>>> >
>>> > The behavior you describe sounds like a bug to me.  I'll go spelunking
>>> > through the code at first opportunity.  In the meantime, could you
>>> > create a Jira ticket describing the behavior you see vs. the behavior
>>> > you want?
>>> >
>>> > Thanks!
>>> > Karl
>>> >
>>> > On Tue, Aug 2, 2011 at 11:41 AM, K McGonigal <km...@gmail.com>
>>> > wrote:
>>> >> Hi,
>>> >>
>>> >> I'm trying to use ManifoldCF to index an RSS feed into Solr.  It sort
>>> >> of
>>> >> works, but my main problem at the moment is that the *channel*
>>> >> description
>>> >> from the RSS feed is written to the "description" field in Solr when I
>>> >> would
>>> >> really like the *item* description to be written instead.
>>> >>
>>> >> I have a typical RSS feed with the general structure:
>>> >>
>>> >> <rss>
>>> >>     <channel>
>>> >>         <title></title>
>>> >>         <link></link>
>>> >>         <description> *** the description I don't want ***
>>> >> </description>
>>> >>         <item>
>>> >>             <title></title>
>>> >>             <link></link>
>>> >>             <pubDate></pubDate>
>>> >>             <description> *** the description I do want ***
>>> >> </description>
>>> >>             <author></author>
>>> >>             <category></category>
>>> >>         </item>
>>> >>     </channel>
>>> >> </rss>
>>> >>
>>> >> I tried setting up the  field mapping on the job with the XPath address
>>> >> of
>>> >> the second description, i.e. "/rss/channel/item/description" as the
>>> >> source,
>>> >> but that did not work.
>>> >>
>>> >> I suspect I'm overlooking something simple, but I've spent 2 days
>>> >> trying to
>>> >> solve it.  I would be grateful for any help.
>>> >>
>>> >>
>>> >> Kate McGonigal
>>> >>
>>> >>
>>> >>
>>> >
>>
>>
>

Re: Field mapping for RSS feed

Posted by Karl Wright <da...@gmail.com>.

Hi Kate,

Many news RSS feeds put the full article in either the item
description or the item content field, while the document described by
the url field is not just straight content but contains navigation and
advertising "chrome".  In such cases it's often preferable to generate
an index based on the description or content field contents rather
than the actual document with all of that chrome.  The Dechromed
Content options allow you to set up that behavior for a specific job.

Thanks for opening the ticket; I'll propose a solution shortly.

Karl


On Tue, Aug 2, 2011 at 2:56 PM, K McGonigal <km...@gmail.com> wrote:
> Hi Karl,
>
> Thank you for your quick response. I've opened a Jira ticket for this,
> though I don't really understand what sort of solution you had in mind so I
> didn't propose anything.
>
> I'm afraid I don't understand exactly what the Dechromed Content options do
> either. I read about them in the End User Documentation, but there wasn't
> much there yet.
>
> I find it odd that I would be the first person to have this problem. You'd
> think it would be very common.
>
>
> Kate
>
>
> On Tue, Aug 2, 2011 at 11:05 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> I just looked at the code.  It's not a bug rather than an oversight of
>> sorts.  The "description" or "content" fields are indexed as the
>> primary content of the document if the "chrome" mode is selected
>> accordingly.  If "None" is the "chrome" mode, then the item-level
>> description field is ignored even when present.
>>
>> So I recommend simply adding a new kind of "description" field for
>> when the "chrome" mode is set to "None".  "item/description" may be
>> its name, or maybe the full XPath, your choice.  Propose something in
>> the ticket and I'll respond.
>>
>> Thanks!
>> Karl
>>
>>
>> On Tue, Aug 2, 2011 at 11:47 AM, Karl Wright <da...@gmail.com> wrote:
>> > Hi Kate,
>> >
>> > The field mapping won't do the trick because the RSS connector is
>> > currently very selective about what fields it extracts - it by no
>> > means extracts all of them, so the ones that it *does* extract from
>> > the feed are "special".
>> >
>> > The behavior you describe sounds like a bug to me.  I'll go spelunking
>> > through the code at first opportunity.  In the meantime, could you
>> > create a Jira ticket describing the behavior you see vs. the behavior
>> > you want?
>> >
>> > Thanks!
>> > Karl
>> >
>> > On Tue, Aug 2, 2011 at 11:41 AM, K McGonigal <km...@gmail.com>
>> > wrote:
>> >> Hi,
>> >>
>> >> I'm trying to use ManifoldCF to index an RSS feed into Solr.  It sort
>> >> of
>> >> works, but my main problem at the moment is that the *channel*
>> >> description
>> >> from the RSS feed is written to the "description" field in Solr when I
>> >> would
>> >> really like the *item* description to be written instead.
>> >>
>> >> I have a typical RSS feed with the general structure:
>> >>
>> >> <rss>
>> >>     <channel>
>> >>         <title></title>
>> >>         <link></link>
>> >>         <description> *** the description I don't want ***
>> >> </description>
>> >>         <item>
>> >>             <title></title>
>> >>             <link></link>
>> >>             <pubDate></pubDate>
>> >>             <description> *** the description I do want ***
>> >> </description>
>> >>             <author></author>
>> >>             <category></category>
>> >>         </item>
>> >>     </channel>
>> >> </rss>
>> >>
>> >> I tried setting up the  field mapping on the job with the XPath address
>> >> of
>> >> the second description, i.e. "/rss/channel/item/description" as the
>> >> source,
>> >> but that did not work.
>> >>
>> >> I suspect I'm overlooking something simple, but I've spent 2 days
>> >> trying to
>> >> solve it.  I would be grateful for any help.
>> >>
>> >>
>> >> Kate McGonigal
>> >>
>> >>
>> >>
>> >
>
>

Re: Field mapping for RSS feed

Posted by Karl Wright <da...@gmail.com>.

I confirmed that the solr requests actually do get through fine:

Aug 4, 2011 5:24:15 PM org.apache.solr.update.processor.LogUpdateProcessor finis
h
INFO: {add=[http://www.onemansjazz.ca/content/view/328/30/]} 0 463
Aug 4, 2011 5:24:15 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update/extract params={literal.source=http://www.one
mansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/&literal.category=Rad
io+-+Play+lists&literal.id=http://www.onemansjazz.ca/content/view/328/30/&litera
l.title=July+2,+2011+Playlist&literal.pubdate=1309523437000} status=0 QTime=463

Aug 4, 2011 5:24:15 PM org.apache.solr.update.processor.LogUpdateProcessor finis
h
INFO: {add=[http://www.onemansjazz.ca/content/view/330/50/]} 0 464
Aug 4, 2011 5:24:15 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update/extract params={literal.source=http://www.one
mansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/&literal.category=New
s+-+General&literal.id=http://www.onemansjazz.ca/content/view/330/50/&literal.ti
tle=Listener+Survey&literal.pubdate=1310475289000} status=0 QTime=464
Aug 4, 2011 5:24:15 PM org.apache.solr.update.processor.LogUpdateProcessor finis
h
INFO: {add=[http://www.onemansjazz.ca/content/view/331/30/]} 0 466
Aug 4, 2011 5:24:15 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update/extract params={literal.source=http://www.one
mansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/&literal.category=Rad
io+-+Play+lists&literal.id=http://www.onemansjazz.ca/content/view/331/30/&litera
l.title=July+16,+2011+Playlist&literal.pubdate=1310718848000} status=0 QTime=466

Aug 4, 2011 5:24:15 PM org.apache.solr.update.processor.LogUpdateProcessor finis
h
INFO: {add=[http://www.onemansjazz.ca/content/view/329/30/]} 0 464
Aug 4, 2011 5:24:15 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update/extract params={literal.source=http://www.one
mansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/&literal.category=Rad
io+-+Play+lists&literal.id=http://www.onemansjazz.ca/content/view/329/30/&litera
l.title=July+9,+2011+Playlist&literal.pubdate=1310070625000} status=0 QTime=464

So I'm not sure what you are seeing.

Karl

On Thu, Aug 4, 2011 at 1:08 PM, Karl Wright <da...@gmail.com> wrote:
>>>>>>>
> I guess the only caveat is that, to use this, one has to know to add a
> "summary" field to their Solr schema. Long-term, I wonder if the
> "field mapping" feature could be used to let users map any RSS element
> (based on its XPath "address") to any Solr field?
> <<<<<<
>
> The problem is that there are three different kinds of feeds that the
> RSS connector supports, and they have different names for each kind of
> item element.  The RSS connector attempts to normalize all that mess
> into something more standard.
>
>>>>>>>
> But I wonder if it is working properly for Dechromed Content =
> "Dechromed content, if present, in 'description' field". When I use
> that, nothing is sent to Solr, although the job terminates OK and
> doesn't hang like it was doing before. Is that what is to be expected?
> <<<<<<
>
> I'll look at this.  The behavior you should see is an indexing
> operation per document but the content should just include the
> description string.
>
>>>>>>>
> I'm actually still confused by all the dechromed options because I
> thought that the item description was used as the dechromed content.
> So does "Dechromed content, if present, in 'description' field" mean
> that the contents of the item description element will be used for
> indexing instead of the web page specified by the link?
> <<<<<<
>
> Your understanding is correct.  When I tried this last night I looked
> at the Simple History and it looked like the description data was sent
> to the Solr index (based on the reported size).  I'll have to see
> whether it actually gets there though.
>
> Karl
>
> On Thu, Aug 4, 2011 at 1:00 PM, K McGonigal <km...@gmail.com> wrote:
>> Works great now (with  Dechromed Content = "No dechromed content").
>> Thanks!!
>>
>> I guess the only caveat is that, to use this, one has to know to add a
>> "summary" field to their Solr schema. Long-term, I wonder if the "field
>> mapping" feature could be used to let users map any RSS element (based on
>> its XPath "address") to any Solr field?
>>
>> But I wonder if it is working properly for Dechromed Content = "Dechromed
>> content, if present, in 'description' field". When I use that, nothing is
>> sent to Solr, although the job terminates OK and doesn't hang like it was
>> doing before. Is that what is to be expected?
>>
>> I'm actually still confused by all the dechromed options because I thought
>> that the item description was used as the dechromed content. So does
>> "Dechromed content, if present, in 'description' field" mean that the
>> contents of the item description element will be used for indexing instead
>> of the web page specified by the link?
>>
>> Sorry for all these questions. I appreciate your patience.
>>
>>
>> On Thu, Aug 4, 2011 at 5:11 AM, Karl Wright <da...@gmail.com> wrote:
>>>
>>> Hi Kate,
>>>
>>> I did two additional check-ins yesterday evening.  Would you be so
>>> kind as to synch up from trunk and try again?  I apologize for the
>>> confusion.
>>>
>>> Karl
>>>
>>> On Wed, Aug 3, 2011 at 8:13 AM, Karl Wright <da...@gmail.com> wrote:
>>> >>>>>>>
>>> > I find it odd that I would be the first person to have this problem.
>>> > You'd think it would be very common.
>>> > <<<<<<
>>> >
>>> > Actually, I've not encountered this before even though the RSS
>>> > connector is one of the most widely used connectors.  The only
>>> > situation this ever came up in before was when some MetaCarta clients
>>> > wanted to use the description field as primary content, which is why
>>> > it is an option for the "Dechromed Content" tab.  But new feature
>>> > requests are always welcome.
>>> >
>>> > Also, as you might guess by the Derby and HSQLDB issue that you
>>> > encountered, most of our users use PostgreSQL.  The Derby and HSQLDB
>>> > database support was added to simplify setup and allow tests to be
>>> > written that did not involve installing another package first.
>>> > However, each of these databases has known problems, some minor and
>>> > some more major.  Thus you might want to consider going to PostgreSQL
>>> > in the future if you plan on doing any serious crawling.
>>> >
>>> > Thanks again!
>>> > Karl
>>> >
>>> > On Tue, Aug 2, 2011 at 2:56 PM, K McGonigal <km...@gmail.com> wrote:
>>> >> Hi Karl,
>>> >>
>>> >> Thank you for your quick response. I've opened a Jira ticket for this,
>>> >> though I don't really understand what sort of solution you had in mind
>>> >> so I
>>> >> didn't propose anything.
>>> >>
>>> >> I'm afraid I don't understand exactly what the Dechromed Content
>>> >> options do
>>> >> either. I read about them in the End User Documentation, but there
>>> >> wasn't
>>> >> much there yet.
>>> >>
>>> >> I find it odd that I would be the first person to have this problem.
>>> >> You'd
>>> >> think it would be very common.
>>> >>
>>> >>
>>> >> Kate
>>> >>
>>> >>
>>> >> On Tue, Aug 2, 2011 at 11:05 AM, Karl Wright <da...@gmail.com>
>>> >> wrote:
>>> >>>
>>> >>> I just looked at the code.  It's not a bug rather than an oversight of
>>> >>> sorts.  The "description" or "content" fields are indexed as the
>>> >>> primary content of the document if the "chrome" mode is selected
>>> >>> accordingly.  If "None" is the "chrome" mode, then the item-level
>>> >>> description field is ignored even when present.
>>> >>>
>>> >>> So I recommend simply adding a new kind of "description" field for
>>> >>> when the "chrome" mode is set to "None".  "item/description" may be
>>> >>> its name, or maybe the full XPath, your choice.  Propose something in
>>> >>> the ticket and I'll respond.
>>> >>>
>>> >>> Thanks!
>>> >>> Karl
>>> >>>
>>> >>>
>>> >>> On Tue, Aug 2, 2011 at 11:47 AM, Karl Wright <da...@gmail.com>
>>> >>> wrote:
>>> >>> > Hi Kate,
>>> >>> >
>>> >>> > The field mapping won't do the trick because the RSS connector is
>>> >>> > currently very selective about what fields it extracts - it by no
>>> >>> > means extracts all of them, so the ones that it *does* extract from
>>> >>> > the feed are "special".
>>> >>> >
>>> >>> > The behavior you describe sounds like a bug to me.  I'll go
>>> >>> > spelunking
>>> >>> > through the code at first opportunity.  In the meantime, could you
>>> >>> > create a Jira ticket describing the behavior you see vs. the
>>> >>> > behavior
>>> >>> > you want?
>>> >>> >
>>> >>> > Thanks!
>>> >>> > Karl
>>> >>> >
>>> >>> > On Tue, Aug 2, 2011 at 11:41 AM, K McGonigal <km...@gmail.com>
>>> >>> > wrote:
>>> >>> >> Hi,
>>> >>> >>
>>> >>> >> I'm trying to use ManifoldCF to index an RSS feed into Solr.  It
>>> >>> >> sort
>>> >>> >> of
>>> >>> >> works, but my main problem at the moment is that the *channel*
>>> >>> >> description
>>> >>> >> from the RSS feed is written to the "description" field in Solr
>>> >>> >> when I
>>> >>> >> would
>>> >>> >> really like the *item* description to be written instead.
>>> >>> >>
>>> >>> >> I have a typical RSS feed with the general structure:
>>> >>> >>
>>> >>> >> <rss>
>>> >>> >>     <channel>
>>> >>> >>         <title></title>
>>> >>> >>         <link></link>
>>> >>> >>         <description> *** the description I don't want ***
>>> >>> >> </description>
>>> >>> >>         <item>
>>> >>> >>             <title></title>
>>> >>> >>             <link></link>
>>> >>> >>             <pubDate></pubDate>
>>> >>> >>             <description> *** the description I do want ***
>>> >>> >> </description>
>>> >>> >>             <author></author>
>>> >>> >>             <category></category>
>>> >>> >>         </item>
>>> >>> >>     </channel>
>>> >>> >> </rss>
>>> >>> >>
>>> >>> >> I tried setting up the  field mapping on the job with the XPath
>>> >>> >> address
>>> >>> >> of
>>> >>> >> the second description, i.e. "/rss/channel/item/description" as the
>>> >>> >> source,
>>> >>> >> but that did not work.
>>> >>> >>
>>> >>> >> I suspect I'm overlooking something simple, but I've spent 2 days
>>> >>> >> trying to
>>> >>> >> solve it.  I would be grateful for any help.
>>> >>> >>
>>> >>> >>
>>> >>> >> Kate McGonigal
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >
>>> >>
>>> >>
>>> >
>>
>>
>

Re: Field mapping for RSS feed

Posted by Karl Wright <da...@gmail.com>.

>>>>>>
I guess the only caveat is that, to use this, one has to know to add a
"summary" field to their Solr schema. Long-term, I wonder if the
"field mapping" feature could be used to let users map any RSS element
(based on its XPath "address") to any Solr field?
<<<<<<

The problem is that there are three different kinds of feeds that the
RSS connector supports, and they have different names for each kind of
item element.  The RSS connector attempts to normalize all that mess
into something more standard.

>>>>>>
But I wonder if it is working properly for Dechromed Content =
"Dechromed content, if present, in 'description' field". When I use
that, nothing is sent to Solr, although the job terminates OK and
doesn't hang like it was doing before. Is that what is to be expected?
<<<<<<

I'll look at this.  The behavior you should see is an indexing
operation per document but the content should just include the
description string.

>>>>>>
I'm actually still confused by all the dechromed options because I
thought that the item description was used as the dechromed content.
So does "Dechromed content, if present, in 'description' field" mean
that the contents of the item description element will be used for
indexing instead of the web page specified by the link?
<<<<<<

Your understanding is correct.  When I tried this last night I looked
at the Simple History and it looked like the description data was sent
to the Solr index (based on the reported size).  I'll have to see
whether it actually gets there though.

Karl

On Thu, Aug 4, 2011 at 1:00 PM, K McGonigal <km...@gmail.com> wrote:
> Works great now (with  Dechromed Content = "No dechromed content").
> Thanks!!
>
> I guess the only caveat is that, to use this, one has to know to add a
> "summary" field to their Solr schema. Long-term, I wonder if the "field
> mapping" feature could be used to let users map any RSS element (based on
> its XPath "address") to any Solr field?
>
> But I wonder if it is working properly for Dechromed Content = "Dechromed
> content, if present, in 'description' field". When I use that, nothing is
> sent to Solr, although the job terminates OK and doesn't hang like it was
> doing before. Is that what is to be expected?
>
> I'm actually still confused by all the dechromed options because I thought
> that the item description was used as the dechromed content. So does
> "Dechromed content, if present, in 'description' field" mean that the
> contents of the item description element will be used for indexing instead
> of the web page specified by the link?
>
> Sorry for all these questions. I appreciate your patience.
>
>
> On Thu, Aug 4, 2011 at 5:11 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> Hi Kate,
>>
>> I did two additional check-ins yesterday evening.  Would you be so
>> kind as to synch up from trunk and try again?  I apologize for the
>> confusion.
>>
>> Karl
>>
>> On Wed, Aug 3, 2011 at 8:13 AM, Karl Wright <da...@gmail.com> wrote:
>> >>>>>>>
>> > I find it odd that I would be the first person to have this problem.
>> > You'd think it would be very common.
>> > <<<<<<
>> >
>> > Actually, I've not encountered this before even though the RSS
>> > connector is one of the most widely used connectors.  The only
>> > situation this ever came up in before was when some MetaCarta clients
>> > wanted to use the description field as primary content, which is why
>> > it is an option for the "Dechromed Content" tab.  But new feature
>> > requests are always welcome.
>> >
>> > Also, as you might guess by the Derby and HSQLDB issue that you
>> > encountered, most of our users use PostgreSQL.  The Derby and HSQLDB
>> > database support was added to simplify setup and allow tests to be
>> > written that did not involve installing another package first.
>> > However, each of these databases has known problems, some minor and
>> > some more major.  Thus you might want to consider going to PostgreSQL
>> > in the future if you plan on doing any serious crawling.
>> >
>> > Thanks again!
>> > Karl
>> >
>> > On Tue, Aug 2, 2011 at 2:56 PM, K McGonigal <km...@gmail.com> wrote:
>> >> Hi Karl,
>> >>
>> >> Thank you for your quick response. I've opened a Jira ticket for this,
>> >> though I don't really understand what sort of solution you had in mind
>> >> so I
>> >> didn't propose anything.
>> >>
>> >> I'm afraid I don't understand exactly what the Dechromed Content
>> >> options do
>> >> either. I read about them in the End User Documentation, but there
>> >> wasn't
>> >> much there yet.
>> >>
>> >> I find it odd that I would be the first person to have this problem.
>> >> You'd
>> >> think it would be very common.
>> >>
>> >>
>> >> Kate
>> >>
>> >>
>> >> On Tue, Aug 2, 2011 at 11:05 AM, Karl Wright <da...@gmail.com>
>> >> wrote:
>> >>>
>> >>> I just looked at the code.  It's not a bug rather than an oversight of
>> >>> sorts.  The "description" or "content" fields are indexed as the
>> >>> primary content of the document if the "chrome" mode is selected
>> >>> accordingly.  If "None" is the "chrome" mode, then the item-level
>> >>> description field is ignored even when present.
>> >>>
>> >>> So I recommend simply adding a new kind of "description" field for
>> >>> when the "chrome" mode is set to "None".  "item/description" may be
>> >>> its name, or maybe the full XPath, your choice.  Propose something in
>> >>> the ticket and I'll respond.
>> >>>
>> >>> Thanks!
>> >>> Karl
>> >>>
>> >>>
>> >>> On Tue, Aug 2, 2011 at 11:47 AM, Karl Wright <da...@gmail.com>
>> >>> wrote:
>> >>> > Hi Kate,
>> >>> >
>> >>> > The field mapping won't do the trick because the RSS connector is
>> >>> > currently very selective about what fields it extracts - it by no
>> >>> > means extracts all of them, so the ones that it *does* extract from
>> >>> > the feed are "special".
>> >>> >
>> >>> > The behavior you describe sounds like a bug to me.  I'll go
>> >>> > spelunking
>> >>> > through the code at first opportunity.  In the meantime, could you
>> >>> > create a Jira ticket describing the behavior you see vs. the
>> >>> > behavior
>> >>> > you want?
>> >>> >
>> >>> > Thanks!
>> >>> > Karl
>> >>> >
>> >>> > On Tue, Aug 2, 2011 at 11:41 AM, K McGonigal <km...@gmail.com>
>> >>> > wrote:
>> >>> >> Hi,
>> >>> >>
>> >>> >> I'm trying to use ManifoldCF to index an RSS feed into Solr.  It
>> >>> >> sort
>> >>> >> of
>> >>> >> works, but my main problem at the moment is that the *channel*
>> >>> >> description
>> >>> >> from the RSS feed is written to the "description" field in Solr
>> >>> >> when I
>> >>> >> would
>> >>> >> really like the *item* description to be written instead.
>> >>> >>
>> >>> >> I have a typical RSS feed with the general structure:
>> >>> >>
>> >>> >> <rss>
>> >>> >>     <channel>
>> >>> >>         <title></title>
>> >>> >>         <link></link>
>> >>> >>         <description> *** the description I don't want ***
>> >>> >> </description>
>> >>> >>         <item>
>> >>> >>             <title></title>
>> >>> >>             <link></link>
>> >>> >>             <pubDate></pubDate>
>> >>> >>             <description> *** the description I do want ***
>> >>> >> </description>
>> >>> >>             <author></author>
>> >>> >>             <category></category>
>> >>> >>         </item>
>> >>> >>     </channel>
>> >>> >> </rss>
>> >>> >>
>> >>> >> I tried setting up the  field mapping on the job with the XPath
>> >>> >> address
>> >>> >> of
>> >>> >> the second description, i.e. "/rss/channel/item/description" as the
>> >>> >> source,
>> >>> >> but that did not work.
>> >>> >>
>> >>> >> I suspect I'm overlooking something simple, but I've spent 2 days
>> >>> >> trying to
>> >>> >> solve it.  I would be grateful for any help.
>> >>> >>
>> >>> >>
>> >>> >> Kate McGonigal
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >
>> >>
>> >>
>> >
>
>

Re: Field mapping for RSS feed

Posted by K McGonigal <km...@gmail.com>.

Works great now (with  Dechromed Content = "No dechromed content").
Thanks!!

I guess the only caveat is that, to use this, one has to know to add a
"summary" field to their Solr schema. Long-term, I wonder if the "field
mapping" feature could be used to let users map any RSS element (based on
its XPath "address") to any Solr field?

But I wonder if it is working properly for Dechromed Content = "Dechromed
content, if present, in 'description' field". When I use that, nothing is
sent to Solr, although the job terminates OK and doesn't hang like it was
doing before. Is that what is to be expected?

I'm actually still confused by all the dechromed options because I thought
that the item description was used as the dechromed content. So does
"Dechromed content, if present, in 'description' field" mean that the
contents of the item description element will be used for indexing instead
of the web page specified by the link?

Sorry for all these questions. I appreciate your patience.


On Thu, Aug 4, 2011 at 5:11 AM, Karl Wright <da...@gmail.com> wrote:

> Hi Kate,
>
> I did two additional check-ins yesterday evening.  Would you be so
> kind as to synch up from trunk and try again?  I apologize for the
> confusion.
>
> Karl
>
> On Wed, Aug 3, 2011 at 8:13 AM, Karl Wright <da...@gmail.com> wrote:
> >>>>>>>
> > I find it odd that I would be the first person to have this problem.
> > You'd think it would be very common.
> > <<<<<<
> >
> > Actually, I've not encountered this before even though the RSS
> > connector is one of the most widely used connectors.  The only
> > situation this ever came up in before was when some MetaCarta clients
> > wanted to use the description field as primary content, which is why
> > it is an option for the "Dechromed Content" tab.  But new feature
> > requests are always welcome.
> >
> > Also, as you might guess by the Derby and HSQLDB issue that you
> > encountered, most of our users use PostgreSQL.  The Derby and HSQLDB
> > database support was added to simplify setup and allow tests to be
> > written that did not involve installing another package first.
> > However, each of these databases has known problems, some minor and
> > some more major.  Thus you might want to consider going to PostgreSQL
> > in the future if you plan on doing any serious crawling.
> >
> > Thanks again!
> > Karl
> >
> > On Tue, Aug 2, 2011 at 2:56 PM, K McGonigal <km...@gmail.com> wrote:
> >> Hi Karl,
> >>
> >> Thank you for your quick response. I've opened a Jira ticket for this,
> >> though I don't really understand what sort of solution you had in mind
> so I
> >> didn't propose anything.
> >>
> >> I'm afraid I don't understand exactly what the Dechromed Content options
> do
> >> either. I read about them in the End User Documentation, but there
> wasn't
> >> much there yet.
> >>
> >> I find it odd that I would be the first person to have this problem.
> You'd
> >> think it would be very common.
> >>
> >>
> >> Kate
> >>
> >>
> >> On Tue, Aug 2, 2011 at 11:05 AM, Karl Wright <da...@gmail.com>
> wrote:
> >>>
> >>> I just looked at the code.  It's not a bug rather than an oversight of
> >>> sorts.  The "description" or "content" fields are indexed as the
> >>> primary content of the document if the "chrome" mode is selected
> >>> accordingly.  If "None" is the "chrome" mode, then the item-level
> >>> description field is ignored even when present.
> >>>
> >>> So I recommend simply adding a new kind of "description" field for
> >>> when the "chrome" mode is set to "None".  "item/description" may be
> >>> its name, or maybe the full XPath, your choice.  Propose something in
> >>> the ticket and I'll respond.
> >>>
> >>> Thanks!
> >>> Karl
> >>>
> >>>
> >>> On Tue, Aug 2, 2011 at 11:47 AM, Karl Wright <da...@gmail.com>
> wrote:
> >>> > Hi Kate,
> >>> >
> >>> > The field mapping won't do the trick because the RSS connector is
> >>> > currently very selective about what fields it extracts - it by no
> >>> > means extracts all of them, so the ones that it *does* extract from
> >>> > the feed are "special".
> >>> >
> >>> > The behavior you describe sounds like a bug to me.  I'll go
> spelunking
> >>> > through the code at first opportunity.  In the meantime, could you
> >>> > create a Jira ticket describing the behavior you see vs. the behavior
> >>> > you want?
> >>> >
> >>> > Thanks!
> >>> > Karl
> >>> >
> >>> > On Tue, Aug 2, 2011 at 11:41 AM, K McGonigal <km...@gmail.com>
> >>> > wrote:
> >>> >> Hi,
> >>> >>
> >>> >> I'm trying to use ManifoldCF to index an RSS feed into Solr.  It
> sort
> >>> >> of
> >>> >> works, but my main problem at the moment is that the *channel*
> >>> >> description
> >>> >> from the RSS feed is written to the "description" field in Solr when
> I
> >>> >> would
> >>> >> really like the *item* description to be written instead.
> >>> >>
> >>> >> I have a typical RSS feed with the general structure:
> >>> >>
> >>> >> <rss>
> >>> >>     <channel>
> >>> >>         <title></title>
> >>> >>         <link></link>
> >>> >>         <description> *** the description I don't want ***
> >>> >> </description>
> >>> >>         <item>
> >>> >>             <title></title>
> >>> >>             <link></link>
> >>> >>             <pubDate></pubDate>
> >>> >>             <description> *** the description I do want ***
> >>> >> </description>
> >>> >>             <author></author>
> >>> >>             <category></category>
> >>> >>         </item>
> >>> >>     </channel>
> >>> >> </rss>
> >>> >>
> >>> >> I tried setting up the  field mapping on the job with the XPath
> address
> >>> >> of
> >>> >> the second description, i.e. "/rss/channel/item/description" as the
> >>> >> source,
> >>> >> but that did not work.
> >>> >>
> >>> >> I suspect I'm overlooking something simple, but I've spent 2 days
> >>> >> trying to
> >>> >> solve it.  I would be grateful for any help.
> >>> >>
> >>> >>
> >>> >> Kate McGonigal
> >>> >>
> >>> >>
> >>> >>
> >>> >
> >>
> >>
> >
>

Re: Field mapping for RSS feed

Posted by Karl Wright <da...@gmail.com>.

Hi Kate,

I did two additional check-ins yesterday evening.  Would you be so
kind as to synch up from trunk and try again?  I apologize for the
confusion.

Karl

On Wed, Aug 3, 2011 at 8:13 AM, Karl Wright <da...@gmail.com> wrote:
>>>>>>>
> I find it odd that I would be the first person to have this problem.
> You'd think it would be very common.
> <<<<<<
>
> Actually, I've not encountered this before even though the RSS
> connector is one of the most widely used connectors.  The only
> situation this ever came up in before was when some MetaCarta clients
> wanted to use the description field as primary content, which is why
> it is an option for the "Dechromed Content" tab.  But new feature
> requests are always welcome.
>
> Also, as you might guess by the Derby and HSQLDB issue that you
> encountered, most of our users use PostgreSQL.  The Derby and HSQLDB
> database support was added to simplify setup and allow tests to be
> written that did not involve installing another package first.
> However, each of these databases has known problems, some minor and
> some more major.  Thus you might want to consider going to PostgreSQL
> in the future if you plan on doing any serious crawling.
>
> Thanks again!
> Karl
>
> On Tue, Aug 2, 2011 at 2:56 PM, K McGonigal <km...@gmail.com> wrote:
>> Hi Karl,
>>
>> Thank you for your quick response. I've opened a Jira ticket for this,
>> though I don't really understand what sort of solution you had in mind so I
>> didn't propose anything.
>>
>> I'm afraid I don't understand exactly what the Dechromed Content options do
>> either. I read about them in the End User Documentation, but there wasn't
>> much there yet.
>>
>> I find it odd that I would be the first person to have this problem. You'd
>> think it would be very common.
>>
>>
>> Kate
>>
>>
>> On Tue, Aug 2, 2011 at 11:05 AM, Karl Wright <da...@gmail.com> wrote:
>>>
>>> I just looked at the code.  It's not a bug rather than an oversight of
>>> sorts.  The "description" or "content" fields are indexed as the
>>> primary content of the document if the "chrome" mode is selected
>>> accordingly.  If "None" is the "chrome" mode, then the item-level
>>> description field is ignored even when present.
>>>
>>> So I recommend simply adding a new kind of "description" field for
>>> when the "chrome" mode is set to "None".  "item/description" may be
>>> its name, or maybe the full XPath, your choice.  Propose something in
>>> the ticket and I'll respond.
>>>
>>> Thanks!
>>> Karl
>>>
>>>
>>> On Tue, Aug 2, 2011 at 11:47 AM, Karl Wright <da...@gmail.com> wrote:
>>> > Hi Kate,
>>> >
>>> > The field mapping won't do the trick because the RSS connector is
>>> > currently very selective about what fields it extracts - it by no
>>> > means extracts all of them, so the ones that it *does* extract from
>>> > the feed are "special".
>>> >
>>> > The behavior you describe sounds like a bug to me.  I'll go spelunking
>>> > through the code at first opportunity.  In the meantime, could you
>>> > create a Jira ticket describing the behavior you see vs. the behavior
>>> > you want?
>>> >
>>> > Thanks!
>>> > Karl
>>> >
>>> > On Tue, Aug 2, 2011 at 11:41 AM, K McGonigal <km...@gmail.com>
>>> > wrote:
>>> >> Hi,
>>> >>
>>> >> I'm trying to use ManifoldCF to index an RSS feed into Solr.  It sort
>>> >> of
>>> >> works, but my main problem at the moment is that the *channel*
>>> >> description
>>> >> from the RSS feed is written to the "description" field in Solr when I
>>> >> would
>>> >> really like the *item* description to be written instead.
>>> >>
>>> >> I have a typical RSS feed with the general structure:
>>> >>
>>> >> <rss>
>>> >>     <channel>
>>> >>         <title></title>
>>> >>         <link></link>
>>> >>         <description> *** the description I don't want ***
>>> >> </description>
>>> >>         <item>
>>> >>             <title></title>
>>> >>             <link></link>
>>> >>             <pubDate></pubDate>
>>> >>             <description> *** the description I do want ***
>>> >> </description>
>>> >>             <author></author>
>>> >>             <category></category>
>>> >>         </item>
>>> >>     </channel>
>>> >> </rss>
>>> >>
>>> >> I tried setting up the  field mapping on the job with the XPath address
>>> >> of
>>> >> the second description, i.e. "/rss/channel/item/description" as the
>>> >> source,
>>> >> but that did not work.
>>> >>
>>> >> I suspect I'm overlooking something simple, but I've spent 2 days
>>> >> trying to
>>> >> solve it.  I would be grateful for any help.
>>> >>
>>> >>
>>> >> Kate McGonigal
>>> >>
>>> >>
>>> >>
>>> >
>>
>>
>

Re: Field mapping for RSS feed

Posted by Karl Wright <da...@gmail.com>.

>>>>>>
I find it odd that I would be the first person to have this problem.
You'd think it would be very common.
<<<<<<

Actually, I've not encountered this before even though the RSS
connector is one of the most widely used connectors.  The only
situation this ever came up in before was when some MetaCarta clients
wanted to use the description field as primary content, which is why
it is an option for the "Dechromed Content" tab.  But new feature
requests are always welcome.

Also, as you might guess by the Derby and HSQLDB issue that you
encountered, most of our users use PostgreSQL.  The Derby and HSQLDB
database support was added to simplify setup and allow tests to be
written that did not involve installing another package first.
However, each of these databases has known problems, some minor and
some more major.  Thus you might want to consider going to PostgreSQL
in the future if you plan on doing any serious crawling.

Thanks again!
Karl

On Tue, Aug 2, 2011 at 2:56 PM, K McGonigal <km...@gmail.com> wrote:
> Hi Karl,
>
> Thank you for your quick response. I've opened a Jira ticket for this,
> though I don't really understand what sort of solution you had in mind so I
> didn't propose anything.
>
> I'm afraid I don't understand exactly what the Dechromed Content options do
> either. I read about them in the End User Documentation, but there wasn't
> much there yet.
>
> I find it odd that I would be the first person to have this problem. You'd
> think it would be very common.
>
>
> Kate
>
>
> On Tue, Aug 2, 2011 at 11:05 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> I just looked at the code.  It's not a bug rather than an oversight of
>> sorts.  The "description" or "content" fields are indexed as the
>> primary content of the document if the "chrome" mode is selected
>> accordingly.  If "None" is the "chrome" mode, then the item-level
>> description field is ignored even when present.
>>
>> So I recommend simply adding a new kind of "description" field for
>> when the "chrome" mode is set to "None".  "item/description" may be
>> its name, or maybe the full XPath, your choice.  Propose something in
>> the ticket and I'll respond.
>>
>> Thanks!
>> Karl
>>
>>
>> On Tue, Aug 2, 2011 at 11:47 AM, Karl Wright <da...@gmail.com> wrote:
>> > Hi Kate,
>> >
>> > The field mapping won't do the trick because the RSS connector is
>> > currently very selective about what fields it extracts - it by no
>> > means extracts all of them, so the ones that it *does* extract from
>> > the feed are "special".
>> >
>> > The behavior you describe sounds like a bug to me.  I'll go spelunking
>> > through the code at first opportunity.  In the meantime, could you
>> > create a Jira ticket describing the behavior you see vs. the behavior
>> > you want?
>> >
>> > Thanks!
>> > Karl
>> >
>> > On Tue, Aug 2, 2011 at 11:41 AM, K McGonigal <km...@gmail.com>
>> > wrote:
>> >> Hi,
>> >>
>> >> I'm trying to use ManifoldCF to index an RSS feed into Solr.  It sort
>> >> of
>> >> works, but my main problem at the moment is that the *channel*
>> >> description
>> >> from the RSS feed is written to the "description" field in Solr when I
>> >> would
>> >> really like the *item* description to be written instead.
>> >>
>> >> I have a typical RSS feed with the general structure:
>> >>
>> >> <rss>
>> >>     <channel>
>> >>         <title></title>
>> >>         <link></link>
>> >>         <description> *** the description I don't want ***
>> >> </description>
>> >>         <item>
>> >>             <title></title>
>> >>             <link></link>
>> >>             <pubDate></pubDate>
>> >>             <description> *** the description I do want ***
>> >> </description>
>> >>             <author></author>
>> >>             <category></category>
>> >>         </item>
>> >>     </channel>
>> >> </rss>
>> >>
>> >> I tried setting up the  field mapping on the job with the XPath address
>> >> of
>> >> the second description, i.e. "/rss/channel/item/description" as the
>> >> source,
>> >> but that did not work.
>> >>
>> >> I suspect I'm overlooking something simple, but I've spent 2 days
>> >> trying to
>> >> solve it.  I would be grateful for any help.
>> >>
>> >>
>> >> Kate McGonigal
>> >>
>> >>
>> >>
>> >
>
>

Re: Field mapping for RSS feed

Posted by K McGonigal <km...@gmail.com>.

Hi Karl,

Thank you for your quick response. I've opened a Jira ticket for this,
though I don't really understand what sort of solution you had in mind so I
didn't propose anything.

I'm afraid I don't understand exactly what the Dechromed Content options do
either. I read about them in the End User Documentation, but there wasn't
much there yet.

I find it odd that I would be the first person to have this problem. You'd
think it would be very common.


Kate


On Tue, Aug 2, 2011 at 11:05 AM, Karl Wright <da...@gmail.com> wrote:

> I just looked at the code.  It's not a bug rather than an oversight of
> sorts.  The "description" or "content" fields are indexed as the
> primary content of the document if the "chrome" mode is selected
> accordingly.  If "None" is the "chrome" mode, then the item-level
> description field is ignored even when present.
>
> So I recommend simply adding a new kind of "description" field for
> when the "chrome" mode is set to "None".  "item/description" may be
> its name, or maybe the full XPath, your choice.  Propose something in
> the ticket and I'll respond.
>
> Thanks!
> Karl
>
>
> On Tue, Aug 2, 2011 at 11:47 AM, Karl Wright <da...@gmail.com> wrote:
> > Hi Kate,
> >
> > The field mapping won't do the trick because the RSS connector is
> > currently very selective about what fields it extracts - it by no
> > means extracts all of them, so the ones that it *does* extract from
> > the feed are "special".
> >
> > The behavior you describe sounds like a bug to me.  I'll go spelunking
> > through the code at first opportunity.  In the meantime, could you
> > create a Jira ticket describing the behavior you see vs. the behavior
> > you want?
> >
> > Thanks!
> > Karl
> >
> > On Tue, Aug 2, 2011 at 11:41 AM, K McGonigal <km...@gmail.com>
> wrote:
> >> Hi,
> >>
> >> I'm trying to use ManifoldCF to index an RSS feed into Solr.  It sort of
> >> works, but my main problem at the moment is that the *channel*
> description
> >> from the RSS feed is written to the "description" field in Solr when I
> would
> >> really like the *item* description to be written instead.
> >>
> >> I have a typical RSS feed with the general structure:
> >>
> >> <rss>
> >>     <channel>
> >>         <title></title>
> >>         <link></link>
> >>         <description> *** the description I don't want ***
> </description>
> >>         <item>
> >>             <title></title>
> >>             <link></link>
> >>             <pubDate></pubDate>
> >>             <description> *** the description I do want ***
> </description>
> >>             <author></author>
> >>             <category></category>
> >>         </item>
> >>     </channel>
> >> </rss>
> >>
> >> I tried setting up the  field mapping on the job with the XPath address
> of
> >> the second description, i.e. "/rss/channel/item/description" as the
> source,
> >> but that did not work.
> >>
> >> I suspect I'm overlooking something simple, but I've spent 2 days trying
> to
> >> solve it.  I would be grateful for any help.
> >>
> >>
> >> Kate McGonigal
> >>
> >>
> >>
> >
>

Re: Field mapping for RSS feed

Posted by Karl Wright <da...@gmail.com>.

I just looked at the code.  It's not a bug rather than an oversight of
sorts.  The "description" or "content" fields are indexed as the
primary content of the document if the "chrome" mode is selected
accordingly.  If "None" is the "chrome" mode, then the item-level
description field is ignored even when present.

So I recommend simply adding a new kind of "description" field for
when the "chrome" mode is set to "None".  "item/description" may be
its name, or maybe the full XPath, your choice.  Propose something in
the ticket and I'll respond.

Thanks!
Karl


On Tue, Aug 2, 2011 at 11:47 AM, Karl Wright <da...@gmail.com> wrote:
> Hi Kate,
>
> The field mapping won't do the trick because the RSS connector is
> currently very selective about what fields it extracts - it by no
> means extracts all of them, so the ones that it *does* extract from
> the feed are "special".
>
> The behavior you describe sounds like a bug to me.  I'll go spelunking
> through the code at first opportunity.  In the meantime, could you
> create a Jira ticket describing the behavior you see vs. the behavior
> you want?
>
> Thanks!
> Karl
>
> On Tue, Aug 2, 2011 at 11:41 AM, K McGonigal <km...@gmail.com> wrote:
>> Hi,
>>
>> I'm trying to use ManifoldCF to index an RSS feed into Solr.  It sort of
>> works, but my main problem at the moment is that the *channel* description
>> from the RSS feed is written to the "description" field in Solr when I would
>> really like the *item* description to be written instead.
>>
>> I have a typical RSS feed with the general structure:
>>
>> <rss>
>>     <channel>
>>         <title></title>
>>         <link></link>
>>         <description> *** the description I don't want *** </description>
>>         <item>
>>             <title></title>
>>             <link></link>
>>             <pubDate></pubDate>
>>             <description> *** the description I do want *** </description>
>>             <author></author>
>>             <category></category>
>>         </item>
>>     </channel>
>> </rss>
>>
>> I tried setting up the  field mapping on the job with the XPath address of
>> the second description, i.e. "/rss/channel/item/description" as the source,
>> but that did not work.
>>
>> I suspect I'm overlooking something simple, but I've spent 2 days trying to
>> solve it.  I would be grateful for any help.
>>
>>
>> Kate McGonigal
>>
>>
>>
>

Re: Field mapping for RSS feed

Posted by Karl Wright <da...@gmail.com>.

Hi Kate,

The field mapping won't do the trick because the RSS connector is
currently very selective about what fields it extracts - it by no
means extracts all of them, so the ones that it *does* extract from
the feed are "special".

The behavior you describe sounds like a bug to me.  I'll go spelunking
through the code at first opportunity.  In the meantime, could you
create a Jira ticket describing the behavior you see vs. the behavior
you want?

Thanks!
Karl

On Tue, Aug 2, 2011 at 11:41 AM, K McGonigal <km...@gmail.com> wrote:
> Hi,
>
> I'm trying to use ManifoldCF to index an RSS feed into Solr.  It sort of
> works, but my main problem at the moment is that the *channel* description
> from the RSS feed is written to the "description" field in Solr when I would
> really like the *item* description to be written instead.
>
> I have a typical RSS feed with the general structure:
>
> <rss>
>     <channel>
>         <title></title>
>         <link></link>
>         <description> *** the description I don't want *** </description>
>         <item>
>             <title></title>
>             <link></link>
>             <pubDate></pubDate>
>             <description> *** the description I do want *** </description>
>             <author></author>
>             <category></category>
>         </item>
>     </channel>
> </rss>
>
> I tried setting up the  field mapping on the job with the XPath address of
> the second description, i.e. "/rss/channel/item/description" as the source,
> but that did not work.
>
> I suspect I'm overlooking something simple, but I've spent 2 days trying to
> solve it.  I would be grateful for any help.
>
>
> Kate McGonigal
>
>
>