You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by K McGonigal <km...@gmail.com> on 2011/08/12 22:42:16 UTC

Trouble indexing a Twitter search in RSS format

Sorry to bother everyone again but I'm having trouble with an RSS connector
job on a Twitter search. When I try to run a job on
http://search.twitter.com/search.rss?q=Campylobacter the fetch appears to
work OK, but the document ingestion does not occur.

I was wondering if it is just my setup, or could it be the redirection that
Twitter does on the links. For instance, a link shown in the RSS feed as
http://twitter.com/VashinkaInuiel/statuses/101493222852923393 redirects to
http://twitter.com/#!/VashinkaInuiel/statuses/101493222852923393 when it is
followed.

Any help is very appreciated.

Re: Trouble indexing a Twitter search in RSS format

Posted by Karl Wright <da...@gmail.com>.

Never mind on the ticket  - I created it.  CONNECTORS-239.

Karl


On Mon, Aug 15, 2011 at 6:05 PM, Karl Wright <da...@gmail.com> wrote:
>> Also, there appears to be a little bug in that if "Use chromed content if no
>> dechromed content found" is selected, when you go back to edit that job, it
>> is not selected (i.e. neither of the bottom two radio buttons are active).
>> Should I open a JIRA ticket for that?
>
> Yes please.
>
> For the rest, I suspect that you have been running the same job over
> and over again to get the results you describe.  However, you should
> be aware that ManifoldCF is an incremental crawler.  It will NOT
> reindex content that has not changed between job runs.
>
> So the only result that is definitely weird is:
>
>> case 4)  "Dechromed content, if present, in 'description' field" and "Never
>> use chromed content"
>>                      --> Ingests but both "description" and "summary" fields
>> ARE EMPTY in Solr
>
> I'd like to play with this one here, if you can give me the URL in
> question that you are using.
>
> Karl
>
> On Mon, Aug 15, 2011 at 4:07 PM, K McGonigal <km...@gmail.com> wrote:
>> That makes sense, but my  RSS feed DOES have a "description" field within
>> the "item" field.
>>
>> Upon further experimentation with the two sets of dechromed radio buttons, I
>> found the following.
>>
>> case 1)  "No dechromed content" and "Use chromed content if no dechromed
>> content found"
>>                      --> Ingests to both "description" and "summary" fields
>> in Solr
>> e.g.
>>>
>>> INFO: {add=[http://twitter.com/MicrobeWorld/statuses/103102842524545025]}
>>> 0 0
>>> 15-Aug-2011 2:51:26 PM org.apache.solr.core.SolrCore execute
>>> INFO: [] webapp=/solr path=/update/extract
>>> params={literal.service=Twitter&liter
>>>
>>> al.source=http://search.twitter.com/search.rss?q%3DCampylobacter&literal.summary
>>>
>>> =<em>Campylobacter</em>+bacteria:+<em>Campylobacter</em>+bacteria+are+the+number
>>>
>>> -one+cause+of+food-related+gastrointestinal+illness...+<a+href%3D"http://t.co/0B
>>>
>>> k8mTm">http://t.co/0Bk8mTm</a>&literal.id=http://twitter.com/MicrobeWorld/status
>>>
>>> es/103102842524545025&literal.title=Campylobacter+bacteria:+Campylobacter+bacter
>>>
>>> ia+are+the+number-one+cause+of+food-related+gastrointestinal+illness...+http://t
>>> .co/0Bk8mTm&literal.pubdate=1313416607000} status=0 QTime=0
>>> 15-Aug-2011 2:51:31 PM org.apache.solr.update.processor.LogUpdateProcessor
>>> finis
>>> h
>>
>>
>> case 2)  "No dechromed content" and "Never use chromed content"
>>                      --> didn't ingest
>>
>> case 3)  "Dechromed content, if present, in 'description' field" and "Use
>> chromed content if no dechromed content found"
>>                      --> didn't ingest
>>
>> case 4)  "Dechromed content, if present, in 'description' field" and "Never
>> use chromed content"
>>                      --> Ingests but both "description" and "summary" fields
>> ARE EMPTY in Solr
>> e.g.
>>>
>>> INFO: {add=[http://twitter.com/MicrobeWorld/statuses/103102842524545025]}
>>> 0 0
>>> 15-Aug-2011 3:04:02 PM org.apache.solr.update.processor.LogUpdateProcessor
>>> finis
>>> h
>>
>>
>> I hope that is all to be expected.
>>
>> Also, there appears to be a little bug in that if "Use chromed content if no
>> dechromed content found" is selected, when you go back to edit that job, it
>> is not selected (i.e. neither of the bottom two radio buttons are active).
>> Should I open a JIRA ticket for that?
>>
>>
>> On Mon, Aug 15, 2011 at 11:49 AM, Karl Wright <da...@gmail.com> wrote:
>>>
>>> The behavior depends on the setting of the other pair of radio buttons
>>> on that tab.  You can select "Use chromed content if not found" or
>>> "Never use chromed content".  So, if the feed has no "description"
>>> field for the document, and the dechromed content setting is
>>> "description field", and the other setting is "Never use chromed
>>> content", no document will be indexed.
>>>
>>> Karl
>>>
>>>
>>> On Mon, Aug 15, 2011 at 12:44 PM, K McGonigal <km...@gmail.com> wrote:
>>> > I deleted my twitter RSS job and created another one and now it works!
>>> >
>>> > Doing some experimentation, I see that when Dechromed Content is set to
>>> > "No
>>> > dechromed content" it ingests fine, but when set to "if present, in
>>> > 'description' field" it doesn't do the ingestion (nothing is added to
>>> > Solr).  Is that to be expected?
>>> >
>>> >
>>> > Kate
>>> >
>>> >
>>> > On Mon, Aug 15, 2011 at 10:48 AM, Karl Wright <da...@gmail.com>
>>> > wrote:
>>> >>
>>> >> Regardless of the twitter sign-in issue, I'd still expect the RSS
>>> >> connector to index whatever it finds at the redirected page, even if
>>> >> it's not very useful stuff.  Could you send me a screen shot of the
>>> >> view page for the RSS connection and for the RSS job?  Also, if you
>>> >> could delete the job that contains the twitter RSS feed and recreated
>>> >> it, then crawl, I'd like to see the simple history for that crawl.
>>> >>
>>> >> Thanks,
>>> >> Karl
>>> >>
>>> >> On Mon, Aug 15, 2011 at 11:38 AM, K McGonigal <km...@gmail.com>
>>> >> wrote:
>>> >> > Hmm, that's odd the URLs didn't work for you.  I've asked other
>>> >> > people
>>> >> > here
>>> >> > to try them and they had no problems.
>>> >> >
>>> >> > After your suggestion I tried the web connector (but still with no
>>> >> > access
>>> >> > credentials) and it did pretty well ingesting the RSS feed, so I
>>> >> > might
>>> >> > be
>>> >> > able to just use that.
>>> >> >
>>> >> > I'm still mystified as to why the RSS connector couldn't handle it
>>> >> > though. I
>>> >> > turned on DEBUG logging in Manifold, but that did not show anything
>>> >> > unusual.
>>> >> >
>>> >> > Thanks,
>>> >> > Kate
>>> >> >
>>> >> > On Fri, Aug 12, 2011 at 3:58 PM, Karl Wright <da...@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> When I drop any of these URLs into my browser, I get redirected to a
>>> >> >> login screen.  Therefore it looks to me like Twitter does some kind
>>> >> >> of
>>> >> >> session-based login, tracked with cookies.  That would require
>>> >> >> maintenance of session cookies which the RSS connector simply does
>>> >> >> not
>>> >> >> do, and the coding of a login sequence as well.
>>> >> >>
>>> >> >> This is not a straightforward feature to add to the RSS connector,
>>> >> >> by
>>> >> >> any
>>> >> >> means.
>>> >> >>
>>> >> >> The web connector does have support for login sequencing and cookie
>>> >> >> session maintenance, and it does know how to chase RSS feeds, so
>>> >> >> that
>>> >> >> might be an option for you to try.  The problem is that most login
>>> >> >> sequences are non-trivial to set up and you will need a lot of
>>> >> >> patience and web spelunking skills to get it right.  The
>>> >> >> documentation
>>> >> >> is of some help but really could use a good example.
>>> >> >>
>>> >> >>
>>> >> >> Hope this helps.
>>> >> >> Karl
>>> >> >>
>>> >> >> On Fri, Aug 12, 2011 at 4:42 PM, K McGonigal <km...@gmail.com>
>>> >> >> wrote:
>>> >> >> > Sorry to bother everyone again but I'm having trouble with an RSS
>>> >> >> > connector
>>> >> >> > job on a Twitter search. When I try to run a job on
>>> >> >> > http://search.twitter.com/search.rss?q=Campylobacter the fetch
>>> >> >> > appears
>>> >> >> > to
>>> >> >> > work OK, but the document ingestion does not occur.
>>> >> >> >
>>> >> >> > I was wondering if it is just my setup, or could it be the
>>> >> >> > redirection
>>> >> >> > that
>>> >> >> > Twitter does on the links. For instance, a link shown in the RSS
>>> >> >> > feed
>>> >> >> > as
>>> >> >> > http://twitter.com/VashinkaInuiel/statuses/101493222852923393
>>> >> >> > redirects
>>> >> >> > to
>>> >> >> > http://twitter.com/#!/VashinkaInuiel/statuses/101493222852923393
>>> >> >> > when
>>> >> >> > it
>>> >> >> > is
>>> >> >> > followed.
>>> >> >> >
>>> >> >> > Any help is very appreciated.
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >
>>> >> >
>>> >
>>> >
>>
>>
>

Re: Trouble indexing a Twitter search in RSS format

Posted by K McGonigal <km...@gmail.com>.

Ah, I see. Thank you for this.

On Wed, Aug 17, 2011 at 9:07 AM, Karl Wright <da...@gmail.com> wrote:

> Sorry, I was misaligned.  But it actually is true that the pages differ.  I
> captured two fetches of the same document and diff'd them:
>
> root@duck96:~# diff file1.txt file2.txt
> 408c408
> < </html><!-- 1313589725 -->
> \ No newline at end of file
> ---
> > </html><!-- 1313589820 -->
> \ No newline at end of file
> root@duck96:~#
>
> So that is indeed the correct explanation.
> Karl
>
>
>

Re: Trouble indexing a Twitter search in RSS format

Posted by Karl Wright <da...@gmail.com>.

Sorry, I was misaligned.  But it actually is true that the pages differ.  I
captured two fetches of the same document and diff'd them:

root@duck96:~# diff file1.txt file2.txt
408c408
< </html><!-- 1313589725 -->
\ No newline at end of file
---
> </html><!-- 1313589820 -->
\ No newline at end of file
root@duck96:~#

So that is indeed the correct explanation.
Karl


On Wed, Aug 17, 2011 at 10:00 AM, K McGonigal <km...@gmail.com> wrote:

> Thanks Karl.  But it looks to me like all the documents are the same size
> in both runs. They are just indexed in a different order (for some unknown
> reason).
>
> Kate
>
>
> On Tue, Aug 16, 2011 at 7:44 PM, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Kate,
>>
>> I ran a job based on the same feed twice.  Here are the results, from the
>> simple history:
>>
>> Start Time Activity Identifier Result Code Bytes Time Result Description  08-16-2011
>> 20:38:10.924 job end 1313541280969(jazz)
>>
>> 0 1
>>  08-16-2011 20:37:57.179 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/331/30/
>>  200 16980 18
>>  08-16-2011 20:37:56.241 fetch
>> http://www.onemansjazz.ca/content/view/331/30/
>>  200 16980 905
>>  08-16-2011 20:37:52.117 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/334/30/
>>  200 16718 15
>>  08-16-2011 20:37:51.241 fetch
>> http://www.onemansjazz.ca/content/view/334/30/
>>  200 16718 839
>>  08-16-2011 20:37:47.292 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/330/50/
>>  200 22605 19
>>  08-16-2011 20:37:46.241 fetch
>> http://www.onemansjazz.ca/content/view/330/50/
>>  200 22605 1003
>>  08-16-2011 20:37:42.149 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/333/30/
>>  200 17606 19
>>  08-16-2011 20:37:41.241 fetch
>> http://www.onemansjazz.ca/content/view/333/30/
>>  200 17606 887
>>  08-16-2011 20:37:37.165 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/332/30/
>>  200 17083 20
>>  08-16-2011 20:37:36.241 fetch
>> http://www.onemansjazz.ca/content/view/332/30/
>>  200 17083 898
>>  08-16-2011 20:37:32.783 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/336/30/
>>  200 17473 19
>>  08-16-2011 20:37:31.241 fetch
>> http://www.onemansjazz.ca/content/view/336/30/
>>  200 17473 922
>>  08-16-2011 20:37:27.191 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/329/30/
>>  200 17105 52
>>  08-16-2011 20:37:26.241 fetch
>> http://www.onemansjazz.ca/content/view/329/30/
>>  200 17105 912
>>  08-16-2011 20:37:21.241 fetch
>> http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2....
>> 0/no_html,1/
>>  200 3973 542
>>  08-16-2011 20:37:20.970 job start 1313541280969(jazz)
>>
>> 0 1
>>  08-16-2011 20:37:00.893 job end 1313541280969(jazz)
>>
>> 0 1
>>  08-16-2011 20:36:49.123 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/334/30/
>>  200 16718 17
>>  08-16-2011 20:36:48.076 fetch
>> http://www.onemansjazz.ca/content/view/334/30/
>>  200 16718 1028
>>  08-16-2011 20:36:44.305 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/332/30/
>>  200 17083 34
>>  08-16-2011 20:36:43.076 fetch
>> http://www.onemansjazz.ca/content/view/332/30/
>>  200 17083 1208
>>  08-16-2011 20:36:39.175 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/336/30/
>>  200 17473 23
>>  08-16-2011 20:36:38.076 fetch
>> http://www.onemansjazz.ca/content/view/336/30/
>>  200 17473 1087
>>  08-16-2011 20:36:33.983 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/331/30/
>>  200 16980 24
>>  08-16-2011 20:36:33.076 fetch
>> http://www.onemansjazz.ca/content/view/331/30/
>>  200 16980 896
>>  08-16-2011 20:36:29.297 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/329/30/
>>  200 17105 24
>>  08-16-2011 20:36:28.774 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/330/50/
>>  200 22605 35
>>  08-16-2011 20:36:28.076 fetch
>> http://www.onemansjazz.ca/content/view/329/30/
>>  200 17105 1204
>>  08-16-2011 20:36:23.076 fetch
>> http://www.onemansjazz.ca/content/view/330/50/
>>  200 22605 5679
>>  08-16-2011 20:36:21.130 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/333/30/
>>  200 17606 418
>>  08-16-2011 20:36:18.076 fetch
>> http://www.onemansjazz.ca/content/view/333/30/
>>  200 17606 2969
>>  08-16-2011 20:36:13.094 fetch
>> http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2....
>> 0/no_html,1/
>>  200 3973 1945
>>  08-16-2011 20:36:10.870 job start 1313541280969(jazz)
>>
>> 0 1
>>
>> Note that on each run, the size of each document being indexed changes.
>> This is likely due to "chrome" (advertisements, etc.) which are dynamically
>> delivered by the site in a random way.  The RSS connector will, of course,
>> not be able to recognize that the content you are interested in hasn't
>> changed, because as far as it can tell it *has*.
>>
>> This is very different from the case where you are use the "dechromed"
>> content based on the "description" field, because it is the actual feed
>> description field that is indexed, not the document contents, and therefore
>> no chrome will be present.  Thus you are more likely to see repeated runs of
>> a job index nothing if the job has a "dechromed" content mode set.
>>
>> Karl
>>
>>
>>
>> On Tue, Aug 16, 2011 at 5:07 PM, K McGonigal <km...@gmail.com> wrote:
>>
>>> Hmm. I will keep this in mind, but I'm confused again. I just ran this
>>> job twice in a row and pretty much the same thing was sent to Solr.  The
>>> same number of items (7) were "add"ed. I think they were the same items,
>>> just in a different order. The second run also deleted an item from Solr
>>> that was not in the RSS document.  I'm pretty sure the RSS feed document or
>>> the linked documents did not change.
>>>
>>> A snippet from the first run:
>>>
>>> INFO: {add=[http://www.onemansjazz.ca/content/view/330/50/]} 0 16
>>>> 16-Aug-2011 3:18:11 PM org.apache.solr.core.SolrCore execute
>>>> INFO: [] webapp=/solr path=/update/extract params={literal.source=
>>>> http://www.one
>>>>
>>>> mansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/&literal.category=New
>>>>
>>>> s+-+General&literal.summary=I+have+created+a+Listener+Survey+and+if+you+have+the
>>>>
>>>> +time+to+complete+it,+that+would+be+terrific.++I%26#39;m+trying+to+do+an+evaluat
>>>>
>>>> ion+of+One+Man%26#39;s+Jazz+as+well+as+considering+some+new+options+that+have+ar
>>>>
>>>> isen.++Your+feedback+would+be+most+appreciate.This+survey+is+in+two+parts+and+is
>>>>
>>>> +a+total+of+twenty+parts,+most+of+them+just+require+a+click+of+your+mouse.++Clic
>>>> k+here+(
>>>> http://www.surveymonkey.com/s/C3DZ3JK)++for+Part+One,+and+here+(http://w<http://www.surveymonkey.com/s/C3DZ3JK%29++for+Part+One,+and+here+%28http://w>
>>>>
>>>> ww.surveymonkey.com/s/C38FVH8)++for+Part+Two.+++Thanks+again+for+your+input.+&li<http://ww.surveymonkey.com/s/C38FVH8%29++for+Part+Two.+++Thanks+again+for+your+input.+&li>
>>>> teral.id=
>>>> http://www.onemansjazz.ca/content/view/330/50/&literal.title=Listener+S
>>>> urvey&literal.pubdate=1310475289000} status=0 QTime=16
>>>> 16-Aug-2011 3:18:13 PM
>>>> org.apache.solr.update.processor.LogUpdateProcessor finis
>>>> h
>>>>
>>>
>>> A snippet from the second run:
>>>
>>> INFO: {add=[http://www.onemansjazz.ca/content/view/330/50/]} 0 15
>>>> 16-Aug-2011 3:27:55 PM org.apache.solr.core.SolrCore execute
>>>> INFO: [] webapp=/solr path=/update/extract params={literal.source=
>>>> http://www.one
>>>>
>>>> mansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/&literal.category=New
>>>>
>>>> s+-+General&literal.summary=I+have+created+a+Listener+Survey+and+if+you+have+the
>>>>
>>>> +time+to+complete+it,+that+would+be+terrific.++I%26#39;m+trying+to+do+an+evaluat
>>>>
>>>> ion+of+One+Man%26#39;s+Jazz+as+well+as+considering+some+new+options+that+have+ar
>>>>
>>>> isen.++Your+feedback+would+be+most+appreciate.This+survey+is+in+two+parts+and+is
>>>>
>>>> +a+total+of+twenty+parts,+most+of+them+just+require+a+click+of+your+mouse.++Clic
>>>> k+here+(
>>>> http://www.surveymonkey.com/s/C3DZ3JK)++for+Part+One,+and+here+(http://w<http://www.surveymonkey.com/s/C3DZ3JK%29++for+Part+One,+and+here+%28http://w>
>>>>
>>>> ww.surveymonkey.com/s/C38FVH8)++for+Part+Two.+++Thanks+again+for+your+input.+&li<http://ww.surveymonkey.com/s/C38FVH8%29++for+Part+Two.+++Thanks+again+for+your+input.+&li>
>>>> teral.id=
>>>> http://www.onemansjazz.ca/content/view/330/50/&literal.title=Listener+S
>>>> urvey&literal.pubdate=1310475289000} status=0 QTime=15
>>>> 16-Aug-2011 3:28:00 PM
>>>> org.apache.solr.update.processor.LogUpdateProcessor finis
>>>> h
>>>>
>>>
>>> I think they are identical.
>>>
>>>
>>> View a Job
>>>>  ------------------------------
>>>>  Name:OMJ
>>>> ------------------------------
>>>>  Output connection: Solr Repository connection: RSS
>>>> ------------------------------
>>>>  Priority:5 Start method:Don't automatically start
>>>> ------------------------------
>>>>  Schedule type:Scan every document once Minimum recrawl interval:Not
>>>> applicable  Expiration interval:Not applicable Reseed interval:Not
>>>> applicable
>>>> ------------------------------
>>>>  No scheduled run times
>>>> ------------------------------
>>>>    Field mappings:  Metadata field name Solr field name No field
>>>> mapping specified
>>>> ------------------------------
>>>>    RSS urls:
>>>> http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
>>>>  ------------------------------
>>>> No url canonicalization specified; will reorder all urls and remove all
>>>> sessions
>>>> ------------------------------
>>>> No mappings specified; will accept all urls
>>>> ------------------------------
>>>>  Feed connection timeout (seconds): 60  Default feed rescan interval
>>>> (minutes): 60  Minimum feed rescan interval (minutes): 15  Bad feed
>>>> rescan interval (minutes): (Default feed rescan value)
>>>> ------------------------------
>>>>  Dechromed content source: none  Chromed content: none
>>>> ------------------------------
>>>> No access tokens specified
>>>> ------------------------------
>>>> No metadata specified
>>>
>>>
>>>
>>> View Repository Connection Status
>>>  ------------------------------
>>>  Name:RSS Description:
>>>  ------------------------------
>>>  Connection type:RSS Max connections:10  Authority:None (global
>>> authority)
>>> ------------------------------
>>>  Throttling:  Bin regular expression Description Max avg fetches/min No
>>> throttles
>>> ------------------------------
>>>    Parameters: Proxy port=
>>> Proxy authentication password=********
>>> Max server connections=2
>>> Proxy host=
>>> KB per second=64
>>> Robots usage=none
>>> Proxy authentication user name=
>>> Max fetches per minute=12
>>> Email address=kmcgoniga@gmail.com
>>> Proxy authentication domain=
>>> Throttle group=
>>>    ------------------------------
>>>  Connection status:Connection working
>>>
>>>
>>
>

Re: Trouble indexing a Twitter search in RSS format

Posted by K McGonigal <km...@gmail.com>.

Thanks Karl.  But it looks to me like all the documents are the same size in
both runs. They are just indexed in a different order (for some unknown
reason).

Kate


On Tue, Aug 16, 2011 at 7:44 PM, Karl Wright <da...@gmail.com> wrote:

> Hi Kate,
>
> I ran a job based on the same feed twice.  Here are the results, from the
> simple history:
>
> Start Time Activity Identifier Result Code Bytes Time Result Description  08-16-2011
> 20:38:10.924 job end 1313541280969(jazz)
>
> 0 1
>  08-16-2011 20:37:57.179 document ingest (solr)
> http://www.onemansjazz.ca/content/view/331/30/
>  200 16980 18
>  08-16-2011 20:37:56.241 fetch
> http://www.onemansjazz.ca/content/view/331/30/
>  200 16980 905
>  08-16-2011 20:37:52.117 document ingest (solr)
> http://www.onemansjazz.ca/content/view/334/30/
>  200 16718 15
>  08-16-2011 20:37:51.241 fetch
> http://www.onemansjazz.ca/content/view/334/30/
>  200 16718 839
>  08-16-2011 20:37:47.292 document ingest (solr)
> http://www.onemansjazz.ca/content/view/330/50/
>  200 22605 19
>  08-16-2011 20:37:46.241 fetch
> http://www.onemansjazz.ca/content/view/330/50/
>  200 22605 1003
>  08-16-2011 20:37:42.149 document ingest (solr)
> http://www.onemansjazz.ca/content/view/333/30/
>  200 17606 19
>  08-16-2011 20:37:41.241 fetch
> http://www.onemansjazz.ca/content/view/333/30/
>  200 17606 887
>  08-16-2011 20:37:37.165 document ingest (solr)
> http://www.onemansjazz.ca/content/view/332/30/
>  200 17083 20
>  08-16-2011 20:37:36.241 fetch
> http://www.onemansjazz.ca/content/view/332/30/
>  200 17083 898
>  08-16-2011 20:37:32.783 document ingest (solr)
> http://www.onemansjazz.ca/content/view/336/30/
>  200 17473 19
>  08-16-2011 20:37:31.241 fetch
> http://www.onemansjazz.ca/content/view/336/30/
>  200 17473 922
>  08-16-2011 20:37:27.191 document ingest (solr)
> http://www.onemansjazz.ca/content/view/329/30/
>  200 17105 52
>  08-16-2011 20:37:26.241 fetch
> http://www.onemansjazz.ca/content/view/329/30/
>  200 17105 912
>  08-16-2011 20:37:21.241 fetch
> http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2....
> 0/no_html,1/
>  200 3973 542
>  08-16-2011 20:37:20.970 job start 1313541280969(jazz)
>
> 0 1
>  08-16-2011 20:37:00.893 job end 1313541280969(jazz)
>
> 0 1
>  08-16-2011 20:36:49.123 document ingest (solr)
> http://www.onemansjazz.ca/content/view/334/30/
>  200 16718 17
>  08-16-2011 20:36:48.076 fetch
> http://www.onemansjazz.ca/content/view/334/30/
>  200 16718 1028
>  08-16-2011 20:36:44.305 document ingest (solr)
> http://www.onemansjazz.ca/content/view/332/30/
>  200 17083 34
>  08-16-2011 20:36:43.076 fetch
> http://www.onemansjazz.ca/content/view/332/30/
>  200 17083 1208
>  08-16-2011 20:36:39.175 document ingest (solr)
> http://www.onemansjazz.ca/content/view/336/30/
>  200 17473 23
>  08-16-2011 20:36:38.076 fetch
> http://www.onemansjazz.ca/content/view/336/30/
>  200 17473 1087
>  08-16-2011 20:36:33.983 document ingest (solr)
> http://www.onemansjazz.ca/content/view/331/30/
>  200 16980 24
>  08-16-2011 20:36:33.076 fetch
> http://www.onemansjazz.ca/content/view/331/30/
>  200 16980 896
>  08-16-2011 20:36:29.297 document ingest (solr)
> http://www.onemansjazz.ca/content/view/329/30/
>  200 17105 24
>  08-16-2011 20:36:28.774 document ingest (solr)
> http://www.onemansjazz.ca/content/view/330/50/
>  200 22605 35
>  08-16-2011 20:36:28.076 fetch
> http://www.onemansjazz.ca/content/view/329/30/
>  200 17105 1204
>  08-16-2011 20:36:23.076 fetch
> http://www.onemansjazz.ca/content/view/330/50/
>  200 22605 5679
>  08-16-2011 20:36:21.130 document ingest (solr)
> http://www.onemansjazz.ca/content/view/333/30/
>  200 17606 418
>  08-16-2011 20:36:18.076 fetch
> http://www.onemansjazz.ca/content/view/333/30/
>  200 17606 2969
>  08-16-2011 20:36:13.094 fetch
> http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2....
> 0/no_html,1/
>  200 3973 1945
>  08-16-2011 20:36:10.870 job start 1313541280969(jazz)
>
> 0 1
>
> Note that on each run, the size of each document being indexed changes.
> This is likely due to "chrome" (advertisements, etc.) which are dynamically
> delivered by the site in a random way.  The RSS connector will, of course,
> not be able to recognize that the content you are interested in hasn't
> changed, because as far as it can tell it *has*.
>
> This is very different from the case where you are use the "dechromed"
> content based on the "description" field, because it is the actual feed
> description field that is indexed, not the document contents, and therefore
> no chrome will be present.  Thus you are more likely to see repeated runs of
> a job index nothing if the job has a "dechromed" content mode set.
>
> Karl
>
>
>
> On Tue, Aug 16, 2011 at 5:07 PM, K McGonigal <km...@gmail.com> wrote:
>
>> Hmm. I will keep this in mind, but I'm confused again. I just ran this job
>> twice in a row and pretty much the same thing was sent to Solr.  The same
>> number of items (7) were "add"ed. I think they were the same items, just in
>> a different order. The second run also deleted an item from Solr that was
>> not in the RSS document.  I'm pretty sure the RSS feed document or the
>> linked documents did not change.
>>
>> A snippet from the first run:
>>
>> INFO: {add=[http://www.onemansjazz.ca/content/view/330/50/]} 0 16
>>> 16-Aug-2011 3:18:11 PM org.apache.solr.core.SolrCore execute
>>> INFO: [] webapp=/solr path=/update/extract params={literal.source=
>>> http://www.one
>>>
>>> mansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/&literal.category=New
>>>
>>> s+-+General&literal.summary=I+have+created+a+Listener+Survey+and+if+you+have+the
>>>
>>> +time+to+complete+it,+that+would+be+terrific.++I%26#39;m+trying+to+do+an+evaluat
>>>
>>> ion+of+One+Man%26#39;s+Jazz+as+well+as+considering+some+new+options+that+have+ar
>>>
>>> isen.++Your+feedback+would+be+most+appreciate.This+survey+is+in+two+parts+and+is
>>>
>>> +a+total+of+twenty+parts,+most+of+them+just+require+a+click+of+your+mouse.++Clic
>>> k+here+(
>>> http://www.surveymonkey.com/s/C3DZ3JK)++for+Part+One,+and+here+(http://w<http://www.surveymonkey.com/s/C3DZ3JK%29++for+Part+One,+and+here+%28http://w>
>>>
>>> ww.surveymonkey.com/s/C38FVH8)++for+Part+Two.+++Thanks+again+for+your+input.+&li<http://ww.surveymonkey.com/s/C38FVH8%29++for+Part+Two.+++Thanks+again+for+your+input.+&li>
>>> teral.id=
>>> http://www.onemansjazz.ca/content/view/330/50/&literal.title=Listener+S
>>> urvey&literal.pubdate=1310475289000} status=0 QTime=16
>>> 16-Aug-2011 3:18:13 PM
>>> org.apache.solr.update.processor.LogUpdateProcessor finis
>>> h
>>>
>>
>> A snippet from the second run:
>>
>> INFO: {add=[http://www.onemansjazz.ca/content/view/330/50/]} 0 15
>>> 16-Aug-2011 3:27:55 PM org.apache.solr.core.SolrCore execute
>>> INFO: [] webapp=/solr path=/update/extract params={literal.source=
>>> http://www.one
>>>
>>> mansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/&literal.category=New
>>>
>>> s+-+General&literal.summary=I+have+created+a+Listener+Survey+and+if+you+have+the
>>>
>>> +time+to+complete+it,+that+would+be+terrific.++I%26#39;m+trying+to+do+an+evaluat
>>>
>>> ion+of+One+Man%26#39;s+Jazz+as+well+as+considering+some+new+options+that+have+ar
>>>
>>> isen.++Your+feedback+would+be+most+appreciate.This+survey+is+in+two+parts+and+is
>>>
>>> +a+total+of+twenty+parts,+most+of+them+just+require+a+click+of+your+mouse.++Clic
>>> k+here+(
>>> http://www.surveymonkey.com/s/C3DZ3JK)++for+Part+One,+and+here+(http://w<http://www.surveymonkey.com/s/C3DZ3JK%29++for+Part+One,+and+here+%28http://w>
>>>
>>> ww.surveymonkey.com/s/C38FVH8)++for+Part+Two.+++Thanks+again+for+your+input.+&li<http://ww.surveymonkey.com/s/C38FVH8%29++for+Part+Two.+++Thanks+again+for+your+input.+&li>
>>> teral.id=
>>> http://www.onemansjazz.ca/content/view/330/50/&literal.title=Listener+S
>>> urvey&literal.pubdate=1310475289000} status=0 QTime=15
>>> 16-Aug-2011 3:28:00 PM
>>> org.apache.solr.update.processor.LogUpdateProcessor finis
>>> h
>>>
>>
>> I think they are identical.
>>
>>
>> View a Job
>>>  ------------------------------
>>>  Name:OMJ
>>> ------------------------------
>>>  Output connection: Solr Repository connection: RSS
>>> ------------------------------
>>>  Priority:5 Start method:Don't automatically start
>>> ------------------------------
>>>  Schedule type:Scan every document once Minimum recrawl interval:Not
>>> applicable  Expiration interval:Not applicable Reseed interval:Not
>>> applicable
>>> ------------------------------
>>>  No scheduled run times
>>> ------------------------------
>>>    Field mappings:  Metadata field name Solr field name No field mapping
>>> specified
>>> ------------------------------
>>>    RSS urls:
>>> http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
>>>  ------------------------------
>>> No url canonicalization specified; will reorder all urls and remove all
>>> sessions
>>> ------------------------------
>>> No mappings specified; will accept all urls
>>> ------------------------------
>>>  Feed connection timeout (seconds): 60  Default feed rescan interval
>>> (minutes): 60  Minimum feed rescan interval (minutes): 15  Bad feed
>>> rescan interval (minutes): (Default feed rescan value)
>>> ------------------------------
>>>  Dechromed content source: none  Chromed content: none
>>> ------------------------------
>>> No access tokens specified
>>> ------------------------------
>>> No metadata specified
>>
>>
>>
>> View Repository Connection Status
>>  ------------------------------
>>  Name:RSS Description:
>>  ------------------------------
>>  Connection type:RSS Max connections:10  Authority:None (global
>> authority)
>> ------------------------------
>>  Throttling:  Bin regular expression Description Max avg fetches/min No
>> throttles
>> ------------------------------
>>    Parameters: Proxy port=
>> Proxy authentication password=********
>> Max server connections=2
>> Proxy host=
>> KB per second=64
>> Robots usage=none
>> Proxy authentication user name=
>> Max fetches per minute=12
>> Email address=kmcgoniga@gmail.com
>> Proxy authentication domain=
>> Throttle group=
>>    ------------------------------
>>  Connection status:Connection working
>>
>>
>

Re: Trouble indexing a Twitter search in RSS format

Posted by Karl Wright <da...@gmail.com>.

Hi Kate,

I ran a job based on the same feed twice.  Here are the results, from the
simple history:

Start Time Activity Identifier Result Code Bytes Time Result
Description  08-16-2011
20:38:10.924 job end 1313541280969(jazz)

0 1
 08-16-2011 20:37:57.179 document ingest (solr)
http://www.onemansjazz.ca/content/view/331/30/
 200 16980 18
 08-16-2011 20:37:56.241 fetch
http://www.onemansjazz.ca/content/view/331/30/
 200 16980 905
 08-16-2011 20:37:52.117 document ingest (solr)
http://www.onemansjazz.ca/content/view/334/30/
 200 16718 15
 08-16-2011 20:37:51.241 fetch
http://www.onemansjazz.ca/content/view/334/30/
 200 16718 839
 08-16-2011 20:37:47.292 document ingest (solr)
http://www.onemansjazz.ca/content/view/330/50/
 200 22605 19
 08-16-2011 20:37:46.241 fetch
http://www.onemansjazz.ca/content/view/330/50/
 200 22605 1003
 08-16-2011 20:37:42.149 document ingest (solr)
http://www.onemansjazz.ca/content/view/333/30/
 200 17606 19
 08-16-2011 20:37:41.241 fetch
http://www.onemansjazz.ca/content/view/333/30/
 200 17606 887
 08-16-2011 20:37:37.165 document ingest (solr)
http://www.onemansjazz.ca/content/view/332/30/
 200 17083 20
 08-16-2011 20:37:36.241 fetch
http://www.onemansjazz.ca/content/view/332/30/
 200 17083 898
 08-16-2011 20:37:32.783 document ingest (solr)
http://www.onemansjazz.ca/content/view/336/30/
 200 17473 19
 08-16-2011 20:37:31.241 fetch
http://www.onemansjazz.ca/content/view/336/30/
 200 17473 922
 08-16-2011 20:37:27.191 document ingest (solr)
http://www.onemansjazz.ca/content/view/329/30/
 200 17105 52
 08-16-2011 20:37:26.241 fetch
http://www.onemansjazz.ca/content/view/329/30/
 200 17105 912
 08-16-2011 20:37:21.241 fetch
http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2....
0/no_html,1/
 200 3973 542
 08-16-2011 20:37:20.970 job start 1313541280969(jazz)

0 1
 08-16-2011 20:37:00.893 job end 1313541280969(jazz)

0 1
 08-16-2011 20:36:49.123 document ingest (solr)
http://www.onemansjazz.ca/content/view/334/30/
 200 16718 17
 08-16-2011 20:36:48.076 fetch
http://www.onemansjazz.ca/content/view/334/30/
 200 16718 1028
 08-16-2011 20:36:44.305 document ingest (solr)
http://www.onemansjazz.ca/content/view/332/30/
 200 17083 34
 08-16-2011 20:36:43.076 fetch
http://www.onemansjazz.ca/content/view/332/30/
 200 17083 1208
 08-16-2011 20:36:39.175 document ingest (solr)
http://www.onemansjazz.ca/content/view/336/30/
 200 17473 23
 08-16-2011 20:36:38.076 fetch
http://www.onemansjazz.ca/content/view/336/30/
 200 17473 1087
 08-16-2011 20:36:33.983 document ingest (solr)
http://www.onemansjazz.ca/content/view/331/30/
 200 16980 24
 08-16-2011 20:36:33.076 fetch
http://www.onemansjazz.ca/content/view/331/30/
 200 16980 896
 08-16-2011 20:36:29.297 document ingest (solr)
http://www.onemansjazz.ca/content/view/329/30/
 200 17105 24
 08-16-2011 20:36:28.774 document ingest (solr)
http://www.onemansjazz.ca/content/view/330/50/
 200 22605 35
 08-16-2011 20:36:28.076 fetch
http://www.onemansjazz.ca/content/view/329/30/
 200 17105 1204
 08-16-2011 20:36:23.076 fetch
http://www.onemansjazz.ca/content/view/330/50/
 200 22605 5679
 08-16-2011 20:36:21.130 document ingest (solr)
http://www.onemansjazz.ca/content/view/333/30/
 200 17606 418
 08-16-2011 20:36:18.076 fetch
http://www.onemansjazz.ca/content/view/333/30/
 200 17606 2969
 08-16-2011 20:36:13.094 fetch
http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2....
0/no_html,1/
 200 3973 1945
 08-16-2011 20:36:10.870 job start 1313541280969(jazz)

0 1

Note that on each run, the size of each document being indexed changes.
This is likely due to "chrome" (advertisements, etc.) which are dynamically
delivered by the site in a random way.  The RSS connector will, of course,
not be able to recognize that the content you are interested in hasn't
changed, because as far as it can tell it *has*.

This is very different from the case where you are use the "dechromed"
content based on the "description" field, because it is the actual feed
description field that is indexed, not the document contents, and therefore
no chrome will be present.  Thus you are more likely to see repeated runs of
a job index nothing if the job has a "dechromed" content mode set.

Karl


On Tue, Aug 16, 2011 at 5:07 PM, K McGonigal <km...@gmail.com> wrote:

> Hmm. I will keep this in mind, but I'm confused again. I just ran this job
> twice in a row and pretty much the same thing was sent to Solr.  The same
> number of items (7) were "add"ed. I think they were the same items, just in
> a different order. The second run also deleted an item from Solr that was
> not in the RSS document.  I'm pretty sure the RSS feed document or the
> linked documents did not change.
>
> A snippet from the first run:
>
> INFO: {add=[http://www.onemansjazz.ca/content/view/330/50/]} 0 16
>> 16-Aug-2011 3:18:11 PM org.apache.solr.core.SolrCore execute
>> INFO: [] webapp=/solr path=/update/extract params={literal.source=
>> http://www.one
>>
>> mansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/&literal.category=New
>>
>> s+-+General&literal.summary=I+have+created+a+Listener+Survey+and+if+you+have+the
>>
>> +time+to+complete+it,+that+would+be+terrific.++I%26#39;m+trying+to+do+an+evaluat
>>
>> ion+of+One+Man%26#39;s+Jazz+as+well+as+considering+some+new+options+that+have+ar
>>
>> isen.++Your+feedback+would+be+most+appreciate.This+survey+is+in+two+parts+and+is
>>
>> +a+total+of+twenty+parts,+most+of+them+just+require+a+click+of+your+mouse.++Clic
>> k+here+(
>> http://www.surveymonkey.com/s/C3DZ3JK)++for+Part+One,+and+here+(http://w<http://www.surveymonkey.com/s/C3DZ3JK%29++for+Part+One,+and+here+%28http://w>
>>
>> ww.surveymonkey.com/s/C38FVH8)++for+Part+Two.+++Thanks+again+for+your+input.+&li<http://ww.surveymonkey.com/s/C38FVH8%29++for+Part+Two.+++Thanks+again+for+your+input.+&li>
>> teral.id=
>> http://www.onemansjazz.ca/content/view/330/50/&literal.title=Listener+S
>> urvey&literal.pubdate=1310475289000} status=0 QTime=16
>> 16-Aug-2011 3:18:13 PM org.apache.solr.update.processor.LogUpdateProcessor
>> finis
>> h
>>
>
> A snippet from the second run:
>
> INFO: {add=[http://www.onemansjazz.ca/content/view/330/50/]} 0 15
>> 16-Aug-2011 3:27:55 PM org.apache.solr.core.SolrCore execute
>> INFO: [] webapp=/solr path=/update/extract params={literal.source=
>> http://www.one
>>
>> mansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/&literal.category=New
>>
>> s+-+General&literal.summary=I+have+created+a+Listener+Survey+and+if+you+have+the
>>
>> +time+to+complete+it,+that+would+be+terrific.++I%26#39;m+trying+to+do+an+evaluat
>>
>> ion+of+One+Man%26#39;s+Jazz+as+well+as+considering+some+new+options+that+have+ar
>>
>> isen.++Your+feedback+would+be+most+appreciate.This+survey+is+in+two+parts+and+is
>>
>> +a+total+of+twenty+parts,+most+of+them+just+require+a+click+of+your+mouse.++Clic
>> k+here+(
>> http://www.surveymonkey.com/s/C3DZ3JK)++for+Part+One,+and+here+(http://w<http://www.surveymonkey.com/s/C3DZ3JK%29++for+Part+One,+and+here+%28http://w>
>>
>> ww.surveymonkey.com/s/C38FVH8)++for+Part+Two.+++Thanks+again+for+your+input.+&li<http://ww.surveymonkey.com/s/C38FVH8%29++for+Part+Two.+++Thanks+again+for+your+input.+&li>
>> teral.id=
>> http://www.onemansjazz.ca/content/view/330/50/&literal.title=Listener+S
>> urvey&literal.pubdate=1310475289000} status=0 QTime=15
>> 16-Aug-2011 3:28:00 PM org.apache.solr.update.processor.LogUpdateProcessor
>> finis
>> h
>>
>
> I think they are identical.
>
>
> View a Job
>>  ------------------------------
>>  Name:OMJ
>> ------------------------------
>>  Output connection: Solr Repository connection: RSS
>> ------------------------------
>>  Priority:5 Start method:Don't automatically start
>> ------------------------------
>>  Schedule type:Scan every document once Minimum recrawl interval:Not
>> applicable  Expiration interval:Not applicable Reseed interval:Not
>> applicable
>> ------------------------------
>>  No scheduled run times
>> ------------------------------
>>    Field mappings:  Metadata field name Solr field name No field mapping
>> specified
>> ------------------------------
>>    RSS urls:
>> http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
>>  ------------------------------
>> No url canonicalization specified; will reorder all urls and remove all
>> sessions
>> ------------------------------
>> No mappings specified; will accept all urls
>> ------------------------------
>>  Feed connection timeout (seconds): 60  Default feed rescan interval
>> (minutes): 60  Minimum feed rescan interval (minutes): 15  Bad feed
>> rescan interval (minutes): (Default feed rescan value)
>> ------------------------------
>>  Dechromed content source: none  Chromed content: none
>> ------------------------------
>> No access tokens specified
>> ------------------------------
>> No metadata specified
>
>
>
> View Repository Connection Status
>  ------------------------------
>  Name:RSS Description:
>  ------------------------------
>  Connection type:RSS Max connections:10  Authority:None (global authority)
> ------------------------------
>  Throttling:  Bin regular expression Description Max avg fetches/min No
> throttles
> ------------------------------
>    Parameters: Proxy port=
> Proxy authentication password=********
> Max server connections=2
> Proxy host=
> KB per second=64
> Robots usage=none
> Proxy authentication user name=
> Max fetches per minute=12
> Email address=kmcgoniga@gmail.com
> Proxy authentication domain=
> Throttle group=
>    ------------------------------
>  Connection status:Connection working
>
>

Re: Trouble indexing a Twitter search in RSS format

Posted by K McGonigal <km...@gmail.com>.

Hmm. I will keep this in mind, but I'm confused again. I just ran this job
twice in a row and pretty much the same thing was sent to Solr.  The same
number of items (7) were "add"ed. I think they were the same items, just in
a different order. The second run also deleted an item from Solr that was
not in the RSS document.  I'm pretty sure the RSS feed document or the
linked documents did not change.

A snippet from the first run:

INFO: {add=[http://www.onemansjazz.ca/content/view/330/50/]} 0 16
> 16-Aug-2011 3:18:11 PM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/update/extract params={literal.source=
> http://www.one
>
> mansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/&literal.category=New
>
> s+-+General&literal.summary=I+have+created+a+Listener+Survey+and+if+you+have+the
>
> +time+to+complete+it,+that+would+be+terrific.++I%26#39;m+trying+to+do+an+evaluat
>
> ion+of+One+Man%26#39;s+Jazz+as+well+as+considering+some+new+options+that+have+ar
>
> isen.++Your+feedback+would+be+most+appreciate.This+survey+is+in+two+parts+and+is
>
> +a+total+of+twenty+parts,+most+of+them+just+require+a+click+of+your+mouse.++Clic
> k+here+(
> http://www.surveymonkey.com/s/C3DZ3JK)++for+Part+One,+and+here+(http://w<http://www.surveymonkey.com/s/C3DZ3JK%29++for+Part+One,+and+here+%28http://w>
>
> ww.surveymonkey.com/s/C38FVH8)++for+Part+Two.+++Thanks+again+for+your+input.+&li<http://ww.surveymonkey.com/s/C38FVH8%29++for+Part+Two.+++Thanks+again+for+your+input.+&li>
> teral.id=
> http://www.onemansjazz.ca/content/view/330/50/&literal.title=Listener+S
> urvey&literal.pubdate=1310475289000} status=0 QTime=16
> 16-Aug-2011 3:18:13 PM org.apache.solr.update.processor.LogUpdateProcessor
> finis
> h
>

A snippet from the second run:

INFO: {add=[http://www.onemansjazz.ca/content/view/330/50/]} 0 15
> 16-Aug-2011 3:27:55 PM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/update/extract params={literal.source=
> http://www.one
>
> mansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/&literal.category=New
>
> s+-+General&literal.summary=I+have+created+a+Listener+Survey+and+if+you+have+the
>
> +time+to+complete+it,+that+would+be+terrific.++I%26#39;m+trying+to+do+an+evaluat
>
> ion+of+One+Man%26#39;s+Jazz+as+well+as+considering+some+new+options+that+have+ar
>
> isen.++Your+feedback+would+be+most+appreciate.This+survey+is+in+two+parts+and+is
>
> +a+total+of+twenty+parts,+most+of+them+just+require+a+click+of+your+mouse.++Clic
> k+here+(
> http://www.surveymonkey.com/s/C3DZ3JK)++for+Part+One,+and+here+(http://w<http://www.surveymonkey.com/s/C3DZ3JK%29++for+Part+One,+and+here+%28http://w>
>
> ww.surveymonkey.com/s/C38FVH8)++for+Part+Two.+++Thanks+again+for+your+input.+&li<http://ww.surveymonkey.com/s/C38FVH8%29++for+Part+Two.+++Thanks+again+for+your+input.+&li>
> teral.id=
> http://www.onemansjazz.ca/content/view/330/50/&literal.title=Listener+S
> urvey&literal.pubdate=1310475289000} status=0 QTime=15
> 16-Aug-2011 3:28:00 PM org.apache.solr.update.processor.LogUpdateProcessor
> finis
> h
>

I think they are identical.


View a Job
>  ------------------------------
>  Name:OMJ
> ------------------------------
>  Output connection: Solr Repository connection: RSS
> ------------------------------
>  Priority:5 Start method:Don't automatically start
> ------------------------------
>  Schedule type:Scan every document once Minimum recrawl interval:Not
> applicable  Expiration interval:Not applicable Reseed interval:Not
> applicable
> ------------------------------
>  No scheduled run times
> ------------------------------
>    Field mappings:  Metadata field name Solr field name No field mapping
> specified
> ------------------------------
>    RSS urls:
> http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
>  ------------------------------
> No url canonicalization specified; will reorder all urls and remove all
> sessions
> ------------------------------
> No mappings specified; will accept all urls
> ------------------------------
>  Feed connection timeout (seconds): 60  Default feed rescan interval
> (minutes): 60  Minimum feed rescan interval (minutes): 15  Bad feed rescan
> interval (minutes): (Default feed rescan value)
> ------------------------------
>  Dechromed content source: none  Chromed content: none
> ------------------------------
> No access tokens specified
> ------------------------------
> No metadata specified



View Repository Connection Status
 ------------------------------
 Name:RSS Description:
 ------------------------------
 Connection type:RSS Max connections:10  Authority:None (global authority)
------------------------------
 Throttling:  Bin regular expression Description Max avg fetches/min No
throttles
------------------------------
   Parameters: Proxy port=
Proxy authentication password=********
Max server connections=2
Proxy host=
KB per second=64
Robots usage=none
Proxy authentication user name=
Max fetches per minute=12
Email address=kmcgoniga@gmail.com
Proxy authentication domain=
Throttle group=
   ------------------------------
 Connection status:Connection working

Re: Trouble indexing a Twitter search in RSS format

Posted by K McGonigal <km...@gmail.com>.

Yes, this agrees. Thank you for all your help and patience.

Kate

On Tue, Aug 16, 2011 at 4:44 AM, Karl Wright <da...@gmail.com> wrote:

> Using your twitter RSS feed, dechromed mode="description", and chromed
> mode="skip", and turning off robots exclusion, I get a number of
> indexing operations. The following Solr log output corresponds to one
> such:
>
> INFO: {add=[http://twitter.com/DraRositaperez/statuses/103103998965456896]}
> 0 2
> Aug 16, 2011 5:28:52 AM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/update/extract
> params={literal.source=
> http://search.twitter.com/search.rss?q%3DCampylobacter&literal.id=http://twitter.com/DraRositaperez/statuses/103103998965456896&literal.title=RT+@MicrobeWorld:+Campylobacter+bacteria:+Campylobacter+bacteria+are+the+number-one+cause+of+food-related+gastrointestinal+illness...+http://t.co/0Bk8mTm&literal.pubdate=1313416883000
> }
> status=0 QTime=2
>
> The document's source, title, and pubdate seem to all be set.  The
> feed's "description" field is the actual content that is being indexed
> into Solr, so that is not present in the Solr url but should be
> present in the post data.  So the only question, then, is the
> "summary" field.  Looking at the feed itself, I see <title> fields and
> <description> fields, but no <content> fields, so it makes sense that
> there would be no summary metadata.
>
> Hope this helps.  Does this agree with what you are seeing?
> Karl
>
> >
> > For the rest, I suspect that you have been running the same job over
> > and over again to get the results you describe.  However, you should
> > be aware that ManifoldCF is an incremental crawler.  It will NOT
> > reindex content that has not changed between job runs.
> >
> > So the only result that is definitely weird is:
> >
> >> case 4)  "Dechromed content, if present, in 'description' field" and
> "Never
> >> use chromed content"
> >>                      --> Ingests but both "description" and "summary"
> fields
> >> ARE EMPTY in Solr
> >
>

Re: Trouble indexing a Twitter search in RSS format

Posted by Karl Wright <da...@gmail.com>.

Hi Kate,

Another point you should be aware of - this site has a robots
exclusion for crawlers, so unless you override that you will not be
able to fetch either feeds or content.  There are two ways to do the
override - you can set it to just allow the feed itself, or you can
set it to allow both feed and content.  If you select the former, then
any secondary (document) fetches will be disallowed.

Should you crawl repeatedly when the site owner says "no robots", you
can also wind up being blocked by the site owner.  In that case your
crawls will all cease to work suddenly and without warning.

Thanks,
Karl


On Tue, Aug 16, 2011 at 5:44 AM, Karl Wright <da...@gmail.com> wrote:
> Using your twitter RSS feed, dechromed mode="description", and chromed
> mode="skip", and turning off robots exclusion, I get a number of
> indexing operations. The following Solr log output corresponds to one
> such:
>
> INFO: {add=[http://twitter.com/DraRositaperez/statuses/103103998965456896]} 0 2
> Aug 16, 2011 5:28:52 AM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/update/extract
> params={literal.source=http://search.twitter.com/search.rss?q%3DCampylobacter&literal.id=http://twitter.com/DraRositaperez/statuses/103103998965456896&literal.title=RT+@MicrobeWorld:+Campylobacter+bacteria:+Campylobacter+bacteria+are+the+number-one+cause+of+food-related+gastrointestinal+illness...+http://t.co/0Bk8mTm&literal.pubdate=1313416883000}
> status=0 QTime=2
>
> The document's source, title, and pubdate seem to all be set.  The
> feed's "description" field is the actual content that is being indexed
> into Solr, so that is not present in the Solr url but should be
> present in the post data.  So the only question, then, is the
> "summary" field.  Looking at the feed itself, I see <title> fields and
> <description> fields, but no <content> fields, so it makes sense that
> there would be no summary metadata.
>
> Hope this helps.  Does this agree with what you are seeing?
> Karl
>
>>
>> For the rest, I suspect that you have been running the same job over
>> and over again to get the results you describe.  However, you should
>> be aware that ManifoldCF is an incremental crawler.  It will NOT
>> reindex content that has not changed between job runs.
>>
>> So the only result that is definitely weird is:
>>
>>> case 4)  "Dechromed content, if present, in 'description' field" and "Never
>>> use chromed content"
>>>                      --> Ingests but both "description" and "summary" fields
>>> ARE EMPTY in Solr
>>
>

Re: Trouble indexing a Twitter search in RSS format

Posted by Karl Wright <da...@gmail.com>.

Using your twitter RSS feed, dechromed mode="description", and chromed
mode="skip", and turning off robots exclusion, I get a number of
indexing operations. The following Solr log output corresponds to one
such:

INFO: {add=[http://twitter.com/DraRositaperez/statuses/103103998965456896]} 0 2
Aug 16, 2011 5:28:52 AM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update/extract
params={literal.source=http://search.twitter.com/search.rss?q%3DCampylobacter&literal.id=http://twitter.com/DraRositaperez/statuses/103103998965456896&literal.title=RT+@MicrobeWorld:+Campylobacter+bacteria:+Campylobacter+bacteria+are+the+number-one+cause+of+food-related+gastrointestinal+illness...+http://t.co/0Bk8mTm&literal.pubdate=1313416883000}
status=0 QTime=2

The document's source, title, and pubdate seem to all be set.  The
feed's "description" field is the actual content that is being indexed
into Solr, so that is not present in the Solr url but should be
present in the post data.  So the only question, then, is the
"summary" field.  Looking at the feed itself, I see <title> fields and
<description> fields, but no <content> fields, so it makes sense that
there would be no summary metadata.

Hope this helps.  Does this agree with what you are seeing?
Karl

>
> For the rest, I suspect that you have been running the same job over
> and over again to get the results you describe.  However, you should
> be aware that ManifoldCF is an incremental crawler.  It will NOT
> reindex content that has not changed between job runs.
>
> So the only result that is definitely weird is:
>
>> case 4)  "Dechromed content, if present, in 'description' field" and "Never
>> use chromed content"
>>                      --> Ingests but both "description" and "summary" fields
>> ARE EMPTY in Solr
>

Re: Trouble indexing a Twitter search in RSS format

Posted by Karl Wright <da...@gmail.com>.

> Also, there appears to be a little bug in that if "Use chromed content if no
> dechromed content found" is selected, when you go back to edit that job, it
> is not selected (i.e. neither of the bottom two radio buttons are active).
> Should I open a JIRA ticket for that?

Yes please.

For the rest, I suspect that you have been running the same job over
and over again to get the results you describe.  However, you should
be aware that ManifoldCF is an incremental crawler.  It will NOT
reindex content that has not changed between job runs.

So the only result that is definitely weird is:

> case 4)  "Dechromed content, if present, in 'description' field" and "Never
> use chromed content"
>                      --> Ingests but both "description" and "summary" fields
> ARE EMPTY in Solr

I'd like to play with this one here, if you can give me the URL in
question that you are using.

Karl

On Mon, Aug 15, 2011 at 4:07 PM, K McGonigal <km...@gmail.com> wrote:
> That makes sense, but my  RSS feed DOES have a "description" field within
> the "item" field.
>
> Upon further experimentation with the two sets of dechromed radio buttons, I
> found the following.
>
> case 1)  "No dechromed content" and "Use chromed content if no dechromed
> content found"
>                      --> Ingests to both "description" and "summary" fields
> in Solr
> e.g.
>>
>> INFO: {add=[http://twitter.com/MicrobeWorld/statuses/103102842524545025]}
>> 0 0
>> 15-Aug-2011 2:51:26 PM org.apache.solr.core.SolrCore execute
>> INFO: [] webapp=/solr path=/update/extract
>> params={literal.service=Twitter&liter
>>
>> al.source=http://search.twitter.com/search.rss?q%3DCampylobacter&literal.summary
>>
>> =<em>Campylobacter</em>+bacteria:+<em>Campylobacter</em>+bacteria+are+the+number
>>
>> -one+cause+of+food-related+gastrointestinal+illness...+<a+href%3D"http://t.co/0B
>>
>> k8mTm">http://t.co/0Bk8mTm</a>&literal.id=http://twitter.com/MicrobeWorld/status
>>
>> es/103102842524545025&literal.title=Campylobacter+bacteria:+Campylobacter+bacter
>>
>> ia+are+the+number-one+cause+of+food-related+gastrointestinal+illness...+http://t
>> .co/0Bk8mTm&literal.pubdate=1313416607000} status=0 QTime=0
>> 15-Aug-2011 2:51:31 PM org.apache.solr.update.processor.LogUpdateProcessor
>> finis
>> h
>
>
> case 2)  "No dechromed content" and "Never use chromed content"
>                      --> didn't ingest
>
> case 3)  "Dechromed content, if present, in 'description' field" and "Use
> chromed content if no dechromed content found"
>                      --> didn't ingest
>
> case 4)  "Dechromed content, if present, in 'description' field" and "Never
> use chromed content"
>                      --> Ingests but both "description" and "summary" fields
> ARE EMPTY in Solr
> e.g.
>>
>> INFO: {add=[http://twitter.com/MicrobeWorld/statuses/103102842524545025]}
>> 0 0
>> 15-Aug-2011 3:04:02 PM org.apache.solr.update.processor.LogUpdateProcessor
>> finis
>> h
>
>
> I hope that is all to be expected.
>
> Also, there appears to be a little bug in that if "Use chromed content if no
> dechromed content found" is selected, when you go back to edit that job, it
> is not selected (i.e. neither of the bottom two radio buttons are active).
> Should I open a JIRA ticket for that?
>
>
> On Mon, Aug 15, 2011 at 11:49 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> The behavior depends on the setting of the other pair of radio buttons
>> on that tab.  You can select "Use chromed content if not found" or
>> "Never use chromed content".  So, if the feed has no "description"
>> field for the document, and the dechromed content setting is
>> "description field", and the other setting is "Never use chromed
>> content", no document will be indexed.
>>
>> Karl
>>
>>
>> On Mon, Aug 15, 2011 at 12:44 PM, K McGonigal <km...@gmail.com> wrote:
>> > I deleted my twitter RSS job and created another one and now it works!
>> >
>> > Doing some experimentation, I see that when Dechromed Content is set to
>> > "No
>> > dechromed content" it ingests fine, but when set to "if present, in
>> > 'description' field" it doesn't do the ingestion (nothing is added to
>> > Solr).  Is that to be expected?
>> >
>> >
>> > Kate
>> >
>> >
>> > On Mon, Aug 15, 2011 at 10:48 AM, Karl Wright <da...@gmail.com>
>> > wrote:
>> >>
>> >> Regardless of the twitter sign-in issue, I'd still expect the RSS
>> >> connector to index whatever it finds at the redirected page, even if
>> >> it's not very useful stuff.  Could you send me a screen shot of the
>> >> view page for the RSS connection and for the RSS job?  Also, if you
>> >> could delete the job that contains the twitter RSS feed and recreated
>> >> it, then crawl, I'd like to see the simple history for that crawl.
>> >>
>> >> Thanks,
>> >> Karl
>> >>
>> >> On Mon, Aug 15, 2011 at 11:38 AM, K McGonigal <km...@gmail.com>
>> >> wrote:
>> >> > Hmm, that's odd the URLs didn't work for you.  I've asked other
>> >> > people
>> >> > here
>> >> > to try them and they had no problems.
>> >> >
>> >> > After your suggestion I tried the web connector (but still with no
>> >> > access
>> >> > credentials) and it did pretty well ingesting the RSS feed, so I
>> >> > might
>> >> > be
>> >> > able to just use that.
>> >> >
>> >> > I'm still mystified as to why the RSS connector couldn't handle it
>> >> > though. I
>> >> > turned on DEBUG logging in Manifold, but that did not show anything
>> >> > unusual.
>> >> >
>> >> > Thanks,
>> >> > Kate
>> >> >
>> >> > On Fri, Aug 12, 2011 at 3:58 PM, Karl Wright <da...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> When I drop any of these URLs into my browser, I get redirected to a
>> >> >> login screen.  Therefore it looks to me like Twitter does some kind
>> >> >> of
>> >> >> session-based login, tracked with cookies.  That would require
>> >> >> maintenance of session cookies which the RSS connector simply does
>> >> >> not
>> >> >> do, and the coding of a login sequence as well.
>> >> >>
>> >> >> This is not a straightforward feature to add to the RSS connector,
>> >> >> by
>> >> >> any
>> >> >> means.
>> >> >>
>> >> >> The web connector does have support for login sequencing and cookie
>> >> >> session maintenance, and it does know how to chase RSS feeds, so
>> >> >> that
>> >> >> might be an option for you to try.  The problem is that most login
>> >> >> sequences are non-trivial to set up and you will need a lot of
>> >> >> patience and web spelunking skills to get it right.  The
>> >> >> documentation
>> >> >> is of some help but really could use a good example.
>> >> >>
>> >> >>
>> >> >> Hope this helps.
>> >> >> Karl
>> >> >>
>> >> >> On Fri, Aug 12, 2011 at 4:42 PM, K McGonigal <km...@gmail.com>
>> >> >> wrote:
>> >> >> > Sorry to bother everyone again but I'm having trouble with an RSS
>> >> >> > connector
>> >> >> > job on a Twitter search. When I try to run a job on
>> >> >> > http://search.twitter.com/search.rss?q=Campylobacter the fetch
>> >> >> > appears
>> >> >> > to
>> >> >> > work OK, but the document ingestion does not occur.
>> >> >> >
>> >> >> > I was wondering if it is just my setup, or could it be the
>> >> >> > redirection
>> >> >> > that
>> >> >> > Twitter does on the links. For instance, a link shown in the RSS
>> >> >> > feed
>> >> >> > as
>> >> >> > http://twitter.com/VashinkaInuiel/statuses/101493222852923393
>> >> >> > redirects
>> >> >> > to
>> >> >> > http://twitter.com/#!/VashinkaInuiel/statuses/101493222852923393
>> >> >> > when
>> >> >> > it
>> >> >> > is
>> >> >> > followed.
>> >> >> >
>> >> >> > Any help is very appreciated.
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >
>> >> >
>> >
>> >
>
>

Re: Trouble indexing a Twitter search in RSS format

Posted by K McGonigal <km...@gmail.com>.

That makes sense, but my  RSS feed DOES have a "description" field within
the "item" field.

Upon further experimentation with the two sets of dechromed radio buttons, I
found the following.

case 1)  "No dechromed content" and "Use chromed content if no dechromed
content found"
                     --> Ingests to both "description" and "summary" fields
in Solr
e.g.

> INFO: {add=[http://twitter.com/MicrobeWorld/statuses/103102842524545025]}
> 0 0
> 15-Aug-2011 2:51:26 PM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/update/extract
> params={literal.service=Twitter&liter
> al.source=
> http://search.twitter.com/search.rss?q%3DCampylobacter&literal.summary
>
> =<em>Campylobacter</em>+bacteria:+<em>Campylobacter</em>+bacteria+are+the+number
> -one+cause+of+food-related+gastrointestinal+illness...+<a+href%3D"
> http://t.co/0B
> k8mTm">http://t.co/0Bk8mTm</a>&literal.id=
> http://twitter.com/MicrobeWorld/status
>
> es/103102842524545025&literal.title=Campylobacter+bacteria:+Campylobacter+bacter
> ia+are+the+number-one+cause+of+food-related+gastrointestinal+illness...+
> http://t
> .co/0Bk8mTm&literal.pubdate=1313416607000} status=0 QTime=0
> 15-Aug-2011 2:51:31 PM org.apache.solr.update.processor.LogUpdateProcessor
> finis
> h
>


case 2)  "No dechromed content" and "Never use chromed content"
                     --> didn't ingest

case 3)  "Dechromed content, if present, in 'description' field" and "Use
chromed content if no dechromed content found"
                     --> didn't ingest

case 4)  "Dechromed content, if present, in 'description' field" and "Never
use chromed content"
                     --> Ingests but both "description" and "summary" fields
ARE EMPTY in Solr
e.g.

> INFO: {add=[http://twitter.com/MicrobeWorld/statuses/103102842524545025]}
> 0 0
> 15-Aug-2011 3:04:02 PM org.apache.solr.update.processor.LogUpdateProcessor
> finis
> h
>


I hope that is all to be expected.

Also, there appears to be a little bug in that if "Use chromed content if no
dechromed content found" is selected, when you go back to edit that job, it
is not selected (i.e. neither of the bottom two radio buttons are active).
Should I open a JIRA ticket for that?


On Mon, Aug 15, 2011 at 11:49 AM, Karl Wright <da...@gmail.com> wrote:

> The behavior depends on the setting of the other pair of radio buttons
> on that tab.  You can select "Use chromed content if not found" or
> "Never use chromed content".  So, if the feed has no "description"
> field for the document, and the dechromed content setting is
> "description field", and the other setting is "Never use chromed
> content", no document will be indexed.
>
> Karl
>
>
> On Mon, Aug 15, 2011 at 12:44 PM, K McGonigal <km...@gmail.com> wrote:
> > I deleted my twitter RSS job and created another one and now it works!
> >
> > Doing some experimentation, I see that when Dechromed Content is set to
> "No
> > dechromed content" it ingests fine, but when set to "if present, in
> > 'description' field" it doesn't do the ingestion (nothing is added to
> > Solr).  Is that to be expected?
> >
> >
> > Kate
> >
> >
> > On Mon, Aug 15, 2011 at 10:48 AM, Karl Wright <da...@gmail.com>
> wrote:
> >>
> >> Regardless of the twitter sign-in issue, I'd still expect the RSS
> >> connector to index whatever it finds at the redirected page, even if
> >> it's not very useful stuff.  Could you send me a screen shot of the
> >> view page for the RSS connection and for the RSS job?  Also, if you
> >> could delete the job that contains the twitter RSS feed and recreated
> >> it, then crawl, I'd like to see the simple history for that crawl.
> >>
> >> Thanks,
> >> Karl
> >>
> >> On Mon, Aug 15, 2011 at 11:38 AM, K McGonigal <km...@gmail.com>
> wrote:
> >> > Hmm, that's odd the URLs didn't work for you.  I've asked other people
> >> > here
> >> > to try them and they had no problems.
> >> >
> >> > After your suggestion I tried the web connector (but still with no
> >> > access
> >> > credentials) and it did pretty well ingesting the RSS feed, so I might
> >> > be
> >> > able to just use that.
> >> >
> >> > I'm still mystified as to why the RSS connector couldn't handle it
> >> > though. I
> >> > turned on DEBUG logging in Manifold, but that did not show anything
> >> > unusual.
> >> >
> >> > Thanks,
> >> > Kate
> >> >
> >> > On Fri, Aug 12, 2011 at 3:58 PM, Karl Wright <da...@gmail.com>
> wrote:
> >> >>
> >> >> When I drop any of these URLs into my browser, I get redirected to a
> >> >> login screen.  Therefore it looks to me like Twitter does some kind
> of
> >> >> session-based login, tracked with cookies.  That would require
> >> >> maintenance of session cookies which the RSS connector simply does
> not
> >> >> do, and the coding of a login sequence as well.
> >> >>
> >> >> This is not a straightforward feature to add to the RSS connector, by
> >> >> any
> >> >> means.
> >> >>
> >> >> The web connector does have support for login sequencing and cookie
> >> >> session maintenance, and it does know how to chase RSS feeds, so that
> >> >> might be an option for you to try.  The problem is that most login
> >> >> sequences are non-trivial to set up and you will need a lot of
> >> >> patience and web spelunking skills to get it right.  The
> documentation
> >> >> is of some help but really could use a good example.
> >> >>
> >> >>
> >> >> Hope this helps.
> >> >> Karl
> >> >>
> >> >> On Fri, Aug 12, 2011 at 4:42 PM, K McGonigal <km...@gmail.com>
> >> >> wrote:
> >> >> > Sorry to bother everyone again but I'm having trouble with an RSS
> >> >> > connector
> >> >> > job on a Twitter search. When I try to run a job on
> >> >> > http://search.twitter.com/search.rss?q=Campylobacter the fetch
> >> >> > appears
> >> >> > to
> >> >> > work OK, but the document ingestion does not occur.
> >> >> >
> >> >> > I was wondering if it is just my setup, or could it be the
> >> >> > redirection
> >> >> > that
> >> >> > Twitter does on the links. For instance, a link shown in the RSS
> feed
> >> >> > as
> >> >> > http://twitter.com/VashinkaInuiel/statuses/101493222852923393
> >> >> > redirects
> >> >> > to
> >> >> > http://twitter.com/#!/VashinkaInuiel/statuses/101493222852923393when
> >> >> > it
> >> >> > is
> >> >> > followed.
> >> >> >
> >> >> > Any help is very appreciated.
> >> >> >
> >> >> >
> >> >> >
> >> >
> >> >
> >
> >
>

Re: Trouble indexing a Twitter search in RSS format

Posted by Karl Wright <da...@gmail.com>.

The behavior depends on the setting of the other pair of radio buttons
on that tab.  You can select "Use chromed content if not found" or
"Never use chromed content".  So, if the feed has no "description"
field for the document, and the dechromed content setting is
"description field", and the other setting is "Never use chromed
content", no document will be indexed.

Karl


On Mon, Aug 15, 2011 at 12:44 PM, K McGonigal <km...@gmail.com> wrote:
> I deleted my twitter RSS job and created another one and now it works!
>
> Doing some experimentation, I see that when Dechromed Content is set to "No
> dechromed content" it ingests fine, but when set to "if present, in
> 'description' field" it doesn't do the ingestion (nothing is added to
> Solr).  Is that to be expected?
>
>
> Kate
>
>
> On Mon, Aug 15, 2011 at 10:48 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> Regardless of the twitter sign-in issue, I'd still expect the RSS
>> connector to index whatever it finds at the redirected page, even if
>> it's not very useful stuff.  Could you send me a screen shot of the
>> view page for the RSS connection and for the RSS job?  Also, if you
>> could delete the job that contains the twitter RSS feed and recreated
>> it, then crawl, I'd like to see the simple history for that crawl.
>>
>> Thanks,
>> Karl
>>
>> On Mon, Aug 15, 2011 at 11:38 AM, K McGonigal <km...@gmail.com> wrote:
>> > Hmm, that's odd the URLs didn't work for you.  I've asked other people
>> > here
>> > to try them and they had no problems.
>> >
>> > After your suggestion I tried the web connector (but still with no
>> > access
>> > credentials) and it did pretty well ingesting the RSS feed, so I might
>> > be
>> > able to just use that.
>> >
>> > I'm still mystified as to why the RSS connector couldn't handle it
>> > though. I
>> > turned on DEBUG logging in Manifold, but that did not show anything
>> > unusual.
>> >
>> > Thanks,
>> > Kate
>> >
>> > On Fri, Aug 12, 2011 at 3:58 PM, Karl Wright <da...@gmail.com> wrote:
>> >>
>> >> When I drop any of these URLs into my browser, I get redirected to a
>> >> login screen.  Therefore it looks to me like Twitter does some kind of
>> >> session-based login, tracked with cookies.  That would require
>> >> maintenance of session cookies which the RSS connector simply does not
>> >> do, and the coding of a login sequence as well.
>> >>
>> >> This is not a straightforward feature to add to the RSS connector, by
>> >> any
>> >> means.
>> >>
>> >> The web connector does have support for login sequencing and cookie
>> >> session maintenance, and it does know how to chase RSS feeds, so that
>> >> might be an option for you to try.  The problem is that most login
>> >> sequences are non-trivial to set up and you will need a lot of
>> >> patience and web spelunking skills to get it right.  The documentation
>> >> is of some help but really could use a good example.
>> >>
>> >>
>> >> Hope this helps.
>> >> Karl
>> >>
>> >> On Fri, Aug 12, 2011 at 4:42 PM, K McGonigal <km...@gmail.com>
>> >> wrote:
>> >> > Sorry to bother everyone again but I'm having trouble with an RSS
>> >> > connector
>> >> > job on a Twitter search. When I try to run a job on
>> >> > http://search.twitter.com/search.rss?q=Campylobacter the fetch
>> >> > appears
>> >> > to
>> >> > work OK, but the document ingestion does not occur.
>> >> >
>> >> > I was wondering if it is just my setup, or could it be the
>> >> > redirection
>> >> > that
>> >> > Twitter does on the links. For instance, a link shown in the RSS feed
>> >> > as
>> >> > http://twitter.com/VashinkaInuiel/statuses/101493222852923393
>> >> > redirects
>> >> > to
>> >> > http://twitter.com/#!/VashinkaInuiel/statuses/101493222852923393 when
>> >> > it
>> >> > is
>> >> > followed.
>> >> >
>> >> > Any help is very appreciated.
>> >> >
>> >> >
>> >> >
>> >
>> >
>
>

Re: Trouble indexing a Twitter search in RSS format

Posted by K McGonigal <km...@gmail.com>.

I deleted my twitter RSS job and created another one and now it works!

Doing some experimentation, I see that when Dechromed Content is set to "No
dechromed content" it ingests fine, but when set to "if present, in
'description' field" it doesn't do the ingestion (nothing is added to
Solr).  Is that to be expected?


Kate


On Mon, Aug 15, 2011 at 10:48 AM, Karl Wright <da...@gmail.com> wrote:

> Regardless of the twitter sign-in issue, I'd still expect the RSS
> connector to index whatever it finds at the redirected page, even if
> it's not very useful stuff.  Could you send me a screen shot of the
> view page for the RSS connection and for the RSS job?  Also, if you
> could delete the job that contains the twitter RSS feed and recreated
> it, then crawl, I'd like to see the simple history for that crawl.
>
> Thanks,
> Karl
>
> On Mon, Aug 15, 2011 at 11:38 AM, K McGonigal <km...@gmail.com> wrote:
> > Hmm, that's odd the URLs didn't work for you.  I've asked other people
> here
> > to try them and they had no problems.
> >
> > After your suggestion I tried the web connector (but still with no access
> > credentials) and it did pretty well ingesting the RSS feed, so I might be
> > able to just use that.
> >
> > I'm still mystified as to why the RSS connector couldn't handle it
> though. I
> > turned on DEBUG logging in Manifold, but that did not show anything
> unusual.
> >
> > Thanks,
> > Kate
> >
> > On Fri, Aug 12, 2011 at 3:58 PM, Karl Wright <da...@gmail.com> wrote:
> >>
> >> When I drop any of these URLs into my browser, I get redirected to a
> >> login screen.  Therefore it looks to me like Twitter does some kind of
> >> session-based login, tracked with cookies.  That would require
> >> maintenance of session cookies which the RSS connector simply does not
> >> do, and the coding of a login sequence as well.
> >>
> >> This is not a straightforward feature to add to the RSS connector, by
> any
> >> means.
> >>
> >> The web connector does have support for login sequencing and cookie
> >> session maintenance, and it does know how to chase RSS feeds, so that
> >> might be an option for you to try.  The problem is that most login
> >> sequences are non-trivial to set up and you will need a lot of
> >> patience and web spelunking skills to get it right.  The documentation
> >> is of some help but really could use a good example.
> >>
> >>
> >> Hope this helps.
> >> Karl
> >>
> >> On Fri, Aug 12, 2011 at 4:42 PM, K McGonigal <km...@gmail.com>
> wrote:
> >> > Sorry to bother everyone again but I'm having trouble with an RSS
> >> > connector
> >> > job on a Twitter search. When I try to run a job on
> >> > http://search.twitter.com/search.rss?q=Campylobacter the fetch
> appears
> >> > to
> >> > work OK, but the document ingestion does not occur.
> >> >
> >> > I was wondering if it is just my setup, or could it be the redirection
> >> > that
> >> > Twitter does on the links. For instance, a link shown in the RSS feed
> as
> >> > http://twitter.com/VashinkaInuiel/statuses/101493222852923393redirects
> >> > to
> >> > http://twitter.com/#!/VashinkaInuiel/statuses/101493222852923393 when
> it
> >> > is
> >> > followed.
> >> >
> >> > Any help is very appreciated.
> >> >
> >> >
> >> >
> >
> >
>

Re: Trouble indexing a Twitter search in RSS format

Posted by Karl Wright <da...@gmail.com>.

Regardless of the twitter sign-in issue, I'd still expect the RSS
connector to index whatever it finds at the redirected page, even if
it's not very useful stuff.  Could you send me a screen shot of the
view page for the RSS connection and for the RSS job?  Also, if you
could delete the job that contains the twitter RSS feed and recreated
it, then crawl, I'd like to see the simple history for that crawl.

Thanks,
Karl

On Mon, Aug 15, 2011 at 11:38 AM, K McGonigal <km...@gmail.com> wrote:
> Hmm, that's odd the URLs didn't work for you.  I've asked other people here
> to try them and they had no problems.
>
> After your suggestion I tried the web connector (but still with no access
> credentials) and it did pretty well ingesting the RSS feed, so I might be
> able to just use that.
>
> I'm still mystified as to why the RSS connector couldn't handle it though. I
> turned on DEBUG logging in Manifold, but that did not show anything unusual.
>
> Thanks,
> Kate
>
> On Fri, Aug 12, 2011 at 3:58 PM, Karl Wright <da...@gmail.com> wrote:
>>
>> When I drop any of these URLs into my browser, I get redirected to a
>> login screen.  Therefore it looks to me like Twitter does some kind of
>> session-based login, tracked with cookies.  That would require
>> maintenance of session cookies which the RSS connector simply does not
>> do, and the coding of a login sequence as well.
>>
>> This is not a straightforward feature to add to the RSS connector, by any
>> means.
>>
>> The web connector does have support for login sequencing and cookie
>> session maintenance, and it does know how to chase RSS feeds, so that
>> might be an option for you to try.  The problem is that most login
>> sequences are non-trivial to set up and you will need a lot of
>> patience and web spelunking skills to get it right.  The documentation
>> is of some help but really could use a good example.
>>
>>
>> Hope this helps.
>> Karl
>>
>> On Fri, Aug 12, 2011 at 4:42 PM, K McGonigal <km...@gmail.com> wrote:
>> > Sorry to bother everyone again but I'm having trouble with an RSS
>> > connector
>> > job on a Twitter search. When I try to run a job on
>> > http://search.twitter.com/search.rss?q=Campylobacter the fetch appears
>> > to
>> > work OK, but the document ingestion does not occur.
>> >
>> > I was wondering if it is just my setup, or could it be the redirection
>> > that
>> > Twitter does on the links. For instance, a link shown in the RSS feed as
>> > http://twitter.com/VashinkaInuiel/statuses/101493222852923393 redirects
>> > to
>> > http://twitter.com/#!/VashinkaInuiel/statuses/101493222852923393 when it
>> > is
>> > followed.
>> >
>> > Any help is very appreciated.
>> >
>> >
>> >
>
>

Re: Trouble indexing a Twitter search in RSS format

Posted by K McGonigal <km...@gmail.com>.

Hmm, that's odd the URLs didn't work for you.  I've asked other people here
to try them and they had no problems.

After your suggestion I tried the web connector (but still with no access
credentials) and it did pretty well ingesting the RSS feed, so I might be
able to just use that.

I'm still mystified as to why the RSS connector couldn't handle it though. I
turned on DEBUG logging in Manifold, but that did not show anything unusual.

Thanks,
Kate

On Fri, Aug 12, 2011 at 3:58 PM, Karl Wright <da...@gmail.com> wrote:

> When I drop any of these URLs into my browser, I get redirected to a
> login screen.  Therefore it looks to me like Twitter does some kind of
> session-based login, tracked with cookies.  That would require
> maintenance of session cookies which the RSS connector simply does not
> do, and the coding of a login sequence as well.
>
> This is not a straightforward feature to add to the RSS connector, by any
> means.
>
> The web connector does have support for login sequencing and cookie
> session maintenance, and it does know how to chase RSS feeds, so that
> might be an option for you to try.  The problem is that most login
> sequences are non-trivial to set up and you will need a lot of
> patience and web spelunking skills to get it right.  The documentation
> is of some help but really could use a good example.
>
>
> Hope this helps.
> Karl
>
> On Fri, Aug 12, 2011 at 4:42 PM, K McGonigal <km...@gmail.com> wrote:
> > Sorry to bother everyone again but I'm having trouble with an RSS
> connector
> > job on a Twitter search. When I try to run a job on
> > http://search.twitter.com/search.rss?q=Campylobacter the fetch appears
> to
> > work OK, but the document ingestion does not occur.
> >
> > I was wondering if it is just my setup, or could it be the redirection
> that
> > Twitter does on the links. For instance, a link shown in the RSS feed as
> > http://twitter.com/VashinkaInuiel/statuses/101493222852923393 redirects
> to
> > http://twitter.com/#!/VashinkaInuiel/statuses/101493222852923393 when it
> is
> > followed.
> >
> > Any help is very appreciated.
> >
> >
> >
>

Re: Trouble indexing a Twitter search in RSS format

Posted by Karl Wright <da...@gmail.com>.

When I drop any of these URLs into my browser, I get redirected to a
login screen.  Therefore it looks to me like Twitter does some kind of
session-based login, tracked with cookies.  That would require
maintenance of session cookies which the RSS connector simply does not
do, and the coding of a login sequence as well.

This is not a straightforward feature to add to the RSS connector, by any means.

The web connector does have support for login sequencing and cookie
session maintenance, and it does know how to chase RSS feeds, so that
might be an option for you to try.  The problem is that most login
sequences are non-trivial to set up and you will need a lot of
patience and web spelunking skills to get it right.  The documentation
is of some help but really could use a good example.

Hope this helps.
Karl

On Fri, Aug 12, 2011 at 4:42 PM, K McGonigal <km...@gmail.com> wrote:
> Sorry to bother everyone again but I'm having trouble with an RSS connector
> job on a Twitter search. When I try to run a job on
> http://search.twitter.com/search.rss?q=Campylobacter the fetch appears to
> work OK, but the document ingestion does not occur.
>
> I was wondering if it is just my setup, or could it be the redirection that
> Twitter does on the links. For instance, a link shown in the RSS feed as
> http://twitter.com/VashinkaInuiel/statuses/101493222852923393 redirects to
> http://twitter.com/#!/VashinkaInuiel/statuses/101493222852923393 when it is
> followed.
>
> Any help is very appreciated.
>
>
>