You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by Stephane Gamard <st...@gamard.net> on 2013/06/03 16:13:24 UTC

RSS Connector

Hi all, 

I'm trying to use the RSS connector for the following feed: http://blog.mikemccandless.com/feeds/posts/default

After setting the job up and ingesting documents I have 2 pending questions: 
- why is the connector using the URL as ID instead of the atom ID tag?
- I have no creation and/or modified date in my Solr document, how is it so?

Overall I am a bit confused as to where does the crawler gets it's information (chrome vs dechromed). I've downloaded the feed and tried to find the entries back into my index but could not do so (could only find pages which are linked from the rss entry). 

Sorry for the hassle, I'm reading over and over trying to piece it all together.

Cheers, 

_Stephane

Re: RSS Connector

Posted by Karl Wright <da...@gmail.com>.

CONNECTORS-700 has now been resolved.

Karl


On Mon, Jun 3, 2013 at 11:12 AM, Karl Wright <da...@gmail.com> wrote:

> I've created CONNECTORS-700 for the date parsing issue.
>
> Karl
>
>
>
> On Mon, Jun 3, 2013 at 11:04 AM, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Stephane,
>>
>>
>> First, you would not want to select to get dechromed content from the
>> feed description field if there is no feed description field.  (In that
>> case, by default the connector fall back to use the actual content from the
>> document link.)
>>
>> Second, for this kind of feed, the connector looks for either "published"
>> or "updated" and takes the latter of the two if both are found.  However,
>> the ISO8601 date parser we are using is not happy with any timezone other
>> than Z (zulu) at this time, but your dates have -0400 instead, and that is
>> the problem.  I'll create a ticket to deal with that issue.
>>
>> Karl
>>
>>
>>
>> On Mon, Jun 3, 2013 at 10:48 AM, Stephane Gamard <st...@gamard.net>wrote:
>>
>>> Hi Karl,
>>>
>>>
>>> Thank you for the prompt reply. Agreed on #1, url is perfectly fine and
>>> well used :). As for #2, I am still puzzled about the following. Here's an
>>> excerpt from  the feed xml:
>>>
>>>
>>>  <entry>
>>>
>>> <id>tag:blogger.com
>>> ,1999:blog-8623074010562846957.post-6579597884362535238</id>
>>>
>>> <published>2013-05-21T18:23:00.000-04:00</published>
>>>
>>> <updated>2013-05-21T18:23:06.451-04:00</updated>
>>>
>>> <category scheme="http://www.blogger.com/atom/ns#" term="Lucene"/>
>>>
>>> <title type="text">Dynamic faceting with Lucene</title>
>>>
>>> <content type="html">Lucene's [...] Happy faceting!</content>
>>>
>>> <link rel="replies" type="application/atom+xml" href="
>>> http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default"
>>> title="Post Comments"/>
>>>
>>> <link rel="replies" type="text/html" href="
>>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html#comment-form"
>>> title="0 Comments"/>
>>>
>>> <link rel="edit" type="application/atom+xml" href="
>>> http://www.blogger.com/feeds/8623074010562846957/posts/default/6579597884362535238
>>> "/>
>>>
>>> <link rel="self" type="application/atom+xml" href="
>>> http://www.blogger.com/feeds/8623074010562846957/posts/default/6579597884362535238
>>> "/>
>>>
>>> <link rel="alternate" type="text/html" href="
>>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html"
>>> title="Dynamic faceting with Lucene"/>
>>>
>>> <author>
>>>
>>> <name>Michael McCandless</name>
>>>
>>>  <uri>https://plus.google.com/112759599082866346694</uri>
>>>
>>> <email>noreply@blogger.com</email>
>>>
>>> <gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32"
>>> height="32" src="//
>>> lh5.googleusercontent.com/-uZl5chgeDsM/AAAAAAAAAAI/AAAAAAAAAO4/Go4SFcNl-jY/s512-c/photo.jpg
>>> "/>
>>>
>>> </author>
>>>
>>> <thr:total>0</thr:total>
>>>
>>> </entry>
>>>
>>>
>>> Below is the document once ingested in Solr (searched with query:
>>> http://localhost:8983/lucene/select?q=id:http%3A%2F%2Fblog.mikemccandless.com%2F2013%2F05%2Fdynamic-faceting-with-lucene.html&fl=*).
>>> Note that I use a catch all field (<dynamicField name="*"  type="string"
>>>  indexed="true"  multiValued="true" stored="true" omitNorms="true"/>) to
>>> save all submitted fields.
>>>
>>>
>>> I have two questions that I don't understand:
>>>
>>> - I've selected the option "Dechromed content, if present, in
>>> 'description' field"  and yet I have no description field
>>>
>>> - I have no pubDate of publications field available
>>>
>>>
>>> Here's the attached Solr output:
>>>
>>>
>>> This XML file does not appear to have any style information associated
>>> with it. The document tree is shown below.
>>> <response>
>>> <lst name="responseHeader">
>>> <int name="status">0</int>
>>> <int name="QTime">1</int>
>>> <lst name="params">
>>> <str name="fl">*</str>
>>> <str name="q">
>>> id:
>>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
>>> </str>
>>> </lst>
>>> </lst>
>>> <result name="response" numFound="1" start="0">
>>> <doc>
>>> <arr name="link">
>>> <str>http://blog.mikemccandless.com/favicon.ico</str>
>>> <str>icon</str>
>>> <str>image/x-icon</str>
>>> <str>
>>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
>>> </str>
>>> <str>canonical</str>
>>> <str>alternate</str>
>>> <str>application/atom+xml</str>
>>> <str>http://blog.mikemccandless.com/feeds/posts/default</str>
>>> <str>alternate</str>
>>> <str>application/rss+xml</str>
>>> <str>
>>> http://blog.mikemccandless.com/feeds/posts/default?alt=rss
>>> </str>
>>> <str>service.post</str>
>>> <str>application/atom+xml</str>
>>> <str>
>>> http://www.blogger.com/feeds/8623074010562846957/posts/default
>>> </str>
>>> <str>EditURI</str>
>>> <str>application/rsd+xml</str>
>>> <str>
>>> http://www.blogger.com/rsd.g?blogID=8623074010562846957
>>> </str>
>>> <str>alternate</str>
>>> <str>application/atom+xml</str>
>>> <str>
>>> http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default
>>> </str>
>>> <str>https://plus.google.com/112759599082866346694</str>
>>> <str>publisher</str>
>>> <str>text/css</str>
>>> <str>stylesheet</str>
>>> <str>
>>> //www.blogger.com/static/v1/widgets/2159474849-widget_css_2_bundle.css
>>> </str>
>>> <str>text/css</str>
>>> <str>stylesheet</str>
>>> <str>
>>> //
>>> www.blogger.com/dyn-css/authorization.css?targetBlogID=8623074010562846957&zx=93c35911-ffbb-4abb-ba82-d88c30b4b1b8
>>> </str>
>>> </arr>
>>> <arr name="meta">
>>> <str>viewport</str>
>>> <str>width=1100</str>
>>> <str>stream_source_info</str>
>>> <str>docname</str>
>>> <str>stream_content_type</str>
>>> <str>text/html; charset=UTF-8</str>
>>> <str>stream_size</str>
>>> <str>80779</str>
>>> <str>Content-Encoding</str>
>>> <str>UTF-8</str>
>>> <str>stream_name</str>
>>> <str>docname</str>
>>> <str>generator</str>
>>> <str>blogger</str>
>>> <str>MSSmartTagsPreventParsing</str>
>>> <str>true</str>
>>> <str>Content-Type</str>
>>> <str>text/html; charset=UTF-8</str>
>>> <str>resourceName</str>
>>> <str>docname</str>
>>> <str>dc:title</str>
>>> <str>Changing Bits: Dynamic faceting with Lucene</str>
>>> </arr>
>>> <arr name="false">
>>> <str>rect</str>
>>> <str>http://blog.mikemccandless.com/</str>
>>> <str>rect</str>
>>> <str>6579597884362535238</str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://blog.mikemccandless.com/2013/02/drill-sideways-faceting-with-lucene.html
>>> </str>
>>> <str>rect</str>
>>> <str>http://jirasearch.mikemccandless.com</str>
>>> <str>rect</str>
>>> <str>
>>> http://www.elasticsearch.org/guide/reference/api/search/facets/
>>> </str>
>>> <str>rect</str>
>>> <str>http://wiki.apache.org/solr/SolrFacetingOverview</str>
>>> <str>rect</str>
>>> <str>https://issues.apache.org/jira/browse/LUCENE-4795</str>
>>> <str>rect</str>
>>> <str>https://issues.apache.org/jira/browse/LUCENE-4965</str>
>>> <str>rect</str>
>>> <str>http://en.wikipedia.org/wiki/Interval_tree</str>
>>> <str>rect</str>
>>> <str>http://jirasearch.mikemccandless.com</str>
>>> <str>rect</str>
>>> <str>https://plus.google.com/112759599082866346694</str>
>>> <str>author</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
>>> </str>
>>> <str>bookmark</str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://www.blogger.com/email-post.g?blogID=8623074010562846957&postID=6579597884362535238
>>> </str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://www.blogger.com/post-edit.g?blogID=8623074010562846957&postID=6579597884362535238&from=pencil
>>> </str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=email
>>> </str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=blog
>>> </str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=twitter
>>> </str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=facebook
>>> </str>
>>> <str>rect</str>
>>> <str>http://blog.mikemccandless.com/search/label/Lucene</str>
>>> <str>tag</str>
>>> <str>rect</str>
>>> <str>comments</str>
>>> <str>rect</str>
>>> <str>comment-form</str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://www.blogger.com/comment-iframe.g?blogID=8623074010562846957&postID=6579597884362535238
>>> </str>
>>> <str>rect</str>
>>> <str>links</str>
>>> <str>rect</str>
>>> <str/>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2013/05/eating-dog-food-with-lucene.html
>>> </str>
>>> <str>rect</str>
>>> <str>http://blog.mikemccandless.com/</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default
>>> </str>
>>> <str>application/atom+xml</str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://www.netvibes.com/subscribe.php?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2Fposts%2Fdefault
>>> </str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://www.newsgator.com/ngs/subscriber/subext.aspx?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2Fposts%2Fdefault
>>> </str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://add.my.yahoo.com/content?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2Fposts%2Fdefault
>>> </str>
>>> <str>rect</str>
>>> <str>http://blog.mikemccandless.com/feeds/posts/default</str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://www.netvibes.com/subscribe.php?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2F6579597884362535238%2Fcomments%2Fdefault
>>> </str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://www.newsgator.com/ngs/subscriber/subext.aspx?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2F6579597884362535238%2Fcomments%2Fdefault
>>> </str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://add.my.yahoo.com/content?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2F6579597884362535238%2Fcomments%2Fdefault
>>> </str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default
>>> </str>
>>> <str>rect</str>
>>> <str>
>>> //
>>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Subscribe&widgetId=Subscribe1&action=editWidget&sectionId=sidebar-right-1
>>> </str>
>>> <str>rect</str>
>>> <str>https://plus.google.com/112759599082866346694</str>
>>> <str>rect</str>
>>> <str>https://plus.google.com/112759599082866346694</str>
>>> <str>author</str>
>>> <str>rect</str>
>>> <str>https://plus.google.com/112759599082866346694</str>
>>> <str>author</str>
>>> <str>rect</str>
>>> <str>
>>> //
>>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Profile&widgetId=Profile1&action=editWidget&sectionId=sidebar-right-1
>>> </str>
>>> <str>rect</str>
>>> <str>
>>> http://affiliate.manning.com/idevaffiliate.php?id=1171_147
>>> </str>
>>> <str>rect</str>
>>> <str>
>>> //
>>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Image&widgetId=Image1&action=editWidget&sectionId=sidebar-right-1
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://blog.mikemccandless.com/search?updated-min=2013-01-01T00:00:00-05:00&updated-max=2014-01-01T00:00:00-05:00&max-results=5
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2013_05_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
>>> </str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2013/05/eating-dog-food-with-lucene.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2013_02_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2013_01_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://blog.mikemccandless.com/search?updated-min=2012-01-01T00:00:00-05:00&updated-max=2013-01-01T00:00:00-05:00&max-results=16
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2012_12_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2012_11_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2012_09_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2012_08_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2012_07_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2012_05_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2012_04_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2012_03_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2012_01_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://blog.mikemccandless.com/search?updated-min=2011-01-01T00:00:00-05:00&updated-max=2012-01-01T00:00:00-05:00&max-results=20
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2011_11_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2011_10_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2011_09_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2011_06_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2011_05_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2011_04_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2011_03_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2011_02_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2011_01_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://blog.mikemccandless.com/search?updated-min=2010-01-01T00:00:00-05:00&updated-max=2011-01-01T00:00:00-05:00&max-results=43
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2010_12_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2010_11_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2010_10_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2010_09_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2010_08_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2010_07_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2010_06_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2010_05_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2010_04_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2010_03_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2010_02_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>>
>>> http://blog.mikemccandless.com/search?updated-min=2009-01-01T00:00:00-05:00&updated-max=2010-01-01T00:00:00-05:00&max-results=18
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2009_12_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2009_11_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2009_10_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2009_09_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2009_08_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>javascript:void(0)</str>
>>> <str>rect</str>
>>> <str>
>>> http://blog.mikemccandless.com/2009_07_01_archive.html
>>> </str>
>>> <str>rect</str>
>>> <str>
>>> //
>>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=BlogArchive&widgetId=BlogArchive1&action=editWidget&sectionId=sidebar-right-1
>>> </str>
>>> <str>rect</str>
>>> <str>
>>> //
>>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Followers&widgetId=Followers1&action=editWidget&sectionId=sidebar-right-1
>>> </str>
>>> <str>rect</str>
>>> <str>
>>> //
>>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=FollowByEmail&widgetId=FollowByEmail1&action=editWidget&sectionId=sidebar-right-3
>>> </str>
>>> <str>rect</str>
>>> <str>http://www.blogger.com</str>
>>> <str>rect</str>
>>> <str>
>>> //
>>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Attribution&widgetId=Attribution1&action=editWidget&sectionId=footer-3
>>> </str>
>>> </arr>
>>> <arr name="img">
>>> <str/>
>>> <str>13</str>
>>> <str>http://img1.blogblog.com/img/icon18_email.gif</str>
>>> <str>18</str>
>>> <str/>
>>> <str>18</str>
>>> <str>
>>> http://img2.blogblog.com/img/icon18_edit_allbkg.gif
>>> </str>
>>> <str>18</str>
>>> <str>
>>> http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
>>> </str>
>>> <str/>
>>> <str/>
>>> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
>>> <str>
>>> http://img1.blogblog.com/img/widgets/subscribe-netvibes.png
>>> </str>
>>> <str/>
>>> <str>
>>> http://img1.blogblog.com/img/widgets/subscribe-newsgator.png
>>> </str>
>>> <str/>
>>> <str>
>>> http://img1.blogblog.com/img/widgets/subscribe-yahoo.png
>>> </str>
>>> <str/>
>>> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
>>> <str/>
>>> <str>
>>> http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
>>> </str>
>>> <str/>
>>> <str/>
>>> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
>>> <str>
>>> http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
>>> </str>
>>> <str/>
>>> <str/>
>>> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
>>> <str>
>>> http://img1.blogblog.com/img/widgets/subscribe-netvibes.png
>>> </str>
>>> <str/>
>>> <str>
>>> http://img1.blogblog.com/img/widgets/subscribe-newsgator.png
>>> </str>
>>> <str/>
>>> <str>
>>> http://img1.blogblog.com/img/widgets/subscribe-yahoo.png
>>> </str>
>>> <str/>
>>> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
>>> <str/>
>>> <str>
>>> http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
>>> </str>
>>> <str/>
>>> <str/>
>>> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
>>> <str/>
>>> <str>18</str>
>>> <str>
>>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>>> </str>
>>> <str>18</str>
>>> <str>My Photo</str>
>>> <str>80</str>
>>> <str>
>>> //
>>> lh5.googleusercontent.com/-uZl5chgeDsM/AAAAAAAAAAI/AAAAAAAAAO4/Go4SFcNl-jY/s512-c/photo.jpg
>>> </str>
>>> <str>80</str>
>>> <str/>
>>> <str>18</str>
>>> <str>
>>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>>> </str>
>>> <str>18</str>
>>> <str/>
>>> <str>187</str>
>>> <str>
>>>
>>> http://1.bp.blogspot.com/-QWxIn-kN_Yg/TZH0g4Vm66I/AAAAAAAAAG0/2jsjFLP9voQ/s250/LuceneInAction2.jpg
>>> </str>
>>> <str>150</str>
>>> <str/>
>>> <str>18</str>
>>> <str>
>>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>>> </str>
>>> <str>18</str>
>>> <str/>
>>> <str>18</str>
>>> <str>
>>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>>> </str>
>>> <str>18</str>
>>> <str/>
>>> <str>18</str>
>>> <str>
>>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>>> </str>
>>> <str>18</str>
>>> <str/>
>>> <str>18</str>
>>> <str>
>>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>>> </str>
>>> <str>18</str>
>>> <str/>
>>> <str>18</str>
>>> <str>
>>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>>> </str>
>>> <str>18</str>
>>> </arr>
>>> <arr name="iframe">
>>> <str>0</str>
>>> <str>auto</str>
>>> <str>410</str>
>>> <str>comment-editor</str>
>>> <str/>
>>> <str>100%</str>
>>> </arr>
>>> <str name="filename">docname</str>
>>> <str name="mimetype">text/html; charset=UTF-8</str>
>>> <arr name="source">
>>> <str>http://blog.mikemccandless.com/feeds/posts/default</str>
>>> </arr>
>>> <arr name="category">
>>> <str>Lucene</str>
>>> </arr>
>>> <str name="id">
>>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
>>> </str>
>>> <arr name="source_type">
>>> <str>rss</str>
>>> </arr>
>>> <arr name="title">
>>> <str>Dynamic faceting with Lucene</str>
>>> </arr>
>>> <arr name="title_search">
>>> <str>Dynamic faceting with Lucene</str>
>>> </arr>
>>> <arr name="viewport">
>>> <str>width=1100</str>
>>> </arr>
>>> <arr name="stream_source_info">
>>> <str>docname</str>
>>> </arr>
>>> <arr name="stream_content_type">
>>> <str>text/html; charset=UTF-8</str>
>>> </arr>
>>> <arr name="stream_size">
>>> <str>80779</str>
>>> </arr>
>>> <arr name="content_encoding">
>>> <str>UTF-8</str>
>>> </arr>
>>> <arr name="stream_name">
>>> <str>docname</str>
>>> </arr>
>>> <arr name="generator">
>>> <str>blogger</str>
>>> </arr>
>>> <arr name="mssmarttagspreventparsing">
>>> <str>true</str>
>>> </arr>
>>> <arr name="content_type">
>>> <str>text/html; charset=UTF-8</str>
>>> </arr>
>>> <arr name="resourcename">
>>> <str>docname</str>
>>> </arr>
>>> <arr name="dc_title">
>>> <str>Changing Bits: Dynamic faceting with Lucene</str>
>>> </arr>
>>> <arr name="content">
>>> <str>
>>> Changing Bits: Dynamic faceting with Lucene Changing Bits Tuesday, May
>>> 21, 2013 Dynamic faceting with Lucene Lucene's facet module has seen some
>>> great improvements recently: sizable (nearly 4X) speedups and new features
>>> like DrillSideways . The Jira issues search example showcases a number of
>>> facet features. Here I'll describe two recently committed facet features:
>>> sorted-set doc-values faceting, already available in 4.3, and dynamic range
>>> faceting, coming in the next (4.4) release. To understand these features,
>>> and why they are important, we first need a little background. Lucene's
>>> facet module does most of its work at indexing time: for each indexed
>>> document, it examines every facet label, each of which may be hierarchical,
>>> and maps each unique label in the hierarchy to an integer id, and then
>>> encodes all ids into a binary doc values field. A separate taxonomy index
>>> stores this mapping, and ensures that, even across segments, the same label
>>> gets the same id. At search time, faceting cost is minimal: for each
>>> matched document, we visit all integer ids and aggregate counts in an
>>> array, summarizing the results in the end, for example as top N facet
>>> labels by count. This is in contrast to purely dynamic faceting
>>> implementations like ElasticSearch 's and Solr 's, which do all work at
>>> search time. Such approaches are more flexible: you need not do anything
>>> special during indexing, and for every query you can pick and choose
>>> exactly which facets to compute. However, the price for that flexibility is
>>> slower searching, as each search must do more work for every matched
>>> document. Furthermore, the impact on near-real-time reopen latency can be
>>> horribly costly if top-level data-structures, such as Solr's
>>> UnInvertedField, must be rebuilt on every reopen. The taxonomy index used
>>> by the facet module means no extra work needs to be done on each
>>> near-real-time reopen. Enough background, now on to our two new features!
>>> Sorted-set doc-values faceting These features bring two dynamic
>>> alternatives to the facet module, both computing facet counts from
>>> previously indexed doc-values fields. The first feature, sorted-set
>>> doc-values faceting (see LUCENE-4795 ), allows the application to index a
>>> normal sorted-set doc-values field, for example: doc.add(new
>>> SortedSetDocValuesField("foo")); doc.add(new
>>> SortedSetDocValuesField("bar")); and then to compute facet counts at search
>>> time using SortedSetDocValuesAccumulator and SortedSetDocValuesReaderState.
>>> This feature does not use the taxonomy index, since all state is stored in
>>> the doc-values, but the tradeoff is that on each near-real-time reopen, a
>>> top-level data-structure is recomputed to map per-segment integer ordinals
>>> to global ordinals. The good news is this should be relatively low cost
>>> since it's just merge-sorting already sorted terms, and it doesn't need to
>>> visit the documents (unlike UnInvertedField). At search time there is also
>>> a small performance hit (~25%, depending on the query) since each
>>> per-segment ord must be re-mapped to the global ord space. Likely this
>>> could be improved (no time was spend optimizing). Furthermore, this feature
>>> currently only works with non-hierarchical facet fields, though this should
>>> be fixable (patches welcome!). Dynamic range faceting The second new
>>> feature, dynamic range faceting, works on top of a numeric doc-values field
>>> (see LUCENE-4965 ), and implements dynamic faceting over numeric ranges.
>>> You create a RangeFacetRequest, providing custom ranges with their labels.
>>> Each matched document is checked against all ranges and the count is
>>> incremented when there is a match. The range-test is a naive simple linear
>>> search, which is probably OK since there are usually only a few ranges, but
>>> we could eventually upgrade this to an interval tree to get better
>>> performance (patches welcome!). Likewise, this new feature does not use the
>>> taxonomy index, only a numeric doc-values field. This feature is especially
>>> useful with time-based fields. You can see it in action in the Jira issues
>>> search example in the Updated field. Happy faceting! Posted by Michael
>>> McCandless on 5/21/2013 Email This BlogThis! Share to Twitter Share to
>>> Facebook Labels: Lucene No comments: Post a Comment Older Post Home
>>> Subscribe to: Post Comments (Atom) Subscribe To Posts Atom Posts Comments
>>> Atom Comments About Me Michael McCandless Michael loves building software;
>>> he's been building search engines for more than a decade. In 1999 he
>>> co-founded iPhrase Technologies, a startup providing a user-centric
>>> enterprise search application, written primarily in Python and C. After IBM
>>> acquired iPhrase in 2005, Michael fell in love with Lucene, becoming a
>>> committer in 2006 and PMC member in 2008. Michael has remained an active
>>> committer, helping to push Lucene to new places in recent years. He's
>>> co-author of Lucene in Action, 2nd edition. In his spare time Michael
>>> enjoys building his own computers, writing software to control his house
>>> (mostly in Python), encoding videos and tinkering with all sorts of other
>>> things. View my complete profile Blog Archive ▼  2013 (5) ▼  May (2)
>>> Dynamic faceting with Lucene Eating dog food with Lucene ►  February (1) ►
>>> January (2) ►  2012 (16) ►  December (2) ►  November (1) ►  September (1)
>>> ►  August (1) ►  July (3) ►  May (1) ►  April (2) ►  March (3) ►  January
>>> (2) ►  2011 (20) ►  November (2) ►  October (3) ►  September (1) ►  June
>>> (3) ►  May (2) ►  April (2) ►  March (4) ►  February (2) ►  January (1) ►
>>> 2010 (43) ►  December (1) ►  November (1) ►  October (4) ►  September (4)
>>> ►  August (4) ►  July (11) ►  June (7) ►  May (6) ►  April (1) ►  March (1)
>>> ►  February (3) ►  2009 (18) ►  December (1) ►  November (1) ►  October (1)
>>> ►  September (4) ►  August (6) ►  July (5) Followers Follow by Email Simple
>>> template. Powered by Blogger .
>>> </str>
>>> </arr>
>>> <arr name="content_search">
>>> <str>
>>> Changing Bits: Dynamic faceting with Lucene Changing Bits Tuesday, May
>>> 21, 2013 Dynamic faceting with Lucene Lucene's facet module has seen some
>>> great improvements recently: sizable (nearly 4X) speedups and new features
>>> like DrillSideways . The Jira issues search example showcases a number of
>>> facet features. Here I'll describe two recently committed facet features:
>>> sorted-set doc-values faceting, already available in 4.3, and dynamic range
>>> faceting, coming in the next (4.4) release. To understand these features,
>>> and why they are important, we first need a little background. Lucene's
>>> facet module does most of its work at indexing time: for each indexed
>>> document, it examines every facet label, each of which may be hierarchical,
>>> and maps each unique label in the hierarchy to an integer id, and then
>>> encodes all ids into a binary doc values field. A separate taxonomy index
>>> stores this mapping, and ensures that, even across segments, the same label
>>> gets the same id. At search time, faceting cost is minimal: for each
>>> matched document, we visit all integer ids and aggregate counts in an
>>> array, summarizing the results in the end, for example as top N facet
>>> labels by count. This is in contrast to purely dynamic faceting
>>> implementations like ElasticSearch 's and Solr 's, which do all work at
>>> search time. Such approaches are more flexible: you need not do anything
>>> special during indexing, and for every query you can pick and choose
>>> exactly which facets to compute. However, the price for that flexibility is
>>> slower searching, as each search must do more work for every matched
>>> document. Furthermore, the impact on near-real-time reopen latency can be
>>> horribly costly if top-level data-structures, such as Solr's
>>> UnInvertedField, must be rebuilt on every reopen. The taxonomy index used
>>> by the facet module means no extra work needs to be done on each
>>> near-real-time reopen. Enough background, now on to our two new features!
>>> Sorted-set doc-values faceting These features bring two dynamic
>>> alternatives to the facet module, both computing facet counts from
>>> previously indexed doc-values fields. The first feature, sorted-set
>>> doc-values faceting (see LUCENE-4795 ), allows the application to index a
>>> normal sorted-set doc-values field, for example: doc.add(new
>>> SortedSetDocValuesField("foo")); doc.add(new
>>> SortedSetDocValuesField("bar")); and then to compute facet counts at search
>>> time using SortedSetDocValuesAccumulator and SortedSetDocValuesReaderState.
>>> This feature does not use the taxonomy index, since all state is stored in
>>> the doc-values, but the tradeoff is that on each near-real-time reopen, a
>>> top-level data-structure is recomputed to map per-segment integer ordinals
>>> to global ordinals. The good news is this should be relatively low cost
>>> since it's just merge-sorting already sorted terms, and it doesn't need to
>>> visit the documents (unlike UnInvertedField). At search time there is also
>>> a small performance hit (~25%, depending on the query) since each
>>> per-segment ord must be re-mapped to the global ord space. Likely this
>>> could be improved (no time was spend optimizing). Furthermore, this feature
>>> currently only works with non-hierarchical facet fields, though this should
>>> be fixable (patches welcome!). Dynamic range faceting The second new
>>> feature, dynamic range faceting, works on top of a numeric doc-values field
>>> (see LUCENE-4965 ), and implements dynamic faceting over numeric ranges.
>>> You create a RangeFacetRequest, providing custom ranges with their labels.
>>> Each matched document is checked against all ranges and the count is
>>> incremented when there is a match. The range-test is a naive simple linear
>>> search, which is probably OK since there are usually only a few ranges, but
>>> we could eventually upgrade this to an interval tree to get better
>>> performance (patches welcome!). Likewise, this new feature does not use the
>>> taxonomy index, only a numeric doc-values field. This feature is especially
>>> useful with time-based fields. You can see it in action in the Jira issues
>>> search example in the Updated field. Happy faceting! Posted by Michael
>>> McCandless on 5/21/2013 Email This BlogThis! Share to Twitter Share to
>>> Facebook Labels: Lucene No comments: Post a Comment Older Post Home
>>> Subscribe to: Post Comments (Atom) Subscribe To Posts Atom Posts Comments
>>> Atom Comments About Me Michael McCandless Michael loves building software;
>>> he's been building search engines for more than a decade. In 1999 he
>>> co-founded iPhrase Technologies, a startup providing a user-centric
>>> enterprise search application, written primarily in Python and C. After IBM
>>> acquired iPhrase in 2005, Michael fell in love with Lucene, becoming a
>>> committer in 2006 and PMC member in 2008. Michael has remained an active
>>> committer, helping to push Lucene to new places in recent years. He's
>>> co-author of Lucene in Action, 2nd edition. In his spare time Michael
>>> enjoys building his own computers, writing software to control his house
>>> (mostly in Python), encoding videos and tinkering with all sorts of other
>>> things. View my complete profile Blog Archive ▼  2013 (5) ▼  May (2)
>>> Dynamic faceting with Lucene Eating dog food with Lucene ►  February (1) ►
>>> January (2) ►  2012 (16) ►  December (2) ►  November (1) ►  September (1)
>>> ►  August (1) ►  July (3) ►  May (1) ►  April (2) ►  March (3) ►  January
>>> (2) ►  2011 (20) ►  November (2) ►  October (3) ►  September (1) ►  June
>>> (3) ►  May (2) ►  April (2) ►  March (4) ►  February (2) ►  January (1) ►
>>> 2010 (43) ►  December (1) ►  November (1) ►  October (4) ►  September (4)
>>> ►  August (4) ►  July (11) ►  June (7) ►  May (6) ►  April (1) ►  March (1)
>>> ►  February (3) ►  2009 (18) ►  December (1) ►  November (1) ►  October (1)
>>> ►  September (4) ►  August (6) ►  July (5) Followers Follow by Email Simple
>>> template. Powered by Blogger .
>>> </str>
>>> </arr>
>>> <arr name="language">
>>> <str>en</str>
>>> </arr>
>>> <arr name="url">
>>> <str>
>>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
>>> </str>
>>> </arr>
>>> <arr name="snippet">
>>> <str>
>>> Changing Bits: Dynamic faceting with Lucene Changing Bits Tuesday, May
>>> 21, 2013 Dynamic faceting with Lucene Lucene's facet module has seen some
>>> great improvements recently: sizable (nearly 4X) speedups and new features
>>> like DrillSideways ....At search time, faceting cost is minimal: for each
>>> matched document, we visit all integer ids and aggregate counts in an
>>> array, summarizing the results in the end, for example as top N facet
>>> labels by count....The range-test is a naive simple linear search, which is
>>> probably OK since there are usually only a few ranges, but we could
>>> eventually upgrade this to an interval tree to get better performance
>>> (patches welcome!)....Share to Twitter Share to Facebook Labels: Lucene No
>>> comments: Post a Comment Older Post Home Subscribe to: Post Comments (Atom)
>>> Subscribe To Posts Atom Posts Comments Atom Comments About Me Michael
>>> McCandless Michael loves building software; he's been building search
>>> engines for more than a decade....View my complete profile Blog Archive ▼
>>> 2013 (5) ▼  May (2) Dynamic faceting with Lucene Eating dog food with
>>> Lucene ►  February (1) ►  January (2) ►  2012 (16) ►  December (2) ►
>>> November (1) ►  September (1) ►  August (1) ►  July (3) ►  May (1) ►  April
>>> (2) ►  March (3) ►  January (2) ►  2011 (20) ►  November (2) ►  October (3)
>>> ►  September (1) ►  June (3) ►  May (2) ►  April (2) ►  March (4) ►
>>> February (2) ►  January (1) ►  2010 (43) ►  December (1) ►  November (1) ►
>>> October (4) ►  September (4) ►  August (4) ►  July (11) ►  June (7) ►  May
>>> (6) ►  April (1) ►  March (1) ►  February (3) ►  2009 (18) ►  December (1)
>>> ►  November (1) ►  October (1) ►  September (4) ►  August (6) ►  July (5)
>>> Followers Follow by Email Simple template.
>>> </str>
>>> </arr>
>>> <arr name="host">
>>> <str>blog.mikemccandless.com</str>
>>> </arr>
>>> <arr name="path">
>>> <str>/2013/05/dynamic-faceting-with-lucene.html</str>
>>> </arr>
>>> <long name="_version_">1436832383182569472</long>
>>> </doc>
>>> </result>
>>> </response>
>>>
>>>
>>>
>>> I can see there are published and updated markup, and yet none of those
>>> fields (pubDate or publications) are present in the solr document.
>>>
>>>
>>> Thank you for the prompt reply. Agreed on #1, url is perfectly fine and
>>> well used :). As for #2, I am still puzzled about the following. Here's an
>>> excerpt from  the feed xml:
>>>
>>> On June 3, 2013 at 4:25:51 PM, Karl Wright (daddywri@gmail.com) wrote:
>>>
>>> Hi Stephane,
>>>
>>> (1) ManifoldCF always uses the URL of a document as the primary ID when
>>> it indexes it.  This is the standard treatment and has been since Day 1.
>>>
>>> (2) For the "creation date" attribute, the RSS connector uses the date
>>> in the feed, if there is one.  This is a date in ISO format, and comes out
>>> as the metadata value "pubdateiso".  There is also an attribute called
>>> "pubdate", which is in milliseconds since epoch, which is EITHER the date
>>> in the feed (if present), or if not it's the date the document is fetched.
>>>
>>> As for your other question, "chromed" data comes from the URLs
>>> referenced by the items in the feed, and "dechromed" data comes from either
>>> the content or description field that's actually in the feed, whichever you
>>> specify.
>>>
>>> All of this is described in the end-user-documentation, although I do
>>> notice that "pubdateiso" is missing from the metadata listed.
>>>
>>>
>>> http://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html#rssrepository
>>>
>>> Karl
>>>
>>>
>>>
>>> On Mon, Jun 3, 2013 at 10:13 AM, Stephane Gamard <st...@gamard.net>wrote:
>>>
>>>>
>>>> Hi all,
>>>>
>>>>
>>>> I'm trying to use the RSS connector for the following feed:
>>>> http://blog.mikemccandless.com/feeds/posts/default
>>>>
>>>> After setting the job up and ingesting documents I have 2 pending
>>>> questions:
>>>> - why is the connector using the URL as ID instead of the atom ID tag?
>>>> - I have no creation and/or modified date in my Solr document, how is
>>>> it so?
>>>>
>>>> Overall I am a bit confused as to where does the crawler gets it's
>>>> information (chrome vs dechromed). I've downloaded the feed and tried to
>>>> find the entries back into my index but could not do so (could only find
>>>> pages which are linked from the rss entry).
>>>>
>>>> Sorry for the hassle, I'm reading over and over trying to piece it all
>>>> together.
>>>>
>>>> Cheers,
>>>>
>>>> _Stephane
>>>>
>>>
>>>
>>
>

Re: RSS Connector

Posted by Karl Wright <da...@gmail.com>.

I've created CONNECTORS-700 for the date parsing issue.

Karl



On Mon, Jun 3, 2013 at 11:04 AM, Karl Wright <da...@gmail.com> wrote:

> Hi Stephane,
>
>
> First, you would not want to select to get dechromed content from the feed
> description field if there is no feed description field.  (In that case, by
> default the connector fall back to use the actual content from the document
> link.)
>
> Second, for this kind of feed, the connector looks for either "published"
> or "updated" and takes the latter of the two if both are found.  However,
> the ISO8601 date parser we are using is not happy with any timezone other
> than Z (zulu) at this time, but your dates have -0400 instead, and that is
> the problem.  I'll create a ticket to deal with that issue.
>
> Karl
>
>
>
> On Mon, Jun 3, 2013 at 10:48 AM, Stephane Gamard <st...@gamard.net>wrote:
>
>> Hi Karl,
>>
>>
>> Thank you for the prompt reply. Agreed on #1, url is perfectly fine and
>> well used :). As for #2, I am still puzzled about the following. Here's an
>> excerpt from  the feed xml:
>>
>>
>>  <entry>
>>
>> <id>tag:blogger.com
>> ,1999:blog-8623074010562846957.post-6579597884362535238</id>
>>
>> <published>2013-05-21T18:23:00.000-04:00</published>
>>
>> <updated>2013-05-21T18:23:06.451-04:00</updated>
>>
>> <category scheme="http://www.blogger.com/atom/ns#" term="Lucene"/>
>>
>> <title type="text">Dynamic faceting with Lucene</title>
>>
>> <content type="html">Lucene's [...] Happy faceting!</content>
>>
>> <link rel="replies" type="application/atom+xml" href="
>> http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default"
>> title="Post Comments"/>
>>
>> <link rel="replies" type="text/html" href="
>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html#comment-form"
>> title="0 Comments"/>
>>
>> <link rel="edit" type="application/atom+xml" href="
>> http://www.blogger.com/feeds/8623074010562846957/posts/default/6579597884362535238
>> "/>
>>
>> <link rel="self" type="application/atom+xml" href="
>> http://www.blogger.com/feeds/8623074010562846957/posts/default/6579597884362535238
>> "/>
>>
>> <link rel="alternate" type="text/html" href="
>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html"
>> title="Dynamic faceting with Lucene"/>
>>
>> <author>
>>
>> <name>Michael McCandless</name>
>>
>> <uri>https://plus.google.com/112759599082866346694</uri>
>>
>> <email>noreply@blogger.com</email>
>>
>> <gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32"
>> height="32" src="//
>> lh5.googleusercontent.com/-uZl5chgeDsM/AAAAAAAAAAI/AAAAAAAAAO4/Go4SFcNl-jY/s512-c/photo.jpg
>> "/>
>>
>> </author>
>>
>> <thr:total>0</thr:total>
>>
>> </entry>
>>
>>
>> Below is the document once ingested in Solr (searched with query:
>> http://localhost:8983/lucene/select?q=id:http%3A%2F%2Fblog.mikemccandless.com%2F2013%2F05%2Fdynamic-faceting-with-lucene.html&fl=*).
>> Note that I use a catch all field (<dynamicField name="*"  type="string"
>>  indexed="true"  multiValued="true" stored="true" omitNorms="true"/>) to
>> save all submitted fields.
>>
>>
>> I have two questions that I don't understand:
>>
>> - I've selected the option "Dechromed content, if present, in
>> 'description' field"  and yet I have no description field
>>
>> - I have no pubDate of publications field available
>>
>>
>> Here's the attached Solr output:
>>
>>
>> This XML file does not appear to have any style information associated
>> with it. The document tree is shown below.
>> <response>
>> <lst name="responseHeader">
>> <int name="status">0</int>
>> <int name="QTime">1</int>
>> <lst name="params">
>> <str name="fl">*</str>
>> <str name="q">
>> id:
>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
>> </str>
>> </lst>
>> </lst>
>> <result name="response" numFound="1" start="0">
>> <doc>
>> <arr name="link">
>> <str>http://blog.mikemccandless.com/favicon.ico</str>
>> <str>icon</str>
>> <str>image/x-icon</str>
>> <str>
>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
>> </str>
>> <str>canonical</str>
>> <str>alternate</str>
>> <str>application/atom+xml</str>
>> <str>http://blog.mikemccandless.com/feeds/posts/default</str>
>> <str>alternate</str>
>> <str>application/rss+xml</str>
>> <str>
>> http://blog.mikemccandless.com/feeds/posts/default?alt=rss
>> </str>
>> <str>service.post</str>
>> <str>application/atom+xml</str>
>> <str>
>> http://www.blogger.com/feeds/8623074010562846957/posts/default
>> </str>
>> <str>EditURI</str>
>> <str>application/rsd+xml</str>
>> <str>
>> http://www.blogger.com/rsd.g?blogID=8623074010562846957
>> </str>
>> <str>alternate</str>
>> <str>application/atom+xml</str>
>> <str>
>> http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default
>> </str>
>> <str>https://plus.google.com/112759599082866346694</str>
>> <str>publisher</str>
>> <str>text/css</str>
>> <str>stylesheet</str>
>> <str>
>> //www.blogger.com/static/v1/widgets/2159474849-widget_css_2_bundle.css
>> </str>
>> <str>text/css</str>
>> <str>stylesheet</str>
>> <str>
>> //
>> www.blogger.com/dyn-css/authorization.css?targetBlogID=8623074010562846957&zx=93c35911-ffbb-4abb-ba82-d88c30b4b1b8
>> </str>
>> </arr>
>> <arr name="meta">
>> <str>viewport</str>
>> <str>width=1100</str>
>> <str>stream_source_info</str>
>> <str>docname</str>
>> <str>stream_content_type</str>
>> <str>text/html; charset=UTF-8</str>
>> <str>stream_size</str>
>> <str>80779</str>
>> <str>Content-Encoding</str>
>> <str>UTF-8</str>
>> <str>stream_name</str>
>> <str>docname</str>
>> <str>generator</str>
>> <str>blogger</str>
>> <str>MSSmartTagsPreventParsing</str>
>> <str>true</str>
>> <str>Content-Type</str>
>> <str>text/html; charset=UTF-8</str>
>> <str>resourceName</str>
>> <str>docname</str>
>> <str>dc:title</str>
>> <str>Changing Bits: Dynamic faceting with Lucene</str>
>> </arr>
>> <arr name="false">
>> <str>rect</str>
>> <str>http://blog.mikemccandless.com/</str>
>> <str>rect</str>
>> <str>6579597884362535238</str>
>> <str>rect</str>
>> <str>
>>
>> http://blog.mikemccandless.com/2013/02/drill-sideways-faceting-with-lucene.html
>> </str>
>> <str>rect</str>
>> <str>http://jirasearch.mikemccandless.com</str>
>> <str>rect</str>
>> <str>
>> http://www.elasticsearch.org/guide/reference/api/search/facets/
>> </str>
>> <str>rect</str>
>> <str>http://wiki.apache.org/solr/SolrFacetingOverview</str>
>> <str>rect</str>
>> <str>https://issues.apache.org/jira/browse/LUCENE-4795</str>
>> <str>rect</str>
>> <str>https://issues.apache.org/jira/browse/LUCENE-4965</str>
>> <str>rect</str>
>> <str>http://en.wikipedia.org/wiki/Interval_tree</str>
>> <str>rect</str>
>> <str>http://jirasearch.mikemccandless.com</str>
>> <str>rect</str>
>> <str>https://plus.google.com/112759599082866346694</str>
>> <str>author</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
>> </str>
>> <str>bookmark</str>
>> <str>rect</str>
>> <str>
>>
>> http://www.blogger.com/email-post.g?blogID=8623074010562846957&postID=6579597884362535238
>> </str>
>> <str>rect</str>
>> <str>
>>
>> http://www.blogger.com/post-edit.g?blogID=8623074010562846957&postID=6579597884362535238&from=pencil
>> </str>
>> <str>rect</str>
>> <str>
>>
>> http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=email
>> </str>
>> <str>rect</str>
>> <str>
>>
>> http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=blog
>> </str>
>> <str>rect</str>
>> <str>
>>
>> http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=twitter
>> </str>
>> <str>rect</str>
>> <str>
>>
>> http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=facebook
>> </str>
>> <str>rect</str>
>> <str>http://blog.mikemccandless.com/search/label/Lucene</str>
>> <str>tag</str>
>> <str>rect</str>
>> <str>comments</str>
>> <str>rect</str>
>> <str>comment-form</str>
>> <str>rect</str>
>> <str>
>>
>> http://www.blogger.com/comment-iframe.g?blogID=8623074010562846957&postID=6579597884362535238
>> </str>
>> <str>rect</str>
>> <str>links</str>
>> <str>rect</str>
>> <str/>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2013/05/eating-dog-food-with-lucene.html
>> </str>
>> <str>rect</str>
>> <str>http://blog.mikemccandless.com/</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default
>> </str>
>> <str>application/atom+xml</str>
>> <str>rect</str>
>> <str>
>>
>> http://www.netvibes.com/subscribe.php?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2Fposts%2Fdefault
>> </str>
>> <str>rect</str>
>> <str>
>>
>> http://www.newsgator.com/ngs/subscriber/subext.aspx?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2Fposts%2Fdefault
>> </str>
>> <str>rect</str>
>> <str>
>>
>> http://add.my.yahoo.com/content?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2Fposts%2Fdefault
>> </str>
>> <str>rect</str>
>> <str>http://blog.mikemccandless.com/feeds/posts/default</str>
>> <str>rect</str>
>> <str>
>>
>> http://www.netvibes.com/subscribe.php?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2F6579597884362535238%2Fcomments%2Fdefault
>> </str>
>> <str>rect</str>
>> <str>
>>
>> http://www.newsgator.com/ngs/subscriber/subext.aspx?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2F6579597884362535238%2Fcomments%2Fdefault
>> </str>
>> <str>rect</str>
>> <str>
>>
>> http://add.my.yahoo.com/content?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2F6579597884362535238%2Fcomments%2Fdefault
>> </str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default
>> </str>
>> <str>rect</str>
>> <str>
>> //
>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Subscribe&widgetId=Subscribe1&action=editWidget&sectionId=sidebar-right-1
>> </str>
>> <str>rect</str>
>> <str>https://plus.google.com/112759599082866346694</str>
>> <str>rect</str>
>> <str>https://plus.google.com/112759599082866346694</str>
>> <str>author</str>
>> <str>rect</str>
>> <str>https://plus.google.com/112759599082866346694</str>
>> <str>author</str>
>> <str>rect</str>
>> <str>
>> //
>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Profile&widgetId=Profile1&action=editWidget&sectionId=sidebar-right-1
>> </str>
>> <str>rect</str>
>> <str>
>> http://affiliate.manning.com/idevaffiliate.php?id=1171_147
>> </str>
>> <str>rect</str>
>> <str>
>> //
>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Image&widgetId=Image1&action=editWidget&sectionId=sidebar-right-1
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>>
>> http://blog.mikemccandless.com/search?updated-min=2013-01-01T00:00:00-05:00&updated-max=2014-01-01T00:00:00-05:00&max-results=5
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2013_05_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
>> </str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2013/05/eating-dog-food-with-lucene.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2013_02_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2013_01_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>>
>> http://blog.mikemccandless.com/search?updated-min=2012-01-01T00:00:00-05:00&updated-max=2013-01-01T00:00:00-05:00&max-results=16
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2012_12_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2012_11_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2012_09_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2012_08_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2012_07_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2012_05_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2012_04_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2012_03_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2012_01_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>>
>> http://blog.mikemccandless.com/search?updated-min=2011-01-01T00:00:00-05:00&updated-max=2012-01-01T00:00:00-05:00&max-results=20
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2011_11_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2011_10_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2011_09_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2011_06_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2011_05_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2011_04_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2011_03_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2011_02_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2011_01_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>>
>> http://blog.mikemccandless.com/search?updated-min=2010-01-01T00:00:00-05:00&updated-max=2011-01-01T00:00:00-05:00&max-results=43
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2010_12_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2010_11_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2010_10_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2010_09_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2010_08_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2010_07_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2010_06_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2010_05_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2010_04_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2010_03_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2010_02_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>>
>> http://blog.mikemccandless.com/search?updated-min=2009-01-01T00:00:00-05:00&updated-max=2010-01-01T00:00:00-05:00&max-results=18
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2009_12_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2009_11_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2009_10_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2009_09_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2009_08_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>javascript:void(0)</str>
>> <str>rect</str>
>> <str>
>> http://blog.mikemccandless.com/2009_07_01_archive.html
>> </str>
>> <str>rect</str>
>> <str>
>> //
>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=BlogArchive&widgetId=BlogArchive1&action=editWidget&sectionId=sidebar-right-1
>> </str>
>> <str>rect</str>
>> <str>
>> //
>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Followers&widgetId=Followers1&action=editWidget&sectionId=sidebar-right-1
>> </str>
>> <str>rect</str>
>> <str>
>> //
>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=FollowByEmail&widgetId=FollowByEmail1&action=editWidget&sectionId=sidebar-right-3
>> </str>
>> <str>rect</str>
>> <str>http://www.blogger.com</str>
>> <str>rect</str>
>> <str>
>> //
>> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Attribution&widgetId=Attribution1&action=editWidget&sectionId=footer-3
>> </str>
>> </arr>
>> <arr name="img">
>> <str/>
>> <str>13</str>
>> <str>http://img1.blogblog.com/img/icon18_email.gif</str>
>> <str>18</str>
>> <str/>
>> <str>18</str>
>> <str>
>> http://img2.blogblog.com/img/icon18_edit_allbkg.gif
>> </str>
>> <str>18</str>
>> <str>
>> http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
>> </str>
>> <str/>
>> <str/>
>> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
>> <str>
>> http://img1.blogblog.com/img/widgets/subscribe-netvibes.png
>> </str>
>> <str/>
>> <str>
>> http://img1.blogblog.com/img/widgets/subscribe-newsgator.png
>> </str>
>> <str/>
>> <str>
>> http://img1.blogblog.com/img/widgets/subscribe-yahoo.png
>> </str>
>> <str/>
>> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
>> <str/>
>> <str>
>> http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
>> </str>
>> <str/>
>> <str/>
>> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
>> <str>
>> http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
>> </str>
>> <str/>
>> <str/>
>> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
>> <str>
>> http://img1.blogblog.com/img/widgets/subscribe-netvibes.png
>> </str>
>> <str/>
>> <str>
>> http://img1.blogblog.com/img/widgets/subscribe-newsgator.png
>> </str>
>> <str/>
>> <str>
>> http://img1.blogblog.com/img/widgets/subscribe-yahoo.png
>> </str>
>> <str/>
>> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
>> <str/>
>> <str>
>> http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
>> </str>
>> <str/>
>> <str/>
>> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
>> <str/>
>> <str>18</str>
>> <str>
>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>> </str>
>> <str>18</str>
>> <str>My Photo</str>
>> <str>80</str>
>> <str>
>> //
>> lh5.googleusercontent.com/-uZl5chgeDsM/AAAAAAAAAAI/AAAAAAAAAO4/Go4SFcNl-jY/s512-c/photo.jpg
>> </str>
>> <str>80</str>
>> <str/>
>> <str>18</str>
>> <str>
>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>> </str>
>> <str>18</str>
>> <str/>
>> <str>187</str>
>> <str>
>>
>> http://1.bp.blogspot.com/-QWxIn-kN_Yg/TZH0g4Vm66I/AAAAAAAAAG0/2jsjFLP9voQ/s250/LuceneInAction2.jpg
>> </str>
>> <str>150</str>
>> <str/>
>> <str>18</str>
>> <str>
>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>> </str>
>> <str>18</str>
>> <str/>
>> <str>18</str>
>> <str>
>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>> </str>
>> <str>18</str>
>> <str/>
>> <str>18</str>
>> <str>
>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>> </str>
>> <str>18</str>
>> <str/>
>> <str>18</str>
>> <str>
>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>> </str>
>> <str>18</str>
>> <str/>
>> <str>18</str>
>> <str>
>> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
>> </str>
>> <str>18</str>
>> </arr>
>> <arr name="iframe">
>> <str>0</str>
>> <str>auto</str>
>> <str>410</str>
>> <str>comment-editor</str>
>> <str/>
>> <str>100%</str>
>> </arr>
>> <str name="filename">docname</str>
>> <str name="mimetype">text/html; charset=UTF-8</str>
>> <arr name="source">
>> <str>http://blog.mikemccandless.com/feeds/posts/default</str>
>> </arr>
>> <arr name="category">
>> <str>Lucene</str>
>> </arr>
>> <str name="id">
>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
>> </str>
>> <arr name="source_type">
>> <str>rss</str>
>> </arr>
>> <arr name="title">
>> <str>Dynamic faceting with Lucene</str>
>> </arr>
>> <arr name="title_search">
>> <str>Dynamic faceting with Lucene</str>
>> </arr>
>> <arr name="viewport">
>> <str>width=1100</str>
>> </arr>
>> <arr name="stream_source_info">
>> <str>docname</str>
>> </arr>
>> <arr name="stream_content_type">
>> <str>text/html; charset=UTF-8</str>
>> </arr>
>> <arr name="stream_size">
>> <str>80779</str>
>> </arr>
>> <arr name="content_encoding">
>> <str>UTF-8</str>
>> </arr>
>> <arr name="stream_name">
>> <str>docname</str>
>> </arr>
>> <arr name="generator">
>> <str>blogger</str>
>> </arr>
>> <arr name="mssmarttagspreventparsing">
>> <str>true</str>
>> </arr>
>> <arr name="content_type">
>> <str>text/html; charset=UTF-8</str>
>> </arr>
>> <arr name="resourcename">
>> <str>docname</str>
>> </arr>
>> <arr name="dc_title">
>> <str>Changing Bits: Dynamic faceting with Lucene</str>
>> </arr>
>> <arr name="content">
>> <str>
>> Changing Bits: Dynamic faceting with Lucene Changing Bits Tuesday, May
>> 21, 2013 Dynamic faceting with Lucene Lucene's facet module has seen some
>> great improvements recently: sizable (nearly 4X) speedups and new features
>> like DrillSideways . The Jira issues search example showcases a number of
>> facet features. Here I'll describe two recently committed facet features:
>> sorted-set doc-values faceting, already available in 4.3, and dynamic range
>> faceting, coming in the next (4.4) release. To understand these features,
>> and why they are important, we first need a little background. Lucene's
>> facet module does most of its work at indexing time: for each indexed
>> document, it examines every facet label, each of which may be hierarchical,
>> and maps each unique label in the hierarchy to an integer id, and then
>> encodes all ids into a binary doc values field. A separate taxonomy index
>> stores this mapping, and ensures that, even across segments, the same label
>> gets the same id. At search time, faceting cost is minimal: for each
>> matched document, we visit all integer ids and aggregate counts in an
>> array, summarizing the results in the end, for example as top N facet
>> labels by count. This is in contrast to purely dynamic faceting
>> implementations like ElasticSearch 's and Solr 's, which do all work at
>> search time. Such approaches are more flexible: you need not do anything
>> special during indexing, and for every query you can pick and choose
>> exactly which facets to compute. However, the price for that flexibility is
>> slower searching, as each search must do more work for every matched
>> document. Furthermore, the impact on near-real-time reopen latency can be
>> horribly costly if top-level data-structures, such as Solr's
>> UnInvertedField, must be rebuilt on every reopen. The taxonomy index used
>> by the facet module means no extra work needs to be done on each
>> near-real-time reopen. Enough background, now on to our two new features!
>> Sorted-set doc-values faceting These features bring two dynamic
>> alternatives to the facet module, both computing facet counts from
>> previously indexed doc-values fields. The first feature, sorted-set
>> doc-values faceting (see LUCENE-4795 ), allows the application to index a
>> normal sorted-set doc-values field, for example: doc.add(new
>> SortedSetDocValuesField("foo")); doc.add(new
>> SortedSetDocValuesField("bar")); and then to compute facet counts at search
>> time using SortedSetDocValuesAccumulator and SortedSetDocValuesReaderState.
>> This feature does not use the taxonomy index, since all state is stored in
>> the doc-values, but the tradeoff is that on each near-real-time reopen, a
>> top-level data-structure is recomputed to map per-segment integer ordinals
>> to global ordinals. The good news is this should be relatively low cost
>> since it's just merge-sorting already sorted terms, and it doesn't need to
>> visit the documents (unlike UnInvertedField). At search time there is also
>> a small performance hit (~25%, depending on the query) since each
>> per-segment ord must be re-mapped to the global ord space. Likely this
>> could be improved (no time was spend optimizing). Furthermore, this feature
>> currently only works with non-hierarchical facet fields, though this should
>> be fixable (patches welcome!). Dynamic range faceting The second new
>> feature, dynamic range faceting, works on top of a numeric doc-values field
>> (see LUCENE-4965 ), and implements dynamic faceting over numeric ranges.
>> You create a RangeFacetRequest, providing custom ranges with their labels.
>> Each matched document is checked against all ranges and the count is
>> incremented when there is a match. The range-test is a naive simple linear
>> search, which is probably OK since there are usually only a few ranges, but
>> we could eventually upgrade this to an interval tree to get better
>> performance (patches welcome!). Likewise, this new feature does not use the
>> taxonomy index, only a numeric doc-values field. This feature is especially
>> useful with time-based fields. You can see it in action in the Jira issues
>> search example in the Updated field. Happy faceting! Posted by Michael
>> McCandless on 5/21/2013 Email This BlogThis! Share to Twitter Share to
>> Facebook Labels: Lucene No comments: Post a Comment Older Post Home
>> Subscribe to: Post Comments (Atom) Subscribe To Posts Atom Posts Comments
>> Atom Comments About Me Michael McCandless Michael loves building software;
>> he's been building search engines for more than a decade. In 1999 he
>> co-founded iPhrase Technologies, a startup providing a user-centric
>> enterprise search application, written primarily in Python and C. After IBM
>> acquired iPhrase in 2005, Michael fell in love with Lucene, becoming a
>> committer in 2006 and PMC member in 2008. Michael has remained an active
>> committer, helping to push Lucene to new places in recent years. He's
>> co-author of Lucene in Action, 2nd edition. In his spare time Michael
>> enjoys building his own computers, writing software to control his house
>> (mostly in Python), encoding videos and tinkering with all sorts of other
>> things. View my complete profile Blog Archive ▼  2013 (5) ▼  May (2)
>> Dynamic faceting with Lucene Eating dog food with Lucene ►  February (1) ►
>> January (2) ►  2012 (16) ►  December (2) ►  November (1) ►  September (1)
>> ►  August (1) ►  July (3) ►  May (1) ►  April (2) ►  March (3) ►  January
>> (2) ►  2011 (20) ►  November (2) ►  October (3) ►  September (1) ►  June
>> (3) ►  May (2) ►  April (2) ►  March (4) ►  February (2) ►  January (1) ►
>> 2010 (43) ►  December (1) ►  November (1) ►  October (4) ►  September (4)
>> ►  August (4) ►  July (11) ►  June (7) ►  May (6) ►  April (1) ►  March (1)
>> ►  February (3) ►  2009 (18) ►  December (1) ►  November (1) ►  October (1)
>> ►  September (4) ►  August (6) ►  July (5) Followers Follow by Email Simple
>> template. Powered by Blogger .
>> </str>
>> </arr>
>> <arr name="content_search">
>> <str>
>> Changing Bits: Dynamic faceting with Lucene Changing Bits Tuesday, May
>> 21, 2013 Dynamic faceting with Lucene Lucene's facet module has seen some
>> great improvements recently: sizable (nearly 4X) speedups and new features
>> like DrillSideways . The Jira issues search example showcases a number of
>> facet features. Here I'll describe two recently committed facet features:
>> sorted-set doc-values faceting, already available in 4.3, and dynamic range
>> faceting, coming in the next (4.4) release. To understand these features,
>> and why they are important, we first need a little background. Lucene's
>> facet module does most of its work at indexing time: for each indexed
>> document, it examines every facet label, each of which may be hierarchical,
>> and maps each unique label in the hierarchy to an integer id, and then
>> encodes all ids into a binary doc values field. A separate taxonomy index
>> stores this mapping, and ensures that, even across segments, the same label
>> gets the same id. At search time, faceting cost is minimal: for each
>> matched document, we visit all integer ids and aggregate counts in an
>> array, summarizing the results in the end, for example as top N facet
>> labels by count. This is in contrast to purely dynamic faceting
>> implementations like ElasticSearch 's and Solr 's, which do all work at
>> search time. Such approaches are more flexible: you need not do anything
>> special during indexing, and for every query you can pick and choose
>> exactly which facets to compute. However, the price for that flexibility is
>> slower searching, as each search must do more work for every matched
>> document. Furthermore, the impact on near-real-time reopen latency can be
>> horribly costly if top-level data-structures, such as Solr's
>> UnInvertedField, must be rebuilt on every reopen. The taxonomy index used
>> by the facet module means no extra work needs to be done on each
>> near-real-time reopen. Enough background, now on to our two new features!
>> Sorted-set doc-values faceting These features bring two dynamic
>> alternatives to the facet module, both computing facet counts from
>> previously indexed doc-values fields. The first feature, sorted-set
>> doc-values faceting (see LUCENE-4795 ), allows the application to index a
>> normal sorted-set doc-values field, for example: doc.add(new
>> SortedSetDocValuesField("foo")); doc.add(new
>> SortedSetDocValuesField("bar")); and then to compute facet counts at search
>> time using SortedSetDocValuesAccumulator and SortedSetDocValuesReaderState.
>> This feature does not use the taxonomy index, since all state is stored in
>> the doc-values, but the tradeoff is that on each near-real-time reopen, a
>> top-level data-structure is recomputed to map per-segment integer ordinals
>> to global ordinals. The good news is this should be relatively low cost
>> since it's just merge-sorting already sorted terms, and it doesn't need to
>> visit the documents (unlike UnInvertedField). At search time there is also
>> a small performance hit (~25%, depending on the query) since each
>> per-segment ord must be re-mapped to the global ord space. Likely this
>> could be improved (no time was spend optimizing). Furthermore, this feature
>> currently only works with non-hierarchical facet fields, though this should
>> be fixable (patches welcome!). Dynamic range faceting The second new
>> feature, dynamic range faceting, works on top of a numeric doc-values field
>> (see LUCENE-4965 ), and implements dynamic faceting over numeric ranges.
>> You create a RangeFacetRequest, providing custom ranges with their labels.
>> Each matched document is checked against all ranges and the count is
>> incremented when there is a match. The range-test is a naive simple linear
>> search, which is probably OK since there are usually only a few ranges, but
>> we could eventually upgrade this to an interval tree to get better
>> performance (patches welcome!). Likewise, this new feature does not use the
>> taxonomy index, only a numeric doc-values field. This feature is especially
>> useful with time-based fields. You can see it in action in the Jira issues
>> search example in the Updated field. Happy faceting! Posted by Michael
>> McCandless on 5/21/2013 Email This BlogThis! Share to Twitter Share to
>> Facebook Labels: Lucene No comments: Post a Comment Older Post Home
>> Subscribe to: Post Comments (Atom) Subscribe To Posts Atom Posts Comments
>> Atom Comments About Me Michael McCandless Michael loves building software;
>> he's been building search engines for more than a decade. In 1999 he
>> co-founded iPhrase Technologies, a startup providing a user-centric
>> enterprise search application, written primarily in Python and C. After IBM
>> acquired iPhrase in 2005, Michael fell in love with Lucene, becoming a
>> committer in 2006 and PMC member in 2008. Michael has remained an active
>> committer, helping to push Lucene to new places in recent years. He's
>> co-author of Lucene in Action, 2nd edition. In his spare time Michael
>> enjoys building his own computers, writing software to control his house
>> (mostly in Python), encoding videos and tinkering with all sorts of other
>> things. View my complete profile Blog Archive ▼  2013 (5) ▼  May (2)
>> Dynamic faceting with Lucene Eating dog food with Lucene ►  February (1) ►
>> January (2) ►  2012 (16) ►  December (2) ►  November (1) ►  September (1)
>> ►  August (1) ►  July (3) ►  May (1) ►  April (2) ►  March (3) ►  January
>> (2) ►  2011 (20) ►  November (2) ►  October (3) ►  September (1) ►  June
>> (3) ►  May (2) ►  April (2) ►  March (4) ►  February (2) ►  January (1) ►
>> 2010 (43) ►  December (1) ►  November (1) ►  October (4) ►  September (4)
>> ►  August (4) ►  July (11) ►  June (7) ►  May (6) ►  April (1) ►  March (1)
>> ►  February (3) ►  2009 (18) ►  December (1) ►  November (1) ►  October (1)
>> ►  September (4) ►  August (6) ►  July (5) Followers Follow by Email Simple
>> template. Powered by Blogger .
>> </str>
>> </arr>
>> <arr name="language">
>> <str>en</str>
>> </arr>
>> <arr name="url">
>> <str>
>> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
>> </str>
>> </arr>
>> <arr name="snippet">
>> <str>
>> Changing Bits: Dynamic faceting with Lucene Changing Bits Tuesday, May
>> 21, 2013 Dynamic faceting with Lucene Lucene's facet module has seen some
>> great improvements recently: sizable (nearly 4X) speedups and new features
>> like DrillSideways ....At search time, faceting cost is minimal: for each
>> matched document, we visit all integer ids and aggregate counts in an
>> array, summarizing the results in the end, for example as top N facet
>> labels by count....The range-test is a naive simple linear search, which is
>> probably OK since there are usually only a few ranges, but we could
>> eventually upgrade this to an interval tree to get better performance
>> (patches welcome!)....Share to Twitter Share to Facebook Labels: Lucene No
>> comments: Post a Comment Older Post Home Subscribe to: Post Comments (Atom)
>> Subscribe To Posts Atom Posts Comments Atom Comments About Me Michael
>> McCandless Michael loves building software; he's been building search
>> engines for more than a decade....View my complete profile Blog Archive ▼
>> 2013 (5) ▼  May (2) Dynamic faceting with Lucene Eating dog food with
>> Lucene ►  February (1) ►  January (2) ►  2012 (16) ►  December (2) ►
>> November (1) ►  September (1) ►  August (1) ►  July (3) ►  May (1) ►  April
>> (2) ►  March (3) ►  January (2) ►  2011 (20) ►  November (2) ►  October (3)
>> ►  September (1) ►  June (3) ►  May (2) ►  April (2) ►  March (4) ►
>> February (2) ►  January (1) ►  2010 (43) ►  December (1) ►  November (1) ►
>> October (4) ►  September (4) ►  August (4) ►  July (11) ►  June (7) ►  May
>> (6) ►  April (1) ►  March (1) ►  February (3) ►  2009 (18) ►  December (1)
>> ►  November (1) ►  October (1) ►  September (4) ►  August (6) ►  July (5)
>> Followers Follow by Email Simple template.
>> </str>
>> </arr>
>> <arr name="host">
>> <str>blog.mikemccandless.com</str>
>> </arr>
>> <arr name="path">
>> <str>/2013/05/dynamic-faceting-with-lucene.html</str>
>> </arr>
>> <long name="_version_">1436832383182569472</long>
>> </doc>
>> </result>
>> </response>
>>
>>
>>
>> I can see there are published and updated markup, and yet none of those
>> fields (pubDate or publications) are present in the solr document.
>>
>>
>> Thank you for the prompt reply. Agreed on #1, url is perfectly fine and
>> well used :). As for #2, I am still puzzled about the following. Here's an
>> excerpt from  the feed xml:
>>
>> On June 3, 2013 at 4:25:51 PM, Karl Wright (daddywri@gmail.com) wrote:
>>
>> Hi Stephane,
>>
>> (1) ManifoldCF always uses the URL of a document as the primary ID when
>> it indexes it.  This is the standard treatment and has been since Day 1.
>>
>> (2) For the "creation date" attribute, the RSS connector uses the date in
>> the feed, if there is one.  This is a date in ISO format, and comes out as
>> the metadata value "pubdateiso".  There is also an attribute called
>> "pubdate", which is in milliseconds since epoch, which is EITHER the date
>> in the feed (if present), or if not it's the date the document is fetched.
>>
>> As for your other question, "chromed" data comes from the URLs referenced
>> by the items in the feed, and "dechromed" data comes from either the
>> content or description field that's actually in the feed, whichever you
>> specify.
>>
>> All of this is described in the end-user-documentation, although I do
>> notice that "pubdateiso" is missing from the metadata listed.
>>
>>
>> http://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html#rssrepository
>>
>> Karl
>>
>>
>>
>> On Mon, Jun 3, 2013 at 10:13 AM, Stephane Gamard <st...@gamard.net>wrote:
>>
>>>
>>> Hi all,
>>>
>>>
>>> I'm trying to use the RSS connector for the following feed:
>>> http://blog.mikemccandless.com/feeds/posts/default
>>>
>>> After setting the job up and ingesting documents I have 2 pending
>>> questions:
>>> - why is the connector using the URL as ID instead of the atom ID tag?
>>> - I have no creation and/or modified date in my Solr document, how is it
>>> so?
>>>
>>> Overall I am a bit confused as to where does the crawler gets it's
>>> information (chrome vs dechromed). I've downloaded the feed and tried to
>>> find the entries back into my index but could not do so (could only find
>>> pages which are linked from the rss entry).
>>>
>>> Sorry for the hassle, I'm reading over and over trying to piece it all
>>> together.
>>>
>>> Cheers,
>>>
>>> _Stephane
>>>
>>
>>
>

Re: RSS Connector

Posted by Karl Wright <da...@gmail.com>.

Hi Stephane,


First, you would not want to select to get dechromed content from the feed
description field if there is no feed description field.  (In that case, by
default the connector fall back to use the actual content from the document
link.)

Second, for this kind of feed, the connector looks for either "published"
or "updated" and takes the latter of the two if both are found.  However,
the ISO8601 date parser we are using is not happy with any timezone other
than Z (zulu) at this time, but your dates have -0400 instead, and that is
the problem.  I'll create a ticket to deal with that issue.

Karl



On Mon, Jun 3, 2013 at 10:48 AM, Stephane Gamard <st...@gamard.net>wrote:

> Hi Karl,
>
>
> Thank you for the prompt reply. Agreed on #1, url is perfectly fine and
> well used :). As for #2, I am still puzzled about the following. Here's an
> excerpt from  the feed xml:
>
>
> <entry>
>
> <id>tag:blogger.com
> ,1999:blog-8623074010562846957.post-6579597884362535238</id>
>
> <published>2013-05-21T18:23:00.000-04:00</published>
>
> <updated>2013-05-21T18:23:06.451-04:00</updated>
>
> <category scheme="http://www.blogger.com/atom/ns#" term="Lucene"/>
>
> <title type="text">Dynamic faceting with Lucene</title>
>
> <content type="html">Lucene's [...] Happy faceting!</content>
>
> <link rel="replies" type="application/atom+xml" href="
> http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default"
> title="Post Comments"/>
>
> <link rel="replies" type="text/html" href="
> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html#comment-form"
> title="0 Comments"/>
>
> <link rel="edit" type="application/atom+xml" href="
> http://www.blogger.com/feeds/8623074010562846957/posts/default/6579597884362535238
> "/>
>
> <link rel="self" type="application/atom+xml" href="
> http://www.blogger.com/feeds/8623074010562846957/posts/default/6579597884362535238
> "/>
>
> <link rel="alternate" type="text/html" href="
> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html"
> title="Dynamic faceting with Lucene"/>
>
> <author>
>
> <name>Michael McCandless</name>
>
> <uri>https://plus.google.com/112759599082866346694</uri>
>
> <email>noreply@blogger.com</email>
>
> <gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32"
> height="32" src="//
> lh5.googleusercontent.com/-uZl5chgeDsM/AAAAAAAAAAI/AAAAAAAAAO4/Go4SFcNl-jY/s512-c/photo.jpg
> "/>
>
> </author>
>
> <thr:total>0</thr:total>
>
> </entry>
>
>
> Below is the document once ingested in Solr (searched with query:
> http://localhost:8983/lucene/select?q=id:http%3A%2F%2Fblog.mikemccandless.com%2F2013%2F05%2Fdynamic-faceting-with-lucene.html&fl=*).
> Note that I use a catch all field (<dynamicField name="*"  type="string"
>  indexed="true"  multiValued="true" stored="true" omitNorms="true"/>) to
> save all submitted fields.
>
>
> I have two questions that I don't understand:
>
> - I've selected the option "Dechromed content, if present, in
> 'description' field"  and yet I have no description field
>
> - I have no pubDate of publications field available
>
>
> Here's the attached Solr output:
>
>
> This XML file does not appear to have any style information associated
> with it. The document tree is shown below.
> <response>
> <lst name="responseHeader">
> <int name="status">0</int>
> <int name="QTime">1</int>
> <lst name="params">
> <str name="fl">*</str>
> <str name="q">
> id:
> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
> </str>
> </lst>
> </lst>
> <result name="response" numFound="1" start="0">
> <doc>
> <arr name="link">
> <str>http://blog.mikemccandless.com/favicon.ico</str>
> <str>icon</str>
> <str>image/x-icon</str>
> <str>
> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
> </str>
> <str>canonical</str>
> <str>alternate</str>
> <str>application/atom+xml</str>
> <str>http://blog.mikemccandless.com/feeds/posts/default</str>
> <str>alternate</str>
> <str>application/rss+xml</str>
> <str>
> http://blog.mikemccandless.com/feeds/posts/default?alt=rss
> </str>
> <str>service.post</str>
> <str>application/atom+xml</str>
> <str>
> http://www.blogger.com/feeds/8623074010562846957/posts/default
> </str>
> <str>EditURI</str>
> <str>application/rsd+xml</str>
> <str>
> http://www.blogger.com/rsd.g?blogID=8623074010562846957
> </str>
> <str>alternate</str>
> <str>application/atom+xml</str>
> <str>
> http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default
> </str>
> <str>https://plus.google.com/112759599082866346694</str>
> <str>publisher</str>
> <str>text/css</str>
> <str>stylesheet</str>
> <str>
> //www.blogger.com/static/v1/widgets/2159474849-widget_css_2_bundle.css
> </str>
> <str>text/css</str>
> <str>stylesheet</str>
> <str>
> //
> www.blogger.com/dyn-css/authorization.css?targetBlogID=8623074010562846957&zx=93c35911-ffbb-4abb-ba82-d88c30b4b1b8
> </str>
> </arr>
> <arr name="meta">
> <str>viewport</str>
> <str>width=1100</str>
> <str>stream_source_info</str>
> <str>docname</str>
> <str>stream_content_type</str>
> <str>text/html; charset=UTF-8</str>
> <str>stream_size</str>
> <str>80779</str>
> <str>Content-Encoding</str>
> <str>UTF-8</str>
> <str>stream_name</str>
> <str>docname</str>
> <str>generator</str>
> <str>blogger</str>
> <str>MSSmartTagsPreventParsing</str>
> <str>true</str>
> <str>Content-Type</str>
> <str>text/html; charset=UTF-8</str>
> <str>resourceName</str>
> <str>docname</str>
> <str>dc:title</str>
> <str>Changing Bits: Dynamic faceting with Lucene</str>
> </arr>
> <arr name="false">
> <str>rect</str>
> <str>http://blog.mikemccandless.com/</str>
> <str>rect</str>
> <str>6579597884362535238</str>
> <str>rect</str>
> <str>
>
> http://blog.mikemccandless.com/2013/02/drill-sideways-faceting-with-lucene.html
> </str>
> <str>rect</str>
> <str>http://jirasearch.mikemccandless.com</str>
> <str>rect</str>
> <str>
> http://www.elasticsearch.org/guide/reference/api/search/facets/
> </str>
> <str>rect</str>
> <str>http://wiki.apache.org/solr/SolrFacetingOverview</str>
> <str>rect</str>
> <str>https://issues.apache.org/jira/browse/LUCENE-4795</str>
> <str>rect</str>
> <str>https://issues.apache.org/jira/browse/LUCENE-4965</str>
> <str>rect</str>
> <str>http://en.wikipedia.org/wiki/Interval_tree</str>
> <str>rect</str>
> <str>http://jirasearch.mikemccandless.com</str>
> <str>rect</str>
> <str>https://plus.google.com/112759599082866346694</str>
> <str>author</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
> </str>
> <str>bookmark</str>
> <str>rect</str>
> <str>
>
> http://www.blogger.com/email-post.g?blogID=8623074010562846957&postID=6579597884362535238
> </str>
> <str>rect</str>
> <str>
>
> http://www.blogger.com/post-edit.g?blogID=8623074010562846957&postID=6579597884362535238&from=pencil
> </str>
> <str>rect</str>
> <str>
>
> http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=email
> </str>
> <str>rect</str>
> <str>
>
> http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=blog
> </str>
> <str>rect</str>
> <str>
>
> http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=twitter
> </str>
> <str>rect</str>
> <str>
>
> http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=facebook
> </str>
> <str>rect</str>
> <str>http://blog.mikemccandless.com/search/label/Lucene</str>
> <str>tag</str>
> <str>rect</str>
> <str>comments</str>
> <str>rect</str>
> <str>comment-form</str>
> <str>rect</str>
> <str>
>
> http://www.blogger.com/comment-iframe.g?blogID=8623074010562846957&postID=6579597884362535238
> </str>
> <str>rect</str>
> <str>links</str>
> <str>rect</str>
> <str/>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2013/05/eating-dog-food-with-lucene.html
> </str>
> <str>rect</str>
> <str>http://blog.mikemccandless.com/</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default
> </str>
> <str>application/atom+xml</str>
> <str>rect</str>
> <str>
>
> http://www.netvibes.com/subscribe.php?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2Fposts%2Fdefault
> </str>
> <str>rect</str>
> <str>
>
> http://www.newsgator.com/ngs/subscriber/subext.aspx?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2Fposts%2Fdefault
> </str>
> <str>rect</str>
> <str>
>
> http://add.my.yahoo.com/content?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2Fposts%2Fdefault
> </str>
> <str>rect</str>
> <str>http://blog.mikemccandless.com/feeds/posts/default</str>
> <str>rect</str>
> <str>
>
> http://www.netvibes.com/subscribe.php?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2F6579597884362535238%2Fcomments%2Fdefault
> </str>
> <str>rect</str>
> <str>
>
> http://www.newsgator.com/ngs/subscriber/subext.aspx?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2F6579597884362535238%2Fcomments%2Fdefault
> </str>
> <str>rect</str>
> <str>
>
> http://add.my.yahoo.com/content?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2F6579597884362535238%2Fcomments%2Fdefault
> </str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default
> </str>
> <str>rect</str>
> <str>
> //
> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Subscribe&widgetId=Subscribe1&action=editWidget&sectionId=sidebar-right-1
> </str>
> <str>rect</str>
> <str>https://plus.google.com/112759599082866346694</str>
> <str>rect</str>
> <str>https://plus.google.com/112759599082866346694</str>
> <str>author</str>
> <str>rect</str>
> <str>https://plus.google.com/112759599082866346694</str>
> <str>author</str>
> <str>rect</str>
> <str>
> //
> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Profile&widgetId=Profile1&action=editWidget&sectionId=sidebar-right-1
> </str>
> <str>rect</str>
> <str>
> http://affiliate.manning.com/idevaffiliate.php?id=1171_147
> </str>
> <str>rect</str>
> <str>
> //
> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Image&widgetId=Image1&action=editWidget&sectionId=sidebar-right-1
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
>
> http://blog.mikemccandless.com/search?updated-min=2013-01-01T00:00:00-05:00&updated-max=2014-01-01T00:00:00-05:00&max-results=5
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2013_05_01_archive.html
> </str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
> </str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2013/05/eating-dog-food-with-lucene.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2013_02_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2013_01_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
>
> http://blog.mikemccandless.com/search?updated-min=2012-01-01T00:00:00-05:00&updated-max=2013-01-01T00:00:00-05:00&max-results=16
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2012_12_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2012_11_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2012_09_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2012_08_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2012_07_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2012_05_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2012_04_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2012_03_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2012_01_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
>
> http://blog.mikemccandless.com/search?updated-min=2011-01-01T00:00:00-05:00&updated-max=2012-01-01T00:00:00-05:00&max-results=20
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2011_11_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2011_10_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2011_09_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2011_06_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2011_05_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2011_04_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2011_03_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2011_02_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2011_01_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
>
> http://blog.mikemccandless.com/search?updated-min=2010-01-01T00:00:00-05:00&updated-max=2011-01-01T00:00:00-05:00&max-results=43
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2010_12_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2010_11_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2010_10_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2010_09_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2010_08_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2010_07_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2010_06_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2010_05_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2010_04_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2010_03_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2010_02_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
>
> http://blog.mikemccandless.com/search?updated-min=2009-01-01T00:00:00-05:00&updated-max=2010-01-01T00:00:00-05:00&max-results=18
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2009_12_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2009_11_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2009_10_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2009_09_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2009_08_01_archive.html
> </str>
> <str>rect</str>
> <str>javascript:void(0)</str>
> <str>rect</str>
> <str>
> http://blog.mikemccandless.com/2009_07_01_archive.html
> </str>
> <str>rect</str>
> <str>
> //
> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=BlogArchive&widgetId=BlogArchive1&action=editWidget&sectionId=sidebar-right-1
> </str>
> <str>rect</str>
> <str>
> //
> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Followers&widgetId=Followers1&action=editWidget&sectionId=sidebar-right-1
> </str>
> <str>rect</str>
> <str>
> //
> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=FollowByEmail&widgetId=FollowByEmail1&action=editWidget&sectionId=sidebar-right-3
> </str>
> <str>rect</str>
> <str>http://www.blogger.com</str>
> <str>rect</str>
> <str>
> //
> www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Attribution&widgetId=Attribution1&action=editWidget&sectionId=footer-3
> </str>
> </arr>
> <arr name="img">
> <str/>
> <str>13</str>
> <str>http://img1.blogblog.com/img/icon18_email.gif</str>
> <str>18</str>
> <str/>
> <str>18</str>
> <str>
> http://img2.blogblog.com/img/icon18_edit_allbkg.gif
> </str>
> <str>18</str>
> <str>
> http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
> </str>
> <str/>
> <str/>
> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
> <str>
> http://img1.blogblog.com/img/widgets/subscribe-netvibes.png
> </str>
> <str/>
> <str>
> http://img1.blogblog.com/img/widgets/subscribe-newsgator.png
> </str>
> <str/>
> <str>
> http://img1.blogblog.com/img/widgets/subscribe-yahoo.png
> </str>
> <str/>
> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
> <str/>
> <str>
> http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
> </str>
> <str/>
> <str/>
> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
> <str>
> http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
> </str>
> <str/>
> <str/>
> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
> <str>
> http://img1.blogblog.com/img/widgets/subscribe-netvibes.png
> </str>
> <str/>
> <str>
> http://img1.blogblog.com/img/widgets/subscribe-newsgator.png
> </str>
> <str/>
> <str>
> http://img1.blogblog.com/img/widgets/subscribe-yahoo.png
> </str>
> <str/>
> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
> <str/>
> <str>
> http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
> </str>
> <str/>
> <str/>
> <str>http://img1.blogblog.com/img/icon_feed12.png</str>
> <str/>
> <str>18</str>
> <str>
> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
> </str>
> <str>18</str>
> <str>My Photo</str>
> <str>80</str>
> <str>
> //
> lh5.googleusercontent.com/-uZl5chgeDsM/AAAAAAAAAAI/AAAAAAAAAO4/Go4SFcNl-jY/s512-c/photo.jpg
> </str>
> <str>80</str>
> <str/>
> <str>18</str>
> <str>
> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
> </str>
> <str>18</str>
> <str/>
> <str>187</str>
> <str>
>
> http://1.bp.blogspot.com/-QWxIn-kN_Yg/TZH0g4Vm66I/AAAAAAAAAG0/2jsjFLP9voQ/s250/LuceneInAction2.jpg
> </str>
> <str>150</str>
> <str/>
> <str>18</str>
> <str>
> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
> </str>
> <str>18</str>
> <str/>
> <str>18</str>
> <str>
> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
> </str>
> <str>18</str>
> <str/>
> <str>18</str>
> <str>
> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
> </str>
> <str>18</str>
> <str/>
> <str>18</str>
> <str>
> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
> </str>
> <str>18</str>
> <str/>
> <str>18</str>
> <str>
> http://img1.blogblog.com/img/icon18_wrench_allbkg.png
> </str>
> <str>18</str>
> </arr>
> <arr name="iframe">
> <str>0</str>
> <str>auto</str>
> <str>410</str>
> <str>comment-editor</str>
> <str/>
> <str>100%</str>
> </arr>
> <str name="filename">docname</str>
> <str name="mimetype">text/html; charset=UTF-8</str>
> <arr name="source">
> <str>http://blog.mikemccandless.com/feeds/posts/default</str>
> </arr>
> <arr name="category">
> <str>Lucene</str>
> </arr>
> <str name="id">
> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
> </str>
> <arr name="source_type">
> <str>rss</str>
> </arr>
> <arr name="title">
> <str>Dynamic faceting with Lucene</str>
> </arr>
> <arr name="title_search">
> <str>Dynamic faceting with Lucene</str>
> </arr>
> <arr name="viewport">
> <str>width=1100</str>
> </arr>
> <arr name="stream_source_info">
> <str>docname</str>
> </arr>
> <arr name="stream_content_type">
> <str>text/html; charset=UTF-8</str>
> </arr>
> <arr name="stream_size">
> <str>80779</str>
> </arr>
> <arr name="content_encoding">
> <str>UTF-8</str>
> </arr>
> <arr name="stream_name">
> <str>docname</str>
> </arr>
> <arr name="generator">
> <str>blogger</str>
> </arr>
> <arr name="mssmarttagspreventparsing">
> <str>true</str>
> </arr>
> <arr name="content_type">
> <str>text/html; charset=UTF-8</str>
> </arr>
> <arr name="resourcename">
> <str>docname</str>
> </arr>
> <arr name="dc_title">
> <str>Changing Bits: Dynamic faceting with Lucene</str>
> </arr>
> <arr name="content">
> <str>
> Changing Bits: Dynamic faceting with Lucene Changing Bits Tuesday, May 21,
> 2013 Dynamic faceting with Lucene Lucene's facet module has seen some great
> improvements recently: sizable (nearly 4X) speedups and new features like
> DrillSideways . The Jira issues search example showcases a number of facet
> features. Here I'll describe two recently committed facet features:
> sorted-set doc-values faceting, already available in 4.3, and dynamic range
> faceting, coming in the next (4.4) release. To understand these features,
> and why they are important, we first need a little background. Lucene's
> facet module does most of its work at indexing time: for each indexed
> document, it examines every facet label, each of which may be hierarchical,
> and maps each unique label in the hierarchy to an integer id, and then
> encodes all ids into a binary doc values field. A separate taxonomy index
> stores this mapping, and ensures that, even across segments, the same label
> gets the same id. At search time, faceting cost is minimal: for each
> matched document, we visit all integer ids and aggregate counts in an
> array, summarizing the results in the end, for example as top N facet
> labels by count. This is in contrast to purely dynamic faceting
> implementations like ElasticSearch 's and Solr 's, which do all work at
> search time. Such approaches are more flexible: you need not do anything
> special during indexing, and for every query you can pick and choose
> exactly which facets to compute. However, the price for that flexibility is
> slower searching, as each search must do more work for every matched
> document. Furthermore, the impact on near-real-time reopen latency can be
> horribly costly if top-level data-structures, such as Solr's
> UnInvertedField, must be rebuilt on every reopen. The taxonomy index used
> by the facet module means no extra work needs to be done on each
> near-real-time reopen. Enough background, now on to our two new features!
> Sorted-set doc-values faceting These features bring two dynamic
> alternatives to the facet module, both computing facet counts from
> previously indexed doc-values fields. The first feature, sorted-set
> doc-values faceting (see LUCENE-4795 ), allows the application to index a
> normal sorted-set doc-values field, for example: doc.add(new
> SortedSetDocValuesField("foo")); doc.add(new
> SortedSetDocValuesField("bar")); and then to compute facet counts at search
> time using SortedSetDocValuesAccumulator and SortedSetDocValuesReaderState.
> This feature does not use the taxonomy index, since all state is stored in
> the doc-values, but the tradeoff is that on each near-real-time reopen, a
> top-level data-structure is recomputed to map per-segment integer ordinals
> to global ordinals. The good news is this should be relatively low cost
> since it's just merge-sorting already sorted terms, and it doesn't need to
> visit the documents (unlike UnInvertedField). At search time there is also
> a small performance hit (~25%, depending on the query) since each
> per-segment ord must be re-mapped to the global ord space. Likely this
> could be improved (no time was spend optimizing). Furthermore, this feature
> currently only works with non-hierarchical facet fields, though this should
> be fixable (patches welcome!). Dynamic range faceting The second new
> feature, dynamic range faceting, works on top of a numeric doc-values field
> (see LUCENE-4965 ), and implements dynamic faceting over numeric ranges.
> You create a RangeFacetRequest, providing custom ranges with their labels.
> Each matched document is checked against all ranges and the count is
> incremented when there is a match. The range-test is a naive simple linear
> search, which is probably OK since there are usually only a few ranges, but
> we could eventually upgrade this to an interval tree to get better
> performance (patches welcome!). Likewise, this new feature does not use the
> taxonomy index, only a numeric doc-values field. This feature is especially
> useful with time-based fields. You can see it in action in the Jira issues
> search example in the Updated field. Happy faceting! Posted by Michael
> McCandless on 5/21/2013 Email This BlogThis! Share to Twitter Share to
> Facebook Labels: Lucene No comments: Post a Comment Older Post Home
> Subscribe to: Post Comments (Atom) Subscribe To Posts Atom Posts Comments
> Atom Comments About Me Michael McCandless Michael loves building software;
> he's been building search engines for more than a decade. In 1999 he
> co-founded iPhrase Technologies, a startup providing a user-centric
> enterprise search application, written primarily in Python and C. After IBM
> acquired iPhrase in 2005, Michael fell in love with Lucene, becoming a
> committer in 2006 and PMC member in 2008. Michael has remained an active
> committer, helping to push Lucene to new places in recent years. He's
> co-author of Lucene in Action, 2nd edition. In his spare time Michael
> enjoys building his own computers, writing software to control his house
> (mostly in Python), encoding videos and tinkering with all sorts of other
> things. View my complete profile Blog Archive ▼  2013 (5) ▼  May (2)
> Dynamic faceting with Lucene Eating dog food with Lucene ►  February (1) ►
> January (2) ►  2012 (16) ►  December (2) ►  November (1) ►  September (1)
> ►  August (1) ►  July (3) ►  May (1) ►  April (2) ►  March (3) ►  January
> (2) ►  2011 (20) ►  November (2) ►  October (3) ►  September (1) ►  June
> (3) ►  May (2) ►  April (2) ►  March (4) ►  February (2) ►  January (1) ►
> 2010 (43) ►  December (1) ►  November (1) ►  October (4) ►  September (4)
> ►  August (4) ►  July (11) ►  June (7) ►  May (6) ►  April (1) ►  March (1)
> ►  February (3) ►  2009 (18) ►  December (1) ►  November (1) ►  October (1)
> ►  September (4) ►  August (6) ►  July (5) Followers Follow by Email Simple
> template. Powered by Blogger .
> </str>
> </arr>
> <arr name="content_search">
> <str>
> Changing Bits: Dynamic faceting with Lucene Changing Bits Tuesday, May 21,
> 2013 Dynamic faceting with Lucene Lucene's facet module has seen some great
> improvements recently: sizable (nearly 4X) speedups and new features like
> DrillSideways . The Jira issues search example showcases a number of facet
> features. Here I'll describe two recently committed facet features:
> sorted-set doc-values faceting, already available in 4.3, and dynamic range
> faceting, coming in the next (4.4) release. To understand these features,
> and why they are important, we first need a little background. Lucene's
> facet module does most of its work at indexing time: for each indexed
> document, it examines every facet label, each of which may be hierarchical,
> and maps each unique label in the hierarchy to an integer id, and then
> encodes all ids into a binary doc values field. A separate taxonomy index
> stores this mapping, and ensures that, even across segments, the same label
> gets the same id. At search time, faceting cost is minimal: for each
> matched document, we visit all integer ids and aggregate counts in an
> array, summarizing the results in the end, for example as top N facet
> labels by count. This is in contrast to purely dynamic faceting
> implementations like ElasticSearch 's and Solr 's, which do all work at
> search time. Such approaches are more flexible: you need not do anything
> special during indexing, and for every query you can pick and choose
> exactly which facets to compute. However, the price for that flexibility is
> slower searching, as each search must do more work for every matched
> document. Furthermore, the impact on near-real-time reopen latency can be
> horribly costly if top-level data-structures, such as Solr's
> UnInvertedField, must be rebuilt on every reopen. The taxonomy index used
> by the facet module means no extra work needs to be done on each
> near-real-time reopen. Enough background, now on to our two new features!
> Sorted-set doc-values faceting These features bring two dynamic
> alternatives to the facet module, both computing facet counts from
> previously indexed doc-values fields. The first feature, sorted-set
> doc-values faceting (see LUCENE-4795 ), allows the application to index a
> normal sorted-set doc-values field, for example: doc.add(new
> SortedSetDocValuesField("foo")); doc.add(new
> SortedSetDocValuesField("bar")); and then to compute facet counts at search
> time using SortedSetDocValuesAccumulator and SortedSetDocValuesReaderState.
> This feature does not use the taxonomy index, since all state is stored in
> the doc-values, but the tradeoff is that on each near-real-time reopen, a
> top-level data-structure is recomputed to map per-segment integer ordinals
> to global ordinals. The good news is this should be relatively low cost
> since it's just merge-sorting already sorted terms, and it doesn't need to
> visit the documents (unlike UnInvertedField). At search time there is also
> a small performance hit (~25%, depending on the query) since each
> per-segment ord must be re-mapped to the global ord space. Likely this
> could be improved (no time was spend optimizing). Furthermore, this feature
> currently only works with non-hierarchical facet fields, though this should
> be fixable (patches welcome!). Dynamic range faceting The second new
> feature, dynamic range faceting, works on top of a numeric doc-values field
> (see LUCENE-4965 ), and implements dynamic faceting over numeric ranges.
> You create a RangeFacetRequest, providing custom ranges with their labels.
> Each matched document is checked against all ranges and the count is
> incremented when there is a match. The range-test is a naive simple linear
> search, which is probably OK since there are usually only a few ranges, but
> we could eventually upgrade this to an interval tree to get better
> performance (patches welcome!). Likewise, this new feature does not use the
> taxonomy index, only a numeric doc-values field. This feature is especially
> useful with time-based fields. You can see it in action in the Jira issues
> search example in the Updated field. Happy faceting! Posted by Michael
> McCandless on 5/21/2013 Email This BlogThis! Share to Twitter Share to
> Facebook Labels: Lucene No comments: Post a Comment Older Post Home
> Subscribe to: Post Comments (Atom) Subscribe To Posts Atom Posts Comments
> Atom Comments About Me Michael McCandless Michael loves building software;
> he's been building search engines for more than a decade. In 1999 he
> co-founded iPhrase Technologies, a startup providing a user-centric
> enterprise search application, written primarily in Python and C. After IBM
> acquired iPhrase in 2005, Michael fell in love with Lucene, becoming a
> committer in 2006 and PMC member in 2008. Michael has remained an active
> committer, helping to push Lucene to new places in recent years. He's
> co-author of Lucene in Action, 2nd edition. In his spare time Michael
> enjoys building his own computers, writing software to control his house
> (mostly in Python), encoding videos and tinkering with all sorts of other
> things. View my complete profile Blog Archive ▼  2013 (5) ▼  May (2)
> Dynamic faceting with Lucene Eating dog food with Lucene ►  February (1) ►
> January (2) ►  2012 (16) ►  December (2) ►  November (1) ►  September (1)
> ►  August (1) ►  July (3) ►  May (1) ►  April (2) ►  March (3) ►  January
> (2) ►  2011 (20) ►  November (2) ►  October (3) ►  September (1) ►  June
> (3) ►  May (2) ►  April (2) ►  March (4) ►  February (2) ►  January (1) ►
> 2010 (43) ►  December (1) ►  November (1) ►  October (4) ►  September (4)
> ►  August (4) ►  July (11) ►  June (7) ►  May (6) ►  April (1) ►  March (1)
> ►  February (3) ►  2009 (18) ►  December (1) ►  November (1) ►  October (1)
> ►  September (4) ►  August (6) ►  July (5) Followers Follow by Email Simple
> template. Powered by Blogger .
> </str>
> </arr>
> <arr name="language">
> <str>en</str>
> </arr>
> <arr name="url">
> <str>
> http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
> </str>
> </arr>
> <arr name="snippet">
> <str>
> Changing Bits: Dynamic faceting with Lucene Changing Bits Tuesday, May 21,
> 2013 Dynamic faceting with Lucene Lucene's facet module has seen some great
> improvements recently: sizable (nearly 4X) speedups and new features like
> DrillSideways ....At search time, faceting cost is minimal: for each
> matched document, we visit all integer ids and aggregate counts in an
> array, summarizing the results in the end, for example as top N facet
> labels by count....The range-test is a naive simple linear search, which is
> probably OK since there are usually only a few ranges, but we could
> eventually upgrade this to an interval tree to get better performance
> (patches welcome!)....Share to Twitter Share to Facebook Labels: Lucene No
> comments: Post a Comment Older Post Home Subscribe to: Post Comments (Atom)
> Subscribe To Posts Atom Posts Comments Atom Comments About Me Michael
> McCandless Michael loves building software; he's been building search
> engines for more than a decade....View my complete profile Blog Archive ▼
> 2013 (5) ▼  May (2) Dynamic faceting with Lucene Eating dog food with
> Lucene ►  February (1) ►  January (2) ►  2012 (16) ►  December (2) ►
> November (1) ►  September (1) ►  August (1) ►  July (3) ►  May (1) ►  April
> (2) ►  March (3) ►  January (2) ►  2011 (20) ►  November (2) ►  October (3)
> ►  September (1) ►  June (3) ►  May (2) ►  April (2) ►  March (4) ►
> February (2) ►  January (1) ►  2010 (43) ►  December (1) ►  November (1) ►
> October (4) ►  September (4) ►  August (4) ►  July (11) ►  June (7) ►  May
> (6) ►  April (1) ►  March (1) ►  February (3) ►  2009 (18) ►  December (1)
> ►  November (1) ►  October (1) ►  September (4) ►  August (6) ►  July (5)
> Followers Follow by Email Simple template.
> </str>
> </arr>
> <arr name="host">
> <str>blog.mikemccandless.com</str>
> </arr>
> <arr name="path">
> <str>/2013/05/dynamic-faceting-with-lucene.html</str>
> </arr>
> <long name="_version_">1436832383182569472</long>
> </doc>
> </result>
> </response>
>
>
>
> I can see there are published and updated markup, and yet none of those
> fields (pubDate or publications) are present in the solr document.
>
>
> Thank you for the prompt reply. Agreed on #1, url is perfectly fine and
> well used :). As for #2, I am still puzzled about the following. Here's an
> excerpt from  the feed xml:
>
> On June 3, 2013 at 4:25:51 PM, Karl Wright (daddywri@gmail.com) wrote:
>
> Hi Stephane,
>
> (1) ManifoldCF always uses the URL of a document as the primary ID when it
> indexes it.  This is the standard treatment and has been since Day 1.
>
> (2) For the "creation date" attribute, the RSS connector uses the date in
> the feed, if there is one.  This is a date in ISO format, and comes out as
> the metadata value "pubdateiso".  There is also an attribute called
> "pubdate", which is in milliseconds since epoch, which is EITHER the date
> in the feed (if present), or if not it's the date the document is fetched.
>
> As for your other question, "chromed" data comes from the URLs referenced
> by the items in the feed, and "dechromed" data comes from either the
> content or description field that's actually in the feed, whichever you
> specify.
>
> All of this is described in the end-user-documentation, although I do
> notice that "pubdateiso" is missing from the metadata listed.
>
>
> http://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html#rssrepository
>
> Karl
>
>
>
> On Mon, Jun 3, 2013 at 10:13 AM, Stephane Gamard <st...@gamard.net>wrote:
>
>>
>> Hi all,
>>
>>
>> I'm trying to use the RSS connector for the following feed:
>> http://blog.mikemccandless.com/feeds/posts/default
>>
>> After setting the job up and ingesting documents I have 2 pending
>> questions:
>> - why is the connector using the URL as ID instead of the atom ID tag?
>> - I have no creation and/or modified date in my Solr document, how is it
>> so?
>>
>> Overall I am a bit confused as to where does the crawler gets it's
>> information (chrome vs dechromed). I've downloaded the feed and tried to
>> find the entries back into my index but could not do so (could only find
>> pages which are linked from the rss entry).
>>
>> Sorry for the hassle, I'm reading over and over trying to piece it all
>> together.
>>
>> Cheers,
>>
>> _Stephane
>>
>
>

Re: RSS Connector

Posted by Stephane Gamard <st...@gamard.net>.

Hi Karl, 

Thank you for the prompt reply. Agreed on #1, url is perfectly fine and well used :). As for #2, I am still puzzled about the following. Here's an excerpt from  the feed xml:



	<entry>
		<id>tag:blogger.com,1999:blog-8623074010562846957.post-6579597884362535238</id>
		<published>2013-05-21T18:23:00.000-04:00</published>
		<updated>2013-05-21T18:23:06.451-04:00</updated>
		<category scheme="http://www.blogger.com/atom/ns#" term="Lucene"/>
		<title type="text">Dynamic faceting with Lucene</title>
		<content type="html">Lucene's [...] Happy faceting!</content>
		<link rel="replies" type="application/atom+xml" href="http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default" title="Post Comments"/>
		<link rel="replies" type="text/html" href="http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html#comment-form" title="0 Comments"/>
		<link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8623074010562846957/posts/default/6579597884362535238"/>
		<link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8623074010562846957/posts/default/6579597884362535238"/>
		<link rel="alternate" type="text/html" href="http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html" title="Dynamic faceting with Lucene"/>
		<author>
			<name>Michael McCandless</name>
			<uri>https://plus.google.com/112759599082866346694</uri>
			<email>noreply@blogger.com</email>
			<gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh5.googleusercontent.com/-uZl5chgeDsM/AAAAAAAAAAI/AAAAAAAAAO4/Go4SFcNl-jY/s512-c/photo.jpg"/>
		</author>
		<thr:total>0</thr:total>
	</entry>


Below is the document once ingested in Solr (searched with query: http://localhost:8983/lucene/select?q=id:http%3A%2F%2Fblog.mikemccandless.com%2F2013%2F05%2Fdynamic-faceting-with-lucene.html&fl=*). Note that I use a catch all field (<dynamicField name="*"  type="string"  indexed="true"  multiValued="true" stored="true" omitNorms="true"/>) to save all submitted fields. 

I have two questions that I don't understand: 
- I've selected the option "Dechromed content, if present, in 'description' field"  and yet I have no description field
- I have no pubDate of publications field available

Here's the attached Solr output:


This XML file does not appear to have any style information associated with it. The document tree is shown below.
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
<str name="fl">*</str>
<str name="q">
id:http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
</str>
</lst>
</lst>
<result name="response" numFound="1" start="0">
<doc>
<arr name="link">
<str>http://blog.mikemccandless.com/favicon.ico</str>
<str>icon</str>
<str>image/x-icon</str>
<str>
http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
</str>
<str>canonical</str>
<str>alternate</str>
<str>application/atom+xml</str>
<str>http://blog.mikemccandless.com/feeds/posts/default</str>
<str>alternate</str>
<str>application/rss+xml</str>
<str>
http://blog.mikemccandless.com/feeds/posts/default?alt=rss
</str>
<str>service.post</str>
<str>application/atom+xml</str>
<str>
http://www.blogger.com/feeds/8623074010562846957/posts/default
</str>
<str>EditURI</str>
<str>application/rsd+xml</str>
<str>
http://www.blogger.com/rsd.g?blogID=8623074010562846957
</str>
<str>alternate</str>
<str>application/atom+xml</str>
<str>
http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default
</str>
<str>https://plus.google.com/112759599082866346694</str>
<str>publisher</str>
<str>text/css</str>
<str>stylesheet</str>
<str>
//www.blogger.com/static/v1/widgets/2159474849-widget_css_2_bundle.css
</str>
<str>text/css</str>
<str>stylesheet</str>
<str>
//www.blogger.com/dyn-css/authorization.css?targetBlogID=8623074010562846957&zx=93c35911-ffbb-4abb-ba82-d88c30b4b1b8
</str>
</arr>
<arr name="meta">
<str>viewport</str>
<str>width=1100</str>
<str>stream_source_info</str>
<str>docname</str>
<str>stream_content_type</str>
<str>text/html; charset=UTF-8</str>
<str>stream_size</str>
<str>80779</str>
<str>Content-Encoding</str>
<str>UTF-8</str>
<str>stream_name</str>
<str>docname</str>
<str>generator</str>
<str>blogger</str>
<str>MSSmartTagsPreventParsing</str>
<str>true</str>
<str>Content-Type</str>
<str>text/html; charset=UTF-8</str>
<str>resourceName</str>
<str>docname</str>
<str>dc:title</str>
<str>Changing Bits: Dynamic faceting with Lucene</str>
</arr>
<arr name="false">
<str>rect</str>
<str>http://blog.mikemccandless.com/</str>
<str>rect</str>
<str>6579597884362535238</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2013/02/drill-sideways-faceting-with-lucene.html
</str>
<str>rect</str>
<str>http://jirasearch.mikemccandless.com</str>
<str>rect</str>
<str>
http://www.elasticsearch.org/guide/reference/api/search/facets/
</str>
<str>rect</str>
<str>http://wiki.apache.org/solr/SolrFacetingOverview</str>
<str>rect</str>
<str>https://issues.apache.org/jira/browse/LUCENE-4795</str>
<str>rect</str>
<str>https://issues.apache.org/jira/browse/LUCENE-4965</str>
<str>rect</str>
<str>http://en.wikipedia.org/wiki/Interval_tree</str>
<str>rect</str>
<str>http://jirasearch.mikemccandless.com</str>
<str>rect</str>
<str>https://plus.google.com/112759599082866346694</str>
<str>author</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
</str>
<str>bookmark</str>
<str>rect</str>
<str>
http://www.blogger.com/email-post.g?blogID=8623074010562846957&postID=6579597884362535238
</str>
<str>rect</str>
<str>
http://www.blogger.com/post-edit.g?blogID=8623074010562846957&postID=6579597884362535238&from=pencil
</str>
<str>rect</str>
<str>
http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=email
</str>
<str>rect</str>
<str>
http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=blog
</str>
<str>rect</str>
<str>
http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=twitter
</str>
<str>rect</str>
<str>
http://www.blogger.com/share-post.g?blogID=8623074010562846957&postID=6579597884362535238&target=facebook
</str>
<str>rect</str>
<str>http://blog.mikemccandless.com/search/label/Lucene</str>
<str>tag</str>
<str>rect</str>
<str>comments</str>
<str>rect</str>
<str>comment-form</str>
<str>rect</str>
<str>
http://www.blogger.com/comment-iframe.g?blogID=8623074010562846957&postID=6579597884362535238
</str>
<str>rect</str>
<str>links</str>
<str>rect</str>
<str/>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2013/05/eating-dog-food-with-lucene.html
</str>
<str>rect</str>
<str>http://blog.mikemccandless.com/</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default
</str>
<str>application/atom+xml</str>
<str>rect</str>
<str>
http://www.netvibes.com/subscribe.php?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2Fposts%2Fdefault
</str>
<str>rect</str>
<str>
http://www.newsgator.com/ngs/subscriber/subext.aspx?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2Fposts%2Fdefault
</str>
<str>rect</str>
<str>
http://add.my.yahoo.com/content?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2Fposts%2Fdefault
</str>
<str>rect</str>
<str>http://blog.mikemccandless.com/feeds/posts/default</str>
<str>rect</str>
<str>
http://www.netvibes.com/subscribe.php?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2F6579597884362535238%2Fcomments%2Fdefault
</str>
<str>rect</str>
<str>
http://www.newsgator.com/ngs/subscriber/subext.aspx?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2F6579597884362535238%2Fcomments%2Fdefault
</str>
<str>rect</str>
<str>
http://add.my.yahoo.com/content?url=http%3A%2F%2Fblog.mikemccandless.com%2Ffeeds%2F6579597884362535238%2Fcomments%2Fdefault
</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/feeds/6579597884362535238/comments/default
</str>
<str>rect</str>
<str>
//www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Subscribe&widgetId=Subscribe1&action=editWidget&sectionId=sidebar-right-1
</str>
<str>rect</str>
<str>https://plus.google.com/112759599082866346694</str>
<str>rect</str>
<str>https://plus.google.com/112759599082866346694</str>
<str>author</str>
<str>rect</str>
<str>https://plus.google.com/112759599082866346694</str>
<str>author</str>
<str>rect</str>
<str>
//www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Profile&widgetId=Profile1&action=editWidget&sectionId=sidebar-right-1
</str>
<str>rect</str>
<str>
http://affiliate.manning.com/idevaffiliate.php?id=1171_147
</str>
<str>rect</str>
<str>
//www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Image&widgetId=Image1&action=editWidget&sectionId=sidebar-right-1
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/search?updated-min=2013-01-01T00:00:00-05:00&updated-max=2014-01-01T00:00:00-05:00&max-results=5
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2013_05_01_archive.html
</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2013/05/eating-dog-food-with-lucene.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2013_02_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2013_01_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/search?updated-min=2012-01-01T00:00:00-05:00&updated-max=2013-01-01T00:00:00-05:00&max-results=16
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2012_12_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2012_11_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2012_09_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2012_08_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2012_07_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2012_05_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2012_04_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2012_03_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2012_01_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/search?updated-min=2011-01-01T00:00:00-05:00&updated-max=2012-01-01T00:00:00-05:00&max-results=20
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2011_11_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2011_10_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2011_09_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2011_06_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2011_05_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2011_04_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2011_03_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2011_02_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2011_01_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/search?updated-min=2010-01-01T00:00:00-05:00&updated-max=2011-01-01T00:00:00-05:00&max-results=43
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2010_12_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2010_11_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2010_10_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2010_09_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2010_08_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2010_07_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2010_06_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2010_05_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2010_04_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2010_03_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2010_02_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/search?updated-min=2009-01-01T00:00:00-05:00&updated-max=2010-01-01T00:00:00-05:00&max-results=18
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2009_12_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2009_11_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2009_10_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2009_09_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2009_08_01_archive.html
</str>
<str>rect</str>
<str>javascript:void(0)</str>
<str>rect</str>
<str>
http://blog.mikemccandless.com/2009_07_01_archive.html
</str>
<str>rect</str>
<str>
//www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=BlogArchive&widgetId=BlogArchive1&action=editWidget&sectionId=sidebar-right-1
</str>
<str>rect</str>
<str>
//www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Followers&widgetId=Followers1&action=editWidget&sectionId=sidebar-right-1
</str>
<str>rect</str>
<str>
//www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=FollowByEmail&widgetId=FollowByEmail1&action=editWidget&sectionId=sidebar-right-3
</str>
<str>rect</str>
<str>http://www.blogger.com</str>
<str>rect</str>
<str>
//www.blogger.com/rearrange?blogID=8623074010562846957&widgetType=Attribution&widgetId=Attribution1&action=editWidget&sectionId=footer-3
</str>
</arr>
<arr name="img">
<str/>
<str>13</str>
<str>http://img1.blogblog.com/img/icon18_email.gif</str>
<str>18</str>
<str/>
<str>18</str>
<str>
http://img2.blogblog.com/img/icon18_edit_allbkg.gif
</str>
<str>18</str>
<str>
http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
</str>
<str/>
<str/>
<str>http://img1.blogblog.com/img/icon_feed12.png</str>
<str>
http://img1.blogblog.com/img/widgets/subscribe-netvibes.png
</str>
<str/>
<str>
http://img1.blogblog.com/img/widgets/subscribe-newsgator.png
</str>
<str/>
<str>
http://img1.blogblog.com/img/widgets/subscribe-yahoo.png
</str>
<str/>
<str>http://img1.blogblog.com/img/icon_feed12.png</str>
<str/>
<str>
http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
</str>
<str/>
<str/>
<str>http://img1.blogblog.com/img/icon_feed12.png</str>
<str>
http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
</str>
<str/>
<str/>
<str>http://img1.blogblog.com/img/icon_feed12.png</str>
<str>
http://img1.blogblog.com/img/widgets/subscribe-netvibes.png
</str>
<str/>
<str>
http://img1.blogblog.com/img/widgets/subscribe-newsgator.png
</str>
<str/>
<str>
http://img1.blogblog.com/img/widgets/subscribe-yahoo.png
</str>
<str/>
<str>http://img1.blogblog.com/img/icon_feed12.png</str>
<str/>
<str>
http://img2.blogblog.com/img/widgets/arrow_dropdown.gif
</str>
<str/>
<str/>
<str>http://img1.blogblog.com/img/icon_feed12.png</str>
<str/>
<str>18</str>
<str>
http://img1.blogblog.com/img/icon18_wrench_allbkg.png
</str>
<str>18</str>
<str>My Photo</str>
<str>80</str>
<str>
//lh5.googleusercontent.com/-uZl5chgeDsM/AAAAAAAAAAI/AAAAAAAAAO4/Go4SFcNl-jY/s512-c/photo.jpg
</str>
<str>80</str>
<str/>
<str>18</str>
<str>
http://img1.blogblog.com/img/icon18_wrench_allbkg.png
</str>
<str>18</str>
<str/>
<str>187</str>
<str>
http://1.bp.blogspot.com/-QWxIn-kN_Yg/TZH0g4Vm66I/AAAAAAAAAG0/2jsjFLP9voQ/s250/LuceneInAction2.jpg
</str>
<str>150</str>
<str/>
<str>18</str>
<str>
http://img1.blogblog.com/img/icon18_wrench_allbkg.png
</str>
<str>18</str>
<str/>
<str>18</str>
<str>
http://img1.blogblog.com/img/icon18_wrench_allbkg.png
</str>
<str>18</str>
<str/>
<str>18</str>
<str>
http://img1.blogblog.com/img/icon18_wrench_allbkg.png
</str>
<str>18</str>
<str/>
<str>18</str>
<str>
http://img1.blogblog.com/img/icon18_wrench_allbkg.png
</str>
<str>18</str>
<str/>
<str>18</str>
<str>
http://img1.blogblog.com/img/icon18_wrench_allbkg.png
</str>
<str>18</str>
</arr>
<arr name="iframe">
<str>0</str>
<str>auto</str>
<str>410</str>
<str>comment-editor</str>
<str/>
<str>100%</str>
</arr>
<str name="filename">docname</str>
<str name="mimetype">text/html; charset=UTF-8</str>
<arr name="source">
<str>http://blog.mikemccandless.com/feeds/posts/default</str>
</arr>
<arr name="category">
<str>Lucene</str>
</arr>
<str name="id">
http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
</str>
<arr name="source_type">
<str>rss</str>
</arr>
<arr name="title">
<str>Dynamic faceting with Lucene</str>
</arr>
<arr name="title_search">
<str>Dynamic faceting with Lucene</str>
</arr>
<arr name="viewport">
<str>width=1100</str>
</arr>
<arr name="stream_source_info">
<str>docname</str>
</arr>
<arr name="stream_content_type">
<str>text/html; charset=UTF-8</str>
</arr>
<arr name="stream_size">
<str>80779</str>
</arr>
<arr name="content_encoding">
<str>UTF-8</str>
</arr>
<arr name="stream_name">
<str>docname</str>
</arr>
<arr name="generator">
<str>blogger</str>
</arr>
<arr name="mssmarttagspreventparsing">
<str>true</str>
</arr>
<arr name="content_type">
<str>text/html; charset=UTF-8</str>
</arr>
<arr name="resourcename">
<str>docname</str>
</arr>
<arr name="dc_title">
<str>Changing Bits: Dynamic faceting with Lucene</str>
</arr>
<arr name="content">
<str>
Changing Bits: Dynamic faceting with Lucene Changing Bits Tuesday, May 21, 2013 Dynamic faceting with Lucene Lucene's facet module has seen some great improvements recently: sizable (nearly 4X) speedups and new features like DrillSideways . The Jira issues search example showcases a number of facet features. Here I'll describe two recently committed facet features: sorted-set doc-values faceting, already available in 4.3, and dynamic range faceting, coming in the next (4.4) release. To understand these features, and why they are important, we first need a little background. Lucene's facet module does most of its work at indexing time: for each indexed document, it examines every facet label, each of which may be hierarchical, and maps each unique label in the hierarchy to an integer id, and then encodes all ids into a binary doc values field. A separate taxonomy index stores this mapping, and ensures that, even across segments, the same label gets the same id. At search time, faceting cost is minimal: for each matched document, we visit all integer ids and aggregate counts in an array, summarizing the results in the end, for example as top N facet labels by count. This is in contrast to purely dynamic faceting implementations like ElasticSearch 's and Solr 's, which do all work at search time. Such approaches are more flexible: you need not do anything special during indexing, and for every query you can pick and choose exactly which facets to compute. However, the price for that flexibility is slower searching, as each search must do more work for every matched document. Furthermore, the impact on near-real-time reopen latency can be horribly costly if top-level data-structures, such as Solr's UnInvertedField, must be rebuilt on every reopen. The taxonomy index used by the facet module means no extra work needs to be done on each near-real-time reopen. Enough background, now on to our two new features! Sorted-set doc-values faceting These features bring two dynamic alternatives to the facet module, both computing facet counts from previously indexed doc-values fields. The first feature, sorted-set doc-values faceting (see LUCENE-4795 ), allows the application to index a normal sorted-set doc-values field, for example: doc.add(new SortedSetDocValuesField("foo")); doc.add(new SortedSetDocValuesField("bar")); and then to compute facet counts at search time using SortedSetDocValuesAccumulator and SortedSetDocValuesReaderState. This feature does not use the taxonomy index, since all state is stored in the doc-values, but the tradeoff is that on each near-real-time reopen, a top-level data-structure is recomputed to map per-segment integer ordinals to global ordinals. The good news is this should be relatively low cost since it's just merge-sorting already sorted terms, and it doesn't need to visit the documents (unlike UnInvertedField). At search time there is also a small performance hit (~25%, depending on the query) since each per-segment ord must be re-mapped to the global ord space. Likely this could be improved (no time was spend optimizing). Furthermore, this feature currently only works with non-hierarchical facet fields, though this should be fixable (patches welcome!). Dynamic range faceting The second new feature, dynamic range faceting, works on top of a numeric doc-values field (see LUCENE-4965 ), and implements dynamic faceting over numeric ranges. You create a RangeFacetRequest, providing custom ranges with their labels. Each matched document is checked against all ranges and the count is incremented when there is a match. The range-test is a naive simple linear search, which is probably OK since there are usually only a few ranges, but we could eventually upgrade this to an interval tree to get better performance (patches welcome!). Likewise, this new feature does not use the taxonomy index, only a numeric doc-values field. This feature is especially useful with time-based fields. You can see it in action in the Jira issues search example in the Updated field. Happy faceting! Posted by Michael McCandless on 5/21/2013 Email This BlogThis! Share to Twitter Share to Facebook Labels: Lucene No comments: Post a Comment Older Post Home Subscribe to: Post Comments (Atom) Subscribe To Posts Atom Posts Comments Atom Comments About Me Michael McCandless Michael loves building software; he's been building search engines for more than a decade. In 1999 he co-founded iPhrase Technologies, a startup providing a user-centric enterprise search application, written primarily in Python and C. After IBM acquired iPhrase in 2005, Michael fell in love with Lucene, becoming a committer in 2006 and PMC member in 2008. Michael has remained an active committer, helping to push Lucene to new places in recent years. He's co-author of Lucene in Action, 2nd edition. In his spare time Michael enjoys building his own computers, writing software to control his house (mostly in Python), encoding videos and tinkering with all sorts of other things. View my complete profile Blog Archive ▼  2013 (5) ▼  May (2) Dynamic faceting with Lucene Eating dog food with Lucene ►  February (1) ►  January (2) ►  2012 (16) ►  December (2) ►  November (1) ►  September (1) ►  August (1) ►  July (3) ►  May (1) ►  April (2) ►  March (3) ►  January (2) ►  2011 (20) ►  November (2) ►  October (3) ►  September (1) ►  June (3) ►  May (2) ►  April (2) ►  March (4) ►  February (2) ►  January (1) ►  2010 (43) ►  December (1) ►  November (1) ►  October (4) ►  September (4) ►  August (4) ►  July (11) ►  June (7) ►  May (6) ►  April (1) ►  March (1) ►  February (3) ►  2009 (18) ►  December (1) ►  November (1) ►  October (1) ►  September (4) ►  August (6) ►  July (5) Followers Follow by Email Simple template. Powered by Blogger .
</str>
</arr>
<arr name="content_search">
<str>
Changing Bits: Dynamic faceting with Lucene Changing Bits Tuesday, May 21, 2013 Dynamic faceting with Lucene Lucene's facet module has seen some great improvements recently: sizable (nearly 4X) speedups and new features like DrillSideways . The Jira issues search example showcases a number of facet features. Here I'll describe two recently committed facet features: sorted-set doc-values faceting, already available in 4.3, and dynamic range faceting, coming in the next (4.4) release. To understand these features, and why they are important, we first need a little background. Lucene's facet module does most of its work at indexing time: for each indexed document, it examines every facet label, each of which may be hierarchical, and maps each unique label in the hierarchy to an integer id, and then encodes all ids into a binary doc values field. A separate taxonomy index stores this mapping, and ensures that, even across segments, the same label gets the same id. At search time, faceting cost is minimal: for each matched document, we visit all integer ids and aggregate counts in an array, summarizing the results in the end, for example as top N facet labels by count. This is in contrast to purely dynamic faceting implementations like ElasticSearch 's and Solr 's, which do all work at search time. Such approaches are more flexible: you need not do anything special during indexing, and for every query you can pick and choose exactly which facets to compute. However, the price for that flexibility is slower searching, as each search must do more work for every matched document. Furthermore, the impact on near-real-time reopen latency can be horribly costly if top-level data-structures, such as Solr's UnInvertedField, must be rebuilt on every reopen. The taxonomy index used by the facet module means no extra work needs to be done on each near-real-time reopen. Enough background, now on to our two new features! Sorted-set doc-values faceting These features bring two dynamic alternatives to the facet module, both computing facet counts from previously indexed doc-values fields. The first feature, sorted-set doc-values faceting (see LUCENE-4795 ), allows the application to index a normal sorted-set doc-values field, for example: doc.add(new SortedSetDocValuesField("foo")); doc.add(new SortedSetDocValuesField("bar")); and then to compute facet counts at search time using SortedSetDocValuesAccumulator and SortedSetDocValuesReaderState. This feature does not use the taxonomy index, since all state is stored in the doc-values, but the tradeoff is that on each near-real-time reopen, a top-level data-structure is recomputed to map per-segment integer ordinals to global ordinals. The good news is this should be relatively low cost since it's just merge-sorting already sorted terms, and it doesn't need to visit the documents (unlike UnInvertedField). At search time there is also a small performance hit (~25%, depending on the query) since each per-segment ord must be re-mapped to the global ord space. Likely this could be improved (no time was spend optimizing). Furthermore, this feature currently only works with non-hierarchical facet fields, though this should be fixable (patches welcome!). Dynamic range faceting The second new feature, dynamic range faceting, works on top of a numeric doc-values field (see LUCENE-4965 ), and implements dynamic faceting over numeric ranges. You create a RangeFacetRequest, providing custom ranges with their labels. Each matched document is checked against all ranges and the count is incremented when there is a match. The range-test is a naive simple linear search, which is probably OK since there are usually only a few ranges, but we could eventually upgrade this to an interval tree to get better performance (patches welcome!). Likewise, this new feature does not use the taxonomy index, only a numeric doc-values field. This feature is especially useful with time-based fields. You can see it in action in the Jira issues search example in the Updated field. Happy faceting! Posted by Michael McCandless on 5/21/2013 Email This BlogThis! Share to Twitter Share to Facebook Labels: Lucene No comments: Post a Comment Older Post Home Subscribe to: Post Comments (Atom) Subscribe To Posts Atom Posts Comments Atom Comments About Me Michael McCandless Michael loves building software; he's been building search engines for more than a decade. In 1999 he co-founded iPhrase Technologies, a startup providing a user-centric enterprise search application, written primarily in Python and C. After IBM acquired iPhrase in 2005, Michael fell in love with Lucene, becoming a committer in 2006 and PMC member in 2008. Michael has remained an active committer, helping to push Lucene to new places in recent years. He's co-author of Lucene in Action, 2nd edition. In his spare time Michael enjoys building his own computers, writing software to control his house (mostly in Python), encoding videos and tinkering with all sorts of other things. View my complete profile Blog Archive ▼  2013 (5) ▼  May (2) Dynamic faceting with Lucene Eating dog food with Lucene ►  February (1) ►  January (2) ►  2012 (16) ►  December (2) ►  November (1) ►  September (1) ►  August (1) ►  July (3) ►  May (1) ►  April (2) ►  March (3) ►  January (2) ►  2011 (20) ►  November (2) ►  October (3) ►  September (1) ►  June (3) ►  May (2) ►  April (2) ►  March (4) ►  February (2) ►  January (1) ►  2010 (43) ►  December (1) ►  November (1) ►  October (4) ►  September (4) ►  August (4) ►  July (11) ►  June (7) ►  May (6) ►  April (1) ►  March (1) ►  February (3) ►  2009 (18) ►  December (1) ►  November (1) ►  October (1) ►  September (4) ►  August (6) ►  July (5) Followers Follow by Email Simple template. Powered by Blogger .
</str>
</arr>
<arr name="language">
<str>en</str>
</arr>
<arr name="url">
<str>
http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html
</str>
</arr>
<arr name="snippet">
<str>
Changing Bits: Dynamic faceting with Lucene Changing Bits Tuesday, May 21, 2013 Dynamic faceting with Lucene Lucene's facet module has seen some great improvements recently: sizable (nearly 4X) speedups and new features like DrillSideways ....At search time, faceting cost is minimal: for each matched document, we visit all integer ids and aggregate counts in an array, summarizing the results in the end, for example as top N facet labels by count....The range-test is a naive simple linear search, which is probably OK since there are usually only a few ranges, but we could eventually upgrade this to an interval tree to get better performance (patches welcome!)....Share to Twitter Share to Facebook Labels: Lucene No comments: Post a Comment Older Post Home Subscribe to: Post Comments (Atom) Subscribe To Posts Atom Posts Comments Atom Comments About Me Michael McCandless Michael loves building software; he's been building search engines for more than a decade....View my complete profile Blog Archive ▼  2013 (5) ▼  May (2) Dynamic faceting with Lucene Eating dog food with Lucene ►  February (1) ►  January (2) ►  2012 (16) ►  December (2) ►  November (1) ►  September (1) ►  August (1) ►  July (3) ►  May (1) ►  April (2) ►  March (3) ►  January (2) ►  2011 (20) ►  November (2) ►  October (3) ►  September (1) ►  June (3) ►  May (2) ►  April (2) ►  March (4) ►  February (2) ►  January (1) ►  2010 (43) ►  December (1) ►  November (1) ►  October (4) ►  September (4) ►  August (4) ►  July (11) ►  June (7) ►  May (6) ►  April (1) ►  March (1) ►  February (3) ►  2009 (18) ►  December (1) ►  November (1) ►  October (1) ►  September (4) ►  August (6) ►  July (5) Followers Follow by Email Simple template.
</str>
</arr>
<arr name="host">
<str>blog.mikemccandless.com</str>
</arr>
<arr name="path">
<str>/2013/05/dynamic-faceting-with-lucene.html</str>
</arr>
<long name="_version_">1436832383182569472</long>
</doc>
</result>
</response>




I can see there are published and updated markup, and yet none of those fields (pubDate or publications) are present in the solr document. 

Thank you for the prompt reply. Agreed on #1, url is perfectly fine and well used :). As for #2, I am still puzzled about the following. Here's an excerpt from  the feed xml:


On June 3, 2013 at 4:25:51 PM, Karl Wright (daddywri@gmail.com) wrote:
Hi Stephane,

(1) ManifoldCF always uses the URL of a document as the primary ID when it indexes it.  This is the standard treatment and has been since Day 1.

(2) For the "creation date" attribute, the RSS connector uses the date in the feed, if there is one.  This is a date in ISO format, and comes out as the metadata value "pubdateiso".  There is also an attribute called "pubdate", which is in milliseconds since epoch, which is EITHER the date in the feed (if present), or if not it's the date the document is fetched.

As for your other question, "chromed" data comes from the URLs referenced by the items in the feed, and "dechromed" data comes from either the content or description field that's actually in the feed, whichever you specify.

All of this is described in the end-user-documentation, although I do notice that "pubdateiso" is missing from the metadata listed.

http://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html#rssrepository

Karl



On Mon, Jun 3, 2013 at 10:13 AM, Stephane Gamard <st...@gamard.net> wrote:

Hi all, 

I'm trying to use the RSS connector for the following feed: http://blog.mikemccandless.com/feeds/posts/default

After setting the job up and ingesting documents I have 2 pending questions: 
- why is the connector using the URL as ID instead of the atom ID tag?
- I have no creation and/or modified date in my Solr document, how is it so?

Overall I am a bit confused as to where does the crawler gets it's information (chrome vs dechromed). I've downloaded the feed and tried to find the entries back into my index but could not do so (could only find pages which are linked from the rss entry). 

Sorry for the hassle, I'm reading over and over trying to piece it all together.

Cheers, 

_Stephane

Re: RSS Connector

Posted by Karl Wright <da...@gmail.com>.

Hi Stephane,

(1) ManifoldCF always uses the URL of a document as the primary ID when it
indexes it.  This is the standard treatment and has been since Day 1.

(2) For the "creation date" attribute, the RSS connector uses the date in
the feed, if there is one.  This is a date in ISO format, and comes out as
the metadata value "pubdateiso".  There is also an attribute called
"pubdate", which is in milliseconds since epoch, which is EITHER the date
in the feed (if present), or if not it's the date the document is fetched.

As for your other question, "chromed" data comes from the URLs referenced
by the items in the feed, and "dechromed" data comes from either the
content or description field that's actually in the feed, whichever you
specify.

All of this is described in the end-user-documentation, although I do
notice that "pubdateiso" is missing from the metadata listed.

http://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html#rssrepository

Karl

On Mon, Jun 3, 2013 at 10:13 AM, Stephane Gamard <st...@gamard.net>wrote:

>
> Hi all,
>
>
> I'm trying to use the RSS connector for the following feed:
> http://blog.mikemccandless.com/feeds/posts/default
>
> After setting the job up and ingesting documents I have 2 pending
> questions:
> - why is the connector using the URL as ID instead of the atom ID tag?
> - I have no creation and/or modified date in my Solr document, how is it
> so?
>
> Overall I am a bit confused as to where does the crawler gets it's
> information (chrome vs dechromed). I've downloaded the feed and tried to
> find the entries back into my index but could not do so (could only find
> pages which are linked from the rss entry).
>
> Sorry for the hassle, I'm reading over and over trying to piece it all
> together.
>
> Cheers,
>
> _Stephane
>