You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Adam Estrada <es...@gmail.com> on 2010/12/10 17:15:59 UTC

[Multiple] RSS Feeds at a time...

All,

Right now I am using the default DIH config that comes with the Solr
examples. I update my index using the dataimport handler here

http://localhost:8983/solr/admin/dataimport.jsp?handler=/dataimport

This works fine but I want to be able to index more than just one feed at a
time and more importantly I want to be able to index both ATOM and RSS feeds
which means that the schema will definitely be different.

There is a good example on how to index all of the example docs in the
SolrNet example application but that is looking for xml files with the
properly formatted xml tags.

                foreach (var file in
Directory.GetFiles(Server.MapPath("/exampledocs"), "*.xml"))
                {
                    connection.Post("/update", File.ReadAllText(file,
Encoding.UTF8));
                }
                solr.Commit();

example xml:

- <add>
 - <doc>
   <field name="*id*">F8V7067-APL-KIT</field>
   <field name="*name*">Belkin Mobile Power Cord for iPod w/ Dock</field>
   <field name="*manu*">Belkin</field>
   <field name="*cat*">electronics</field>
   <field name="*cat*">connector</field>
   <field name="*features*">car power adapter, white</field>
   <field name="*weight*">4</field>
   <field name="*price*">19.95</field>
   <field name="*popularity*">1</field>
   <field name="*inStock*">false</field>
   <field name="*manufacturedate_dt*">2005-08-01T16:30:25Z</field>
  </doc>
</add>

This obviously won't help me when trying to grab random RSS feeds so my
question is, how can I ingest several feeds at a time? Can I do this
programmatically or is there a configuration option I am missing?

Thanks,
Adam

Re: [Multiple] RSS Feeds at a time...

Posted by Ahmet Arslan <io...@yahoo.com>.
> What else am I missing here because the reload-config
> command does not seem
> to be working. Any ideas would be great!

solr/dataimport?command=reload-config should return the message 
<str name="importResponse">Configuration Re-loaded sucessfully</str>
if everything went well. May be you can check that after each reload. May be it is not a valid xml?

By the way, can't you use variable resolver in your case?

http://wiki.apache.org/solr/DataImportHandler#A_VariableResolver

Passing different rss URLs using a custom parameter from request
like, ${dataimporter.request.myrssurl}. 

/dataimport?command=full-import&clean=false&myrssurl=http://rss.cnn.com/rss/cnn_topstories.rss

Similar discussion http://search-lucene.com/m/xILqvbY6h91/


      

Re: [Multiple] RSS Feeds at a time...

Posted by Adam Estrada <es...@gmail.com>.
Hi Ahmet,

This is a great idea but still does not appear to be working correctly. The
idea is that I want to be able to add an RSS feed and then index that feed
on a schedule. My C# method looks something like this.

        public ActionResult Index()
        {
            try {
                HTTPGet req = new HTTPGet();
                string solrStr =
System.Configuration.ConfigurationManager.AppSettings["solrUrl"].ToString();
                req.Request(solrStr +
"/select?clean=true&commit=true&qt=/dataimport&command=reload-config");
                req.Request(solrStr +
"/select?clean=false&commit=true&qt=/dataimport&command=full-import");
                Response.Write(req.StatusLine);
                Response.Write(req.ResponseTime);
                Response.Write(req.StatusCode);
                return RedirectToAction("../Import/Feeds");
                //return View();
            } catch (SolrConnectionException) {
                throw new Exception(string.Format("Couldn't Import RSS
Feeds"));
            }
        }

My XML configuration file looks somethiing like this...

<dataConfig>
<dataSource type="HttpDataSource" />
  <document>
    <entity name="filedatasource"
            processor="FileListEntityProcessor"
            baseDir="./solr/conf/dataimporthandler"
            fileName="^.*xml$"
            recursive="true"
            rootEntity="false"
            dataSource="null">

      <entity name="cnn"
              pk="link"
              datasource="filedatasource"
              url="http://rss.cnn.com/rss/cnn_topstories.rss"
              processor="XPathEntityProcessor"
              forEach="/rss/channel | /rss/channel/item"
              transformer="DateFormatTransformer,HTMLStripTransformer">

        <field column="source"       xpath="/rss/channel/title"
commonField="true" />
        <field column="source-link"  xpath="/rss/channel/link"
 commonField="true" />
        <field column="subject"      xpath="/rss/channel/description"
commonField="true" />
        <field column="title"        xpath="/rss/channel/item/title" />
        <field column="link"         xpath="/rss/channel/item/link" />
        <field column="description"  xpath="/rss/channel/item/description"
stripHTML="true" />
        <field column="creator"      xpath="/rss/channel/item/creator" />
        <field column="item-subject" xpath="/rss/channel/item/subject" />
        <field column="author"       xpath="/rss/channel/item/author" />
        <field column="comments"     xpath="/rss/channel/item/comments" />
        <field column="pubdate"      xpath="/rss/channel/item/pubDate"
dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
      </entity>

      <entity name="newsweek"
        pk="link"
        datasource="filedatasource"
        url="http://feeds.newsweek.com/newsweek/nation"
        processor="XPathEntityProcessor"
        forEach="/rss/channel | /rss/channel/item"
        transformer="DateFormatTransformer,HTMLStripTransformer">

        <field column="source"       xpath="/rss/channel/title"
commonField="true" />
        <field column="source-link"  xpath="/rss/channel/link"
 commonField="true" />
        <field column="subject"      xpath="/rss/channel/description"
commonField="true" />
        <field column="title"        xpath="/rss/channel/item/title" />
        <field column="link"         xpath="/rss/channel/item/link" />
        <field column="description"  xpath="/rss/channel/item/description"
stripHTML="true" />
        <field column="creator"      xpath="/rss/channel/item/creator" />
        <field column="item-subject" xpath="/rss/channel/item/subject" />
        <field column="author"       xpath="/rss/channel/item/author" />
        <field column="comments"     xpath="/rss/channel/item/comments" />
        <field column="pubdate"      xpath="/rss/channel/item/pubDate"
dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'"/>
      </entity>
       </entity>
  </document>
</dataConfig>

As you can see, I can add sub-entities from what appears to be as many times
as I want. The idea was to reload the xml file after each entity is added.
What else am I missing here because the reload-config command does not seem
to be working. Any ideas would be great!

Thanks,
Adam Estrada

On Sat, Dec 11, 2010 at 4:48 PM, Ahmet Arslan <io...@yahoo.com> wrote:

> > I found that you can have a single config file that can
> > have several
> > entities in it. My question now is how can I add entities
> > without restarting
> > the Solr service?
>
> You mean changing and re-loading xml config file?
>
> dataimport?command=reload-config
> http://wiki.apache.org/solr/DataImportHandler#Commands
>
>
>
>

Re: [Multiple] RSS Feeds at a time...

Posted by Adam Estrada <es...@gmail.com>.
You are da man! w00t!

adam

On Sat, Dec 11, 2010 at 4:48 PM, Ahmet Arslan <io...@yahoo.com> wrote:

> > I found that you can have a single config file that can
> > have several
> > entities in it. My question now is how can I add entities
> > without restarting
> > the Solr service?
>
> You mean changing and re-loading xml config file?
>
> dataimport?command=reload-config
> http://wiki.apache.org/solr/DataImportHandler#Commands
>
>
>
>

Re: [Multiple] RSS Feeds at a time...

Posted by Ahmet Arslan <io...@yahoo.com>.
> I found that you can have a single config file that can
> have several
> entities in it. My question now is how can I add entities
> without restarting
> the Solr service? 

You mean changing and re-loading xml config file?

dataimport?command=reload-config
http://wiki.apache.org/solr/DataImportHandler#Commands


      

Re: [Multiple] RSS Feeds at a time...

Posted by Adam Estrada <es...@gmail.com>.
Lance,

I found that you can have a single config file that can have several
entities in it. My question now is how can I add entities without restarting
the Solr service? It doesn't really work otherwise but it looks like it
should becasue we call the /dataimport handler after the entire application
has been started and loaded. How Can I make the app load the /dataimport
handler at runtime?

Example config...

<dataConfig>
<dataSource type="HttpDataSource" />
  <document>
    <entity name="f"
            processor="FileListEntityProcessor"

 baseDir="C:/Users/aestrada//SolrNET/solr-1.4.1/lucidworks/solr/conf/dataimporthandler"
            fileName=".*xml"
            newerThan="'NOW-3DAYS'"
            recursive="true"
            rootEntity="false"
            dataSource="null">

      <entity name="cnn"
              pk="link"
              url="http://rss.cnn.com/rss/cnn_topstories.rss"
              processor="XPathEntityProcessor"
              forEach="/rss/channel | /rss/channel/item"
              transformer="HTMLStripTransformer">

        <field column="source"       xpath="/rss/channel/title"
commonField="true" />
        <field column="source-link"  xpath="/rss/channel/link"
 commonField="true" />
        <field column="subject"      xpath="/rss/channel/description"
commonField="true" />
        <field column="title"        xpath="/rss/channel/item/title" />
        <field column="link"         xpath="/rss/channel/item/link" />
        <field column="description"  xpath="/rss/channel/item/description"
stripHTML="true" />
        <field column="creator"      xpath="/rss/channel/item/creator" />
        <field column="item-subject" xpath="/rss/channel/item/subject" />
        <field column="author"       xpath="/rss/channel/item/author" />
        <field column="comments"     xpath="/rss/channel/item/comments" />
        <field column="pubdate"      xpath="/rss/channel/item/pubDate" />
      </entity>

      <entity name="ABC"
        pk="link"
        url="http://feeds.abcnews.com/abcnews/topstories"
        processor="XPathEntityProcessor"
        forEach="/rss/channel | /rss/channel/item"
        transformer="HTMLStripTransformer">

        <field column="source"       xpath="/rss/channel/title"
commonField="true" />
        <field column="source-link"  xpath="/rss/channel/link"
 commonField="true" />
        <field column="subject"      xpath="/rss/channel/description"
commonField="true" />
        <field column="title"        xpath="/rss/channel/item/title" />
        <field column="link"         xpath="/rss/channel/item/link" />
        <field column="description"  xpath="/rss/channel/item/description"
stripHTML="true" />
        <field column="creator"      xpath="/rss/channel/item/creator" />
        <field column="item-subject" xpath="/rss/channel/item/subject" />
        <field column="author"       xpath="/rss/channel/item/author" />
        <field column="comments"     xpath="/rss/channel/item/comments" />
        <field column="pubdate"      xpath="/rss/channel/item/pubDate" />
      </entity>

      <entity name="CBS"
        pk="link"
        url="http://feeds.cbsnews.com/CBSNewsMain?format=xml"
        processor="XPathEntityProcessor"
        forEach="/rss/channel | /rss/channel/item"
        transformer="HTMLStripTransformer">

        <field column="source"       xpath="/rss/channel/title"
commonField="true" />
        <field column="source-link"  xpath="/rss/channel/link"
 commonField="true" />
        <field column="subject"      xpath="/rss/channel/description"
commonField="true" />
        <field column="title"        xpath="/rss/channel/item/title" />
        <field column="link"         xpath="/rss/channel/item/link" />
        <field column="description"  xpath="/rss/channel/item/description"
stripHTML="true" />
        <field column="creator"      xpath="/rss/channel/item/creator" />
        <field column="item-subject" xpath="/rss/channel/item/subject" />
        <field column="author"       xpath="/rss/channel/item/author" />
        <field column="comments"     xpath="/rss/channel/item/comments" />
        <field column="pubdate"      xpath="/rss/channel/item/pubDate" />
      </entity>
      <entity name="whitehouse"
  pk="link"
  url="http://www.whitehouse.gov/feed/blog/white-house"
  processor="XPathEntityProcessor"
  forEach="/rss/channel | /rss/channel/item"
  transformer="HTMLStripTransformer">

        <field column="source"       xpath="/rss/channel/title"
commonField="true" />
        <field column="source-link"  xpath="/rss/channel/link"
 commonField="true" />
        <field column="subject"      xpath="/rss/channel/description"
commonField="true" />
        <field column="title"        xpath="/rss/channel/item/title" />
        <field column="link"         xpath="/rss/channel/item/link" />
        <field column="description"  xpath="/rss/channel/item/description"
stripHTML="true" />
        <field column="creator"      xpath="/rss/channel/item/creator" />
        <field column="item-subject" xpath="/rss/channel/item/subject" />
        <field column="author"       xpath="/rss/channel/item/author" />
        <field column="comments"     xpath="/rss/channel/item/comments" />
        <field column="pubdate"      xpath="/rss/channel/item/pubDate" />
      </entity>
          </entity>
  </document>
</dataConfig>


On Fri, Dec 10, 2010 at 10:38 PM, Lance Norskog <go...@gmail.com> wrote:

> There is I believe no way to do this without separate copies of your
> script. Each 'handler=/dataimport' has to refer to a separate config
> file.
>
> You can make several copies and name them config1.xml, config2.xml
> etc. You'll have to call each one manually, so you have to manage your
> own thread pool.
>
> On Fri, Dec 10, 2010 at 8:15 AM, Adam Estrada
> <es...@gmail.com> wrote:
> > All,
> >
> > Right now I am using the default DIH config that comes with the Solr
> > examples. I update my index using the dataimport handler here
> >
> > http://localhost:8983/solr/admin/dataimport.jsp?handler=/dataimport
> >
> > This works fine but I want to be able to index more than just one feed at
> a
> > time and more importantly I want to be able to index both ATOM and RSS
> feeds
> > which means that the schema will definitely be different.
> >
> > There is a good example on how to index all of the example docs in the
> > SolrNet example application but that is looking for xml files with the
> > properly formatted xml tags.
> >
> >                foreach (var file in
> > Directory.GetFiles(Server.MapPath("/exampledocs"), "*.xml"))
> >                {
> >                    connection.Post("/update", File.ReadAllText(file,
> > Encoding.UTF8));
> >                }
> >                solr.Commit();
> >
> > example xml:
> >
> > - <add>
> >  - <doc>
> >   <field name="*id*">F8V7067-APL-KIT</field>
> >   <field name="*name*">Belkin Mobile Power Cord for iPod w/ Dock</field>
> >   <field name="*manu*">Belkin</field>
> >   <field name="*cat*">electronics</field>
> >   <field name="*cat*">connector</field>
> >   <field name="*features*">car power adapter, white</field>
> >   <field name="*weight*">4</field>
> >   <field name="*price*">19.95</field>
> >   <field name="*popularity*">1</field>
> >   <field name="*inStock*">false</field>
> >   <field name="*manufacturedate_dt*">2005-08-01T16:30:25Z</field>
> >  </doc>
> > </add>
> >
> > This obviously won't help me when trying to grab random RSS feeds so my
> > question is, how can I ingest several feeds at a time? Can I do this
> > programmatically or is there a configuration option I am missing?
> >
> > Thanks,
> > Adam
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: [Multiple] RSS Feeds at a time...

Posted by Lance Norskog <go...@gmail.com>.
There is I believe no way to do this without separate copies of your
script. Each 'handler=/dataimport' has to refer to a separate config
file.

You can make several copies and name them config1.xml, config2.xml
etc. You'll have to call each one manually, so you have to manage your
own thread pool.

On Fri, Dec 10, 2010 at 8:15 AM, Adam Estrada
<es...@gmail.com> wrote:
> All,
>
> Right now I am using the default DIH config that comes with the Solr
> examples. I update my index using the dataimport handler here
>
> http://localhost:8983/solr/admin/dataimport.jsp?handler=/dataimport
>
> This works fine but I want to be able to index more than just one feed at a
> time and more importantly I want to be able to index both ATOM and RSS feeds
> which means that the schema will definitely be different.
>
> There is a good example on how to index all of the example docs in the
> SolrNet example application but that is looking for xml files with the
> properly formatted xml tags.
>
>                foreach (var file in
> Directory.GetFiles(Server.MapPath("/exampledocs"), "*.xml"))
>                {
>                    connection.Post("/update", File.ReadAllText(file,
> Encoding.UTF8));
>                }
>                solr.Commit();
>
> example xml:
>
> - <add>
>  - <doc>
>   <field name="*id*">F8V7067-APL-KIT</field>
>   <field name="*name*">Belkin Mobile Power Cord for iPod w/ Dock</field>
>   <field name="*manu*">Belkin</field>
>   <field name="*cat*">electronics</field>
>   <field name="*cat*">connector</field>
>   <field name="*features*">car power adapter, white</field>
>   <field name="*weight*">4</field>
>   <field name="*price*">19.95</field>
>   <field name="*popularity*">1</field>
>   <field name="*inStock*">false</field>
>   <field name="*manufacturedate_dt*">2005-08-01T16:30:25Z</field>
>  </doc>
> </add>
>
> This obviously won't help me when trying to grab random RSS feeds so my
> question is, how can I ingest several feeds at a time? Can I do this
> programmatically or is there a configuration option I am missing?
>
> Thanks,
> Adam
>



-- 
Lance Norskog
goksron@gmail.com