You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com> on 2009/04/30 05:52:22 UTC

Re: [Solr Wiki] Update of "DataImportHandler" by FergusMcMenemie

Hi Fergus, it is not a good idea to change the documentation now.
Because most of the users would be using Solr1.3 and if they follow
the wiki they will get an error. We should just add the documentation
(w/o removing the HttpdataSource documentation)for URLDataSource and
explicitly say that it is a Solr1.4 feature

after 1.4 release we can change it

On Wed, Apr 29, 2009 at 10:31 PM, Apache Wiki <wi...@apache.org> wrote:
> Dear Wiki user,
>
> You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
>
> The following page has been changed by FergusMcMenemie:
> http://wiki.apache.org/solr/DataImportHandler
>
> The comment on the change is:
> Adjusting page for new URLDataSource and the deprecation of HTTPdataSource
>
> ------------------------------------------------------------------------------
>   * The datasource configuration can be done in solr config xml [#solrconfigdatasource also]
>   * The attribute 'type' specifies the implementation class. It is optional. The default value is `'JdbcDataSource'`
>   * The attribute 'name' can be used if there are [#multipleds multiple datasources] used by multiple entities
> -  * All other attributes in the <dataSource> tag are arbitrary. It is decided by the !DataSource implementation. [#jdbcdatasource See here] for attributes used by !JdbcDataSource and [#httpds see here] for !HttpDataSource
> +  * All other attributes in the <dataSource> tag are arbitrary. It is decided by the !DataSource implementation. [#jdbcdatasource See here] for attributes used by !JdbcDataSource and [#httpds see here] for !URLDataSource
>   * [#datasource See here] for plugging in your own
>  [[Anchor(multipleds)]]
>  === Multiple DataSources ===
> @@ -316, +316 @@
>
>
>  = Usage with XML/HTTP Datasource =
>  DataImportHandler can be used to index data from HTTP based data sources. This includes using indexing from REST/XML APIs as well as from RSS/ATOM Feeds.
> +
>  [[Anchor(httpds)]]
> - == Configuration of HttpDataSource ==
> + == Configuration of !URLDataSource ==
>
> - A sample configuration in for !HttpdataSource in data config xml looks like this
> + A sample configuration in for !URLDataSource in data config xml looks like this
>  {{{
> - <dataSource type="HttpDataSource" baseUrl="http://host:port/" encoding="UTF-8" connectionTimeout="5000" readTimeout="10000"/>
> + <dataSource type="URLDataSource" baseUrl="http://host:port/" encoding="UTF-8" connectionTimeout="5000" readTimeout="10000"/>
>  }}}
>  ''' The attributes are '''
>
> @@ -358, +359 @@
>
>  }}}
>
>
> - == HttpDataSource Example ==
> + == URLDataSource Example ==
>
>  Download the full import example given in the DB section to try this out. We'll try indexing the [http://rss.slashdot.org/Slashdot/slashdot Slashdot RSS feed] for this example.
>
> @@ -366, +367 @@
>
>  The data-config for this example looks like this:
>  {{{
>  <dataConfig>
> -         <dataSource type="HttpDataSource" />
> +         <dataSource type="URLDataSource" />
>        <document>
>                <entity name="slashdot"
>                                pk="link"
> @@ -714, +715 @@
>
>  == EntityProcessor ==
>  Each entity is handled by a default Entity processor called !SqlEntityProcessor. This works well for systems which use RDBMS as a datasource. For other kind of datasources like  REST or Non Sql datasources you can choose to extend this abstract class `org.apache.solr.handler.dataimport.Entityprocessor`. This is designed to Stream rows one by one from an entity. The simplest way to implement your own !EntityProcessor is to extend !EntityProcessorBase and override the `public Map<String,Object> nextRow()` method.
>  '!EntityProcessor' rely on the !DataSource for fetching data. The return type of the !DataSource is important for an !EntityProcessor. The built-in ones are,
> +
>  === SqlEntityProcessor ===
>  This is the defaut. The !DataSource must be of type `DataSource<Iterator<Map<String, Object>>>` . !JdbcDataSource can be used with this.
> +
>  === XPathEntityProcessor ===
> - Used for XML type datasource. The !DataSource must be of type `DataSourec<Reader>` . !HttpDataSource or !FileDataSource can be used with this.
> + Used when indexing XML type data. The !DataSource must be of type `DataSourec<Reader>` . !URLDataSource or !FileDataSource is commonly used with !XPathEntityProcessor.
> +
>  === FileListEntityProcessor ===
>  A simple one which can be used to enumerate the list of files from a File System based on some criteria. It does not use a !DataSource. The entity attributes are:
>   *'''`fileName`''' :(required) A regex pattern to identify files
> @@ -801, +805 @@
>
>  {{{
>  public class JdbcDataSource extends DataSource<Iterator<Map<String, Object>>>
>  }}}
> -
>  It is designed to iterate rows in DB one by one. A row is represented as a Map.
> +
> - === HttpDataSource ===
> + === URLDataSource ===
> - This is used by X!PathEntityProcessor to fetch content from HttpDataSources. See the documentation [#httpds here] . The signature is as follows
> + This datasource is often used with X!PathEntityProcessor to fetch content from an underlying file:// or http:// location. See the documentation [#httpds here] . The signature is as follows
>  {{{
> - public class HttpDataSource extends DataSource<Reader>
> + public class URLDataSource extends DataSource<Reader>
>  }}}
> +
> + === HTTPDataSource ===
> + This datasource now deprecated in favor of !URLDataSource. There is no change in functionality between !URLDataSource and !HTTPDataSource, only a name change.
> +
>  === FileDataSource ===
> - This can be used like an !HttpDataSource but used to fetch content from files on disk. The signature is as follows
> + This can be used like an !URLDataSource but used to fetch content from files on disk. The only difference from !URLDataSource, when accessing disk files, is how a pathname is specified. The signature is as follows
>  {{{
>  public class FileDataSource extends DataSource<Reader>
>  }}}
> @@ -821, +829 @@
>
>  === FieldReaderDataSource ===
>  <!> ["Solr1.4"]
>
> - This can be used like an !HttpDataSource . The signature is as follows
> + This can be used like an !URLDataSource . The signature is as follows
>  {{{
>  public class FieldReaderDataSource extends DataSource<Reader>
>  }}}
> - This can be useful for users who has a DB field containing xml and wish to use a nested X!PathEntityProcessor
> + This can be useful for users who have a DB field containing XML and wish to use a nested X!PathEntityProcessor to process the fields contents.
>  The datasouce may be configured as follows
>  {{{
>    <datasource name="f" type="FieldReaderDataSource" />
> @@ -888, +896 @@
>
>  There are 3 datasources two RDBMS (jdbc1,jdbc2) and one xml/http (B)
>
>   * `jdbc1` and `jdbc2` are instances of  type `JdbcDataSource` which are configured in the solrconfig.xml.
> -  * `http` is an instance of type `HttpDataSource`
> +  * `http` is an instance of type `URLDataSource`
>   * The root entity starts with a table called 'A' and uses 'jdbc1' as the datasource . The entity is conveniently named as the table itself
>   * Entity 'A' has 2 sub-entities 'B' and 'C' . 'B' uses the datasource instance  'http' and 'C' uses the datasource instance 'jdbc2'
>   * On doing a `command=full-import` The root-entity (A) is executed first
>



-- 
--Noble Paul