You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2009/10/28 11:35:26 UTC

DataImportHandler reverted to revision 218 on Solr Wiki

Dear wiki user,

You have subscribed to a wiki page "Solr Wiki" for change notification.

The page DataImportHandler has been reverted to revision 218 by ShalinMangar.
The comment on this change is: MoinMoin ate some content.
http://wiki.apache.org/solr/DataImportHandler?action=diff&rev1=219&rev2=220

--------------------------------------------------

   * '''`newerThan`''' : A date param . Use the format (`yyyy-MM-dd HH:mm:ss`) . It can also be a datemath string eg: ('NOW-3DAYS'). The single quote is necessary . Or it can be a valid variableresolver format like (${var.name})
   * '''`olderThan`''' : A date param . Same rules as above
   * '''`rootEntity`''' :It must be false for this (Unless you wish to just index filenames) An entity directly under the <document> is a root entity. That means that for each row emitted by the root entity one document is created in Solr/Lucene. But as in this case we do not wish to make one document per file. We wish to make one document per row emitted by the following entity 'x'. Because the entity 'f' has rootEntity=false the entity directly under it becomes a root entity automatically and each row emitted by that becomes a document.
-  3E>
+  * '''`dataSource`''' :If you use Solr1.3 It must be set to "null" because this does not use any DataSource. No need to specify that in Solr1.4 .It just means that we won't create a DataSource instance. (In most of the cases there is only one !DataSource (A !JdbcDataSource) and all entities just use them. In case of !FileListEntityProcessor a !DataSource is not necessary.)
+ 
+ example:
+ {{{
+ <dataConfig>
+     <dataSource type="FileDataSource" />
+     <document>
+         <entity name="f" processor="FileListEntityProcessor" baseDir="/some/path/to/files" fileName=".*xml" newerThan="'NOW-3DAYS'" recursive="true" rootEntity="false" dataSource="null">
+             <entity name="x" processor="XPathEntityProcessor" forEach="/the/record/xpath" url="${f.fileAbsolutePath}">
+                 <field column="full_name" xpath="/field/xpath"/>
+             </entity>
+         </entity>
+     </document>
+ </dataConfig>
+ }}}
+ Do not miss the `rootEntity` attribute. The implicit fields generated by the !FileListEntityProcessor are `fileAbsolutePath, fileSize, fileLastModified, fileName` and these are available for use within the entity X as shown above. It should be noted that !FileListEntityProcessor returns a list of pathnames and that the subsequent entity must use the !FileDataSource to fetch the files content.
+ 
+ === CachedSqlEntityProcessor ===
+ <<Anchor(cached)>>
+ 
+ This is an extension of the !SqlEntityProcessor.  This !EntityProcessor helps reduce the no: of DB queries executed by caching the rows. It does not help to use it in the root most entity because only one sql is run for the entity.
+ 
+ Example 1.
+ {{{
+ <entity name="x" query="select * from x">
+     <entity name="y" query="select * from y where xid=${x.id}" processor="CachedSqlEntityProcessor">
+     </entity>
+ <entity>
+ }}}
+ 
+ The usage is exactly same as the other one. When a query is run the results are stored and if the same query is run again it is fetched from the cache and returned
+ 
+ Example 2:
+ {{{
+ <entity name="x" query="select * from x">
+     <entity name="y" query="select * from y" processor="CachedSqlEntityProcessor"  where="xid=x.id">
+     </entity>
+ <entity>
+ }}}
+ 
+ The difference with the previous one is the 'where' attribute. In this case the query fetches all the rows from the table and stores all the rows in the cache. The magic is in the 'where' value. The cache stores the values with the 'xid' value in 'y' as the key. The value for 'x.id' is evaluated every time the entity has to be run and the value is looked up in the cache an the rows are returned.
+ 
+ In the where the lhs (the part before '=') is the column in y and the rhs (the part after '=') is the value to be computed for looking up the cache.
+ 
+ === PlainTextEntityProcessor ===
+ <<Anchor(plaintext)>>
+ <!> [[Solr1.4]]
+ 
+ This !EntityProcessor reads all content from the data source into an single implicit field called 'plainText'. The content is not parsed in any way, however you may add transformers to manipulate the data within 'plainText' as needed or to create other additional fields.
+ 
+ example:
+ {{{
+ <entity processor="PlainTextEntityProcessor" name="x" url="http://abc.com/a.txt" dataSource="data-source-name">
+    <!-- copies the text to a field called 'text' in Solr-->
+   <field column="plainText" name="text"/>
+ </entity>
+ }}}
+ 
+ Ensure that the dataSource is of type !DataSource<Reader> (!FileDataSource, URL!DataSource)
+ 
+ === LineEntityProcessor ===
+ <<Anchor(LineEntityProcessor)>>
  <!> [[Solr1.4]]
  
  This !EntityProcessor reads all content from the data source on a line by line basis, a field called 'rawLine' is returned for each line read. The content is not parsed in any way, however you may add transformers to manipulate the data within 'rawLine' or to create other additional fields.
@@ -823, +884 @@

   * '''`encoding`''': (optional) If the files are to be read in an encoding that is not same as the platform encoding
  
  === FieldReaderDataSource ===
- <!> [[Solr1.4]]>>
- <!> [[Solr1.4]]
- 
- This !EntityProcessor reads all content from the data source on a line by line basis, a field called 'rawLine' is returned for each line read. The content is not parsed in any way, however you may add transformers to manipulate the data within 'rawLine' or to create other additional fields.
- 
- The lines read can be filtered by two regular expressions '''acceptLineRegex''' and '''omitLineRegex'''.
- This entities additional attributes are:
-  * '''`url`''' : a required attribute that specifies the location of the input file in a way that is compatible with the configured datasource. If this value is relative and you are using !FileDataSource or URL!DataSource, it assumed to be relative to '''baseLoc'''.
-  * '''`acceptLineRegex`''' :an optional attribute that if present discards any line which does not match the regExp.
-  * '''`omitLineRegex`''' : an optional attribute that is applied after any acceptLineRegex and discards any line which matches this regExp.
- example:
- {{{
- <entity name="jc"
-         processor="LineEntityProcessor"
-         acceptLineRegex="^.*\.xml$"
-         omitLineRegex="/obsolete"
-         url="file:///Volumes/ts/files.lis"
-         rootEntity="false"
-         dataSource="myURIreader1"
-         transformer="RegexTransformer,DateFormatTransformer"
-         >
-    ...
- }}}
- While there are use cases where you might need to create a solr document per line read from a file, it is expected that in most cases that the lines read will consist of a pathname which is in turn consumed by another !EntityProcessor
- such as X!PathEntityProcessor.
- 
- == DataSource ==
- <<Anchor(datasource)>>
- A class can extend `org.apache.solr.handler.dataimport.DataSource` . [[http://svn.apache.org/viewvc/lucene/solr/trunk/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/DataSource.java?view=markup|See source]]
- 
- and can be used as a !DataSource. It must be configured in the dataSource definition
- {{{
- <dataSource type="com.foo.FooDataSource" prop1="hello"/>
- }}}
- and it can be used in the entities like a standard one
- 
- === JdbcdataSource ===
- This is the default. See the  [[#jdbcdatasource|example]] . The signature is as follows
- {{{
- public class JdbcDataSource extends DataSource<Iterator<Map<String, Object>>>
- }}}
- It is designed to iterate rows in DB one by one. A row is represented as a Map.
- 
- === URLDataSource ===
- <!> [[Solr1.4]]
- This datasource is often used with X!PathEntityProcessor to fetch content from an underlying file:// or http:// location. See the documentation [[#httpds|here]] . The signature is as follows
- {{{
- public class URLDataSource extends DataSource<Reader>
- }}}
- 
- === HttpDataSource ===
- <!> Http!DataSource is being deprecated in favour of URL!DataSource in [[Solr1.4]]. There is no change in functionality between URL!DataSource and !Http!DataSource, only a name change.
- 
- === FileDataSource ===
- This can be used like an URL!DataSource but used to fetch content from files on disk. The only difference from URL!DataSource, when accessing disk files, is how a pathname is specified. The signature is as follows
- {{{
- public class FileDataSource extends DataSource<Reader>
- }}}
- 
- The attributes are:
-  * '''`basePath`''': (optional) The base path relative to which the value is evaluated if it is not absolute
-  * '''`encoding`''': (optional) If the files are to be read in an encoding that is not same as the platform encoding
- 
- === FieldReaderDataSource ===
  <!> [[Solr1.4]]
  
  This can be used like an URL!DataSource . The signature is as follows
@@ -927, +924 @@

   * '''`$docBoost`''' : Boost the current doc. The value can be a number or the toString of a number
   * '''`$deleteDocById`''' : Delete a doc from Solr with this id. The value hast to be the unniqueKey value of the document <!> [[Solr1.4]]
   * '''`$deleteDocByQuery`''' :Delete docs from Solr by this query. The value must be a Solr Query <!> [[Solr1.4]]
-  * '''`$stopTransform`''' : Stops further transformation of the row. No other Transformers in the chain will be executed if this boolean flag is added to the row  <!> [[Solr1.4]]
  
  
  == Adding datasource in solrconfig.xml ==

Re: DataImportHandler reverted to revision 218 on Solr Wiki

Posted by Chris Hostetter <ho...@fucit.org>.
: Hoss, I don't have permissions to re-open the issue but I've added a comment
: describing what happened and links to the edit email as well as this one.

stupid ACLs ... i think Noble is the only one with permission to reopen it 
since he filed it in the first place -- Noble can you please reopen, no 
one is ever goign to look at Shalin's updates as long as it's marked 
resolved.

: Yeah, I haven't tried to add it back. I'm afraid it may break the page
: again. I think we should split that page into many smaller ones.

that may make sense from a documentation stand point, but we shouldn't 
have to constantly live in fear that other pages will grow longer then 
some magical "too big" threshold.




-Hoss


Re: DataImportHandler reverted to revision 218 on Solr Wiki

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Tue, Nov 3, 2009 at 1:15 AM, Chris Hostetter <ho...@fucit.org>wrote:

>
> Shalin: I just noticed this revert, you should reopen the INFRA bug about
> MoinMoin eating that update, describe as best as possible exactly what
> your *intended* change was, and link to the edit email in the archives.
>
>
Hoss, I don't have permissions to re-open the issue but I've added a comment
describing what happened and links to the edit email as well as this one.


> All the previous examples were when email notifications weren't working,
> so we didn't have any good examples showing hte actaul markup involved ...
> looking at the diff there is a suspicious " 3E>" at teh start of the
> deleted chunk.  there's also a large new chunk (which doesn't seem to
> corispond to your comment, so i'm not sure if it was intended, or if
> MoinMoin was getting confused and pulling it out of an older version, or a
> differnet section of the doc) that has an odd looking ">>" at the end of
> the first line.
>
> Lastly: once you reverted the MoinMoin butcher, i don't see any
> notifications of you trying to re-add the content that you added in
> version 218 ... presumably that's worth adding back as well.
>
>
Yeah, I haven't tried to add it back. I'm afraid it may break the page
again. I think we should split that page into many smaller ones.

-- 
Regards,
Shalin Shekhar Mangar.

Re: DataImportHandler reverted to revision 218 on Solr Wiki

Posted by Chris Hostetter <ho...@fucit.org>.
Shalin: I just noticed this revert, you should reopen the INFRA bug about 
MoinMoin eating that update, describe as best as possible exactly what 
your *intended* change was, and link to the edit email in the archives.

All the previous examples were when email notifications weren't working, 
so we didn't have any good examples showing hte actaul markup involved ... 
looking at the diff there is a suspicious " 3E>" at teh start of the 
deleted chunk.  there's also a large new chunk (which doesn't seem to 
corispond to your comment, so i'm not sure if it was intended, or if 
MoinMoin was getting confused and pulling it out of an older version, or a 
differnet section of the doc) that has an odd looking ">>" at the end of 
the first line.

Lastly: once you reverted the MoinMoin butcher, i don't see any 
notifications of you trying to re-add the content that you added in 
version 218 ... presumably that's worth adding back as well.

: Date: Wed, 28 Oct 2009 10:35:26 -0000
: From: Apache Wiki <wi...@apache.org>
: Reply-To: solr-dev@lucene.apache.org
: To: Apache Wiki <wi...@apache.org>
: Subject: DataImportHandler reverted to revision 218 on Solr Wiki
: 
: Dear wiki user,
: 
: You have subscribed to a wiki page "Solr Wiki" for change notification.
: 
: The page DataImportHandler has been reverted to revision 218 by ShalinMangar.
: The comment on this change is: MoinMoin ate some content.
: http://wiki.apache.org/solr/DataImportHandler?action=diff&rev1=219&rev2=220
: 
: --------------------------------------------------
: 
:    * '''`newerThan`''' : A date param . Use the format (`yyyy-MM-dd HH:mm:ss`) . It can also be a datemath string eg: ('NOW-3DAYS'). The single quote is necessary . Or it can be a valid variableresolver format like (${var.name})
:    * '''`olderThan`''' : A date param . Same rules as above
:    * '''`rootEntity`''' :It must be false for this (Unless you wish to just index filenames) An entity directly under the <document> is a root entity. That means that for each row emitted by the root entity one document is created in Solr/Lucene. But as in this case we do not wish to make one document per file. We wish to make one document per row emitted by the following entity 'x'. Because the entity 'f' has rootEntity=false the entity directly under it becomes a root entity automatically and each row emitted by that becomes a document.
: -  3E>
: +  * '''`dataSource`''' :If you use Solr1.3 It must be set to "null" because this does not use any DataSource. No need to specify that in Solr1.4 .It just means that we won't create a DataSource instance. (In most of the cases there is only one !DataSource (A !JdbcDataSource) and all entities just use them. In case of !FileListEntityProcessor a !DataSource is not necessary.)
: + 
: + example:
: + {{{
: + <dataConfig>
: +     <dataSource type="FileDataSource" />
: +     <document>
: +         <entity name="f" processor="FileListEntityProcessor" baseDir="/some/path/to/files" fileName=".*xml" newerThan="'NOW-3DAYS'" recursive="true" rootEntity="false" dataSource="null">
: +             <entity name="x" processor="XPathEntityProcessor" forEach="/the/record/xpath" url="${f.fileAbsolutePath}">
: +                 <field column="full_name" xpath="/field/xpath"/>
: +             </entity>
: +         </entity>
: +     </document>
: + </dataConfig>
: + }}}
: + Do not miss the `rootEntity` attribute. The implicit fields generated by the !FileListEntityProcessor are `fileAbsolutePath, fileSize, fileLastModified, fileName` and these are available for use within the entity X as shown above. It should be noted that !FileListEntityProcessor returns a list of pathnames and that the subsequent entity must use the !FileDataSource to fetch the files content.
: + 
: + === CachedSqlEntityProcessor ===
: + <<Anchor(cached)>>
: + 
: + This is an extension of the !SqlEntityProcessor.  This !EntityProcessor helps reduce the no: of DB queries executed by caching the rows. It does not help to use it in the root most entity because only one sql is run for the entity.
: + 
: + Example 1.
: + {{{
: + <entity name="x" query="select * from x">
: +     <entity name="y" query="select * from y where xid=${x.id}" processor="CachedSqlEntityProcessor">
: +     </entity>
: + <entity>
: + }}}
: + 
: + The usage is exactly same as the other one. When a query is run the results are stored and if the same query is run again it is fetched from the cache and returned
: + 
: + Example 2:
: + {{{
: + <entity name="x" query="select * from x">
: +     <entity name="y" query="select * from y" processor="CachedSqlEntityProcessor"  where="xid=x.id">
: +     </entity>
: + <entity>
: + }}}
: + 
: + The difference with the previous one is the 'where' attribute. In this case the query fetches all the rows from the table and stores all the rows in the cache. The magic is in the 'where' value. The cache stores the values with the 'xid' value in 'y' as the key. The value for 'x.id' is evaluated every time the entity has to be run and the value is looked up in the cache an the rows are returned.
: + 
: + In the where the lhs (the part before '=') is the column in y and the rhs (the part after '=') is the value to be computed for looking up the cache.
: + 
: + === PlainTextEntityProcessor ===
: + <<Anchor(plaintext)>>
: + <!> [[Solr1.4]]
: + 
: + This !EntityProcessor reads all content from the data source into an single implicit field called 'plainText'. The content is not parsed in any way, however you may add transformers to manipulate the data within 'plainText' as needed or to create other additional fields.
: + 
: + example:
: + {{{
: + <entity processor="PlainTextEntityProcessor" name="x" url="http://abc.com/a.txt" dataSource="data-source-name">
: +    <!-- copies the text to a field called 'text' in Solr-->
: +   <field column="plainText" name="text"/>
: + </entity>
: + }}}
: + 
: + Ensure that the dataSource is of type !DataSource<Reader> (!FileDataSource, URL!DataSource)
: + 
: + === LineEntityProcessor ===
: + <<Anchor(LineEntityProcessor)>>
:   <!> [[Solr1.4]]
:   
:   This !EntityProcessor reads all content from the data source on a line by line basis, a field called 'rawLine' is returned for each line read. The content is not parsed in any way, however you may add transformers to manipulate the data within 'rawLine' or to create other additional fields.
: @@ -823, +884 @@
: 
:    * '''`encoding`''': (optional) If the files are to be read in an encoding that is not same as the platform encoding
:   
:   === FieldReaderDataSource ===
: - <!> [[Solr1.4]]>>
: - <!> [[Solr1.4]]
: - 
: - This !EntityProcessor reads all content from the data source on a line by line basis, a field called 'rawLine' is returned for each line read. The content is not parsed in any way, however you may add transformers to manipulate the data within 'rawLine' or to create other additional fields.
: - 
: - The lines read can be filtered by two regular expressions '''acceptLineRegex''' and '''omitLineRegex'''.
: - This entities additional attributes are:
: -  * '''`url`''' : a required attribute that specifies the location of the input file in a way that is compatible with the configured datasource. If this value is relative and you are using !FileDataSource or URL!DataSource, it assumed to be relative to '''baseLoc'''.
: -  * '''`acceptLineRegex`''' :an optional attribute that if present discards any line which does not match the regExp.
: -  * '''`omitLineRegex`''' : an optional attribute that is applied after any acceptLineRegex and discards any line which matches this regExp.
: - example:
: - {{{
: - <entity name="jc"
: -         processor="LineEntityProcessor"
: -         acceptLineRegex="^.*\.xml$"
: -         omitLineRegex="/obsolete"
: -         url="file:///Volumes/ts/files.lis"
: -         rootEntity="false"
: -         dataSource="myURIreader1"
: -         transformer="RegexTransformer,DateFormatTransformer"
: -         >
: -    ...
: - }}}
: - While there are use cases where you might need to create a solr document per line read from a file, it is expected that in most cases that the lines read will consist of a pathname which is in turn consumed by another !EntityProcessor
: - such as X!PathEntityProcessor.
: - 
: - == DataSource ==
: - <<Anchor(datasource)>>
: - A class can extend `org.apache.solr.handler.dataimport.DataSource` . [[http://svn.apache.org/viewvc/lucene/solr/trunk/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/DataSource.java?view=markup|See source]]
: - 
: - and can be used as a !DataSource. It must be configured in the dataSource definition
: - {{{
: - <dataSource type="com.foo.FooDataSource" prop1="hello"/>
: - }}}
: - and it can be used in the entities like a standard one
: - 
: - === JdbcdataSource ===
: - This is the default. See the  [[#jdbcdatasource|example]] . The signature is as follows
: - {{{
: - public class JdbcDataSource extends DataSource<Iterator<Map<String, Object>>>
: - }}}
: - It is designed to iterate rows in DB one by one. A row is represented as a Map.
: - 
: - === URLDataSource ===
: - <!> [[Solr1.4]]
: - This datasource is often used with X!PathEntityProcessor to fetch content from an underlying file:// or http:// location. See the documentation [[#httpds|here]] . The signature is as follows
: - {{{
: - public class URLDataSource extends DataSource<Reader>
: - }}}
: - 
: - === HttpDataSource ===
: - <!> Http!DataSource is being deprecated in favour of URL!DataSource in [[Solr1.4]]. There is no change in functionality between URL!DataSource and !Http!DataSource, only a name change.
: - 
: - === FileDataSource ===
: - This can be used like an URL!DataSource but used to fetch content from files on disk. The only difference from URL!DataSource, when accessing disk files, is how a pathname is specified. The signature is as follows
: - {{{
: - public class FileDataSource extends DataSource<Reader>
: - }}}
: - 
: - The attributes are:
: -  * '''`basePath`''': (optional) The base path relative to which the value is evaluated if it is not absolute
: -  * '''`encoding`''': (optional) If the files are to be read in an encoding that is not same as the platform encoding
: - 
: - === FieldReaderDataSource ===
:   <!> [[Solr1.4]]
:   
:   This can be used like an URL!DataSource . The signature is as follows
: @@ -927, +924 @@
: 
:    * '''`$docBoost`''' : Boost the current doc. The value can be a number or the toString of a number
:    * '''`$deleteDocById`''' : Delete a doc from Solr with this id. The value hast to be the unniqueKey value of the document <!> [[Solr1.4]]
:    * '''`$deleteDocByQuery`''' :Delete docs from Solr by this query. The value must be a Solr Query <!> [[Solr1.4]]
: -  * '''`$stopTransform`''' : Stops further transformation of the row. No other Transformers in the chain will be executed if this boolean flag is added to the row  <!> [[Solr1.4]]
:   
:   
:   == Adding datasource in solrconfig.xml ==
: 



-Hoss