You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2009/10/06 10:24:19 UTC

[Solr Wiki] Update of "DataImportHandler" by FergusMcMenemie

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "DataImportHandler" page has been changed by FergusMcMenemie:
http://wiki.apache.org/solr/DataImportHandler?action=diff&rev1=212&rev2=213

     xpath="/a/b/subject[@qualifier='fullTitle']"
     xpath="/a/b/subject/@qualifier"
     xpath="/a/b/c"
+ }}}
+ <!> new for [[Solr1.4]]
+ {{{
+    xpath="//a/..."
+    xpath="/a//b..."
  }}}
  
  
@@ -768, +773 @@

      <document>
          <entity name="f" processor="FileListEntityProcessor" baseDir="/some/path/to/files" fileName=".*xml" newerThan="'NOW-3DAYS'" recursive="true" rootEntity="false" dataSource="null">
              <entity name="x" processor="XPathEntityProcessor" forEach="/the/record/xpath" url="${f.fileAbsolutePath}">
+                 <field column="full_name" xpat0Aand can be used as a !DataSource. It must be3A//abc.com/a.txt" dataSource="data-source-name">
+    <!-- copies the text to a field called 'text' in Solr-->
+   <field column="plainText" name="text"/>
-                 <field column="full_name" xpath="/field/xpath"/>
-             </entity>
-         </entity>
-     </document>
- </dataConfig>
- }}}
- Do not miss the `rootEntity` attribute. The implicit fields generated by the !FileListEntityProcessor are `fileAbsolutePath, fileSize, fileLastModified, fileName` and these are available for use within the entity X as shown above. It should be noted that !FileListEntityProcessor returns a list of pathnames and that the subsequent entity must use the !FileDataSource to fetch the files content.
- 
- === CachedSqlEntityProcessor ===
- <<Anchor(cached)>>
- 
- This is an extension of the !SqlEntityProcessor.  This !EntityProcessor helps reduce the no: of DB queries executed by caching the rows. It does not help to use it in the root most entity because only one sql is run for the entity.
- 
- Example 1.
- {{{
- <entity name="x" query="select * from x">
-     <entity name="y" query="select * from y where xid=${x.id}" processor="CachedSqlEntityProcessor">
-     </entity>
- <entity>
+ </entity>
  }}}
  
- The usage is exactly same as the other one. When a query is run the results are stored and if the same query is run again it is fetched from the cache and returned
+ Ensure that the dataSource is of type !DataSource<Reader> (!FileDataSource, URL!DataSource)
  
- Example 2:
- {{{
- <entity name="x" query="select * from x">
-     <entity name="y" query="select * from y" processor="CachedSqlEntityProcessor"  where="xid=x.id">
-     </entity>
- <entity>
- }}}
- 
- The difference with the previous one is the 'where' attribute. In this case the query fetches all the rows from the table and stores all the rows in the cache. The magic is in the 'where' value. The cache stores the values with the 'xid' value in 'y' as the key. The value for 'x.id' is evaluated every time the entity has to be run and the value is looked up in the cache an the rows are returned.
- 
- In the where the lhs (the part before '=') is the column in y and the rhs (the part after '=') is the value to be computed for looking up the cache.
- 
- === PlainTextEntityProcessor ===
+ === LineEntityProcessor ===
- <<Anchor(plaintext)>>
+ <<Anchor(LineEntityProcessor)>>
  <!> [[Solr1.4]]
  
- This !EntityProcessor reads all content from the data source into an single implicit field called 'plainText'. The content is not parsed in any way, however you may add transformers to manipulate the data within 'plainText' as needed or to create other additional fields.
+ This !EntityProcessor reads all content from the data source on a line by line basis, a field called 'rawLine' is returned for each line read. The content is not parsed in any way, however you may add transformers to manipulate the data within 'rawLine' or to create other additional fields.
  
+ The lines read can be filtered by two regular expressions '''acceptLineRegex''' and '''omitLineRegex'''.
+ This entities additional attributes are:
+  * '''`url`''' : a required attribute that specifies the location of the input file in a way that is compatible with the configured datasource. If this value is relative and you are using !FileDataSource or URL!DataSource, it assumed to be relative to '''baseLoc'''.
+  * '''`acceptLineRegex`''' :an optional attribute that if present discards any line which does not match the regExp.
+  * '''`omitLineRegex`''' : an optional attribute that is applied after any acceptLineRegex and discards any line which matches this regExp.
  example:
  {{{
- <entity processor="PlainTextEntityProcessor" name="x" url="http://abc.com/a.txt" dataSource="data-source-name">
+ <entity name="jc"
+         processor="LineEntityProcessor"
+         acceptLineRegex="^.*\.xml$"
+         omitLineRegex="/obsolete"
+         url="file:///Volumes/ts/files.lis"
+         rootEntity="false"
+         dataSource="myURIreader1"
+         transformer="RegexTransformer,DateFormatTransformer"
+         >
+    ...
+ }}}
+ While there are use cases where you might need to create a solr document per line read from a file, it is expected that in most cases that the lines read will consist of a pathname which is in turn consumed by another !EntityProcessor
+ such as X!PathEntityProcessor.
+ 
+ == DataSource ==
+ <<Anchor(datasource)>>
+ A class can extend `org.apache.solr.handler.dataimport.DataSource` . [[http://svn.apache.org/viewvc/lucene/solr/trunk/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/DataSource.java?view=markup|See source]]
+ 
+ and can be used as a !DataSource. It must be3A//abc.com/a.txt" dataSource="data-source-name">
     <!-- copies the text to a field called 'text' in Solr-->
    <field column="plainText" name="text"/>
  </entity>

Re: [Solr Wiki] Update of "DataImportHandler" by FergusMcMenemie

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.
The wiki has eaten up a lot of documentation

On Tue, Oct 6, 2009 at 1:54 PM, Apache Wiki <wi...@apache.org> wrote:
> Dear Wiki user,
>
> You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
>
> The "DataImportHandler" page has been changed by FergusMcMenemie:
> http://wiki.apache.org/solr/DataImportHandler?action=diff&rev1=212&rev2=213
>
>     xpath="/a/b/subject[@qualifier='fullTitle']"
>     xpath="/a/b/subject/@qualifier"
>     xpath="/a/b/c"
> + }}}
> + <!> new for [[Solr1.4]]
> + {{{
> +    xpath="//a/..."
> +    xpath="/a//b..."
>  }}}
>
>
> @@ -768, +773 @@
>
>      <document>
>          <entity name="f" processor="FileListEntityProcessor" baseDir="/some/path/to/files" fileName=".*xml" newerThan="'NOW-3DAYS'" recursive="true" rootEntity="false" dataSource="null">
>              <entity name="x" processor="XPathEntityProcessor" forEach="/the/record/xpath" url="${f.fileAbsolutePath}">
> +                 <field column="full_name" xpat0Aand can be used as a !DataSource. It must be3A//abc.com/a.txt" dataSource="data-source-name">
> +    <!-- copies the text to a field called 'text' in Solr-->
> +   <field column="plainText" name="text"/>
> -                 <field column="full_name" xpath="/field/xpath"/>
> -             </entity>
> -         </entity>
> -     </document>
> - </dataConfig>
> - }}}
> - Do not miss the `rootEntity` attribute. The implicit fields generated by the !FileListEntityProcessor are `fileAbsolutePath, fileSize, fileLastModified, fileName` and these are available for use within the entity X as shown above. It should be noted that !FileListEntityProcessor returns a list of pathnames and that the subsequent entity must use the !FileDataSource to fetch the files content.
> -
> - === CachedSqlEntityProcessor ===
> - <<Anchor(cached)>>
> -
> - This is an extension of the !SqlEntityProcessor.  This !EntityProcessor helps reduce the no: of DB queries executed by caching the rows. It does not help to use it in the root most entity because only one sql is run for the entity.
> -
> - Example 1.
> - {{{
> - <entity name="x" query="select * from x">
> -     <entity name="y" query="select * from y where xid=${x.id}" processor="CachedSqlEntityProcessor">
> -     </entity>
> - <entity>
> + </entity>
>  }}}
>
> - The usage is exactly same as the other one. When a query is run the results are stored and if the same query is run again it is fetched from the cache and returned
> + Ensure that the dataSource is of type !DataSource<Reader> (!FileDataSource, URL!DataSource)
>
> - Example 2:
> - {{{
> - <entity name="x" query="select * from x">
> -     <entity name="y" query="select * from y" processor="CachedSqlEntityProcessor"  where="xid=x.id">
> -     </entity>
> - <entity>
> - }}}
> -
> - The difference with the previous one is the 'where' attribute. In this case the query fetches all the rows from the table and stores all the rows in the cache. The magic is in the 'where' value. The cache stores the values with the 'xid' value in 'y' as the key. The value for 'x.id' is evaluated every time the entity has to be run and the value is looked up in the cache an the rows are returned.
> -
> - In the where the lhs (the part before '=') is the column in y and the rhs (the part after '=') is the value to be computed for looking up the cache.
> -
> - === PlainTextEntityProcessor ===
> + === LineEntityProcessor ===
> - <<Anchor(plaintext)>>
> + <<Anchor(LineEntityProcessor)>>
>  <!> [[Solr1.4]]
>
> - This !EntityProcessor reads all content from the data source into an single implicit field called 'plainText'. The content is not parsed in any way, however you may add transformers to manipulate the data within 'plainText' as needed or to create other additional fields.
> + This !EntityProcessor reads all content from the data source on a line by line basis, a field called 'rawLine' is returned for each line read. The content is not parsed in any way, however you may add transformers to manipulate the data within 'rawLine' or to create other additional fields.
>
> + The lines read can be filtered by two regular expressions '''acceptLineRegex''' and '''omitLineRegex'''.
> + This entities additional attributes are:
> +  * '''`url`''' : a required attribute that specifies the location of the input file in a way that is compatible with the configured datasource. If this value is relative and you are using !FileDataSource or URL!DataSource, it assumed to be relative to '''baseLoc'''.
> +  * '''`acceptLineRegex`''' :an optional attribute that if present discards any line which does not match the regExp.
> +  * '''`omitLineRegex`''' : an optional attribute that is applied after any acceptLineRegex and discards any line which matches this regExp.
>  example:
>  {{{
> - <entity processor="PlainTextEntityProcessor" name="x" url="http://abc.com/a.txt" dataSource="data-source-name">
> + <entity name="jc"
> +         processor="LineEntityProcessor"
> +         acceptLineRegex="^.*\.xml$"
> +         omitLineRegex="/obsolete"
> +         url="file:///Volumes/ts/files.lis"
> +         rootEntity="false"
> +         dataSource="myURIreader1"
> +         transformer="RegexTransformer,DateFormatTransformer"
> +         >
> +    ...
> + }}}
> + While there are use cases where you might need to create a solr document per line read from a file, it is expected that in most cases that the lines read will consist of a pathname which is in turn consumed by another !EntityProcessor
> + such as X!PathEntityProcessor.
> +
> + == DataSource ==
> + <<Anchor(datasource)>>
> + A class can extend `org.apache.solr.handler.dataimport.DataSource` . [[http://svn.apache.org/viewvc/lucene/solr/trunk/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/DataSource.java?view=markup|See source]]
> +
> + and can be used as a !DataSource. It must be3A//abc.com/a.txt" dataSource="data-source-name">
>     <!-- copies the text to a field called 'text' in Solr-->
>    <field column="plainText" name="text"/>
>  </entity>
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com