You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Pulkit Singhal <pu...@gmail.com> on 2011/09/13 23:15:29 UTC
Re: DIH load only selected documents with XPathEntityProcessor
This solution doesn't seem to be working for me.
I am using Solr trunk and I have the same question as Bernd with a small
twist: the field that should NOT be empty, happens to be a derived field
called price, see the config below:
<entity ...
transformer="RegexTransformer,HTMLStripTransformer,DateFormatTransformer,
script:skipRow">
<field column="description"
xpath="/rss/channel/item/description"
/>
<field column="price"
regex=".*\$(\d*.\d*)"
sourceColName="description"
/>
...
</entity>
I have also changed the sample script to check the price field isntead of
the link field that was being used as an example in this thread earlier:
<script>
<![CDATA[
function skipRow(row) {
var price = row.get( 'price' );
if ( price == null || price == '' ) {
row.put( '$skipRow', 'true' );
}
return row;
}
]]>
</script>
Does anyone have any thoughts on what I'm missing?
Thanks!
- Pulkit
On Mon, Jan 10, 2011 at 3:06 AM, Bernd Fehling <
bernd.fehling@uni-bielefeld.de> wrote:
> Hi Gora,
>
> thanks a lot, very nice solution, works perfectly.
> I will dig more into ScriptTransformer, seems to be very powerful.
>
> Regards,
> Bernd
>
> Am 08.01.2011 14:38, schrieb Gora Mohanty:
> > On Fri, Jan 7, 2011 at 12:30 PM, Bernd Fehling
> > <be...@uni-bielefeld.de> wrote:
> >> Hello list,
> >>
> >> is it possible to load only selected documents with
> XPathEntityProcessor?
> >> While loading docs I want to drop/skip/ignore documents with missing
> URL.
> >>
> >> Example:
> >> <documents>
> >> <document>
> >> <title>first title</title>
> >> <id>identifier_01</id>
> >> <link>http://www.foo.com/path/bar.html</link>
> >> </document>
> >> <document>
> >> <title>second title</title>
> >> <id>identifier_02</id>
> >> <link></link>
> >> </document>
> >> </documents>
> >>
> >> The first document should be loaded, the second document should be
> ignored
> >> because it has an empty link (should also work for missing link field).
> > [...]
> >
> > You can use a ScriptTransformer, along with $skipRow/$skipDoc.
> > E.g., something like this for your data import configuration file:
> >
> > <dataConfig>
> > <script><![CDATA[
> > function skipRow(row) {
> > var link = row.get( 'link' );
> > if( link == null || link == '' ) {
> > row.put( '$skipRow', 'true' );
> > }
> > return row;
> > }
> > ]]></script>
> > <dataSource type="FileDataSource" />
> > <document>
> > <entity name="f" processor="FileListEntityProcessor"
> > baseDir="/home/gora/test" fileName=".*xml" newerThan="'NOW-3DAYS'"
> > recursive="true" rootEntity="false" dataSource="null">
> > <entity name="top" processor="XPathEntityProcessor"
> > forEach="/documents/document" url="${f.fileAbsolutePath}"
> > transformer="script:skipRow">
> > <field column="link" xpath="/documents/document/link"/>
> > <field column="title" xpath="/documents/document/title"/>
> > <field column="id" xpath="/documents/document/id"/>
> > </entity>
> > </entity>
> > </document>
> > </dataConfig>
> >
> > Regards,
> > Gora
>
Re: DIH load only selected documents with XPathEntityProcessor
Posted by Pulkit Singhal <pu...@gmail.com>.
Oh and I"m sure that I'm using Java 6 because the properties from the Solr
webpage spit out:
java.runtime.version = 1.6.0_26-b03-384-10M3425
On Tue, Sep 13, 2011 at 4:15 PM, Pulkit Singhal <pu...@gmail.com>wrote:
> This solution doesn't seem to be working for me.
>
> I am using Solr trunk and I have the same question as Bernd with a small
> twist: the field that should NOT be empty, happens to be a derived field
> called price, see the config below:
>
> <entity ...
> transformer="RegexTransformer,HTMLStripTransformer,DateFormatTransformer,
> script:skipRow">
>
> <field column="description"
> xpath="/rss/channel/item/description"
> />
>
> <field column="price"
> regex=".*\$(\d*.\d*)"
> sourceColName="description"
> />
> ...
> </entity>
>
> I have also changed the sample script to check the price field isntead of
> the link field that was being used as an example in this thread earlier:
>
>
> <script>
> <![CDATA[
> function skipRow(row) {
> var price = row.get( 'price' );
> if ( price == null || price == '' ) {
>
> row.put( '$skipRow', 'true' );
> }
> return row;
> }
> ]]>
> </script>
>
> Does anyone have any thoughts on what I'm missing?
> Thanks!
> - Pulkit
>
>
> On Mon, Jan 10, 2011 at 3:06 AM, Bernd Fehling <
> bernd.fehling@uni-bielefeld.de> wrote:
>
>> Hi Gora,
>>
>> thanks a lot, very nice solution, works perfectly.
>> I will dig more into ScriptTransformer, seems to be very powerful.
>>
>> Regards,
>> Bernd
>>
>> Am 08.01.2011 14:38, schrieb Gora Mohanty:
>> > On Fri, Jan 7, 2011 at 12:30 PM, Bernd Fehling
>> > <be...@uni-bielefeld.de> wrote:
>> >> Hello list,
>> >>
>> >> is it possible to load only selected documents with
>> XPathEntityProcessor?
>> >> While loading docs I want to drop/skip/ignore documents with missing
>> URL.
>> >>
>> >> Example:
>> >> <documents>
>> >> <document>
>> >> <title>first title</title>
>> >> <id>identifier_01</id>
>> >> <link>http://www.foo.com/path/bar.html</link>
>> >> </document>
>> >> <document>
>> >> <title>second title</title>
>> >> <id>identifier_02</id>
>> >> <link></link>
>> >> </document>
>> >> </documents>
>> >>
>> >> The first document should be loaded, the second document should be
>> ignored
>> >> because it has an empty link (should also work for missing link field).
>> > [...]
>> >
>> > You can use a ScriptTransformer, along with $skipRow/$skipDoc.
>> > E.g., something like this for your data import configuration file:
>> >
>> > <dataConfig>
>> > <script><![CDATA[
>> > function skipRow(row) {
>> > var link = row.get( 'link' );
>> > if( link == null || link == '' ) {
>> > row.put( '$skipRow', 'true' );
>> > }
>> > return row;
>> > }
>> > ]]></script>
>> > <dataSource type="FileDataSource" />
>> > <document>
>> > <entity name="f" processor="FileListEntityProcessor"
>> > baseDir="/home/gora/test" fileName=".*xml" newerThan="'NOW-3DAYS'"
>> > recursive="true" rootEntity="false" dataSource="null">
>> > <entity name="top" processor="XPathEntityProcessor"
>> > forEach="/documents/document" url="${f.fileAbsolutePath}"
>> > transformer="script:skipRow">
>> > <field column="link" xpath="/documents/document/link"/>
>> > <field column="title" xpath="/documents/document/title"/>
>> > <field column="id" xpath="/documents/document/id"/>
>> > </entity>
>> > </entity>
>> > </document>
>> > </dataConfig>
>> >
>> > Regards,
>> > Gora
>>
>
>