You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Pulkit Singhal <pu...@gmail.com> on 2011/09/13 23:15:29 UTC

Re: DIH load only selected documents with XPathEntityProcessor

This solution doesn't seem to be working for me.

I am using Solr trunk and I have the same question as Bernd with a small
twist: the field that should NOT be empty, happens to be a derived field
called price, see the config below:

<entity ...
  transformer="RegexTransformer,HTMLStripTransformer,DateFormatTransformer,
script:skipRow">

<field column="description"
          xpath="/rss/channel/item/description"
          />

<field column="price"
         regex=".*\$(\d*.\d*)"
         sourceColName="description"
         />
...
</entity>

I have also changed the sample script to check the price field isntead of
the link field that was being used as an example in this thread earlier:

    <script>
        <![CDATA[
        function skipRow(row) {
            var price = row.get( 'price' );
            if ( price == null || price == '' ) {
                row.put( '$skipRow', 'true' );
            }
            return row;
        }
        ]]>
    </script>

Does anyone have any thoughts on what I'm missing?
Thanks!
- Pulkit

On Mon, Jan 10, 2011 at 3:06 AM, Bernd Fehling <
bernd.fehling@uni-bielefeld.de> wrote:

> Hi Gora,
>
> thanks a lot, very nice solution, works perfectly.
> I will dig more into ScriptTransformer, seems to be very powerful.
>
> Regards,
> Bernd
>
> Am 08.01.2011 14:38, schrieb Gora Mohanty:
> > On Fri, Jan 7, 2011 at 12:30 PM, Bernd Fehling
> > <be...@uni-bielefeld.de> wrote:
> >> Hello list,
> >>
> >> is it possible to load only selected documents with
> XPathEntityProcessor?
> >> While loading docs I want to drop/skip/ignore documents with missing
> URL.
> >>
> >> Example:
> >> <documents>
> >>    <document>
> >>        <title>first title</title>
> >>        <id>identifier_01</id>
> >>        <link>http://www.foo.com/path/bar.html</link>
> >>    </document>
> >>    <document>
> >>        <title>second title</title>
> >>        <id>identifier_02</id>
> >>        <link></link>
> >>    </document>
> >> </documents>
> >>
> >> The first document should be loaded, the second document should be
> ignored
> >> because it has an empty link (should also work for missing link field).
> > [...]
> >
> > You can use a ScriptTransformer, along with $skipRow/$skipDoc.
> > E.g., something like this for your data import configuration file:
> >
> > <dataConfig>
> >     <script><![CDATA[
> >       function skipRow(row) {
> >         var link = row.get( 'link' );
> >         if( link == null || link == '' ) {
> >           row.put( '$skipRow', 'true' );
> >         }
> >         return row;
> >       }
> >     ]]></script>
> >     <dataSource type="FileDataSource" />
> >     <document>
> >         <entity name="f" processor="FileListEntityProcessor"
> > baseDir="/home/gora/test" fileName=".*xml" newerThan="'NOW-3DAYS'"
> > recursive="true" rootEntity="false" dataSource="null">
> >             <entity name="top" processor="XPathEntityProcessor"
> > forEach="/documents/document" url="${f.fileAbsolutePath}"
> > transformer="script:skipRow">
> >                <field column="link" xpath="/documents/document/link"/>
> >                <field column="title" xpath="/documents/document/title"/>
> >                <field column="id" xpath="/documents/document/id"/>
> >             </entity>
> >         </entity>
> >     </document>
> > </dataConfig>
> >
> > Regards,
> > Gora
>

Re: DIH load only selected documents with XPathEntityProcessor

Posted by Pulkit Singhal <pu...@gmail.com>.
Oh and I"m sure that I'm using Java 6 because the properties from the Solr
webpage spit out:

java.runtime.version = 1.6.0_26-b03-384-10M3425


On Tue, Sep 13, 2011 at 4:15 PM, Pulkit Singhal <pu...@gmail.com>wrote:

> This solution doesn't seem to be working for me.
>
> I am using Solr trunk and I have the same question as Bernd with a small
> twist: the field that should NOT be empty, happens to be a derived field
> called price, see the config below:
>
> <entity ...
>   transformer="RegexTransformer,HTMLStripTransformer,DateFormatTransformer,
> script:skipRow">
>
> <field column="description"
>           xpath="/rss/channel/item/description"
>           />
>
> <field column="price"
>          regex=".*\$(\d*.\d*)"
>          sourceColName="description"
>          />
> ...
> </entity>
>
> I have also changed the sample script to check the price field isntead of
> the link field that was being used as an example in this thread earlier:
>
>
>     <script>
>         <![CDATA[
>         function skipRow(row) {
>             var price = row.get( 'price' );
>             if ( price == null || price == '' ) {
>
>                 row.put( '$skipRow', 'true' );
>             }
>             return row;
>         }
>         ]]>
>     </script>
>
> Does anyone have any thoughts on what I'm missing?
> Thanks!
> - Pulkit
>
>
> On Mon, Jan 10, 2011 at 3:06 AM, Bernd Fehling <
> bernd.fehling@uni-bielefeld.de> wrote:
>
>> Hi Gora,
>>
>> thanks a lot, very nice solution, works perfectly.
>> I will dig more into ScriptTransformer, seems to be very powerful.
>>
>> Regards,
>> Bernd
>>
>> Am 08.01.2011 14:38, schrieb Gora Mohanty:
>> > On Fri, Jan 7, 2011 at 12:30 PM, Bernd Fehling
>> > <be...@uni-bielefeld.de> wrote:
>> >> Hello list,
>> >>
>> >> is it possible to load only selected documents with
>> XPathEntityProcessor?
>> >> While loading docs I want to drop/skip/ignore documents with missing
>> URL.
>> >>
>> >> Example:
>> >> <documents>
>> >>    <document>
>> >>        <title>first title</title>
>> >>        <id>identifier_01</id>
>> >>        <link>http://www.foo.com/path/bar.html</link>
>> >>    </document>
>> >>    <document>
>> >>        <title>second title</title>
>> >>        <id>identifier_02</id>
>> >>        <link></link>
>> >>    </document>
>> >> </documents>
>> >>
>> >> The first document should be loaded, the second document should be
>> ignored
>> >> because it has an empty link (should also work for missing link field).
>> > [...]
>> >
>> > You can use a ScriptTransformer, along with $skipRow/$skipDoc.
>> > E.g., something like this for your data import configuration file:
>> >
>> > <dataConfig>
>> >     <script><![CDATA[
>> >       function skipRow(row) {
>> >         var link = row.get( 'link' );
>> >         if( link == null || link == '' ) {
>> >           row.put( '$skipRow', 'true' );
>> >         }
>> >         return row;
>> >       }
>> >     ]]></script>
>> >     <dataSource type="FileDataSource" />
>> >     <document>
>> >         <entity name="f" processor="FileListEntityProcessor"
>> > baseDir="/home/gora/test" fileName=".*xml" newerThan="'NOW-3DAYS'"
>> > recursive="true" rootEntity="false" dataSource="null">
>> >             <entity name="top" processor="XPathEntityProcessor"
>> > forEach="/documents/document" url="${f.fileAbsolutePath}"
>> > transformer="script:skipRow">
>> >                <field column="link" xpath="/documents/document/link"/>
>> >                <field column="title" xpath="/documents/document/title"/>
>> >                <field column="id" xpath="/documents/document/id"/>
>> >             </entity>
>> >         </entity>
>> >     </document>
>> > </dataConfig>
>> >
>> > Regards,
>> > Gora
>>
>
>