You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Eyeris RodrIguez Rueda <er...@uci.cu> on 2015/04/29 21:50:29 UTC

how to skip documents with empty field that are required in schema.xml

Hi all.
Im using nutch 1.9 and solr 4.10 in my environment.
I want to skip of the indexing process, all document that have the field title empty (or another), and of course, avoid it go to solr.

My first solution was clean all document with empty title in solr. this is not good idea for me because i need to execute the clean query after all indexing

The second solution that I thought was put the fields as required in schema.xml

<field name="title" type="text" stored="true" indexed="true" multiValued="true" required="true"/>

After do that, i found that when nutch try to send a batch of 250 documents, if there is one document with title empty, solr fails and nutch throw Job Failed Exception, because solr don't permit to index one document without title value, therefore solr index nothing.  

Is there any way that nutch take required option in schema.xml and clean it document from the collection of document before to index to solr?

Please any body can give me one advice, comment about it or what is the best way to restrict documents with empty field before to index ?.

Eyeris.

Re: [MASSMAIL] Re: how to skip documents with empty field that are required in schema.xml

Posted by Eyeris RodrIguez Rueda <er...@uci.cu>.

Hi all.
Thanks Jeff for your answer,it was very useful for me.
My solution is in that way.
Sorry for the time for answer i was testing your suggestions on my environment.

I have defined my custom index plugin as you say, i take the document and if has some empty field important for me, i return null and it is skipped well.
Now i need to read the schema.xml and take all fields required.
Example:
<field name="title" type="text" stored="true" indexed="true" multiValued="true" required="true"/>

I was reading on Configuration class but it is for nutch-default.xml or nutch-site.xml

Can you or anybody tell me how can i read the schema.xml file?

----- Mensaje original -----
De: "Jeff Cocking" <je...@gmail.com>
Para: user@nutch.apache.org
Enviados: Jueves, 30 de Abril 2015 8:11:55
Asunto: [MASSMAIL] Re: how to skip documents with empty field that are required in schema.xml

I would suggest you define a custom index plugin.  The index plugin could
evaluate the nutch document based on your parameters. You can add, modify,
remove fields as needed.  You should be able to to set the nutch document
to null if it does not meet your criteria.   This would prevent the
document from being sent to Solr.

Nutch has the ability to handle documents set to null by the indexer.

I am not aware of a method to call to set a nutch document to null. Is
anyone aware of this option?

If now one responds, here are a couple of options to purse testing:
1. Test what happens when you set NutchDocument doc = null;
2. In NutchIndexAction.java there is a reference to a variable called byte
Delete = 1. You can try changing the variable byte action = DELETE;
3. Test what would happen in your filter if you called NutchDocument doc =
new NutchDocument(); This should reset the doc to null as it is an empty
class.

The indexer-basic is a good plugin to copy and make a custom indexer from.

Let us know the results of what you find.

jeff

On Wed, Apr 29, 2015 at 2:50 PM, Eyeris RodrIguez Rueda <er...@uci.cu>
wrote:

> Hi all.
> Im using nutch 1.9 and solr 4.10 in my environment.
> I want to skip of the indexing process, all document that have the field
> title empty (or another), and of course, avoid it go to solr.
>
> My first solution was clean all document with empty title in solr. this is
> not good idea for me because i need to execute the clean query after all
> indexing
>
> The second solution that I thought was put the fields as required in
> schema.xml
>
> <field name="title" type="text" stored="true" indexed="true"
> multiValued="true" required="true"/>
>
> After do that, i found that when nutch try to send a batch of 250
> documents, if there is one document with title empty, solr fails and nutch
> throw Job Failed Exception, because solr don't permit to index one document
> without title value, therefore solr index nothing.
>
> Is there any way that nutch take required option in schema.xml and clean
> it document from the collection of document before to index to solr?
>
> Please any body can give me one advice, comment about it or what is the
> best way to restrict documents with empty field before to index ?.
>
> Eyeris.
>
>

Re: how to skip documents with empty field that are required in schema.xml

Posted by Jeff Cocking <je...@gmail.com>.

I would suggest you define a custom index plugin.  The index plugin could
evaluate the nutch document based on your parameters. You can add, modify,
remove fields as needed.  You should be able to to set the nutch document
to null if it does not meet your criteria.   This would prevent the
document from being sent to Solr.

Nutch has the ability to handle documents set to null by the indexer.

I am not aware of a method to call to set a nutch document to null. Is
anyone aware of this option?

If now one responds, here are a couple of options to purse testing:
1. Test what happens when you set NutchDocument doc = null;
2. In NutchIndexAction.java there is a reference to a variable called byte
Delete = 1. You can try changing the variable byte action = DELETE;
3. Test what would happen in your filter if you called NutchDocument doc =
new NutchDocument(); This should reset the doc to null as it is an empty
class.

The indexer-basic is a good plugin to copy and make a custom indexer from.

Let us know the results of what you find.

jeff

On Wed, Apr 29, 2015 at 2:50 PM, Eyeris RodrIguez Rueda <er...@uci.cu>
wrote:

> Hi all.
> Im using nutch 1.9 and solr 4.10 in my environment.
> I want to skip of the indexing process, all document that have the field
> title empty (or another), and of course, avoid it go to solr.
>
> My first solution was clean all document with empty title in solr. this is
> not good idea for me because i need to execute the clean query after all
> indexing
>
> The second solution that I thought was put the fields as required in
> schema.xml
>
> <field name="title" type="text" stored="true" indexed="true"
> multiValued="true" required="true"/>
>
> After do that, i found that when nutch try to send a batch of 250
> documents, if there is one document with title empty, solr fails and nutch
> throw Job Failed Exception, because solr don't permit to index one document
> without title value, therefore solr index nothing.
>
> Is there any way that nutch take required option in schema.xml and clean
> it document from the collection of document before to index to solr?
>
> Please any body can give me one advice, comment about it or what is the
> best way to restrict documents with empty field before to index ?.
>
> Eyeris.
>
>