You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Gabriele Kahlout <ga...@mysimpatico.com> on 2011/07/05 12:28:21 UTC

Does Nutch make any use of solr.WhitespaceTokenizerFactory defined in schema.xml?

Hello,

I'm trying to understand better Nutch and Solr integration. My understanding
is that Documents are added to Solr index from SolrWriter's write(NutchDocument
doc) method. But does it make any use of the WhitespaceTokenizerFactory?

-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: Does Nutch make any use of solr.WhitespaceTokenizerFactory defined in schema.xml?

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.

the answer to 2) is new IndexSchema(solrConf, schema).getAnalyzer();


On Tue, Jul 5, 2011 at 2:48 PM, Gabriele Kahlout
<ga...@mysimpatico.com>wrote:

> Not yet an answer to 2) but this is where and how Solr initializes the
> Analyzer defined in the schema.xml into :
>
> //org.apache.solr.schema.IndexSchema
>  // Load the Tokenizer
>     // Although an analyzer only allows a single Tokenizer, we load a list
> to make sure
>     // the configuration is ok
>     //
> --------------------------------------------------------------------------------
>     final ArrayList<TokenizerFactory> tokenizers = new
> ArrayList<TokenizerFactory>(1);
>     AbstractPluginLoader<TokenizerFactory> tokenizerLoader =
>       new AbstractPluginLoader<TokenizerFactory>( "[schema.xml]
> analyzer/tokenizer", false, false )
>     {
>       @Override
>       protected void init(TokenizerFactory plugin, Node node) throws
> Exception {
>         if( !tokenizers.isEmpty() ) {
>           throw new SolrException( SolrException.ErrorCode.SERVER_ERROR,
>               "The schema defines multiple tokenizers for: "+node );
>         }
>         final Map<String,String> params =
> DOMUtil.toMapExcept(node.getAttributes(),"class");
>         // copy the luceneMatchVersion from config, if not set
>         if (!params.containsKey(LUCENE_MATCH_VERSION_PARAM))
>           params.put(LUCENE_MATCH_VERSION_PARAM,
> solrConfig.luceneMatchVersion.toString());
>         plugin.init( params );
>         tokenizers.add( plugin );
>       }
>
>       @Override
>       protected TokenizerFactory register(String name, TokenizerFactory
> plugin) throws Exception {
>         return null; // used for map registration
>       }
>     };
>     tokenizerLoader.load( loader, (NodeList)xpath.evaluate("./tokenizer",
> node, XPathConstants.NODESET) );
>
>     // Make sure something was loaded
>     if( tokenizers.isEmpty() ) {
>       throw new
> SolrException(SolrException.ErrorCode.SERVER_ERROR,"analyzer without class
> or tokenizer & filter list");
>     }
>
>
>     // Load the Filters
>     //
> --------------------------------------------------------------------------------
>     final ArrayList<TokenFilterFactory> filters = new
> ArrayList<TokenFilterFactory>();
>     AbstractPluginLoader<TokenFilterFactory> filterLoader =
>       new AbstractPluginLoader<TokenFilterFactory>( "[schema.xml]
> analyzer/filter", false, false )
>     {
>       @Override
>       protected void init(TokenFilterFactory plugin, Node node) throws
> Exception {
>         if( plugin != null ) {
>           final Map<String,String> params =
> DOMUtil.toMapExcept(node.getAttributes(),"class");
>           // copy the luceneMatchVersion from config, if not set
>           if (!params.containsKey(LUCENE_MATCH_VERSION_PARAM))
>             params.put(LUCENE_MATCH_VERSION_PARAM,
> solrConfig.luceneMatchVersion.toString());
>           plugin.init( params );
>           filters.add( plugin );
>         }
>       }
>
>       @Override
>       protected TokenFilterFactory register(String name, TokenFilterFactory
> plugin) throws Exception {
>         return null; // used for map registration
>       }
>     };
>     filterLoader.load( loader, (NodeList)xpath.evaluate("./filter", node,
> XPathConstants.NODESET) );
>
>     return new TokenizerChain(charFilters.toArray(new
> CharFilterFactory[charFilters.size()]),
>         tokenizers.get(0), filters.toArray(new
> TokenFilterFactory[filters.size()]));
>   };
>
>
>
> On Tue, Jul 5, 2011 at 2:26 PM, Gabriele Kahlout <gabriele@mysimpatico.com
> > wrote:
>
>> I suspect the following should do (1). I'm just not sure about file
>> references as in  stopInit.put("words", "stopwords.txt") . (2) should
>> clarify.
>>
>> 1)
>> class SchemaAnalyzer extends Analyzer{
>>
>>         @Override
>>         public TokenStream tokenStream(String fieldName, Reader reader) {
>>             HashMap<String, String> stopInit = new
>> HashMap<String,String>();
>>             stopInit.put("words", "stopwords.txt");
>>             stopInit.put("ignoreCase", Boolean.TRUE.toString());
>>             StopFilterFactory stopFilterFactory = new StopFilterFactory();
>>             stopFilterFactory.init(stopInit);
>>
>>             final HashMap<String, String> wordDelimInit = new
>> HashMap<String, String>();
>>             wordDelimInit.put("generateWordParts", "1");
>>             wordDelimInit.put("generateNumberParts", "1");
>>             wordDelimInit.put("catenateWords", "1");
>>             wordDelimInit.put("catenateWords", "1");
>>             wordDelimInit.put("catenateNumbers", "1");
>>             wordDelimInit.put("catenateAll", "0");
>>             wordDelimInit.put("splitOnCaseChange", "1");
>>
>>             WordDelimiterFilterFactory wordDelimiterFilterFactory = new
>> WordDelimiterFilterFactory();
>>             wordDelimiterFilterFactory.init(wordDelimInit);
>>             HashMap<String, String> porterInit = new HashMap<String,
>> String>();
>>             porterInit.put("protected", "protwords.txt");
>>             EnglishPorterFilterFactory englishPorterFilterFactory = new
>> EnglishPorterFilterFactory();
>>             englishPorterFilterFactory.init(porterInit);
>>
>>             return new
>> RemoveDuplicatesTokenFilter(englishPorterFilterFactory.create(new
>> LowerCaseFilter(wordDelimiterFilterFactory.create(stopFilterFactory.create(new
>> WhitespaceTokenizer(reader))))));
>>         }
>>     }
>>
>> On Tue, Jul 5, 2011 at 1:00 PM, Gabriele Kahlout <
>> gabriele@mysimpatico.com> wrote:
>>
>>> nice...where?
>>>
>>> I'm trying to figure out 2 things:
>>> 1) How to create an analyzer that corresponds to the one in the
>>> schema.xml.
>>>
>>>  <analyzer>
>>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>         <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1"/>
>>>       </analyzer>
>>>
>>> 2) I'd like to see the code that creates it reading it from schema.xml .
>>>
>>>
>>> On Tue, Jul 5, 2011 at 12:33 PM, Markus Jelsma <
>>> markus.jelsma@openindex.io> wrote:
>>>
>>>> No. SolrJ only builds input docs from NutchDocument objects. Solr will
>>>> do
>>>> analysis. The integration is analogous to XML post of Solr documents.
>>>>
>>>> On Tuesday 05 July 2011 12:28:21 Gabriele Kahlout wrote:
>>>> > Hello,
>>>> >
>>>> > I'm trying to understand better Nutch and Solr integration. My
>>>> > understanding is that Documents are added to Solr index from
>>>> SolrWriter's
>>>> > write(NutchDocument doc) method. But does it make any use of the
>>>> > WhitespaceTokenizerFactory?
>>>>
>>>> --
>>>> Markus Jelsma - CTO - Openindex
>>>> http://www.linkedin.com/in/markus17
>>>> 050-8536620 / 06-50258350
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> K. Gabriele
>>>
>>> --- unchanged since 20/9/10 ---
>>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>>> receipt within 48 hours then I don't resend the email.
>>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>>
>>> If an email is sent by a sender that is not a trusted contact or the
>>> email does not contain a valid code then the email is not received. A valid
>>> code starts with a hyphen and ends with "X".
>>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>>> L(-[a-z]+[0-9]X)).
>>>
>>>
>>
>>
>> --
>> Regards,
>> K. Gabriele
>>
>> --- unchanged since 20/9/10 ---
>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>> receipt within 48 hours then I don't resend the email.
>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>
>> If an email is sent by a sender that is not a trusted contact or the email
>> does not contain a valid code then the email is not received. A valid code
>> starts with a hyphen and ends with "X".
>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>> L(-[a-z]+[0-9]X)).
>>
>>
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>
>


-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: Does Nutch make any use of solr.WhitespaceTokenizerFactory defined in schema.xml?

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.

Not yet an answer to 2) but this is where and how Solr initializes the
Analyzer defined in the schema.xml into :

//org.apache.solr.schema.IndexSchema
 // Load the Tokenizer
    // Although an analyzer only allows a single Tokenizer, we load a list
to make sure
    // the configuration is ok
    //
--------------------------------------------------------------------------------
    final ArrayList<TokenizerFactory> tokenizers = new
ArrayList<TokenizerFactory>(1);
    AbstractPluginLoader<TokenizerFactory> tokenizerLoader =
      new AbstractPluginLoader<TokenizerFactory>( "[schema.xml]
analyzer/tokenizer", false, false )
    {
      @Override
      protected void init(TokenizerFactory plugin, Node node) throws
Exception {
        if( !tokenizers.isEmpty() ) {
          throw new SolrException( SolrException.ErrorCode.SERVER_ERROR,
              "The schema defines multiple tokenizers for: "+node );
        }
        final Map<String,String> params =
DOMUtil.toMapExcept(node.getAttributes(),"class");
        // copy the luceneMatchVersion from config, if not set
        if (!params.containsKey(LUCENE_MATCH_VERSION_PARAM))
          params.put(LUCENE_MATCH_VERSION_PARAM,
solrConfig.luceneMatchVersion.toString());
        plugin.init( params );
        tokenizers.add( plugin );
      }

      @Override
      protected TokenizerFactory register(String name, TokenizerFactory
plugin) throws Exception {
        return null; // used for map registration
      }
    };
    tokenizerLoader.load( loader, (NodeList)xpath.evaluate("./tokenizer",
node, XPathConstants.NODESET) );

    // Make sure something was loaded
    if( tokenizers.isEmpty() ) {
      throw new SolrException(SolrException.ErrorCode.SERVER_ERROR,"analyzer
without class or tokenizer & filter list");
    }


    // Load the Filters
    //
--------------------------------------------------------------------------------
    final ArrayList<TokenFilterFactory> filters = new
ArrayList<TokenFilterFactory>();
    AbstractPluginLoader<TokenFilterFactory> filterLoader =
      new AbstractPluginLoader<TokenFilterFactory>( "[schema.xml]
analyzer/filter", false, false )
    {
      @Override
      protected void init(TokenFilterFactory plugin, Node node) throws
Exception {
        if( plugin != null ) {
          final Map<String,String> params =
DOMUtil.toMapExcept(node.getAttributes(),"class");
          // copy the luceneMatchVersion from config, if not set
          if (!params.containsKey(LUCENE_MATCH_VERSION_PARAM))
            params.put(LUCENE_MATCH_VERSION_PARAM,
solrConfig.luceneMatchVersion.toString());
          plugin.init( params );
          filters.add( plugin );
        }
      }

      @Override
      protected TokenFilterFactory register(String name, TokenFilterFactory
plugin) throws Exception {
        return null; // used for map registration
      }
    };
    filterLoader.load( loader, (NodeList)xpath.evaluate("./filter", node,
XPathConstants.NODESET) );

    return new TokenizerChain(charFilters.toArray(new
CharFilterFactory[charFilters.size()]),
        tokenizers.get(0), filters.toArray(new
TokenFilterFactory[filters.size()]));
  };


On Tue, Jul 5, 2011 at 2:26 PM, Gabriele Kahlout
<ga...@mysimpatico.com>wrote:

> I suspect the following should do (1). I'm just not sure about file
> references as in  stopInit.put("words", "stopwords.txt") . (2) should
> clarify.
>
> 1)
> class SchemaAnalyzer extends Analyzer{
>
>         @Override
>         public TokenStream tokenStream(String fieldName, Reader reader) {
>             HashMap<String, String> stopInit = new
> HashMap<String,String>();
>             stopInit.put("words", "stopwords.txt");
>             stopInit.put("ignoreCase", Boolean.TRUE.toString());
>             StopFilterFactory stopFilterFactory = new StopFilterFactory();
>             stopFilterFactory.init(stopInit);
>
>             final HashMap<String, String> wordDelimInit = new
> HashMap<String, String>();
>             wordDelimInit.put("generateWordParts", "1");
>             wordDelimInit.put("generateNumberParts", "1");
>             wordDelimInit.put("catenateWords", "1");
>             wordDelimInit.put("catenateWords", "1");
>             wordDelimInit.put("catenateNumbers", "1");
>             wordDelimInit.put("catenateAll", "0");
>             wordDelimInit.put("splitOnCaseChange", "1");
>
>             WordDelimiterFilterFactory wordDelimiterFilterFactory = new
> WordDelimiterFilterFactory();
>             wordDelimiterFilterFactory.init(wordDelimInit);
>             HashMap<String, String> porterInit = new HashMap<String,
> String>();
>             porterInit.put("protected", "protwords.txt");
>             EnglishPorterFilterFactory englishPorterFilterFactory = new
> EnglishPorterFilterFactory();
>             englishPorterFilterFactory.init(porterInit);
>
>             return new
> RemoveDuplicatesTokenFilter(englishPorterFilterFactory.create(new
> LowerCaseFilter(wordDelimiterFilterFactory.create(stopFilterFactory.create(new
> WhitespaceTokenizer(reader))))));
>         }
>     }
>
> On Tue, Jul 5, 2011 at 1:00 PM, Gabriele Kahlout <gabriele@mysimpatico.com
> > wrote:
>
>> nice...where?
>>
>> I'm trying to figure out 2 things:
>> 1) How to create an analyzer that corresponds to the one in the
>> schema.xml.
>>
>>  <analyzer>
>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1"/>
>>       </analyzer>
>>
>> 2) I'd like to see the code that creates it reading it from schema.xml .
>>
>>
>> On Tue, Jul 5, 2011 at 12:33 PM, Markus Jelsma <
>> markus.jelsma@openindex.io> wrote:
>>
>>> No. SolrJ only builds input docs from NutchDocument objects. Solr will do
>>> analysis. The integration is analogous to XML post of Solr documents.
>>>
>>> On Tuesday 05 July 2011 12:28:21 Gabriele Kahlout wrote:
>>> > Hello,
>>> >
>>> > I'm trying to understand better Nutch and Solr integration. My
>>> > understanding is that Documents are added to Solr index from
>>> SolrWriter's
>>> > write(NutchDocument doc) method. But does it make any use of the
>>> > WhitespaceTokenizerFactory?
>>>
>>> --
>>> Markus Jelsma - CTO - Openindex
>>> http://www.linkedin.com/in/markus17
>>> 050-8536620 / 06-50258350
>>>
>>
>>
>>
>> --
>> Regards,
>> K. Gabriele
>>
>> --- unchanged since 20/9/10 ---
>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>> receipt within 48 hours then I don't resend the email.
>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>
>> If an email is sent by a sender that is not a trusted contact or the email
>> does not contain a valid code then the email is not received. A valid code
>> starts with a hyphen and ends with "X".
>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>> L(-[a-z]+[0-9]X)).
>>
>>
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>
>


-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: Does Nutch make any use of solr.WhitespaceTokenizerFactory defined in schema.xml?

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.

I suspect the following should do (1). I'm just not sure about file
references as in  stopInit.put("words", "stopwords.txt") . (2) should
clarify.

1)
class SchemaAnalyzer extends Analyzer{

        @Override
        public TokenStream tokenStream(String fieldName, Reader reader) {
            HashMap<String, String> stopInit = new HashMap<String,String>();
            stopInit.put("words", "stopwords.txt");
            stopInit.put("ignoreCase", Boolean.TRUE.toString());
            StopFilterFactory stopFilterFactory = new StopFilterFactory();
            stopFilterFactory.init(stopInit);

            final HashMap<String, String> wordDelimInit = new
HashMap<String, String>();
            wordDelimInit.put("generateWordParts", "1");
            wordDelimInit.put("generateNumberParts", "1");
            wordDelimInit.put("catenateWords", "1");
            wordDelimInit.put("catenateWords", "1");
            wordDelimInit.put("catenateNumbers", "1");
            wordDelimInit.put("catenateAll", "0");
            wordDelimInit.put("splitOnCaseChange", "1");

            WordDelimiterFilterFactory wordDelimiterFilterFactory = new
WordDelimiterFilterFactory();
            wordDelimiterFilterFactory.init(wordDelimInit);
            HashMap<String, String> porterInit = new HashMap<String,
String>();
            porterInit.put("protected", "protwords.txt");
            EnglishPorterFilterFactory englishPorterFilterFactory = new
EnglishPorterFilterFactory();
            englishPorterFilterFactory.init(porterInit);

            return new
RemoveDuplicatesTokenFilter(englishPorterFilterFactory.create(new
LowerCaseFilter(wordDelimiterFilterFactory.create(stopFilterFactory.create(new
WhitespaceTokenizer(reader))))));
        }
    }

On Tue, Jul 5, 2011 at 1:00 PM, Gabriele Kahlout
<ga...@mysimpatico.com>wrote:

> nice...where?
>
> I'm trying to figure out 2 things:
> 1) How to create an analyzer that corresponds to the one in the schema.xml.
>
>
>  <analyzer>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1"/>
>       </analyzer>
>
> 2) I'd like to see the code that creates it reading it from schema.xml .
>
>
> On Tue, Jul 5, 2011 at 12:33 PM, Markus Jelsma <markus.jelsma@openindex.io
> > wrote:
>
>> No. SolrJ only builds input docs from NutchDocument objects. Solr will do
>> analysis. The integration is analogous to XML post of Solr documents.
>>
>> On Tuesday 05 July 2011 12:28:21 Gabriele Kahlout wrote:
>> > Hello,
>> >
>> > I'm trying to understand better Nutch and Solr integration. My
>> > understanding is that Documents are added to Solr index from
>> SolrWriter's
>> > write(NutchDocument doc) method. But does it make any use of the
>> > WhitespaceTokenizerFactory?
>>
>> --
>> Markus Jelsma - CTO - Openindex
>> http://www.linkedin.com/in/markus17
>> 050-8536620 / 06-50258350
>>
>
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>
>


-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: Does Nutch make any use of solr.WhitespaceTokenizerFactory defined in schema.xml?

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.

nice...where?

I'm trying to figure out 2 things:
1) How to create an analyzer that corresponds to the one in the schema.xml.

 <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>

2) I'd like to see the code that creates it reading it from schema.xml .

On Tue, Jul 5, 2011 at 12:33 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> No. SolrJ only builds input docs from NutchDocument objects. Solr will do
> analysis. The integration is analogous to XML post of Solr documents.
>
> On Tuesday 05 July 2011 12:28:21 Gabriele Kahlout wrote:
> > Hello,
> >
> > I'm trying to understand better Nutch and Solr integration. My
> > understanding is that Documents are added to Solr index from SolrWriter's
> > write(NutchDocument doc) method. But does it make any use of the
> > WhitespaceTokenizerFactory?
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: Does Nutch make any use of solr.WhitespaceTokenizerFactory defined in schema.xml?

Posted by Markus Jelsma <ma...@openindex.io>.

No. SolrJ only builds input docs from NutchDocument objects. Solr will do 
analysis. The integration is analogous to XML post of Solr documents.

On Tuesday 05 July 2011 12:28:21 Gabriele Kahlout wrote:
> Hello,
> 
> I'm trying to understand better Nutch and Solr integration. My
> understanding is that Documents are added to Solr index from SolrWriter's
> write(NutchDocument doc) method. But does it make any use of the
> WhitespaceTokenizerFactory?

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350