You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2012/08/25 14:18:47 UTC

Random Patch

Hi,

I've had a random patch lying around one of my desktops for sometime.

1) schema.xml is straight foward enough
2) MoreIndexingFilter.java seems to be an issue of reliability
(possibly). Maybe the Http Header  content information can be
unreliable at times? Does anyone have an opinion on this? At the
moment I am none-the-wiser but keen to gather views and/experiences.
3) Again in SolrWriter.java this may be an issue of reliability
(accuracy?) regarding the proposed explicit equals cast check instead
of the abitrary assignment check. Any thoughts?

I did not produce this patch and can't remember how or why it ended up
on my desktop! So apologies for the randomness of this one.

Thanks

Lewis


Index: conf/schema.xml
===================================================================
--- conf/schema.xml     (revision 1145734)
+++ conf/schema.xml     (working copy)
@@ -113,6 +113,8 @@
         <!-- fields for creativecommons plugin -->
         <field name="cc" type="string" stored="true" indexed="true"
             multiValued="true"/>
+
+        <field name="tld" type="string" stored="false" indexed="false"/>
     </fields>
     <uniqueKey>id</uniqueKey>
     <defaultSearchField>content</defaultSearchField>

Index: src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
===================================================================
--- src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
       (revision 1053817)
+++ src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
       (working copy)
@@ -172,7 +172,7 @@
    */
   private NutchDocument addType(NutchDocument doc, WebPage page, String url) {
     MimeType mimeType = null;
-    Utf8 contentType = page.getFromHeaders(new Utf8(HttpHeaders.CONTENT_TYPE));
+    Utf8 contentType = page.getContentType();
     if (contentType == null) {
       // Note by Jerome Charron on 20050415:
       // Content Type not solved by a previous plugin
Index: src/java/org/apache/nutch/indexer/solr/SolrWriter.java
===================================================================
--- src/java/org/apache/nutch/indexer/solr/SolrWriter.java
(revision 1053817)
+++ src/java/org/apache/nutch/indexer/solr/SolrWriter.java      (working copy)
@@ -56,7 +56,7 @@
       for (final String val : e.getValue()) {
         inputDoc.addField(solrMapping.mapKey(e.getKey()), val);
         String sCopy = solrMapping.mapCopyKey(e.getKey());
-        if (sCopy != e.getKey()) {
+        if (! sCopy.equals(e.getKey())) {
                inputDoc.addField(sCopy, val);
         }
       }

-- 
Lewis

Re: Random Patch

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Came across this issue :0)

https://issues.apache.org/jira/browse/NUTCH-956

which seems to uncover all mystery with this one.

It also reminded me of this conversation recently [0]

I will test and get a JUnit case written before attaching new patch to
the issue.

[0] http://www.mail-archive.com/user%40nutch.apache.org/msg07272.html

On Sat, Aug 25, 2012 at 1:18 PM, Lewis John Mcgibbney
<le...@gmail.com> wrote:
> Hi,
>
> I've had a random patch lying around one of my desktops for sometime.
>
> 1) schema.xml is straight foward enough
> 2) MoreIndexingFilter.java seems to be an issue of reliability
> (possibly). Maybe the Http Header  content information can be
> unreliable at times? Does anyone have an opinion on this? At the
> moment I am none-the-wiser but keen to gather views and/experiences.
> 3) Again in SolrWriter.java this may be an issue of reliability
> (accuracy?) regarding the proposed explicit equals cast check instead
> of the abitrary assignment check. Any thoughts?
>
> I did not produce this patch and can't remember how or why it ended up
> on my desktop! So apologies for the randomness of this one.
>
> Thanks
>
> Lewis
>
>
> Index: conf/schema.xml
> ===================================================================
> --- conf/schema.xml     (revision 1145734)
> +++ conf/schema.xml     (working copy)
> @@ -113,6 +113,8 @@
>          <!-- fields for creativecommons plugin -->
>          <field name="cc" type="string" stored="true" indexed="true"
>              multiValued="true"/>
> +
> +        <field name="tld" type="string" stored="false" indexed="false"/>
>      </fields>
>      <uniqueKey>id</uniqueKey>
>      <defaultSearchField>content</defaultSearchField>
>
> Index: src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
> ===================================================================
> --- src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
>        (revision 1053817)
> +++ src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
>        (working copy)
> @@ -172,7 +172,7 @@
>     */
>    private NutchDocument addType(NutchDocument doc, WebPage page, String url) {
>      MimeType mimeType = null;
> -    Utf8 contentType = page.getFromHeaders(new Utf8(HttpHeaders.CONTENT_TYPE));
> +    Utf8 contentType = page.getContentType();
>      if (contentType == null) {
>        // Note by Jerome Charron on 20050415:
>        // Content Type not solved by a previous plugin
> Index: src/java/org/apache/nutch/indexer/solr/SolrWriter.java
> ===================================================================
> --- src/java/org/apache/nutch/indexer/solr/SolrWriter.java
> (revision 1053817)
> +++ src/java/org/apache/nutch/indexer/solr/SolrWriter.java      (working copy)
> @@ -56,7 +56,7 @@
>        for (final String val : e.getValue()) {
>          inputDoc.addField(solrMapping.mapKey(e.getKey()), val);
>          String sCopy = solrMapping.mapCopyKey(e.getKey());
> -        if (sCopy != e.getKey()) {
> +        if (! sCopy.equals(e.getKey())) {
>                 inputDoc.addField(sCopy, val);
>          }
>        }
>
> --
> Lewis



-- 
Lewis