You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Jianyun He (JIRA)" <ji...@apache.org> on 2012/07/01 10:10:49 UTC

[jira] [Created] (NUTCH-1416) Can not update the index

Jianyun He created NUTCH-1416:
---------------------------------

             Summary: Can not update the index
                 Key: NUTCH-1416
                 URL: https://issues.apache.org/jira/browse/NUTCH-1416
             Project: Nutch
          Issue Type: Bug
          Components: indexer
            Reporter: Jianyun He


When we update the index,can not guarantee that the contents which be indexed is the latest.In the class IndexerMapReduce and method reduce(), it has the following code:
public void reduce(Text key, Iterator<NutchWritable> values,
                     OutputCollector<Text, NutchDocument> output, Reporter reporter) throws IOException {
   ……
   } else if (value instanceof ParseData) {  
        parseData = (ParseData)value;
   } else if (value instanceof ParseText) { 
        parseText = (ParseText)value;
   }
   ……
}
For example,30 days ago,I fetched the web page A,now I fetch it again. Then the key A will correspond to two ParseData objects(located in different segments).But in this code,it does not compare the fetch time and simply overwrites the previous value.So the final value maybe the old one.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (NUTCH-1416) Can not update the index

Posted by "Jianyun He (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jianyun He updated NUTCH-1416:
------------------------------

    Description: 
When we update the index,can not guarantee that the contents which be indexed is the latest.In the class IndexerMapReduce and method reduce(), it has the following code:
public void reduce(Text key, Iterator<NutchWritable> values,
                     OutputCollector<Text, NutchDocument> output, Reporter reporter) throws IOException {
   ……
   } else if (value instanceof ParseData) {  
      parseData = (ParseData)value;
   } else if (value instanceof ParseText) { 
      parseText = (ParseText)value;
   }
   ……
}
For example,30 days ago,I fetched the web page A,now I fetch it again. Then the key A will correspond to two ParseData objects(located in different segments).But in this code,it does not compare the fetch time and simply overwrites the previous value.So the final value maybe the old one.

  was:
When we update the index,can not guarantee that the contents which be indexed is the latest.In the class IndexerMapReduce and method reduce(), it has the following code:
public void reduce(Text key, Iterator<NutchWritable> values,
                     OutputCollector<Text, NutchDocument> output, Reporter reporter) throws IOException {
   ……
   } else if (value instanceof ParseData) {  
        parseData = (ParseData)value;
   } else if (value instanceof ParseText) { 
        parseText = (ParseText)value;
   }
   ……
}
For example,30 days ago,I fetched the web page A,now I fetch it again. Then the key A will correspond to two ParseData objects(located in different segments).But in this code,it does not compare the fetch time and simply overwrites the previous value.So the final value maybe the old one.

    
> Can not update the index
> ------------------------
>
>                 Key: NUTCH-1416
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1416
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>            Reporter: Jianyun He
>
> When we update the index,can not guarantee that the contents which be indexed is the latest.In the class IndexerMapReduce and method reduce(), it has the following code:
> public void reduce(Text key, Iterator<NutchWritable> values,
>                      OutputCollector<Text, NutchDocument> output, Reporter reporter) throws IOException {
>    ……
>    } else if (value instanceof ParseData) {  
>       parseData = (ParseData)value;
>    } else if (value instanceof ParseText) { 
>       parseText = (ParseText)value;
>    }
>    ……
> }
> For example,30 days ago,I fetched the web page A,now I fetch it again. Then the key A will correspond to two ParseData objects(located in different segments).But in this code,it does not compare the fetch time and simply overwrites the previous value.So the final value maybe the old one.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (NUTCH-1416) Can not update the index

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424549#comment-13424549 ] 

Hudson commented on NUTCH-1416:
-------------------------------

Integrated in nutch-trunk-maven #373 (See [https://builds.apache.org/job/nutch-trunk-maven/373/])
    NUTCH-1416 Remove o.a.n.metadata.Office (Revision 1366847)

     Result = SUCCESS
lewismc : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/metadata/Metadata.java
* /nutch/trunk/src/java/org/apache/nutch/metadata/Office.java

                
> Can not update the index
> ------------------------
>
>                 Key: NUTCH-1416
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1416
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>            Reporter: Jianyun He
>            Priority: Critical
>
> When we update the index,can not guarantee that the contents which be indexed is the latest.In the class IndexerMapReduce and method reduce(), it has the following code:
> public void reduce(Text key, Iterator<NutchWritable> values,
>                      OutputCollector<Text, NutchDocument> output, Reporter reporter) throws IOException {
>    ……
>    } else if (value instanceof ParseData) {  
>       parseData = (ParseData)value;
>    } else if (value instanceof ParseText) { 
>       parseText = (ParseText)value;
>    }
>    ……
> }
> For example,30 days ago,I fetched the web page A,now I fetch it again. Then the key A will correspond to two ParseData objects(located in different segments).But in this code,it does not compare the fetch time and simply overwrites the previous value.So the final value maybe the old one.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (NUTCH-1416) Can not update the index

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424685#comment-13424685 ] 

Hudson commented on NUTCH-1416:
-------------------------------

Integrated in Nutch-trunk #1912 (See [https://builds.apache.org/job/Nutch-trunk/1912/])
    NUTCH-1416 Remove o.a.n.metadata.Office (Revision 1366847)

     Result = SUCCESS
lewismc : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1366847
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/metadata/Metadata.java
* /nutch/trunk/src/java/org/apache/nutch/metadata/Office.java

                
> Can not update the index
> ------------------------
>
>                 Key: NUTCH-1416
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1416
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>            Reporter: Jianyun He
>            Priority: Critical
>
> When we update the index,can not guarantee that the contents which be indexed is the latest.In the class IndexerMapReduce and method reduce(), it has the following code:
> public void reduce(Text key, Iterator<NutchWritable> values,
>                      OutputCollector<Text, NutchDocument> output, Reporter reporter) throws IOException {
>    ……
>    } else if (value instanceof ParseData) {  
>       parseData = (ParseData)value;
>    } else if (value instanceof ParseText) { 
>       parseText = (ParseText)value;
>    }
>    ……
> }
> For example,30 days ago,I fetched the web page A,now I fetch it again. Then the key A will correspond to two ParseData objects(located in different segments).But in this code,it does not compare the fetch time and simply overwrites the previous value.So the final value maybe the old one.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (NUTCH-1416) Can not update the index

Posted by "Jianyun He (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jianyun He updated NUTCH-1416:
------------------------------

    Priority: Critical  (was: Major)
    
> Can not update the index
> ------------------------
>
>                 Key: NUTCH-1416
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1416
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>            Reporter: Jianyun He
>            Priority: Critical
>
> When we update the index,can not guarantee that the contents which be indexed is the latest.In the class IndexerMapReduce and method reduce(), it has the following code:
> public void reduce(Text key, Iterator<NutchWritable> values,
>                      OutputCollector<Text, NutchDocument> output, Reporter reporter) throws IOException {
>    ……
>    } else if (value instanceof ParseData) {  
>       parseData = (ParseData)value;
>    } else if (value instanceof ParseText) { 
>       parseText = (ParseText)value;
>    }
>    ……
> }
> For example,30 days ago,I fetched the web page A,now I fetch it again. Then the key A will correspond to two ParseData objects(located in different segments).But in this code,it does not compare the fetch time and simply overwrites the previous value.So the final value maybe the old one.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira