You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Stefano Cherchi <st...@yahoo.it> on 2011/06/30 13:18:06 UTC

Nutch + Hadoop + Solr: custom plugin cause EOFException while indexing

Hi everybody,

I have a 4-nodes nutch+hadoop+solr stack that indexes a bunch of external websites of, say, house sale ads.

Everything worked fine until i used only the default Nutch IndexingFilter but then I needed some customization to enhance the search results quality.

So I developed a set of plugins (one for each site I need to index) that add some custom field to the index (say house price, location, name of the seller and so on) and extract those specific data from the html of the parsed page.

Again, everything has run smoothly until the structure of the parsed pages stood unchanged. Unfortunately some of the sites that I want to index have recently undergone restyling and troubles started for me: now all the crawling, fetching, merging etc seems to complete without errors but when Nutch invokes LinkDb (just before solrindexer) to prepare data to be put into Solr database it returns a lot of EOFException, the indexing job fails and no document is added to Solr even if just one of the plugins fails.

My questions are: where could the problem be and how can I avoid the complete failure of the indexing job? The plugin that parses the modified site should manage to fail "cleanly" without affecting the whole process.

This is the code of the indexing part of the plugin:




package it.company.searchengine.nutch.plugin.indexer.html.company;

import it.company.searchengine.nutch.plugin.parser.html.company.SiteURL1Parser;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.log4j.Logger;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.Inlinks;
import org.apache.nutch.indexer.IndexingException;
import org.apache.nutch.indexer.IndexingFilter;
import org.apache.nutch.indexer.NutchDocument;
import org.apache.nutch.indexer.lucene.LuceneWriter;
import org.apache.nutch.indexer.lucene.LuceneWriter.INDEX;
import org.apache.nutch.indexer.lucene.LuceneWriter.STORE;
import org.apache.nutch.parse.Parse;

public class SiteURL1Indexer implements IndexingFilter {

    private static final Logger LOGGER = Logger.getLogger(SiteURL1Indexer.class);
    public static final String POSITION_KEY = "position";
    public static final String LOCATION_KEY = "location";
    public static final String COMPANY_KEY = "company";
    public static final String DESCRIPTION_KEY = "description";
    private Configuration conf;

    public void addIndexBackendOptions(Configuration conf) {
        LuceneWriter.addFieldOptions(POSITION_KEY, STORE.YES, INDEX.TOKENIZED, conf);
        LuceneWriter.addFieldOptions(LOCATION_KEY, STORE.YES, INDEX.TOKENIZED, conf);
        LuceneWriter.addFieldOptions(COMPANY_KEY, STORE.YES, INDEX.TOKENIZED, conf);
        LuceneWriter.addFieldOptions(DESCRIPTION_KEY, STORE.YES, INDEX.TOKENIZED, conf);
    }

    public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException {

        String position = null;
        String where = null;
        String company = null;
        String description = null;

        position = parse.getData().getParseMeta().get(POSITION_KEY);
        where = parse.getData().getParseMeta().get(LOCATION_KEY);
        company = parse.getData().getParseMeta().get(COMPANY_KEY);
        description = parse.getData().getParseMeta().get(DESCRIPTION_KEY);

        if (SiteURL1Parser.validateField(position)
                && SiteURL1Parser.validateField(where)
                && SiteURL1Parser.validateField(company)
                && SiteURL1Parser.validateField(description)) {

            LOGGER.debug("Adding position: [" + position + "] for URL: " + url.toString());
            doc.add(POSITION_KEY, position);

            LOGGER.debug("Adding location: [" + position + "] for URL: " + url.toString());
            doc.add(LOCATION_KEY, where);

            LOGGER.debug("Adding company: [" + position + "] for URL: " + url.toString());
            doc.add(COMPANY_KEY, company);

            LOGGER.debug("Adding description: [" + position + "] for URL: " + url.toString());
            doc.add(DESCRIPTION_KEY, description);

            return doc;

        } else {
            return doc;
        }
    }

    public Configuration getConf() {
        return this.conf;
    }

    public void setConf(Configuration conf) {
        this.conf = conf;
    }
}




I'm running Nutch 1.0. Yes, I know it's an old one but I cannot afford the migration to a newer version at the moment.


Thanks a lot for any hint.

S
 
---------------------------------- 
"Anyone proposing to run Windows on servers should be prepared to explain 
what they know about servers that Google, Yahoo, and Amazon don't."
Paul Graham


"A mathematician is a device for turning coffee into theorems."
Paul Erdos (who obviously never met a sysadmin)

Re: Nutch + Hadoop + Solr: custom plugin cause EOFException while indexing

Posted by Stefano Cherchi <st...@yahoo.it>.



>________________________________
>Da: lewis john mcgibbney <le...@gmail.com>
>A: user@nutch.apache.org; Stefano Cherchi <st...@yahoo.it>
>Inviato: Venerdì 8 Luglio 2011 22:20
>Oggetto: Re: Nutch + Hadoop + Solr: custom plugin cause EOFException while indexing
>
>
>Hi Stefano,
>
>Any further on with this?
>
>I have not looked too much into your code, and sorry to state the obvious but it sounds like a definite link between the number of plugins to the duplication of the data.
>


Hi Lewis,

yes, of course the number of duplicated data depends on the number of the activated plugins.


>Are you sure that every one of your plugins is only handling the page it is supposed to? It sounds more like a case that each of your plugins is probably being activiated for all of your pages, doing the  parsing stage on every page and also indexing. This would explain the 17X duplication. Are you using any URLfilters or anything of this nature within your plugins? 
>

This is actually the weirdest part of the issue: in the first place I thought that each plugin was indexing the data of all the sites, even those it wasn't intended to. So I randomly checked the indexed data by url expecting to find in the other fields informations not related to that url. I have found instead that each plugin is managing (and indexing) only the data it is supposed to. E.g: we crawl an ad of a house on sale in Milan for 125.000EUR (yes, this is just pure fantasy). The url is http://mycheaphouses.it/milan/?id=1236654. If I check the data indexed into Solr for that ur I find that the field "Location" contains 17 times the string "Milan", the field "price" 17 times the float "125000.00" and so on. So it looks like the plugin mycheaphousesIT (and its parsing + indexing extensions) is managing only its own site.

S




>
>On Thu, Jul 7, 2011 at 10:54 AM, Stefano Cherchi <st...@yahoo.it> wrote:
>
>Hello Markus,
>>
>>sorry for my late reply. I have finally solved the issue. Actually, it was my fault: I wasn't using Nutch 1.0 (as I said) but 1.2. Now I rolled back to 1.0 and everything is working fine.
>>
>>But another strange behavior showed up: as I said in my first mail, I have a plugin for each site I want to index. Each plugin creates 4 custom fields in the index. At the moment 17 of this plugins are activated. Now when Nutch puts data into Solr each custom field is filled with 17 identical strings. The data saved into the custom fields are right, so each plugin is correctly extracting data from the site it is intended for, but when it performs indexing it duplicates the datum 17x.
>>
>>Quite weird.
>>
>>I have pasted here the code of both the parsing and the indexing extensions of one plugin:
>>
>>####################INDEXING EXTENSION#######################
>>>> package it.company.searchengine.nutch.plugin.indexer.html.company;
>>>> 
>>>> import
>>>> it.company.searchengine.nutch.plugin.parser.html.company.SiteURL1Parser;
>>>> import org.apache.hadoop.conf.Configuration;
>>>> import org.apache.hadoop.io.Text;
>>>> import org.apache.log4j.Logger;
>>>> import org.apache.nutch.crawl.CrawlDatum;
>>>> import org.apache.nutch.crawl.Inlinks;
>>>> import org.apache.nutch.indexer.IndexingException;
>>>> import org.apache.nutch.indexer.IndexingFilter;
>>>> import org.apache.nutch.indexer.NutchDocument;
>>>> import org.apache.nutch.indexer.lucene.LuceneWriter;
>>>> import org.apache.nutch.indexer.lucene.LuceneWriter.INDEX;
>>>> import org.apache.nutch.indexer.lucene.LuceneWriter.STORE;
>>>> import org.apache.nutch.parse.Parse;
>>>> 
>>>> public class SiteURL1Indexer implements IndexingFilter {
>>>> 
>>>>     private static final Logger LOGGER =
>>>> Logger.getLogger(SiteURL1Indexer.class); public static final String
>>>> POSITION_KEY = "position";
>>>>     public static final String LOCATION_KEY = "location";
>>>>     public static final String COMPANY_KEY = "company";
>>>>     public static final String DESCRIPTION_KEY = "description";
>>>>     private Configuration conf;
>>>> 
>>>>     public void addIndexBackendOptions(Configuration conf) {
>>>>         LuceneWriter.addFieldOptions(POSITION_KEY, STORE.YES,
>>>> INDEX.TOKENIZED, conf); LuceneWriter.addFieldOptions(LOCATION_KEY,
>>>> STORE.YES, INDEX.TOKENIZED, conf);
>>>> LuceneWriter.addFieldOptions(COMPANY_KEY, STORE.YES, INDEX.TOKENIZED,
>>>> conf); LuceneWriter.addFieldOptions(DESCRIPTION_KEY, STORE.YES,
>>>> INDEX.TOKENIZED, conf); }
>>>> 
>>>>     public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
>>>> CrawlDatum datum, Inlinks inlinks) throws IndexingException {
>>>> 
>>>>         String position = null;
>>>>         String where = null;
>>>>         String company = null;
>>>>         String description = null;
>>>> 
>>>>         position = parse.getData().getParseMeta().get(POSITION_KEY);
>>>>         where = parse.getData().getParseMeta().get(LOCATION_KEY);
>>>>         company = parse.getData().getParseMeta().get(COMPANY_KEY);
>>>>         description = parse.getData().getParseMeta().get(DESCRIPTION_KEY);
>>>> 
>>>>         if (SiteURL1Parser.validateField(position)
>>>>                 && SiteURL1Parser.validateField(where)
>>>>                 && SiteURL1Parser.validateField(company)
>>>>                 && SiteURL1Parser.validateField(description)) {
>>>> 
>>>>             LOGGER.debug("Adding position: [" + position + "] for URL: " +
>>>> url.toString()); doc.add(POSITION_KEY, position);
>>>> 
>>>>             LOGGER.debug("Adding location: [" + position + "] for URL: " +
>>>> url.toString()); doc.add(LOCATION_KEY, where);
>>>> 
>>>>             LOGGER.debug("Adding company: [" + position + "] for URL: " +
>>>> url.toString()); doc.add(COMPANY_KEY, company);
>>>> 
>>>>             LOGGER.debug("Adding description: [" + position + "] for URL: "
>>>> + url.toString()); doc.add(DESCRIPTION_KEY, description);
>>>> 
>>>>             return doc;
>>>> 
>>>>         } else {
>>>>             return doc;
>>>>         }
>>>>     }
>>>> 
>>>>     public Configuration getConf() {
>>>>         return this.conf;
>>>>     }
>>>> 
>>>>     public void setConf(Configuration conf) {
>>>>         this.conf = conf;
>>>>     }
>>>> }
>>
>>
>>
>>################PARSING EXTENSION##################
>>package it.company.searchengine.nutch.plugin.parser.html.company;
>>
>>import java.io.BufferedReader;
>>import java.io.ByteArrayInputStream;
>>import java.io.IOException;
>>import java.io.InputStreamReader;
>>import java.util.regex.Matcher;
>>import java.util.regex.Pattern;
>>import org.apache.log4j.Logger;
>>import org.apache.hadoop.conf.Configuration;
>>import org.apache.nutch.metadata.Metadata;
>>import org.apache.nutch.parse.HTMLMetaTags;
>>import org.apache.nutch.parse.HtmlParseFilter;
>>import org.apache.nutch.parse.Parse;
>>import org.apache.nutch.parse.ParseResult;
>>import org.apache.nutch.protocol.Content;
>>import org.w3c.dom.DocumentFragment;
>>
>>public class SiteURL1Parser implements HtmlParseFilter {
>>
>>    public static final String POSITION_KEY = "position";
>>    public static final String LOCATION_KEY = "location";
>>    public static final String COMPANY_KEY = "company";
>>    public static final String DESCRIPTION_KEY = "description";
>>    private static final Logger logger = Logger.getLogger(SiteURL1Parser.class);
>>    private static final String HTML_TAG_PATTERN = "<[^><]{0,}>";
>>    private Configuration conf = null;
>>
>>    public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) {
>>
>>        String currentURL = null;
>>        String urlPattern = null;
>>        Pattern pattern = null;
>>        Matcher matcher = null;
>>
>>        currentURL = currentURL = content.getUrl();
>>
>>        //  SiteURL1.COM
>>        if (currentURL.contains("SiteURL1.com")) {
>>            urlPattern = "^http://www.SiteURL1.com/offer[-\\w]{3,}[?]id[=][0-9]{5,10}$";;
>>            pattern = Pattern.compile(urlPattern);
>>            matcher = pattern.matcher(currentURL);
>>
>>            if (matcher.find()) {
>>                return filterSiteURL1(content, parseResult);
>>            }
>>        }
>>
>>        return parseResult;
>>    }
>>
>>    public Configuration getConf() {
>>        return conf;
>>    }
>>
>>    public void setConf(Configuration conf) {
>>        this.conf = conf;
>>    }
>>
>>    public static boolean validateField(String field) {
>>
>>        if (field == null)
>>            return false;
>>
>>        if (field.equalsIgnoreCase(""))
>>            return false;
>>
>>        if (field.equalsIgnoreCase("NULL"))
>>            return false;
>>
>>        return true;
>>    }
>>
>>    private void printExtractedFields(String position, String company, String location, String description) {
>>        System.out.println("");
>>        System.out.println("- POSITION:    " + position);
>>        System.out.println("- COMPANY:     " + company);
>>        System.out.println("- LOCATION:    " + location);
>>        System.out.println("- DESCRIPTION: " + description);
>>    }
>>
>>    private ParseResult filterSiteURL1(Content content, ParseResult parseResult) {
>>
>>        logger.debug("Parsing URL: " + content.getUrl());
>>
>>        BufferedReader reader = null;
>>        String currentURL = null;
>>        String line = null;
>>        Parse parse = null;
>>        Metadata metadata = null;
>>
>>        String company = null;
>>        String position = null;
>>        String location = null;
>>        String description = null;
>>
>>        boolean intoLocation = false;
>>        boolean intoDescription = false;
>>
>>        Pattern pattern = null;
>>        Matcher matcher = null;
>>
>>        try {
>>
>>            currentURL = content.getUrl();
>>            description = new String();
>>
>>            reader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(content.getContent())));
>>            pattern = Pattern.compile(HTML_TAG_PATTERN);
>>
>>            while ((line = reader.readLine()) != null) {
>>
>>                if (line.contains("<tr><td valign=top><a href='/join/check_session.jsp?idfonte=")) {
>>                    line = line.trim();
>>                    matcher = pattern.matcher(line);
>>                    company = matcher.replaceAll("").trim();
>>                    continue;
>>                }
>>
>>                if (line.contains("<tr><td><a href='/join/check_session.jsp?id=")) {
>>                    line = line.trim();
>>                    matcher = pattern.matcher(line);
>>                    position = matcher.replaceAll("").trim();
>>                    continue;
>>                }
>>
>>                if (line.contains("<tr><td class=\"txt-black-regular-10\"></br><strong>Place</strong>:")) {
>>                    intoLocation = true;
>>                    continue;
>>
>>                } else if (intoLocation) {
>>                    line = line.trim();
>>
>>                    if (validateField(line)) {
>>                        location = line;
>>                        location = location.replaceAll("&nbsp;&nbsp;-&nbsp;&nbsp;", " - ");
>>                        intoLocation = false;
>>                    }
>>
>>                    continue;
>>                }
>>
>>                if (line.contains("<span class=\"txt-black-regular-10\"><strong>Requirements</strong></span>:<br/><a href='/join/check_session.jsp?id=")) {
>>
>>                    intoDescription = true;
>>                    line = line.trim();
>>                    matcher = pattern.matcher(line);
>>                    description = matcher.replaceAll("").trim();
>>
>>                } else if (intoDescription) {
>>
>>                    line = line.trim();
>>
>>                    if (validateField(line)) {
>>
>>                        String tmpDescription = null;
>>                        matcher = pattern.matcher(line);
>>                        tmpDescription = matcher.replaceAll("").trim();
>>
>>                        if (validateField(tmpDescription)) {
>>
>>                            if (validateField(description)) {
>>                                description = description + " " + tmpDescription;
>>
>>                            } else {
>>                                description = tmpDescription;
>>                            }
>>                        }
>>                    }
>>                }
>>
>>                if (line.contains("</a></span><br/><br/>")) {
>>
>>                    description = description.replaceAll("[\\s]{1,}", " ").trim();
>>
>>                    while (description.startsWith("Requirements")) {
>>
>>                        description = description.replaceFirst("Requirements", "").trim();
>>
>>                        if (description.startsWith(":")) {
>>                            description = description.substring(1).trim();
>>                        }
>>                    }
>>
>>                    intoDescription = false;
>>                    break;
>>                }
>>
>>                continue;
>>            }
>>
>>            reader.close();
>>
>>            if (validateField(position)) {
>>
>>                parse = parseResult.get(currentURL);
>>                metadata = parse.getData().getParseMeta();
>>                metadata.add(POSITION_KEY, position);
>>
>>                if (validateField(company)) {
>>                    metadata.add(COMPANY_KEY, company);
>>
>>                } else {
>>                    metadata.add(COMPANY_KEY, "Unknow");
>>                }
>>
>>                if (validateField(location)) {
>>                    metadata.add(LOCATION_KEY, location);
>>
>>                } else {
>>                    metadata.add(LOCATION_KEY, "Unknow");
>>                }
>>
>>                if (validateField(description)) {
>>                    metadata.add(DESCRIPTION_KEY, description);
>>
>>                } else {
>>                    metadata.add(DESCRIPTION_KEY, "");
>>                }
>>            }
>>
>>        } catch (IOException e) {
>>            logger.warn("IOException encountered parsing file:", e);
>>        }
>>
>>        return parseResult;
>>    }
>>
>>   
>>}
>>
>>
>>----------------------------------
>>"Anyone proposing to run Windows on servers should be prepared to explain
>>what they know about servers that Google, Yahoo, and Amazon don't."
>>Paul Graham
>>
>>
>>"A mathematician is a device for turning coffee into theorems."
>>Paul Erdos (who obviously never met a sysadmin)
>>
>>
>>>________________________________
>>>Da: Markus Jelsma <ma...@openindex.io>
>>>A: user@nutch.apache.org
>>>Cc: Stefano Cherchi <st...@yahoo.it>
>>>Inviato: Giovedì 30 Giugno 2011 13:29
>>>Oggetto: Re: Nutch + Hadoop + Solr: custom plugin cause EOFException while indexing
>>>
>>>I'm not sure but you could provide your stacktrace. Would atl east make it
>>>easier.
>>>
>>>On Thursday 30 June 2011 13:18:06 Stefano Cherchi wrote:
>>>> Hi everybody,
>>>>
>>>> I have a 4-nodes nutch+hadoop+solr stack that indexes a bunch of external
>>>> websites of, say, house sale ads.
>>>>
>>>> Everything worked fine until i used only the default Nutch IndexingFilter
>>>> but then I needed some customization to enhance the search results
>>>> quality.
>>>>
>>>> So I developed a set of plugins (one for each site I need to index) that
>>>> add some custom field to the index (say house price, location, name of the
>>>> seller and so on) and extract those specific data from the html of the
>>>> parsed page.
>>>>
>>>> Again, everything has run smoothly until the structure of the parsed pages
>>>> stood unchanged. Unfortunately some of the sites that I want to index have
>>>> recently undergone restyling and troubles started for me: now all the
>>>> crawling, fetching, merging etc seems to complete without errors but when
>>>> Nutch invokes LinkDb (just before solrindexer) to prepare data to be put
>>>> into Solr database it returns a lot of EOFException, the indexing job
>>>> fails and no document is added to Solr even if just one of the plugins
>>>> fails.
>>>>
>>>> My questions are: where could the problem be and how can I avoid the
>>>> complete failure of the indexing job? The plugin that parses the modified
>>>> site should manage to fail "cleanly" without affecting the whole process.
>>>>
>>>> This is the code of the indexing part of the plugin:
>>>>
>>>>
>>>>
>>>>
>>>> package it.company.searchengine.nutch.plugin.indexer.html.company;
>>>>
>>>> import
>>>> it.company.searchengine.nutch.plugin.parser.html.company.SiteURL1Parser;
>>>> import org.apache.hadoop.conf.Configuration;
>>>> import org.apache.hadoop.io.Text;
>>>> import org.apache.log4j.Logger;
>>>> import org.apache.nutch.crawl.CrawlDatum;
>>>> import org.apache.nutch.crawl.Inlinks;
>>>> import org.apache.nutch.indexer.IndexingException;
>>>> import org.apache.nutch.indexer.IndexingFilter;
>>>> import org.apache.nutch.indexer.NutchDocument;
>>>> import org.apache.nutch.indexer.lucene.LuceneWriter;
>>>> import org.apache.nutch.indexer.lucene.LuceneWriter.INDEX;
>>>> import org.apache.nutch.indexer.lucene.LuceneWriter.STORE;
>>>> import org.apache.nutch.parse.Parse;
>>>>
>>>> public class SiteURL1Indexer implements IndexingFilter {
>>>>
>>>>     private static final Logger LOGGER =
>>>> Logger.getLogger(SiteURL1Indexer.class); public static final String
>>>> POSITION_KEY = "position";
>>>>     public static final String LOCATION_KEY = "location";
>>>>     public static final String COMPANY_KEY = "company";
>>>>     public static final String DESCRIPTION_KEY = "description";
>>>>     private Configuration conf;
>>>>
>>>>     public void addIndexBackendOptions(Configuration conf) {
>>>>         LuceneWriter.addFieldOptions(POSITION_KEY, STORE.YES,
>>>> INDEX.TOKENIZED, conf); LuceneWriter.addFieldOptions(LOCATION_KEY,
>>>> STORE.YES, INDEX.TOKENIZED, conf);
>>>> LuceneWriter.addFieldOptions(COMPANY_KEY, STORE.YES, INDEX.TOKENIZED,
>>>> conf); LuceneWriter.addFieldOptions(DESCRIPTION_KEY, STORE.YES,
>>>> INDEX.TOKENIZED, conf); }
>>>>
>>>>     public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
>>>> CrawlDatum datum, Inlinks inlinks) throws IndexingException {
>>>>
>>>>         String position = null;
>>>>         String where = null;
>>>>         String company = null;
>>>>         String description = null;
>>>>
>>>>         position = parse.getData().getParseMeta().get(POSITION_KEY);
>>>>         where = parse.getData().getParseMeta().get(LOCATION_KEY);
>>>>         company = parse.getData().getParseMeta().get(COMPANY_KEY);
>>>>         description = parse.getData().getParseMeta().get(DESCRIPTION_KEY);
>>>>
>>>>         if (SiteURL1Parser.validateField(position)
>>>>                 && SiteURL1Parser.validateField(where)
>>>>                 && SiteURL1Parser.validateField(company)
>>>>                 && SiteURL1Parser.validateField(description)) {
>>>>
>>>>             LOGGER.debug("Adding position: [" + position + "] for URL: " +
>>>> url.toString()); doc.add(POSITION_KEY, position);
>>>>
>>>>             LOGGER.debug("Adding location: [" + position + "] for URL: " +
>>>> url.toString()); doc.add(LOCATION_KEY, where);
>>>>
>>>>             LOGGER.debug("Adding company: [" + position + "] for URL: " +
>>>> url.toString()); doc.add(COMPANY_KEY, company);
>>>>
>>>>             LOGGER.debug("Adding description: [" + position + "] for URL: "
>>>> + url.toString()); doc.add(DESCRIPTION_KEY, description);
>>>>
>>>>             return doc;
>>>>
>>>>         } else {
>>>>             return doc;
>>>>         }
>>>>     }
>>>>
>>>>     public Configuration getConf() {
>>>>         return this.conf;
>>>>     }
>>>>
>>>>     public void setConf(Configuration conf) {
>>>>         this.conf = conf;
>>>>     }
>>>> }
>>>>
>>>>
>>>>
>>>>
>>>> I'm running Nutch 1.0. Yes, I know it's an old one but I cannot afford the
>>>> migration to a newer version at the moment.
>>>>
>>>>
>>>> Thanks a lot for any hint.
>>>>
>>>> S
>>>> 
>>>> ----------------------------------
>>>> "Anyone proposing to run Windows on servers should be prepared to explain
>>>> what they know about servers that Google, Yahoo, and Amazon don't."
>>>> Paul Graham
>>>>
>>>>
>>>> "A mathematician is a device for turning coffee into theorems."
>>>> Paul Erdos (who obviously never met a sysadmin)
>>>
>>>--
>>>Markus Jelsma - CTO - Openindex
>>>http://www.linkedin.com/in/markus17
>>>050-8536620 / 06-50258350
>>>
>>>
>>> 
>>
>
>
>-- 
>Lewis 
>
>
>
>

Re: Nutch + Hadoop + Solr: custom plugin cause EOFException while indexing

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi Stefano,

Any further on with this?

I have not looked too much into your code, and sorry to state the obvious
but it sounds like a definite link between the number of plugins to the
duplication of the data.

Are you sure that every one of your plugins is only handling the page it is
supposed to? It sounds more like a case that each of your plugins is
probably being activiated for all of your pages, doing the  parsing stage on
every page and also indexing. This would explain the 17X duplication. Are
you using any URLfilters or anything of this nature within your plugins?

On Thu, Jul 7, 2011 at 10:54 AM, Stefano Cherchi <st...@yahoo.it>wrote:

> Hello Markus,
>
> sorry for my late reply. I have finally solved the issue. Actually, it was
> my fault: I wasn't using Nutch 1.0 (as I said) but 1.2. Now I rolled back to
> 1.0 and everything is working fine.
>
> But another strange behavior showed up: as I said in my first mail, I have
> a plugin for each site I want to index. Each plugin creates 4 custom fields
> in the index. At the moment 17 of this plugins are activated. Now when Nutch
> puts data into Solr each custom field is filled with 17 identical strings.
> The data saved into the custom fields are right, so each plugin is correctly
> extracting data from the site it is intended for, but when it performs
> indexing it duplicates the datum 17x.
>
> Quite weird.
>
> I have pasted here the code of both the parsing and the indexing extensions
> of one plugin:
>
> ####################INDEXING EXTENSION#######################
> >> package it.company.searchengine.nutch.plugin.indexer.html.company;
> >>
> >> import
> >> it.company.searchengine.nutch.plugin.parser.html.company.SiteURL1Parser;
> >> import org.apache.hadoop.conf.Configuration;
> >> import org.apache.hadoop.io.Text;
> >> import org.apache.log4j.Logger;
> >> import org.apache.nutch.crawl.CrawlDatum;
> >> import org.apache.nutch.crawl.Inlinks;
> >> import org.apache.nutch.indexer.IndexingException;
> >> import org.apache.nutch.indexer.IndexingFilter;
> >> import org.apache.nutch.indexer.NutchDocument;
> >> import org.apache.nutch.indexer.lucene.LuceneWriter;
> >> import org.apache.nutch.indexer.lucene.LuceneWriter.INDEX;
> >> import org.apache.nutch.indexer.lucene.LuceneWriter.STORE;
> >> import org.apache.nutch.parse.Parse;
> >>
> >> public class SiteURL1Indexer implements IndexingFilter {
> >>
> >>     private static final Logger LOGGER =
> >> Logger.getLogger(SiteURL1Indexer.class); public static final String
> >> POSITION_KEY = "position";
> >>     public static final String LOCATION_KEY = "location";
> >>     public static final String COMPANY_KEY = "company";
> >>     public static final String DESCRIPTION_KEY = "description";
> >>     private Configuration conf;
> >>
> >>     public void addIndexBackendOptions(Configuration conf) {
> >>         LuceneWriter.addFieldOptions(POSITION_KEY, STORE.YES,
> >> INDEX.TOKENIZED, conf); LuceneWriter.addFieldOptions(LOCATION_KEY,
> >> STORE.YES, INDEX.TOKENIZED, conf);
> >> LuceneWriter.addFieldOptions(COMPANY_KEY, STORE.YES, INDEX.TOKENIZED,
> >> conf); LuceneWriter.addFieldOptions(DESCRIPTION_KEY, STORE.YES,
> >> INDEX.TOKENIZED, conf); }
> >>
> >>     public NutchDocument filter(NutchDocument doc, Parse parse, Text
> url,
> >> CrawlDatum datum, Inlinks inlinks) throws IndexingException {
> >>
> >>         String position = null;
> >>         String where = null;
> >>         String company = null;
> >>         String description = null;
> >>
> >>         position = parse.getData().getParseMeta().get(POSITION_KEY);
> >>         where = parse.getData().getParseMeta().get(LOCATION_KEY);
> >>         company = parse.getData().getParseMeta().get(COMPANY_KEY);
> >>         description =
> parse.getData().getParseMeta().get(DESCRIPTION_KEY);
> >>
> >>         if (SiteURL1Parser.validateField(position)
> >>                 && SiteURL1Parser.validateField(where)
> >>                 && SiteURL1Parser.validateField(company)
> >>                 && SiteURL1Parser.validateField(description)) {
> >>
> >>             LOGGER.debug("Adding position: [" + position + "] for URL: "
> +
> >> url.toString()); doc.add(POSITION_KEY, position);
> >>
> >>             LOGGER.debug("Adding location: [" + position + "] for URL: "
> +
> >> url.toString()); doc.add(LOCATION_KEY, where);
> >>
> >>             LOGGER.debug("Adding company: [" + position + "] for URL: "
> +
> >> url.toString()); doc.add(COMPANY_KEY, company);
> >>
> >>             LOGGER.debug("Adding description: [" + position + "] for
> URL: "
> >> + url.toString()); doc.add(DESCRIPTION_KEY, description);
> >>
> >>             return doc;
> >>
> >>         } else {
> >>             return doc;
> >>         }
> >>     }
> >>
> >>     public Configuration getConf() {
> >>         return this.conf;
> >>     }
> >>
> >>     public void setConf(Configuration conf) {
> >>         this.conf = conf;
> >>     }
> >> }
>
>
>
> ################PARSING EXTENSION##################
> package it.company.searchengine.nutch.plugin.parser.html.company;
>
> import java.io.BufferedReader;
> import java.io.ByteArrayInputStream;
> import java.io.IOException;
> import java.io.InputStreamReader;
> import java.util.regex.Matcher;
> import java.util.regex.Pattern;
> import org.apache.log4j.Logger;
> import org.apache.hadoop.conf.Configuration;
> import org.apache.nutch.metadata.Metadata;
> import org.apache.nutch.parse.HTMLMetaTags;
> import org.apache.nutch.parse.HtmlParseFilter;
> import org.apache.nutch.parse.Parse;
> import org.apache.nutch.parse.ParseResult;
> import org.apache.nutch.protocol.Content;
> import org.w3c.dom.DocumentFragment;
>
> public class SiteURL1Parser implements HtmlParseFilter {
>
>     public static final String POSITION_KEY = "position";
>     public static final String LOCATION_KEY = "location";
>     public static final String COMPANY_KEY = "company";
>     public static final String DESCRIPTION_KEY = "description";
>     private static final Logger logger =
> Logger.getLogger(SiteURL1Parser.class);
>     private static final String HTML_TAG_PATTERN = "<[^><]{0,}>";
>     private Configuration conf = null;
>
>     public ParseResult filter(Content content, ParseResult parseResult,
> HTMLMetaTags metaTags, DocumentFragment doc) {
>
>         String currentURL = null;
>         String urlPattern = null;
>         Pattern pattern = null;
>         Matcher matcher = null;
>
>         currentURL = currentURL = content.getUrl();
>
>         //  SiteURL1.COM
>         if (currentURL.contains("SiteURL1.com")) {
>             urlPattern = "^
> http://www.SiteURL1.com/offer[-\\w]{3,}[?]id[=][0-9]{5,10}$";
>             pattern = Pattern.compile(urlPattern);
>             matcher = pattern.matcher(currentURL);
>
>             if (matcher.find()) {
>                 return filterSiteURL1(content, parseResult);
>             }
>         }
>
>         return parseResult;
>     }
>
>     public Configuration getConf() {
>         return conf;
>     }
>
>     public void setConf(Configuration conf) {
>         this.conf = conf;
>     }
>
>     public static boolean validateField(String field) {
>
>         if (field == null)
>             return false;
>
>         if (field.equalsIgnoreCase(""))
>             return false;
>
>         if (field.equalsIgnoreCase("NULL"))
>             return false;
>
>         return true;
>     }
>
>     private void printExtractedFields(String position, String company,
> String location, String description) {
>         System.out.println("");
>         System.out.println("- POSITION:    " + position);
>         System.out.println("- COMPANY:     " + company);
>         System.out.println("- LOCATION:    " + location);
>         System.out.println("- DESCRIPTION: " + description);
>     }
>
>     private ParseResult filterSiteURL1(Content content, ParseResult
> parseResult) {
>
>         logger.debug("Parsing URL: " + content.getUrl());
>
>         BufferedReader reader = null;
>         String currentURL = null;
>         String line = null;
>         Parse parse = null;
>         Metadata metadata = null;
>
>         String company = null;
>         String position = null;
>         String location = null;
>         String description = null;
>
>         boolean intoLocation = false;
>         boolean intoDescription = false;
>
>         Pattern pattern = null;
>         Matcher matcher = null;
>
>         try {
>
>             currentURL = content.getUrl();
>             description = new String();
>
>             reader = new BufferedReader(new InputStreamReader(new
> ByteArrayInputStream(content.getContent())));
>             pattern = Pattern.compile(HTML_TAG_PATTERN);
>
>             while ((line = reader.readLine()) != null) {
>
>                 if (line.contains("<tr><td valign=top><a
> href='/join/check_session.jsp?idfonte=")) {
>                     line = line.trim();
>                     matcher = pattern.matcher(line);
>                     company = matcher.replaceAll("").trim();
>                     continue;
>                 }
>
>                 if (line.contains("<tr><td><a
> href='/join/check_session.jsp?id=")) {
>                     line = line.trim();
>                     matcher = pattern.matcher(line);
>                     position = matcher.replaceAll("").trim();
>                     continue;
>                 }
>
>                 if (line.contains("<tr><td
> class=\"txt-black-regular-10\"></br><strong>Place</strong>:")) {
>                     intoLocation = true;
>                     continue;
>
>                 } else if (intoLocation) {
>                     line = line.trim();
>
>                     if (validateField(line)) {
>                         location = line;
>                         location =
> location.replaceAll("&nbsp;&nbsp;-&nbsp;&nbsp;", " - ");
>                         intoLocation = false;
>                     }
>
>                     continue;
>                 }
>
>                 if (line.contains("<span
> class=\"txt-black-regular-10\"><strong>Requirements</strong></span>:<br/><a
> href='/join/check_session.jsp?id=")) {
>
>                     intoDescription = true;
>                     line = line.trim();
>                     matcher = pattern.matcher(line);
>                     description = matcher.replaceAll("").trim();
>
>                 } else if (intoDescription) {
>
>                     line = line.trim();
>
>                     if (validateField(line)) {
>
>                         String tmpDescription = null;
>                         matcher = pattern.matcher(line);
>                         tmpDescription = matcher.replaceAll("").trim();
>
>                         if (validateField(tmpDescription)) {
>
>                             if (validateField(description)) {
>                                 description = description + " " +
> tmpDescription;
>
>                             } else {
>                                 description = tmpDescription;
>                             }
>                         }
>                     }
>                 }
>
>                 if (line.contains("</a></span><br/><br/>")) {
>
>                     description = description.replaceAll("[\\s]{1,}", "
> ").trim();
>
>                     while (description.startsWith("Requirements")) {
>
>                         description =
> description.replaceFirst("Requirements", "").trim();
>
>                         if (description.startsWith(":")) {
>                             description = description.substring(1).trim();
>                         }
>                     }
>
>                     intoDescription = false;
>                     break;
>                 }
>
>                 continue;
>             }
>
>             reader.close();
>
>             if (validateField(position)) {
>
>                 parse = parseResult.get(currentURL);
>                 metadata = parse.getData().getParseMeta();
>                 metadata.add(POSITION_KEY, position);
>
>                 if (validateField(company)) {
>                     metadata.add(COMPANY_KEY, company);
>
>                 } else {
>                     metadata.add(COMPANY_KEY, "Unknow");
>                 }
>
>                 if (validateField(location)) {
>                     metadata.add(LOCATION_KEY, location);
>
>                 } else {
>                     metadata.add(LOCATION_KEY, "Unknow");
>                 }
>
>                 if (validateField(description)) {
>                     metadata.add(DESCRIPTION_KEY, description);
>
>                 } else {
>                     metadata.add(DESCRIPTION_KEY, "");
>                 }
>             }
>
>         } catch (IOException e) {
>             logger.warn("IOException encountered parsing file:", e);
>         }
>
>         return parseResult;
>     }
>
>
> }
>
>
> ----------------------------------
> "Anyone proposing to run Windows on servers should be prepared to explain
> what they know about servers that Google, Yahoo, and Amazon don't."
> Paul Graham
>
>
> "A mathematician is a device for turning coffee into theorems."
> Paul Erdos (who obviously never met a sysadmin)
>
>
> >________________________________
> >Da: Markus Jelsma <ma...@openindex.io>
> >A: user@nutch.apache.org
> >Cc: Stefano Cherchi <st...@yahoo.it>
> >Inviato: Giovedì 30 Giugno 2011 13:29
> >Oggetto: Re: Nutch + Hadoop + Solr: custom plugin cause EOFException while
> indexing
> >
> >I'm not sure but you could provide your stacktrace. Would atl east make it
> >easier.
> >
> >On Thursday 30 June 2011 13:18:06 Stefano Cherchi wrote:
> >> Hi everybody,
> >>
> >> I have a 4-nodes nutch+hadoop+solr stack that indexes a bunch of
> external
> >> websites of, say, house sale ads.
> >>
> >> Everything worked fine until i used only the default Nutch
> IndexingFilter
> >> but then I needed some customization to enhance the search results
> >> quality.
> >>
> >> So I developed a set of plugins (one for each site I need to index) that
> >> add some custom field to the index (say house price, location, name of
> the
> >> seller and so on) and extract those specific data from the html of the
> >> parsed page.
> >>
> >> Again, everything has run smoothly until the structure of the parsed
> pages
> >> stood unchanged. Unfortunately some of the sites that I want to index
> have
> >> recently undergone restyling and troubles started for me: now all the
> >> crawling, fetching, merging etc seems to complete without errors but
> when
> >> Nutch invokes LinkDb (just before solrindexer) to prepare data to be put
> >> into Solr database it returns a lot of EOFException, the indexing job
> >> fails and no document is added to Solr even if just one of the plugins
> >> fails.
> >>
> >> My questions are: where could the problem be and how can I avoid the
> >> complete failure of the indexing job? The plugin that parses the
> modified
> >> site should manage to fail "cleanly" without affecting the whole
> process.
> >>
> >> This is the code of the indexing part of the plugin:
> >>
> >>
> >>
> >>
> >> package it.company.searchengine.nutch.plugin.indexer.html.company;
> >>
> >> import
> >> it.company.searchengine.nutch.plugin.parser.html.company.SiteURL1Parser;
> >> import org.apache.hadoop.conf.Configuration;
> >> import org.apache.hadoop.io.Text;
> >> import org.apache.log4j.Logger;
> >> import org.apache.nutch.crawl.CrawlDatum;
> >> import org.apache.nutch.crawl.Inlinks;
> >> import org.apache.nutch.indexer.IndexingException;
> >> import org.apache.nutch.indexer.IndexingFilter;
> >> import org.apache.nutch.indexer.NutchDocument;
> >> import org.apache.nutch.indexer.lucene.LuceneWriter;
> >> import org.apache.nutch.indexer.lucene.LuceneWriter.INDEX;
> >> import org.apache.nutch.indexer.lucene.LuceneWriter.STORE;
> >> import org.apache.nutch.parse.Parse;
> >>
> >> public class SiteURL1Indexer implements IndexingFilter {
> >>
> >>     private static final Logger LOGGER =
> >> Logger.getLogger(SiteURL1Indexer.class); public static final String
> >> POSITION_KEY = "position";
> >>     public static final String LOCATION_KEY = "location";
> >>     public static final String COMPANY_KEY = "company";
> >>     public static final String DESCRIPTION_KEY = "description";
> >>     private Configuration conf;
> >>
> >>     public void addIndexBackendOptions(Configuration conf) {
> >>         LuceneWriter.addFieldOptions(POSITION_KEY, STORE.YES,
> >> INDEX.TOKENIZED, conf); LuceneWriter.addFieldOptions(LOCATION_KEY,
> >> STORE.YES, INDEX.TOKENIZED, conf);
> >> LuceneWriter.addFieldOptions(COMPANY_KEY, STORE.YES, INDEX.TOKENIZED,
> >> conf); LuceneWriter.addFieldOptions(DESCRIPTION_KEY, STORE.YES,
> >> INDEX.TOKENIZED, conf); }
> >>
> >>     public NutchDocument filter(NutchDocument doc, Parse parse, Text
> url,
> >> CrawlDatum datum, Inlinks inlinks) throws IndexingException {
> >>
> >>         String position = null;
> >>         String where = null;
> >>         String company = null;
> >>         String description = null;
> >>
> >>         position = parse.getData().getParseMeta().get(POSITION_KEY);
> >>         where = parse.getData().getParseMeta().get(LOCATION_KEY);
> >>         company = parse.getData().getParseMeta().get(COMPANY_KEY);
> >>         description =
> parse.getData().getParseMeta().get(DESCRIPTION_KEY);
> >>
> >>         if (SiteURL1Parser.validateField(position)
> >>                 && SiteURL1Parser.validateField(where)
> >>                 && SiteURL1Parser.validateField(company)
> >>                 && SiteURL1Parser.validateField(description)) {
> >>
> >>             LOGGER.debug("Adding position: [" + position + "] for URL: "
> +
> >> url.toString()); doc.add(POSITION_KEY, position);
> >>
> >>             LOGGER.debug("Adding location: [" + position + "] for URL: "
> +
> >> url.toString()); doc.add(LOCATION_KEY, where);
> >>
> >>             LOGGER.debug("Adding company: [" + position + "] for URL: "
> +
> >> url.toString()); doc.add(COMPANY_KEY, company);
> >>
> >>             LOGGER.debug("Adding description: [" + position + "] for
> URL: "
> >> + url.toString()); doc.add(DESCRIPTION_KEY, description);
> >>
> >>             return doc;
> >>
> >>         } else {
> >>             return doc;
> >>         }
> >>     }
> >>
> >>     public Configuration getConf() {
> >>         return this.conf;
> >>     }
> >>
> >>     public void setConf(Configuration conf) {
> >>         this.conf = conf;
> >>     }
> >> }
> >>
> >>
> >>
> >>
> >> I'm running Nutch 1.0. Yes, I know it's an old one but I cannot afford
> the
> >> migration to a newer version at the moment.
> >>
> >>
> >> Thanks a lot for any hint.
> >>
> >> S
> >>
> >> ----------------------------------
> >> "Anyone proposing to run Windows on servers should be prepared to
> explain
> >> what they know about servers that Google, Yahoo, and Amazon don't."
> >> Paul Graham
> >>
> >>
> >> "A mathematician is a device for turning coffee into theorems."
> >> Paul Erdos (who obviously never met a sysadmin)
> >
> >--
> >Markus Jelsma - CTO - Openindex
> >http://www.linkedin.com/in/markus17
> >050-8536620 / 06-50258350
> >
> >
> >
>



-- 
*Lewis*

Re: Nutch + Hadoop + Solr: custom plugin cause EOFException while indexing

Posted by Stefano Cherchi <st...@yahoo.it>.

Hello Markus,

sorry for my late reply. I have finally solved the issue. Actually, it was my fault: I wasn't using Nutch 1.0 (as I said) but 1.2. Now I rolled back to 1.0 and everything is working fine.

But another strange behavior showed up: as I said in my first mail, I have a plugin for each site I want to index. Each plugin creates 4 custom fields in the index. At the moment 17 of this plugins are activated. Now when Nutch puts data into Solr each custom field is filled with 17 identical strings. The data saved into the custom fields are right, so each plugin is correctly extracting data from the site it is intended for, but when it performs indexing it duplicates the datum 17x.

Quite weird.

I have pasted here the code of both the parsing and the indexing extensions of one plugin:

####################INDEXING EXTENSION#######################
>> package it.company.searchengine.nutch.plugin.indexer.html.company;
>> 
>> import
>> it.company.searchengine.nutch.plugin.parser.html.company.SiteURL1Parser;
>> import org.apache.hadoop.conf.Configuration;
>> import org.apache.hadoop.io.Text;
>> import org.apache.log4j.Logger;
>> import org.apache.nutch.crawl.CrawlDatum;
>> import org.apache.nutch.crawl.Inlinks;
>> import org.apache.nutch.indexer.IndexingException;
>> import org.apache.nutch.indexer.IndexingFilter;
>> import org.apache.nutch.indexer.NutchDocument;
>> import org.apache.nutch.indexer.lucene.LuceneWriter;
>> import org.apache.nutch.indexer.lucene.LuceneWriter.INDEX;
>> import org.apache.nutch.indexer.lucene.LuceneWriter.STORE;
>> import org.apache.nutch.parse.Parse;
>> 
>> public class SiteURL1Indexer implements IndexingFilter {
>> 
>>     private static final Logger LOGGER =
>> Logger.getLogger(SiteURL1Indexer.class); public static final String
>> POSITION_KEY = "position";
>>     public static final String LOCATION_KEY = "location";
>>     public static final String COMPANY_KEY = "company";
>>     public static final String DESCRIPTION_KEY = "description";
>>     private Configuration conf;
>> 
>>     public void addIndexBackendOptions(Configuration conf) {
>>         LuceneWriter.addFieldOptions(POSITION_KEY, STORE.YES,
>> INDEX.TOKENIZED, conf); LuceneWriter.addFieldOptions(LOCATION_KEY,
>> STORE.YES, INDEX.TOKENIZED, conf);
>> LuceneWriter.addFieldOptions(COMPANY_KEY, STORE.YES, INDEX.TOKENIZED,
>> conf); LuceneWriter.addFieldOptions(DESCRIPTION_KEY, STORE.YES,
>> INDEX.TOKENIZED, conf); }
>> 
>>     public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
>> CrawlDatum datum, Inlinks inlinks) throws IndexingException {
>> 
>>         String position = null;
>>         String where = null;
>>         String company = null;
>>         String description = null;
>> 
>>         position = parse.getData().getParseMeta().get(POSITION_KEY);
>>         where = parse.getData().getParseMeta().get(LOCATION_KEY);
>>         company = parse.getData().getParseMeta().get(COMPANY_KEY);
>>         description = parse.getData().getParseMeta().get(DESCRIPTION_KEY);
>> 
>>         if (SiteURL1Parser.validateField(position)
>>                 && SiteURL1Parser.validateField(where)
>>                 && SiteURL1Parser.validateField(company)
>>                 && SiteURL1Parser.validateField(description)) {
>> 
>>             LOGGER.debug("Adding position: [" + position + "] for URL: " +
>> url.toString()); doc.add(POSITION_KEY, position);
>> 
>>             LOGGER.debug("Adding location: [" + position + "] for URL: " +
>> url.toString()); doc.add(LOCATION_KEY, where);
>> 
>>             LOGGER.debug("Adding company: [" + position + "] for URL: " +
>> url.toString()); doc.add(COMPANY_KEY, company);
>> 
>>             LOGGER.debug("Adding description: [" + position + "] for URL: "
>> + url.toString()); doc.add(DESCRIPTION_KEY, description);
>> 
>>             return doc;
>> 
>>         } else {
>>             return doc;
>>         }
>>     }
>> 
>>     public Configuration getConf() {
>>         return this.conf;
>>     }
>> 
>>     public void setConf(Configuration conf) {
>>         this.conf = conf;
>>     }
>> }



################PARSING EXTENSION##################
package it.company.searchengine.nutch.plugin.parser.html.company;

import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.log4j.Logger;
import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.metadata.Metadata;
import org.apache.nutch.parse.HTMLMetaTags;
import org.apache.nutch.parse.HtmlParseFilter;
import org.apache.nutch.parse.Parse;
import org.apache.nutch.parse.ParseResult;
import org.apache.nutch.protocol.Content;
import org.w3c.dom.DocumentFragment;

public class SiteURL1Parser implements HtmlParseFilter {

    public static final String POSITION_KEY = "position";
    public static final String LOCATION_KEY = "location";
    public static final String COMPANY_KEY = "company";
    public static final String DESCRIPTION_KEY = "description";
    private static final Logger logger = Logger.getLogger(SiteURL1Parser.class);
    private static final String HTML_TAG_PATTERN = "<[^><]{0,}>";
    private Configuration conf = null;

    public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) {

        String currentURL = null;
        String urlPattern = null;
        Pattern pattern = null;
        Matcher matcher = null;

        currentURL = currentURL = content.getUrl();

        //  SiteURL1.COM
        if (currentURL.contains("SiteURL1.com")) {
            urlPattern = "^http://www.SiteURL1.com/offer[-\\w]{3,}[?]id[=][0-9]{5,10}$";
            pattern = Pattern.compile(urlPattern);
            matcher = pattern.matcher(currentURL);

            if (matcher.find()) {
                return filterSiteURL1(content, parseResult);
            }
        }

        return parseResult;
    }

    public Configuration getConf() {
        return conf;
    }

    public void setConf(Configuration conf) {
        this.conf = conf;
    }

    public static boolean validateField(String field) {

        if (field == null)
            return false;

        if (field.equalsIgnoreCase(""))
            return false;

        if (field.equalsIgnoreCase("NULL"))
            return false;

        return true;
    }

    private void printExtractedFields(String position, String company, String location, String description) {
        System.out.println("");
        System.out.println("- POSITION:    " + position);
        System.out.println("- COMPANY:     " + company);
        System.out.println("- LOCATION:    " + location);
        System.out.println("- DESCRIPTION: " + description);
    }

    private ParseResult filterSiteURL1(Content content, ParseResult parseResult) {

        logger.debug("Parsing URL: " + content.getUrl());

        BufferedReader reader = null;
        String currentURL = null;
        String line = null;
        Parse parse = null;
        Metadata metadata = null;

        String company = null;
        String position = null;
        String location = null;
        String description = null;

        boolean intoLocation = false;
        boolean intoDescription = false;

        Pattern pattern = null;
        Matcher matcher = null;

        try {

            currentURL = content.getUrl();
            description = new String();

            reader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(content.getContent())));
            pattern = Pattern.compile(HTML_TAG_PATTERN);

            while ((line = reader.readLine()) != null) {

                if (line.contains("<tr><td valign=top><a href='/join/check_session.jsp?idfonte=")) {
                    line = line.trim();
                    matcher = pattern.matcher(line);
                    company = matcher.replaceAll("").trim();
                    continue;
                }

                if (line.contains("<tr><td><a href='/join/check_session.jsp?id=")) {
                    line = line.trim();
                    matcher = pattern.matcher(line);
                    position = matcher.replaceAll("").trim();
                    continue;
                }

                if (line.contains("<tr><td class=\"txt-black-regular-10\"></br><strong>Place</strong>:")) {
                    intoLocation = true;
                    continue;

                } else if (intoLocation) {
                    line = line.trim();

                    if (validateField(line)) {
                        location = line;
                        location = location.replaceAll("&nbsp;&nbsp;-&nbsp;&nbsp;", " - ");
                        intoLocation = false;
                    }

                    continue;
                }

                if (line.contains("<span class=\"txt-black-regular-10\"><strong>Requirements</strong></span>:<br/><a href='/join/check_session.jsp?id=")) {

                    intoDescription = true;
                    line = line.trim();
                    matcher = pattern.matcher(line);
                    description = matcher.replaceAll("").trim();

                } else if (intoDescription) {

                    line = line.trim();

                    if (validateField(line)) {

                        String tmpDescription = null;
                        matcher = pattern.matcher(line);
                        tmpDescription = matcher.replaceAll("").trim();

                        if (validateField(tmpDescription)) {

                            if (validateField(description)) {
                                description = description + " " + tmpDescription;

                            } else {
                                description = tmpDescription;
                            }
                        }
                    }
                }

                if (line.contains("</a></span><br/><br/>")) {

                    description = description.replaceAll("[\\s]{1,}", " ").trim();

                    while (description.startsWith("Requirements")) {

                        description = description.replaceFirst("Requirements", "").trim();

                        if (description.startsWith(":")) {
                            description = description.substring(1).trim();
                        }
                    }

                    intoDescription = false;
                    break;
                }

                continue;
            }

            reader.close();

            if (validateField(position)) {

                parse = parseResult.get(currentURL);
                metadata = parse.getData().getParseMeta();
                metadata.add(POSITION_KEY, position);

                if (validateField(company)) {
                    metadata.add(COMPANY_KEY, company);

                } else {
                    metadata.add(COMPANY_KEY, "Unknow");
                }

                if (validateField(location)) {
                    metadata.add(LOCATION_KEY, location);

                } else {
                    metadata.add(LOCATION_KEY, "Unknow");
                }

                if (validateField(description)) {
                    metadata.add(DESCRIPTION_KEY, description);

                } else {
                    metadata.add(DESCRIPTION_KEY, "");
                }
            }

        } catch (IOException e) {
            logger.warn("IOException encountered parsing file:", e);
        }

        return parseResult;
    }

   
}


---------------------------------- 
"Anyone proposing to run Windows on servers should be prepared to explain 
what they know about servers that Google, Yahoo, and Amazon don't."
Paul Graham


"A mathematician is a device for turning coffee into theorems."
Paul Erdos (who obviously never met a sysadmin)


>________________________________
>Da: Markus Jelsma <ma...@openindex.io>
>A: user@nutch.apache.org
>Cc: Stefano Cherchi <st...@yahoo.it>
>Inviato: Giovedì 30 Giugno 2011 13:29
>Oggetto: Re: Nutch + Hadoop + Solr: custom plugin cause EOFException while indexing
>
>I'm not sure but you could provide your stacktrace. Would atl east make it 
>easier.
>
>On Thursday 30 June 2011 13:18:06 Stefano Cherchi wrote:
>> Hi everybody,
>> 
>> I have a 4-nodes nutch+hadoop+solr stack that indexes a bunch of external
>> websites of, say, house sale ads.
>> 
>> Everything worked fine until i used only the default Nutch IndexingFilter
>> but then I needed some customization to enhance the search results
>> quality.
>> 
>> So I developed a set of plugins (one for each site I need to index) that
>> add some custom field to the index (say house price, location, name of the
>> seller and so on) and extract those specific data from the html of the
>> parsed page.
>> 
>> Again, everything has run smoothly until the structure of the parsed pages
>> stood unchanged. Unfortunately some of the sites that I want to index have
>> recently undergone restyling and troubles started for me: now all the
>> crawling, fetching, merging etc seems to complete without errors but when
>> Nutch invokes LinkDb (just before solrindexer) to prepare data to be put
>> into Solr database it returns a lot of EOFException, the indexing job
>> fails and no document is added to Solr even if just one of the plugins
>> fails.
>> 
>> My questions are: where could the problem be and how can I avoid the
>> complete failure of the indexing job? The plugin that parses the modified
>> site should manage to fail "cleanly" without affecting the whole process.
>> 
>> This is the code of the indexing part of the plugin:
>> 
>> 
>> 
>> 
>> package it.company.searchengine.nutch.plugin.indexer.html.company;
>> 
>> import
>> it.company.searchengine.nutch.plugin.parser.html.company.SiteURL1Parser;
>> import org.apache.hadoop.conf.Configuration;
>> import org.apache.hadoop.io.Text;
>> import org.apache.log4j.Logger;
>> import org.apache.nutch.crawl.CrawlDatum;
>> import org.apache.nutch.crawl.Inlinks;
>> import org.apache.nutch.indexer.IndexingException;
>> import org.apache.nutch.indexer.IndexingFilter;
>> import org.apache.nutch.indexer.NutchDocument;
>> import org.apache.nutch.indexer.lucene.LuceneWriter;
>> import org.apache.nutch.indexer.lucene.LuceneWriter.INDEX;
>> import org.apache.nutch.indexer.lucene.LuceneWriter.STORE;
>> import org.apache.nutch.parse.Parse;
>> 
>> public class SiteURL1Indexer implements IndexingFilter {
>> 
>>     private static final Logger LOGGER =
>> Logger.getLogger(SiteURL1Indexer.class); public static final String
>> POSITION_KEY = "position";
>>     public static final String LOCATION_KEY = "location";
>>     public static final String COMPANY_KEY = "company";
>>     public static final String DESCRIPTION_KEY = "description";
>>     private Configuration conf;
>> 
>>     public void addIndexBackendOptions(Configuration conf) {
>>         LuceneWriter.addFieldOptions(POSITION_KEY, STORE.YES,
>> INDEX.TOKENIZED, conf); LuceneWriter.addFieldOptions(LOCATION_KEY,
>> STORE.YES, INDEX.TOKENIZED, conf);
>> LuceneWriter.addFieldOptions(COMPANY_KEY, STORE.YES, INDEX.TOKENIZED,
>> conf); LuceneWriter.addFieldOptions(DESCRIPTION_KEY, STORE.YES,
>> INDEX.TOKENIZED, conf); }
>> 
>>     public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
>> CrawlDatum datum, Inlinks inlinks) throws IndexingException {
>> 
>>         String position = null;
>>         String where = null;
>>         String company = null;
>>         String description = null;
>> 
>>         position = parse.getData().getParseMeta().get(POSITION_KEY);
>>         where = parse.getData().getParseMeta().get(LOCATION_KEY);
>>         company = parse.getData().getParseMeta().get(COMPANY_KEY);
>>         description = parse.getData().getParseMeta().get(DESCRIPTION_KEY);
>> 
>>         if (SiteURL1Parser.validateField(position)
>>                 && SiteURL1Parser.validateField(where)
>>                 && SiteURL1Parser.validateField(company)
>>                 && SiteURL1Parser.validateField(description)) {
>> 
>>             LOGGER.debug("Adding position: [" + position + "] for URL: " +
>> url.toString()); doc.add(POSITION_KEY, position);
>> 
>>             LOGGER.debug("Adding location: [" + position + "] for URL: " +
>> url.toString()); doc.add(LOCATION_KEY, where);
>> 
>>             LOGGER.debug("Adding company: [" + position + "] for URL: " +
>> url.toString()); doc.add(COMPANY_KEY, company);
>> 
>>             LOGGER.debug("Adding description: [" + position + "] for URL: "
>> + url.toString()); doc.add(DESCRIPTION_KEY, description);
>> 
>>             return doc;
>> 
>>         } else {
>>             return doc;
>>         }
>>     }
>> 
>>     public Configuration getConf() {
>>         return this.conf;
>>     }
>> 
>>     public void setConf(Configuration conf) {
>>         this.conf = conf;
>>     }
>> }
>> 
>> 
>> 
>> 
>> I'm running Nutch 1.0. Yes, I know it's an old one but I cannot afford the
>> migration to a newer version at the moment.
>> 
>> 
>> Thanks a lot for any hint.
>> 
>> S
>>  
>> ----------------------------------
>> "Anyone proposing to run Windows on servers should be prepared to explain
>> what they know about servers that Google, Yahoo, and Amazon don't."
>> Paul Graham
>> 
>> 
>> "A mathematician is a device for turning coffee into theorems."
>> Paul Erdos (who obviously never met a sysadmin)
>
>-- 
>Markus Jelsma - CTO - Openindex
>http://www.linkedin.com/in/markus17
>050-8536620 / 06-50258350
>
>
>

Re: Nutch + Hadoop + Solr: custom plugin cause EOFException while indexing

Posted by Matthias Naber <na...@informatik.hu-berlin.de>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hey,

I recently developed a custom ParseFilter and a IndexingFilter with
quite similar effects (DOM did not get entirely parsed for some
unknown reason, no Exceptions at all). Solution was to disable some of
the optional plugins I copied from some example page. And suddenly
everthing went fine.

So maybe try to remove the plugins one by one until everything works
fine again to narrow your problem a bit.

Am 30.06.11 13:29, schrieb Markus Jelsma:
> I'm not sure but you could provide your stacktrace. Would atl east make it
> easier.
>
> On Thursday 30 June 2011 13:18:06 Stefano Cherchi wrote:
>> Hi everybody,
>>
>> I have a 4-nodes nutch+hadoop+solr stack that indexes a bunch of external
>> websites of, say, house sale ads.
>>
>> Everything worked fine until i used only the default Nutch IndexingFilter
>> but then I needed some customization to enhance the search results
>> quality.
>>
>> So I developed a set of plugins (one for each site I need to index) that
>> add some custom field to the index (say house price, location, name of the
>> seller and so on) and extract those specific data from the html of the
>> parsed page.
>>
>> Again, everything has run smoothly until the structure of the parsed pages
>> stood unchanged. Unfortunately some of the sites that I want to index have
>> recently undergone restyling and troubles started for me: now all the
>> crawling, fetching, merging etc seems to complete without errors but when
>> Nutch invokes LinkDb (just before solrindexer) to prepare data to be put
>> into Solr database it returns a lot of EOFException, the indexing job
>> fails and no document is added to Solr even if just one of the plugins
>> fails.
>>
>> My questions are: where could the problem be and how can I avoid the
>> complete failure of the indexing job? The plugin that parses the modified
>> site should manage to fail "cleanly" without affecting the whole process.
>>
>> This is the code of the indexing part of the plugin:
>>
>>
>>
>>
>> package it.company.searchengine.nutch.plugin.indexer.html.company;
>>
>> import
>> it.company.searchengine.nutch.plugin.parser.html.company.SiteURL1Parser;
>> import org.apache.hadoop.conf.Configuration;
>> import org.apache.hadoop.io.Text;
>> import org.apache.log4j.Logger;
>> import org.apache.nutch.crawl.CrawlDatum;
>> import org.apache.nutch.crawl.Inlinks;
>> import org.apache.nutch.indexer.IndexingException;
>> import org.apache.nutch.indexer.IndexingFilter;
>> import org.apache.nutch.indexer.NutchDocument;
>> import org.apache.nutch.indexer.lucene.LuceneWriter;
>> import org.apache.nutch.indexer.lucene.LuceneWriter.INDEX;
>> import org.apache.nutch.indexer.lucene.LuceneWriter.STORE;
>> import org.apache.nutch.parse.Parse;
>>
>> public class SiteURL1Indexer implements IndexingFilter {
>>
>> private static final Logger LOGGER =
>> Logger.getLogger(SiteURL1Indexer.class); public static final String
>> POSITION_KEY = "position";
>> public static final String LOCATION_KEY = "location";
>> public static final String COMPANY_KEY = "company";
>> public static final String DESCRIPTION_KEY = "description";
>> private Configuration conf;
>>
>> public void addIndexBackendOptions(Configuration conf) {
>> LuceneWriter.addFieldOptions(POSITION_KEY, STORE.YES,
>> INDEX.TOKENIZED, conf); LuceneWriter.addFieldOptions(LOCATION_KEY,
>> STORE.YES, INDEX.TOKENIZED, conf);
>> LuceneWriter.addFieldOptions(COMPANY_KEY, STORE.YES, INDEX.TOKENIZED,
>> conf); LuceneWriter.addFieldOptions(DESCRIPTION_KEY, STORE.YES,
>> INDEX.TOKENIZED, conf); }
>>
>> public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
>> CrawlDatum datum, Inlinks inlinks) throws IndexingException {
>>
>> String position = null;
>> String where = null;
>> String company = null;
>> String description = null;
>>
>> position = parse.getData().getParseMeta().get(POSITION_KEY);
>> where = parse.getData().getParseMeta().get(LOCATION_KEY);
>> company = parse.getData().getParseMeta().get(COMPANY_KEY);
>> description = parse.getData().getParseMeta().get(DESCRIPTION_KEY);
>>
>> if (SiteURL1Parser.validateField(position)
>> && SiteURL1Parser.validateField(where)
>> && SiteURL1Parser.validateField(company)
>> && SiteURL1Parser.validateField(description)) {
>>
>> LOGGER.debug("Adding position: [" + position + "] for URL: " +
>> url.toString()); doc.add(POSITION_KEY, position);
>>
>> LOGGER.debug("Adding location: [" + position + "] for URL: " +
>> url.toString()); doc.add(LOCATION_KEY, where);
>>
>> LOGGER.debug("Adding company: [" + position + "] for URL: " +
>> url.toString()); doc.add(COMPANY_KEY, company);
>>
>> LOGGER.debug("Adding description: [" + position + "] for URL: "
>> + url.toString()); doc.add(DESCRIPTION_KEY, description);
>>
>> return doc;
>>
>> } else {
>> return doc;
>> }
>> }
>>
>> public Configuration getConf() {
>> return this.conf;
>> }
>>
>> public void setConf(Configuration conf) {
>> this.conf = conf;
>> }
>> }
>>
>>
>>
>>
>> I'm running Nutch 1.0. Yes, I know it's an old one but I cannot afford the
>> migration to a newer version at the moment.
>>
>>
>> Thanks a lot for any hint.
>>
>> S
>>
>> ----------------------------------
>> "Anyone proposing to run Windows on servers should be prepared to explain
>> what they know about servers that Google, Yahoo, and Amazon don't."
>> Paul Graham
>>
>>
>> "A mathematician is a device for turning coffee into theorems."
>> Paul Erdos (who obviously never met a sysadmin)
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk4MdksACgkQzp84az+gLK286gCfcMuylGGTDIYEJtqIFchDK4oS
/30Anji2WwaePYfZBWdNW9VVlpoA0Ila
=JZmM
-----END PGP SIGNATURE-----

Re: Nutch + Hadoop + Solr: custom plugin cause EOFException while indexing

Posted by Markus Jelsma <ma...@openindex.io>.

I'm not sure but you could provide your stacktrace. Would atl east make it 
easier.

On Thursday 30 June 2011 13:18:06 Stefano Cherchi wrote:
> Hi everybody,
> 
> I have a 4-nodes nutch+hadoop+solr stack that indexes a bunch of external
> websites of, say, house sale ads.
> 
> Everything worked fine until i used only the default Nutch IndexingFilter
> but then I needed some customization to enhance the search results
> quality.
> 
> So I developed a set of plugins (one for each site I need to index) that
> add some custom field to the index (say house price, location, name of the
> seller and so on) and extract those specific data from the html of the
> parsed page.
> 
> Again, everything has run smoothly until the structure of the parsed pages
> stood unchanged. Unfortunately some of the sites that I want to index have
> recently undergone restyling and troubles started for me: now all the
> crawling, fetching, merging etc seems to complete without errors but when
> Nutch invokes LinkDb (just before solrindexer) to prepare data to be put
> into Solr database it returns a lot of EOFException, the indexing job
> fails and no document is added to Solr even if just one of the plugins
> fails.
> 
> My questions are: where could the problem be and how can I avoid the
> complete failure of the indexing job? The plugin that parses the modified
> site should manage to fail "cleanly" without affecting the whole process.
> 
> This is the code of the indexing part of the plugin:
> 
> 
> 
> 
> package it.company.searchengine.nutch.plugin.indexer.html.company;
> 
> import
> it.company.searchengine.nutch.plugin.parser.html.company.SiteURL1Parser;
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.io.Text;
> import org.apache.log4j.Logger;
> import org.apache.nutch.crawl.CrawlDatum;
> import org.apache.nutch.crawl.Inlinks;
> import org.apache.nutch.indexer.IndexingException;
> import org.apache.nutch.indexer.IndexingFilter;
> import org.apache.nutch.indexer.NutchDocument;
> import org.apache.nutch.indexer.lucene.LuceneWriter;
> import org.apache.nutch.indexer.lucene.LuceneWriter.INDEX;
> import org.apache.nutch.indexer.lucene.LuceneWriter.STORE;
> import org.apache.nutch.parse.Parse;
> 
> public class SiteURL1Indexer implements IndexingFilter {
> 
>     private static final Logger LOGGER =
> Logger.getLogger(SiteURL1Indexer.class); public static final String
> POSITION_KEY = "position";
>     public static final String LOCATION_KEY = "location";
>     public static final String COMPANY_KEY = "company";
>     public static final String DESCRIPTION_KEY = "description";
>     private Configuration conf;
> 
>     public void addIndexBackendOptions(Configuration conf) {
>         LuceneWriter.addFieldOptions(POSITION_KEY, STORE.YES,
> INDEX.TOKENIZED, conf); LuceneWriter.addFieldOptions(LOCATION_KEY,
> STORE.YES, INDEX.TOKENIZED, conf);
> LuceneWriter.addFieldOptions(COMPANY_KEY, STORE.YES, INDEX.TOKENIZED,
> conf); LuceneWriter.addFieldOptions(DESCRIPTION_KEY, STORE.YES,
> INDEX.TOKENIZED, conf); }
> 
>     public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
> CrawlDatum datum, Inlinks inlinks) throws IndexingException {
> 
>         String position = null;
>         String where = null;
>         String company = null;
>         String description = null;
> 
>         position = parse.getData().getParseMeta().get(POSITION_KEY);
>         where = parse.getData().getParseMeta().get(LOCATION_KEY);
>         company = parse.getData().getParseMeta().get(COMPANY_KEY);
>         description = parse.getData().getParseMeta().get(DESCRIPTION_KEY);
> 
>         if (SiteURL1Parser.validateField(position)
>                 && SiteURL1Parser.validateField(where)
>                 && SiteURL1Parser.validateField(company)
>                 && SiteURL1Parser.validateField(description)) {
> 
>             LOGGER.debug("Adding position: [" + position + "] for URL: " +
> url.toString()); doc.add(POSITION_KEY, position);
> 
>             LOGGER.debug("Adding location: [" + position + "] for URL: " +
> url.toString()); doc.add(LOCATION_KEY, where);
> 
>             LOGGER.debug("Adding company: [" + position + "] for URL: " +
> url.toString()); doc.add(COMPANY_KEY, company);
> 
>             LOGGER.debug("Adding description: [" + position + "] for URL: "
> + url.toString()); doc.add(DESCRIPTION_KEY, description);
> 
>             return doc;
> 
>         } else {
>             return doc;
>         }
>     }
> 
>     public Configuration getConf() {
>         return this.conf;
>     }
> 
>     public void setConf(Configuration conf) {
>         this.conf = conf;
>     }
> }
> 
> 
> 
> 
> I'm running Nutch 1.0. Yes, I know it's an old one but I cannot afford the
> migration to a newer version at the moment.
> 
> 
> Thanks a lot for any hint.
> 
> S
>  
> ----------------------------------
> "Anyone proposing to run Windows on servers should be prepared to explain
> what they know about servers that Google, Yahoo, and Amazon don't."
> Paul Graham
> 
> 
> "A mathematician is a device for turning coffee into theorems."
> Paul Erdos (who obviously never met a sysadmin)

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350