You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2007/06/06 15:08:25 UTC

[jira] Issue Comment Edited: (NUTCH-466) Flexible segment format

    [ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501921 ] 

Doğacan Güney edited comment on NUTCH-466 at 6/6/07 6:08 AM:
-------------------------------------------------------------

I still haven't tested it yet, but the code looks solid. I have a couple of comments, though:

* One can't define order of execution for ParseFilter-s. It seems we always need it in one way or another in filters so it may be good to just add ordering and be done with it.

* ParseFilters.filter method throws IOException. I think it will be better if it throws a ParseFilterException or whatever, keeping in spirit with IndexingFilters -> IndexingException and ScoringFilters -> ScoringFilterException.

* There are few uses of iterating over Map.keySet() then getting the value with Map.get(key). FindBugs suggests that it is better to iterate over Map.entrySet() in these cases.

* When someone requests more than 1 part-data, we start a couple of threads, receive data and join threads. Nutch also does this for summary. Is starting and joining threads again and again a problem? Especially, if you are clustering you may end up starting and joining _100_ threads for each query. Perhaps a thread pool? This is not completely related to this patch, it is just something that bugs me.

* I just realized that there is no ParseFilter class either :)


 was:
I still haven't tested it yet, but the code looks solid. I have a couple of comments, though:

* One can't define order of execution for ParseFilter-s. It seems we always need it in one way or another in filters so it may be good to just add ordering and be done with it.

* ParseResult.filter method throws IOException. I think it will be better if it throws a ParseFilterException or whatever, keeping in spirit with IndexingFilters -> IndexingException and ScoringFilters -> ScoringFilterException.

* There are few uses of iterating over Map.keySet() then getting the value with Map.get(key). FindBugs suggests that it is better to iterate over Map.entrySet() in these cases.

* When someone requests more than 1 part-data, we start a couple of threads, receive data and join threads. Nutch also does this for summary. Is starting and joining threads again and again a problem? Especially, if you are clustering you may end up starting and joining _100_ threads for each query. Perhaps a thread pool? This is not completely related to this patch, it is just something that bugs me.

* I just realized that there is no ParseFilter class either :)

> Flexible segment format
> -----------------------
>
>                 Key: NUTCH-466
>                 URL: https://issues.apache.org/jira/browse/NUTCH-466
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: ParseFilters.java, segmentparts.patch
>
>
> In many situations it is necessary to store more data associated with pages than it's possible now with the current segment format. Quite often it's a binary data. There are two common workarounds for this: one is to use per-page metadata, either in Content or ParseData, the other is to use an external independent database using page ID-s as foreign keys.
> Currently segments can consist of the following predefined parts: content, crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I propose a third option, which is a natural extension of this existing segment format, i.e. to introduce the ability to add arbitrarily named segment "parts", with the only requirement that they should be MapFile-s that store Writable keys and values. Alternatively, we could define a SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios.
> Existing segment API and searcher API (NutchBean, DistributedSearch Client/Server) should be extended to handle such arbitrary parts.
> Example applications:
> * storing HTML previews of non-HTML pages, such as PDF, PS and Office documents
> * storing pre-tokenized version of plain text for faster snippet generation
> * storing linguistically tagged text for sophisticated data mining
> * storing image thumbnails
> etc, etc ...
> I'm going to prepare a patchset shortly. Any comments and suggestions are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.