You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2007/04/01 22:44:32 UTC

[jira] Created: (NUTCH-466) Flexible segment format

Flexible segment format
-----------------------

Key: NUTCH-466
URL: https://issues.apache.org/jira/browse/NUTCH-466
Project: Nutch
Issue Type: Improvement
Components: searcher
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki
Assigned To: Andrzej Bialecki

In many situations it is necessary to store more data associated with pages than it's possible now with the current segment format. Quite often it's a binary data. There are two common workarounds for this: one is to use per-page metadata, either in Content or ParseData, the other is to use an external independent database using page ID-s as foreign keys.

Currently segments can consist of the following predefined parts: content, crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I propose a third option, which is a natural extension of this existing segment format, i.e. to introduce the ability to add arbitrarily named segment "parts", with the only requirement that they should be MapFile-s that store Writable keys and values. Alternatively, we could define a SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios.

Existing segment API and searcher API (NutchBean, DistributedSearch Client/Server) should be extended to handle such arbitrary parts.

Example applications:

* storing HTML previews of non-HTML pages, such as PDF, PS and Office documents
* storing pre-tokenized version of plain text for faster snippet generation
* storing linguistically tagged text for sophisticated data mining
* storing image thumbnails

etc, etc ...

I'm going to prepare a patchset shortly. Any comments and suggestions are welcome.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Work started: (NUTCH-466) Flexible segment format

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on NUTCH-466 started by Andrzej Bialecki .

> Flexible segment format
> -----------------------
>
>                 Key: NUTCH-466
>                 URL: https://issues.apache.org/jira/browse/NUTCH-466
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>
> In many situations it is necessary to store more data associated with pages than it's possible now with the current segment format. Quite often it's a binary data. There are two common workarounds for this: one is to use per-page metadata, either in Content or ParseData, the other is to use an external independent database using page ID-s as foreign keys.
> Currently segments can consist of the following predefined parts: content, crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I propose a third option, which is a natural extension of this existing segment format, i.e. to introduce the ability to add arbitrarily named segment "parts", with the only requirement that they should be MapFile-s that store Writable keys and values. Alternatively, we could define a SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios.
> Existing segment API and searcher API (NutchBean, DistributedSearch Client/Server) should be extended to handle such arbitrary parts.
> Example applications:
> * storing HTML previews of non-HTML pages, such as PDF, PS and Office documents
> * storing pre-tokenized version of plain text for faster snippet generation
> * storing linguistically tagged text for sophisticated data mining
> * storing image thumbnails
> etc, etc ...
> I'm going to prepare a patchset shortly. Any comments and suggestions are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-466) Flexible segment format

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500459 ] 

Doğacan Güney commented on NUTCH-466:
-------------------------------------

I skimmed through it and it looks awesome. I will try to test it better later, but it seems patch is missing ParseFilters class.


> Flexible segment format
> -----------------------
>
>                 Key: NUTCH-466
>                 URL: https://issues.apache.org/jira/browse/NUTCH-466
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: segmentparts.patch
>
>
> In many situations it is necessary to store more data associated with pages than it's possible now with the current segment format. Quite often it's a binary data. There are two common workarounds for this: one is to use per-page metadata, either in Content or ParseData, the other is to use an external independent database using page ID-s as foreign keys.
> Currently segments can consist of the following predefined parts: content, crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I propose a third option, which is a natural extension of this existing segment format, i.e. to introduce the ability to add arbitrarily named segment "parts", with the only requirement that they should be MapFile-s that store Writable keys and values. Alternatively, we could define a SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios.
> Existing segment API and searcher API (NutchBean, DistributedSearch Client/Server) should be extended to handle such arbitrary parts.
> Example applications:
> * storing HTML previews of non-HTML pages, such as PDF, PS and Office documents
> * storing pre-tokenized version of plain text for faster snippet generation
> * storing linguistically tagged text for sophisticated data mining
> * storing image thumbnails
> etc, etc ...
> I'm going to prepare a patchset shortly. Any comments and suggestions are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (NUTCH-466) Flexible segment format

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501921 ] 

Doğacan Güney edited comment on NUTCH-466 at 6/6/07 6:08 AM:
-------------------------------------------------------------

I still haven't tested it yet, but the code looks solid. I have a couple of comments, though:

* One can't define order of execution for ParseFilter-s. It seems we always need it in one way or another in filters so it may be good to just add ordering and be done with it.

* ParseFilters.filter method throws IOException. I think it will be better if it throws a ParseFilterException or whatever, keeping in spirit with IndexingFilters -> IndexingException and ScoringFilters -> ScoringFilterException.

* There are few uses of iterating over Map.keySet() then getting the value with Map.get(key). FindBugs suggests that it is better to iterate over Map.entrySet() in these cases.

* When someone requests more than 1 part-data, we start a couple of threads, receive data and join threads. Nutch also does this for summary. Is starting and joining threads again and again a problem? Especially, if you are clustering you may end up starting and joining _100_ threads for each query. Perhaps a thread pool? This is not completely related to this patch, it is just something that bugs me.

* I just realized that there is no ParseFilter class either :)


 was:
I still haven't tested it yet, but the code looks solid. I have a couple of comments, though:

* One can't define order of execution for ParseFilter-s. It seems we always need it in one way or another in filters so it may be good to just add ordering and be done with it.

* ParseResult.filter method throws IOException. I think it will be better if it throws a ParseFilterException or whatever, keeping in spirit with IndexingFilters -> IndexingException and ScoringFilters -> ScoringFilterException.

* There are few uses of iterating over Map.keySet() then getting the value with Map.get(key). FindBugs suggests that it is better to iterate over Map.entrySet() in these cases.

* When someone requests more than 1 part-data, we start a couple of threads, receive data and join threads. Nutch also does this for summary. Is starting and joining threads again and again a problem? Especially, if you are clustering you may end up starting and joining _100_ threads for each query. Perhaps a thread pool? This is not completely related to this patch, it is just something that bugs me.

* I just realized that there is no ParseFilter class either :)

> Flexible segment format
> -----------------------
>
>                 Key: NUTCH-466
>                 URL: https://issues.apache.org/jira/browse/NUTCH-466
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: ParseFilters.java, segmentparts.patch
>
>
> In many situations it is necessary to store more data associated with pages than it's possible now with the current segment format. Quite often it's a binary data. There are two common workarounds for this: one is to use per-page metadata, either in Content or ParseData, the other is to use an external independent database using page ID-s as foreign keys.
> Currently segments can consist of the following predefined parts: content, crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I propose a third option, which is a natural extension of this existing segment format, i.e. to introduce the ability to add arbitrarily named segment "parts", with the only requirement that they should be MapFile-s that store Writable keys and values. Alternatively, we could define a SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios.
> Existing segment API and searcher API (NutchBean, DistributedSearch Client/Server) should be extended to handle such arbitrary parts.
> Example applications:
> * storing HTML previews of non-HTML pages, such as PDF, PS and Office documents
> * storing pre-tokenized version of plain text for faster snippet generation
> * storing linguistically tagged text for sophisticated data mining
> * storing image thumbnails
> etc, etc ...
> I'm going to prepare a patchset shortly. Any comments and suggestions are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Work stopped: (NUTCH-466) Flexible segment format

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on NUTCH-466 stopped by Andrzej Bialecki .

> Flexible segment format
> -----------------------
>
>                 Key: NUTCH-466
>                 URL: https://issues.apache.org/jira/browse/NUTCH-466
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: ParseFilters.java, segmentparts.patch
>
>
> In many situations it is necessary to store more data associated with pages than it's possible now with the current segment format. Quite often it's a binary data. There are two common workarounds for this: one is to use per-page metadata, either in Content or ParseData, the other is to use an external independent database using page ID-s as foreign keys.
> Currently segments can consist of the following predefined parts: content, crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I propose a third option, which is a natural extension of this existing segment format, i.e. to introduce the ability to add arbitrarily named segment "parts", with the only requirement that they should be MapFile-s that store Writable keys and values. Alternatively, we could define a SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios.
> Existing segment API and searcher API (NutchBean, DistributedSearch Client/Server) should be extended to handle such arbitrary parts.
> Example applications:
> * storing HTML previews of non-HTML pages, such as PDF, PS and Office documents
> * storing pre-tokenized version of plain text for faster snippet generation
> * storing linguistically tagged text for sophisticated data mining
> * storing image thumbnails
> etc, etc ...
> I'm going to prepare a patchset shortly. Any comments and suggestions are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-466) Flexible segment format

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated NUTCH-466:
------------------------------------

    Attachment: segmentparts.patch

This patch contains the following modifications and additions:

* a new extension point, ParseFilter, to post-process results of parsing just before they are written out.

* related chained filter facade, ParseFilters.

* Parse / ParseImpl changes to support passing of arbitrary data records "tagged" by Text keys.

* ParseOutputFormat changes to store such arbitrary data in MapFile-s named after the Text keys.

* NutchBean and DistributedSearch protocol changes to support retrieving records from named segment parts.

* PdfParser changes to support storing of HTML preview for PDF files.

Please review and comment.

> Flexible segment format
> -----------------------
>
>                 Key: NUTCH-466
>                 URL: https://issues.apache.org/jira/browse/NUTCH-466
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: segmentparts.patch
>
>
> In many situations it is necessary to store more data associated with pages than it's possible now with the current segment format. Quite often it's a binary data. There are two common workarounds for this: one is to use per-page metadata, either in Content or ParseData, the other is to use an external independent database using page ID-s as foreign keys.
> Currently segments can consist of the following predefined parts: content, crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I propose a third option, which is a natural extension of this existing segment format, i.e. to introduce the ability to add arbitrarily named segment "parts", with the only requirement that they should be MapFile-s that store Writable keys and values. Alternatively, we could define a SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios.
> Existing segment API and searcher API (NutchBean, DistributedSearch Client/Server) should be extended to handle such arbitrary parts.
> Example applications:
> * storing HTML previews of non-HTML pages, such as PDF, PS and Office documents
> * storing pre-tokenized version of plain text for faster snippet generation
> * storing linguistically tagged text for sophisticated data mining
> * storing image thumbnails
> etc, etc ...
> I'm going to prepare a patchset shortly. Any comments and suggestions are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-466) Flexible segment format

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated NUTCH-466:
------------------------------------

    Attachment: ParseFilters.java

Add missing file.

> Flexible segment format
> -----------------------
>
>                 Key: NUTCH-466
>                 URL: https://issues.apache.org/jira/browse/NUTCH-466
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: ParseFilters.java, segmentparts.patch
>
>
> In many situations it is necessary to store more data associated with pages than it's possible now with the current segment format. Quite often it's a binary data. There are two common workarounds for this: one is to use per-page metadata, either in Content or ParseData, the other is to use an external independent database using page ID-s as foreign keys.
> Currently segments can consist of the following predefined parts: content, crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I propose a third option, which is a natural extension of this existing segment format, i.e. to introduce the ability to add arbitrarily named segment "parts", with the only requirement that they should be MapFile-s that store Writable keys and values. Alternatively, we could define a SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios.
> Existing segment API and searcher API (NutchBean, DistributedSearch Client/Server) should be extended to handle such arbitrary parts.
> Example applications:
> * storing HTML previews of non-HTML pages, such as PDF, PS and Office documents
> * storing pre-tokenized version of plain text for faster snippet generation
> * storing linguistically tagged text for sophisticated data mining
> * storing image thumbnails
> etc, etc ...
> I'm going to prepare a patchset shortly. Any comments and suggestions are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-466) Flexible segment format

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501921 ] 

Doğacan Güney commented on NUTCH-466:
-------------------------------------

I still haven't tested it yet, but the code looks solid. I have a couple of comments, though:

* One can't define order of execution for ParseFilter-s. It seems we always need it in one way or another in filters so it may be good to just add ordering and be done with it.

* ParseResult.filter method throws IOException. I think it will be better if it throws a ParseFilterException or whatever, keeping in spirit with IndexingFilters -> IndexingException and ScoringFilters -> ScoringFilterException.

* There are few uses of iterating over Map.keySet() then getting the value with Map.get(key). FindBugs suggests that it is better to iterate over Map.entrySet() in these cases.

* When someone requests more than 1 part-data, we start a couple of threads, receive data and join threads. Nutch also does this for summary. Is starting and joining threads again and again a problem? Especially, if you are clustering you may end up starting and joining _100_ threads for each query. Perhaps a thread pool? This is not completely related to this patch, it is just something that bugs me.

* I just realized that there is no ParseFilter class either :)

> Flexible segment format
> -----------------------
>
>                 Key: NUTCH-466
>                 URL: https://issues.apache.org/jira/browse/NUTCH-466
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: ParseFilters.java, segmentparts.patch
>
>
> In many situations it is necessary to store more data associated with pages than it's possible now with the current segment format. Quite often it's a binary data. There are two common workarounds for this: one is to use per-page metadata, either in Content or ParseData, the other is to use an external independent database using page ID-s as foreign keys.
> Currently segments can consist of the following predefined parts: content, crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I propose a third option, which is a natural extension of this existing segment format, i.e. to introduce the ability to add arbitrarily named segment "parts", with the only requirement that they should be MapFile-s that store Writable keys and values. Alternatively, we could define a SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios.
> Existing segment API and searcher API (NutchBean, DistributedSearch Client/Server) should be extended to handle such arbitrary parts.
> Example applications:
> * storing HTML previews of non-HTML pages, such as PDF, PS and Office documents
> * storing pre-tokenized version of plain text for faster snippet generation
> * storing linguistically tagged text for sophisticated data mining
> * storing image thumbnails
> etc, etc ...
> I'm going to prepare a patchset shortly. Any comments and suggestions are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-466) Flexible segment format

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485977 ] 

Enis Soztutar commented on NUTCH-466:
-------------------------------------

This patch will indeed resolve many issues related to storing extra information about the crawl. IMO MapFiles will do the job. 
Searcher API can be extended with an interface with a method like 

      <E extends Writable> getInfo(<T extends Writable>); 

The implementing class should have a map of Class to MapFiles. 

> Flexible segment format
> -----------------------
>
>                 Key: NUTCH-466
>                 URL: https://issues.apache.org/jira/browse/NUTCH-466
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>
> In many situations it is necessary to store more data associated with pages than it's possible now with the current segment format. Quite often it's a binary data. There are two common workarounds for this: one is to use per-page metadata, either in Content or ParseData, the other is to use an external independent database using page ID-s as foreign keys.
> Currently segments can consist of the following predefined parts: content, crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I propose a third option, which is a natural extension of this existing segment format, i.e. to introduce the ability to add arbitrarily named segment "parts", with the only requirement that they should be MapFile-s that store Writable keys and values. Alternatively, we could define a SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios.
> Existing segment API and searcher API (NutchBean, DistributedSearch Client/Server) should be extended to handle such arbitrary parts.
> Example applications:
> * storing HTML previews of non-HTML pages, such as PDF, PS and Office documents
> * storing pre-tokenized version of plain text for faster snippet generation
> * storing linguistically tagged text for sophisticated data mining
> * storing image thumbnails
> etc, etc ...
> I'm going to prepare a patchset shortly. Any comments and suggestions are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-466) Flexible segment format

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486003 ] 

Andrzej Bialecki  commented on NUTCH-466:
-----------------------------------------

> I thought that the map will be from class names to directory names.

Well, then you would have to pass the whole class name in an RPC call - I think we should come up with a way that uses at most one byte to select the right part.

> Do you think that we sould also move HitDetailer, HitSummarizer, HitContent and Searcher to this plugin system

Yes, that was my plan - the same way we did it with indexing plugins - although I intend to create a separate issue regarding the use of separate index / page / summary servers, to avoid complicating this patch too much..

> Flexible segment format
> -----------------------
>
>                 Key: NUTCH-466
>                 URL: https://issues.apache.org/jira/browse/NUTCH-466
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>
> In many situations it is necessary to store more data associated with pages than it's possible now with the current segment format. Quite often it's a binary data. There are two common workarounds for this: one is to use per-page metadata, either in Content or ParseData, the other is to use an external independent database using page ID-s as foreign keys.
> Currently segments can consist of the following predefined parts: content, crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I propose a third option, which is a natural extension of this existing segment format, i.e. to introduce the ability to add arbitrarily named segment "parts", with the only requirement that they should be MapFile-s that store Writable keys and values. Alternatively, we could define a SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios.
> Existing segment API and searcher API (NutchBean, DistributedSearch Client/Server) should be extended to handle such arbitrary parts.
> Example applications:
> * storing HTML previews of non-HTML pages, such as PDF, PS and Office documents
> * storing pre-tokenized version of plain text for faster snippet generation
> * storing linguistically tagged text for sophisticated data mining
> * storing image thumbnails
> etc, etc ...
> I'm going to prepare a patchset shortly. Any comments and suggestions are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-466) Flexible segment format

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485996 ] 

Enis Soztutar commented on NUTCH-466:
-------------------------------------

>> There may be many parts that use the same key/value classes in MapFiles.

Yes indeed you are right. I haven't thought about several parts having the same classes. 

>> I think the API should select the part by name (String) or some other ID, with a map of byte ID-s to directory names

I thought that the map will be from class names to directory names. 

>>I think we should use the plugin model, with a registry of segment parts that are active for the current configuration

Do you think that we sould also move HitDetailer, HitSummarizer, HitContent and Searcher to this plugin system. And should we break the multiple functionality in NutchBean and DistributedSearch$Client, and allow for separate index, segment servers? 

> Flexible segment format
> -----------------------
>
>                 Key: NUTCH-466
>                 URL: https://issues.apache.org/jira/browse/NUTCH-466
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>
> In many situations it is necessary to store more data associated with pages than it's possible now with the current segment format. Quite often it's a binary data. There are two common workarounds for this: one is to use per-page metadata, either in Content or ParseData, the other is to use an external independent database using page ID-s as foreign keys.
> Currently segments can consist of the following predefined parts: content, crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I propose a third option, which is a natural extension of this existing segment format, i.e. to introduce the ability to add arbitrarily named segment "parts", with the only requirement that they should be MapFile-s that store Writable keys and values. Alternatively, we could define a SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios.
> Existing segment API and searcher API (NutchBean, DistributedSearch Client/Server) should be extended to handle such arbitrary parts.
> Example applications:
> * storing HTML previews of non-HTML pages, such as PDF, PS and Office documents
> * storing pre-tokenized version of plain text for faster snippet generation
> * storing linguistically tagged text for sophisticated data mining
> * storing image thumbnails
> etc, etc ...
> I'm going to prepare a patchset shortly. Any comments and suggestions are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-466) Flexible segment format

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485986 ] 

Andrzej Bialecki  commented on NUTCH-466:
-----------------------------------------

Minor nit: MapFile requires that the key is a WritableComparable.

I'm not sure I understand the last part of your comment .. There may be many parts that use the same key/value classes in MapFiles. I think the API should select the part by name (String) or some other ID, with a map of byte ID-s to directory names (this is to avoid excessive overhead during RPC). Regarding the implementing classes - I think we should use the plugin model, with a registry of segment parts that are active for the current configuration.

> Flexible segment format
> -----------------------
>
>                 Key: NUTCH-466
>                 URL: https://issues.apache.org/jira/browse/NUTCH-466
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>
> In many situations it is necessary to store more data associated with pages than it's possible now with the current segment format. Quite often it's a binary data. There are two common workarounds for this: one is to use per-page metadata, either in Content or ParseData, the other is to use an external independent database using page ID-s as foreign keys.
> Currently segments can consist of the following predefined parts: content, crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I propose a third option, which is a natural extension of this existing segment format, i.e. to introduce the ability to add arbitrarily named segment "parts", with the only requirement that they should be MapFile-s that store Writable keys and values. Alternatively, we could define a SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios.
> Existing segment API and searcher API (NutchBean, DistributedSearch Client/Server) should be extended to handle such arbitrary parts.
> Example applications:
> * storing HTML previews of non-HTML pages, such as PDF, PS and Office documents
> * storing pre-tokenized version of plain text for faster snippet generation
> * storing linguistically tagged text for sophisticated data mining
> * storing image thumbnails
> etc, etc ...
> I'm going to prepare a patchset shortly. Any comments and suggestions are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.