You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Alan Tanaman (JIRA)" <ji...@apache.org> on 2006/12/28 20:23:22 UTC

[jira] Created: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

index-extra plugin creates additional fields in the index, based on configurable logic
--------------------------------------------------------------------------------------

                 Key: NUTCH-422
                 URL: http://issues.apache.org/jira/browse/NUTCH-422
             Project: Nutch
          Issue Type: New Feature
          Components: indexer
    Affects Versions: 0.8.1
         Environment: All environments
            Reporter: Alan Tanaman


Extract from the Readme file:

A.  Introduction

    The index-extra plugin allows you to configure additional fields that you wish to be added to the index, based on one of the following sources:
      - The parsed text
      - Meta data fields
      - Previously created document-to-be-indexed fields
      - Plain constant string
      - Java expression combining one or more of the above, and resolving to a string
    A regex can also be applied to any of the above, allowing fields to be created based on patterns extracted from the source.

B.  Installation

    1)  Binaries only:  Copy the 'index-extra' folder within index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
                        Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
                        Enable the plugin by updating the nutch-site.xml file
    2)  Source code:    Always refer to the Nutch wiki for detailed instructions on building Nutch.  In short:
                        Copy the 'index-extra' folder within index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
                        Update the build.xml in NUTCHDIR/src/plugin to include plugin
                        Update the NUTCHDIR/default.properties file to include plugin
                        run ant to build
                        Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
                        Enable the plugin by updating the nutch-site.xml file

C.  Known Issues

    1)  For this plugin to work correctly on any document field, it is necessary to run the other index filters
    first, so that all basic document fields are generated first.  To do this, configure the indexingfilter.order
    property.  (Please see patch NUTCH-421 to enable indexingfilter.order property. If this patch is not applied,
    the plugin will still work, but will not be able to use document fields created by other index filter plugins.)

    2)  At this stage, field boost can not be used as Nutch scoring overrides the field boost with its own
    document-level boost calculation.  This occurs at the end of org.apache.nutch.indexer.Indexer's reduce method.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

Posted by w00_008 <ta...@sina.com>.
hi I got same problem as you had once before and I couln't solve it by
myself; so could you explain the way
that you did solve it in details.
Would be greatful to get your reply!

JIRA jira@apache.org wrote:
> 
> 
>     [
> https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-ta`bpanel#action_12478688
> ] 
> 
> Nathan ter Bogt commented on NUTCH-422:
> ---------------------------------------
> 
> Sorry all,
> 
> I managed to get this working. Just had some issues with the jdom library
> (or lack thereof).
> I must have just misread the error earlier.
> 
> Fantastic plugin idea too, thanks!
> 
>> index-extra plugin creates additional fields in the index, based on
>> configurable logic
>> --------------------------------------------------------------------------------------
>>
>>                 Key: NUTCH-422
>>                 URL: https://issues.apache.org/jira/browse/NUTCH-422
>>             Project: Nutch
>>          Issue Type: New Feature
>>          Components: indexer
>>    Affects Versions: 0.8.1
>>         Environment: All environments
>>            Reporter: Alan Tanaman
>>         Assigned To: Sami Siren
>>         Attachments: index-extra-v1.0-bin-java1.5.zip,
>> index-extra-v1.0-source.zip
>>
>>
>> Extract from the Readme file:
>> A.  Introduction
>>     The index-extra plugin allows you to configure additional fields that
>> you wish to be added to the index, based on one of the following sources:
>>       - The parsed text
>>       - Meta data fields
>>       - Previously created document-to-be-indexed fields
>>       - Plain constant string
>>       - Java expression combining one or more of the above, and resolving
>> to a string
>>     A regex can also be applied to any of the above, allowing fields to
>> be created based on patterns extracted from the source.
>> B.  Installation
>>     1)  Binaries only:  Copy the 'index-extra' folder within
>> index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
>>                         Copy the 'index-extra-conf.xml' file to
>> NUTCHDIR/conf, and configure
>>                         Enable the plugin by updating the nutch-site.xml
>> file
>>     2)  Source code:    Always refer to the Nutch wiki for detailed
>> instructions on building Nutch.  In short:
>>                         Copy the 'index-extra' folder within
>> index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
>>                         Update the build.xml in NUTCHDIR/src/plugin to
>> include plugin
>>                         Update the NUTCHDIR/default.properties file to
>> include plugin
>>                         run ant to build
>>                         Copy the 'index-extra-conf.xml' file to
>> NUTCHDIR/conf, and configure
>>                         Enable the plugin by updating the nutch-site.xml
>> file
>> C.  Known Issues
>>     1)  For this plugin to work correctly on any document field, it is
>> necessary to run the other index filters
>>     first, so that all basic document fields are generated first.  To do
>> this, configure the indexingfilter.order
>>     property.  (Please see patch NUTCH-421 to enable indexingfilter.order
>> property. If this patch is not applied,
>>     the plugin will still work, but will not be able to use document
>> fields created by other index filter plugins.)
>>     2)  At this stage, field boost can not be used as Nutch scoring
>> overrides the field boost with its own
>>     document-level boost calculation.  This occurs at the end of
>> org.apache.nutch.indexer.Indexer's reduce method.
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/-jira--Created%3A-%28NUTCH-422%29-index-extra-plugin-creates-additional-fields-in-the-index%2C-based-on-configurable-logic-tf2891798.html#a13753106
Sent from the Nutch - Dev mailing list archive at Nabble.com.


RE: [jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

Posted by Alan Tanaman <al...@idna-solutions.com>.
Nathan,

Sorry I didn't get back to you sooner.  There are a few messy things that we need to clear up in this plugin, as previously commented by Sami Siren.  As for the jdom, we need to change the plugin configuration so that it points to the existing jdom library.  Glad you got it to work though!

Best regards,
Alan
_________________________
Alan Tanaman
iDNA Solutions
Tel: +44 (20) 7257 6125
Mobile: +44 (7796) 932 362
http://blog.idna-solutions.com
-----Original Message-----
From: Nathan ter Bogt (JIRA) [mailto:jira@apache.org] 
Sent: 07 March 2007 05:16
To: nutch-dev@lucene.apache.org
Subject: [jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic


    [ https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478688 ] 

Nathan ter Bogt commented on NUTCH-422:
---------------------------------------

Sorry all,

I managed to get this working. Just had some issues with the jdom library (or lack thereof).
I must have just misread the error earlier.

Fantastic plugin idea too, thanks!

> index-extra plugin creates additional fields in the index, based on configurable logic
> --------------------------------------------------------------------------------------
>
>                 Key: NUTCH-422
>                 URL: https://issues.apache.org/jira/browse/NUTCH-422
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.8.1
>         Environment: All environments
>            Reporter: Alan Tanaman
>         Assigned To: Sami Siren
>         Attachments: index-extra-v1.0-bin-java1.5.zip, index-extra-v1.0-source.zip
>
>
> Extract from the Readme file:
> A.  Introduction
>     The index-extra plugin allows you to configure additional fields that you wish to be added to the index, based on one of the following sources:
>       - The parsed text
>       - Meta data fields
>       - Previously created document-to-be-indexed fields
>       - Plain constant string
>       - Java expression combining one or more of the above, and resolving to a string
>     A regex can also be applied to any of the above, allowing fields to be created based on patterns extracted from the source.
> B.  Installation
>     1)  Binaries only:  Copy the 'index-extra' folder within index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
>     2)  Source code:    Always refer to the Nutch wiki for detailed instructions on building Nutch.  In short:
>                         Copy the 'index-extra' folder within index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
>                         Update the build.xml in NUTCHDIR/src/plugin to include plugin
>                         Update the NUTCHDIR/default.properties file to include plugin
>                         run ant to build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
> C.  Known Issues
>     1)  For this plugin to work correctly on any document field, it is necessary to run the other index filters
>     first, so that all basic document fields are generated first.  To do this, configure the indexingfilter.order
>     property.  (Please see patch NUTCH-421 to enable indexingfilter.order property. If this patch is not applied,
>     the plugin will still work, but will not be able to use document fields created by other index filter plugins.)
>     2)  At this stage, field boost can not be used as Nutch scoring overrides the field boost with its own
>     document-level boost calculation.  This occurs at the end of org.apache.nutch.indexer.Indexer's reduce method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

Posted by "garpinc (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

garpinc updated NUTCH-422:
--------------------------

    Attachment: ExtraIndexingFilter.java

> index-extra plugin creates additional fields in the index, based on configurable logic
> --------------------------------------------------------------------------------------
>
>                 Key: NUTCH-422
>                 URL: https://issues.apache.org/jira/browse/NUTCH-422
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.8.1
>         Environment: All environments
>            Reporter: Alan Tanaman
>            Assignee: Sami Siren
>         Attachments: ExtraIndexingFilter.java, index-extra-v1.0-bin-java1.5.zip, index-extra-v1.0-source.zip
>
>
> Extract from the Readme file:
> A.  Introduction
>     The index-extra plugin allows you to configure additional fields that you wish to be added to the index, based on one of the following sources:
>       - The parsed text
>       - Meta data fields
>       - Previously created document-to-be-indexed fields
>       - Plain constant string
>       - Java expression combining one or more of the above, and resolving to a string
>     A regex can also be applied to any of the above, allowing fields to be created based on patterns extracted from the source.
> B.  Installation
>     1)  Binaries only:  Copy the 'index-extra' folder within index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
>     2)  Source code:    Always refer to the Nutch wiki for detailed instructions on building Nutch.  In short:
>                         Copy the 'index-extra' folder within index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
>                         Update the build.xml in NUTCHDIR/src/plugin to include plugin
>                         Update the NUTCHDIR/default.properties file to include plugin
>                         run ant to build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
> C.  Known Issues
>     1)  For this plugin to work correctly on any document field, it is necessary to run the other index filters
>     first, so that all basic document fields are generated first.  To do this, configure the indexingfilter.order
>     property.  (Please see patch NUTCH-421 to enable indexingfilter.order property. If this patch is not applied,
>     the plugin will still work, but will not be able to use document fields created by other index filter plugins.)
>     2)  At this stage, field boost can not be used as Nutch scoring overrides the field boost with its own
>     document-level boost calculation.  This occurs at the end of org.apache.nutch.indexer.Indexer's reduce method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

Posted by "Peter Boot (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554034 ] 

Peter Boot commented on NUTCH-422:
----------------------------------

I am getting errors when trying to compile  this plugin with the trunk.
Has anyone managed to update it ?
Is there a better way to get Nutch to create termVectors ?

[echo] Compiling plugin: index-extra
[javac] Compiling 3 source files to /opt/nutch-trunk/build/index-extra/classes
[javac] /opt/nutch-trunk/src/plugin/index-extra/src/java/org/apache/nutch/indexer/extra/ExtraIndexingFilter.java:61:
org.apache.nutch.indexer.extra.ExtraIndexingFilter is not abstract and
does not override abstract method
filter(org.apache.lucene.document.Document,org.apache.nutch.parse.Parse,org.apache.hadoop.io.Text,org.apache.nutch.crawl.CrawlDatum,org.apache.nutch.crawl.Inlinks)
in org.apache.nutch.indexer.IndexingFilter
[javac] public class ExtraIndexingFilter implements IndexingFilter {
[javac]   ^
[javac] Note: /opt/nutch-trunk/src/plugin/index-extra/src/java/org/apache/nutch/indexer/extra/ExtraIndexingFilter.java
uses or overrides a deprecated API.

> index-extra plugin creates additional fields in the index, based on configurable logic
> --------------------------------------------------------------------------------------
>
>                 Key: NUTCH-422
>                 URL: https://issues.apache.org/jira/browse/NUTCH-422
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.8.1
>         Environment: All environments
>            Reporter: Alan Tanaman
>            Assignee: Sami Siren
>         Attachments: index-extra-v1.0-bin-java1.5.zip, index-extra-v1.0-source.zip
>
>
> Extract from the Readme file:
> A.  Introduction
>     The index-extra plugin allows you to configure additional fields that you wish to be added to the index, based on one of the following sources:
>       - The parsed text
>       - Meta data fields
>       - Previously created document-to-be-indexed fields
>       - Plain constant string
>       - Java expression combining one or more of the above, and resolving to a string
>     A regex can also be applied to any of the above, allowing fields to be created based on patterns extracted from the source.
> B.  Installation
>     1)  Binaries only:  Copy the 'index-extra' folder within index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
>     2)  Source code:    Always refer to the Nutch wiki for detailed instructions on building Nutch.  In short:
>                         Copy the 'index-extra' folder within index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
>                         Update the build.xml in NUTCHDIR/src/plugin to include plugin
>                         Update the NUTCHDIR/default.properties file to include plugin
>                         run ant to build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
> C.  Known Issues
>     1)  For this plugin to work correctly on any document field, it is necessary to run the other index filters
>     first, so that all basic document fields are generated first.  To do this, configure the indexingfilter.order
>     property.  (Please see patch NUTCH-421 to enable indexingfilter.order property. If this patch is not applied,
>     the plugin will still work, but will not be able to use document fields created by other index filter plugins.)
>     2)  At this stage, field boost can not be used as Nutch scoring overrides the field boost with its own
>     document-level boost calculation.  This occurs at the end of org.apache.nutch.indexer.Indexer's reduce method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

Posted by "Alex McLintock (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12728717#action_12728717 ] 

Alex McLintock commented on NUTCH-422:
--------------------------------------

May I ask if this code still works with Nutch 1.0?

Thanks

> index-extra plugin creates additional fields in the index, based on configurable logic
> --------------------------------------------------------------------------------------
>
>                 Key: NUTCH-422
>                 URL: https://issues.apache.org/jira/browse/NUTCH-422
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.8.1
>         Environment: All environments
>            Reporter: Alan Tanaman
>            Assignee: Sami Siren
>         Attachments: index-extra-v1.0-bin-java1.5.zip, index-extra-v1.0-source.zip
>
>
> Extract from the Readme file:
> A.  Introduction
>     The index-extra plugin allows you to configure additional fields that you wish to be added to the index, based on one of the following sources:
>       - The parsed text
>       - Meta data fields
>       - Previously created document-to-be-indexed fields
>       - Plain constant string
>       - Java expression combining one or more of the above, and resolving to a string
>     A regex can also be applied to any of the above, allowing fields to be created based on patterns extracted from the source.
> B.  Installation
>     1)  Binaries only:  Copy the 'index-extra' folder within index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
>     2)  Source code:    Always refer to the Nutch wiki for detailed instructions on building Nutch.  In short:
>                         Copy the 'index-extra' folder within index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
>                         Update the build.xml in NUTCHDIR/src/plugin to include plugin
>                         Update the NUTCHDIR/default.properties file to include plugin
>                         run ant to build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
> C.  Known Issues
>     1)  For this plugin to work correctly on any document field, it is necessary to run the other index filters
>     first, so that all basic document fields are generated first.  To do this, configure the indexingfilter.order
>     property.  (Please see patch NUTCH-421 to enable indexingfilter.order property. If this patch is not applied,
>     the plugin will still work, but will not be able to use document fields created by other index filter plugins.)
>     2)  At this stage, field boost can not be used as Nutch scoring overrides the field boost with its own
>     document-level boost calculation.  This occurs at the end of org.apache.nutch.indexer.Indexer's reduce method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


RE: [jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

Posted by Alan Tanaman <al...@idna-solutions.com>.
Many thanks for your feedback.

Do you have any specifics in mind regarding examples?  I will try and
include any additional ones that we implement.  I know there are a lot of
options, but it is a little hard to see what is unclear from my end -- as I
am so involved in the development, another point-of-view on this is welcome.
;)

Regarding query-extra, we are not currently using the Nutch bean, so the
need has not arisen for us at this point in time, but I can see how that
would be useful.  I guess you could adapt one of the existing query-xxxx
plugins fairly easily by having them read the xml configuration file to see
what fields are potentially available in the index.

As for the boost, I included that as it seems like a useful thing to be able
to control the boost of a single field, although we don't need that at this
very moment.  The line of code in the org.apache.nutch.indexer.Indexer's
reduce method could be overridden, but I'm not yet sure how that would
affect the overall scoring (scoring is one of my really weak points).
Perhaps one of the scoring experts could give some guidance on this?

Best regards,
Alan
_________________________
Alan Tanaman
iDNA Solutions

-----Original Message-----
From: nutch.newbie (JIRA) [mailto:jira@apache.org] 
Sent: 02 January 2007 11:03
To: nutch-dev@lucene.apache.org
Subject: [jira] Commented: (NUTCH-422) index-extra plugin creates additional
fields in the index, based on configurable logic


    [
http://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugi
n.system.issuetabpanels:comment-tabpanel#action_12461710 ] 

nutch.newbie commented on NUTCH-422:
------------------------------------

I have got it to work.. but took me a while to properly index fields.. Its a
rather complex plugin and definitely requires more documentation and example
from a newbie prospective. I can see my indexed field using Luke. However I
don't have the necessary query-plugin to do a search - find 'xyz' in filed
'author' meta data etc.. Any plans for query-extra plugin? where you define
query items via query-extra-conf.xml or something similler?? 

Also the boost feature is important do you have any patch to solve known
issue 2. 

Good work for getting a complex plugin to work not so complexly :-0)

> index-extra plugin creates additional fields in the index, based on
configurable logic
>
----------------------------------------------------------------------------
----------
>
>                 Key: NUTCH-422
>                 URL: http://issues.apache.org/jira/browse/NUTCH-422
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.8.1
>         Environment: All environments
>            Reporter: Alan Tanaman
>         Attachments: index-extra-v1.0-bin-java1.5.zip,
index-extra-v1.0-source.zip
>
>
> Extract from the Readme file:
> A.  Introduction
>     The index-extra plugin allows you to configure additional fields that
you wish to be added to the index, based on one of the following sources:
>       - The parsed text
>       - Meta data fields
>       - Previously created document-to-be-indexed fields
>       - Plain constant string
>       - Java expression combining one or more of the above, and resolving
to a string
>     A regex can also be applied to any of the above, allowing fields to be
created based on patterns extracted from the source.
> B.  Installation
>     1)  Binaries only:  Copy the 'index-extra' folder within
index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
>                         Copy the 'index-extra-conf.xml' file to
NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml
file
>     2)  Source code:    Always refer to the Nutch wiki for detailed
instructions on building Nutch.  In short:
>                         Copy the 'index-extra' folder within
index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
>                         Update the build.xml in NUTCHDIR/src/plugin to
include plugin
>                         Update the NUTCHDIR/default.properties file to
include plugin
>                         run ant to build
>                         Copy the 'index-extra-conf.xml' file to
NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml
file
> C.  Known Issues
>     1)  For this plugin to work correctly on any document field, it is
necessary to run the other index filters
>     first, so that all basic document fields are generated first.  To do
this, configure the indexingfilter.order
>     property.  (Please see patch NUTCH-421 to enable indexingfilter.order
property. If this patch is not applied,
>     the plugin will still work, but will not be able to use document
fields created by other index filter plugins.)
>     2)  At this stage, field boost can not be used as Nutch scoring
overrides the field boost with its own
>     document-level boost calculation.  This occurs at the end of
org.apache.nutch.indexer.Indexer's reduce method.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        


[jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

Posted by "nutch.newbie (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12461710 ] 

nutch.newbie commented on NUTCH-422:
------------------------------------

I have got it to work.. but took me a while to properly index fields.. Its a rather complex plugin and definitely requires more documentation and example from a newbie prospective. I can see my indexed field using Luke. However I don't have the necessary query-plugin to do a search - find 'xyz' in filed 'author' meta data etc.. Any plans for query-extra plugin? where you define query items via query-extra-conf.xml or something similler?? 

Also the boost feature is important do you have any patch to solve known issue 2. 

Good work for getting a complex plugin to work not so complexly :-0)

> index-extra plugin creates additional fields in the index, based on configurable logic
> --------------------------------------------------------------------------------------
>
>                 Key: NUTCH-422
>                 URL: http://issues.apache.org/jira/browse/NUTCH-422
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.8.1
>         Environment: All environments
>            Reporter: Alan Tanaman
>         Attachments: index-extra-v1.0-bin-java1.5.zip, index-extra-v1.0-source.zip
>
>
> Extract from the Readme file:
> A.  Introduction
>     The index-extra plugin allows you to configure additional fields that you wish to be added to the index, based on one of the following sources:
>       - The parsed text
>       - Meta data fields
>       - Previously created document-to-be-indexed fields
>       - Plain constant string
>       - Java expression combining one or more of the above, and resolving to a string
>     A regex can also be applied to any of the above, allowing fields to be created based on patterns extracted from the source.
> B.  Installation
>     1)  Binaries only:  Copy the 'index-extra' folder within index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
>     2)  Source code:    Always refer to the Nutch wiki for detailed instructions on building Nutch.  In short:
>                         Copy the 'index-extra' folder within index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
>                         Update the build.xml in NUTCHDIR/src/plugin to include plugin
>                         Update the NUTCHDIR/default.properties file to include plugin
>                         run ant to build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
> C.  Known Issues
>     1)  For this plugin to work correctly on any document field, it is necessary to run the other index filters
>     first, so that all basic document fields are generated first.  To do this, configure the indexingfilter.order
>     property.  (Please see patch NUTCH-421 to enable indexingfilter.order property. If this patch is not applied,
>     the plugin will still work, but will not be able to use document fields created by other index filter plugins.)
>     2)  At this stage, field boost can not be used as Nutch scoring overrides the field boost with its own
>     document-level boost calculation.  This occurs at the end of org.apache.nutch.indexer.Indexer's reduce method.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

RE: [jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

Posted by Alan Tanaman <al...@idna-solutions.com>.
Thomas,
I can't say whether it is or isn't possible, as I'm afraid I don't know what
types of fields you would need for yacy.  BTW regarding terminology, in
Lucene (the Nutch engine) you have an index, documents and fields, not
database, records and columns, although in a way they are similar.
But as a rule, the plugin is freely configurable, so if you know what logic
is needed to extract the field value from the meta data, the parsed text of
the source document, or another indexed field (for example, one created by
index-basic or index-more), you should be able to configure it using this
plugin. 
Best regards,
Alan
_________________________
Alan Tanaman
iDNA Solutions

-----Original Message-----
From: "Thomas Müller" [mailto:thomasasta@gmx.net] 
Sent: 03 January 2007 06:35
To: nutch-dev@lucene.apache.org
Subject: Re: [jira] Commented: (NUTCH-422) index-extra plugin creates
additional fields in the index, based on configurable logic

Alan, would it be possible, to create with this plugin columns in the nutch
database, which correspond to the www.yacy.net search enginge (as well in
java), so that nutch can be hybrid with the yacy p2p system?
Then this means, the databse of each nutch can be distributed over this p2p
system to other yacy AND nutch nodes.
Then we only need as well a yacy plugin, and each website is crwaled twice
in each nutch central search engine, once for nutch, once for yacy, but both
relay on the same database.

Thanks


-------- Original-Nachricht --------
Datum: Tue, 2 Jan 2007 14:57:27 -0800 (PST)
Von: "Alan Tanaman (JIRA)" <ji...@apache.org>
An: nutch-dev@lucene.apache.org
Betreff: [jira] Commented: (NUTCH-422) index-extra plugin creates additional
fields in the index, based on configurable logic

> 
>     [
>
http://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugi
n.system.issuetabpanels:comment-tabpanel#action_12461863 ] 
> 
> Alan Tanaman commented on NUTCH-422:
> ------------------------------------
> 
> Many thanks for your feedback.
> 
> Do you have any specifics in mind regarding examples?  I will try and
> include any additional ones that we implement.  I know there are a lot of
> options, but it is a little hard to see what is unclear from my end -- as
I am
> so involved in the development, another point-of-view on this is welcome.
> ;)
> 
> Regarding query-extra, we are not currently using the Nutch bean, so the
> need has not arisen for us at this point in time, but I can see how that
> would be useful.  I guess you could adapt one of the existing query-xxxx
> plugins fairly easily by having them read the xml configuration file to
see what
> fields are potentially available in the index.
> 
> As for the boost, I included that as it seems like a useful thing to be
> able to control the boost of a single field, although we don't need that
at
> this very moment.  The line of code in the
> org.apache.nutch.indexer.Indexer's
> reduce method could be overridden, but I'm not yet sure how that would
> affect the overall scoring (scoring is one of my really weak points).
> Perhaps one of the scoring experts could give some guidance on this?
> 
> > index-extra plugin creates additional fields in the index, based on
> configurable logic
> >
>
----------------------------------------------------------------------------
----------
> >
> >                 Key: NUTCH-422
> >                 URL: http://issues.apache.org/jira/browse/NUTCH-422
> >             Project: Nutch
> >          Issue Type: New Feature
> >          Components: indexer
> >    Affects Versions: 0.8.1
> >         Environment: All environments
> >            Reporter: Alan Tanaman
> >         Attachments: index-extra-v1.0-bin-java1.5.zip,
> index-extra-v1.0-source.zip
> >
> >
> > Extract from the Readme file:
> > A.  Introduction
> >     The index-extra plugin allows you to configure additional fields
> that you wish to be added to the index, based on one of the following
sources:
> >       - The parsed text
> >       - Meta data fields
> >       - Previously created document-to-be-indexed fields
> >       - Plain constant string
> >       - Java expression combining one or more of the above, and
> resolving to a string
> >     A regex can also be applied to any of the above, allowing fields to
> be created based on patterns extracted from the source.
> > B.  Installation
> >     1)  Binaries only:  Copy the 'index-extra' folder within
> index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
> >                         Copy the 'index-extra-conf.xml' file to
> NUTCHDIR/conf, and configure
> >                         Enable the plugin by updating the nutch-site.xml
> file
> >     2)  Source code:    Always refer to the Nutch wiki for detailed
> instructions on building Nutch.  In short:
> >                         Copy the 'index-extra' folder within
> index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
> >                         Update the build.xml in NUTCHDIR/src/plugin to
> include plugin
> >                         Update the NUTCHDIR/default.properties file to
> include plugin
> >                         run ant to build
> >                         Copy the 'index-extra-conf.xml' file to
> NUTCHDIR/conf, and configure
> >                         Enable the plugin by updating the nutch-site.xml
> file
> > C.  Known Issues
> >     1)  For this plugin to work correctly on any document field, it is
> necessary to run the other index filters
> >     first, so that all basic document fields are generated first.  To do
> this, configure the indexingfilter.order
> >     property.  (Please see patch NUTCH-421 to enable
> indexingfilter.order property. If this patch is not applied,
> >     the plugin will still work, but will not be able to use document
> fields created by other index filter plugins.)
> >     2)  At this stage, field boost can not be used as Nutch scoring
> overrides the field boost with its own
> >     document-level boost calculation.  This occurs at the end of
> org.apache.nutch.indexer.Indexer's reduce method.
> 
> -- 
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
> http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see: http://www.atlassian.com/software/jira
> 
>         

-- 
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! 
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer


Re: [jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

Posted by Thomas Müller <th...@gmx.net>.
Alan, would it be possible, to create with this plugin columns in the nutch database, which correspond to the www.yacy.net search enginge (as well in java), so that nutch can be hybrid with the yacy p2p system?
Then this means, the databse of each nutch can be distributed over this p2p system to other yacy AND nutch nodes.
Then we only need as well a yacy plugin, and each website is crwaled twice in each nutch central search engine, once for nutch, once for yacy, but both relay on the same database.

Thanks


-------- Original-Nachricht --------
Datum: Tue, 2 Jan 2007 14:57:27 -0800 (PST)
Von: "Alan Tanaman (JIRA)" <ji...@apache.org>
An: nutch-dev@lucene.apache.org
Betreff: [jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

> 
>     [
> http://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12461863 ] 
> 
> Alan Tanaman commented on NUTCH-422:
> ------------------------------------
> 
> Many thanks for your feedback.
> 
> Do you have any specifics in mind regarding examples?  I will try and
> include any additional ones that we implement.  I know there are a lot of
> options, but it is a little hard to see what is unclear from my end -- as I am
> so involved in the development, another point-of-view on this is welcome.
> ;)
> 
> Regarding query-extra, we are not currently using the Nutch bean, so the
> need has not arisen for us at this point in time, but I can see how that
> would be useful.  I guess you could adapt one of the existing query-xxxx
> plugins fairly easily by having them read the xml configuration file to see what
> fields are potentially available in the index.
> 
> As for the boost, I included that as it seems like a useful thing to be
> able to control the boost of a single field, although we don't need that at
> this very moment.  The line of code in the
> org.apache.nutch.indexer.Indexer's
> reduce method could be overridden, but I'm not yet sure how that would
> affect the overall scoring (scoring is one of my really weak points).
> Perhaps one of the scoring experts could give some guidance on this?
> 
> > index-extra plugin creates additional fields in the index, based on
> configurable logic
> >
> --------------------------------------------------------------------------------------
> >
> >                 Key: NUTCH-422
> >                 URL: http://issues.apache.org/jira/browse/NUTCH-422
> >             Project: Nutch
> >          Issue Type: New Feature
> >          Components: indexer
> >    Affects Versions: 0.8.1
> >         Environment: All environments
> >            Reporter: Alan Tanaman
> >         Attachments: index-extra-v1.0-bin-java1.5.zip,
> index-extra-v1.0-source.zip
> >
> >
> > Extract from the Readme file:
> > A.  Introduction
> >     The index-extra plugin allows you to configure additional fields
> that you wish to be added to the index, based on one of the following sources:
> >       - The parsed text
> >       - Meta data fields
> >       - Previously created document-to-be-indexed fields
> >       - Plain constant string
> >       - Java expression combining one or more of the above, and
> resolving to a string
> >     A regex can also be applied to any of the above, allowing fields to
> be created based on patterns extracted from the source.
> > B.  Installation
> >     1)  Binaries only:  Copy the 'index-extra' folder within
> index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
> >                         Copy the 'index-extra-conf.xml' file to
> NUTCHDIR/conf, and configure
> >                         Enable the plugin by updating the nutch-site.xml
> file
> >     2)  Source code:    Always refer to the Nutch wiki for detailed
> instructions on building Nutch.  In short:
> >                         Copy the 'index-extra' folder within
> index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
> >                         Update the build.xml in NUTCHDIR/src/plugin to
> include plugin
> >                         Update the NUTCHDIR/default.properties file to
> include plugin
> >                         run ant to build
> >                         Copy the 'index-extra-conf.xml' file to
> NUTCHDIR/conf, and configure
> >                         Enable the plugin by updating the nutch-site.xml
> file
> > C.  Known Issues
> >     1)  For this plugin to work correctly on any document field, it is
> necessary to run the other index filters
> >     first, so that all basic document fields are generated first.  To do
> this, configure the indexingfilter.order
> >     property.  (Please see patch NUTCH-421 to enable
> indexingfilter.order property. If this patch is not applied,
> >     the plugin will still work, but will not be able to use document
> fields created by other index filter plugins.)
> >     2)  At this stage, field boost can not be used as Nutch scoring
> overrides the field boost with its own
> >     document-level boost calculation.  This occurs at the end of
> org.apache.nutch.indexer.Indexer's reduce method.
> 
> -- 
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
> http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see: http://www.atlassian.com/software/jira
> 
>         

-- 
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! 
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

[jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

Posted by "Alan Tanaman (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12461863 ] 

Alan Tanaman commented on NUTCH-422:
------------------------------------

Many thanks for your feedback.

Do you have any specifics in mind regarding examples?  I will try and include any additional ones that we implement.  I know there are a lot of options, but it is a little hard to see what is unclear from my end -- as I am so involved in the development, another point-of-view on this is welcome.
;)

Regarding query-extra, we are not currently using the Nutch bean, so the need has not arisen for us at this point in time, but I can see how that would be useful.  I guess you could adapt one of the existing query-xxxx plugins fairly easily by having them read the xml configuration file to see what fields are potentially available in the index.

As for the boost, I included that as it seems like a useful thing to be able to control the boost of a single field, although we don't need that at this very moment.  The line of code in the org.apache.nutch.indexer.Indexer's
reduce method could be overridden, but I'm not yet sure how that would affect the overall scoring (scoring is one of my really weak points).
Perhaps one of the scoring experts could give some guidance on this?

> index-extra plugin creates additional fields in the index, based on configurable logic
> --------------------------------------------------------------------------------------
>
>                 Key: NUTCH-422
>                 URL: http://issues.apache.org/jira/browse/NUTCH-422
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.8.1
>         Environment: All environments
>            Reporter: Alan Tanaman
>         Attachments: index-extra-v1.0-bin-java1.5.zip, index-extra-v1.0-source.zip
>
>
> Extract from the Readme file:
> A.  Introduction
>     The index-extra plugin allows you to configure additional fields that you wish to be added to the index, based on one of the following sources:
>       - The parsed text
>       - Meta data fields
>       - Previously created document-to-be-indexed fields
>       - Plain constant string
>       - Java expression combining one or more of the above, and resolving to a string
>     A regex can also be applied to any of the above, allowing fields to be created based on patterns extracted from the source.
> B.  Installation
>     1)  Binaries only:  Copy the 'index-extra' folder within index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
>     2)  Source code:    Always refer to the Nutch wiki for detailed instructions on building Nutch.  In short:
>                         Copy the 'index-extra' folder within index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
>                         Update the build.xml in NUTCHDIR/src/plugin to include plugin
>                         Update the NUTCHDIR/default.properties file to include plugin
>                         run ant to build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
> C.  Known Issues
>     1)  For this plugin to work correctly on any document field, it is necessary to run the other index filters
>     first, so that all basic document fields are generated first.  To do this, configure the indexingfilter.order
>     property.  (Please see patch NUTCH-421 to enable indexingfilter.order property. If this patch is not applied,
>     the plugin will still work, but will not be able to use document fields created by other index filter plugins.)
>     2)  At this stage, field boost can not be used as Nutch scoring overrides the field boost with its own
>     document-level boost calculation.  This occurs at the end of org.apache.nutch.indexer.Indexer's reduce method.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464351 ] 

Sami Siren commented on NUTCH-422:
----------------------------------

couple of more points:
-source files use tabs for indentation
-headers of files are not consistent, should be updated
-module contains jdom which is already part of nutch, should instead use existing one
-no junit tests, not strictly a requirement but a big plus is to have some!

> index-extra plugin creates additional fields in the index, based on configurable logic
> --------------------------------------------------------------------------------------
>
>                 Key: NUTCH-422
>                 URL: https://issues.apache.org/jira/browse/NUTCH-422
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.8.1
>         Environment: All environments
>            Reporter: Alan Tanaman
>         Assigned To: Sami Siren
>         Attachments: index-extra-v1.0-bin-java1.5.zip, index-extra-v1.0-source.zip
>
>
> Extract from the Readme file:
> A.  Introduction
>     The index-extra plugin allows you to configure additional fields that you wish to be added to the index, based on one of the following sources:
>       - The parsed text
>       - Meta data fields
>       - Previously created document-to-be-indexed fields
>       - Plain constant string
>       - Java expression combining one or more of the above, and resolving to a string
>     A regex can also be applied to any of the above, allowing fields to be created based on patterns extracted from the source.
> B.  Installation
>     1)  Binaries only:  Copy the 'index-extra' folder within index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
>     2)  Source code:    Always refer to the Nutch wiki for detailed instructions on building Nutch.  In short:
>                         Copy the 'index-extra' folder within index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
>                         Update the build.xml in NUTCHDIR/src/plugin to include plugin
>                         Update the NUTCHDIR/default.properties file to include plugin
>                         run ant to build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
> C.  Known Issues
>     1)  For this plugin to work correctly on any document field, it is necessary to run the other index filters
>     first, so that all basic document fields are generated first.  To do this, configure the indexingfilter.order
>     property.  (Please see patch NUTCH-421 to enable indexingfilter.order property. If this patch is not applied,
>     the plugin will still work, but will not be able to use document fields created by other index filter plugins.)
>     2)  At this stage, field boost can not be used as Nutch scoring overrides the field boost with its own
>     document-level boost calculation.  This occurs at the end of org.apache.nutch.indexer.Indexer's reduce method.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

Posted by "Alan Tanaman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464738 ] 

Alan Tanaman commented on NUTCH-422:
------------------------------------

Sami,

About your questions - thank you for looking at this plugin.  I will be
seeing to all of them and will respond over the next week, as currently have
a couple of stressed clients...

Best regards,
Alan


> index-extra plugin creates additional fields in the index, based on configurable logic
> --------------------------------------------------------------------------------------
>
>                 Key: NUTCH-422
>                 URL: https://issues.apache.org/jira/browse/NUTCH-422
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.8.1
>         Environment: All environments
>            Reporter: Alan Tanaman
>         Assigned To: Sami Siren
>         Attachments: index-extra-v1.0-bin-java1.5.zip, index-extra-v1.0-source.zip
>
>
> Extract from the Readme file:
> A.  Introduction
>     The index-extra plugin allows you to configure additional fields that you wish to be added to the index, based on one of the following sources:
>       - The parsed text
>       - Meta data fields
>       - Previously created document-to-be-indexed fields
>       - Plain constant string
>       - Java expression combining one or more of the above, and resolving to a string
>     A regex can also be applied to any of the above, allowing fields to be created based on patterns extracted from the source.
> B.  Installation
>     1)  Binaries only:  Copy the 'index-extra' folder within index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
>     2)  Source code:    Always refer to the Nutch wiki for detailed instructions on building Nutch.  In short:
>                         Copy the 'index-extra' folder within index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
>                         Update the build.xml in NUTCHDIR/src/plugin to include plugin
>                         Update the NUTCHDIR/default.properties file to include plugin
>                         run ant to build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
> C.  Known Issues
>     1)  For this plugin to work correctly on any document field, it is necessary to run the other index filters
>     first, so that all basic document fields are generated first.  To do this, configure the indexingfilter.order
>     property.  (Please see patch NUTCH-421 to enable indexingfilter.order property. If this patch is not applied,
>     the plugin will still work, but will not be able to use document fields created by other index filter plugins.)
>     2)  At this stage, field boost can not be used as Nutch scoring overrides the field boost with its own
>     document-level boost calculation.  This occurs at the end of org.apache.nutch.indexer.Indexer's reduce method.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

Posted by "Nathan ter Bogt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478683 ] 

Nathan ter Bogt commented on NUTCH-422:
---------------------------------------

Has anyone got the binary version of this module to work? I get to the indexing part and get the following error:

Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
        at org.apache.nutch.indexer.Indexer.index(Indexer.java:296)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:121)

And this is what I get in my hadoop log:

2007-03-07 15:26:33,272 INFO  indexer.Indexer - Optimizing index.
2007-03-07 15:26:33,275 WARN  mapred.LocalJobRunner - job_qq3l2z
java.lang.NoClassDefFoundError: org/jdom/JDOMException
        at org.apache.nutch.indexer.extra.ExtraIndexingFilter.filter(ExtraIndexingFilter.java:68)
        at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:72)
        at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:235)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:247)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:112)

Any help would be greatly appreciated. Lastly, I'm all for the query-extra plugin also.

> index-extra plugin creates additional fields in the index, based on configurable logic
> --------------------------------------------------------------------------------------
>
>                 Key: NUTCH-422
>                 URL: https://issues.apache.org/jira/browse/NUTCH-422
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.8.1
>         Environment: All environments
>            Reporter: Alan Tanaman
>         Assigned To: Sami Siren
>         Attachments: index-extra-v1.0-bin-java1.5.zip, index-extra-v1.0-source.zip
>
>
> Extract from the Readme file:
> A.  Introduction
>     The index-extra plugin allows you to configure additional fields that you wish to be added to the index, based on one of the following sources:
>       - The parsed text
>       - Meta data fields
>       - Previously created document-to-be-indexed fields
>       - Plain constant string
>       - Java expression combining one or more of the above, and resolving to a string
>     A regex can also be applied to any of the above, allowing fields to be created based on patterns extracted from the source.
> B.  Installation
>     1)  Binaries only:  Copy the 'index-extra' folder within index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
>     2)  Source code:    Always refer to the Nutch wiki for detailed instructions on building Nutch.  In short:
>                         Copy the 'index-extra' folder within index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
>                         Update the build.xml in NUTCHDIR/src/plugin to include plugin
>                         Update the NUTCHDIR/default.properties file to include plugin
>                         run ant to build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
> C.  Known Issues
>     1)  For this plugin to work correctly on any document field, it is necessary to run the other index filters
>     first, so that all basic document fields are generated first.  To do this, configure the indexingfilter.order
>     property.  (Please see patch NUTCH-421 to enable indexingfilter.order property. If this patch is not applied,
>     the plugin will still work, but will not be able to use document fields created by other index filter plugins.)
>     2)  At this stage, field boost can not be used as Nutch scoring overrides the field boost with its own
>     document-level boost calculation.  This occurs at the end of org.apache.nutch.indexer.Indexer's reduce method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

Posted by "Morille Jerome (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789515#action_12789515 ] 

Morille Jerome commented on NUTCH-422:
--------------------------------------

No It don't work with nutch version 1.0
He still use the Lucene Document and not NutchDocument.in new Apis.
It easy to correct.

If you want to use it, Take care with this code,a fast read  you can see :
 - InputStream was open and never close
 - Exception cath to Null

The idear is good, 
Nutch distribution plugin don't permit to customize easly Index data.

They are something to do !!!



> index-extra plugin creates additional fields in the index, based on configurable logic
> --------------------------------------------------------------------------------------
>
>                 Key: NUTCH-422
>                 URL: https://issues.apache.org/jira/browse/NUTCH-422
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.8.1
>         Environment: All environments
>            Reporter: Alan Tanaman
>            Assignee: Sami Siren
>         Attachments: index-extra-v1.0-bin-java1.5.zip, index-extra-v1.0-source.zip
>
>
> Extract from the Readme file:
> A.  Introduction
>     The index-extra plugin allows you to configure additional fields that you wish to be added to the index, based on one of the following sources:
>       - The parsed text
>       - Meta data fields
>       - Previously created document-to-be-indexed fields
>       - Plain constant string
>       - Java expression combining one or more of the above, and resolving to a string
>     A regex can also be applied to any of the above, allowing fields to be created based on patterns extracted from the source.
> B.  Installation
>     1)  Binaries only:  Copy the 'index-extra' folder within index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
>     2)  Source code:    Always refer to the Nutch wiki for detailed instructions on building Nutch.  In short:
>                         Copy the 'index-extra' folder within index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
>                         Update the build.xml in NUTCHDIR/src/plugin to include plugin
>                         Update the NUTCHDIR/default.properties file to include plugin
>                         run ant to build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
> C.  Known Issues
>     1)  For this plugin to work correctly on any document field, it is necessary to run the other index filters
>     first, so that all basic document fields are generated first.  To do this, configure the indexingfilter.order
>     property.  (Please see patch NUTCH-421 to enable indexingfilter.order property. If this patch is not applied,
>     the plugin will still work, but will not be able to use document fields created by other index filter plugins.)
>     2)  At this stage, field boost can not be used as Nutch scoring overrides the field boost with its own
>     document-level boost calculation.  This occurs at the end of org.apache.nutch.indexer.Indexer's reduce method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

Posted by "garpinc (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845572#action_12845572 ] 

garpinc commented on NUTCH-422:
-------------------------------

I don't see the meta tags in the Parse Object.. What might I be doing wrong..

I've attached Nutch 1.0 code

> index-extra plugin creates additional fields in the index, based on configurable logic
> --------------------------------------------------------------------------------------
>
>                 Key: NUTCH-422
>                 URL: https://issues.apache.org/jira/browse/NUTCH-422
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.8.1
>         Environment: All environments
>            Reporter: Alan Tanaman
>            Assignee: Sami Siren
>         Attachments: index-extra-v1.0-bin-java1.5.zip, index-extra-v1.0-source.zip
>
>
> Extract from the Readme file:
> A.  Introduction
>     The index-extra plugin allows you to configure additional fields that you wish to be added to the index, based on one of the following sources:
>       - The parsed text
>       - Meta data fields
>       - Previously created document-to-be-indexed fields
>       - Plain constant string
>       - Java expression combining one or more of the above, and resolving to a string
>     A regex can also be applied to any of the above, allowing fields to be created based on patterns extracted from the source.
> B.  Installation
>     1)  Binaries only:  Copy the 'index-extra' folder within index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
>     2)  Source code:    Always refer to the Nutch wiki for detailed instructions on building Nutch.  In short:
>                         Copy the 'index-extra' folder within index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
>                         Update the build.xml in NUTCHDIR/src/plugin to include plugin
>                         Update the NUTCHDIR/default.properties file to include plugin
>                         run ant to build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
> C.  Known Issues
>     1)  For this plugin to work correctly on any document field, it is necessary to run the other index filters
>     first, so that all basic document fields are generated first.  To do this, configure the indexingfilter.order
>     property.  (Please see patch NUTCH-421 to enable indexingfilter.order property. If this patch is not applied,
>     the plugin will still work, but will not be able to use document fields created by other index filter plugins.)
>     2)  At this stage, field boost can not be used as Nutch scoring overrides the field boost with its own
>     document-level boost calculation.  This occurs at the end of org.apache.nutch.indexer.Indexer's reduce method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sami Siren reassigned NUTCH-422:
--------------------------------

    Assignee: Sami Siren

> index-extra plugin creates additional fields in the index, based on configurable logic
> --------------------------------------------------------------------------------------
>
>                 Key: NUTCH-422
>                 URL: https://issues.apache.org/jira/browse/NUTCH-422
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.8.1
>         Environment: All environments
>            Reporter: Alan Tanaman
>         Assigned To: Sami Siren
>         Attachments: index-extra-v1.0-bin-java1.5.zip, index-extra-v1.0-source.zip
>
>
> Extract from the Readme file:
> A.  Introduction
>     The index-extra plugin allows you to configure additional fields that you wish to be added to the index, based on one of the following sources:
>       - The parsed text
>       - Meta data fields
>       - Previously created document-to-be-indexed fields
>       - Plain constant string
>       - Java expression combining one or more of the above, and resolving to a string
>     A regex can also be applied to any of the above, allowing fields to be created based on patterns extracted from the source.
> B.  Installation
>     1)  Binaries only:  Copy the 'index-extra' folder within index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
>     2)  Source code:    Always refer to the Nutch wiki for detailed instructions on building Nutch.  In short:
>                         Copy the 'index-extra' folder within index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
>                         Update the build.xml in NUTCHDIR/src/plugin to include plugin
>                         Update the NUTCHDIR/default.properties file to include plugin
>                         run ant to build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
> C.  Known Issues
>     1)  For this plugin to work correctly on any document field, it is necessary to run the other index filters
>     first, so that all basic document fields are generated first.  To do this, configure the indexingfilter.order
>     property.  (Please see patch NUTCH-421 to enable indexingfilter.order property. If this patch is not applied,
>     the plugin will still work, but will not be able to use document fields created by other index filter plugins.)
>     2)  At this stage, field boost can not be used as Nutch scoring overrides the field boost with its own
>     document-level boost calculation.  This occurs at the end of org.apache.nutch.indexer.Indexer's reduce method.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

Posted by "Alan Tanaman (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-422?page=all ]

Alan Tanaman updated NUTCH-422:
-------------------------------

    Attachment: index-extra-v1.0-bin-java1.5.zip
                index-extra-v1.0-source.zip

> index-extra plugin creates additional fields in the index, based on configurable logic
> --------------------------------------------------------------------------------------
>
>                 Key: NUTCH-422
>                 URL: http://issues.apache.org/jira/browse/NUTCH-422
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.8.1
>         Environment: All environments
>            Reporter: Alan Tanaman
>         Attachments: index-extra-v1.0-bin-java1.5.zip, index-extra-v1.0-source.zip
>
>
> Extract from the Readme file:
> A.  Introduction
>     The index-extra plugin allows you to configure additional fields that you wish to be added to the index, based on one of the following sources:
>       - The parsed text
>       - Meta data fields
>       - Previously created document-to-be-indexed fields
>       - Plain constant string
>       - Java expression combining one or more of the above, and resolving to a string
>     A regex can also be applied to any of the above, allowing fields to be created based on patterns extracted from the source.
> B.  Installation
>     1)  Binaries only:  Copy the 'index-extra' folder within index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
>     2)  Source code:    Always refer to the Nutch wiki for detailed instructions on building Nutch.  In short:
>                         Copy the 'index-extra' folder within index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
>                         Update the build.xml in NUTCHDIR/src/plugin to include plugin
>                         Update the NUTCHDIR/default.properties file to include plugin
>                         run ant to build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
> C.  Known Issues
>     1)  For this plugin to work correctly on any document field, it is necessary to run the other index filters
>     first, so that all basic document fields are generated first.  To do this, configure the indexingfilter.order
>     property.  (Please see patch NUTCH-421 to enable indexingfilter.order property. If this patch is not applied,
>     the plugin will still work, but will not be able to use document fields created by other index filter plugins.)
>     2)  At this stage, field boost can not be used as Nutch scoring overrides the field boost with its own
>     document-level boost calculation.  This occurs at the end of org.apache.nutch.indexer.Indexer's reduce method.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464347 ] 

Sami Siren commented on NUTCH-422:
----------------------------------

Is there a reason for the two takarta-regexp-jars (v 1.2 and 1.3) in source package?

> index-extra plugin creates additional fields in the index, based on configurable logic
> --------------------------------------------------------------------------------------
>
>                 Key: NUTCH-422
>                 URL: https://issues.apache.org/jira/browse/NUTCH-422
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.8.1
>         Environment: All environments
>            Reporter: Alan Tanaman
>         Assigned To: Sami Siren
>         Attachments: index-extra-v1.0-bin-java1.5.zip, index-extra-v1.0-source.zip
>
>
> Extract from the Readme file:
> A.  Introduction
>     The index-extra plugin allows you to configure additional fields that you wish to be added to the index, based on one of the following sources:
>       - The parsed text
>       - Meta data fields
>       - Previously created document-to-be-indexed fields
>       - Plain constant string
>       - Java expression combining one or more of the above, and resolving to a string
>     A regex can also be applied to any of the above, allowing fields to be created based on patterns extracted from the source.
> B.  Installation
>     1)  Binaries only:  Copy the 'index-extra' folder within index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
>     2)  Source code:    Always refer to the Nutch wiki for detailed instructions on building Nutch.  In short:
>                         Copy the 'index-extra' folder within index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
>                         Update the build.xml in NUTCHDIR/src/plugin to include plugin
>                         Update the NUTCHDIR/default.properties file to include plugin
>                         run ant to build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
> C.  Known Issues
>     1)  For this plugin to work correctly on any document field, it is necessary to run the other index filters
>     first, so that all basic document fields are generated first.  To do this, configure the indexingfilter.order
>     property.  (Please see patch NUTCH-421 to enable indexingfilter.order property. If this patch is not applied,
>     the plugin will still work, but will not be able to use document fields created by other index filter plugins.)
>     2)  At this stage, field boost can not be used as Nutch scoring overrides the field boost with its own
>     document-level boost calculation.  This occurs at the end of org.apache.nutch.indexer.Indexer's reduce method.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

Posted by "Nathan ter Bogt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478688 ] 

Nathan ter Bogt commented on NUTCH-422:
---------------------------------------

Sorry all,

I managed to get this working. Just had some issues with the jdom library (or lack thereof).
I must have just misread the error earlier.

Fantastic plugin idea too, thanks!

> index-extra plugin creates additional fields in the index, based on configurable logic
> --------------------------------------------------------------------------------------
>
>                 Key: NUTCH-422
>                 URL: https://issues.apache.org/jira/browse/NUTCH-422
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.8.1
>         Environment: All environments
>            Reporter: Alan Tanaman
>         Assigned To: Sami Siren
>         Attachments: index-extra-v1.0-bin-java1.5.zip, index-extra-v1.0-source.zip
>
>
> Extract from the Readme file:
> A.  Introduction
>     The index-extra plugin allows you to configure additional fields that you wish to be added to the index, based on one of the following sources:
>       - The parsed text
>       - Meta data fields
>       - Previously created document-to-be-indexed fields
>       - Plain constant string
>       - Java expression combining one or more of the above, and resolving to a string
>     A regex can also be applied to any of the above, allowing fields to be created based on patterns extracted from the source.
> B.  Installation
>     1)  Binaries only:  Copy the 'index-extra' folder within index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
>     2)  Source code:    Always refer to the Nutch wiki for detailed instructions on building Nutch.  In short:
>                         Copy the 'index-extra' folder within index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
>                         Update the build.xml in NUTCHDIR/src/plugin to include plugin
>                         Update the NUTCHDIR/default.properties file to include plugin
>                         run ant to build
>                         Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure
>                         Enable the plugin by updating the nutch-site.xml file
> C.  Known Issues
>     1)  For this plugin to work correctly on any document field, it is necessary to run the other index filters
>     first, so that all basic document fields are generated first.  To do this, configure the indexingfilter.order
>     property.  (Please see patch NUTCH-421 to enable indexingfilter.order property. If this patch is not applied,
>     the plugin will still work, but will not be able to use document fields created by other index filter plugins.)
>     2)  At this stage, field boost can not be used as Nutch scoring overrides the field boost with its own
>     document-level boost calculation.  This occurs at the end of org.apache.nutch.indexer.Indexer's reduce method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.