You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Chip Calhoun <cc...@aip.org> on 2011/09/15 22:50:36 UTC

Machine readable vs. human readable URLs.

Hi everyone,

We'd like to use Nutch and Solr to replace an existing Verity search that's become a bit long in the tooth. In our Verity search, we have a hack which allows each document to have a machine-readable URL which is indexed (generally an xml document), and a human-readable URL which we actually send users to. Has anyone done the same with Nutch and Solr?

Thanks,
Chip

Re: Machine readable vs. human readable URLs.

Posted by lewis john mcgibbney <le...@gmail.com>.

thanks for clarifying this for us Julien :0)

On Mon, Sep 19, 2011 at 10:23 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> > In addition, it looks like you are misinterpreting how the urlmeta plugin
> > works Chip. It is designed to pick up addition meta tags with name and a
> > content values respectively. e.g.
> >
> > <meta name="humanURL" content="blahblahblah">
> >
>
> Sorry Lewis but it does not do that at all. See link I gave earlier for a
> description of urlmeta. I agree that the name is misleading, it does not
> extra the content from the page but simply uses the crawldb metadata
>
>
> >
> > The plugin then gets this data as well as any additional values added in
> > the
> > urlmeta.tags property within nutch-site.xml and add this to the index
> which
> > can then be queried.
> >
> > Does this make sense?
> >
> > On Mon, Sep 19, 2011 at 9:10 PM, Julien Nioche <
> > lists.digitalpebble@gmail.com> wrote:
> >
> > > Hi
> > >
> > > Since the info is available thanks to the injection you can use the
> > > url-meta
> > > plugin as-is and won't need to have a custom version.  See
> > > https://issues.apache.org/jira/browse/NUTCH-855
> > >
> > > Apart from that do not modify the content of  \runtime\local\conf\
> before
> > > re-compiling with ANT as this will be overwritten. Either modify
> > > $NUTCH/conf/nutch-site.xml or recompile THEN modify.
> > >
> > > As Lewis suggested check the logs and see if the plugin is activated
> > etc...
> > >
> > > J.
> > >
> > >
> > > On 19 September 2011 21:03, Chip Calhoun <cc...@aip.org> wrote:
> > >
> > > > Hi Lewis,
> > > >
> > > > My probably wrong understanding was that I'm supposed to add the tags
> > for
> > > > my new field to my list of seed URLs. So if I have a seed URL
> followed
> > by
> > > "
> > > >        \t humanURL=http://www.aip.org/history/ead/20110369.html", I
> > get
> > > a
> > > > new field called "humanURL" which is populated with the string I've
> > > > specified for that specific URL. I may just be greatly
> misunderstanding
> > > how
> > > > this plugin works.
> > > >
> > > > I've checked my Nutch logs now and it looks like nothing happened.
> The
> > > new
> > > > field does at least show up in the Solr admin UI's schema, but
> clearly
> > my
> > > > problem is on the Nutch end of things.
> > > >
> > > > -----Original Message-----
> > > > From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> > > > Sent: Monday, September 19, 2011 3:34 PM
> > > > To: user@nutch.apache.org
> > > > Subject: Re: Machine readable vs. human readable URLs.
> > > >
> > > > Hi Chip,
> > > >
> > > > There is no need to run ant war, there is no war target in the >=
> Nutch
> > > 1.3
> > > > build.xml file.
> > > >
> > > > Can you explian more about adding 'the tags to %NUTCH_HOME% etc etc.
> Do
> > > you
> > > > mean you've added your seed URLs?
> > > >
> > > > Have you had a look at any of your log output as to whether the
> urlmeta
> > > > plugin is loaded and used when fetching?
> > > >
> > > > You should be able to get info on your schema, fields etc within the
> > Solr
> > > > admin UI
> > > >
> > > > On Mon, Sep 19, 2011 at 8:09 PM, Chip Calhoun <cc...@aip.org>
> > wrote:
> > > >
> > > > > Hi Julien,
> > > > >
> > > > > Thanks, that's encouraging. I'm trying to make this work, and I'm
> > > > > definitely missing something. I hope I'm not too far off the mark.
> > > > > I've started with the instructions at
> > > > > http://wiki.apache.org/nutch/WritingPluginExample . If I
> understand
> > > > > this properly, the changes I needed to make were the following:
> > > > >
> > > > > In Nutch:
> > > > > Paste the prescribed block of code into
> > > > > %NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells Nutch to
> > > > > look for and run the urlmeta plugin.
> > > > > In %NUTCH_HOME%, run "ant war".
> > > > > Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line in
> > this
> > > > file
> > > > > now looks like: "http://www.aip.org/history/ead/20110369.xml
> >  \t
> > > > > humanURL=http://www.aip.org/history/ead/20110369.html"
> > > > >
> > > > > In Solr:
> > > > > Added my new tag to %SOLR_HOME%\example\solr\conf\schema.xml . The
> > new
> > > > > line consists of: " <field name="humanURL" type="string"
> > stored="true"
> > > > > indexed="false"/>"
> > > > >
> > > > > I've redone the indexing, and my new field still doesn't show up in
> > > > > the search results. Can you tell where I'm going wrong?
> > > > >
> > > > > Thanks,
> > > > > Chip
> > > > >
> > > > > -----Original Message-----
> > > > > From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> > > > > Sent: Friday, September 16, 2011 4:37 AM
> > > > > To: user@nutch.apache.org
> > > > > Subject: Re: Machine readable vs. human readable URLs.
> > > > >
> > > > > Hi Chip,
> > > > >
> > > > > Should simply be a matter of creating a custom field with an
> > > > > IndexingFilter, you can then use it in any way you want on the SOLR
> > > > > side
> > > > >
> > > > > Julien
> > > > >
> > > > > On 15 September 2011 21:50, Chip Calhoun <cc...@aip.org> wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > We'd like to use Nutch and Solr to replace an existing Verity
> > search
> > > > > > that's become a bit long in the tooth. In our Verity search, we
> > have
> > > > > > a hack which allows each document to have a machine-readable URL
> > > > > > which is indexed (generally an xml document), and a
> human-readable
> > > > > > URL which we actually send users to. Has anyone done the same
> with
> > > > Nutch and Solr?
> > > > > >
> > > > > > Thanks,
> > > > > > Chip
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > *
> > > > > *Open Source Solutions for Text Engineering
> > > > >
> > > > > http://digitalpebble.blogspot.com/
> > > > > http://www.digitalpebble.com
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > *Lewis*
> > > >
> > >
> > >
> > >
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



-- 
*Lewis*

Re: Machine readable vs. human readable URLs.

Posted by lewis john mcgibbney <le...@gmail.com>.

Thanks Chip we'll get this added in sue course.

On Wed, Sep 21, 2011 at 3:00 PM, Chip Calhoun <cc...@aip.org> wrote:

> For my own sake I wish I could think of a way in which it was unclear, but
> no; I just screwed up. I could maybe see reinforcing that the urls document
> has to be saved as a tab-delimited file, so a newbie like me won't look at
> the examples and think this is meant to be a text file. Otherwise, both the
> plugin and the documentation work great!
>
> -----Original Message-----
> From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> Sent: Wednesday, September 21, 2011 3:05 AM
> To: user@nutch.apache.org
> Subject: Re: Machine readable vs. human readable URLs.
>
> H^i Chip,
>
> Was there anything in particular you found misleading about the plugin
> example on the wiki? I am keen to make it as clear as possible.
>
> Thank you
>
> Lewis
>
> On Tue, Sep 20, 2011 at 6:00 PM, Chip Calhoun <cc...@aip.org> wrote:
>
> > Hi Julien,
> >
> > Thanks for clarifying this! I've got it working now. Instead of
> > seeding with a proper tab-delimited file created in Excel, I had been
> > wrong-headedly seeding it with a text file that just had tabs in it.
> > They look the same, but it makes a difference. Thanks!
> >
> > Chip
> >
> > -----Original Message-----
> > From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> > Sent: Monday, September 19, 2011 5:23 PM
> > To: user@nutch.apache.org
> > Subject: Re: Machine readable vs. human readable URLs.
> >
> > > In addition, it looks like you are misinterpreting how the urlmeta
> > > plugin works Chip. It is designed to pick up addition meta tags with
> > > name and a content values respectively. e.g.
> > >
> > > <meta name="humanURL" content="blahblahblah">
> > >
> >
> > Sorry Lewis but it does not do that at all. See link I gave earlier
> > for a description of urlmeta. I agree that the name is misleading, it
> > does not extra the content from the page but simply uses the crawldb
> > metadata
> >
> >
> > >
> > > The plugin then gets this data as well as any additional values
> > > added in the urlmeta.tags property within nutch-site.xml and add
> > > this to the index which can then be queried.
> > >
> > > Does this make sense?
> > >
> > > On Mon, Sep 19, 2011 at 9:10 PM, Julien Nioche <
> > > lists.digitalpebble@gmail.com> wrote:
> > >
> > > > Hi
> > > >
> > > > Since the info is available thanks to the injection you can use
> > > > the url-meta plugin as-is and won't need to have a custom version.
> > > > See
> > > > https://issues.apache.org/jira/browse/NUTCH-855
> > > >
> > > > Apart from that do not modify the content of  \runtime\local\conf\
> > > > before re-compiling with ANT as this will be overwritten. Either
> > > > modify $NUTCH/conf/nutch-site.xml or recompile THEN modify.
> > > >
> > > > As Lewis suggested check the logs and see if the plugin is
> > > > activated
> > > etc...
> > > >
> > > > J.
> > > >
> > > >
> > > > On 19 September 2011 21:03, Chip Calhoun <cc...@aip.org> wrote:
> > > >
> > > > > Hi Lewis,
> > > > >
> > > > > My probably wrong understanding was that I'm supposed to add the
> > > > > tags
> > > for
> > > > > my new field to my list of seed URLs. So if I have a seed URL
> > > > > followed
> > > by
> > > > "
> > > > >        \t
> > > > > humanURL=http://www.aip.org/history/ead/20110369.html",
> > > > > I
> > > get
> > > > a
> > > > > new field called "humanURL" which is populated with the string
> > > > > I've specified for that specific URL. I may just be greatly
> > > > > misunderstanding
> > > > how
> > > > > this plugin works.
> > > > >
> > > > > I've checked my Nutch logs now and it looks like nothing happened.
> > > > > The
> > > > new
> > > > > field does at least show up in the Solr admin UI's schema, but
> > > > > clearly
> > > my
> > > > > problem is on the Nutch end of things.
> > > > >
> > > > > -----Original Message-----
> > > > > From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> > > > > Sent: Monday, September 19, 2011 3:34 PM
> > > > > To: user@nutch.apache.org
> > > > > Subject: Re: Machine readable vs. human readable URLs.
> > > > >
> > > > > Hi Chip,
> > > > >
> > > > > There is no need to run ant war, there is no war target in the
> > > > > >= Nutch
> > > > 1.3
> > > > > build.xml file.
> > > > >
> > > > > Can you explian more about adding 'the tags to %NUTCH_HOME% etc
> > > > > etc. Do
> > > > you
> > > > > mean you've added your seed URLs?
> > > > >
> > > > > Have you had a look at any of your log output as to whether the
> > > > > urlmeta plugin is loaded and used when fetching?
> > > > >
> > > > > You should be able to get info on your schema, fields etc within
> > > > > the
> > > Solr
> > > > > admin UI
> > > > >
> > > > > On Mon, Sep 19, 2011 at 8:09 PM, Chip Calhoun <cc...@aip.org>
> > > wrote:
> > > > >
> > > > > > Hi Julien,
> > > > > >
> > > > > > Thanks, that's encouraging. I'm trying to make this work, and
> > > > > > I'm definitely missing something. I hope I'm not too far off
> > > > > > the
> > mark.
> > > > > > I've started with the instructions at
> > > > > > http://wiki.apache.org/nutch/WritingPluginExample . If I
> > > > > > understand this properly, the changes I needed to make were
> > > > > > the
> > following:
> > > > > >
> > > > > > In Nutch:
> > > > > > Paste the prescribed block of code into
> > > > > > %NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells
> > > > > > Nutch to look for and run the urlmeta plugin.
> > > > > > In %NUTCH_HOME%, run "ant war".
> > > > > > Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line
> > > > > > in
> > > this
> > > > > file
> > > > > > now looks like: "http://www.aip.org/history/ead/20110369.xml
> > >  \t
> > > > > > humanURL=http://www.aip.org/history/ead/20110369.html"
> > > > > >
> > > > > > In Solr:
> > > > > > Added my new tag to %SOLR_HOME%\example\solr\conf\schema.xml .
> > > > > > The
> > > new
> > > > > > line consists of: " <field name="humanURL" type="string"
> > > stored="true"
> > > > > > indexed="false"/>"
> > > > > >
> > > > > > I've redone the indexing, and my new field still doesn't show
> > > > > > up in the search results. Can you tell where I'm going wrong?
> > > > > >
> > > > > > Thanks,
> > > > > > Chip
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> > > > > > Sent: Friday, September 16, 2011 4:37 AM
> > > > > > To: user@nutch.apache.org
> > > > > > Subject: Re: Machine readable vs. human readable URLs.
> > > > > >
> > > > > > Hi Chip,
> > > > > >
> > > > > > Should simply be a matter of creating a custom field with an
> > > > > > IndexingFilter, you can then use it in any way you want on the
> > > > > > SOLR side
> > > > > >
> > > > > > Julien
> > > > > >
> > > > > > On 15 September 2011 21:50, Chip Calhoun <cc...@aip.org>
> wrote:
> > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > We'd like to use Nutch and Solr to replace an existing
> > > > > > > Verity
> > > search
> > > > > > > that's become a bit long in the tooth. In our Verity search,
> > > > > > > we
> > > have
> > > > > > > a hack which allows each document to have a machine-readable
> > > > > > > URL which is indexed (generally an xml document), and a
> > > > > > > human-readable URL which we actually send users to. Has
> > > > > > > anyone done the same with
> > > > > Nutch and Solr?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Chip
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > *
> > > > > > *Open Source Solutions for Text Engineering
> > > > > >
> > > > > > http://digitalpebble.blogspot.com/
> > > > > > http://www.digitalpebble.com
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > *Lewis*
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > *
> > > > *Open Source Solutions for Text Engineering
> > > >
> > > > http://digitalpebble.blogspot.com/
> > > > http://www.digitalpebble.com
> > > >
> > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> >
>
>
>
> --
> *Lewis*
>



-- 
*Lewis*

RE: Machine readable vs. human readable URLs.

Posted by Chip Calhoun <cc...@aip.org>.

For my own sake I wish I could think of a way in which it was unclear, but no; I just screwed up. I could maybe see reinforcing that the urls document has to be saved as a tab-delimited file, so a newbie like me won't look at the examples and think this is meant to be a text file. Otherwise, both the plugin and the documentation work great!

-----Original Message-----
From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com] 
Sent: Wednesday, September 21, 2011 3:05 AM
To: user@nutch.apache.org
Subject: Re: Machine readable vs. human readable URLs.

H^i Chip,

Was there anything in particular you found misleading about the plugin example on the wiki? I am keen to make it as clear as possible.

Thank you

Lewis

On Tue, Sep 20, 2011 at 6:00 PM, Chip Calhoun <cc...@aip.org> wrote:

> Hi Julien,
>
> Thanks for clarifying this! I've got it working now. Instead of 
> seeding with a proper tab-delimited file created in Excel, I had been 
> wrong-headedly seeding it with a text file that just had tabs in it. 
> They look the same, but it makes a difference. Thanks!
>
> Chip
>
> -----Original Message-----
> From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> Sent: Monday, September 19, 2011 5:23 PM
> To: user@nutch.apache.org
> Subject: Re: Machine readable vs. human readable URLs.
>
> > In addition, it looks like you are misinterpreting how the urlmeta 
> > plugin works Chip. It is designed to pick up addition meta tags with 
> > name and a content values respectively. e.g.
> >
> > <meta name="humanURL" content="blahblahblah">
> >
>
> Sorry Lewis but it does not do that at all. See link I gave earlier 
> for a description of urlmeta. I agree that the name is misleading, it 
> does not extra the content from the page but simply uses the crawldb 
> metadata
>
>
> >
> > The plugin then gets this data as well as any additional values 
> > added in the urlmeta.tags property within nutch-site.xml and add 
> > this to the index which can then be queried.
> >
> > Does this make sense?
> >
> > On Mon, Sep 19, 2011 at 9:10 PM, Julien Nioche < 
> > lists.digitalpebble@gmail.com> wrote:
> >
> > > Hi
> > >
> > > Since the info is available thanks to the injection you can use 
> > > the url-meta plugin as-is and won't need to have a custom version.  
> > > See
> > > https://issues.apache.org/jira/browse/NUTCH-855
> > >
> > > Apart from that do not modify the content of  \runtime\local\conf\ 
> > > before re-compiling with ANT as this will be overwritten. Either 
> > > modify $NUTCH/conf/nutch-site.xml or recompile THEN modify.
> > >
> > > As Lewis suggested check the logs and see if the plugin is 
> > > activated
> > etc...
> > >
> > > J.
> > >
> > >
> > > On 19 September 2011 21:03, Chip Calhoun <cc...@aip.org> wrote:
> > >
> > > > Hi Lewis,
> > > >
> > > > My probably wrong understanding was that I'm supposed to add the 
> > > > tags
> > for
> > > > my new field to my list of seed URLs. So if I have a seed URL 
> > > > followed
> > by
> > > "
> > > >        \t 
> > > > humanURL=http://www.aip.org/history/ead/20110369.html",
> > > > I
> > get
> > > a
> > > > new field called "humanURL" which is populated with the string 
> > > > I've specified for that specific URL. I may just be greatly 
> > > > misunderstanding
> > > how
> > > > this plugin works.
> > > >
> > > > I've checked my Nutch logs now and it looks like nothing happened.
> > > > The
> > > new
> > > > field does at least show up in the Solr admin UI's schema, but 
> > > > clearly
> > my
> > > > problem is on the Nutch end of things.
> > > >
> > > > -----Original Message-----
> > > > From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> > > > Sent: Monday, September 19, 2011 3:34 PM
> > > > To: user@nutch.apache.org
> > > > Subject: Re: Machine readable vs. human readable URLs.
> > > >
> > > > Hi Chip,
> > > >
> > > > There is no need to run ant war, there is no war target in the 
> > > > >= Nutch
> > > 1.3
> > > > build.xml file.
> > > >
> > > > Can you explian more about adding 'the tags to %NUTCH_HOME% etc 
> > > > etc. Do
> > > you
> > > > mean you've added your seed URLs?
> > > >
> > > > Have you had a look at any of your log output as to whether the 
> > > > urlmeta plugin is loaded and used when fetching?
> > > >
> > > > You should be able to get info on your schema, fields etc within 
> > > > the
> > Solr
> > > > admin UI
> > > >
> > > > On Mon, Sep 19, 2011 at 8:09 PM, Chip Calhoun <cc...@aip.org>
> > wrote:
> > > >
> > > > > Hi Julien,
> > > > >
> > > > > Thanks, that's encouraging. I'm trying to make this work, and 
> > > > > I'm definitely missing something. I hope I'm not too far off 
> > > > > the
> mark.
> > > > > I've started with the instructions at 
> > > > > http://wiki.apache.org/nutch/WritingPluginExample . If I 
> > > > > understand this properly, the changes I needed to make were 
> > > > > the
> following:
> > > > >
> > > > > In Nutch:
> > > > > Paste the prescribed block of code into 
> > > > > %NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells 
> > > > > Nutch to look for and run the urlmeta plugin.
> > > > > In %NUTCH_HOME%, run "ant war".
> > > > > Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line 
> > > > > in
> > this
> > > > file
> > > > > now looks like: "http://www.aip.org/history/ead/20110369.xml
> >  \t
> > > > > humanURL=http://www.aip.org/history/ead/20110369.html"
> > > > >
> > > > > In Solr:
> > > > > Added my new tag to %SOLR_HOME%\example\solr\conf\schema.xml .
> > > > > The
> > new
> > > > > line consists of: " <field name="humanURL" type="string"
> > stored="true"
> > > > > indexed="false"/>"
> > > > >
> > > > > I've redone the indexing, and my new field still doesn't show 
> > > > > up in the search results. Can you tell where I'm going wrong?
> > > > >
> > > > > Thanks,
> > > > > Chip
> > > > >
> > > > > -----Original Message-----
> > > > > From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> > > > > Sent: Friday, September 16, 2011 4:37 AM
> > > > > To: user@nutch.apache.org
> > > > > Subject: Re: Machine readable vs. human readable URLs.
> > > > >
> > > > > Hi Chip,
> > > > >
> > > > > Should simply be a matter of creating a custom field with an 
> > > > > IndexingFilter, you can then use it in any way you want on the 
> > > > > SOLR side
> > > > >
> > > > > Julien
> > > > >
> > > > > On 15 September 2011 21:50, Chip Calhoun <cc...@aip.org> wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > We'd like to use Nutch and Solr to replace an existing 
> > > > > > Verity
> > search
> > > > > > that's become a bit long in the tooth. In our Verity search, 
> > > > > > we
> > have
> > > > > > a hack which allows each document to have a machine-readable 
> > > > > > URL which is indexed (generally an xml document), and a 
> > > > > > human-readable URL which we actually send users to. Has 
> > > > > > anyone done the same with
> > > > Nutch and Solr?
> > > > > >
> > > > > > Thanks,
> > > > > > Chip
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > *
> > > > > *Open Source Solutions for Text Engineering
> > > > >
> > > > > http://digitalpebble.blogspot.com/
> > > > > http://www.digitalpebble.com
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > *Lewis*
> > > >
> > >
> > >
> > >
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



--
*Lewis*

Re: Machine readable vs. human readable URLs.

Posted by lewis john mcgibbney <le...@gmail.com>.

H^i Chip,

Was there anything in particular you found misleading about the plugin
example on the wiki? I am keen to make it as clear as possible.

Thank you

Lewis

On Tue, Sep 20, 2011 at 6:00 PM, Chip Calhoun <cc...@aip.org> wrote:

> Hi Julien,
>
> Thanks for clarifying this! I've got it working now. Instead of seeding
> with a proper tab-delimited file created in Excel, I had been wrong-headedly
> seeding it with a text file that just had tabs in it. They look the same,
> but it makes a difference. Thanks!
>
> Chip
>
> -----Original Message-----
> From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> Sent: Monday, September 19, 2011 5:23 PM
> To: user@nutch.apache.org
> Subject: Re: Machine readable vs. human readable URLs.
>
> > In addition, it looks like you are misinterpreting how the urlmeta
> > plugin works Chip. It is designed to pick up addition meta tags with
> > name and a content values respectively. e.g.
> >
> > <meta name="humanURL" content="blahblahblah">
> >
>
> Sorry Lewis but it does not do that at all. See link I gave earlier for a
> description of urlmeta. I agree that the name is misleading, it does not
> extra the content from the page but simply uses the crawldb metadata
>
>
> >
> > The plugin then gets this data as well as any additional values added
> > in the urlmeta.tags property within nutch-site.xml and add this to the
> > index which can then be queried.
> >
> > Does this make sense?
> >
> > On Mon, Sep 19, 2011 at 9:10 PM, Julien Nioche <
> > lists.digitalpebble@gmail.com> wrote:
> >
> > > Hi
> > >
> > > Since the info is available thanks to the injection you can use the
> > > url-meta plugin as-is and won't need to have a custom version.  See
> > > https://issues.apache.org/jira/browse/NUTCH-855
> > >
> > > Apart from that do not modify the content of  \runtime\local\conf\
> > > before re-compiling with ANT as this will be overwritten. Either
> > > modify $NUTCH/conf/nutch-site.xml or recompile THEN modify.
> > >
> > > As Lewis suggested check the logs and see if the plugin is activated
> > etc...
> > >
> > > J.
> > >
> > >
> > > On 19 September 2011 21:03, Chip Calhoun <cc...@aip.org> wrote:
> > >
> > > > Hi Lewis,
> > > >
> > > > My probably wrong understanding was that I'm supposed to add the
> > > > tags
> > for
> > > > my new field to my list of seed URLs. So if I have a seed URL
> > > > followed
> > by
> > > "
> > > >        \t humanURL=http://www.aip.org/history/ead/20110369.html",
> > > > I
> > get
> > > a
> > > > new field called "humanURL" which is populated with the string
> > > > I've specified for that specific URL. I may just be greatly
> > > > misunderstanding
> > > how
> > > > this plugin works.
> > > >
> > > > I've checked my Nutch logs now and it looks like nothing happened.
> > > > The
> > > new
> > > > field does at least show up in the Solr admin UI's schema, but
> > > > clearly
> > my
> > > > problem is on the Nutch end of things.
> > > >
> > > > -----Original Message-----
> > > > From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> > > > Sent: Monday, September 19, 2011 3:34 PM
> > > > To: user@nutch.apache.org
> > > > Subject: Re: Machine readable vs. human readable URLs.
> > > >
> > > > Hi Chip,
> > > >
> > > > There is no need to run ant war, there is no war target in the >=
> > > > Nutch
> > > 1.3
> > > > build.xml file.
> > > >
> > > > Can you explian more about adding 'the tags to %NUTCH_HOME% etc
> > > > etc. Do
> > > you
> > > > mean you've added your seed URLs?
> > > >
> > > > Have you had a look at any of your log output as to whether the
> > > > urlmeta plugin is loaded and used when fetching?
> > > >
> > > > You should be able to get info on your schema, fields etc within
> > > > the
> > Solr
> > > > admin UI
> > > >
> > > > On Mon, Sep 19, 2011 at 8:09 PM, Chip Calhoun <cc...@aip.org>
> > wrote:
> > > >
> > > > > Hi Julien,
> > > > >
> > > > > Thanks, that's encouraging. I'm trying to make this work, and
> > > > > I'm definitely missing something. I hope I'm not too far off the
> mark.
> > > > > I've started with the instructions at
> > > > > http://wiki.apache.org/nutch/WritingPluginExample . If I
> > > > > understand this properly, the changes I needed to make were the
> following:
> > > > >
> > > > > In Nutch:
> > > > > Paste the prescribed block of code into
> > > > > %NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells Nutch
> > > > > to look for and run the urlmeta plugin.
> > > > > In %NUTCH_HOME%, run "ant war".
> > > > > Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line
> > > > > in
> > this
> > > > file
> > > > > now looks like: "http://www.aip.org/history/ead/20110369.xml
> >  \t
> > > > > humanURL=http://www.aip.org/history/ead/20110369.html"
> > > > >
> > > > > In Solr:
> > > > > Added my new tag to %SOLR_HOME%\example\solr\conf\schema.xml .
> > > > > The
> > new
> > > > > line consists of: " <field name="humanURL" type="string"
> > stored="true"
> > > > > indexed="false"/>"
> > > > >
> > > > > I've redone the indexing, and my new field still doesn't show up
> > > > > in the search results. Can you tell where I'm going wrong?
> > > > >
> > > > > Thanks,
> > > > > Chip
> > > > >
> > > > > -----Original Message-----
> > > > > From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> > > > > Sent: Friday, September 16, 2011 4:37 AM
> > > > > To: user@nutch.apache.org
> > > > > Subject: Re: Machine readable vs. human readable URLs.
> > > > >
> > > > > Hi Chip,
> > > > >
> > > > > Should simply be a matter of creating a custom field with an
> > > > > IndexingFilter, you can then use it in any way you want on the
> > > > > SOLR side
> > > > >
> > > > > Julien
> > > > >
> > > > > On 15 September 2011 21:50, Chip Calhoun <cc...@aip.org> wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > We'd like to use Nutch and Solr to replace an existing Verity
> > search
> > > > > > that's become a bit long in the tooth. In our Verity search,
> > > > > > we
> > have
> > > > > > a hack which allows each document to have a machine-readable
> > > > > > URL which is indexed (generally an xml document), and a
> > > > > > human-readable URL which we actually send users to. Has anyone
> > > > > > done the same with
> > > > Nutch and Solr?
> > > > > >
> > > > > > Thanks,
> > > > > > Chip
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > *
> > > > > *Open Source Solutions for Text Engineering
> > > > >
> > > > > http://digitalpebble.blogspot.com/
> > > > > http://www.digitalpebble.com
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > *Lewis*
> > > >
> > >
> > >
> > >
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



-- 
*Lewis*

RE: Machine readable vs. human readable URLs.

Posted by Chip Calhoun <cc...@aip.org>.

Hi Julien,

Thanks for clarifying this! I've got it working now. Instead of seeding with a proper tab-delimited file created in Excel, I had been wrong-headedly seeding it with a text file that just had tabs in it. They look the same, but it makes a difference. Thanks!

Chip

-----Original Message-----
From: Julien Nioche [mailto:lists.digitalpebble@gmail.com] 
Sent: Monday, September 19, 2011 5:23 PM
To: user@nutch.apache.org
Subject: Re: Machine readable vs. human readable URLs.

> In addition, it looks like you are misinterpreting how the urlmeta 
> plugin works Chip. It is designed to pick up addition meta tags with 
> name and a content values respectively. e.g.
>
> <meta name="humanURL" content="blahblahblah">
>

Sorry Lewis but it does not do that at all. See link I gave earlier for a description of urlmeta. I agree that the name is misleading, it does not extra the content from the page but simply uses the crawldb metadata


>
> The plugin then gets this data as well as any additional values added 
> in the urlmeta.tags property within nutch-site.xml and add this to the 
> index which can then be queried.
>
> Does this make sense?
>
> On Mon, Sep 19, 2011 at 9:10 PM, Julien Nioche < 
> lists.digitalpebble@gmail.com> wrote:
>
> > Hi
> >
> > Since the info is available thanks to the injection you can use the 
> > url-meta plugin as-is and won't need to have a custom version.  See
> > https://issues.apache.org/jira/browse/NUTCH-855
> >
> > Apart from that do not modify the content of  \runtime\local\conf\ 
> > before re-compiling with ANT as this will be overwritten. Either 
> > modify $NUTCH/conf/nutch-site.xml or recompile THEN modify.
> >
> > As Lewis suggested check the logs and see if the plugin is activated
> etc...
> >
> > J.
> >
> >
> > On 19 September 2011 21:03, Chip Calhoun <cc...@aip.org> wrote:
> >
> > > Hi Lewis,
> > >
> > > My probably wrong understanding was that I'm supposed to add the 
> > > tags
> for
> > > my new field to my list of seed URLs. So if I have a seed URL 
> > > followed
> by
> > "
> > >        \t humanURL=http://www.aip.org/history/ead/20110369.html", 
> > > I
> get
> > a
> > > new field called "humanURL" which is populated with the string 
> > > I've specified for that specific URL. I may just be greatly 
> > > misunderstanding
> > how
> > > this plugin works.
> > >
> > > I've checked my Nutch logs now and it looks like nothing happened. 
> > > The
> > new
> > > field does at least show up in the Solr admin UI's schema, but 
> > > clearly
> my
> > > problem is on the Nutch end of things.
> > >
> > > -----Original Message-----
> > > From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> > > Sent: Monday, September 19, 2011 3:34 PM
> > > To: user@nutch.apache.org
> > > Subject: Re: Machine readable vs. human readable URLs.
> > >
> > > Hi Chip,
> > >
> > > There is no need to run ant war, there is no war target in the >= 
> > > Nutch
> > 1.3
> > > build.xml file.
> > >
> > > Can you explian more about adding 'the tags to %NUTCH_HOME% etc 
> > > etc. Do
> > you
> > > mean you've added your seed URLs?
> > >
> > > Have you had a look at any of your log output as to whether the 
> > > urlmeta plugin is loaded and used when fetching?
> > >
> > > You should be able to get info on your schema, fields etc within 
> > > the
> Solr
> > > admin UI
> > >
> > > On Mon, Sep 19, 2011 at 8:09 PM, Chip Calhoun <cc...@aip.org>
> wrote:
> > >
> > > > Hi Julien,
> > > >
> > > > Thanks, that's encouraging. I'm trying to make this work, and 
> > > > I'm definitely missing something. I hope I'm not too far off the mark.
> > > > I've started with the instructions at 
> > > > http://wiki.apache.org/nutch/WritingPluginExample . If I 
> > > > understand this properly, the changes I needed to make were the following:
> > > >
> > > > In Nutch:
> > > > Paste the prescribed block of code into 
> > > > %NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells Nutch 
> > > > to look for and run the urlmeta plugin.
> > > > In %NUTCH_HOME%, run "ant war".
> > > > Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line 
> > > > in
> this
> > > file
> > > > now looks like: "http://www.aip.org/history/ead/20110369.xml
>  \t
> > > > humanURL=http://www.aip.org/history/ead/20110369.html"
> > > >
> > > > In Solr:
> > > > Added my new tag to %SOLR_HOME%\example\solr\conf\schema.xml . 
> > > > The
> new
> > > > line consists of: " <field name="humanURL" type="string"
> stored="true"
> > > > indexed="false"/>"
> > > >
> > > > I've redone the indexing, and my new field still doesn't show up 
> > > > in the search results. Can you tell where I'm going wrong?
> > > >
> > > > Thanks,
> > > > Chip
> > > >
> > > > -----Original Message-----
> > > > From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> > > > Sent: Friday, September 16, 2011 4:37 AM
> > > > To: user@nutch.apache.org
> > > > Subject: Re: Machine readable vs. human readable URLs.
> > > >
> > > > Hi Chip,
> > > >
> > > > Should simply be a matter of creating a custom field with an 
> > > > IndexingFilter, you can then use it in any way you want on the 
> > > > SOLR side
> > > >
> > > > Julien
> > > >
> > > > On 15 September 2011 21:50, Chip Calhoun <cc...@aip.org> wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > We'd like to use Nutch and Solr to replace an existing Verity
> search
> > > > > that's become a bit long in the tooth. In our Verity search, 
> > > > > we
> have
> > > > > a hack which allows each document to have a machine-readable 
> > > > > URL which is indexed (generally an xml document), and a 
> > > > > human-readable URL which we actually send users to. Has anyone 
> > > > > done the same with
> > > Nutch and Solr?
> > > > >
> > > > > Thanks,
> > > > > Chip
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > *
> > > > *Open Source Solutions for Text Engineering
> > > >
> > > > http://digitalpebble.blogspot.com/
> > > > http://www.digitalpebble.com
> > > >
> > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> >
>
>
>
> --
> *Lewis*
>



--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Machine readable vs. human readable URLs.

Posted by Julien Nioche <li...@gmail.com>.

> In addition, it looks like you are misinterpreting how the urlmeta plugin
> works Chip. It is designed to pick up addition meta tags with name and a
> content values respectively. e.g.
>
> <meta name="humanURL" content="blahblahblah">
>

Sorry Lewis but it does not do that at all. See link I gave earlier for a
description of urlmeta. I agree that the name is misleading, it does not
extra the content from the page but simply uses the crawldb metadata


>
> The plugin then gets this data as well as any additional values added in
> the
> urlmeta.tags property within nutch-site.xml and add this to the index which
> can then be queried.
>
> Does this make sense?
>
> On Mon, Sep 19, 2011 at 9:10 PM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
> > Hi
> >
> > Since the info is available thanks to the injection you can use the
> > url-meta
> > plugin as-is and won't need to have a custom version.  See
> > https://issues.apache.org/jira/browse/NUTCH-855
> >
> > Apart from that do not modify the content of  \runtime\local\conf\ before
> > re-compiling with ANT as this will be overwritten. Either modify
> > $NUTCH/conf/nutch-site.xml or recompile THEN modify.
> >
> > As Lewis suggested check the logs and see if the plugin is activated
> etc...
> >
> > J.
> >
> >
> > On 19 September 2011 21:03, Chip Calhoun <cc...@aip.org> wrote:
> >
> > > Hi Lewis,
> > >
> > > My probably wrong understanding was that I'm supposed to add the tags
> for
> > > my new field to my list of seed URLs. So if I have a seed URL followed
> by
> > "
> > >        \t humanURL=http://www.aip.org/history/ead/20110369.html", I
> get
> > a
> > > new field called "humanURL" which is populated with the string I've
> > > specified for that specific URL. I may just be greatly misunderstanding
> > how
> > > this plugin works.
> > >
> > > I've checked my Nutch logs now and it looks like nothing happened. The
> > new
> > > field does at least show up in the Solr admin UI's schema, but clearly
> my
> > > problem is on the Nutch end of things.
> > >
> > > -----Original Message-----
> > > From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> > > Sent: Monday, September 19, 2011 3:34 PM
> > > To: user@nutch.apache.org
> > > Subject: Re: Machine readable vs. human readable URLs.
> > >
> > > Hi Chip,
> > >
> > > There is no need to run ant war, there is no war target in the >= Nutch
> > 1.3
> > > build.xml file.
> > >
> > > Can you explian more about adding 'the tags to %NUTCH_HOME% etc etc. Do
> > you
> > > mean you've added your seed URLs?
> > >
> > > Have you had a look at any of your log output as to whether the urlmeta
> > > plugin is loaded and used when fetching?
> > >
> > > You should be able to get info on your schema, fields etc within the
> Solr
> > > admin UI
> > >
> > > On Mon, Sep 19, 2011 at 8:09 PM, Chip Calhoun <cc...@aip.org>
> wrote:
> > >
> > > > Hi Julien,
> > > >
> > > > Thanks, that's encouraging. I'm trying to make this work, and I'm
> > > > definitely missing something. I hope I'm not too far off the mark.
> > > > I've started with the instructions at
> > > > http://wiki.apache.org/nutch/WritingPluginExample . If I understand
> > > > this properly, the changes I needed to make were the following:
> > > >
> > > > In Nutch:
> > > > Paste the prescribed block of code into
> > > > %NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells Nutch to
> > > > look for and run the urlmeta plugin.
> > > > In %NUTCH_HOME%, run "ant war".
> > > > Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line in
> this
> > > file
> > > > now looks like: "http://www.aip.org/history/ead/20110369.xml
>  \t
> > > > humanURL=http://www.aip.org/history/ead/20110369.html"
> > > >
> > > > In Solr:
> > > > Added my new tag to %SOLR_HOME%\example\solr\conf\schema.xml . The
> new
> > > > line consists of: " <field name="humanURL" type="string"
> stored="true"
> > > > indexed="false"/>"
> > > >
> > > > I've redone the indexing, and my new field still doesn't show up in
> > > > the search results. Can you tell where I'm going wrong?
> > > >
> > > > Thanks,
> > > > Chip
> > > >
> > > > -----Original Message-----
> > > > From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> > > > Sent: Friday, September 16, 2011 4:37 AM
> > > > To: user@nutch.apache.org
> > > > Subject: Re: Machine readable vs. human readable URLs.
> > > >
> > > > Hi Chip,
> > > >
> > > > Should simply be a matter of creating a custom field with an
> > > > IndexingFilter, you can then use it in any way you want on the SOLR
> > > > side
> > > >
> > > > Julien
> > > >
> > > > On 15 September 2011 21:50, Chip Calhoun <cc...@aip.org> wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > We'd like to use Nutch and Solr to replace an existing Verity
> search
> > > > > that's become a bit long in the tooth. In our Verity search, we
> have
> > > > > a hack which allows each document to have a machine-readable URL
> > > > > which is indexed (generally an xml document), and a human-readable
> > > > > URL which we actually send users to. Has anyone done the same with
> > > Nutch and Solr?
> > > > >
> > > > > Thanks,
> > > > > Chip
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > *
> > > > *Open Source Solutions for Text Engineering
> > > >
> > > > http://digitalpebble.blogspot.com/
> > > > http://www.digitalpebble.com
> > > >
> > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> >
>
>
>
> --
> *Lewis*
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Machine readable vs. human readable URLs.

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi Chip,

If you have not had a look at the JIRA link provided by Julien then please
do, it makes more sense.
As you correctly identified, in many cases the pages are not owned by 'us',
therefore it is unlikely a web administrator will add meta tags willy nilly
as per our requests. As far as I am aware however the urlmeta plugin caters
for this by allowing us to fabricate these tags to simulate them being
originally included in the page source.

Please correct me if I am wrong someone.

On Mon, Sep 19, 2011 at 10:22 PM, Chip Calhoun <cc...@aip.org> wrote:

> I thought it seemed too good to be true. I understood the part about this
> picking up metadata from tags within the actual documents; that seems like a
> feature a lot of people would need. But I thought the whole point of the
> tab-delimited tags in my URLs file was that I could also inject tags that
> aren't in the source documents. That doesn't seem like it would be a
> standard feature, but it's what I need. Most of the pages I need to index
> aren't owned by us, and I won't always be able to get other sites to add an
> extra meta tag to their pages.
>
> It looks like I might need to write my own plugin, which is a little
> daunting for me. Can anyone think of an existing plugin that injects
> metadata into indexed documents after the fact? It would be nice to have
> some existing code I could examine and learn from.
>
> Thanks,
> Chip
>
> -----Original Message-----
> From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> Sent: Monday, September 19, 2011 4:56 PM
> To: user@nutch.apache.org
> Subject: Re: Machine readable vs. human readable URLs.
>
> In addition, it looks like you are misinterpreting how the urlmeta plugin
> works Chip. It is designed to pick up addition meta tags with name and a
> content values respectively. e.g.
>
> <meta name="humanURL" content="blahblahblah">
>
> The plugin then gets this data as well as any additional values added in
> the urlmeta.tags property within nutch-site.xml and add this to the index
> which can then be queried.
>
> Does this make sense?
>
> On Mon, Sep 19, 2011 at 9:10 PM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
> > Hi
> >
> > Since the info is available thanks to the injection you can use the
> > url-meta plugin as-is and won't need to have a custom version.  See
> > https://issues.apache.org/jira/browse/NUTCH-855
> >
> > Apart from that do not modify the content of  \runtime\local\conf\
> > before re-compiling with ANT as this will be overwritten. Either
> > modify $NUTCH/conf/nutch-site.xml or recompile THEN modify.
> >
> > As Lewis suggested check the logs and see if the plugin is activated
> etc...
> >
> > J.
> >
> >
> > On 19 September 2011 21:03, Chip Calhoun <cc...@aip.org> wrote:
> >
> > > Hi Lewis,
> > >
> > > My probably wrong understanding was that I'm supposed to add the
> > > tags for my new field to my list of seed URLs. So if I have a seed
> > > URL followed by
> > "
> > >        \t humanURL=http://www.aip.org/history/ead/20110369.html", I
> > > get
> > a
> > > new field called "humanURL" which is populated with the string I've
> > > specified for that specific URL. I may just be greatly
> > > misunderstanding
> > how
> > > this plugin works.
> > >
> > > I've checked my Nutch logs now and it looks like nothing happened.
> > > The
> > new
> > > field does at least show up in the Solr admin UI's schema, but
> > > clearly my problem is on the Nutch end of things.
> > >
> > > -----Original Message-----
> > > From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> > > Sent: Monday, September 19, 2011 3:34 PM
> > > To: user@nutch.apache.org
> > > Subject: Re: Machine readable vs. human readable URLs.
> > >
> > > Hi Chip,
> > >
> > > There is no need to run ant war, there is no war target in the >=
> > > Nutch
> > 1.3
> > > build.xml file.
> > >
> > > Can you explian more about adding 'the tags to %NUTCH_HOME% etc etc.
> > > Do
> > you
> > > mean you've added your seed URLs?
> > >
> > > Have you had a look at any of your log output as to whether the
> > > urlmeta plugin is loaded and used when fetching?
> > >
> > > You should be able to get info on your schema, fields etc within the
> > > Solr admin UI
> > >
> > > On Mon, Sep 19, 2011 at 8:09 PM, Chip Calhoun <cc...@aip.org>
> wrote:
> > >
> > > > Hi Julien,
> > > >
> > > > Thanks, that's encouraging. I'm trying to make this work, and I'm
> > > > definitely missing something. I hope I'm not too far off the mark.
> > > > I've started with the instructions at
> > > > http://wiki.apache.org/nutch/WritingPluginExample . If I
> > > > understand this properly, the changes I needed to make were the
> following:
> > > >
> > > > In Nutch:
> > > > Paste the prescribed block of code into
> > > > %NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells Nutch
> > > > to look for and run the urlmeta plugin.
> > > > In %NUTCH_HOME%, run "ant war".
> > > > Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line in
> > > > this
> > > file
> > > > now looks like: "http://www.aip.org/history/ead/20110369.xml
>  \t
> > > > humanURL=http://www.aip.org/history/ead/20110369.html"
> > > >
> > > > In Solr:
> > > > Added my new tag to %SOLR_HOME%\example\solr\conf\schema.xml . The
> > > > new line consists of: " <field name="humanURL" type="string"
> stored="true"
> > > > indexed="false"/>"
> > > >
> > > > I've redone the indexing, and my new field still doesn't show up
> > > > in the search results. Can you tell where I'm going wrong?
> > > >
> > > > Thanks,
> > > > Chip
> > > >
> > > > -----Original Message-----
> > > > From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> > > > Sent: Friday, September 16, 2011 4:37 AM
> > > > To: user@nutch.apache.org
> > > > Subject: Re: Machine readable vs. human readable URLs.
> > > >
> > > > Hi Chip,
> > > >
> > > > Should simply be a matter of creating a custom field with an
> > > > IndexingFilter, you can then use it in any way you want on the
> > > > SOLR side
> > > >
> > > > Julien
> > > >
> > > > On 15 September 2011 21:50, Chip Calhoun <cc...@aip.org> wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > We'd like to use Nutch and Solr to replace an existing Verity
> > > > > search that's become a bit long in the tooth. In our Verity
> > > > > search, we have a hack which allows each document to have a
> > > > > machine-readable URL which is indexed (generally an xml
> > > > > document), and a human-readable URL which we actually send users
> > > > > to. Has anyone done the same with
> > > Nutch and Solr?
> > > > >
> > > > > Thanks,
> > > > > Chip
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > *
> > > > *Open Source Solutions for Text Engineering
> > > >
> > > > http://digitalpebble.blogspot.com/
> > > > http://www.digitalpebble.com
> > > >
> > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> >
>
>
>
> --
> *Lewis*
>



-- 
*Lewis*

RE: Machine readable vs. human readable URLs.

Posted by Chip Calhoun <cc...@aip.org>.

I thought it seemed too good to be true. I understood the part about this picking up metadata from tags within the actual documents; that seems like a feature a lot of people would need. But I thought the whole point of the tab-delimited tags in my URLs file was that I could also inject tags that aren't in the source documents. That doesn't seem like it would be a standard feature, but it's what I need. Most of the pages I need to index aren't owned by us, and I won't always be able to get other sites to add an extra meta tag to their pages.

It looks like I might need to write my own plugin, which is a little daunting for me. Can anyone think of an existing plugin that injects metadata into indexed documents after the fact? It would be nice to have some existing code I could examine and learn from.

Thanks,
Chip

-----Original Message-----
From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com] 
Sent: Monday, September 19, 2011 4:56 PM
To: user@nutch.apache.org
Subject: Re: Machine readable vs. human readable URLs.

In addition, it looks like you are misinterpreting how the urlmeta plugin works Chip. It is designed to pick up addition meta tags with name and a content values respectively. e.g.

<meta name="humanURL" content="blahblahblah">

The plugin then gets this data as well as any additional values added in the urlmeta.tags property within nutch-site.xml and add this to the index which can then be queried.

Does this make sense?

On Mon, Sep 19, 2011 at 9:10 PM, Julien Nioche < lists.digitalpebble@gmail.com> wrote:

> Hi
>
> Since the info is available thanks to the injection you can use the 
> url-meta plugin as-is and won't need to have a custom version.  See
> https://issues.apache.org/jira/browse/NUTCH-855
>
> Apart from that do not modify the content of  \runtime\local\conf\ 
> before re-compiling with ANT as this will be overwritten. Either 
> modify $NUTCH/conf/nutch-site.xml or recompile THEN modify.
>
> As Lewis suggested check the logs and see if the plugin is activated etc...
>
> J.
>
>
> On 19 September 2011 21:03, Chip Calhoun <cc...@aip.org> wrote:
>
> > Hi Lewis,
> >
> > My probably wrong understanding was that I'm supposed to add the 
> > tags for my new field to my list of seed URLs. So if I have a seed 
> > URL followed by
> "
> >        \t humanURL=http://www.aip.org/history/ead/20110369.html", I 
> > get
> a
> > new field called "humanURL" which is populated with the string I've 
> > specified for that specific URL. I may just be greatly 
> > misunderstanding
> how
> > this plugin works.
> >
> > I've checked my Nutch logs now and it looks like nothing happened. 
> > The
> new
> > field does at least show up in the Solr admin UI's schema, but 
> > clearly my problem is on the Nutch end of things.
> >
> > -----Original Message-----
> > From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> > Sent: Monday, September 19, 2011 3:34 PM
> > To: user@nutch.apache.org
> > Subject: Re: Machine readable vs. human readable URLs.
> >
> > Hi Chip,
> >
> > There is no need to run ant war, there is no war target in the >= 
> > Nutch
> 1.3
> > build.xml file.
> >
> > Can you explian more about adding 'the tags to %NUTCH_HOME% etc etc. 
> > Do
> you
> > mean you've added your seed URLs?
> >
> > Have you had a look at any of your log output as to whether the 
> > urlmeta plugin is loaded and used when fetching?
> >
> > You should be able to get info on your schema, fields etc within the 
> > Solr admin UI
> >
> > On Mon, Sep 19, 2011 at 8:09 PM, Chip Calhoun <cc...@aip.org> wrote:
> >
> > > Hi Julien,
> > >
> > > Thanks, that's encouraging. I'm trying to make this work, and I'm 
> > > definitely missing something. I hope I'm not too far off the mark.
> > > I've started with the instructions at 
> > > http://wiki.apache.org/nutch/WritingPluginExample . If I 
> > > understand this properly, the changes I needed to make were the following:
> > >
> > > In Nutch:
> > > Paste the prescribed block of code into 
> > > %NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells Nutch 
> > > to look for and run the urlmeta plugin.
> > > In %NUTCH_HOME%, run "ant war".
> > > Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line in 
> > > this
> > file
> > > now looks like: "http://www.aip.org/history/ead/20110369.xml        \t
> > > humanURL=http://www.aip.org/history/ead/20110369.html"
> > >
> > > In Solr:
> > > Added my new tag to %SOLR_HOME%\example\solr\conf\schema.xml . The 
> > > new line consists of: " <field name="humanURL" type="string" stored="true"
> > > indexed="false"/>"
> > >
> > > I've redone the indexing, and my new field still doesn't show up 
> > > in the search results. Can you tell where I'm going wrong?
> > >
> > > Thanks,
> > > Chip
> > >
> > > -----Original Message-----
> > > From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> > > Sent: Friday, September 16, 2011 4:37 AM
> > > To: user@nutch.apache.org
> > > Subject: Re: Machine readable vs. human readable URLs.
> > >
> > > Hi Chip,
> > >
> > > Should simply be a matter of creating a custom field with an 
> > > IndexingFilter, you can then use it in any way you want on the 
> > > SOLR side
> > >
> > > Julien
> > >
> > > On 15 September 2011 21:50, Chip Calhoun <cc...@aip.org> wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > We'd like to use Nutch and Solr to replace an existing Verity 
> > > > search that's become a bit long in the tooth. In our Verity 
> > > > search, we have a hack which allows each document to have a 
> > > > machine-readable URL which is indexed (generally an xml 
> > > > document), and a human-readable URL which we actually send users 
> > > > to. Has anyone done the same with
> > Nutch and Solr?
> > > >
> > > > Thanks,
> > > > Chip
> > > >
> > >
> > >
> > >
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



--
*Lewis*

Re: Machine readable vs. human readable URLs.

Posted by lewis john mcgibbney <le...@gmail.com>.

In addition, it looks like you are misinterpreting how the urlmeta plugin
works Chip. It is designed to pick up addition meta tags with name and a
content values respectively. e.g.

<meta name="humanURL" content="blahblahblah">

The plugin then gets this data as well as any additional values added in the
urlmeta.tags property within nutch-site.xml and add this to the index which
can then be queried.

Does this make sense?

On Mon, Sep 19, 2011 at 9:10 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Hi
>
> Since the info is available thanks to the injection you can use the
> url-meta
> plugin as-is and won't need to have a custom version.  See
> https://issues.apache.org/jira/browse/NUTCH-855
>
> Apart from that do not modify the content of  \runtime\local\conf\ before
> re-compiling with ANT as this will be overwritten. Either modify
> $NUTCH/conf/nutch-site.xml or recompile THEN modify.
>
> As Lewis suggested check the logs and see if the plugin is activated etc...
>
> J.
>
>
> On 19 September 2011 21:03, Chip Calhoun <cc...@aip.org> wrote:
>
> > Hi Lewis,
> >
> > My probably wrong understanding was that I'm supposed to add the tags for
> > my new field to my list of seed URLs. So if I have a seed URL followed by
> "
> >        \t humanURL=http://www.aip.org/history/ead/20110369.html", I get
> a
> > new field called "humanURL" which is populated with the string I've
> > specified for that specific URL. I may just be greatly misunderstanding
> how
> > this plugin works.
> >
> > I've checked my Nutch logs now and it looks like nothing happened. The
> new
> > field does at least show up in the Solr admin UI's schema, but clearly my
> > problem is on the Nutch end of things.
> >
> > -----Original Message-----
> > From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> > Sent: Monday, September 19, 2011 3:34 PM
> > To: user@nutch.apache.org
> > Subject: Re: Machine readable vs. human readable URLs.
> >
> > Hi Chip,
> >
> > There is no need to run ant war, there is no war target in the >= Nutch
> 1.3
> > build.xml file.
> >
> > Can you explian more about adding 'the tags to %NUTCH_HOME% etc etc. Do
> you
> > mean you've added your seed URLs?
> >
> > Have you had a look at any of your log output as to whether the urlmeta
> > plugin is loaded and used when fetching?
> >
> > You should be able to get info on your schema, fields etc within the Solr
> > admin UI
> >
> > On Mon, Sep 19, 2011 at 8:09 PM, Chip Calhoun <cc...@aip.org> wrote:
> >
> > > Hi Julien,
> > >
> > > Thanks, that's encouraging. I'm trying to make this work, and I'm
> > > definitely missing something. I hope I'm not too far off the mark.
> > > I've started with the instructions at
> > > http://wiki.apache.org/nutch/WritingPluginExample . If I understand
> > > this properly, the changes I needed to make were the following:
> > >
> > > In Nutch:
> > > Paste the prescribed block of code into
> > > %NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells Nutch to
> > > look for and run the urlmeta plugin.
> > > In %NUTCH_HOME%, run "ant war".
> > > Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line in this
> > file
> > > now looks like: "http://www.aip.org/history/ead/20110369.xml        \t
> > > humanURL=http://www.aip.org/history/ead/20110369.html"
> > >
> > > In Solr:
> > > Added my new tag to %SOLR_HOME%\example\solr\conf\schema.xml . The new
> > > line consists of: " <field name="humanURL" type="string" stored="true"
> > > indexed="false"/>"
> > >
> > > I've redone the indexing, and my new field still doesn't show up in
> > > the search results. Can you tell where I'm going wrong?
> > >
> > > Thanks,
> > > Chip
> > >
> > > -----Original Message-----
> > > From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> > > Sent: Friday, September 16, 2011 4:37 AM
> > > To: user@nutch.apache.org
> > > Subject: Re: Machine readable vs. human readable URLs.
> > >
> > > Hi Chip,
> > >
> > > Should simply be a matter of creating a custom field with an
> > > IndexingFilter, you can then use it in any way you want on the SOLR
> > > side
> > >
> > > Julien
> > >
> > > On 15 September 2011 21:50, Chip Calhoun <cc...@aip.org> wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > We'd like to use Nutch and Solr to replace an existing Verity search
> > > > that's become a bit long in the tooth. In our Verity search, we have
> > > > a hack which allows each document to have a machine-readable URL
> > > > which is indexed (generally an xml document), and a human-readable
> > > > URL which we actually send users to. Has anyone done the same with
> > Nutch and Solr?
> > > >
> > > > Thanks,
> > > > Chip
> > > >
> > >
> > >
> > >
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



-- 
*Lewis*

Re: Machine readable vs. human readable URLs.

Posted by Julien Nioche <li...@gmail.com>.

Hi

Since the info is available thanks to the injection you can use the url-meta
plugin as-is and won't need to have a custom version.  See
https://issues.apache.org/jira/browse/NUTCH-855

Apart from that do not modify the content of  \runtime\local\conf\ before
re-compiling with ANT as this will be overwritten. Either modify
$NUTCH/conf/nutch-site.xml or recompile THEN modify.

As Lewis suggested check the logs and see if the plugin is activated etc...

J.


On 19 September 2011 21:03, Chip Calhoun <cc...@aip.org> wrote:

> Hi Lewis,
>
> My probably wrong understanding was that I'm supposed to add the tags for
> my new field to my list of seed URLs. So if I have a seed URL followed by "
>        \t humanURL=http://www.aip.org/history/ead/20110369.html", I get a
> new field called "humanURL" which is populated with the string I've
> specified for that specific URL. I may just be greatly misunderstanding how
> this plugin works.
>
> I've checked my Nutch logs now and it looks like nothing happened. The new
> field does at least show up in the Solr admin UI's schema, but clearly my
> problem is on the Nutch end of things.
>
> -----Original Message-----
> From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> Sent: Monday, September 19, 2011 3:34 PM
> To: user@nutch.apache.org
> Subject: Re: Machine readable vs. human readable URLs.
>
> Hi Chip,
>
> There is no need to run ant war, there is no war target in the >= Nutch 1.3
> build.xml file.
>
> Can you explian more about adding 'the tags to %NUTCH_HOME% etc etc. Do you
> mean you've added your seed URLs?
>
> Have you had a look at any of your log output as to whether the urlmeta
> plugin is loaded and used when fetching?
>
> You should be able to get info on your schema, fields etc within the Solr
> admin UI
>
> On Mon, Sep 19, 2011 at 8:09 PM, Chip Calhoun <cc...@aip.org> wrote:
>
> > Hi Julien,
> >
> > Thanks, that's encouraging. I'm trying to make this work, and I'm
> > definitely missing something. I hope I'm not too far off the mark.
> > I've started with the instructions at
> > http://wiki.apache.org/nutch/WritingPluginExample . If I understand
> > this properly, the changes I needed to make were the following:
> >
> > In Nutch:
> > Paste the prescribed block of code into
> > %NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells Nutch to
> > look for and run the urlmeta plugin.
> > In %NUTCH_HOME%, run "ant war".
> > Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line in this
> file
> > now looks like: "http://www.aip.org/history/ead/20110369.xml        \t
> > humanURL=http://www.aip.org/history/ead/20110369.html"
> >
> > In Solr:
> > Added my new tag to %SOLR_HOME%\example\solr\conf\schema.xml . The new
> > line consists of: " <field name="humanURL" type="string" stored="true"
> > indexed="false"/>"
> >
> > I've redone the indexing, and my new field still doesn't show up in
> > the search results. Can you tell where I'm going wrong?
> >
> > Thanks,
> > Chip
> >
> > -----Original Message-----
> > From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> > Sent: Friday, September 16, 2011 4:37 AM
> > To: user@nutch.apache.org
> > Subject: Re: Machine readable vs. human readable URLs.
> >
> > Hi Chip,
> >
> > Should simply be a matter of creating a custom field with an
> > IndexingFilter, you can then use it in any way you want on the SOLR
> > side
> >
> > Julien
> >
> > On 15 September 2011 21:50, Chip Calhoun <cc...@aip.org> wrote:
> >
> > > Hi everyone,
> > >
> > > We'd like to use Nutch and Solr to replace an existing Verity search
> > > that's become a bit long in the tooth. In our Verity search, we have
> > > a hack which allows each document to have a machine-readable URL
> > > which is indexed (generally an xml document), and a human-readable
> > > URL which we actually send users to. Has anyone done the same with
> Nutch and Solr?
> > >
> > > Thanks,
> > > Chip
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> >
>
>
>
> --
> *Lewis*
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

RE: Machine readable vs. human readable URLs.

Posted by Chip Calhoun <cc...@aip.org>.

Hi Lewis,

My probably wrong understanding was that I'm supposed to add the tags for my new field to my list of seed URLs. So if I have a seed URL followed by "        \t humanURL=http://www.aip.org/history/ead/20110369.html", I get a new field called "humanURL" which is populated with the string I've specified for that specific URL. I may just be greatly misunderstanding how this plugin works.

I've checked my Nutch logs now and it looks like nothing happened. The new field does at least show up in the Solr admin UI's schema, but clearly my problem is on the Nutch end of things.

-----Original Message-----
From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com] 
Sent: Monday, September 19, 2011 3:34 PM
To: user@nutch.apache.org
Subject: Re: Machine readable vs. human readable URLs.

Hi Chip,

There is no need to run ant war, there is no war target in the >= Nutch 1.3 build.xml file.

Can you explian more about adding 'the tags to %NUTCH_HOME% etc etc. Do you mean you've added your seed URLs?

Have you had a look at any of your log output as to whether the urlmeta plugin is loaded and used when fetching?

You should be able to get info on your schema, fields etc within the Solr admin UI

On Mon, Sep 19, 2011 at 8:09 PM, Chip Calhoun <cc...@aip.org> wrote:

> Hi Julien,
>
> Thanks, that's encouraging. I'm trying to make this work, and I'm 
> definitely missing something. I hope I'm not too far off the mark. 
> I've started with the instructions at 
> http://wiki.apache.org/nutch/WritingPluginExample . If I understand 
> this properly, the changes I needed to make were the following:
>
> In Nutch:
> Paste the prescribed block of code into 
> %NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells Nutch to 
> look for and run the urlmeta plugin.
> In %NUTCH_HOME%, run "ant war".
> Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line in this file
> now looks like: "http://www.aip.org/history/ead/20110369.xml        \t
> humanURL=http://www.aip.org/history/ead/20110369.html"
>
> In Solr:
> Added my new tag to %SOLR_HOME%\example\solr\conf\schema.xml . The new 
> line consists of: " <field name="humanURL" type="string" stored="true"
> indexed="false"/>"
>
> I've redone the indexing, and my new field still doesn't show up in 
> the search results. Can you tell where I'm going wrong?
>
> Thanks,
> Chip
>
> -----Original Message-----
> From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> Sent: Friday, September 16, 2011 4:37 AM
> To: user@nutch.apache.org
> Subject: Re: Machine readable vs. human readable URLs.
>
> Hi Chip,
>
> Should simply be a matter of creating a custom field with an 
> IndexingFilter, you can then use it in any way you want on the SOLR 
> side
>
> Julien
>
> On 15 September 2011 21:50, Chip Calhoun <cc...@aip.org> wrote:
>
> > Hi everyone,
> >
> > We'd like to use Nutch and Solr to replace an existing Verity search 
> > that's become a bit long in the tooth. In our Verity search, we have 
> > a hack which allows each document to have a machine-readable URL 
> > which is indexed (generally an xml document), and a human-readable 
> > URL which we actually send users to. Has anyone done the same with Nutch and Solr?
> >
> > Thanks,
> > Chip
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



--
*Lewis*

Re: Machine readable vs. human readable URLs.

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi Chip,

There is no need to run ant war, there is no war target in the >= Nutch 1.3
build.xml file.

Can you explian more about adding 'the tags to %NUTCH_HOME% etc etc. Do you
mean you've added your seed URLs?

Have you had a look at any of your log output as to whether the urlmeta
plugin is loaded and used when fetching?

You should be able to get info on your schema, fields etc within the Solr
admin UI

On Mon, Sep 19, 2011 at 8:09 PM, Chip Calhoun <cc...@aip.org> wrote:

> Hi Julien,
>
> Thanks, that's encouraging. I'm trying to make this work, and I'm
> definitely missing something. I hope I'm not too far off the mark. I've
> started with the instructions at
> http://wiki.apache.org/nutch/WritingPluginExample . If I understand this
> properly, the changes I needed to make were the following:
>
> In Nutch:
> Paste the prescribed block of code into
> %NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells Nutch to look for
> and run the urlmeta plugin.
> In %NUTCH_HOME%, run "ant war".
> Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line in this file
> now looks like: "http://www.aip.org/history/ead/20110369.xml        \t
> humanURL=http://www.aip.org/history/ead/20110369.html"
>
> In Solr:
> Added my new tag to %SOLR_HOME%\example\solr\conf\schema.xml . The new line
> consists of: " <field name="humanURL" type="string" stored="true"
> indexed="false"/>"
>
> I've redone the indexing, and my new field still doesn't show up in the
> search results. Can you tell where I'm going wrong?
>
> Thanks,
> Chip
>
> -----Original Message-----
> From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> Sent: Friday, September 16, 2011 4:37 AM
> To: user@nutch.apache.org
> Subject: Re: Machine readable vs. human readable URLs.
>
> Hi Chip,
>
> Should simply be a matter of creating a custom field with an
> IndexingFilter, you can then use it in any way you want on the SOLR side
>
> Julien
>
> On 15 September 2011 21:50, Chip Calhoun <cc...@aip.org> wrote:
>
> > Hi everyone,
> >
> > We'd like to use Nutch and Solr to replace an existing Verity search
> > that's become a bit long in the tooth. In our Verity search, we have a
> > hack which allows each document to have a machine-readable URL which
> > is indexed (generally an xml document), and a human-readable URL which
> > we actually send users to. Has anyone done the same with Nutch and Solr?
> >
> > Thanks,
> > Chip
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



-- 
*Lewis*

RE: Machine readable vs. human readable URLs.

Posted by Chip Calhoun <cc...@aip.org>.

Hi Julien,

Thanks, that's encouraging. I'm trying to make this work, and I'm definitely missing something. I hope I'm not too far off the mark. I've started with the instructions at http://wiki.apache.org/nutch/WritingPluginExample . If I understand this properly, the changes I needed to make were the following:

In Nutch:
Paste the prescribed block of code into %NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells Nutch to look for and run the urlmeta plugin.
In %NUTCH_HOME%, run "ant war".
Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line in this file now looks like: "http://www.aip.org/history/ead/20110369.xml	\t humanURL=http://www.aip.org/history/ead/20110369.html"

In Solr:
Added my new tag to %SOLR_HOME%\example\solr\conf\schema.xml . The new line consists of: " <field name="humanURL" type="string" stored="true" indexed="false"/>"

I've redone the indexing, and my new field still doesn't show up in the search results. Can you tell where I'm going wrong?

Thanks,
Chip

-----Original Message-----
From: Julien Nioche [mailto:lists.digitalpebble@gmail.com] 
Sent: Friday, September 16, 2011 4:37 AM
To: user@nutch.apache.org
Subject: Re: Machine readable vs. human readable URLs.

Hi Chip,

Should simply be a matter of creating a custom field with an IndexingFilter, you can then use it in any way you want on the SOLR side

Julien

On 15 September 2011 21:50, Chip Calhoun <cc...@aip.org> wrote:

> Hi everyone,
>
> We'd like to use Nutch and Solr to replace an existing Verity search 
> that's become a bit long in the tooth. In our Verity search, we have a 
> hack which allows each document to have a machine-readable URL which 
> is indexed (generally an xml document), and a human-readable URL which 
> we actually send users to. Has anyone done the same with Nutch and Solr?
>
> Thanks,
> Chip
>

--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Machine readable vs. human readable URLs.

Posted by Julien Nioche <li...@gmail.com>.

Hi Chip,

Should simply be a matter of creating a custom field with an IndexingFilter,
you can then use it in any way you want on the SOLR side

Julien

On 15 September 2011 21:50, Chip Calhoun <cc...@aip.org> wrote:

> Hi everyone,
>
> We'd like to use Nutch and Solr to replace an existing Verity search that's
> become a bit long in the tooth. In our Verity search, we have a hack which
> allows each document to have a machine-readable URL which is indexed
> (generally an xml document), and a human-readable URL which we actually send
> users to. Has anyone done the same with Nutch and Solr?
>
> Thanks,
> Chip
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com