You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jeff Cocking <je...@gmail.com> on 2015/04/02 16:09:42 UTC

Suggested Approaches for Website Groupings

Environment:  Nutch 1.9, Solr 5.0

I am trying to define a group (category) of websites. Each website will
have assigned group (1 to many). The assignment is known before the
creation of seed.txt file.  All pages within the website should inherit the
assigned group(s). The assigned group(s) need to be passed to Solr for
faceted search.

For example:
www.site1.com group1, group2 group3
All pages within www.site1.com inherit group1, group2, group3

www.site2.com group2, group4, group5
All pages within www.site2.com inherit group2, group4, group5

Thoughts on ways to accomplish this?

Thank you in advance.

jeff

Re: Suggested Approaches for Website Groupings

Posted by Jeff Cocking <je...@gmail.com>.
Jonathan et all,

URLMeta Plugin Test to Force Updated MetaData from Seed.txt

BackGround: URLMeta plugin allows you to define a metadata value to a url
in the seed.txt file. This metadata value is inherited by all pages crawled
within this domain name. The question comes, what happens when the metadata
value is changed in the seed.txt file. Additionally, we tested adding a new
metatag name to see if this would propagate.

Hypothesis:  If the seed.txt file is updated to a new metadata value and
the db.injector.overwrite is set to true, all the urls in the domain will
be updated to the new metadata value when refetched/parsed/indexed.

Test Scenario:

The db.fetch.interval.default=60 (seconds) and the
db.fetch.interval.max=180 (seconds) were changed to very small intervals.
This would allow for forcing the urls to be fetched quickly. (do not do
this in production, unless you have unlimited bandwidth and hardware.)

Fetch times were validated by dumping the crawldb to validate the urls were
being refetched. Additionally, the segments were reviewed to validate the
urls were being sent to solr post metadata changes.

Expected Results:

- The urls listed in the seed.txt file were updated with the new metadata
values and the new metatag name/value.

Unexpected Results:
 - The other urls within the domain were NOT updated with the new metadata
values.
 - All new urls identified, fetched, parsed did NOT use the new metadata
values.
 - All new urls identified, fetched, parsed did NOT pick up the new metatag
name/value


Additional questions:
1. It appears the metatag values are managed in the
plugin/urlmeta/src/java/org/apache/nutch/scoring/urlmeta/URLMetaScoringFilter.java.
According to the comments, it appears the metatags/metadata are defined at
outlink creation.
2. Has anyone ever fixed/tackled this issue?
3. Will NUTCH-1872 (enables control over how injected metadata is
propagated) fix this issue?

thank you

jeff


On Fri, Apr 3, 2015 at 1:07 PM, Jonathan Cooper-Ellis <
jcooperellis@cloudera.com> wrote:

> I think if you have "db.injector.overwrite" property configured to true,
> the new values will be injected and then when the outlinks are recrawled
> they'll be updated accordingly, but honestly I'm not totally sure.
>
> --
> Jonathan Cooper-Ellis
> Field Enablement Engineer
> <http://www.cloudera.com>
>

Re: Suggested Approaches for Website Groupings

Posted by Jonathan Cooper-Ellis <jc...@cloudera.com>.
I think if you have "db.injector.overwrite" property configured to true,
the new values will be injected and then when the outlinks are recrawled
they'll be updated accordingly, but honestly I'm not totally sure.

-- 
Jonathan Cooper-Ellis
Field Enablement Engineer
<http://www.cloudera.com>

Re: Suggested Approaches for Website Groupings

Posted by Jeff Cocking <je...@gmail.com>.
Yes, If a value is changed in the seed.txt file, will the new values be
used when the page is re-crawled/fetched?

Sorry for being vague.

jeff

On Fri, Apr 3, 2015 at 12:19 PM, Jonathan Cooper-Ellis <
jcooperellis@cloudera.com> wrote:

> If a value is changed in seed.txt?
>
> On Fri, Apr 3, 2015 at 12:44 PM, Jeff Cocking <je...@gmail.com>
> wrote:
>
> > I figured i might have to inject a csv blob and manually explode as a
> > custom filter.
> >
> > As a second question, If a value is changed, will the new value propagate
> > with normal fetching cycle?
> >
> > thank you.
> >
> > jeff
> >
> > On Thu, Apr 2, 2015 at 5:06 PM, Jonathan Cooper-Ellis <
> > jcooperellis@cloudera.com> wrote:
> >
> > > Hi Jeff,
> > >
> > > Off the top of my head, the best way to do that might be to inject the
> > > filters as a CSV blob, and write (or modify) an indexing filter to
> split
> > up
> > > the blob and index them as separate values to a "multiValued" field in
> > > Solr.
> > >
> > > On Thu, Apr 2, 2015 at 3:07 PM, Jeff Cocking <je...@gmail.com>
> > > wrote:
> > >
> > > > Jonathan et al
> > > >
> > > > Thank you for the reply.  I used this approach and it is working with
> > one
> > > > minor issue. It is the "one to many" requirement for each group. The
> > > intent
> > > > is to use a filter query within solr on the group data element.  I
> have
> > > > tried the following:
> > > >
> > > > group="filter1,filter2"
> > > > group="filter1","filter2"
> > > > group=filter1,filter2
> > > > group=filter1 filter2
> > > > group=filter1   group=filter2
> > > >
> > > > Each of these choices create a single variable assigned to group. Do
> > you
> > > > have any suggestions on how to format the seed.txt file to support
> the
> > > "one
> > > > to many" option? i.e. that each filter value can be used as a filter
> > > query
> > > > element within solr?
> > > >
> > > >
> > > > For those who find this thread searching for a similar solution, here
> > is
> > > > how to implement urlmeta:
> > > >
> > > > 1. Turn on the plugin by adding urlmeta in the plugin.includes
> property
> > > > within nutch-site.xml. urlmeta is a standalone item within plugin
> > value:
> > > >  ....|index-(basic|anchor|metadata)|urlmeta|indexer-solr|....
> > > > 2. Add the urlmeta.tags property to the nutch-site.xml file. Add the
> > > > keywords you want to use as values.
> > > > <property>
> > > >   <name>urlmeta.tags</name>
> > > >   <value>group1,group2</value>
> > > > </property>
> > > > 3. In your seed.txt file add the tag values for the urls as needed.
> > make
> > > > sure they are tab delimited.
> > > >    http://www.domain1.com   /tgroup1=foo   /tgroup2=bar
> > > >    http://www.domain2.com   /tgroup1=faa   /tgroup2=bur
> > > >
> > > >
> > > >
> > > > On Thu, Apr 2, 2015 at 9:36 AM, Jonathan Cooper-Ellis <
> > > > jcooperellis@cloudera.com> wrote:
> > > >
> > > > > Hey Jeff,
> > > > >
> > > > > Check out the urlmeta plugin. You can inject metadata in with your
> > seed
> > > > > list and propagate it to outlinks.
> > > > >
> > > > > On Thu, Apr 2, 2015 at 10:09 AM, Jeff Cocking <
> > jeff.cocking@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Environment:  Nutch 1.9, Solr 5.0
> > > > > >
> > > > > > I am trying to define a group (category) of websites. Each
> website
> > > will
> > > > > > have assigned group (1 to many). The assignment is known before
> the
> > > > > > creation of seed.txt file.  All pages within the website should
> > > inherit
> > > > > the
> > > > > > assigned group(s). The assigned group(s) need to be passed to
> Solr
> > > for
> > > > > > faceted search.
> > > > > >
> > > > > > For example:
> > > > > > www.site1.com group1, group2 group3
> > > > > > All pages within www.site1.com inherit group1, group2, group3
> > > > > >
> > > > > > www.site2.com group2, group4, group5
> > > > > > All pages within www.site2.com inherit group2, group4, group5
> > > > > >
> > > > > > Thoughts on ways to accomplish this?
> > > > > >
> > > > > > Thank you in advance.
> > > > > >
> > > > > > jeff
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jonathan Cooper-Ellis
> > > > > Field Enablement Engineer
> > > > > <http://www.cloudera.com>
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Jonathan Cooper-Ellis
> > > Field Enablement Engineer
> > > <http://www.cloudera.com>
> > >
> >
>
>
>
> --
> Jonathan Cooper-Ellis
> Field Enablement Engineer
> <http://www.cloudera.com>
>

Re: Suggested Approaches for Website Groupings

Posted by Jonathan Cooper-Ellis <jc...@cloudera.com>.
If a value is changed in seed.txt?

On Fri, Apr 3, 2015 at 12:44 PM, Jeff Cocking <je...@gmail.com>
wrote:

> I figured i might have to inject a csv blob and manually explode as a
> custom filter.
>
> As a second question, If a value is changed, will the new value propagate
> with normal fetching cycle?
>
> thank you.
>
> jeff
>
> On Thu, Apr 2, 2015 at 5:06 PM, Jonathan Cooper-Ellis <
> jcooperellis@cloudera.com> wrote:
>
> > Hi Jeff,
> >
> > Off the top of my head, the best way to do that might be to inject the
> > filters as a CSV blob, and write (or modify) an indexing filter to split
> up
> > the blob and index them as separate values to a "multiValued" field in
> > Solr.
> >
> > On Thu, Apr 2, 2015 at 3:07 PM, Jeff Cocking <je...@gmail.com>
> > wrote:
> >
> > > Jonathan et al
> > >
> > > Thank you for the reply.  I used this approach and it is working with
> one
> > > minor issue. It is the "one to many" requirement for each group. The
> > intent
> > > is to use a filter query within solr on the group data element.  I have
> > > tried the following:
> > >
> > > group="filter1,filter2"
> > > group="filter1","filter2"
> > > group=filter1,filter2
> > > group=filter1 filter2
> > > group=filter1   group=filter2
> > >
> > > Each of these choices create a single variable assigned to group. Do
> you
> > > have any suggestions on how to format the seed.txt file to support the
> > "one
> > > to many" option? i.e. that each filter value can be used as a filter
> > query
> > > element within solr?
> > >
> > >
> > > For those who find this thread searching for a similar solution, here
> is
> > > how to implement urlmeta:
> > >
> > > 1. Turn on the plugin by adding urlmeta in the plugin.includes property
> > > within nutch-site.xml. urlmeta is a standalone item within plugin
> value:
> > >  ....|index-(basic|anchor|metadata)|urlmeta|indexer-solr|....
> > > 2. Add the urlmeta.tags property to the nutch-site.xml file. Add the
> > > keywords you want to use as values.
> > > <property>
> > >   <name>urlmeta.tags</name>
> > >   <value>group1,group2</value>
> > > </property>
> > > 3. In your seed.txt file add the tag values for the urls as needed.
> make
> > > sure they are tab delimited.
> > >    http://www.domain1.com   /tgroup1=foo   /tgroup2=bar
> > >    http://www.domain2.com   /tgroup1=faa   /tgroup2=bur
> > >
> > >
> > >
> > > On Thu, Apr 2, 2015 at 9:36 AM, Jonathan Cooper-Ellis <
> > > jcooperellis@cloudera.com> wrote:
> > >
> > > > Hey Jeff,
> > > >
> > > > Check out the urlmeta plugin. You can inject metadata in with your
> seed
> > > > list and propagate it to outlinks.
> > > >
> > > > On Thu, Apr 2, 2015 at 10:09 AM, Jeff Cocking <
> jeff.cocking@gmail.com>
> > > > wrote:
> > > >
> > > > > Environment:  Nutch 1.9, Solr 5.0
> > > > >
> > > > > I am trying to define a group (category) of websites. Each website
> > will
> > > > > have assigned group (1 to many). The assignment is known before the
> > > > > creation of seed.txt file.  All pages within the website should
> > inherit
> > > > the
> > > > > assigned group(s). The assigned group(s) need to be passed to Solr
> > for
> > > > > faceted search.
> > > > >
> > > > > For example:
> > > > > www.site1.com group1, group2 group3
> > > > > All pages within www.site1.com inherit group1, group2, group3
> > > > >
> > > > > www.site2.com group2, group4, group5
> > > > > All pages within www.site2.com inherit group2, group4, group5
> > > > >
> > > > > Thoughts on ways to accomplish this?
> > > > >
> > > > > Thank you in advance.
> > > > >
> > > > > jeff
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Jonathan Cooper-Ellis
> > > > Field Enablement Engineer
> > > > <http://www.cloudera.com>
> > > >
> > >
> >
> >
> >
> > --
> > Jonathan Cooper-Ellis
> > Field Enablement Engineer
> > <http://www.cloudera.com>
> >
>



-- 
Jonathan Cooper-Ellis
Field Enablement Engineer
<http://www.cloudera.com>

Re: Suggested Approaches for Website Groupings

Posted by Jeff Cocking <je...@gmail.com>.
I figured i might have to inject a csv blob and manually explode as a
custom filter.

As a second question, If a value is changed, will the new value propagate
with normal fetching cycle?

thank you.

jeff

On Thu, Apr 2, 2015 at 5:06 PM, Jonathan Cooper-Ellis <
jcooperellis@cloudera.com> wrote:

> Hi Jeff,
>
> Off the top of my head, the best way to do that might be to inject the
> filters as a CSV blob, and write (or modify) an indexing filter to split up
> the blob and index them as separate values to a "multiValued" field in
> Solr.
>
> On Thu, Apr 2, 2015 at 3:07 PM, Jeff Cocking <je...@gmail.com>
> wrote:
>
> > Jonathan et al
> >
> > Thank you for the reply.  I used this approach and it is working with one
> > minor issue. It is the "one to many" requirement for each group. The
> intent
> > is to use a filter query within solr on the group data element.  I have
> > tried the following:
> >
> > group="filter1,filter2"
> > group="filter1","filter2"
> > group=filter1,filter2
> > group=filter1 filter2
> > group=filter1   group=filter2
> >
> > Each of these choices create a single variable assigned to group. Do you
> > have any suggestions on how to format the seed.txt file to support the
> "one
> > to many" option? i.e. that each filter value can be used as a filter
> query
> > element within solr?
> >
> >
> > For those who find this thread searching for a similar solution, here is
> > how to implement urlmeta:
> >
> > 1. Turn on the plugin by adding urlmeta in the plugin.includes property
> > within nutch-site.xml. urlmeta is a standalone item within plugin value:
> >  ....|index-(basic|anchor|metadata)|urlmeta|indexer-solr|....
> > 2. Add the urlmeta.tags property to the nutch-site.xml file. Add the
> > keywords you want to use as values.
> > <property>
> >   <name>urlmeta.tags</name>
> >   <value>group1,group2</value>
> > </property>
> > 3. In your seed.txt file add the tag values for the urls as needed. make
> > sure they are tab delimited.
> >    http://www.domain1.com   /tgroup1=foo   /tgroup2=bar
> >    http://www.domain2.com   /tgroup1=faa   /tgroup2=bur
> >
> >
> >
> > On Thu, Apr 2, 2015 at 9:36 AM, Jonathan Cooper-Ellis <
> > jcooperellis@cloudera.com> wrote:
> >
> > > Hey Jeff,
> > >
> > > Check out the urlmeta plugin. You can inject metadata in with your seed
> > > list and propagate it to outlinks.
> > >
> > > On Thu, Apr 2, 2015 at 10:09 AM, Jeff Cocking <je...@gmail.com>
> > > wrote:
> > >
> > > > Environment:  Nutch 1.9, Solr 5.0
> > > >
> > > > I am trying to define a group (category) of websites. Each website
> will
> > > > have assigned group (1 to many). The assignment is known before the
> > > > creation of seed.txt file.  All pages within the website should
> inherit
> > > the
> > > > assigned group(s). The assigned group(s) need to be passed to Solr
> for
> > > > faceted search.
> > > >
> > > > For example:
> > > > www.site1.com group1, group2 group3
> > > > All pages within www.site1.com inherit group1, group2, group3
> > > >
> > > > www.site2.com group2, group4, group5
> > > > All pages within www.site2.com inherit group2, group4, group5
> > > >
> > > > Thoughts on ways to accomplish this?
> > > >
> > > > Thank you in advance.
> > > >
> > > > jeff
> > > >
> > >
> > >
> > >
> > > --
> > > Jonathan Cooper-Ellis
> > > Field Enablement Engineer
> > > <http://www.cloudera.com>
> > >
> >
>
>
>
> --
> Jonathan Cooper-Ellis
> Field Enablement Engineer
> <http://www.cloudera.com>
>

Re: Suggested Approaches for Website Groupings

Posted by Jonathan Cooper-Ellis <jc...@cloudera.com>.
Hi Jeff,

Off the top of my head, the best way to do that might be to inject the
filters as a CSV blob, and write (or modify) an indexing filter to split up
the blob and index them as separate values to a "multiValued" field in Solr.

On Thu, Apr 2, 2015 at 3:07 PM, Jeff Cocking <je...@gmail.com> wrote:

> Jonathan et al
>
> Thank you for the reply.  I used this approach and it is working with one
> minor issue. It is the "one to many" requirement for each group. The intent
> is to use a filter query within solr on the group data element.  I have
> tried the following:
>
> group="filter1,filter2"
> group="filter1","filter2"
> group=filter1,filter2
> group=filter1 filter2
> group=filter1   group=filter2
>
> Each of these choices create a single variable assigned to group. Do you
> have any suggestions on how to format the seed.txt file to support the "one
> to many" option? i.e. that each filter value can be used as a filter query
> element within solr?
>
>
> For those who find this thread searching for a similar solution, here is
> how to implement urlmeta:
>
> 1. Turn on the plugin by adding urlmeta in the plugin.includes property
> within nutch-site.xml. urlmeta is a standalone item within plugin value:
>  ....|index-(basic|anchor|metadata)|urlmeta|indexer-solr|....
> 2. Add the urlmeta.tags property to the nutch-site.xml file. Add the
> keywords you want to use as values.
> <property>
>   <name>urlmeta.tags</name>
>   <value>group1,group2</value>
> </property>
> 3. In your seed.txt file add the tag values for the urls as needed. make
> sure they are tab delimited.
>    http://www.domain1.com   /tgroup1=foo   /tgroup2=bar
>    http://www.domain2.com   /tgroup1=faa   /tgroup2=bur
>
>
>
> On Thu, Apr 2, 2015 at 9:36 AM, Jonathan Cooper-Ellis <
> jcooperellis@cloudera.com> wrote:
>
> > Hey Jeff,
> >
> > Check out the urlmeta plugin. You can inject metadata in with your seed
> > list and propagate it to outlinks.
> >
> > On Thu, Apr 2, 2015 at 10:09 AM, Jeff Cocking <je...@gmail.com>
> > wrote:
> >
> > > Environment:  Nutch 1.9, Solr 5.0
> > >
> > > I am trying to define a group (category) of websites. Each website will
> > > have assigned group (1 to many). The assignment is known before the
> > > creation of seed.txt file.  All pages within the website should inherit
> > the
> > > assigned group(s). The assigned group(s) need to be passed to Solr for
> > > faceted search.
> > >
> > > For example:
> > > www.site1.com group1, group2 group3
> > > All pages within www.site1.com inherit group1, group2, group3
> > >
> > > www.site2.com group2, group4, group5
> > > All pages within www.site2.com inherit group2, group4, group5
> > >
> > > Thoughts on ways to accomplish this?
> > >
> > > Thank you in advance.
> > >
> > > jeff
> > >
> >
> >
> >
> > --
> > Jonathan Cooper-Ellis
> > Field Enablement Engineer
> > <http://www.cloudera.com>
> >
>



-- 
Jonathan Cooper-Ellis
Field Enablement Engineer
<http://www.cloudera.com>

Re: Suggested Approaches for Website Groupings

Posted by Jeff Cocking <je...@gmail.com>.
Jonathan et al

Thank you for the reply.  I used this approach and it is working with one
minor issue. It is the "one to many" requirement for each group. The intent
is to use a filter query within solr on the group data element.  I have
tried the following:

group="filter1,filter2"
group="filter1","filter2"
group=filter1,filter2
group=filter1 filter2
group=filter1   group=filter2

Each of these choices create a single variable assigned to group. Do you
have any suggestions on how to format the seed.txt file to support the "one
to many" option? i.e. that each filter value can be used as a filter query
element within solr?


For those who find this thread searching for a similar solution, here is
how to implement urlmeta:

1. Turn on the plugin by adding urlmeta in the plugin.includes property
within nutch-site.xml. urlmeta is a standalone item within plugin value:
 ....|index-(basic|anchor|metadata)|urlmeta|indexer-solr|....
2. Add the urlmeta.tags property to the nutch-site.xml file. Add the
keywords you want to use as values.
<property>
  <name>urlmeta.tags</name>
  <value>group1,group2</value>
</property>
3. In your seed.txt file add the tag values for the urls as needed. make
sure they are tab delimited.
   http://www.domain1.com   /tgroup1=foo   /tgroup2=bar
   http://www.domain2.com   /tgroup1=faa   /tgroup2=bur



On Thu, Apr 2, 2015 at 9:36 AM, Jonathan Cooper-Ellis <
jcooperellis@cloudera.com> wrote:

> Hey Jeff,
>
> Check out the urlmeta plugin. You can inject metadata in with your seed
> list and propagate it to outlinks.
>
> On Thu, Apr 2, 2015 at 10:09 AM, Jeff Cocking <je...@gmail.com>
> wrote:
>
> > Environment:  Nutch 1.9, Solr 5.0
> >
> > I am trying to define a group (category) of websites. Each website will
> > have assigned group (1 to many). The assignment is known before the
> > creation of seed.txt file.  All pages within the website should inherit
> the
> > assigned group(s). The assigned group(s) need to be passed to Solr for
> > faceted search.
> >
> > For example:
> > www.site1.com group1, group2 group3
> > All pages within www.site1.com inherit group1, group2, group3
> >
> > www.site2.com group2, group4, group5
> > All pages within www.site2.com inherit group2, group4, group5
> >
> > Thoughts on ways to accomplish this?
> >
> > Thank you in advance.
> >
> > jeff
> >
>
>
>
> --
> Jonathan Cooper-Ellis
> Field Enablement Engineer
> <http://www.cloudera.com>
>

Re: Suggested Approaches for Website Groupings

Posted by Jonathan Cooper-Ellis <jc...@cloudera.com>.
Hey Jeff,

Check out the urlmeta plugin. You can inject metadata in with your seed
list and propagate it to outlinks.

On Thu, Apr 2, 2015 at 10:09 AM, Jeff Cocking <je...@gmail.com>
wrote:

> Environment:  Nutch 1.9, Solr 5.0
>
> I am trying to define a group (category) of websites. Each website will
> have assigned group (1 to many). The assignment is known before the
> creation of seed.txt file.  All pages within the website should inherit the
> assigned group(s). The assigned group(s) need to be passed to Solr for
> faceted search.
>
> For example:
> www.site1.com group1, group2 group3
> All pages within www.site1.com inherit group1, group2, group3
>
> www.site2.com group2, group4, group5
> All pages within www.site2.com inherit group2, group4, group5
>
> Thoughts on ways to accomplish this?
>
> Thank you in advance.
>
> jeff
>



-- 
Jonathan Cooper-Ellis
Field Enablement Engineer
<http://www.cloudera.com>