You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2012/08/09 23:36:27 UTC
cache field in index-basic in 2.X
Hi,
Can someone please explain to me exactly what the cashing field is
actually cashing in index-basic?
I see the various fields in o.a.n.metadata.Nutch e.g.
CACHING_FORBIDDEN_ALL, CACHING_FORBIDDEN_CONTENT, etc. but I am still
not sure how the functionality or indeed the 'what' actually is!!!
Also, say if I wanted to fictitiously create some cache content within
a WebPage I could do
WebPage page = new WebPage();
page.putToMetadata(Utf8, ByteBuffer); // I think
but could someone explain to me what typical values would be for key
and associated value respectively?
Thank you very much in advance.
best
Lewis
--
Lewis
Re: cache field in index-basic in 2.X
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Julien,
On Fri, Aug 10, 2012 at 10:39 AM, Julien Nioche
<li...@gmail.com> wrote:
>
> Makes sense?
>
100 times over
Thanks for the explanation this is great and makes perfect sense.
Re: cache field in index-basic in 2.X
Posted by Julien Nioche <li...@gmail.com>.
>
>
> Well in o.a.n.metadata.Nutch some brief Javadoc's for the caching
> fields mention the following
>
> static String CACHING_FORBIDDEN_ALL
> Don't show either original forbidden content or summaries.
> static String CACHING_FORBIDDEN_CONTENT
> Don't show original forbidden content, but show summaries.
> static String CACHING_FORBIDDEN_KEY
> Sites may request that search engines don't provide access
> to cached documents.
> static org.apache.avro.util.Utf8 CACHING_FORBIDDEN_KEY_UTF8
>
> static String CACHING_FORBIDDEN_NONE
> Show both original forbidden content and summaries (default).
>
> I understand that caching data is held within and concerns metadata
> (in trunk it is parse.getData().getMeta())
it does not concern metadata, we store as metadata the policies regarding
caching that are specified in the html pages (
http://www.i18nguy.com/markup/metatags.html) then store the policy in the
cache field
* // add cached content/summary display policy, if available*
* String caching = parse.getData().getMeta(Nutch.CACHING_FORBIDDEN_KEY);*
* if (caching != null && !caching.equals(Nutch.CACHING_FORBIDDEN_NONE)) {
*
* doc.add("cache", caching);*
* }*
* *
I expect that this was then used by our search web app to determine whether
we could display the cached content or not.
> but I still have no idea the
> characteristics of the cache data, why this would be valuable for an
> index. I personally have never queried for it before in my index.
>
we do not store the cached content as a field, just the policy. caching can
be useful for an index e.g. when the target server is down and you want to
have a peek at the content of the page
indexing the policy instead of the actual cache content is probably not so
relevant now that we've delegated the indexing + search to SOLR & ES. We
could of course add a binary field with the content so that web apps
querying the search backends could provide the cache if needed. We'd need
to enforce the caching policy at the indexing level + put some restrictions
on length etc...
Makes sense?
Julien
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
Re: cache field in index-basic in 2.X
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Julien,
Firstly I need to apologise for my failure to differentiate between
cashing and caching (of course the latter being correct). Sorry about
that.
On Fri, Aug 10, 2012 at 8:30 AM, Julien Nioche
<li...@gmail.com> wrote:
> Could this be for the html meta directives?
>
Well in o.a.n.metadata.Nutch some brief Javadoc's for the caching
fields mention the following
static String CACHING_FORBIDDEN_ALL
Don't show either original forbidden content or summaries.
static String CACHING_FORBIDDEN_CONTENT
Don't show original forbidden content, but show summaries.
static String CACHING_FORBIDDEN_KEY
Sites may request that search engines don't provide access
to cached documents.
static org.apache.avro.util.Utf8 CACHING_FORBIDDEN_KEY_UTF8
static String CACHING_FORBIDDEN_NONE
Show both original forbidden content and summaries (default).
I understand that caching data is held within and concerns metadata
(in trunk it is parse.getData().getMeta())but I still have no idea the
characteristics of the cache data, why this would be valuable for an
index. I personally have never queried for it before in my index.
Thanks
Re: cache field in index-basic in 2.X
Posted by Julien Nioche <li...@gmail.com>.
Could this be for the html meta directives?
On 9 August 2012 22:36, Lewis John Mcgibbney <le...@gmail.com>wrote:
> Hi,
>
> Can someone please explain to me exactly what the cashing field is
> actually cashing in index-basic?
> I see the various fields in o.a.n.metadata.Nutch e.g.
> CACHING_FORBIDDEN_ALL, CACHING_FORBIDDEN_CONTENT, etc. but I am still
> not sure how the functionality or indeed the 'what' actually is!!!
>
> Also, say if I wanted to fictitiously create some cache content within
> a WebPage I could do
>
> WebPage page = new WebPage();
> page.putToMetadata(Utf8, ByteBuffer); // I think
>
> but could someone explain to me what typical values would be for key
> and associated value respectively?
>
> Thank you very much in advance.
>
> best
> Lewis
>
> --
> Lewis
>
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble