You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2012/08/09 23:36:27 UTC

cache field in index-basic in 2.X

Hi,

Can someone please explain to me exactly what the cashing field is
actually cashing in index-basic?
I see the various fields in o.a.n.metadata.Nutch e.g.
CACHING_FORBIDDEN_ALL, CACHING_FORBIDDEN_CONTENT, etc. but I am still
not sure how the functionality or indeed the 'what' actually is!!!

Also, say if I wanted to fictitiously create some cache content within
a WebPage I could do

WebPage page = new WebPage();
page.putToMetadata(Utf8, ByteBuffer); // I think

 but could someone explain to me what typical values would be for key
and associated value respectively?

Thank you very much in advance.

best
Lewis

-- 
Lewis

Re: cache field in index-basic in 2.X

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Julien,

On Fri, Aug 10, 2012 at 10:39 AM, Julien Nioche
<li...@gmail.com> wrote:

>
> Makes sense?
>

100 times over

Thanks for the explanation this is great and makes perfect sense.

Re: cache field in index-basic in 2.X

Posted by Julien Nioche <li...@gmail.com>.
>
>
> Well in o.a.n.metadata.Nutch some brief Javadoc's for the caching
> fields mention the following
>
>  static String  CACHING_FORBIDDEN_ALL
>           Don't show either original forbidden content or summaries.
> static String   CACHING_FORBIDDEN_CONTENT
>           Don't show original forbidden content, but show summaries.
> static String   CACHING_FORBIDDEN_KEY
>           Sites may request that search engines don't provide access
> to cached documents.
> static org.apache.avro.util.Utf8        CACHING_FORBIDDEN_KEY_UTF8
>
> static String   CACHING_FORBIDDEN_NONE
>           Show both original forbidden content and summaries (default).
>
> I understand that caching data is held within and concerns metadata
> (in trunk it is parse.getData().getMeta())


it does not concern metadata, we store as metadata the policies regarding
caching that are specified in the html pages (
http://www.i18nguy.com/markup/metatags.html) then store the policy in the
cache field

* // add cached content/summary display policy, if available*
*    String caching = parse.getData().getMeta(Nutch.CACHING_FORBIDDEN_KEY);*
*    if (caching != null && !caching.equals(Nutch.CACHING_FORBIDDEN_NONE)) {
*
*      doc.add("cache", caching);*
*    }*
* *
I expect that this was then used by our search web app to determine whether
we could display the cached content or not.


> but I still have no idea the
> characteristics of the cache data, why this would be valuable for an
> index. I personally have never queried for it before in my index.
>

we do not store the cached content as a field, just the policy. caching can
be useful for an index e.g. when the target server is down and you want to
have a peek at the content of the page

indexing the policy instead of the actual cache content is probably not so
relevant now that we've delegated the indexing + search to SOLR & ES. We
could of course add a binary field with the content so that web apps
querying the search backends could provide the cache if needed. We'd need
to enforce the caching policy at the indexing level + put some restrictions
on length etc...

Makes sense?

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: cache field in index-basic in 2.X

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Julien,

Firstly I need to apologise for my failure to differentiate between
cashing and caching (of course the latter being correct). Sorry about
that.

On Fri, Aug 10, 2012 at 8:30 AM, Julien Nioche
<li...@gmail.com> wrote:
> Could this be for the html meta directives?
>

Well in o.a.n.metadata.Nutch some brief Javadoc's for the caching
fields mention the following

 static String 	CACHING_FORBIDDEN_ALL
          Don't show either original forbidden content or summaries.
static String 	CACHING_FORBIDDEN_CONTENT
          Don't show original forbidden content, but show summaries.
static String 	CACHING_FORBIDDEN_KEY
          Sites may request that search engines don't provide access
to cached documents.
static org.apache.avro.util.Utf8 	CACHING_FORBIDDEN_KEY_UTF8

static String 	CACHING_FORBIDDEN_NONE
          Show both original forbidden content and summaries (default).

I understand that caching data is held within and concerns metadata
(in trunk it is parse.getData().getMeta())but I still have no idea the
characteristics of the cache data, why this would be valuable for an
index. I personally have never queried for it before in my index.

Thanks

Re: cache field in index-basic in 2.X

Posted by Julien Nioche <li...@gmail.com>.
Could this be for the html meta directives?

On 9 August 2012 22:36, Lewis John Mcgibbney <le...@gmail.com>wrote:

> Hi,
>
> Can someone please explain to me exactly what the cashing field is
> actually cashing in index-basic?
> I see the various fields in o.a.n.metadata.Nutch e.g.
> CACHING_FORBIDDEN_ALL, CACHING_FORBIDDEN_CONTENT, etc. but I am still
> not sure how the functionality or indeed the 'what' actually is!!!
>
> Also, say if I wanted to fictitiously create some cache content within
> a WebPage I could do
>
> WebPage page = new WebPage();
> page.putToMetadata(Utf8, ByteBuffer); // I think
>
>  but could someone explain to me what typical values would be for key
> and associated value respectively?
>
> Thank you very much in advance.
>
> best
> Lewis
>
> --
> Lewis
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble