You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Azhar Jassal <az...@gmail.com> on 2014/04/17 12:23:30 UTC

Restricting stored Nutch fields in Gora/ HBase

Hi

I'm using Nutch 2.2.1 with HBase

How can I restrict the fields persisted in HBase? For example, I don't need
the "p:c" column (parser text field). Its actual content will never be used
by my search implementation (am not using a default text field). I can see
the "p:c" mapping is listed in conf/gora-hbase-mapping.xml but omitting it
from the file causes a Gora writer exception.

I'm using my own set of plugins to extract the specific content I need and
adding it to metadata so its saved in column mtdt.

Now I want to restrict the storage of additional data to the most minimum
required for Nutch to function (mostly to minimise hard disk usage). For
example, I don't want to store headers (column h)- how can I restrict them
from making it to HBase?

Also, I'm using "fetcher.parse" = true, so don't require data persisted for
post-parsing


Thanks

Az

Re: Restricting stored Nutch fields in Gora/ HBase

Posted by Talat Uyarer <ta...@uyarer.com>.
Hi Azhar,

If i dont misunderstand you want to build minimal crawler. Nutch is
designed for whole web crawl. Maybe some fields are not used by you but it
can be used by some plugins. If you want to change to code you can do it.
But if you want to hbase store minimum space you can use hbase compression
for all fields.

I Hope i could help you
Talat
17 Nis 2014 13:24 tarihinde "Azhar Jassal" <az...@gmail.com> yazdı:

> Hi
>
> I'm using Nutch 2.2.1 with HBase
>
> How can I restrict the fields persisted in HBase? For example, I don't need
> the "p:c" column (parser text field). Its actual content will never be used
> by my search implementation (am not using a default text field). I can see
> the "p:c" mapping is listed in conf/gora-hbase-mapping.xml but omitting it
> from the file causes a Gora writer exception.
>
> I'm using my own set of plugins to extract the specific content I need and
> adding it to metadata so its saved in column mtdt.
>
> Now I want to restrict the storage of additional data to the most minimum
> required for Nutch to function (mostly to minimise hard disk usage). For
> example, I don't want to store headers (column h)- how can I restrict them
> from making it to HBase?
>
> Also, I'm using "fetcher.parse" = true, so don't require data persisted for
> post-parsing
>
>
> Thanks
>
> Az
>