You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by petri koski <mo...@gmail.com> on 2012/04/18 20:08:17 UTC

Basic -Hbase table question

Hello,

I am quite new to Hbase, and here comes my question:

I have a table. What I do with hadoop is to download webpages in MAP
-phase, extract Urls found and save them in Reduce -phase. I read  from one
table, and I save them (put) to same table to avoid duplicates etc.

I will get millions of rows, unique ones. Some times, actually quite often,
timestamps are reset because sometimes duplicates are found.

Question is:

Should I keep on doing those M/R in what way:

1. somehow save last Maps ROW -position and pass that info to next MAP to
start from .. this way I wouldnt have to process processed rows .. Of
course I have to spider sites all over again after they are finnished so ..
but this option would give me some control when site is finnished ..
2. Everytime start from row 0 and proceed to last one and start all over
again and go little bit deeper to site you are "spidering" ..

That option number 2 is good coz many sites get newest info on first pages,
so in that way I could keep my own data updated from those sites, but
flipside is that I dont know when site is crawled ..

Option nro 1. seemed to be wise, but there is something un - Hbase, and un
- Hadoop like thinking: They are ment to take all in at once and process
them at once and in case you need more, you chain M/R .. So, my option nro
2 is more like hadoop/hbase way.. And like I said before, I will not just
spider one site once and forget it, I will do it again after I have
finnished doing it once etc.

Which one is better ..

Yours,

Peter

Re: Basic -Hbase table question

Posted by Doug Meil <do...@explorysmedical.com>.

Hi there-

Because your topic is webcrawling, you might want to read the BigTable
paper because the example in that paper is about webcrawling.

You can find that, and other info, in the RefGuide...

http://hbase.apache.org/book.html#other.info.papers






On 4/18/12 2:08 PM, "petri koski" <mo...@gmail.com> wrote:

>Hello,
>
>I am quite new to Hbase, and here comes my question:
>
>I have a table. What I do with hadoop is to download webpages in MAP
>-phase, extract Urls found and save them in Reduce -phase. I read  from
>one
>table, and I save them (put) to same table to avoid duplicates etc.
>
>I will get millions of rows, unique ones. Some times, actually quite
>often,
>timestamps are reset because sometimes duplicates are found.
>
>Question is:
>
>Should I keep on doing those M/R in what way:
>
>1. somehow save last Maps ROW -position and pass that info to next MAP to
>start from .. this way I wouldnt have to process processed rows .. Of
>course I have to spider sites all over again after they are finnished so
>..
>but this option would give me some control when site is finnished ..
>2. Everytime start from row 0 and proceed to last one and start all over
>again and go little bit deeper to site you are "spidering" ..
>
>That option number 2 is good coz many sites get newest info on first
>pages,
>so in that way I could keep my own data updated from those sites, but
>flipside is that I dont know when site is crawled ..
>
>Option nro 1. seemed to be wise, but there is something un - Hbase, and un
>- Hadoop like thinking: They are ment to take all in at once and process
>them at once and in case you need more, you chain M/R .. So, my option nro
>2 is more like hadoop/hbase way.. And like I said before, I will not just
>spider one site once and forget it, I will do it again after I have
>finnished doing it once etc.
>
>Which one is better ..
>
>Yours,
>
>Peter