You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Bin Wang <bi...@gmail.com> on 2014/01/02 23:48:41 UTC

use to parse big Nutch/Content file

Hi,

I have a robot that scrapes a website daily and store the HTML locally so
far(in nutch binary format in segment/content folder).

The size of the scraping is fairly big. Million pages per day.
One thing about the HTML pages themselves is that they follow exactly the
same format.. so I can write a parser in Java to parse out the info I want
(say unit price, part number...etc) for one page, and that parser will work
for most of the pages..

I am wondering is there some map reduce template already written so I can
just replace the parser with my customized one and easily start a hadoop
mapreduce job. (actually, there doesn't have to be any reduce job... in
this case, we map every page to the parsed result and that is it...)

I was looking at the map reduce example here:
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
But I have some problem translating that into my real-world nutch problem.

I know run map reduce against Nutch binary file will be a bit different
than word count. I looked at the source code of Nutch and to me, it looks
like the file are a sequence files of records where each records is a
key/value pair where key is text type and value is
org.apache.nutch.protocol.Content type. Then how should I configure the map
job so it can read in the raw big content binary file and do the Inputsplit
correctly and run the map job..

Thanks a lot!

/usr/bin


( Some explanations of why I decided not to write Java plugin ):
I was thinking about writing a Nutch Plugin so it will be handy to parse
the scraped data using Nutch command. However, the problem here is "it is
hard to write a perfect parser" in one go. It probably makes a lot of sense
for the people who deal with parsers a lot. You locate your HTML tag by
some specific features that you think will be general... css class type,
id...etc...even combining with regular expression. However, when you apply
your logic to all the pages, it won't stand true for all the pages. Then
you need to write many different parsers to run against the whole dataset
(Million pages) in one go and see which one has the best performance. Then
you run your parser against all your snapshots days * million pages.. to
get the new dataset.. )

RE: use to parse big Nutch/Content file

Posted by Markus Jelsma <ma...@openindex.io>.
Yes, this is much easier. Let Nutch crawl the files and parse the files with parse-html or parse-tika and have a custom ParseFilter plugin. In there you can walk over the DOM via the passed DocumentFragment object. It is very easy to look up the HTML elements of interest. One example is the headings plugin Nutch has. It does exactly that and can serve as a template for you to work on.

Also, i'd advice to move these discussions to the user list so more users can benefit from it.

Cheers,
Markus

-----Original message-----
From: Tejas Patil<te...@gmail.com>
Sent: Friday 3rd January 2014 5:53
To: dev@nutch.apache.org
Subject: Re: use <Map Reduce + Jsoup> to parse big Nutch/Content file

Here is what I would do:

If you running a crawl, let it run with the default parser. Write a nutch plugin with your customized parse implementation to evaluate your parse logic. Now get some real segments (with a subset of those million pages) and run only the bin/nutch parse command and see how good it is. That command will run your parser over the segment. Do this till you get a satisfactory parser implementation.

~tejas

On Thu, Jan 2, 2014 at 2:48 PM, Bin Wang <binwang.cu@gmail.com <ma...@gmail.com>> wrote:

Hi,

I have a robot that scrapes a website daily and store the HTML locally so far(in nutch binary format in segment/content folder).

The size of the scraping is fairly big. Million pages per day.

One thing about the HTML pages themselves is that they follow exactly the same format.. so I can write a parser in Java to parse out the info I want (say unit price, part number...etc) for one page, and that parser will work for most of the pages..

I am wondering is there some map reduce template already written so I can just replace the parser with my customized one and easily start a hadoop mapreduce job. (actually, there doesnt have to be any reduce job... in this case, we map every page to the parsed result and that is it...)

I was looking at the map reduce example here: https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html <https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html>

But I have some problem translating that into my real-world nutch problem.

I know run map reduce against Nutch binary file will be a bit different than word count. I looked at the source code of Nutch and to me, it looks like the file are a sequence files of records where each records is a key/value pair where key is text type and value is org.apache.nutch.protocol.Content type. Then how should I configure the map job so it can read in the raw big content binary file and do the Inputsplit correctly and run the map job..

Thanks a lot!

/usr/bin

( Some explanations of why I decided not to write Java plugin ):

I was thinking about writing a Nutch Plugin so it will be handy to parse the scraped data using Nutch command. However, the problem here is "it is hard to write a perfect parser" in one go. It probably makes a lot of sense for the people who deal with parsers a lot. You locate your HTML tag by some specific features that you think will be general... css class type, id...etc...even combining with regular expression. However, when you apply your logic to all the pages, it wont stand true for all the pages. Then you need to write many different parsers to run against the whole dataset (Million pages) in one go and see which one has the best performance. Then you run your parser against all your snapshots days * million pages.. to get the new dataset.. )



Re: use to parse big Nutch/Content file

Posted by Bin Wang <bi...@gmail.com>.
Hi Tejas,

-- Nutch Plugin --

I got a bit confused here. Both of you (Markus Jelsma) are recommending
writing a Nutch plugin in. Does the "bin/nutch parse" part run in
distributed mode?
Correct me if I was wrong, here is my understanding of the labor behind
Nutch.
    <Fetching>: both Nutch 1.7 and Nutch 2.X will run on one node, so even
if you have a cluster, only one of the node will be truly used to make the
http requests.
    <Storing>: Nutch 1.7 will store the content/HTML to local disk as
default and Nutch 2.X can store the data in accumulo/hbase, which is the
"big-data-like" distributed system (also capable of storing locally MySQL
etc.).
    will the <Parsing> part actually run in distributed mode if you are
using Nutch 2.X?
    In another word, when you do bin/nutch parse...., it actually kicks out
thousands of map-reduce jobs to utilize the whole cluster to parse the data?
    So each node in your cluster will parse a fragment of the complete
dataset when you decide to reparse all your dataset?
    Otherwise, if the parsing is running only on one node, it will take
extremely long time to reparse all your data even if you finally got your
perfect parser.

/usr/bin


On Thu, Jan 2, 2014 at 9:52 PM, Tejas Patil <te...@gmail.com>wrote:

> Here is what I would do:
> If you running a crawl, let it run with the default parser. Write a nutch
> plugin with your customized parse implementation to evaluate your parse
> logic. Now get some real segments (with a subset of those million pages)
> and run only the 'bin/nutch parse' command and see how good it is. That
> command will run your parser over the segment. Do this till you get a
> satisfactory parser implementation.
>
> ~tejas
>
>
> On Thu, Jan 2, 2014 at 2:48 PM, Bin Wang <bi...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a robot that scrapes a website daily and store the HTML locally so
>> far(in nutch binary format in segment/content folder).
>>
>> The size of the scraping is fairly big. Million pages per day.
>> One thing about the HTML pages themselves is that they follow exactly the
>> same format.. so I can write a parser in Java to parse out the info I want
>> (say unit price, part number...etc) for one page, and that parser will work
>> for most of the pages..
>>
>> I am wondering is there some map reduce template already written so I can
>> just replace the parser with my customized one and easily start a hadoop
>> mapreduce job. (actually, there doesn't have to be any reduce job... in
>> this case, we map every page to the parsed result and that is it...)
>>
>> I was looking at the map reduce example here:
>> https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
>> But I have some problem translating that into my real-world nutch
>> problem.
>>
>> I know run map reduce against Nutch binary file will be a bit different
>> than word count. I looked at the source code of Nutch and to me, it looks
>> like the file are a sequence files of records where each records is a
>> key/value pair where key is text type and value is
>> org.apache.nutch.protocol.Content type. Then how should I configure the map
>> job so it can read in the raw big content binary file and do the Inputsplit
>> correctly and run the map job..
>>
>> Thanks a lot!
>>
>> /usr/bin
>>
>>
>> ( Some explanations of why I decided not to write Java plugin ):
>> I was thinking about writing a Nutch Plugin so it will be handy to parse
>> the scraped data using Nutch command. However, the problem here is "it is
>> hard to write a perfect parser" in one go. It probably makes a lot of sense
>> for the people who deal with parsers a lot. You locate your HTML tag by
>> some specific features that you think will be general... css class type,
>> id...etc...even combining with regular expression. However, when you apply
>> your logic to all the pages, it won't stand true for all the pages. Then
>> you need to write many different parsers to run against the whole dataset
>> (Million pages) in one go and see which one has the best performance. Then
>> you run your parser against all your snapshots days * million pages.. to
>> get the new dataset.. )
>>
>
>

Re: use to parse big Nutch/Content file

Posted by Tejas Patil <te...@gmail.com>.
Here is what I would do:
If you running a crawl, let it run with the default parser. Write a nutch
plugin with your customized parse implementation to evaluate your parse
logic. Now get some real segments (with a subset of those million pages)
and run only the 'bin/nutch parse' command and see how good it is. That
command will run your parser over the segment. Do this till you get a
satisfactory parser implementation.

~tejas


On Thu, Jan 2, 2014 at 2:48 PM, Bin Wang <bi...@gmail.com> wrote:

> Hi,
>
> I have a robot that scrapes a website daily and store the HTML locally so
> far(in nutch binary format in segment/content folder).
>
> The size of the scraping is fairly big. Million pages per day.
> One thing about the HTML pages themselves is that they follow exactly the
> same format.. so I can write a parser in Java to parse out the info I want
> (say unit price, part number...etc) for one page, and that parser will work
> for most of the pages..
>
> I am wondering is there some map reduce template already written so I can
> just replace the parser with my customized one and easily start a hadoop
> mapreduce job. (actually, there doesn't have to be any reduce job... in
> this case, we map every page to the parsed result and that is it...)
>
> I was looking at the map reduce example here:
> https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
> But I have some problem translating that into my real-world nutch problem.
>
> I know run map reduce against Nutch binary file will be a bit different
> than word count. I looked at the source code of Nutch and to me, it looks
> like the file are a sequence files of records where each records is a
> key/value pair where key is text type and value is
> org.apache.nutch.protocol.Content type. Then how should I configure the map
> job so it can read in the raw big content binary file and do the Inputsplit
> correctly and run the map job..
>
> Thanks a lot!
>
> /usr/bin
>
>
> ( Some explanations of why I decided not to write Java plugin ):
> I was thinking about writing a Nutch Plugin so it will be handy to parse
> the scraped data using Nutch command. However, the problem here is "it is
> hard to write a perfect parser" in one go. It probably makes a lot of sense
> for the people who deal with parsers a lot. You locate your HTML tag by
> some specific features that you think will be general... css class type,
> id...etc...even combining with regular expression. However, when you apply
> your logic to all the pages, it won't stand true for all the pages. Then
> you need to write many different parsers to run against the whole dataset
> (Million pages) in one go and see which one has the best performance. Then
> you run your parser against all your snapshots days * million pages.. to
> get the new dataset.. )
>