You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by LoneEagle70 <av...@e-djuster.com> on 2007/10/17 14:53:07 UTC

Extracting html pages from db

Hi,

I was able to install Nutch 0.9 and crawl a site and use the Web Page to do
full text search of my db.

But we need to extract informations from all HTML page.

So, is there a way to extract HTML pages from the db?
-- 
View this message in context: http://www.nabble.com/Extracting-html-pages-from-db-tf4640373.html#a13253122
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Extracting html pages from db

Posted by misc <mi...@robotgenius.net>.

Hello-

    I've done this, I think it is

    nutch readseg -dump <segment_dir> <dumpfile>

to dump all the html of everything in a segment.  You can also specify what 
url you are interested in, type nutch readseg for details.

                        see you
                            -Jim


----- Original Message ----- 
From: "LoneEagle70" <av...@e-djuster.com>
To: <nu...@lucene.apache.org>
Sent: Wednesday, October 17, 2007 5:53 AM
Subject: Extracting html pages from db


>
> Hi,
>
> I was able to install Nutch 0.9 and crawl a site and use the Web Page to 
> do
> full text search of my db.
>
> But we need to extract informations from all HTML page.
>
> So, is there a way to extract HTML pages from the db?
> -- 
> View this message in context: 
> http://www.nabble.com/Extracting-html-pages-from-db-tf4640373.html#a13253122
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: Extracting html pages from db

Posted by Dennis Kubes <ku...@apache.org>.

There is currently no way to do that.  You would need to write a map job 
to pull the data from Content within Segments.

Dennis Kubes

LoneEagle70 wrote:
> Do you have any idea how to extract from command line all my html files
> stored in the db?
> 
> Dennis Kubes-2 wrote:
>> Pulling out specific information for each site could be done through 
>> HtmlParseFilter implementations.  Look at 
>> org.apache.nutch.parse.HtmlParseFilter and its implementations.  The 
>> specific fields you extract can be stored in MetaData in ParseData.  You 
>> can then access that information in other jobs such as indexer.  Hope 
>> this helps.
>>
>> Dennis Kubes
>>
>> LoneEagle70 wrote:
>>> I do not want it using the WebApp.
>>>
>>> Is there a way to extract all html files from command line in a
>>> directory?
>>> Like displaying stats. I tried the dump but was not what I wanted. I
>>> really
>>> want only html pages so I can take information from them.
>>>
>>> Here my problem: We are looking for a program that will do Web Crawling
>>> but
>>> must be customized for each site that we need because those pages are
>>> generated based on parameters. Also, we need to extract information
>>> (product, price, manufacturer, ...). So, if you have experience with
>>> Nutch,
>>> you could help me out. Can I customized it through Hooks? What can/can't
>>> I
>>> do?
>>>
>>> Thanks for your help! :)
>>>
>>> Dennis Kubes-2 wrote:
>>>> It depends on what you are trying to do.  Content in segments stores the 
>>>> full content (html, etc.) of each page.  The cached.jsp page displays 
>>>> full content.
>>>>
>>>> Dennis Kubes
>>>>
>>>>
>>>> LoneEagle70 wrote:
>>>>> Hi,
>>>>>
>>>>> I was able to install Nutch 0.9 and crawl a site and use the Web Page
>>>>> to
>>>>> do
>>>>> full text search of my db.
>>>>>
>>>>> But we need to extract informations from all HTML page.
>>>>>
>>>>> So, is there a way to extract HTML pages from the db?
>>
>

Re: Extracting html pages from db

Posted by LoneEagle70 <av...@e-djuster.com>.

Do you have any idea how to extract from command line all my html files
stored in the db?

Dennis Kubes-2 wrote:
> 
> Pulling out specific information for each site could be done through 
> HtmlParseFilter implementations.  Look at 
> org.apache.nutch.parse.HtmlParseFilter and its implementations.  The 
> specific fields you extract can be stored in MetaData in ParseData.  You 
> can then access that information in other jobs such as indexer.  Hope 
> this helps.
> 
> Dennis Kubes
> 
> LoneEagle70 wrote:
>> I do not want it using the WebApp.
>> 
>> Is there a way to extract all html files from command line in a
>> directory?
>> Like displaying stats. I tried the dump but was not what I wanted. I
>> really
>> want only html pages so I can take information from them.
>> 
>> Here my problem: We are looking for a program that will do Web Crawling
>> but
>> must be customized for each site that we need because those pages are
>> generated based on parameters. Also, we need to extract information
>> (product, price, manufacturer, ...). So, if you have experience with
>> Nutch,
>> you could help me out. Can I customized it through Hooks? What can/can't
>> I
>> do?
>> 
>> Thanks for your help! :)
>> 
>> Dennis Kubes-2 wrote:
>>> It depends on what you are trying to do.  Content in segments stores the 
>>> full content (html, etc.) of each page.  The cached.jsp page displays 
>>> full content.
>>>
>>> Dennis Kubes
>>>
>>>
>>> LoneEagle70 wrote:
>>>> Hi,
>>>>
>>>> I was able to install Nutch 0.9 and crawl a site and use the Web Page
>>>> to
>>>> do
>>>> full text search of my db.
>>>>
>>>> But we need to extract informations from all HTML page.
>>>>
>>>> So, is there a way to extract HTML pages from the db?
>>>
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Extracting-html-pages-from-db-tf4640373.html#a13258870
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Extracting html pages from db

Posted by Dennis Kubes <ku...@apache.org>.

Pulling out specific information for each site could be done through 
HtmlParseFilter implementations.  Look at 
org.apache.nutch.parse.HtmlParseFilter and its implementations.  The 
specific fields you extract can be stored in MetaData in ParseData.  You 
can then access that information in other jobs such as indexer.  Hope 
this helps.

Dennis Kubes

LoneEagle70 wrote:
> I do not want it using the WebApp.
> 
> Is there a way to extract all html files from command line in a directory?
> Like displaying stats. I tried the dump but was not what I wanted. I really
> want only html pages so I can take information from them.
> 
> Here my problem: We are looking for a program that will do Web Crawling but
> must be customized for each site that we need because those pages are
> generated based on parameters. Also, we need to extract information
> (product, price, manufacturer, ...). So, if you have experience with Nutch,
> you could help me out. Can I customized it through Hooks? What can/can't I
> do?
> 
> Thanks for your help! :)
> 
> Dennis Kubes-2 wrote:
>> It depends on what you are trying to do.  Content in segments stores the 
>> full content (html, etc.) of each page.  The cached.jsp page displays 
>> full content.
>>
>> Dennis Kubes
>>
>>
>> LoneEagle70 wrote:
>>> Hi,
>>>
>>> I was able to install Nutch 0.9 and crawl a site and use the Web Page to
>>> do
>>> full text search of my db.
>>>
>>> But we need to extract informations from all HTML page.
>>>
>>> So, is there a way to extract HTML pages from the db?
>>
>

Re: Extracting html pages from db

Posted by LoneEagle70 <av...@e-djuster.com>.

I do not want it using the WebApp.

Is there a way to extract all html files from command line in a directory?
Like displaying stats. I tried the dump but was not what I wanted. I really
want only html pages so I can take information from them.

Here my problem: We are looking for a program that will do Web Crawling but
must be customized for each site that we need because those pages are
generated based on parameters. Also, we need to extract information
(product, price, manufacturer, ...). So, if you have experience with Nutch,
you could help me out. Can I customized it through Hooks? What can/can't I
do?

Thanks for your help! :)

Dennis Kubes-2 wrote:
> 
> It depends on what you are trying to do.  Content in segments stores the 
> full content (html, etc.) of each page.  The cached.jsp page displays 
> full content.
> 
> Dennis Kubes
> 
> 
> LoneEagle70 wrote:
>> Hi,
>> 
>> I was able to install Nutch 0.9 and crawl a site and use the Web Page to
>> do
>> full text search of my db.
>> 
>> But we need to extract informations from all HTML page.
>> 
>> So, is there a way to extract HTML pages from the db?
> 
> 

-- 
View this message in context: http://www.nabble.com/Extracting-html-pages-from-db-tf4640373.html#a13258493
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Extracting html pages from db

Posted by Dennis Kubes <ku...@apache.org>.

It depends on what you are trying to do.  Content in segments stores the 
full content (html, etc.) of each page.  The cached.jsp page displays 
full content.

Dennis Kubes

LoneEagle70 wrote:
> Hi,
> 
> I was able to install Nutch 0.9 and crawl a site and use the Web Page to do
> full text search of my db.
> 
> But we need to extract informations from all HTML page.
> 
> So, is there a way to extract HTML pages from the db?