You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Joe Reger, Jr." <re...@gmail.com> on 2005/09/20 17:48:53 UTC
Seeking Nutch Consultant(s) for Six Week Project
Hi All!
I hope this email isn't an intrusion. I'm looking for a consultant (or
consultants plural) to work on a Nutch project. I've sent this out to a few
of you but I wanted to make sure I got it out to everybody.
The client is interested in a "proof of concept." What this boils down to is
a prototype system that crawls terrorist sites with Nutch, creating a web
interface for searching and then adding some additional features like saved
searches, active hot spots and export to blog. I've included a rough outline
of functionality below:
- *Spider* – Visit a starting set of seed sites, provided by the
analyst, and follow links to other sites … index and store what it finds.
Most sites will be in the arabic language.
- *Geographical Tagging* – When geo.position,
DC.coverage.spatial, and/or ICBM information is available for a
web page the location is calculated and stored in the index.
- *Link-graph* – Each URL is ranked based on a number of
parameters like inbound links, etc.
- *Query capability* – Query the set of indexed web pages based
on keyword, geo-location, and Boolean triggers.
- *Saved query capability* – A query can be saved for easy
future viewing.
- *Hot spots* – A summary report showing which pages are
actively being linked to.
- *Hot Memes* – A summary report showing phraseology that is
currently active.
-
*Deliverable Work Flow for Analysts:*
- Log in to the Spider tool
- View daily summary reports, export a few hot sites to the blog
- Based on memes appearing in other media (television, radio,
print) do some queries to see how those memes are spreading
online, export
relevant sites to the blog
- Create a saved search in the Spider tool to watch progress of
memes
- Log in to Blog tool
- Review the list of sites that were just exported from the
Spider tool
- Make qualitative analysis, storing it as part of the
blog entry
- Evaluate site / page via I4 and other quantitative
methodologies
- Categorize site via datablogging fields (i.e. area = {Middle
East, Indonesia, etc.})
- Publish the blog entries for other analysts to see
- Analysts subscribe to each other's blogs via RSS and read
daily analysis in one location without having to visit
individual blog sites
- Collaboration happens via comments on blogs,
posts-about-posts, etc. This collaboration is on
cultural/contextual elements and is documented, archived, and
searchable.
The deliverable is an installable web application (Tomcat, Java, MySql)
along with installation, configuration and startup support. I've tried to
build down the requirements to what I know Nutch can do well out of the box
and we can wrap fairly quickly. This is a six week proof of concept so I'll
need a working beta within four weeks.
Don't worry about the blog collaboration tool at all... that's a piece of
software that we have currently and can export to via the MetaWeblogAPI very
quickly and easily. We need help on the Nutch side but I wanted you to see
the workflow from Nutch to the blogging collaboration system.
A few questions for you:
1) If you're interested, when can you start? The client is a hurry up and
wait client, but they may be willing to jump quickly in the coming days.
2) What are your hourly consulting/coding rates?
3) How many hours over the course of six weeks do you guess this project
would take? I'm guessing one to two people full time.
4) Is this sort of deliverable something that you feel you can pull off on
your own in four weeks (to beta) or would you recommend I bring in somebody
else with Nutch/web experience? Do you have anybody in mind that you work
well with?
As a software developer myself I know that these are heavily loaded
questions given the lack of exact design requirements. I'm looking for
somebody who feels this is within their capability and is willing to work
hard to make it happen. The client is willing to trust us with many of the
details and if we do a good job this should lead to a much more robust and
dynamic application. And, of course, building the prototype gives you a good
leg up on getting the contract once they move forward. So, please answer to
the best of your ability... this isn't a commitment at this point... just a
ballpark to get me moving forward with the client.
I'm going to speak with the client again later this afternoon and would
like to have a sense of what's possible. I apologize for the urgency... the
client awoke from something of a slumber yesterday.
Best,
Joe Reger
Re: Seeking Nutch Consultant(s) for Six Week Project
Posted by Jack Tang <hi...@gmail.com>.
Hi Joe
Is part-time ok to your project?
/Jack
On 9/20/05, Joe Reger, Jr. <re...@gmail.com> wrote:
> Hi All!
>
>
> I hope this email isn't an intrusion. I'm looking for a consultant (or
> consultants plural) to work on a Nutch project. I've sent this out to a few
> of you but I wanted to make sure I got it out to everybody.
>
> The client is interested in a "proof of concept." What this boils down to is
> a prototype system that crawls terrorist sites with Nutch, creating a web
> interface for searching and then adding some additional features like saved
> searches, active hot spots and export to blog. I've included a rough outline
> of functionality below:
>
> - *Spider* – Visit a starting set of seed sites, provided by the
> analyst, and follow links to other sites … index and store what it finds.
> Most sites will be in the arabic language.
> - *Geographical Tagging* – When geo.position,
> DC.coverage.spatial, and/or ICBM information is available for a
> web page the location is calculated and stored in the index.
> - *Link-graph* – Each URL is ranked based on a number of
> parameters like inbound links, etc.
> - *Query capability* – Query the set of indexed web pages based
> on keyword, geo-location, and Boolean triggers.
> - *Saved query capability* – A query can be saved for easy
> future viewing.
> - *Hot spots* – A summary report showing which pages are
> actively being linked to.
> - *Hot Memes* – A summary report showing phraseology that is
> currently active.
> -
>
> *Deliverable Work Flow for Analysts:*
> - Log in to the Spider tool
> - View daily summary reports, export a few hot sites to the blog
>
> - Based on memes appearing in other media (television, radio,
> print) do some queries to see how those memes are spreading
> online, export
> relevant sites to the blog
> - Create a saved search in the Spider tool to watch progress of
> memes
> - Log in to Blog tool
> - Review the list of sites that were just exported from the
> Spider tool
> - Make qualitative analysis, storing it as part of the
> blog entry
> - Evaluate site / page via I4 and other quantitative
> methodologies
> - Categorize site via datablogging fields (i.e. area = {Middle
> East, Indonesia, etc.})
> - Publish the blog entries for other analysts to see
> - Analysts subscribe to each other's blogs via RSS and read
> daily analysis in one location without having to visit
> individual blog sites
>
> - Collaboration happens via comments on blogs,
> posts-about-posts, etc. This collaboration is on
> cultural/contextual elements and is documented, archived, and
> searchable.
>
> The deliverable is an installable web application (Tomcat, Java, MySql)
> along with installation, configuration and startup support. I've tried to
> build down the requirements to what I know Nutch can do well out of the box
> and we can wrap fairly quickly. This is a six week proof of concept so I'll
> need a working beta within four weeks.
> Don't worry about the blog collaboration tool at all... that's a piece of
> software that we have currently and can export to via the MetaWeblogAPI very
> quickly and easily. We need help on the Nutch side but I wanted you to see
> the workflow from Nutch to the blogging collaboration system.
> A few questions for you:
> 1) If you're interested, when can you start? The client is a hurry up and
> wait client, but they may be willing to jump quickly in the coming days.
> 2) What are your hourly consulting/coding rates?
> 3) How many hours over the course of six weeks do you guess this project
> would take? I'm guessing one to two people full time.
> 4) Is this sort of deliverable something that you feel you can pull off on
> your own in four weeks (to beta) or would you recommend I bring in somebody
> else with Nutch/web experience? Do you have anybody in mind that you work
> well with?
> As a software developer myself I know that these are heavily loaded
> questions given the lack of exact design requirements. I'm looking for
> somebody who feels this is within their capability and is willing to work
> hard to make it happen. The client is willing to trust us with many of the
> details and if we do a good job this should lead to a much more robust and
> dynamic application. And, of course, building the prototype gives you a good
> leg up on getting the contract once they move forward. So, please answer to
> the best of your ability... this isn't a commitment at this point... just a
> ballpark to get me moving forward with the client.
> I'm going to speak with the client again later this afternoon and would
> like to have a sense of what's possible. I apologize for the urgency... the
> client awoke from something of a slumber yesterday.
> Best,
> Joe Reger
>
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Re: re-generating a fetchlist
Posted by Gal Nitzan <gn...@usa.net>.
AJ Chen wrote:
> when generating the fetch list, each page's next fetch time is
> advanced by 7 days (hard-coded). thus, the next fetch list can be
> empty if you try to generate a new fetch list. try to use "-adddays 8"
> in your command to trick nutch that next fetch time is up.
>
> -AJ
>
> Gal Nitzan wrote:
>
>> Hi,
>>
>> my fetch crashed and by mistake I have deleted the segment with the
>> fetchlist.
>>
>> running bin/nutch generate - generates a new fetchlist with 0 items
>>
>> 050921 004033 Processing page 550000...
>> 050921 004033 Overall processing: Sorted 0 entries in 0.0 seconds.
>> 050921 004033 Overall processing: Sorted NaN entries/second
>> 050921 004033 FetchListTool completed
>>
>>
>> Help...
>>
>> Gal
>>
>
Thanks for the reply. I used Michael's advice and changed the system
time for before and after the generate. However, adddays seems to be
what I needed...
Regards,
Gal
Re: re-generating a fetchlist
Posted by AJ Chen <an...@sbcglobal.net>.
when generating the fetch list, each page's next fetch time is advanced
by 7 days (hard-coded). thus, the next fetch list can be empty if you
try to generate a new fetch list. try to use "-adddays 8" in your
command to trick nutch that next fetch time is up.
-AJ
Gal Nitzan wrote:
> Hi,
>
> my fetch crashed and by mistake I have deleted the segment with the
> fetchlist.
>
> running bin/nutch generate - generates a new fetchlist with 0 items
>
> 050921 004033 Processing page 550000...
> 050921 004033 Overall processing: Sorted 0 entries in 0.0 seconds.
> 050921 004033 Overall processing: Sorted NaN entries/second
> 050921 004033 FetchListTool completed
>
>
> Help...
>
> Gal
>
--
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting
Marketing * BD * Software Development
748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, anjun.chen@sbcglobal.net
---------------------------------------------------
Re: re-generating a fetchlist
Posted by Michael Ji <fj...@yahoo.com>.
fetchlist is generated from webdb, if you didn't
delete it by mistake, then, there is no problem for
you to generate a new fetchlist from webdb;
I guess, the problem for you is that the setting for
interval, you probably have 30 days interval as
default, so if you run generation within that waiting
period, nothing will be added to fetchlist,
so, change you system time to over that period and run
generate again, hope that works,
Michael Ji,
--- Gal Nitzan <gn...@usa.net> wrote:
> Hi,
>
> my fetch crashed and by mistake I have deleted the
> segment with the
> fetchlist.
>
> running bin/nutch generate - generates a new
> fetchlist with 0 items
>
> 050921 004033 Processing page 550000...
> 050921 004033 Overall processing: Sorted 0 entries
> in 0.0 seconds.
> 050921 004033 Overall processing: Sorted NaN
> entries/second
> 050921 004033 FetchListTool completed
>
>
> Help...
>
> Gal
>
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
re-generating a fetchlist
Posted by Gal Nitzan <gn...@usa.net>.
Hi,
my fetch crashed and by mistake I have deleted the segment with the
fetchlist.
running bin/nutch generate - generates a new fetchlist with 0 items
050921 004033 Processing page 550000...
050921 004033 Overall processing: Sorted 0 entries in 0.0 seconds.
050921 004033 Overall processing: Sorted NaN entries/second
050921 004033 FetchListTool completed
Help...
Gal