You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Joe Reger, Jr." <re...@gmail.com> on 2005/09/20 17:48:53 UTC

Seeking Nutch Consultant(s) for Six Week Project

Hi All!


I hope this email isn't an intrusion. I'm looking for a consultant (or 
consultants plural) to work on a Nutch project. I've sent this out to a few 
of you but I wanted to make sure I got it out to everybody.
 
The client is interested in a "proof of concept." What this boils down to is 
a prototype system that crawls terrorist sites with Nutch, creating a web 
interface for searching and then adding some additional features like saved 
searches, active hot spots and export to blog. I've included a rough outline 
of functionality below:

   - *Spider* – Visit a starting set of seed sites, provided by the 
   analyst, and follow links to other sites … index and store what it finds. 
   Most sites will be in the arabic language. 
      - *Geographical Tagging* – When geo.position, 
      DC.coverage.spatial, and/or ICBM information is available for a 
      web page the location is calculated and stored in the index. 
      - *Link-graph* – Each URL is ranked based on a number of 
      parameters like inbound links, etc. 
      - *Query capability* – Query the set of indexed web pages based 
      on keyword, geo-location, and Boolean triggers. 
      - *Saved query capability* – A query can be saved for easy 
      future viewing. 
      - *Hot spots* – A summary report showing which pages are 
      actively being linked to. 
      - *Hot Memes* – A summary report showing phraseology that is 
      currently active. 
   - 
   
   *Deliverable Work Flow for Analysts:*
   - Log in to the Spider tool 
      - View daily summary reports, export a few hot sites to the blog 
      
      - Based on memes appearing in other media (television, radio, 
      print) do some queries to see how those memes are spreading
online, export
      relevant sites to the blog 
      - Create a saved search in the Spider tool to watch progress of 
      memes 
      - Log in to Blog tool 
      - Review the list of sites that were just exported from the 
      Spider tool 
         - Make qualitative analysis, storing it as part of the 
         blog entry 
         - Evaluate site / page via I4 and other quantitative 
         methodologies 
         - Categorize site via datablogging fields (i.e. area = {Middle 
         East, Indonesia, etc.})
      - Publish the blog entries for other analysts to see 
      - Analysts subscribe to each other's blogs via RSS and read 
      daily analysis in one location without having to visit
individual blog sites
      
      - Collaboration happens via comments on blogs, 
      posts-about-posts, etc. This collaboration is on 
      cultural/contextual elements and is documented, archived, and 
      searchable.
   
The deliverable is an installable web application (Tomcat, Java, MySql) 
along with installation, configuration and startup support. I've tried to 
build down the requirements to what I know Nutch can do well out of the box 
and we can wrap fairly quickly. This is a six week proof of concept so I'll 
need a working beta within four weeks.
 Don't worry about the blog collaboration tool at all... that's a piece of 
software that we have currently and can export to via the MetaWeblogAPI very 
quickly and easily. We need help on the Nutch side but I wanted you to see 
the workflow from Nutch to the blogging collaboration system.
 A few questions for you:
 1) If you're interested, when can you start? The client is a hurry up and 
wait client, but they may be willing to jump quickly in the coming days.
2) What are your hourly consulting/coding rates?
3) How many hours over the course of six weeks do you guess this project 
would take? I'm guessing one to two people full time.
 4) Is this sort of deliverable something that you feel you can pull off on 
your own in four weeks (to beta) or would you recommend I bring in somebody 
else with Nutch/web experience? Do you have anybody in mind that you work 
well with?
 As a software developer myself I know that these are heavily loaded 
questions given the lack of exact design requirements. I'm looking for 
somebody who feels this is within their capability and is willing to work 
hard to make it happen. The client is willing to trust us with many of the 
details and if we do a good job this should lead to a much more robust and 
dynamic application. And, of course, building the prototype gives you a good 
leg up on getting the contract once they move forward. So, please answer to 
the best of your ability... this isn't a commitment at this point... just a 
ballpark to get me moving forward with the client.
 I'm going to speak with the client again later this afternoon and would 
like to have a sense of what's possible. I apologize for the urgency... the 
client awoke from something of a slumber yesterday.
 Best,
 Joe Reger

Re: Seeking Nutch Consultant(s) for Six Week Project

Posted by Jack Tang <hi...@gmail.com>.
Hi Joe

Is part-time ok to your project?

/Jack

On 9/20/05, Joe Reger, Jr. <re...@gmail.com> wrote:
> Hi All!
> 
> 
> I hope this email isn't an intrusion. I'm looking for a consultant (or
> consultants plural) to work on a Nutch project. I've sent this out to a few
> of you but I wanted to make sure I got it out to everybody.
> 
> The client is interested in a "proof of concept." What this boils down to is
> a prototype system that crawls terrorist sites with Nutch, creating a web
> interface for searching and then adding some additional features like saved
> searches, active hot spots and export to blog. I've included a rough outline
> of functionality below:
> 
>    - *Spider* – Visit a starting set of seed sites, provided by the
>    analyst, and follow links to other sites … index and store what it finds.
>    Most sites will be in the arabic language.
>       - *Geographical Tagging* – When geo.position,
>       DC.coverage.spatial, and/or ICBM information is available for a
>       web page the location is calculated and stored in the index.
>       - *Link-graph* – Each URL is ranked based on a number of
>       parameters like inbound links, etc.
>       - *Query capability* – Query the set of indexed web pages based
>       on keyword, geo-location, and Boolean triggers.
>       - *Saved query capability* – A query can be saved for easy
>       future viewing.
>       - *Hot spots* – A summary report showing which pages are
>       actively being linked to.
>       - *Hot Memes* – A summary report showing phraseology that is
>       currently active.
>    -
> 
>    *Deliverable Work Flow for Analysts:*
>    - Log in to the Spider tool
>       - View daily summary reports, export a few hot sites to the blog
> 
>       - Based on memes appearing in other media (television, radio,
>       print) do some queries to see how those memes are spreading
> online, export
>       relevant sites to the blog
>       - Create a saved search in the Spider tool to watch progress of
>       memes
>       - Log in to Blog tool
>       - Review the list of sites that were just exported from the
>       Spider tool
>          - Make qualitative analysis, storing it as part of the
>          blog entry
>          - Evaluate site / page via I4 and other quantitative
>          methodologies
>          - Categorize site via datablogging fields (i.e. area = {Middle
>          East, Indonesia, etc.})
>       - Publish the blog entries for other analysts to see
>       - Analysts subscribe to each other's blogs via RSS and read
>       daily analysis in one location without having to visit
> individual blog sites
> 
>       - Collaboration happens via comments on blogs,
>       posts-about-posts, etc. This collaboration is on
>       cultural/contextual elements and is documented, archived, and
>       searchable.
> 
> The deliverable is an installable web application (Tomcat, Java, MySql)
> along with installation, configuration and startup support. I've tried to
> build down the requirements to what I know Nutch can do well out of the box
> and we can wrap fairly quickly. This is a six week proof of concept so I'll
> need a working beta within four weeks.
>  Don't worry about the blog collaboration tool at all... that's a piece of
> software that we have currently and can export to via the MetaWeblogAPI very
> quickly and easily. We need help on the Nutch side but I wanted you to see
> the workflow from Nutch to the blogging collaboration system.
>  A few questions for you:
>  1) If you're interested, when can you start? The client is a hurry up and
> wait client, but they may be willing to jump quickly in the coming days.
> 2) What are your hourly consulting/coding rates?
> 3) How many hours over the course of six weeks do you guess this project
> would take? I'm guessing one to two people full time.
>  4) Is this sort of deliverable something that you feel you can pull off on
> your own in four weeks (to beta) or would you recommend I bring in somebody
> else with Nutch/web experience? Do you have anybody in mind that you work
> well with?
>  As a software developer myself I know that these are heavily loaded
> questions given the lack of exact design requirements. I'm looking for
> somebody who feels this is within their capability and is willing to work
> hard to make it happen. The client is willing to trust us with many of the
> details and if we do a good job this should lead to a much more robust and
> dynamic application. And, of course, building the prototype gives you a good
> leg up on getting the contract once they move forward. So, please answer to
> the best of your ability... this isn't a commitment at this point... just a
> ballpark to get me moving forward with the client.
>  I'm going to speak with the client again later this afternoon and would
> like to have a sense of what's possible. I apologize for the urgency... the
> client awoke from something of a slumber yesterday.
>  Best,
>  Joe Reger
> 


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: re-generating a fetchlist

Posted by Gal Nitzan <gn...@usa.net>.
AJ Chen wrote:
> when generating the fetch list, each page's next fetch time is 
> advanced by 7 days (hard-coded). thus, the next fetch list can be 
> empty if you try to generate a new fetch list. try to use "-adddays 8" 
> in your command to trick nutch that next fetch time is up.
>
> -AJ
>
> Gal Nitzan wrote:
>
>> Hi,
>>
>> my fetch crashed and by mistake I have deleted the segment with the 
>> fetchlist.
>>
>> running bin/nutch generate - generates a new fetchlist with 0 items
>>
>> 050921 004033 Processing page 550000...
>> 050921 004033 Overall processing: Sorted 0 entries in 0.0 seconds.
>> 050921 004033 Overall processing: Sorted NaN entries/second
>> 050921 004033 FetchListTool completed
>>
>>
>> Help...
>>
>> Gal
>>
>
Thanks for the reply. I used Michael's advice and changed the system 
time for before and after the generate. However, adddays seems to be 
what I needed...

Regards,

Gal

Re: re-generating a fetchlist

Posted by AJ Chen <an...@sbcglobal.net>.
when generating the fetch list, each page's next fetch time is advanced 
by 7 days (hard-coded). thus, the next fetch list can be empty if you 
try to generate a new fetch list. try to use "-adddays 8" in your 
command to trick nutch that next fetch time is up.

-AJ

Gal Nitzan wrote:

> Hi,
>
> my fetch crashed and by mistake I have deleted the segment with the 
> fetchlist.
>
> running bin/nutch generate - generates a new fetchlist with 0 items
>
> 050921 004033 Processing page 550000...
> 050921 004033 Overall processing: Sorted 0 entries in 0.0 seconds.
> 050921 004033 Overall processing: Sorted NaN entries/second
> 050921 004033 FetchListTool completed
>
>
> Help...
>
> Gal
>

-- 
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting 
Marketing * BD * Software Development
748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, anjun.chen@sbcglobal.net
---------------------------------------------------

Re: re-generating a fetchlist

Posted by Michael Ji <fj...@yahoo.com>.
fetchlist is generated from webdb, if you didn't
delete it by mistake, then, there is no problem for
you to generate a new fetchlist from webdb;

I guess, the problem for you is that the setting for
interval, you probably have 30 days interval as
default, so if you run generation within that waiting
period, nothing will be added to fetchlist,

so, change you system time to over that period and run
generate again, hope that works,

Michael Ji,

--- Gal Nitzan <gn...@usa.net> wrote:

> Hi,
> 
> my fetch crashed and by mistake I have deleted the
> segment with the 
> fetchlist.
> 
> running bin/nutch generate - generates a new
> fetchlist with 0 items
> 
> 050921 004033 Processing page 550000...
> 050921 004033 Overall processing: Sorted 0 entries
> in 0.0 seconds.
> 050921 004033 Overall processing: Sorted NaN
> entries/second
> 050921 004033 FetchListTool completed
> 
> 
> Help...
> 
> Gal
> 



		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

re-generating a fetchlist

Posted by Gal Nitzan <gn...@usa.net>.
Hi,

my fetch crashed and by mistake I have deleted the segment with the 
fetchlist.

running bin/nutch generate - generates a new fetchlist with 0 items

050921 004033 Processing page 550000...
050921 004033 Overall processing: Sorted 0 entries in 0.0 seconds.
050921 004033 Overall processing: Sorted NaN entries/second
050921 004033 FetchListTool completed


Help...

Gal