You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Julien Nioche <li...@gmail.com> on 2015/09/18 11:54:17 UTC

Fwd: Job Opening at Common Crawl - Crawl Engineer / Data Scientist

Nutch people,

Just in case you missed the announcement below. As you probably know CC use
Nutch for their crawls, this is a fantastic opportunity to put your Nutch
skills to great use!

Julien

---------- Forwarded message ----------
From: Sara Crouse <sa...@commoncrawl.org>
Date: 17 September 2015 at 22:51
Subject: Job Opening at Common Crawl - Crawl Engineer / Data Scientist
To: Common Crawl <co...@googlegroups.com>


Hello again CC community,

In addition to my appointment, another staff transition is on the horizon,
and I would like to ask for your help finding candidates to fill a critical
role. At the end of this month, Stephen Merity (data scientist, crawl
engineer, and much more!) will leave Common Crawl to work on image
recognition and language understanding using deep learning at MetaMind, a
new startup. Stephen, has been a great asset to Common Crawl, and we are
grateful that he wishes to remain engaged with us in a volunteer capacity
going forward.

This week, we therefore launch a search to fill the role of Crawl
Engineer/Data Scientist. Below and posted here https://commoncrawl.org/jobs/
is the job description. We appreciate any help you can provide in spreading
the word about this unique opportunity. If you have specific referrals, or
wish to apply, please contact jobs@commoncrawl.org.

Many thanks,

Sara

-------------------------------------------------------

_CRAWL ENGINEER / DATA SCIENTIST at THE COMMON CRAWL FOUNDATION_

*Location*
San Francisco or Remote


*Job Summary*
Common Crawl (CC) is the non-profit organization that builds and maintains
the single largest publicly accessible dataset of the world’s knowledge,
encompassing petabytes of web crawl data.

If democratizing access to web information and tackling the engineering
challenges of working with data at the scale of the web sounds exciting to
you, we would love to hear from you. If you have worked on open source
projects before or can share code samples with us, please don't hesitate to
send relevant links along with your application.


*Description*

/Primary Responsibilities/
_Running the crawl_
* Spinning up and managing Hadoop clusters on Amazon EC2
* Running regular comprehensive crawls of the web using Nutch
* Preparing and publishing crawl data to data hosting partner, Amazon Web
Services
* Incident response and diagnosis of crawl issues as they occur, e.g.
** Replacing lost instances due to EC2 problems / spot instance losses
** Responding to and remedying webmaster queries and issues

_Crawl engineering_
* Maintaining, developing, and deploying new features as required by
running the Nutch crawler, e.g.:
** Providing netiquette features, such as following robots.txt, as
required, and load balancing a crawl across millions of domains
** Implementing and improving ranking algorithms to prioritize the crawling
of popular pages
* Extending existing tools to work efficiently with large datasets
* Working with the Nutch community to push improvements to the crawler to
the public

/Other Responsibilities/
* Building support tools and artifacts, including documentation, tutorials,
and example code or supporting frameworks for processing CC data using
different tools.
* Identifying and reporting on research and innovations that result from
analysis and derivative use of CC data.
* Community evangelism:
** Collaborating with partners in academia and industry
** Engaging regularly with user discussion group and responding to frequent
inquiries about how to use CC data
** Writing technical blog posts
** Presenting on or representing CC at conferences, meetups, etc.


*Qualifications*
/Minimum qualifications/
* Fluent in Java (Nutch and Hadoop are core to our mission)
* Familiarity with the JVM big data ecosystem (Hadoop, HDFS, ...)
* Knowledge the Amazon Web Services (AWS) ecosystem
* Experience with Python
* Basic command line Unix knowledge
* BS Computer Science or equivalent work experience

/Preferred qualifications/
* Experience with running web crawlers
* Cluster computing experience (Hadoop preferred)
* Running parallel jobs over dozens of terabytes of data
* Experience committing to open source projects and participating in open
source forums


*About Common Crawl*
The Common Crawl Foundation is a California 501(c)(3) registered non-profit
with the goal of democratizing access to web information by producing and
maintaining an open repository of web crawl data that is universally
accessible and analyzable.

Our vision is of a truly open web that allows open access to information
and enables greater innovation in research, business and education. We
level the playing field by making wholesale extraction, transformation and
analysis of web data cheap and easy.

The Common Crawl Foundation is an Equal Opportunity Employer.


*To Apply*
Please send your cover letter and resumé to jobs@commoncrawl.org.

-- 
You received this message because you are subscribed to the Google Groups
"Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common-crawl@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Job Opening at Common Crawl - Crawl Engineer / Data Scientist

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

awesome thanks for sharing!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Julien Nioche <li...@gmail.com>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Friday, September 18, 2015 at 2:54 AM
To: "dev@nutch.apache.org" <de...@nutch.apache.org>, "user@nutch.apache.org"
<us...@nutch.apache.org>
Cc: "sara@commoncrawl.org" <sa...@commoncrawl.org>
Subject: Fwd: Job Opening at Common Crawl - Crawl Engineer / Data Scientist

>Nutch people, 
>
>
>Just in case you missed the announcement below. As you probably know CC
>use Nutch for their crawls, this is a fantastic opportunity to put your
>Nutch skills to great use!
>
>
>Julien
>
>---------- Forwarded message ----------
>From: Sara Crouse <sa...@commoncrawl.org>
>Date: 17 September 2015 at 22:51
>Subject: Job Opening at Common Crawl - Crawl Engineer / Data Scientist
>To: Common Crawl <co...@googlegroups.com>
>
>
>Hello again CC community,
>
>In addition to my appointment, another staff transition is on the
>horizon, and I would like to ask for your help finding candidates to fill
>a critical role. At the end of this month, Stephen Merity (data
>scientist, crawl engineer, and much more!) will leave
> Common Crawl to work on image recognition and language understanding
>using deep learning at MetaMind, a new startup. Stephen, has been a great
>asset to Common Crawl, and we are grateful that he wishes to remain
>engaged with us in a volunteer capacity going
> forward.
>
>This week, we therefore launch a search to fill the role of Crawl
>Engineer/Data Scientist. Below and posted here
>https://commoncrawl.org/jobs/ is the job description. We appreciate any
>help you can provide in spreading the word about this unique opportunity.
>If you have specific referrals, or wish to apply, please
> contact jobs@commoncrawl.org.
>
>Many thanks,
>
>Sara
>-------------------------------------------------------
>
>_CRAWL ENGINEER / DATA SCIENTIST at THE COMMON CRAWL FOUNDATION_
>
>*Location* 
>San Francisco or Remote
>
>
>*Job Summary*
>Common Crawl (CC) is the non-profit organization that builds and
>maintains the single largest publicly accessible dataset of the world’s
>knowledge, encompassing petabytes of web crawl data.
>
>If democratizing access to web information and tackling the engineering
>challenges of working with data at the scale of the web sounds exciting
>to you, we would love to hear from you. If you have worked on open source
>projects before or can share code samples
> with us, please don't hesitate to send relevant links along with your
>application.
>
>
>
>*Description*
>
>
>/Primary Responsibilities/
>_Running the crawl_
>* Spinning up and managing Hadoop clusters on Amazon EC2
>* Running regular comprehensive crawls of the web using Nutch
>* Preparing and publishing crawl data to data hosting partner, Amazon Web
>Services
>* Incident response and diagnosis of crawl issues as they occur, e.g.
>** Replacing lost instances due to EC2 problems / spot instance losses
>** Responding to and remedying webmaster queries and issues
>
>_Crawl engineering_
>* Maintaining, developing, and deploying new features as required by
>running the Nutch crawler, e.g.:
>** Providing netiquette features, such as following robots.txt, as
>required, and load balancing a crawl across millions of domains
>
>** Implementing and improving ranking algorithms to prioritize the
>crawling of popular pages
>* Extending existing tools to work efficiently with large datasets
>* Working with the Nutch community to push improvements to the crawler to
>the public
>
>/Other Responsibilities/
>* Building support tools and artifacts, including documentation,
>tutorials, and example code or supporting frameworks for processing CC
>data using different tools.
>* Identifying and reporting on research and innovations that result from
>analysis and derivative use of CC data.
>* Community evangelism:
>** Collaborating with partners in academia and industry
>** Engaging regularly with user discussion group and responding to
>frequent inquiries about how to use CC data
>** Writing technical blog posts
>** Presenting on or representing CC at conferences, meetups, etc.
>
>
>*Qualifications*
>/Minimum qualifications/
>* Fluent in Java (Nutch and Hadoop are core to our mission)
>* Familiarity with the JVM big data ecosystem (Hadoop, HDFS, ...)
>* Knowledge the Amazon Web Services (AWS) ecosystem
>* Experience with Python
>* Basic command line Unix knowledge
>* BS Computer Science or equivalent work experience
>
>/Preferred qualifications/
>* Experience with running web crawlers
>* Cluster computing experience (Hadoop preferred)
>* Running parallel jobs over dozens of terabytes of data
>* Experience committing to open source projects and participating in open
>source forums
>
>
>*About Common Crawl*
>The Common Crawl Foundation is a California 501(c)(3) registered
>non-profit with the goal of democratizing access to web information by
>producing and maintaining an open repository of web crawl data that is
>universally accessible and analyzable.
>
>Our vision is of a truly open web that allows open access to information
>and enables greater innovation in research, business and education. We
>level the playing field by making wholesale extraction, transformation
>and analysis of web data cheap and easy.
>
>The Common Crawl Foundation is an Equal Opportunity Employer.
>
>
>*To Apply* 
>Please send your cover letter and resumé to
>jobs@commoncrawl.org <ma...@commoncrawl.org>.
>
>
>
>
>-- 
>You received this message because you are subscribed to the Google Groups
>"Common Crawl" group.
>To unsubscribe from this group and stop receiving emails from it, send an
>email to
>common-crawl+unsubscribe@googlegroups.com.
>To post to this group, send email to
>common-crawl@googlegroups.com <ma...@googlegroups.com>.
>Visit this group at
>http://groups.google.com/group/common-crawl
><http://groups.google.com/group/common-crawl>.
>For more options, visit
>https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
>
>
>
>
>-- 
>
>Open Source Solutions for Text Engineering
>
>http://digitalpebble.blogspot.com/
>http://www.digitalpebble.com
>http://twitter.com/digitalpebble
>
>

Re: Job Opening at Common Crawl - Crawl Engineer / Data Scientist

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

awesome thanks for sharing!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Julien Nioche <li...@gmail.com>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Friday, September 18, 2015 at 2:54 AM
To: "dev@nutch.apache.org" <de...@nutch.apache.org>, "user@nutch.apache.org"
<us...@nutch.apache.org>
Cc: "sara@commoncrawl.org" <sa...@commoncrawl.org>
Subject: Fwd: Job Opening at Common Crawl - Crawl Engineer / Data Scientist

>Nutch people, 
>
>
>Just in case you missed the announcement below. As you probably know CC
>use Nutch for their crawls, this is a fantastic opportunity to put your
>Nutch skills to great use!
>
>
>Julien
>
>---------- Forwarded message ----------
>From: Sara Crouse <sa...@commoncrawl.org>
>Date: 17 September 2015 at 22:51
>Subject: Job Opening at Common Crawl - Crawl Engineer / Data Scientist
>To: Common Crawl <co...@googlegroups.com>
>
>
>Hello again CC community,
>
>In addition to my appointment, another staff transition is on the
>horizon, and I would like to ask for your help finding candidates to fill
>a critical role. At the end of this month, Stephen Merity (data
>scientist, crawl engineer, and much more!) will leave
> Common Crawl to work on image recognition and language understanding
>using deep learning at MetaMind, a new startup. Stephen, has been a great
>asset to Common Crawl, and we are grateful that he wishes to remain
>engaged with us in a volunteer capacity going
> forward.
>
>This week, we therefore launch a search to fill the role of Crawl
>Engineer/Data Scientist. Below and posted here
>https://commoncrawl.org/jobs/ is the job description. We appreciate any
>help you can provide in spreading the word about this unique opportunity.
>If you have specific referrals, or wish to apply, please
> contact jobs@commoncrawl.org.
>
>Many thanks,
>
>Sara
>-------------------------------------------------------
>
>_CRAWL ENGINEER / DATA SCIENTIST at THE COMMON CRAWL FOUNDATION_
>
>*Location* 
>San Francisco or Remote
>
>
>*Job Summary*
>Common Crawl (CC) is the non-profit organization that builds and
>maintains the single largest publicly accessible dataset of the world’s
>knowledge, encompassing petabytes of web crawl data.
>
>If democratizing access to web information and tackling the engineering
>challenges of working with data at the scale of the web sounds exciting
>to you, we would love to hear from you. If you have worked on open source
>projects before or can share code samples
> with us, please don't hesitate to send relevant links along with your
>application.
>
>
>
>*Description*
>
>
>/Primary Responsibilities/
>_Running the crawl_
>* Spinning up and managing Hadoop clusters on Amazon EC2
>* Running regular comprehensive crawls of the web using Nutch
>* Preparing and publishing crawl data to data hosting partner, Amazon Web
>Services
>* Incident response and diagnosis of crawl issues as they occur, e.g.
>** Replacing lost instances due to EC2 problems / spot instance losses
>** Responding to and remedying webmaster queries and issues
>
>_Crawl engineering_
>* Maintaining, developing, and deploying new features as required by
>running the Nutch crawler, e.g.:
>** Providing netiquette features, such as following robots.txt, as
>required, and load balancing a crawl across millions of domains
>
>** Implementing and improving ranking algorithms to prioritize the
>crawling of popular pages
>* Extending existing tools to work efficiently with large datasets
>* Working with the Nutch community to push improvements to the crawler to
>the public
>
>/Other Responsibilities/
>* Building support tools and artifacts, including documentation,
>tutorials, and example code or supporting frameworks for processing CC
>data using different tools.
>* Identifying and reporting on research and innovations that result from
>analysis and derivative use of CC data.
>* Community evangelism:
>** Collaborating with partners in academia and industry
>** Engaging regularly with user discussion group and responding to
>frequent inquiries about how to use CC data
>** Writing technical blog posts
>** Presenting on or representing CC at conferences, meetups, etc.
>
>
>*Qualifications*
>/Minimum qualifications/
>* Fluent in Java (Nutch and Hadoop are core to our mission)
>* Familiarity with the JVM big data ecosystem (Hadoop, HDFS, ...)
>* Knowledge the Amazon Web Services (AWS) ecosystem
>* Experience with Python
>* Basic command line Unix knowledge
>* BS Computer Science or equivalent work experience
>
>/Preferred qualifications/
>* Experience with running web crawlers
>* Cluster computing experience (Hadoop preferred)
>* Running parallel jobs over dozens of terabytes of data
>* Experience committing to open source projects and participating in open
>source forums
>
>
>*About Common Crawl*
>The Common Crawl Foundation is a California 501(c)(3) registered
>non-profit with the goal of democratizing access to web information by
>producing and maintaining an open repository of web crawl data that is
>universally accessible and analyzable.
>
>Our vision is of a truly open web that allows open access to information
>and enables greater innovation in research, business and education. We
>level the playing field by making wholesale extraction, transformation
>and analysis of web data cheap and easy.
>
>The Common Crawl Foundation is an Equal Opportunity Employer.
>
>
>*To Apply* 
>Please send your cover letter and resumé to
>jobs@commoncrawl.org <ma...@commoncrawl.org>.
>
>
>
>
>-- 
>You received this message because you are subscribed to the Google Groups
>"Common Crawl" group.
>To unsubscribe from this group and stop receiving emails from it, send an
>email to
>common-crawl+unsubscribe@googlegroups.com.
>To post to this group, send email to
>common-crawl@googlegroups.com <ma...@googlegroups.com>.
>Visit this group at
>http://groups.google.com/group/common-crawl
><http://groups.google.com/group/common-crawl>.
>For more options, visit
>https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
>
>
>
>
>-- 
>
>Open Source Solutions for Text Engineering
>
>http://digitalpebble.blogspot.com/
>http://www.digitalpebble.com
>http://twitter.com/digitalpebble
>
>