You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2010/08/11 12:07:16 UTC

[jira] Created: (NUTCH-881) Good quality documentation for Nutch

Good quality documentation for Nutch
------------------------------------

                 Key: NUTCH-881
                 URL: https://issues.apache.org/jira/browse/NUTCH-881
             Project: Nutch
          Issue Type: Improvement
          Components: documentation
    Affects Versions: 2.0
            Reporter: Andrzej Bialecki 


This is, and has been, a long standing request from Nutch users. This becomes an acute need as we redesign Nutch 2.0, because the collective knowledge and the Wiki will no longer be useful without massive amount of editing.

IMHO the reference documentation should be in SVN, and not on the Wiki - the Wiki is good for casual information and recipes but I think it's too messy and not reliable enough as a reference.

I propose to start with the following:

 1. let's decide on the format of the docs. Each format has its own pros and cons:
  * HTML: easy to work with, but formatting may be messy unless we edit it by hand, at which point it's no longer so easy... Good toolchains to convert to other formats, but limited expressiveness of larger structures (e.g. book, chapters, TOC, multi-column layouts, etc).
  * Docbook: learning curve is higher, but not insurmountable... Naturally yields very good structure. Figures/diagrams may be problematic - different renderers (html, pdf) like to treat the scaling and placing somewhat differently.
  * Wiki-style (Confluence or TWiki): easy to use, but limited control over larger structures. Maven Doxia can format cwiki, twiki, and a host of other formats to e.g. html and pdf.
  * other?

 2. start documenting the main tools and the main APIs (e.g. the plugins and all the extension points). We can of course reuse material from the Wiki and from various presentations (e.g. the ApacheCon slides).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-881) Good quality documentation for Nutch

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899815#action_12899815 ] 

Andrzej Bialecki  commented on NUTCH-881:
-----------------------------------------

bq. So what is new in Nutch 2.0 which doesn't appear in Nutch 1.x ? Gora is the main thing which comes to mind.

Yes. We also removed all search-related code from Nutch and rely exclusively on Solr to perform searching. This means that some APIs have been removed (e.g. query filters, text analysis, lucene indexing backend).

bq. How do the config files differ?

We still use the same nutch-default/nutch-site.xml, plus per-plugin config files. Some properties have changes, e.g the ones to limit max. number of urls per host in generator. We added some Gora-related files, gora.properties and gora-*-mapping.xml, that define what driver to use and how to map webtable columns onto storage-specific columns/fields.

bq. How does Nutch's use of Hadoop differ?

All jobs now use GoraInputFormat / GoraOutputFormat, which hides the details about the actual data storage backend.

bq. How do the command lines differ? (Presumably you need different command lines to say where to store the crawldb, right?)

Yes. Actually, this could be a separate issue to be solved - currently we assume there is one Nutch webtable per storage backend, so we don't specify the "db identifier" anywhere... but this prevents us from defining multiple crawl configs that use the same backend, so it should be addressed.

> Good quality documentation for Nutch
> ------------------------------------
>
>                 Key: NUTCH-881
>                 URL: https://issues.apache.org/jira/browse/NUTCH-881
>             Project: Nutch
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>
> This is, and has been, a long standing request from Nutch users. This becomes an acute need as we redesign Nutch 2.0, because the collective knowledge and the Wiki will no longer be useful without massive amount of editing.
> IMHO the reference documentation should be in SVN, and not on the Wiki - the Wiki is good for casual information and recipes but I think it's too messy and not reliable enough as a reference.
> I propose to start with the following:
>  1. let's decide on the format of the docs. Each format has its own pros and cons:
>   * HTML: easy to work with, but formatting may be messy unless we edit it by hand, at which point it's no longer so easy... Good toolchains to convert to other formats, but limited expressiveness of larger structures (e.g. book, chapters, TOC, multi-column layouts, etc).
>   * Docbook: learning curve is higher, but not insurmountable... Naturally yields very good structure. Figures/diagrams may be problematic - different renderers (html, pdf) like to treat the scaling and placing somewhat differently.
>   * Wiki-style (Confluence or TWiki): easy to use, but limited control over larger structures. Maven Doxia can format cwiki, twiki, and a host of other formats to e.g. html and pdf.
>   * other?
>  2. start documenting the main tools and the main APIs (e.g. the plugins and all the extension points). We can of course reuse material from the Wiki and from various presentations (e.g. the ApacheCon slides).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-881) Good quality documentation for Nutch

Posted by "Alex McLintock (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899776#action_12899776 ] 

Alex McLintock commented on NUTCH-881:
--------------------------------------

So what is new in Nutch 2.0 which doesn't appear in Nutch 1.x ?

Gora is the main thing which comes to mind. 

How do the config files differ?
How does Nutch's use of Hadoop differ? 
How do the command lines differ? (Presumably you need different command lines to say *where* to store the crawldb, right?)

anything else?

> Good quality documentation for Nutch
> ------------------------------------
>
>                 Key: NUTCH-881
>                 URL: https://issues.apache.org/jira/browse/NUTCH-881
>             Project: Nutch
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>
> This is, and has been, a long standing request from Nutch users. This becomes an acute need as we redesign Nutch 2.0, because the collective knowledge and the Wiki will no longer be useful without massive amount of editing.
> IMHO the reference documentation should be in SVN, and not on the Wiki - the Wiki is good for casual information and recipes but I think it's too messy and not reliable enough as a reference.
> I propose to start with the following:
>  1. let's decide on the format of the docs. Each format has its own pros and cons:
>   * HTML: easy to work with, but formatting may be messy unless we edit it by hand, at which point it's no longer so easy... Good toolchains to convert to other formats, but limited expressiveness of larger structures (e.g. book, chapters, TOC, multi-column layouts, etc).
>   * Docbook: learning curve is higher, but not insurmountable... Naturally yields very good structure. Figures/diagrams may be problematic - different renderers (html, pdf) like to treat the scaling and placing somewhat differently.
>   * Wiki-style (Confluence or TWiki): easy to use, but limited control over larger structures. Maven Doxia can format cwiki, twiki, and a host of other formats to e.g. html and pdf.
>   * other?
>  2. start documenting the main tools and the main APIs (e.g. the plugins and all the extension points). We can of course reuse material from the Wiki and from various presentations (e.g. the ApacheCon slides).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-881) Good quality documentation for Nutch

Posted by "Alex McLintock (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897929#action_12897929 ] 

Alex McLintock commented on NUTCH-881:
--------------------------------------

The "other?" would be DITA. This is in some ways "DocBook Version 2" in that it seems to have most of the good features of DocBook - but be better for large software projects rather than single documents. 

Basically the documentation is written and stored in XML (in svn/git/cvs whatever). XSLT / xsl:fo is used to generate html and pdf from that single source. 


There is a precedent too. Apache Derby is using DITA for its documentation

http://db.apache.org/derby/manuals/dita.html

I don't have experience of this, but DITA was recommended to me by a friendly documentation professional. 


I'm happy to learn about this and try to set up a framework

> Good quality documentation for Nutch
> ------------------------------------
>
>                 Key: NUTCH-881
>                 URL: https://issues.apache.org/jira/browse/NUTCH-881
>             Project: Nutch
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>
> This is, and has been, a long standing request from Nutch users. This becomes an acute need as we redesign Nutch 2.0, because the collective knowledge and the Wiki will no longer be useful without massive amount of editing.
> IMHO the reference documentation should be in SVN, and not on the Wiki - the Wiki is good for casual information and recipes but I think it's too messy and not reliable enough as a reference.
> I propose to start with the following:
>  1. let's decide on the format of the docs. Each format has its own pros and cons:
>   * HTML: easy to work with, but formatting may be messy unless we edit it by hand, at which point it's no longer so easy... Good toolchains to convert to other formats, but limited expressiveness of larger structures (e.g. book, chapters, TOC, multi-column layouts, etc).
>   * Docbook: learning curve is higher, but not insurmountable... Naturally yields very good structure. Figures/diagrams may be problematic - different renderers (html, pdf) like to treat the scaling and placing somewhat differently.
>   * Wiki-style (Confluence or TWiki): easy to use, but limited control over larger structures. Maven Doxia can format cwiki, twiki, and a host of other formats to e.g. html and pdf.
>   * other?
>  2. start documenting the main tools and the main APIs (e.g. the plugins and all the extension points). We can of course reuse material from the Wiki and from various presentations (e.g. the ApacheCon slides).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-881) Good quality documentation for Nutch

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897218#action_12897218 ] 

Julien Nioche commented on NUTCH-881:
-------------------------------------

+1 for storing the documentation in SVN

As for the format I don't really like the idea of writing in HTML and would rather use something that could generate different formats (html, pdf). I have no experience of docbook or noxia, but would favour anything that can be used easily with ANT so that we could generate the documentation as part of the build process.



> Good quality documentation for Nutch
> ------------------------------------
>
>                 Key: NUTCH-881
>                 URL: https://issues.apache.org/jira/browse/NUTCH-881
>             Project: Nutch
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>
> This is, and has been, a long standing request from Nutch users. This becomes an acute need as we redesign Nutch 2.0, because the collective knowledge and the Wiki will no longer be useful without massive amount of editing.
> IMHO the reference documentation should be in SVN, and not on the Wiki - the Wiki is good for casual information and recipes but I think it's too messy and not reliable enough as a reference.
> I propose to start with the following:
>  1. let's decide on the format of the docs. Each format has its own pros and cons:
>   * HTML: easy to work with, but formatting may be messy unless we edit it by hand, at which point it's no longer so easy... Good toolchains to convert to other formats, but limited expressiveness of larger structures (e.g. book, chapters, TOC, multi-column layouts, etc).
>   * Docbook: learning curve is higher, but not insurmountable... Naturally yields very good structure. Figures/diagrams may be problematic - different renderers (html, pdf) like to treat the scaling and placing somewhat differently.
>   * Wiki-style (Confluence or TWiki): easy to use, but limited control over larger structures. Maven Doxia can format cwiki, twiki, and a host of other formats to e.g. html and pdf.
>   * other?
>  2. start documenting the main tools and the main APIs (e.g. the plugins and all the extension points). We can of course reuse material from the Wiki and from various presentations (e.g. the ApacheCon slides).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-881) Good quality documentation for Nutch

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898052#action_12898052 ] 

Chris A. Mattmann commented on NUTCH-881:
-----------------------------------------

{quote}
I'm happy to learn about this and try to set up a framework
{quote}

Alex, +1 from me, worth a try and looking forward to see what you come up with.

Cheers,
Chris


> Good quality documentation for Nutch
> ------------------------------------
>
>                 Key: NUTCH-881
>                 URL: https://issues.apache.org/jira/browse/NUTCH-881
>             Project: Nutch
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>
> This is, and has been, a long standing request from Nutch users. This becomes an acute need as we redesign Nutch 2.0, because the collective knowledge and the Wiki will no longer be useful without massive amount of editing.
> IMHO the reference documentation should be in SVN, and not on the Wiki - the Wiki is good for casual information and recipes but I think it's too messy and not reliable enough as a reference.
> I propose to start with the following:
>  1. let's decide on the format of the docs. Each format has its own pros and cons:
>   * HTML: easy to work with, but formatting may be messy unless we edit it by hand, at which point it's no longer so easy... Good toolchains to convert to other formats, but limited expressiveness of larger structures (e.g. book, chapters, TOC, multi-column layouts, etc).
>   * Docbook: learning curve is higher, but not insurmountable... Naturally yields very good structure. Figures/diagrams may be problematic - different renderers (html, pdf) like to treat the scaling and placing somewhat differently.
>   * Wiki-style (Confluence or TWiki): easy to use, but limited control over larger structures. Maven Doxia can format cwiki, twiki, and a host of other formats to e.g. html and pdf.
>   * other?
>  2. start documenting the main tools and the main APIs (e.g. the plugins and all the extension points). We can of course reuse material from the Wiki and from various presentations (e.g. the ApacheCon slides).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.