You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by sn...@apache.org on 2018/06/12 16:04:57 UTC

[nutch] branch master updated (4bcaeeb -> 106df96)

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git.


    from 4bcaeeb  Merge pull request #328 from sebastian-nagel/nutch-2576-protocol-okhttp
     add 4cf9682  NUTCH-2549 protocol-http does not behave the same as browsers - add unit test class to emulate bad HTTP server sending   erroneous HTTP headers, etc. - add unit tests for processing of chunked content   (test NUTCH-2562 and NUTCH-2575)
     add 6239655  NUTCH-2555 URL normalization problem: path not starting with a '/' For URLs with query and an empty path (http://example.com?a=1): - fix urlnormalizer-basic to add the missing slash (http://example.com/?a=1) - fix protocol-http to send a correct "GET /?a=1 ..." request
     add 73d082e  NUTCH-2556 protocol-http makes invalid HTTP/1.0 requests - use HTTP/1.1 by default   (setting http.useHttp11 = false will sent HTTP/1.0 requests)
     add 957306a  NUTCH-2564 protocol-http throws an error when the content-length header is not a number - ignore invalid Content-Length header (log warning instead of throwing exception)
     add 9e212a2  NUTCH-2559 protocol-http cannot handle colons after the HTTP status code (patch contributed by Gerard Bouchar)
     add 146a76c  NUTCH-2558 protocol-http cannot handle a missing HTTP status line NUTCH-2561 protocol-http can be made to read arbitrarily large HTTP responses - if parsing HTTP status line fails: log warning, push back input,   assume status 200 OK (patch contributed by Gerard Bouchar) - limit max. length of HTTP header lines   - 2 kB for status line   - Http.BUFFER_SIZE (8 kB) for HTTP header field lines   - throw exception if header line is longer than limit - fix encoding when pushi [...]
     add 381e82f  NUTCH-2563 HTTP header spellchecking issues ("Client-Transfer-Encoding" erroneously corrected to "Transfer-Encoding") - limit max. Levenshtein distance to 3 edit operations - add "Client-Transfer-Encoding" to known header fields
     add d163512  NUTCH-2557 protocol-http fails to follow redirections when HTTP response body is invalid (patch contributed by Gerard Bouchar) - catch exceptions while reading payload - if response code is not "200 OK": ignore exception but reset content
     add a2771dc  NUTCH-2560 protocol-http throws an error when an http header spans over multiple lines - add unit test to verify that multi-line headers are correctly parsed
     add 2e485cf  NUTCH-2549 protocol-http does not behave the same as browsers - be conformant with RFC 7230 and signal that connection is closed   after response (patch contributed by Gerard Bouchar)
     new 106df96  Merge pull request #347 from sebastian-nagel/NUTCH-2549-protocol-http-fixes

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 conf/nutch-default.xml                             |   6 +-
 .../org/apache/nutch/metadata/HttpHeaders.java     |   2 +
 .../nutch/metadata/SpellCheckedMetadata.java       |   7 +-
 .../apache/nutch/protocol/http/HttpResponse.java   | 161 ++++++++++++++-------
 .../src/test/conf/nutch-site-test.xml              |   8 +-
 .../protocol/http}/TestBadServerResponses.java     |  13 +-
 .../urlnormalizer/basic/BasicURLNormalizer.java    |  20 ++-
 .../basic/TestBasicURLNormalizer.java              |   2 +
 8 files changed, 143 insertions(+), 76 deletions(-)
 copy src/plugin/{protocol-okhttp/src/test/org/apache/nutch/protocol/okhttp => protocol-http/src/test/org/apache/nutch/protocol/http}/TestBadServerResponses.java (97%)

-- 
To stop receiving notification emails like this one, please contact
snagel@apache.org.

[nutch] 01/01: Merge pull request #347 from sebastian-nagel/NUTCH-2549-protocol-http-fixes

Posted by sn...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 106df966444311aa7d35b443105d00173bdc4847
Merge: 4bcaeeb 2e485cf
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Tue Jun 12 18:04:42 2018 +0200

    Merge pull request #347 from sebastian-nagel/NUTCH-2549-protocol-http-fixes
    
    NUTCH-2549  protocol-http does not behave the same as browsers

 conf/nutch-default.xml                             |   6 +-
 .../org/apache/nutch/metadata/HttpHeaders.java     |   2 +
 .../nutch/metadata/SpellCheckedMetadata.java       |   7 +-
 .../apache/nutch/protocol/http/HttpResponse.java   | 161 +++++++----
 .../src/test/conf/nutch-site-test.xml              |   8 +-
 .../protocol/http/TestBadServerResponses.java      | 313 +++++++++++++++++++++
 .../urlnormalizer/basic/BasicURLNormalizer.java    |  20 +-
 .../basic/TestBasicURLNormalizer.java              |   2 +
 8 files changed, 452 insertions(+), 67 deletions(-)

-- 
To stop receiving notification emails like this one, please contact
snagel@apache.org.