You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by sn...@apache.org on 2018/06/12 16:04:57 UTC
[nutch] branch master updated (4bcaeeb -> 106df96)
This is an automated email from the ASF dual-hosted git repository.
snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git.
from 4bcaeeb Merge pull request #328 from sebastian-nagel/nutch-2576-protocol-okhttp
add 4cf9682 NUTCH-2549 protocol-http does not behave the same as browsers - add unit test class to emulate bad HTTP server sending erroneous HTTP headers, etc. - add unit tests for processing of chunked content (test NUTCH-2562 and NUTCH-2575)
add 6239655 NUTCH-2555 URL normalization problem: path not starting with a '/' For URLs with query and an empty path (http://example.com?a=1): - fix urlnormalizer-basic to add the missing slash (http://example.com/?a=1) - fix protocol-http to send a correct "GET /?a=1 ..." request
add 73d082e NUTCH-2556 protocol-http makes invalid HTTP/1.0 requests - use HTTP/1.1 by default (setting http.useHttp11 = false will sent HTTP/1.0 requests)
add 957306a NUTCH-2564 protocol-http throws an error when the content-length header is not a number - ignore invalid Content-Length header (log warning instead of throwing exception)
add 9e212a2 NUTCH-2559 protocol-http cannot handle colons after the HTTP status code (patch contributed by Gerard Bouchar)
add 146a76c NUTCH-2558 protocol-http cannot handle a missing HTTP status line NUTCH-2561 protocol-http can be made to read arbitrarily large HTTP responses - if parsing HTTP status line fails: log warning, push back input, assume status 200 OK (patch contributed by Gerard Bouchar) - limit max. length of HTTP header lines - 2 kB for status line - Http.BUFFER_SIZE (8 kB) for HTTP header field lines - throw exception if header line is longer than limit - fix encoding when pushi [...]
add 381e82f NUTCH-2563 HTTP header spellchecking issues ("Client-Transfer-Encoding" erroneously corrected to "Transfer-Encoding") - limit max. Levenshtein distance to 3 edit operations - add "Client-Transfer-Encoding" to known header fields
add d163512 NUTCH-2557 protocol-http fails to follow redirections when HTTP response body is invalid (patch contributed by Gerard Bouchar) - catch exceptions while reading payload - if response code is not "200 OK": ignore exception but reset content
add a2771dc NUTCH-2560 protocol-http throws an error when an http header spans over multiple lines - add unit test to verify that multi-line headers are correctly parsed
add 2e485cf NUTCH-2549 protocol-http does not behave the same as browsers - be conformant with RFC 7230 and signal that connection is closed after response (patch contributed by Gerard Bouchar)
new 106df96 Merge pull request #347 from sebastian-nagel/NUTCH-2549-protocol-http-fixes
The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails. The revisions
listed as "add" were already present in the repository and have only
been added to this reference.
Summary of changes:
conf/nutch-default.xml | 6 +-
.../org/apache/nutch/metadata/HttpHeaders.java | 2 +
.../nutch/metadata/SpellCheckedMetadata.java | 7 +-
.../apache/nutch/protocol/http/HttpResponse.java | 161 ++++++++++++++-------
.../src/test/conf/nutch-site-test.xml | 8 +-
.../protocol/http}/TestBadServerResponses.java | 13 +-
.../urlnormalizer/basic/BasicURLNormalizer.java | 20 ++-
.../basic/TestBasicURLNormalizer.java | 2 +
8 files changed, 143 insertions(+), 76 deletions(-)
copy src/plugin/{protocol-okhttp/src/test/org/apache/nutch/protocol/okhttp => protocol-http/src/test/org/apache/nutch/protocol/http}/TestBadServerResponses.java (97%)
--
To stop receiving notification emails like this one, please contact
snagel@apache.org.
[nutch] 01/01: Merge pull request #347 from
sebastian-nagel/NUTCH-2549-protocol-http-fixes
Posted by sn...@apache.org.
This is an automated email from the ASF dual-hosted git repository.
snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git
commit 106df966444311aa7d35b443105d00173bdc4847
Merge: 4bcaeeb 2e485cf
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Tue Jun 12 18:04:42 2018 +0200
Merge pull request #347 from sebastian-nagel/NUTCH-2549-protocol-http-fixes
NUTCH-2549 protocol-http does not behave the same as browsers
conf/nutch-default.xml | 6 +-
.../org/apache/nutch/metadata/HttpHeaders.java | 2 +
.../nutch/metadata/SpellCheckedMetadata.java | 7 +-
.../apache/nutch/protocol/http/HttpResponse.java | 161 +++++++----
.../src/test/conf/nutch-site-test.xml | 8 +-
.../protocol/http/TestBadServerResponses.java | 313 +++++++++++++++++++++
.../urlnormalizer/basic/BasicURLNormalizer.java | 20 +-
.../basic/TestBasicURLNormalizer.java | 2 +
8 files changed, 452 insertions(+), 67 deletions(-)
--
To stop receiving notification emails like this one, please contact
snagel@apache.org.