You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kudu.apache.org by ad...@apache.org on 2019/11/08 01:51:14 UTC

[kudu] 01/02: thirdparty: add gumbo and gumbo-query

This is an automated email from the ASF dual-hosted git repository.

adar pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/kudu.git

commit 13f226b1400b122af16b9d1edb607fbd492546b8
Author: Adar Dembo <ad...@cloudera.com>
AuthorDate: Mon Oct 28 21:31:39 2019 -0700

    thirdparty: add gumbo and gumbo-query
    
    In a follow-on patch I built a simple web crawler for testing the web UI.
    It's possible to parse HTML for links via string::find, it made for some
    really ugly and brittle code. So I went searching for a proper HTML parser.
    
    I settled on Gumbo (known as gumbo-parser on github), a Google C library for
    parsing HTML 5. Although it is quite old and hasn't been updated in some
    time, it has been used on Google's web cache and has passed Google's
    internal security review. Plus it has an intuitive API.
    
    To simplify further, I incorporated the gumbo-query C++ library, which adds
    a CSS selector [2] API for Gumbo. This drastically simplifies an operation
    like finding all the links in a page. Sample code:
    
      string page;
      gq::CDocument doc;
      doc.parse(page);
      gq::CSelection sel = doc.find("a");
      for (int i = 0; i < sel.nodeNum(); i++) {
        string link = sel.nodeAt(i).attribute("href");
        <do stuff with link>
      }
    
    Like Gumbo, gumbo-query is old and unmaintained. I had to patch it to
    move all of its functionality into a namespace.
    
    1. https://opensource.googleblog.com/2013/08/gumbo-c-library-for-parsing-html.html
    2. https://www.w3schools.com/cssref/css_selectors.asp
    
    Change-Id: I851d5a78e30c1c79a2de253f59c7377a61aa3ed7
    Reviewed-on: http://gerrit.cloudera.org:8080/14572
    Tested-by: Kudu Jenkins
    Reviewed-by: Alexey Serbin <as...@cloudera.com>
---
 CMakeLists.txt                                     |  14 +
 cmake_modules/FindGumboParser.cmake                |  31 ++
 cmake_modules/FindGumboQuery.cmake                 |  31 ++
 thirdparty/LICENSE.txt                             |  10 +
 thirdparty/build-definitions.sh                    |  32 ++
 thirdparty/build-thirdparty.sh                     |  20 ++
 thirdparty/download-thirdparty.sh                  |  15 +
 thirdparty/patches/gumbo-parser-autoconf-263.patch |  28 ++
 thirdparty/patches/gumbo-query-namespace.patch     | 330 +++++++++++++++++++++
 thirdparty/vars.sh                                 |  22 ++
 10 files changed, 533 insertions(+)

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 8331aa6..75f9be1 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -1209,6 +1209,20 @@ ADD_THIRDPARTY_LIB(yaml
   STATIC_LIB "${YAML_STATIC_LIB}"
   SHARED_LIB "${YAML_SHARED_LIB}")
 
+## gumbo-parser
+find_package(GumboParser REQUIRED)
+include_directories(SYSTEM ${GUMBO_PARSER_INCLUDE_DIR})
+ADD_THIRDPARTY_LIB(gumbo-parser
+  STATIC_LIB "${GUMBO_PARSER_STATIC_LIB}"
+  SHARED_LIB "${GUMBO_PARSER_SHARED_LIB}")
+
+## gumbo-query
+find_package(GumboQuery REQUIRED)
+include_directories(SYSTEM ${GUMBO_QUERY_INCLUDE_DIR})
+ADD_THIRDPARTY_LIB(gumbo-query
+  STATIC_LIB "${GUMBO_QUERY_STATIC_LIB}"
+  SHARED_LIB "${GUMBO_QUERY_SHARED_LIB}")
+
 ## Boost
 
 # We use a custom cmake module and not cmake's FindBoost.
diff --git a/cmake_modules/FindGumboParser.cmake b/cmake_modules/FindGumboParser.cmake
new file mode 100644
index 0000000..1cd0c75
--- /dev/null
+++ b/cmake_modules/FindGumboParser.cmake
@@ -0,0 +1,31 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+find_path(GUMBO_PARSER_INCLUDE_DIR gumbo.h
+  # make sure we don't accidentally pick up a different version
+  NO_CMAKE_SYSTEM_PATH
+  NO_SYSTEM_ENVIRONMENT_PATH)
+find_library(GUMBO_PARSER_STATIC_LIB libgumbo.a
+  NO_CMAKE_SYSTEM_PATH
+  NO_SYSTEM_ENVIRONMENT_PATH)
+find_library(GUMBO_PARSER_SHARED_LIB libgumbo.so
+  NO_CMAKE_SYSTEM_PATH
+  NO_SYSTEM_ENVIRONMENT_PATH)
+
+include(FindPackageHandleStandardArgs)
+find_package_handle_standard_args(GUMBO_PARSER REQUIRED_VARS
+    GUMBO_PARSER_STATIC_LIB GUMBO_PARSER_SHARED_LIB GUMBO_PARSER_INCLUDE_DIR)
diff --git a/cmake_modules/FindGumboQuery.cmake b/cmake_modules/FindGumboQuery.cmake
new file mode 100644
index 0000000..be0ea4e
--- /dev/null
+++ b/cmake_modules/FindGumboQuery.cmake
@@ -0,0 +1,31 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+find_path(GUMBO_QUERY_INCLUDE_DIR gq/Document.h
+  # make sure we don't accidentally pick up a different version
+  NO_CMAKE_SYSTEM_PATH
+  NO_SYSTEM_ENVIRONMENT_PATH)
+find_library(GUMBO_QUERY_STATIC_LIB libgq.a
+  NO_CMAKE_SYSTEM_PATH
+  NO_SYSTEM_ENVIRONMENT_PATH)
+find_library(GUMBO_QUERY_SHARED_LIB libgq.so
+  NO_CMAKE_SYSTEM_PATH
+  NO_SYSTEM_ENVIRONMENT_PATH)
+
+include(FindPackageHandleStandardArgs)
+find_package_handle_standard_args(GUMBO_QUERY REQUIRED_VARS
+    GUMBO_QUERY_STATIC_LIB GUMBO_QUERY_SHARED_LIB GUMBO_QUERY_INCLUDE_DIR)
diff --git a/thirdparty/LICENSE.txt b/thirdparty/LICENSE.txt
index 7c14d0f..8e0a2ef 100644
--- a/thirdparty/LICENSE.txt
+++ b/thirdparty/LICENSE.txt
@@ -689,3 +689,13 @@ NOTE: build-time dependency
 thirdparty/src/chrony-*/: GPLv2 license (https://www.gnu.org/licenses/gpl.html)
 Source: https://chrony.tuxfamily.org/
 NOTE: build-time dependency
+
+--------------------------------------------------------------------------------
+thirdparty/src/gumbo-parser-*/: Apache 2.0 license
+Source: https://github.com/google/gumbo-parser
+NOTE: build-time dependency
+
+--------------------------------------------------------------------------------
+thirdparty/src/gumbo-query-*/: MIT license
+Source: https://github.com/lazytiger/gumbo-query
+NOTE: build-time dependency
diff --git a/thirdparty/build-definitions.sh b/thirdparty/build-definitions.sh
index b640d75..d81b01c 100644
--- a/thirdparty/build-definitions.sh
+++ b/thirdparty/build-definitions.sh
@@ -976,3 +976,35 @@ build_chrony() {
   make -j$PARALLEL $EXTRA_MAKEFLAGS install
   popd
 }
+
+build_gumbo_parser() {
+  GUMBO_PARSER_BDIR=$TP_BUILD_DIR/$GUMBO_PARSER_NAME$MODE_SUFFIX
+  mkdir -p $GUMBO_PARSER_BDIR
+  pushd $GUMBO_PARSER_BDIR
+  CFLAGS="$EXTRA_CFLAGS" \
+    CXXFLAGS="$EXTRA_CXXFLAGS" \
+    LDFLAGS="$EXTRA_LDFLAGS" \
+    LIBS="$EXTRA_LIBS" \
+    $GUMBO_PARSER_SOURCE/configure \
+    --prefix=$PREFIX
+  make -j$PARALLEL $EXTRA_MAKEFLAGS install
+  popd
+}
+
+build_gumbo_query() {
+  GUMBO_QUERY_BDIR=$TP_BUILD_DIR/$GUMBO_QUERY_NAME$MODE_SUFFIX
+  mkdir -p $GUMBO_QUERY_BDIR
+  pushd $GUMBO_QUERY_BDIR
+  rm -Rf CMakeCache.txt CMakeFiles/
+  cmake \
+    -DCMAKE_BUILD_TYPE=release \
+    -DCMAKE_INSTALL_PREFIX=$PREFIX \
+    -DCMAKE_CXX_FLAGS="$EXTRA_CXXFLAGS" \
+    -DCMAKE_EXE_LINKER_FLAGS="$EXTRA_LDFLAGS $EXTRA_LIBS" \
+    -DCMAKE_MODULE_LINKER_FLAGS="$EXTRA_LDFLAGS $EXTRA_LIBS" \
+    -DCMAKE_SHARED_LINKER_FLAGS="$EXTRA_LDFLAGS $EXTRA_LIBS -L$PREFIX/lib -Wl,-rpath,$PREFIX/lib" \
+    $EXTRA_CMAKE_FLAGS \
+    $GUMBO_QUERY_SOURCE
+  ${NINJA:-make} -j$PARALLEL $EXTRA_MAKEFLAGS install
+  popd
+}
diff --git a/thirdparty/build-thirdparty.sh b/thirdparty/build-thirdparty.sh
index 7ea5898..f37ac54 100755
--- a/thirdparty/build-thirdparty.sh
+++ b/thirdparty/build-thirdparty.sh
@@ -101,6 +101,8 @@ else
       "sentry")       F_SENTRY=1 ;;
       "yaml")         F_YAML=1 ;;
       "chrony")       F_CHRONY=1 ;;
+      "gumbo-parser") F_GUMBO_PARSER=1 ;;
+      "gumbo-query")  F_GUMBO_QUERY=1 ;;
       *)              echo "Unknown module: $arg"; exit 1 ;;
     esac
   done
@@ -376,6 +378,15 @@ if [ -n "$F_UNINSTRUMENTED" -o -n "$F_YAML" ]; then
   build_yaml
 fi
 
+if [ -n "$F_UNINSTRUMENTED" -o -n "$F_GUMBO_PARSER" ]; then
+  # Although it's a C library its tests are written in C++.
+  build_gumbo_parser
+fi
+
+if [ -n "$F_UNINSTRUMENTED" -o -n "$F_GUMBO_QUERY" ]; then
+  build_gumbo_query
+fi
+
 restore_env
 
 # If we're on macOS best to exit here, otherwise single dependency builds will try to
@@ -556,6 +567,15 @@ if [ -n "$F_TSAN" -o -n "$F_YAML" ]; then
   build_yaml
 fi
 
+if [ -n "$F_TSAN" -o -n "$F_GUMBO_PARSER" ]; then
+  # Although it's a C library its tests are written in C++.
+  build_gumbo_parser
+fi
+
+if [ -n "$F_TSAN" -o -n "$F_GUMBO_QUERY" ]; then
+  build_gumbo_query
+fi
+
 restore_env
 
 finish
diff --git a/thirdparty/download-thirdparty.sh b/thirdparty/download-thirdparty.sh
index 5f56c48..9f3e2ee 100755
--- a/thirdparty/download-thirdparty.sh
+++ b/thirdparty/download-thirdparty.sh
@@ -435,5 +435,20 @@ fetch_and_patch \
  "patch -p1 < $TP_DIR/patches/chrony-no-superuser.patch" \
  "patch -p1 < $TP_DIR/patches/chrony-reuseport.patch"
 
+GUMBO_PARSER_PATCHLEVEL=1
+fetch_and_patch \
+ $GUMBO_PARSER_NAME.tar.gz \
+ $GUMBO_PARSER_SOURCE \
+ $GUMBO_PARSER_PATCHLEVEL \
+ "patch -p1 < $TP_DIR/patches/gumbo-parser-autoconf-263.patch" \
+ "autoreconf -fvi"
+
+GUMBO_QUERY_PATCHLEVEL=1
+fetch_and_patch \
+ $GUMBO_QUERY_NAME.tar.gz \
+ $GUMBO_QUERY_SOURCE \
+ $GUMBO_QUERY_PATCHLEVEL \
+ "patch -p1 < $TP_DIR/patches/gumbo-query-namespace.patch"
+
 echo "---------------"
 echo "Thirdparty dependencies downloaded successfully"
diff --git a/thirdparty/patches/gumbo-parser-autoconf-263.patch b/thirdparty/patches/gumbo-parser-autoconf-263.patch
new file mode 100644
index 0000000..b4ef30f
--- /dev/null
+++ b/thirdparty/patches/gumbo-parser-autoconf-263.patch
@@ -0,0 +1,28 @@
+commit 36c07df
+Author: Adar Dembo <ad...@cloudera.com>
+Date:   Tue Oct 29 10:13:37 2019 -0700
+
+    Allow building gumbo-parser with autoconf 2.63
+    
+    This is found on CentOS 6.6.
+
+diff --git a/configure.ac b/configure.ac
+index 32dc9b9..c0fb407 100644
+--- a/configure.ac
++++ b/configure.ac
+@@ -1,7 +1,7 @@
+ #                                               -*- Autoconf -*-
+ # Process this file with autoconf to produce a configure script.
+ 
+-AC_PREREQ([2.65])
++AC_PREREQ([2.63])
+ AC_INIT([gumbo], [0.10.1], [jonathan.d.tang@gmail.com])
+ AC_CONFIG_MACRO_DIR([m4])
+ AC_CONFIG_SRCDIR([src/parser.c])
+diff --git a/m4/created b/m4/created
+new file mode 100644
+index 0000000..5159d8b
+--- /dev/null
++++ b/m4/created
+@@ -0,0 +1 @@
++# This file exists so that the m4 directory can exist in the git repo.
diff --git a/thirdparty/patches/gumbo-query-namespace.patch b/thirdparty/patches/gumbo-query-namespace.patch
new file mode 100644
index 0000000..2497417
--- /dev/null
+++ b/thirdparty/patches/gumbo-query-namespace.patch
@@ -0,0 +1,330 @@
+commit 510d242
+Author: Adar Dembo <ad...@cloudera.com>
+Date:   Mon Oct 28 17:12:33 2019 -0700
+
+    encapsulate all classes in gq namespace
+
+diff --git a/src/Document.cpp b/src/Document.cpp
+index 2b200e0..cb9c7bb 100644
+--- a/src/Document.cpp
++++ b/src/Document.cpp
+@@ -15,6 +15,8 @@
+ 
+ #include "Document.h"
+ 
++namespace gq {
++
+ CDocument::CDocument()
+ {
+ 	mpOutput = NULL;
+@@ -51,5 +53,7 @@ void CDocument::reset()
+ 	}
+ }
+ 
++} // namespace gq
++
+ /* vim: set ts=4 sw=4 sts=4 tw=100 noet: */
+ 
+diff --git a/src/Document.h b/src/Document.h
+index 92b44d1..d406f01 100644
+--- a/src/Document.h
++++ b/src/Document.h
+@@ -20,6 +20,8 @@
+ #include <string>
+ #include "Selection.h"
+ 
++namespace gq {
++
+ class CDocument: public CObject
+ {
+ 	public:
+@@ -41,6 +43,8 @@ class CDocument: public CObject
+ 		GumboOutput* mpOutput;
+ };
+ 
++} // namespace gq
++
+ #endif /* DOCUMENT_H_ */
+ 
+ /* vim: set ts=4 sw=4 sts=4 tw=100 noet: */
+diff --git a/src/Node.cpp b/src/Node.cpp
+index 917db80..b7676d6 100644
+--- a/src/Node.cpp
++++ b/src/Node.cpp
+@@ -17,6 +17,8 @@
+ #include "Selection.h"
+ #include "QueryUtil.h"
+ 
++namespace gq {
++
+ CNode::CNode(GumboNode* apNode)
+ {
+ 	mpNode = apNode;
+@@ -164,4 +166,7 @@ CSelection CNode::find(std::string aSelector)
+ 	CSelection c(mpNode);
+ 	return c.find(aSelector);
+ }
++
++} // namespace gq
++
+ /* vim: set ts=4 sw=4 sts=4 tw=100 noet: */
+diff --git a/src/Node.h b/src/Node.h
+index b0a41e4..dc3b805 100644
+--- a/src/Node.h
++++ b/src/Node.h
+@@ -19,6 +19,8 @@
+ #include <gumbo.h>
+ #include <string>
+ 
++namespace gq {
++
+ class CSelection;
+ 
+ class CNode
+@@ -66,6 +68,8 @@ class CNode
+ 		GumboNode* mpNode;
+ };
+ 
++} // namespace gq
++
+ #endif /* NODE_H_ */
+ 
+ /* vim: set ts=4 sw=4 sts=4 tw=100 noet: */
+diff --git a/src/Object.cpp b/src/Object.cpp
+index 7a0c33b..d3a21f1 100644
+--- a/src/Object.cpp
++++ b/src/Object.cpp
+@@ -15,6 +15,8 @@
+ 
+ #include "Object.h"
+ 
++namespace gq {
++
+ CObject::CObject()
+ {
+ 	mReferences = 1;
+@@ -55,5 +57,7 @@ unsigned int CObject::references()
+ 	return mReferences;
+ }
+ 
++} // namespace gq
++
+ /* vim: set ts=4 sw=4 sts=4 tw=100 noet: */
+ 
+diff --git a/src/Object.h b/src/Object.h
+index 40aaf32..1152dc7 100644
+--- a/src/Object.h
++++ b/src/Object.h
+@@ -16,6 +16,8 @@
+ #ifndef OBJECT_H_
+ #define OBJECT_H_
+ 
++namespace gq {
++
+ class CObject
+ {
+ 	public:
+@@ -37,6 +39,8 @@ class CObject
+ 		int mReferences;
+ };
+ 
++} // namespace gq
++
+ #endif /* OBJECT_H_ */
+ 
+ /* vim: set ts=4 sw=4 sts=4 tw=100 noet: */
+diff --git a/src/Parser.cpp b/src/Parser.cpp
+index bea39b0..092857c 100644
+--- a/src/Parser.cpp
++++ b/src/Parser.cpp
+@@ -17,6 +17,8 @@
+ #include "Selector.h"
+ #include "QueryUtil.h"
+ 
++namespace gq {
++
+ CParser::CParser(std::string aInput)
+ {
+ 	mInput = aInput;
+@@ -980,5 +982,7 @@ std::string CParser::error(std::string message)
+ 	return ret;
+ }
+ 
++} // namespace gq
++
+ /* vim: set ts=4 sw=4 sts=4 tw=100 noet: */
+ 
+diff --git a/src/Parser.h b/src/Parser.h
+index dd42060..31b3a62 100644
+--- a/src/Parser.h
++++ b/src/Parser.h
+@@ -20,6 +20,8 @@
+ #include <gumbo.h>
+ #include "Selector.h"
+ 
++namespace gq {
++
+ class CParser
+ {
+ 	private:
+@@ -85,6 +87,8 @@ class CParser
+ 		size_t mOffset;
+ };
+ 
++} // namespace gq
++
+ #endif /* PARSER_H_ */
+ 
+ /* vim: set ts=4 sw=4 sts=4 tw=100 noet: */
+diff --git a/src/QueryUtil.cpp b/src/QueryUtil.cpp
+index 037907c..4205b01 100644
+--- a/src/QueryUtil.cpp
++++ b/src/QueryUtil.cpp
+@@ -15,6 +15,8 @@
+ 
+ #include "QueryUtil.h"
+ 
++namespace gq {
++
+ std::string CQueryUtil::tolower(std::string s)
+ {
+ 	for (unsigned int i = 0; i < s.size(); i++)
+@@ -110,4 +112,6 @@ void CQueryUtil::writeNodeText(GumboNode* apNode, std::string& aText)
+ 	}
+ }
+ 
++} // namespace gq
++
+ /* vim: set ts=4 sw=4 sts=4 tw=100 noet: */
+diff --git a/src/QueryUtil.h b/src/QueryUtil.h
+index d51ac71..eafc5c6 100644
+--- a/src/QueryUtil.h
++++ b/src/QueryUtil.h
+@@ -20,6 +20,8 @@
+ #include <string>
+ #include <vector>
+ 
++namespace gq {
++
+ class CQueryUtil
+ {
+ 	public:
+@@ -42,6 +44,8 @@ class CQueryUtil
+ 
+ };
+ 
++} // namespace gq
++
+ #endif /* QUERYUTIL_H_ */
+ 
+ /* vim: set ts=4 sw=4 sts=4 tw=100 noet: */
+diff --git a/src/Selection.cpp b/src/Selection.cpp
+index 1b728e0..7ce447e 100644
+--- a/src/Selection.cpp
++++ b/src/Selection.cpp
+@@ -18,6 +18,8 @@
+ #include "QueryUtil.h"
+ #include "Node.h"
+ 
++namespace gq {
++
+ CSelection::CSelection(GumboNode* apNode)
+ {
+ 	mNodes.push_back(apNode);
+@@ -61,5 +63,7 @@ size_t CSelection::nodeNum()
+ 	return mNodes.size();
+ }
+ 
++} // namespace gq
++
+ /* vim: set ts=4 sw=4 sts=4 tw=100 noet: */
+ 
+diff --git a/src/Selection.h b/src/Selection.h
+index e661113..9b86992 100644
+--- a/src/Selection.h
++++ b/src/Selection.h
+@@ -21,6 +21,8 @@
+ #include <string>
+ #include <gumbo.h>
+ 
++namespace gq {
++
+ class CNode;
+ 
+ class CSelection: public CObject
+@@ -47,6 +49,8 @@ class CSelection: public CObject
+ 		std::vector<GumboNode*> mNodes;
+ };
+ 
++} // namespace gq
++
+ #endif /* SELECTION_H_ */
+ 
+ /* vim: set ts=4 sw=4 sts=4 tw=100 noet: */
+diff --git a/src/Selector.cpp b/src/Selector.cpp
+index e79d68a..8a298ae 100644
+--- a/src/Selector.cpp
++++ b/src/Selector.cpp
+@@ -16,6 +16,8 @@
+ #include "Selector.h"
+ #include "QueryUtil.h"
+ 
++namespace gq {
++
+ bool CSelector::match(GumboNode* apNode)
+ {
+ 	switch (mOp)
+@@ -428,5 +430,7 @@ bool CTextSelector::match(GumboNode* apNode)
+ 	return text.find(mValue) != std::string::npos;
+ }
+ 
++} // namespace gq
++
+ /* vim: set ts=4 sw=4 sts=4 tw=100 noet: */
+ 
+diff --git a/src/Selector.h b/src/Selector.h
+index ee8d188..ac020a9 100644
+--- a/src/Selector.h
++++ b/src/Selector.h
+@@ -21,6 +21,8 @@
+ #include <vector>
+ #include "Object.h"
+ 
++namespace gq {
++
+ class CSelector: public CObject
+ {
+ 
+@@ -273,6 +275,8 @@ class CTextSelector: public CSelector
+ 		TOperator mOp;
+ };
+ 
++} // namespace gq
++
+ #endif /* SELECTOR_H_ */
+ 
+ /* vim: set ts=4 sw=4 sts=4 tw=100 noet: */
+diff --git a/test/ID.cpp b/test/ID.cpp
+index dc84da1..872df38 100644
+--- a/test/ID.cpp
++++ b/test/ID.cpp
+@@ -9,16 +9,16 @@
+ #include "Document.h"
+ #include "Node.h"
+ 
+- using std::string;
++using std::string;
+ 
+ int main(int argc, char * argv[])
+ {
+ 	string page(file_str("test_page.html"));
+ 
+-	CDocument doc;
++	gq::CDocument doc;
+ 	doc.parse(page.c_str());
+ 
+-	CSelection c = doc.find("#start-of-content");
++	gq::CSelection c = doc.find("#start-of-content");
+ 	if(c.nodeNum() > 0)
+ 		return 0;
+ 	return 1;
diff --git a/thirdparty/vars.sh b/thirdparty/vars.sh
index 00e5f93..d0d2e45 100644
--- a/thirdparty/vars.sh
+++ b/thirdparty/vars.sh
@@ -238,3 +238,25 @@ YAML_SOURCE=$TP_SOURCE_DIR/$YAML_NAME
 CHRONY_VERSION=3.5
 CHRONY_NAME=chrony-$CHRONY_VERSION
 CHRONY_SOURCE=$TP_SOURCE_DIR/$CHRONY_NAME
+
+# Hash of the gumbo-parser git revision to use.
+# (from https://github.com/google/gumbo-parser)
+#
+# To re-build this tarball use the following in the sparsepp repo:
+#  export NAME=gumbo-parser-$(git rev-parse HEAD)
+#  git archive HEAD --prefix=$NAME/ -o /tmp/$NAME.tar.gz
+#  s3cmd put -P /tmp/$NAME.tar.gz s3://cloudera-thirdparty-libs/$NAME.tar.gz
+GUMBO_PARSER_VERSION=aa91b27b02c0c80c482e24348a457ed7c3c088e0
+GUMBO_PARSER_NAME=gumbo-parser-$GUMBO_PARSER_VERSION
+GUMBO_PARSER_SOURCE=$TP_SOURCE_DIR/$GUMBO_PARSER_NAME
+
+# Hash of the gumbo-query git revision to use.
+# (from https://github.com/lazytiger/gumbo-query)
+#
+# To re-build this tarball use the following in the sparsepp repo:
+#  export NAME=gumbo-query-$(git rev-parse HEAD)
+#  git archive HEAD --prefix=$NAME/ -o /tmp/$NAME.tar.gz
+#  s3cmd put -P /tmp/$NAME.tar.gz s3://cloudera-thirdparty-libs/$NAME.tar.gz
+GUMBO_QUERY_VERSION=c9f10880b645afccf4fbcd11d2f62a7c01222d2e
+GUMBO_QUERY_NAME=gumbo-query-$GUMBO_QUERY_VERSION
+GUMBO_QUERY_SOURCE=$TP_SOURCE_DIR/$GUMBO_QUERY_NAME