You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Adar Dembo (Code Review)" <ge...@cloudera.org> on 2019/10/29 17:48:08 UTC

[kudu-CR] thirdparty: add gumbo and gumbo-query

Hello Alexey Serbin, Andrew Wong,

I'd like you to do a code review. Please visit

    http://gerrit.cloudera.org:8080/14572

to review the following change.


Change subject: thirdparty: add gumbo and gumbo-query
......................................................................

thirdparty: add gumbo and gumbo-query

In a follow-on patch I built a simple web crawler for testing the web UI.
It's possible to parse HTML for links via string::find, it made for some
really ugly and brittle code. So I went searching for a proper HTML parser.

I settled on Gumbo (known as gumbo-parser on github), a Google C library for
parsing HTML 5. Although it is quite old and hasn't been updated in some
time, it has been used on Google's web cache and has passed Google's
internal security review. Plus it has an intuitive API.

To simplify further, I incorporated the gumbo-query C++ library, which adds
a CSS selector [2] API for Gumbo. This drastically simplifies an operation
like finding all the links in a page. Sample code:

  string page;
  gq::CDocument doc;
  doc.parse(page);
  gq::CSelection sel = doc.find("a");
  for (int i = 0; i < sel.nodeNum(); i++) {
    string link = sel.nodeAt(i).attribute("href");
    <do stuff with link>
  }

Like Gumbo, gumbo-query is old and unmaintained. I had to patch it to
move all of its functionality into a namespace.

1. https://opensource.googleblog.com/2013/08/gumbo-c-library-for-parsing-html.html
2. https://www.w3schools.com/cssref/css_selectors.asp

Change-Id: I851d5a78e30c1c79a2de253f59c7377a61aa3ed7
---
M CMakeLists.txt
A cmake_modules/FindGumboParser.cmake
A cmake_modules/FindGumboQuery.cmake
M thirdparty/build-definitions.sh
M thirdparty/build-thirdparty.sh
M thirdparty/download-thirdparty.sh
A thirdparty/patches/gumbo-parser-autoconf-263.patch
A thirdparty/patches/gumbo-query-namespace.patch
M thirdparty/vars.sh
9 files changed, 521 insertions(+), 0 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/72/14572/1
-- 
To view, visit http://gerrit.cloudera.org:8080/14572
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I851d5a78e30c1c79a2de253f59c7377a61aa3ed7
Gerrit-Change-Number: 14572
Gerrit-PatchSet: 1
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>

[kudu-CR] thirdparty: add gumbo and gumbo-query

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Hello Alexey Serbin, Kudu Jenkins, Andrew Wong, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14572

to look at the new patch set (#3).

Change subject: thirdparty: add gumbo and gumbo-query
......................................................................

thirdparty: add gumbo and gumbo-query

In a follow-on patch I built a simple web crawler for testing the web UI.
It's possible to parse HTML for links via string::find, it made for some
really ugly and brittle code. So I went searching for a proper HTML parser.

I settled on Gumbo (known as gumbo-parser on github), a Google C library for
parsing HTML 5. Although it is quite old and hasn't been updated in some
time, it has been used on Google's web cache and has passed Google's
internal security review. Plus it has an intuitive API.

To simplify further, I incorporated the gumbo-query C++ library, which adds
a CSS selector [2] API for Gumbo. This drastically simplifies an operation
like finding all the links in a page. Sample code:

  string page;
  gq::CDocument doc;
  doc.parse(page);
  gq::CSelection sel = doc.find("a");
  for (int i = 0; i < sel.nodeNum(); i++) {
    string link = sel.nodeAt(i).attribute("href");
    <do stuff with link>
  }

Like Gumbo, gumbo-query is old and unmaintained. I had to patch it to
move all of its functionality into a namespace.

While adding the new licenses to thirdparty/LICENSE.txt, I did a bit of
cleanup. The only substantive change was moving curl to the "build-time"
dependencies section; it's not part of the source or binary distribution.

1. https://opensource.googleblog.com/2013/08/gumbo-c-library-for-parsing-html.html
2. https://www.w3schools.com/cssref/css_selectors.asp

Change-Id: I851d5a78e30c1c79a2de253f59c7377a61aa3ed7
---
M CMakeLists.txt
A cmake_modules/FindGumboParser.cmake
A cmake_modules/FindGumboQuery.cmake
M thirdparty/LICENSE.txt
M thirdparty/build-definitions.sh
M thirdparty/build-thirdparty.sh
M thirdparty/download-thirdparty.sh
A thirdparty/patches/gumbo-parser-autoconf-263.patch
A thirdparty/patches/gumbo-query-namespace.patch
M thirdparty/vars.sh
10 files changed, 568 insertions(+), 57 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/72/14572/3
-- 
To view, visit http://gerrit.cloudera.org:8080/14572
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I851d5a78e30c1c79a2de253f59c7377a61aa3ed7
Gerrit-Change-Number: 14572
Gerrit-PatchSet: 3
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] thirdparty: add gumbo and gumbo-query

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/14572 )

Change subject: thirdparty: add gumbo and gumbo-query
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14572/4/thirdparty/LICENSE.txt
File thirdparty/LICENSE.txt:

PS4: 
> Is it possible to separate the non-gumbo changes into a its own patch?  It 
I'll separate the cleanup.

Basically I was already editing this file to add the gumbo stuff, saw these issues, and decided to clean them up.



-- 
To view, visit http://gerrit.cloudera.org:8080/14572
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I851d5a78e30c1c79a2de253f59c7377a61aa3ed7
Gerrit-Change-Number: 14572
Gerrit-PatchSet: 4
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Comment-Date: Sat, 02 Nov 2019 05:21:15 +0000
Gerrit-HasComments: Yes

[kudu-CR] thirdparty: add gumbo and gumbo-query

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Hello Alexey Serbin, Andrew Wong, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14572

to look at the new patch set (#5).

Change subject: thirdparty: add gumbo and gumbo-query
......................................................................

thirdparty: add gumbo and gumbo-query

In a follow-on patch I built a simple web crawler for testing the web UI.
It's possible to parse HTML for links via string::find, it made for some
really ugly and brittle code. So I went searching for a proper HTML parser.

I settled on Gumbo (known as gumbo-parser on github), a Google C library for
parsing HTML 5. Although it is quite old and hasn't been updated in some
time, it has been used on Google's web cache and has passed Google's
internal security review. Plus it has an intuitive API.

To simplify further, I incorporated the gumbo-query C++ library, which adds
a CSS selector [2] API for Gumbo. This drastically simplifies an operation
like finding all the links in a page. Sample code:

  string page;
  gq::CDocument doc;
  doc.parse(page);
  gq::CSelection sel = doc.find("a");
  for (int i = 0; i < sel.nodeNum(); i++) {
    string link = sel.nodeAt(i).attribute("href");
    <do stuff with link>
  }

Like Gumbo, gumbo-query is old and unmaintained. I had to patch it to
move all of its functionality into a namespace.

1. https://opensource.googleblog.com/2013/08/gumbo-c-library-for-parsing-html.html
2. https://www.w3schools.com/cssref/css_selectors.asp

Change-Id: I851d5a78e30c1c79a2de253f59c7377a61aa3ed7
---
M CMakeLists.txt
A cmake_modules/FindGumboParser.cmake
A cmake_modules/FindGumboQuery.cmake
M thirdparty/LICENSE.txt
M thirdparty/build-definitions.sh
M thirdparty/build-thirdparty.sh
M thirdparty/download-thirdparty.sh
A thirdparty/patches/gumbo-parser-autoconf-263.patch
A thirdparty/patches/gumbo-query-namespace.patch
M thirdparty/vars.sh
10 files changed, 533 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/72/14572/5
-- 
To view, visit http://gerrit.cloudera.org:8080/14572
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I851d5a78e30c1c79a2de253f59c7377a61aa3ed7
Gerrit-Change-Number: 14572
Gerrit-PatchSet: 5
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>

[kudu-CR] thirdparty: add gumbo and gumbo-query

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/14572 )

Change subject: thirdparty: add gumbo and gumbo-query
......................................................................


Patch Set 4: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14572/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14572/4//COMMIT_MSG@18
PS4, Line 18: gumbo-query C++ library
I'm looking at https://github.com/lazytiger/gumbo-query. Is that the right lib?

Can I assume that no license is safe?



-- 
To view, visit http://gerrit.cloudera.org:8080/14572
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I851d5a78e30c1c79a2de253f59c7377a61aa3ed7
Gerrit-Change-Number: 14572
Gerrit-PatchSet: 4
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Comment-Date: Fri, 01 Nov 2019 21:02:26 +0000
Gerrit-HasComments: Yes

[kudu-CR] thirdparty: add gumbo and gumbo-query

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/14572 )

Change subject: thirdparty: add gumbo and gumbo-query
......................................................................


Patch Set 4: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14572/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14572/4//COMMIT_MSG@18
PS4, Line 18: gumbo-query C++ library
> Yeah that's the right lib, but there is a LICENSE file: https://github.com/
Ah, missed that. Thanks for clarifying.



-- 
To view, visit http://gerrit.cloudera.org:8080/14572
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I851d5a78e30c1c79a2de253f59c7377a61aa3ed7
Gerrit-Change-Number: 14572
Gerrit-PatchSet: 4
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Comment-Date: Fri, 01 Nov 2019 23:19:25 +0000
Gerrit-HasComments: Yes

***UNCHECKED***[kudu-CR] thirdparty: add gumbo and gumbo-query

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/14572 )

Change subject: thirdparty: add gumbo and gumbo-query
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14572/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14572/4//COMMIT_MSG@18
PS4, Line 18: gumbo-query C++ library
> I'm looking at https://github.com/lazytiger/gumbo-query. Is that the right 
Yeah that's the right lib, but there is a LICENSE file: https://github.com/lazytiger/gumbo-query/blob/master/LICENSE.

It's the MIT license, which is compatible with ASF projects: https://www.apache.org/legal/resolved.html#category-a



-- 
To view, visit http://gerrit.cloudera.org:8080/14572
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I851d5a78e30c1c79a2de253f59c7377a61aa3ed7
Gerrit-Change-Number: 14572
Gerrit-PatchSet: 4
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Comment-Date: Fri, 01 Nov 2019 21:48:20 +0000
Gerrit-HasComments: Yes

[kudu-CR] thirdparty: add gumbo and gumbo-query

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has removed Kudu Jenkins from this change.  ( http://gerrit.cloudera.org:8080/14572 )

Change subject: thirdparty: add gumbo and gumbo-query
......................................................................


Removed reviewer Kudu Jenkins with the following votes:

* Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/14572
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteReviewer
Gerrit-Change-Id: I851d5a78e30c1c79a2de253f59c7377a61aa3ed7
Gerrit-Change-Number: 14572
Gerrit-PatchSet: 4
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>

[kudu-CR] thirdparty: add gumbo and gumbo-query

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/14572 )

Change subject: thirdparty: add gumbo and gumbo-query
......................................................................


Patch Set 5: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14572/4/thirdparty/LICENSE.txt
File thirdparty/LICENSE.txt:

PS4: 
> I'll separate the cleanup.
Thank you!



-- 
To view, visit http://gerrit.cloudera.org:8080/14572
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I851d5a78e30c1c79a2de253f59c7377a61aa3ed7
Gerrit-Change-Number: 14572
Gerrit-PatchSet: 5
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Sat, 02 Nov 2019 13:38:47 +0000
Gerrit-HasComments: Yes

[kudu-CR] thirdparty: add gumbo and gumbo-query

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Hello Alexey Serbin, Kudu Jenkins, Andrew Wong, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14572

to look at the new patch set (#6).

Change subject: thirdparty: add gumbo and gumbo-query
......................................................................

thirdparty: add gumbo and gumbo-query

In a follow-on patch I built a simple web crawler for testing the web UI.
It's possible to parse HTML for links via string::find, it made for some
really ugly and brittle code. So I went searching for a proper HTML parser.

I settled on Gumbo (known as gumbo-parser on github), a Google C library for
parsing HTML 5. Although it is quite old and hasn't been updated in some
time, it has been used on Google's web cache and has passed Google's
internal security review. Plus it has an intuitive API.

To simplify further, I incorporated the gumbo-query C++ library, which adds
a CSS selector [2] API for Gumbo. This drastically simplifies an operation
like finding all the links in a page. Sample code:

  string page;
  gq::CDocument doc;
  doc.parse(page);
  gq::CSelection sel = doc.find("a");
  for (int i = 0; i < sel.nodeNum(); i++) {
    string link = sel.nodeAt(i).attribute("href");
    <do stuff with link>
  }

Like Gumbo, gumbo-query is old and unmaintained. I had to patch it to
move all of its functionality into a namespace.

1. https://opensource.googleblog.com/2013/08/gumbo-c-library-for-parsing-html.html
2. https://www.w3schools.com/cssref/css_selectors.asp

Change-Id: I851d5a78e30c1c79a2de253f59c7377a61aa3ed7
---
M CMakeLists.txt
A cmake_modules/FindGumboParser.cmake
A cmake_modules/FindGumboQuery.cmake
M thirdparty/LICENSE.txt
M thirdparty/build-definitions.sh
M thirdparty/build-thirdparty.sh
M thirdparty/download-thirdparty.sh
A thirdparty/patches/gumbo-parser-autoconf-263.patch
A thirdparty/patches/gumbo-query-namespace.patch
M thirdparty/vars.sh
10 files changed, 533 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/72/14572/6
-- 
To view, visit http://gerrit.cloudera.org:8080/14572
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I851d5a78e30c1c79a2de253f59c7377a61aa3ed7
Gerrit-Change-Number: 14572
Gerrit-PatchSet: 6
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] thirdparty: add gumbo and gumbo-query

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Hello Alexey Serbin, Kudu Jenkins, Andrew Wong, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14572

to look at the new patch set (#2).

Change subject: thirdparty: add gumbo and gumbo-query
......................................................................

thirdparty: add gumbo and gumbo-query

In a follow-on patch I built a simple web crawler for testing the web UI.
It's possible to parse HTML for links via string::find, it made for some
really ugly and brittle code. So I went searching for a proper HTML parser.

I settled on Gumbo (known as gumbo-parser on github), a Google C library for
parsing HTML 5. Although it is quite old and hasn't been updated in some
time, it has been used on Google's web cache and has passed Google's
internal security review. Plus it has an intuitive API.

To simplify further, I incorporated the gumbo-query C++ library, which adds
a CSS selector [2] API for Gumbo. This drastically simplifies an operation
like finding all the links in a page. Sample code:

  string page;
  gq::CDocument doc;
  doc.parse(page);
  gq::CSelection sel = doc.find("a");
  for (int i = 0; i < sel.nodeNum(); i++) {
    string link = sel.nodeAt(i).attribute("href");
    <do stuff with link>
  }

Like Gumbo, gumbo-query is old and unmaintained. I had to patch it to
move all of its functionality into a namespace.

While adding the new licenses to thirdparty/LICENSE.txt, I did a bit of
cleanup. The only substantive change was moving curl to the "build-time"
dependencies section; it's not part of the source or binary distribution.

1. https://opensource.googleblog.com/2013/08/gumbo-c-library-for-parsing-html.html
2. https://www.w3schools.com/cssref/css_selectors.asp

Change-Id: I851d5a78e30c1c79a2de253f59c7377a61aa3ed7
---
M CMakeLists.txt
A cmake_modules/FindGumboParser.cmake
A cmake_modules/FindGumboQuery.cmake
M thirdparty/LICENSE.txt
M thirdparty/build-definitions.sh
M thirdparty/build-thirdparty.sh
M thirdparty/download-thirdparty.sh
A thirdparty/patches/gumbo-parser-autoconf-263.patch
A thirdparty/patches/gumbo-query-namespace.patch
M thirdparty/vars.sh
10 files changed, 566 insertions(+), 57 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/72/14572/2
-- 
To view, visit http://gerrit.cloudera.org:8080/14572
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I851d5a78e30c1c79a2de253f59c7377a61aa3ed7
Gerrit-Change-Number: 14572
Gerrit-PatchSet: 2
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] thirdparty: add gumbo and gumbo-query

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/14572 )

Change subject: thirdparty: add gumbo and gumbo-query
......................................................................


Patch Set 4: Verified+1

Overriding Jenkins, known unrelated test flake.


-- 
To view, visit http://gerrit.cloudera.org:8080/14572
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I851d5a78e30c1c79a2de253f59c7377a61aa3ed7
Gerrit-Change-Number: 14572
Gerrit-PatchSet: 4
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Fri, 01 Nov 2019 05:46:48 +0000
Gerrit-HasComments: No

[kudu-CR] thirdparty: add gumbo and gumbo-query

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/14572 )

Change subject: thirdparty: add gumbo and gumbo-query
......................................................................

thirdparty: add gumbo and gumbo-query

In a follow-on patch I built a simple web crawler for testing the web UI.
It's possible to parse HTML for links via string::find, it made for some
really ugly and brittle code. So I went searching for a proper HTML parser.

I settled on Gumbo (known as gumbo-parser on github), a Google C library for
parsing HTML 5. Although it is quite old and hasn't been updated in some
time, it has been used on Google's web cache and has passed Google's
internal security review. Plus it has an intuitive API.

To simplify further, I incorporated the gumbo-query C++ library, which adds
a CSS selector [2] API for Gumbo. This drastically simplifies an operation
like finding all the links in a page. Sample code:

  string page;
  gq::CDocument doc;
  doc.parse(page);
  gq::CSelection sel = doc.find("a");
  for (int i = 0; i < sel.nodeNum(); i++) {
    string link = sel.nodeAt(i).attribute("href");
    <do stuff with link>
  }

Like Gumbo, gumbo-query is old and unmaintained. I had to patch it to
move all of its functionality into a namespace.

1. https://opensource.googleblog.com/2013/08/gumbo-c-library-for-parsing-html.html
2. https://www.w3schools.com/cssref/css_selectors.asp

Change-Id: I851d5a78e30c1c79a2de253f59c7377a61aa3ed7
Reviewed-on: http://gerrit.cloudera.org:8080/14572
Tested-by: Kudu Jenkins
Reviewed-by: Alexey Serbin <as...@cloudera.com>
---
M CMakeLists.txt
A cmake_modules/FindGumboParser.cmake
A cmake_modules/FindGumboQuery.cmake
M thirdparty/LICENSE.txt
M thirdparty/build-definitions.sh
M thirdparty/build-thirdparty.sh
M thirdparty/download-thirdparty.sh
A thirdparty/patches/gumbo-parser-autoconf-263.patch
A thirdparty/patches/gumbo-query-namespace.patch
M thirdparty/vars.sh
10 files changed, 533 insertions(+), 0 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Alexey Serbin: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/14572
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I851d5a78e30c1c79a2de253f59c7377a61aa3ed7
Gerrit-Change-Number: 14572
Gerrit-PatchSet: 7
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] thirdparty: add gumbo and gumbo-query

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/14572 )

Change subject: thirdparty: add gumbo and gumbo-query
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14572/4/thirdparty/LICENSE.txt
File thirdparty/LICENSE.txt:

PS4: 
Is it possible to separate the non-gumbo changes into a its own patch?  It might be useful if we want to cherry-pick that clean-up into 1.11.0 and alike.  Also, it's not clear how these non-gumbo changes are related to gumbo.



-- 
To view, visit http://gerrit.cloudera.org:8080/14572
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I851d5a78e30c1c79a2de253f59c7377a61aa3ed7
Gerrit-Change-Number: 14572
Gerrit-PatchSet: 4
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Comment-Date: Sat, 02 Nov 2019 00:00:22 +0000
Gerrit-HasComments: Yes

[kudu-CR] thirdparty: add gumbo and gumbo-query

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/14572 )

Change subject: thirdparty: add gumbo and gumbo-query
......................................................................


Patch Set 6: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/14572
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I851d5a78e30c1c79a2de253f59c7377a61aa3ed7
Gerrit-Change-Number: 14572
Gerrit-PatchSet: 6
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Thu, 07 Nov 2019 07:28:23 +0000
Gerrit-HasComments: No