You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@whimsical.apache.org by se...@apache.org on 2017/04/27 14:35:23 UTC

[whimsy] branch master updated: Sometimes need to fetch the parent node

This is an automated email from the ASF dual-hosted git repository.

sebb pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/whimsy.git

The following commit(s) were added to refs/heads/master by this push:
       new  29a9cd0   Sometimes need to fetch the parent node
29a9cd0 is described below

commit 29a9cd0aeb28341af58f21b2146c4e3867eafe68
Author: Sebb <se...@apache.org>
AuthorDate: Thu Apr 27 15:32:32 2017 +0100

    Sometimes need to fetch the parent node
---
 tools/site-scan.rb | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/tools/site-scan.rb b/tools/site-scan.rb
index 6345c7b..7d6feb0 100755
--- a/tools/site-scan.rb
+++ b/tools/site-scan.rb
@@ -87,17 +87,33 @@ def parse(site, name)
   doc.traverse do |node|
     next unless node.is_a?(Nokogiri::XML::Text)
     # scrub is needed as some sites have invalid UTF-8 bytes
-    txt = node.text.scrub.gsub(/\s+/, ' ').strip
-    if txt =~ /trademarks of [Tt]he Apache Software Foundation/
-      data[:trademarks] = txt
+    txt = node.text.scrub
+    if txt =~ / trademarks /
+      t, p = getText(txt, node)
+      data[:trademarks] = t
+      data[:tradeparent] = p if p
     end
-    if txt =~ /Copyright .+ [Tt]he Apache Software Foundation/
-      data[:copyright] = txt
+    if txt =~ /Copyright /
+      t, p = getText(txt, node)
+      data[:copyright] = t
+      data[:copyparent] = p if p
     end
   end
   return data
 end
 
+# get the text; use parent if text does not appear to be complete
+def getText(txt, node)
+  parent = nil # debug to show where parent needed to be fetched
+  if not txt =~ /Apache Software Foundation/i # have we got all the text?
+    txt = node.parent.text.scrub
+    parent = true
+  end
+  # TODO strip extra text where possible.
+  # Note: both copyright and trademark can be in same text (e.g. Cayenne)
+  return txt.gsub(/\s+/, ' ').strip, parent
+end
+
 $verbose = ARGV.delete '--verbose'
 
 results = {}

-- 
To stop receiving notification emails like this one, please contact
['"commits@whimsical.apache.org" <co...@whimsical.apache.org>'].