You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@whimsical.apache.org by se...@apache.org on 2017/04/27 15:20:19 UTC

[whimsy] branch master updated: Tweak extraction code

This is an automated email from the ASF dual-hosted git repository.

sebb pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/whimsy.git

The following commit(s) were added to refs/heads/master by this push:
       new  f48cec1   Tweak extraction code
f48cec1 is described below

commit f48cec149e609b71f0c3c114cca6736e2b89c962
Author: Sebb <se...@apache.org>
AuthorDate: Thu Apr 27 16:20:18 2017 +0100

    Tweak extraction code
---
 tools/site-scan.rb | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/tools/site-scan.rb b/tools/site-scan.rb
index 81d0285..f7c3468 100755
--- a/tools/site-scan.rb
+++ b/tools/site-scan.rb
@@ -88,10 +88,12 @@ def parse(site, name)
     next unless node.is_a?(Nokogiri::XML::Text)
     # scrub is needed as some sites have invalid UTF-8 bytes
     txt = node.text.scrub
-    if txt =~ / trademarks /
+    # trademarks may appear twice. TODO use array?
+    if txt =~ / trademarks / and not data[:trademarks]
       t, p = getText(txt, node)
       data[:trademarks] = t
       data[:tradeparent] = p if p
+      puts t,p
     end
     if txt =~ /Copyright / or txt =~ /�/
       t, p = getText(txt, node)
@@ -106,7 +108,11 @@ end
 def getText(txt, node)
   parent = nil # debug to show where parent needed to be fetched
   if not txt =~ /Apache Software Foundation/i # have we got all the text?
-    txt = node.parent.text.scrub
+    if node.parent.name == 'a' # e.g. whimsical. such parents don't have extra text.
+      txt = node.parent.parent.text.scrub
+    else
+      txt = node.parent.text.scrub
+    end
     parent = true
   end
   # TODO strip extra text where possible.

-- 
To stop receiving notification emails like this one, please contact
['"commits@whimsical.apache.org" <co...@whimsical.apache.org>'].