You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sdap.apache.org by GitBox <gi...@apache.org> on 2018/04/23 17:38:44 UTC

[GitHub] lewismc closed pull request #2: SDAP-63 Submit MUDROD documentation to SDAP Website

lewismc closed pull request #2: SDAP-63 Submit MUDROD documentation to SDAP Website
URL: https://github.com/apache/incubator-sdap-website/pull/2
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/blog.html b/blog.html
index 50d35a9..7b13316 100644
--- a/blog.html
+++ b/blog.html
@@ -56,6 +56,127 @@ <h1>Blog</h1>
 
 
 
+<a href="/weekly/update/2018/04/23/vocabulary-similarity-algorithm.html"><h2>An introduction to MUDROD vocabulary similarity calculation algorithm</h2></a>
+<p>Posted <b>2018-04-23</b> by <b>Lewis John McGibbney</b></p>
+<p>Big geospatial data have been produced, archived and made available online, but finding the right data for scientific research and decision-support applications remains a significant challenge. A long-standing problem in data discovery is how to locate, assimilate and utilize the semantic context for a given query. Most of past research in geospatial domain attempts to solve this problem through two approaches: 1) building a domain-specific ontology  manually; 2) discovering semantic relationship through dataset metadata automatically using machine learning techniques. The former contains rich expert knowledge, but it is static, costly, and labour intensive, while the latter is automatic, it is prone to noise.</p>
+
+<p>An emerging trend in information science is to take advantage of large-scale user search history, which is dynamic but contains user and crawler generated noise. Leveraging the benefits of all of these three approaches and avoiding their weaknesses, a novel  approach is proposed in this article to 1) discover vocabulary semantic relationship from user clickstream; 2) refine the similarity calculation methods from existing ontology; 3) integrate the results of ontology, metadata, user search history and clickstream analysis to better determine the semantic relationship.</p>
+
+<center>
+	<img src="/images/vocabulary.png" />
+	Figure 1. System workflow and architecture
+</center>
+
+<p>The system starts by pre-processing raw web logs, metadata, and ontology (Figure 1 ). After pre-processing step, search history and clickstream data are extracted from raw logs, selected properties are extracted from metadata, and ocean-related triples are extracted from the SWEET ontology. These four types of processed data are then put into their corresponding processer as discussed in the last section. Once all the processers finish their jobs, the results of different methods are integrated to produce a final most related terms list.</p>
+
+
+
+
+<a href="/weekly/update/2018/04/23/recommendation-algorithms.html"><h2>An introduction to MUDROD recommendation algorithm</h2></a>
+<p>Posted <b>2018-04-23</b> by <b>Lewis John McGibbney</b></p>
+<p>With the recent advances in remote sensing satellites and other sensors, geographic datasets have been growing faster than ever. In response, a number of Spatial Data Infrastructure (SDI) components (e.g. catalogues and portals) have been developed to archive and made those datasets available online. However, finding the right data for scientific research and application development is still a challenge due to the lack of data relevancy information.</p>
+
+<p>Recommendation has become extremely common in recent years and are utilized in a variety of areas to help users quickly find useful information. Wee propose a recommendation system to improve geographic data discovery by mining and utilizing metadata and usage logs. Metadata abstracts are processed with natural language processing methods to find semantic relationship between metadata. Metadata variables are used to calculate spatial and temporal similarity between metadata. In addition, portal logs are analysed to introduce user preference.</p>
+
+<center>
+	<img src="/images/recommendation.png" />
+	Figure 1. Recommendation workflow
+</center>
+
+<p>The system starts by pre-processing raw web logs and metadata (Figure 1). After pre-processing step, sessions are reconstructed from raw web logs and then used to calculate session-based metadata similarity. Metadata are harvested from PO. DAAC web service APIs. Metadata variable values are then converted to value using the united unit to calculate metadata content similarity. All these above similarities are calculated offline and stored in Elasticsearch. Once user views a metadata, the system finds the top-k related metadata with hybrid recommendation. The hybrid recommendation module integrates results from content-based recommendation and session-based recommendation methods and ranks the final recommendation list in a descending order of similarity.</p>
+
+
+
+
+<a href="/weekly/update/2018/04/23/ranking-algorithms.html"><h2>An introduction to MUDROD ranking algorithm</h2></a>
+<p>Posted <b>2018-04-23</b> by <b>Lewis John McGibbney</b></p>
+<p>When a user types some keywords into a search engine, there are typically hundreds, or even thousands of datasets related to the given query. Although high level of recall can be useful in some cases, the user is only interested in a much smaller subset. Current search engines in most geospatial data portals tend to induce end users to focus on one single data characteristic/feature dimension (e.g., spatial resolution), which often results in less than optimal user experience (Ghose, Ipeirotis, and Li 2012).</p>
+
+<p>To overcome this fundamental ranking problem, we therefore 1) identify a number of ranking features of geospatial data to represent users’ multidimensional preferences by considering semantics, user behaviour, spatial similarity, and static dataset metadata attributes; 2) apply machine learning method to automatically learn a function from a training set capable of ranking geospatial data according to the ranking features.</p>
+
+<p>Within the ranking process, each query will be associated with a set of data, and each data can be represented as a feature vector. Eleven features listed below are identified by considering user behaviour, query-text match and  examining common geospatial metadata attributes.</p>
+
+<table>
+  <thead>
+    <tr>
+      <th>Query-dependent features</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>Lucene relevance score</td>
+    </tr>
+    <tr>
+      <td>Semantic popularity</td>
+    </tr>
+    <tr>
+      <td>Spatial Similarity</td>
+    </tr>
+    <tr>
+      <td> </td>
+    </tr>
+  </tbody>
+</table>
+
+<table>
+  <thead>
+    <tr>
+      <th>Query-dependent features</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>Release date</td>
+    </tr>
+    <tr>
+      <td>Processing level</td>
+    </tr>
+    <tr>
+      <td>Version number</td>
+    </tr>
+    <tr>
+      <td>Spatial resolution</td>
+    </tr>
+    <tr>
+      <td>Temporal resolution</td>
+    </tr>
+    <tr>
+      <td>All-time popularity</td>
+    </tr>
+    <tr>
+      <td>Monthly popularity</td>
+    </tr>
+    <tr>
+      <td>User popularity</td>
+    </tr>
+    <tr>
+      <td> </td>
+    </tr>
+  </tbody>
+</table>
+
+<p>RankSVM, one of the well-recognized learning approach is selected to learn feature weights to rank search results. In RankSVM (Joachims 2002), ranking is transformed into a pairwise classification task in which a classifier is trained to predict the ranking order of data pairs.</p>
+
+<center>
+	<img src="/images/ranking.png" />
+	Figure 1. System workflow and architecture
+</center>
+
+<p>The proposed architecture primarily consists of six components comprising semantic knowledge base, geocoding service, search index, feature extractor, learning algorithm, and ranking model respectively (Figure 1). When a user submits a query, it is then converted into a semantic query and a geographical bounding box by the semantic knowledge base and geocoding service. The search index would then return the top K results for the semantic query combined with the bounding box. After that, feature extractor would extract the ranking features for each of the search results, including the semantic click score. Once all the features are prepared, the top K results would then be put into a pre-trained ranking model, which would finally re-rank the top K retrieval. As the index in this architecture can be any Lucene-based software, it enables the loosely coupled software structure of a data portal and avoids the cost of replacing the existing system.</p>
+
+<p>Reference:</p>
+<ul>
+  <li>
+    <p>Ghose, Anindya, Panagiotis G Ipeirotis, and Beibei Li. 2012. “Designing ranking systems for hotels on travel search engines by mining user-generated and crowdsourced content.”  Marketing Science 31 (3):493-520.</p>
+  </li>
+  <li>
+    <p>Joachims, Thorsten. 2002. Optimizing search engines using clickthrough data. Paper presented at the Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining.</p>
+  </li>
+</ul>
+
+
+
+
       <!-- footer -->
       <nav class="navbar navbar-default">
         <div class="navbar-header">
diff --git a/images/architecture.jpg b/images/architecture.jpg
new file mode 100644
index 0000000..6d14285
Binary files /dev/null and b/images/architecture.jpg differ
diff --git a/images/cover.jpg b/images/cover.jpg
new file mode 100644
index 0000000..aa39689
Binary files /dev/null and b/images/cover.jpg differ
diff --git a/images/ranking.png b/images/ranking.png
new file mode 100644
index 0000000..17dd504
Binary files /dev/null and b/images/ranking.png differ
diff --git a/images/recommendation.png b/images/recommendation.png
new file mode 100644
index 0000000..2c19d2e
Binary files /dev/null and b/images/recommendation.png differ
diff --git a/images/vocabulary.png b/images/vocabulary.png
new file mode 100644
index 0000000..9f7dda8
Binary files /dev/null and b/images/vocabulary.png differ
diff --git a/source/_posts/2018-04-23-ranking-algorithms.markdown b/source/_posts/2018-04-23-ranking-algorithms.markdown
new file mode 100644
index 0000000..7d0ff25
--- /dev/null
+++ b/source/_posts/2018-04-23-ranking-algorithms.markdown
@@ -0,0 +1,46 @@
+---
+layout: post
+title:  "An introduction to MUDROD ranking algorithm"
+categories: weekly update
+author: Lewis John McGibbney
+---
+
+When a user types some keywords into a search engine, there are typically hundreds, or even thousands of datasets related to the given query. Although high level of recall can be useful in some cases, the user is only interested in a much smaller subset. Current search engines in most geospatial data portals tend to induce end users to focus on one single data characteristic/feature dimension (e.g., spatial resolution), which often results in less than optimal user experience (Ghose, Ipeirotis, and Li 2012). 
+
+To overcome this fundamental ranking problem, we therefore 1) identify a number of ranking features of geospatial data to represent users’ multidimensional preferences by considering semantics, user behaviour, spatial similarity, and static dataset metadata attributes; 2) apply machine learning method to automatically learn a function from a training set capable of ranking geospatial data according to the ranking features.
+
+Within the ranking process, each query will be associated with a set of data, and each data can be represented as a feature vector. Eleven features listed below are identified by considering user behaviour, query-text match and  examining common geospatial metadata attributes.
+
+  | Query-dependent features        | 
+    | --------   | 
+    | Lucene relevance score        | 
+    | Semantic popularity        |
+    | Spatial Similarity        | 
+  |         |
+  
+  | Query-dependent features        | 
+	| --------   | 
+	| Release date        | 
+    | Processing level        | 
+    | Version number        | 
+    | Spatial resolution        | 
+    | Temporal resolution        |
+    | All-time popularity        | 
+    | Monthly popularity        | 
+    | User popularity        | 
+  |         |
+	
+	
+RankSVM, one of the well-recognized learning approach is selected to learn feature weights to rank search results. In RankSVM (Joachims 2002), ranking is transformed into a pairwise classification task in which a classifier is trained to predict the ranking order of data pairs.
+
+<center>
+	<img src="/images/ranking.png">
+	Figure 1. System workflow and architecture
+</center>
+
+The proposed architecture primarily consists of six components comprising semantic knowledge base, geocoding service, search index, feature extractor, learning algorithm, and ranking model respectively (Figure 1). When a user submits a query, it is then converted into a semantic query and a geographical bounding box by the semantic knowledge base and geocoding service. The search index would then return the top K results for the semantic query combined with the bounding box. After that, feature extractor would extract the ranking features for each of the search results, including the semantic click score. Once all the features are prepared, the top K results would then be put into a pre-trained ranking model, which would finally re-rank the top K retrieval. As the index in this architecture can be any Lucene-based software, it enables the loosely coupled software structure of a data portal and avoids the cost of replacing the existing system.
+
+Reference:
+* Ghose, Anindya, Panagiotis G Ipeirotis, and Beibei Li. 2012. "Designing ranking systems for hotels on travel search engines by mining user-generated and crowdsourced content."  Marketing Science 31 (3):493-520.
+
+* Joachims, Thorsten. 2002. Optimizing search engines using clickthrough data. Paper presented at the Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. 
diff --git a/source/_posts/2018-04-23-recommendation-algorithms.markdown b/source/_posts/2018-04-23-recommendation-algorithms.markdown
new file mode 100644
index 0000000..da0713c
--- /dev/null
+++ b/source/_posts/2018-04-23-recommendation-algorithms.markdown
@@ -0,0 +1,18 @@
+---
+layout: post
+title:  "An introduction to MUDROD recommendation algorithm"
+categories: weekly update
+author: Lewis John McGibbney
+---
+
+With the recent advances in remote sensing satellites and other sensors, geographic datasets have been growing faster than ever. In response, a number of Spatial Data Infrastructure (SDI) components (e.g. catalogues and portals) have been developed to archive and made those datasets available online. However, finding the right data for scientific research and application development is still a challenge due to the lack of data relevancy information. 
+
+Recommendation has become extremely common in recent years and are utilized in a variety of areas to help users quickly find useful information. Wee propose a recommendation system to improve geographic data discovery by mining and utilizing metadata and usage logs. Metadata abstracts are processed with natural language processing methods to find semantic relationship between metadata. Metadata variables are used to calculate spatial and temporal similarity between metadata. In addition, portal logs are analysed to introduce user preference. 
+
+<center>
+	<img src="/images/recommendation.png">
+	Figure 1. Recommendation workflow
+</center>
+
+
+The system starts by pre-processing raw web logs and metadata (Figure 1). After pre-processing step, sessions are reconstructed from raw web logs and then used to calculate session-based metadata similarity. Metadata are harvested from PO. DAAC web service APIs. Metadata variable values are then converted to value using the united unit to calculate metadata content similarity. All these above similarities are calculated offline and stored in Elasticsearch. Once user views a metadata, the system finds the top-k related metadata with hybrid recommendation. The hybrid recommendation module integrates results from content-based recommendation and session-based recommendation methods and ranks the final recommendation list in a descending order of similarity.
diff --git a/source/_posts/2018-04-23-vocabulary-similarity-algorithm.markdown b/source/_posts/2018-04-23-vocabulary-similarity-algorithm.markdown
new file mode 100644
index 0000000..edd7d36
--- /dev/null
+++ b/source/_posts/2018-04-23-vocabulary-similarity-algorithm.markdown
@@ -0,0 +1,18 @@
+---
+layout: post
+title:  "An introduction to MUDROD vocabulary similarity calculation algorithm"
+categories: weekly update
+author: Lewis John McGibbney
+---
+
+Big geospatial data have been produced, archived and made available online, but finding the right data for scientific research and decision-support applications remains a significant challenge. A long-standing problem in data discovery is how to locate, assimilate and utilize the semantic context for a given query. Most of past research in geospatial domain attempts to solve this problem through two approaches: 1) building a domain-specific ontology  manually; 2) discovering semantic relationship through dataset metadata automatically using machine learning techniques. The former contains rich expert knowledge, but it is static, costly, and labour intensive, while the latter is automatic, it is prone to noise. 
+
+An emerging trend in information science is to take advantage of large-scale user search history, which is dynamic but contains user and crawler generated noise. Leveraging the benefits of all of these three approaches and avoiding their weaknesses, a novel  approach is proposed in this article to 1) discover vocabulary semantic relationship from user clickstream; 2) refine the similarity calculation methods from existing ontology; 3) integrate the results of ontology, metadata, user search history and clickstream analysis to better determine the semantic relationship. 
+
+<center>
+	<img src="/images/vocabulary.png">
+	Figure 1. System workflow and architecture
+</center>
+
+
+The system starts by pre-processing raw web logs, metadata, and ontology (Figure 1 ). After pre-processing step, search history and clickstream data are extracted from raw logs, selected properties are extracted from metadata, and ocean-related triples are extracted from the SWEET ontology. These four types of processed data are then put into their corresponding processer as discussed in the last section. Once all the processers finish their jobs, the results of different methods are integrated to produce a final most related terms list.
diff --git a/source/images/architecture.jpg b/source/images/architecture.jpg
new file mode 100644
index 0000000..6d14285
Binary files /dev/null and b/source/images/architecture.jpg differ
diff --git a/source/images/cover.jpg b/source/images/cover.jpg
new file mode 100644
index 0000000..aa39689
Binary files /dev/null and b/source/images/cover.jpg differ
diff --git a/source/images/ranking.png b/source/images/ranking.png
new file mode 100644
index 0000000..17dd504
Binary files /dev/null and b/source/images/ranking.png differ
diff --git a/source/images/recommendation.png b/source/images/recommendation.png
new file mode 100644
index 0000000..2c19d2e
Binary files /dev/null and b/source/images/recommendation.png differ
diff --git a/source/images/vocabulary.png b/source/images/vocabulary.png
new file mode 100644
index 0000000..9f7dda8
Binary files /dev/null and b/source/images/vocabulary.png differ
diff --git a/weekly/update/2018/04/23/ranking-algorithms.html b/weekly/update/2018/04/23/ranking-algorithms.html
new file mode 100644
index 0000000..8417575
--- /dev/null
+++ b/weekly/update/2018/04/23/ranking-algorithms.html
@@ -0,0 +1,169 @@
+<!DOCTYPE html>
+
+<html lang="en">
+  <head>
+    <meta charset="utf-8" />
+    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
+    <meta name="viewport" content="width=device-width,initial-scale=1" />
+    <title>Apache SDAP - Science Data Analytics Platform</title>
+    <link rel="shortcut icon" href="favicon.ico" />
+    <link rel="icon" type="image/png" href="favicon.png" />
+    <link rel="stylesheet" href="css/bootstrap.min.css" />
+    <link rel="stylesheet" href="css/style.css" />
+    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/github-fork-ribbon-css/0.2.0/gh-fork-ribbon.min.css" />
+    <!--[if lt IE 9]>
+    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/github-fork-ribbon-css/0.2.0/gh-fork-ribbon.ie.min.css" />
+    <![endif]--> 
+  </head>
+  <body>
+    <a class="github-fork-ribbon" href="https://github.com/apache?utf8=✓&q=incubator-sdap&type=&language=" title="Fork me on GitHub" target="_blank" >Fork me on GitHub</a> 
+    <div class="container">
+
+      <div class="logos">
+        <a href="https://incubator.apache.org">
+          <img src="/images/egg-logo.png" class="pull-left" />
+        </a>
+      </div>
+
+      <!-- navigation bar -->
+      <nav class="navbar navbar-default">
+        <div class="container-fluid">
+          <div class="navbar-header">
+            <a class="navbar-brand" href="/">SDAP</a>
+          </div>
+          <div class="navbar-right">
+            <ul class="nav navbar-nav">
+              <li><a href="/docs">Docs</a></li>
+              <li><a href="/blog">Blog</a></li>
+              <li><a href="/team">Team &amp; Community</a></li>
+              <li><a href="/resources">Resources</a></li>
+              <li class="dropdown toggle">
+              	<a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">Apache <span class="caret"></span></a>
+                <ul class="dropdown-menu">
+                  <li><a href="http://www.apache.org/foundation/how-it-works.html">Apache Software Foundation</a></li>
+                  <li><a href="http://www.apache.org/licenses/">Apache License</a></li>
+                  <li><a href="http://www.apache.org/foundation/sponsorship">Sponsorship</a></li>
+                  <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
+                </ul>
+              </li>
+            </ul>
+          </div>
+        </div>
+      </nav>
+
+
+<h1>An introduction to MUDROD ranking algorithm</h1>
+
+<p>Posted <b>2018-04-23</b> by <b>Lewis John McGibbney</b></p>
+
+<p>When a user types some keywords into a search engine, there are typically hundreds, or even thousands of datasets related to the given query. Although high level of recall can be useful in some cases, the user is only interested in a much smaller subset. Current search engines in most geospatial data portals tend to induce end users to focus on one single data characteristic/feature dimension (e.g., spatial resolution), which often results in less than optimal user experience (Ghose, Ipeirotis, and Li 2012).</p>
+
+<p>To overcome this fundamental ranking problem, we therefore 1) identify a number of ranking features of geospatial data to represent users’ multidimensional preferences by considering semantics, user behaviour, spatial similarity, and static dataset metadata attributes; 2) apply machine learning method to automatically learn a function from a training set capable of ranking geospatial data according to the ranking features.</p>
+
+<p>Within the ranking process, each query will be associated with a set of data, and each data can be represented as a feature vector. Eleven features listed below are identified by considering user behaviour, query-text match and  examining common geospatial metadata attributes.</p>
+
+<table>
+  <thead>
+    <tr>
+      <th>Query-dependent features</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>Lucene relevance score</td>
+    </tr>
+    <tr>
+      <td>Semantic popularity</td>
+    </tr>
+    <tr>
+      <td>Spatial Similarity</td>
+    </tr>
+    <tr>
+      <td> </td>
+    </tr>
+  </tbody>
+</table>
+
+<table>
+  <thead>
+    <tr>
+      <th>Query-dependent features</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>Release date</td>
+    </tr>
+    <tr>
+      <td>Processing level</td>
+    </tr>
+    <tr>
+      <td>Version number</td>
+    </tr>
+    <tr>
+      <td>Spatial resolution</td>
+    </tr>
+    <tr>
+      <td>Temporal resolution</td>
+    </tr>
+    <tr>
+      <td>All-time popularity</td>
+    </tr>
+    <tr>
+      <td>Monthly popularity</td>
+    </tr>
+    <tr>
+      <td>User popularity</td>
+    </tr>
+    <tr>
+      <td> </td>
+    </tr>
+  </tbody>
+</table>
+
+<p>RankSVM, one of the well-recognized learning approach is selected to learn feature weights to rank search results. In RankSVM (Joachims 2002), ranking is transformed into a pairwise classification task in which a classifier is trained to predict the ranking order of data pairs.</p>
+
+<center>
+	<img src="/images/ranking.png" />
+	Figure 1. System workflow and architecture
+</center>
+
+<p>The proposed architecture primarily consists of six components comprising semantic knowledge base, geocoding service, search index, feature extractor, learning algorithm, and ranking model respectively (Figure 1). When a user submits a query, it is then converted into a semantic query and a geographical bounding box by the semantic knowledge base and geocoding service. The search index would then return the top K results for the semantic query combined with the bounding box. After that, feature extractor would extract the ranking features for each of the search results, including the semantic click score. Once all the features are prepared, the top K results would then be put into a pre-trained ranking model, which would finally re-rank the top K retrieval. As the index in this architecture can be any Lucene-based software, it enables the loosely coupled software structure of a data portal and avoids the cost of replacing the existing system.</p>
+
+<p>Reference:</p>
+<ul>
+  <li>
+    <p>Ghose, Anindya, Panagiotis G Ipeirotis, and Beibei Li. 2012. “Designing ranking systems for hotels on travel search engines by mining user-generated and crowdsourced content.”  Marketing Science 31 (3):493-520.</p>
+  </li>
+  <li>
+    <p>Joachims, Thorsten. 2002. Optimizing search engines using clickthrough data. Paper presented at the Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining.</p>
+  </li>
+</ul>
+
+
+<div>
+
+</div>
+
+<div>
+
+<b>Next:</b> <a href="/weekly/update/2018/04/23/recommendation-algorithms.html">An introduction to MUDROD recommendation algorithm</a>
+
+</div>
+
+      <!-- footer -->
+      <nav class="navbar navbar-default">
+        <div class="navbar-header">
+          <a class="navbar-brand" href="">SDAP</a>
+        </div>
+        <div class="navbar-text pull-right">&copy; 2017 The Apache Software Foundation. Licensed under <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License 2.0</a>.<br/>
+        Apache SDAP, SDAP, Apache, the Apache feather logo, and the Apache SDAP project logo are trademarks of The Apache Software Foundation.</div>
+        <div class="navbar-text pull-right">Apache SDAP is an effort undergoing <a href="https://incubator.apache.org/">Incubation</a> at The Apache Software Foundation (ASF), sponsored by the Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.</div>
+      </nav>
+
+      <script src="js/jquery.min.js"></script>
+      <script src="js/bootstrap.min.js"></script>
+    </div>
+  </body>
+</html>
+
diff --git a/weekly/update/2018/04/23/recommendation-algorithms.html b/weekly/update/2018/04/23/recommendation-algorithms.html
new file mode 100644
index 0000000..07181b5
--- /dev/null
+++ b/weekly/update/2018/04/23/recommendation-algorithms.html
@@ -0,0 +1,98 @@
+<!DOCTYPE html>
+
+<html lang="en">
+  <head>
+    <meta charset="utf-8" />
+    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
+    <meta name="viewport" content="width=device-width,initial-scale=1" />
+    <title>Apache SDAP - Science Data Analytics Platform</title>
+    <link rel="shortcut icon" href="favicon.ico" />
+    <link rel="icon" type="image/png" href="favicon.png" />
+    <link rel="stylesheet" href="css/bootstrap.min.css" />
+    <link rel="stylesheet" href="css/style.css" />
+    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/github-fork-ribbon-css/0.2.0/gh-fork-ribbon.min.css" />
+    <!--[if lt IE 9]>
+    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/github-fork-ribbon-css/0.2.0/gh-fork-ribbon.ie.min.css" />
+    <![endif]--> 
+  </head>
+  <body>
+    <a class="github-fork-ribbon" href="https://github.com/apache?utf8=✓&q=incubator-sdap&type=&language=" title="Fork me on GitHub" target="_blank" >Fork me on GitHub</a> 
+    <div class="container">
+
+      <div class="logos">
+        <a href="https://incubator.apache.org">
+          <img src="/images/egg-logo.png" class="pull-left" />
+        </a>
+      </div>
+
+      <!-- navigation bar -->
+      <nav class="navbar navbar-default">
+        <div class="container-fluid">
+          <div class="navbar-header">
+            <a class="navbar-brand" href="/">SDAP</a>
+          </div>
+          <div class="navbar-right">
+            <ul class="nav navbar-nav">
+              <li><a href="/docs">Docs</a></li>
+              <li><a href="/blog">Blog</a></li>
+              <li><a href="/team">Team &amp; Community</a></li>
+              <li><a href="/resources">Resources</a></li>
+              <li class="dropdown toggle">
+              	<a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">Apache <span class="caret"></span></a>
+                <ul class="dropdown-menu">
+                  <li><a href="http://www.apache.org/foundation/how-it-works.html">Apache Software Foundation</a></li>
+                  <li><a href="http://www.apache.org/licenses/">Apache License</a></li>
+                  <li><a href="http://www.apache.org/foundation/sponsorship">Sponsorship</a></li>
+                  <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
+                </ul>
+              </li>
+            </ul>
+          </div>
+        </div>
+      </nav>
+
+
+<h1>An introduction to MUDROD recommendation algorithm</h1>
+
+<p>Posted <b>2018-04-23</b> by <b>Lewis John McGibbney</b></p>
+
+<p>With the recent advances in remote sensing satellites and other sensors, geographic datasets have been growing faster than ever. In response, a number of Spatial Data Infrastructure (SDI) components (e.g. catalogues and portals) have been developed to archive and made those datasets available online. However, finding the right data for scientific research and application development is still a challenge due to the lack of data relevancy information.</p>
+
+<p>Recommendation has become extremely common in recent years and are utilized in a variety of areas to help users quickly find useful information. Wee propose a recommendation system to improve geographic data discovery by mining and utilizing metadata and usage logs. Metadata abstracts are processed with natural language processing methods to find semantic relationship between metadata. Metadata variables are used to calculate spatial and temporal similarity between metadata. In addition, portal logs are analysed to introduce user preference.</p>
+
+<center>
+	<img src="/images/recommendation.png" />
+	Figure 1. Recommendation workflow
+</center>
+
+<p>The system starts by pre-processing raw web logs and metadata (Figure 1). After pre-processing step, sessions are reconstructed from raw web logs and then used to calculate session-based metadata similarity. Metadata are harvested from PO. DAAC web service APIs. Metadata variable values are then converted to value using the united unit to calculate metadata content similarity. All these above similarities are calculated offline and stored in Elasticsearch. Once user views a metadata, the system finds the top-k related metadata with hybrid recommendation. The hybrid recommendation module integrates results from content-based recommendation and session-based recommendation methods and ranks the final recommendation list in a descending order of similarity.</p>
+
+
+<div>
+
+<b>Previous:</b> <a href="/weekly/update/2018/04/23/ranking-algorithms.html">An introduction to MUDROD ranking algorithm</a>
+
+</div>
+
+<div>
+
+<b>Next:</b> <a href="/weekly/update/2018/04/23/vocabulary-similarity-algorithm.html">An introduction to MUDROD vocabulary similarity calculation algorithm</a>
+
+</div>
+
+      <!-- footer -->
+      <nav class="navbar navbar-default">
+        <div class="navbar-header">
+          <a class="navbar-brand" href="">SDAP</a>
+        </div>
+        <div class="navbar-text pull-right">&copy; 2017 The Apache Software Foundation. Licensed under <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License 2.0</a>.<br/>
+        Apache SDAP, SDAP, Apache, the Apache feather logo, and the Apache SDAP project logo are trademarks of The Apache Software Foundation.</div>
+        <div class="navbar-text pull-right">Apache SDAP is an effort undergoing <a href="https://incubator.apache.org/">Incubation</a> at The Apache Software Foundation (ASF), sponsored by the Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.</div>
+      </nav>
+
+      <script src="js/jquery.min.js"></script>
+      <script src="js/bootstrap.min.js"></script>
+    </div>
+  </body>
+</html>
+
diff --git a/weekly/update/2018/04/23/vocabulary-similarity-algorithm.html b/weekly/update/2018/04/23/vocabulary-similarity-algorithm.html
new file mode 100644
index 0000000..45d0d2d
--- /dev/null
+++ b/weekly/update/2018/04/23/vocabulary-similarity-algorithm.html
@@ -0,0 +1,96 @@
+<!DOCTYPE html>
+
+<html lang="en">
+  <head>
+    <meta charset="utf-8" />
+    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
+    <meta name="viewport" content="width=device-width,initial-scale=1" />
+    <title>Apache SDAP - Science Data Analytics Platform</title>
+    <link rel="shortcut icon" href="favicon.ico" />
+    <link rel="icon" type="image/png" href="favicon.png" />
+    <link rel="stylesheet" href="css/bootstrap.min.css" />
+    <link rel="stylesheet" href="css/style.css" />
+    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/github-fork-ribbon-css/0.2.0/gh-fork-ribbon.min.css" />
+    <!--[if lt IE 9]>
+    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/github-fork-ribbon-css/0.2.0/gh-fork-ribbon.ie.min.css" />
+    <![endif]--> 
+  </head>
+  <body>
+    <a class="github-fork-ribbon" href="https://github.com/apache?utf8=✓&q=incubator-sdap&type=&language=" title="Fork me on GitHub" target="_blank" >Fork me on GitHub</a> 
+    <div class="container">
+
+      <div class="logos">
+        <a href="https://incubator.apache.org">
+          <img src="/images/egg-logo.png" class="pull-left" />
+        </a>
+      </div>
+
+      <!-- navigation bar -->
+      <nav class="navbar navbar-default">
+        <div class="container-fluid">
+          <div class="navbar-header">
+            <a class="navbar-brand" href="/">SDAP</a>
+          </div>
+          <div class="navbar-right">
+            <ul class="nav navbar-nav">
+              <li><a href="/docs">Docs</a></li>
+              <li><a href="/blog">Blog</a></li>
+              <li><a href="/team">Team &amp; Community</a></li>
+              <li><a href="/resources">Resources</a></li>
+              <li class="dropdown toggle">
+              	<a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">Apache <span class="caret"></span></a>
+                <ul class="dropdown-menu">
+                  <li><a href="http://www.apache.org/foundation/how-it-works.html">Apache Software Foundation</a></li>
+                  <li><a href="http://www.apache.org/licenses/">Apache License</a></li>
+                  <li><a href="http://www.apache.org/foundation/sponsorship">Sponsorship</a></li>
+                  <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
+                </ul>
+              </li>
+            </ul>
+          </div>
+        </div>
+      </nav>
+
+
+<h1>An introduction to MUDROD vocabulary similarity calculation algorithm</h1>
+
+<p>Posted <b>2018-04-23</b> by <b>Lewis John McGibbney</b></p>
+
+<p>Big geospatial data have been produced, archived and made available online, but finding the right data for scientific research and decision-support applications remains a significant challenge. A long-standing problem in data discovery is how to locate, assimilate and utilize the semantic context for a given query. Most of past research in geospatial domain attempts to solve this problem through two approaches: 1) building a domain-specific ontology  manually; 2) discovering semantic relationship through dataset metadata automatically using machine learning techniques. The former contains rich expert knowledge, but it is static, costly, and labour intensive, while the latter is automatic, it is prone to noise.</p>
+
+<p>An emerging trend in information science is to take advantage of large-scale user search history, which is dynamic but contains user and crawler generated noise. Leveraging the benefits of all of these three approaches and avoiding their weaknesses, a novel  approach is proposed in this article to 1) discover vocabulary semantic relationship from user clickstream; 2) refine the similarity calculation methods from existing ontology; 3) integrate the results of ontology, metadata, user search history and clickstream analysis to better determine the semantic relationship.</p>
+
+<center>
+	<img src="/images/vocabulary.png" />
+	Figure 1. System workflow and architecture
+</center>
+
+<p>The system starts by pre-processing raw web logs, metadata, and ontology (Figure 1 ). After pre-processing step, search history and clickstream data are extracted from raw logs, selected properties are extracted from metadata, and ocean-related triples are extracted from the SWEET ontology. These four types of processed data are then put into their corresponding processer as discussed in the last section. Once all the processers finish their jobs, the results of different methods are integrated to produce a final most related terms list.</p>
+
+
+<div>
+
+<b>Previous:</b> <a href="/weekly/update/2018/04/23/recommendation-algorithms.html">An introduction to MUDROD recommendation algorithm</a>
+
+</div>
+
+<div>
+
+</div>
+
+      <!-- footer -->
+      <nav class="navbar navbar-default">
+        <div class="navbar-header">
+          <a class="navbar-brand" href="">SDAP</a>
+        </div>
+        <div class="navbar-text pull-right">&copy; 2017 The Apache Software Foundation. Licensed under <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License 2.0</a>.<br/>
+        Apache SDAP, SDAP, Apache, the Apache feather logo, and the Apache SDAP project logo are trademarks of The Apache Software Foundation.</div>
+        <div class="navbar-text pull-right">Apache SDAP is an effort undergoing <a href="https://incubator.apache.org/">Incubation</a> at The Apache Software Foundation (ASF), sponsored by the Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.</div>
+      </nav>
+
+      <script src="js/jquery.min.js"></script>
+      <script src="js/bootstrap.min.js"></script>
+    </div>
+  </body>
+</html>
+


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services