You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oodt.apache.org by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov> on 2014/10/17 01:46:50 UTC
Fwd: Directed Research Weekly Report from 2014/09/29 - 2014/10/05


Sent from my iPhone

Begin forwarded message:

From: MengYing Wang <me...@gmail.com>>
Date: October 16, 2014 at 4:45:56 PM PDT
To: "Verma, Rishi (398J)" <Ri...@jpl.nasa.gov>>
Cc: Christian Alan Mattmann <ma...@usc.edu>>, "Mcgibbney, Lewis J (398J)" <Le...@jpl.nasa.gov>>, "Bryant, Ann C (398J-Affiliate)" <an...@gmail.com>>, "Ramirez, Paul M (398J)" <pa...@jpl.nasa.gov>>, "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>>, Tyler Palsulich <tp...@gmail.com>>
Subject: Re: Directed Research Weekly Report from 2014/09/29 - 2014/10/05

Dear Rishi,

When I try to use the OODT RADiX using the command "mvn clean package -Pfm-solr-catalog", I get the "profile with id: 'fm-solr-catalog' has not been activated" error. Have you by any chance seen this error before? Thank you! Also after the installation, no solr directory is found in my machine too.

$ mvn clean package -Pfm-solr-catalog

[INFO] Scanning for projects...

[WARNING]

Profile with id: 'fm-solr-catalog' has not been activated.

[INFO] Reactor build order:

[INFO]   Data Management System

......

Best,
Mengying Wang

On Sun, Oct 5, 2014 at 3:05 PM, Verma, Rishi (398J) <Ri...@jpl.nasa.gov>> wrote:
Hi Mengying,

For integrating OODT File Manager with Solr, you have a couple options depending on the type of deployment you are doing and what stage your software is at:

If you’re starting from scratch:
1. Use Vagrant Virtual Machine technology to get a pre-built OODT deployment connected to Solr in one command: https://cwiki.apache.org/confluence/display/OODT/Vagrant+Powered+OODT
2. Use OODT RADiX for a pre-built deployment directory containing OODT File Manager, Workflow, Resource etc and Solr pre-integrated. RADiX allows for pre-configured OODT deployments, abstracting you from checking out individual OODT modules via source and building them.
   See: https://cwiki.apache.org/confluence/display/OODT/RADiX+Powered+By+OODT#RADiXPoweredByOODT-TheCommands
   Make sure to build with the command: mvn -Pfm-solr-catalog package  (see read me: http://svn.apache.org/repos/asf/oodt/trunk/mvn/archetypes/radix/src/main/resources/archetype-resources/README.txt)
3. Connect OODT FM with Solr manually, see: https://cwiki.apache.org/confluence/display/OODT/Integrating+Solr+with+OODT+RADiX

If you already have a deployed OODT FM:
1. Follow these directions: https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+Start+Guide
2. If the above doesn’t work, then use OODT RADiX to create a FM and Solr deployment that works, and copy those directories to your currently deployed production directory.

Thanks - hope that helps!
Rishi

On Oct 5, 2014, at 10:56 AM, MengYing Wang <me...@gmail.com>> wrote:

Dear Prof. Mattmann and Rishi,

Attached is the nutch and solr directories.

[https://ssl.gstatic.com/docs/doclist/images/icon_9_archive_list.png] nutch_solr.zip<https://docs.google.com/file/d/0B7PYVKDpy0jlSnI3U1lFcGY0WnM/edit?usp=drive_web>

As for problem (6), I could use SolrIndexer instead. Following is my File Manager directory.

https://drive.google.com/file/d/0B7PYVKDpy0jlVTk2NWFFY2sycW8/view?usp=sharing

Thank you!

Best,
Mengying Wang



On Sun, Oct 5, 2014 at 9:25 AM, Christian Alan Mattmann <ma...@usc.edu>> wrote:
Thanks Angela. Great work!

Some comments/feedback:

(1) According to
https://cwiki.apache.org/confluence/display/OODT/OODT+Push-Pull+User+Guide,
 use the Apache OODT Pushpull to crawl data files from
a remote server to the local machine [Failed, no data files downloaded at
all].
- This problem is not so urgent. Maybe I should use some ftp client tools,
e.g., FileZilla, to download data files in the remote ftp servers.

MY COMMENT: Please send me your PushPull directory zipped up. I will
take a look - Tyler can you also look?

(3) According to https://wiki.apache.org/nutch/IntranetDocumentSearch, use
the Apache Nutch and Solr to crawl and index local data files [Failed,
No data is indexed in Solr].
- This problem is not so urgent. Maybe this feature
only works for the Nutch 2.x. My Nutch version is 1.9. Also I could use
the OODT Crawler to ingest local files.


MY COMMENT: Please send me your nutch + solr directories, zipped up.
I will take a look.

(6) According to
https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+St
art+Guide, integrate the Apache OODT File Manager
with the Apache Solr [Failed, No product information available in the
Solr].
- It doesn't work out. However, I could use (5) to integrate OODT File
Manager and the Solr.


MY COMMENT: Rishi, can you guys help Angela with OODT + Solr + FM? It¹s not
working for her.

Thanks!



++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Adjunct Associate Professor, Computer Science Department
University of Southern California
Los Angeles, CA 90089 USA
Email: mattmann@usc.edu<ma...@usc.edu>
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++




-----Original Message-----
From: MengYing Wang <me...@gmail.com>>
Date: Saturday, October 4, 2014 at 9:12 PM
To: Chris Mattmann <ma...@usc.edu>>, "Mcgibbney, Lewis J (398J)"
<Le...@jpl.nasa.gov>>
Cc: Annie Bryant <an...@gmail.com>>, "Ramirez, Paul M (398J)"
<pa...@jpl.nasa.gov>>, Chris Mattmann
<Ch...@jpl.nasa.gov>>
Subject: Directed Research Weekly Report from 2014/09/29 - 2014/10/05

>Dear Prof. Mattmann,
>
>
>New status of the previous failed problems:
>
>
>(1) According to
>https://cwiki.apache.org/confluence/display/OODT/OODT+Push-Pull+User+Guide
>, use the Apache OODT Pushpull to crawl data files from
> a remote server to the local machine [Failed, no data files downloaded
>at all].
>
>
>- This problem is not so urgent. Maybe I should use some ftp client
>tools, e.g., FileZilla, to download data files in the remote ftp servers.
>
>
>(2) Use the Apache OODT Pushpull to crawl webpages [Succeed].
>
>
>(3) According to https://wiki.apache.org/nutch/IntranetDocumentSearch,
>use the Apache Nutch and Solr to crawl and index local data files [Failed,
> No data is indexed in Solr].
>
>
>- This problem is not so urgent. Maybe this feature
> only works for the Nutch 2.x. My Nutch version is 1.9. Also I could use
>the OODT Crawler to ingest local files.
>
>
>(4) Integrate the tike parser with the Apache Nutch [Failed, No tike
>fields available in the Solr].
>
>
>- Still in progress.
>
>
>(5) According to
>https://cwiki.apache.org/confluence/display/OODT/Using+the+SolrIndexer+to+
>dump+a+File+Manager+Catalog,
> use the SolrIndexer to dump all product information from the Apache OODT
>File Manager to the Apache Solr [Succeed].
>
>
>(6) According to
>https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+S
>tart+Guide, integrate the Apache OODT File Manager
> with the Apache Solr [Failed, No product information available in the
>Solr].
>
>
>- It doesn't work out. However, I could use (5) to integrate OODT File
>Manager and the Solr.
>
>
>So far, I have two ways to crawl remote data and construct indexes in the
>Solr:
>
>
>(1) moving data to the local machine using the FileZilla -> developing
>metadata extractor using the Tika -> crawling the data directory using
>the OODT Crawler -> migrating product information to the Solr uing the
>SolrIndexer
>
>
>(2) crawling websites using the Nutch -> indexing some basic metadata in
>the Solr
>
>
>
>
>Thanks.
>
>
>Best,
>Mengying (Angela) Wang
>
>On Mon, Sep 29, 2014 at 12:22 PM, MengYing Wang
><me...@gmail.com>> wrote:
>
>Dear Prof. Mattmann,
>
>
>In the previous two weeks, I was trying to solve the following problems:
>
>
>(1) According to
>https://cwiki.apache.org/confluence/display/OODT/OODT+Push-Pull+User+Guide
>, use the Apache OODT Pushpull to crawl data files from
> a remote server to the local machine [Failed, couldn't find the data
>files].
>
>
>(2) Use the Apache OODT Pushpull to crawl webpages [Failed, HttpClient
>ClassNotFoundException].
>
>
>(3) According to https://wiki.apache.org/nutch/IntranetDocumentSearch,
>use the Apache Nutch and Solr to crawl and index local data files [Failed,
> No data files found in Solr].
>
>
>(4) According to
>https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+S
>tart+Guide, search and delete redundant products
> in the Apache OODT File Manager [Succeed].
>
>
>(5) According to
>https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help, use
>the Apache OODT Crawler and Tika to extract metadata and then query
> the metadata in the Apache OODT File Manager [Succeed].
>
>
>(6) According to https://wiki.apache.org/nutch/IndexMetatags, use the
>plugins to parse HTML meta tags into separate fields in the Solr index
>[Succeed].
>
>
>(7) Integrate the tike parser with the Apache Nutch to extract metadata
>information which would be indexed in the Solr [Failed, No tike fields
>available in the Solr].
>
>
>(8) According to
>https://cwiki.apache.org/confluence/display/OODT/Using+the+SolrIndexer+to+
>dump+a+File+Manager+Catalog,
> use the SolrIndexer to dump all product information from the Apache OODT
>File Manager to the Apache Solr [Failed, No product information available
>in the Solr].
>
>
>(9) According to
>https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+S
>tart+Guide, integrate the
> Apache OODT File Manager with the Apache Solr [Failed, No product
>information available in the Solr].
>
>
>(10) According to https://lucene.apache.org/solr/4_10_0/tutorial.html,
>explore a simple command line tool for posting, deleting, updating and
>querying
> raw XMLs to the solr server [Succeed].
>
>
>Thank you.
>
>
>
>Best,
>Mengying Wang
>
>
>On Wed, Sep 17, 2014 at 11:44 AM, MengYing Wang
><me...@gmail.com>> wrote:
>
>Dear Prof. Mattmann,
>
>
>For the last week, I was learning the various apache tool tutorials, and
>trying to figure out how to crawl data files in the web, and then build
>up a metadata index for future queries. So far, I have found the
>following two approaches:
>
>
>1: Use the Apache OODT Pushpull to crawl a bunch of data files from some
>remote server to localhost ->  Use the Apache Tika to extract the
>metadata information for each data file ->  Use the Apache OODT File
>Manager to ingest the metadata files ->  Use
> the query_tool script to query the metadata information stored in the
>Apache OODT File Manager
>
>
>We could also achieve the above process by employing the Apache OODT
>CAS-Curator to automatically call the Apache Tika and the Apache File
>Manager, for the details you could refer to
>http://oodt.apache.org/components/maven/curator/user/basic.html
>
>
>2: Use the Apache Nutch to crawl a number of webpages -> Use the Apache
>Solr to do the text queries.
>
>
>However, there are some problems that I am still trying to solve:
>
>
>(1) According to the Apache OODT Pushpull user guide
>(https://cwiki.apache.org/confluence/display/OODT/OODT+Push-Pull+User+Guid
>e), data files should
> be downloaded to the staging area. However, when I started the pushpull
>script, I have waited for at least 15 minutes but nothing was downloaded.
>I have checked the remote FTP server, there indeed are some data files.
>-_-!
>
>
>**************************************************************************
>***********
>guest-wireless-207-151-035-013:bin AngelaWang$ ./pushpull
>TRANSFER:
>org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
>^C
>**************************************************************************
>***********
>
>
>Also, url-downloader script cannot work because of the java
>   NoClassDefFoundError.
>
>
>**************************************************************************
>***********
>
>guest-wireless-207-151-035-013:bin AngelaWang$ ./url-downloader
>
>http://pds-imaging.jpl.nasa.gov/data/msl/MSLHAZ_0XXX/CATALOG/CATINFO.TXT
><http://pds-imaging.jpl.nasa.gov/data/msl/MSLHAZ_0XXX/CATALOG/CATINFO.TXT>
> .
>Exception in thread "main" java.lang.NoClassDefFoundError:
>org/apache/oodt/cas/pushpull/protocol/http/HttpClient
>Caused by: java.lang.ClassNotFoundException:
>org.apache.oodt.cas.pushpull.protocol.http.HttpClient
>at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>at java.security.AccessController.doPrivileged(Native Method)
>at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>
>**************************************************************************
>***********
>
>
>
>2: According to the Apache OODT Crawler Help
>(https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help), the
>Apache OODT Crawler could be integrated
> with the Apache Tika. However, there is no
>org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor class in
>my Apache OODT Crawler package.
>
>
>3: How to dump the metadata in the Apache OODT File Manager to the Apache
>Solr using the Apache OODT Workflow Manager? I still have no clear answer
>yet.
>
>
>4: According to the Apache Solr Tutorial
>(https://lucene.apache.org/solr/4_10_0/tutorial.html), users should be
>able to add/delete/update documents using post.jar script.
> However, it doesn't work in my machine.
>
>
>**************************************************************************
>***********
>
>guest-wireless-207-151-035-013:exampledocs AngelaWang$ java -jar post.jar
>solr.xml
>SimplePostTool version 1.5
>Posting files to base url
>http://localhost:8983/solr/update <http://localhost:8983/solr/update>
>using content-type application/xml..
>POSTing file solr.xml
>SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for
>url:
>http://localhost:8983/solr/update
>SimplePostTool: WARNING: Response: <?xml version="1.0" encoding="UTF-8"?>
><response>
><lst name="responseHeader"><int name="status">400</int><int
>name="QTime">1</int></lst><lst name="error"><str name="msg">ERROR:
>[doc=SOLR1000] unknown field 'name'</str><int name="code">400</int></lst>
></response>
>SimplePostTool: WARNING: IOException while reading response:
>java.io.IOException: Server returned HTTP response code: 400 for URL:
>http://localhost:8983/solr/update
>1 files indexed.
>COMMITting Solr index changes to
>http://localhost:8983/solr/update..
>Time spent: 0:00:00.032
>
>**************************************************************************
>***********
>
>
>
>Solr logs:
>
>
>**************************************************************************
>***********
>
>6506114 [qtp1314570047-14] ERROR org.apache.solr.core.SolrCore  
>org.apache.solr.common.SolrException: ERROR: [doc=SOLR1000] unknown field
>'name'
>at
>org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185
>)
>at
>org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand
>.java:78)
>at
>org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.j
>ava:238)
>at
>org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.ja
>va:164)
>at
>org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdatePr
>ocessorFactory.java:69)
>.......
>**************************************************************************
>***********
>
>
>
>I will continue to solve the above problems this week, and could we
>discuss the two approaches this Thursday after the class? Many thanks!
>Have a good day!
>
>
>Best,
>Mengying (Angela) Wang
>
>
>
>On Mon, Sep 8, 2014 at 10:32 PM, MengYing Wang
><me...@gmail.com>> wrote:
>
>Dear Prof. Mattmann,
>
>
>For the previous week, I have successfully installed the following
>softwares in my personal computer:
>
>
>1: Apache OODT Catalog and Archive File Management Component:
>http://oodt.apache.org/components/maven/filemgr/user/basic.html
>2: Apache OODT Catalog and Archive Crawling Framework:
>http://oodt.apache.org/components/maven/crawler/user/
>
>3: Apache OODT Catalog and Archive Workflow Management Component:
>http://oodt.apache.org/components/maven/workflow/user/basic.html
>
>4: Apache Solr:
>https://cwiki.apache.org/confluence/display/solr/Installing+Solr
>5: Apache Nutch:
>http://wiki.apache.org/nutch/NutchTutorial#A3._Crawl_your_first_website
>6: Apache Tika: http://tika.apache.org/0.9/gettingstarted.html
>
>
>This week I will continue to playing with these softwares to figure out
>the following three questions:
>(1) how to get the metadata using Apache OODT or Apache Nutch?
>(2) how to dump the metadata from Apache OODT to Apache Solr?
>(3) how to query the metadata stored in Solr?
>
>Best,
>Mengying (Angela) Wang
>
>
>
>
>
>
>
>
>
>
>--
>Best,
>Mengying (Angela) Wang
>
>
>
>
>
>
>
>
>
>
>--
>Best,
>Mengying (Angela) Wang
>
>
>
>
>
>
>
>
>
>
>--
>Best,
>Mengying (Angela) Wang
>
>
>
>




--
Best,
Mengying (Angela) Wang

---
Rishi Verma
NASA Jet Propulsion Laboratory
California Institute of Technology
4800 Oak Grove Drive, M/S 158-248
Pasadena, CA 91109
Tel: 1-818-393-5826<tel:1-818-393-5826>




--
Best,
Mengying (Angela) Wang