You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@oodt.apache.org by "Verma, Rishi (398M)" <Ri...@jpl.nasa.gov> on 2014/10/17 03:09:51 UTC

Re: Directed Research Weekly Report from 2014/09/29 - 2014/10/05

Hey Mengying,

That error usually gets thrown if you invoke a Maven build from a subdirectory not containing the profile definition.

Two things to check for:
* Are you calling ‘mvn clean package -Pfm-solr-catalog from the top-level directory of your RADiX installation? i.e. the directory containing a pom.xml file and folders like ‘crawler’, ‘distribution’, ‘extensions’, etc ...
* Are you running an OODT version 0.7+?

Thanks,
rishi

On Oct 16, 2014, at 4:45 PM, MengYing Wang <me...@gmail.com>> wrote:

Dear Rishi,

When I try to use the OODT RADiX using the command "mvn clean package -Pfm-solr-catalog", I get the "profile with id: 'fm-solr-catalog' has not been activated" error. Have you by any chance seen this error before? Thank you! Also after the installation, no solr directory is found in my machine too.

$ mvn clean package -Pfm-solr-catalog

[INFO] Scanning for projects...

[WARNING]

Profile with id: 'fm-solr-catalog' has not been activated.

[INFO] Reactor build order:

[INFO]   Data Management System

......

Best,
Mengying Wang

On Sun, Oct 5, 2014 at 3:05 PM, Verma, Rishi (398J) <Ri...@jpl.nasa.gov>> wrote:
Hi Mengying,

For integrating OODT File Manager with Solr, you have a couple options depending on the type of deployment you are doing and what stage your software is at:

If you’re starting from scratch:
1. Use Vagrant Virtual Machine technology to get a pre-built OODT deployment connected to Solr in one command: https://cwiki.apache.org/confluence/display/OODT/Vagrant+Powered+OODT
2. Use OODT RADiX for a pre-built deployment directory containing OODT File Manager, Workflow, Resource etc and Solr pre-integrated. RADiX allows for pre-configured OODT deployments, abstracting you from checking out individual OODT modules via source and building them.
   See: https://cwiki.apache.org/confluence/display/OODT/RADiX+Powered+By+OODT#RADiXPoweredByOODT-TheCommands
   Make sure to build with the command: mvn -Pfm-solr-catalog package  (see read me: http://svn.apache.org/repos/asf/oodt/trunk/mvn/archetypes/radix/src/main/resources/archetype-resources/README.txt)
3. Connect OODT FM with Solr manually, see: https://cwiki.apache.org/confluence/display/OODT/Integrating+Solr+with+OODT+RADiX

If you already have a deployed OODT FM:
1. Follow these directions: https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+Start+Guide
2. If the above doesn’t work, then use OODT RADiX to create a FM and Solr deployment that works, and copy those directories to your currently deployed production directory.

Thanks - hope that helps!
Rishi

On Oct 5, 2014, at 10:56 AM, MengYing Wang <me...@gmail.com>> wrote:

Dear Prof. Mattmann and Rishi,

Attached is the nutch and solr directories.
​
[https://ssl.gstatic.com/docs/doclist/images/icon_9_archive_list.png] nutch_solr.zip<https://docs.google.com/file/d/0B7PYVKDpy0jlSnI3U1lFcGY0WnM/edit?usp=drive_web>
​
As for problem (6), I could use SolrIndexer instead. Following is my File Manager directory.

https://drive.google.com/file/d/0B7PYVKDpy0jlVTk2NWFFY2sycW8/view?usp=sharing

Thank you!

Best,
Mengying Wang



On Sun, Oct 5, 2014 at 9:25 AM, Christian Alan Mattmann <ma...@usc.edu>> wrote:
Thanks Angela. Great work!

Some comments/feedback:

(1) According to
https://cwiki.apache.org/confluence/display/OODT/OODT+Push-Pull+User+Guide,
 use the Apache OODT Pushpull to crawl data files from
a remote server to the local machine [Failed, no data files downloaded at
all].
- This problem is not so urgent. Maybe I should use some ftp client tools,
e.g., FileZilla, to download data files in the remote ftp servers.

MY COMMENT: Please send me your PushPull directory zipped up. I will
take a look - Tyler can you also look?

(3) According to https://wiki.apache.org/nutch/IntranetDocumentSearch, use
the Apache Nutch and Solr to crawl and index local data files [Failed,
No data is indexed in Solr].
- This problem is not so urgent. Maybe this feature
only works for the Nutch 2.x. My Nutch version is 1.9. Also I could use
the OODT Crawler to ingest local files.


MY COMMENT: Please send me your nutch + solr directories, zipped up.
I will take a look.

(6) According to
https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+St
art+Guide, integrate the Apache OODT File Manager
with the Apache Solr [Failed, No product information available in the
Solr].
- It doesn't work out. However, I could use (5) to integrate OODT File
Manager and the Solr.


MY COMMENT: Rishi, can you guys help Angela with OODT + Solr + FM? It¹s not
working for her.

Thanks!



++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Adjunct Associate Professor, Computer Science Department
University of Southern California
Los Angeles, CA 90089 USA
Email: mattmann@usc.edu<ma...@usc.edu>
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++




-----Original Message-----
From: MengYing Wang <me...@gmail.com>>
Date: Saturday, October 4, 2014 at 9:12 PM
To: Chris Mattmann <ma...@usc.edu>>, "Mcgibbney, Lewis J (398J)"
<Le...@jpl.nasa.gov>>
Cc: Annie Bryant <an...@gmail.com>>, "Ramirez, Paul M (398J)"
<pa...@jpl.nasa.gov>>, Chris Mattmann
<Ch...@jpl.nasa.gov>>
Subject: Directed Research Weekly Report from 2014/09/29 - 2014/10/05

>Dear Prof. Mattmann,
>
>
>New status of the previous failed problems:
>
>
>(1) According to
>https://cwiki.apache.org/confluence/display/OODT/OODT+Push-Pull+User+Guide
>, use the Apache OODT Pushpull to crawl data files from
> a remote server to the local machine [Failed, no data files downloaded
>at all].
>
>
>- This problem is not so urgent. Maybe I should use some ftp client
>tools, e.g., FileZilla, to download data files in the remote ftp servers.
>
>
>(2) Use the Apache OODT Pushpull to crawl webpages [Succeed].
>
>
>(3) According to https://wiki.apache.org/nutch/IntranetDocumentSearch,
>use the Apache Nutch and Solr to crawl and index local data files [Failed,
> No data is indexed in Solr].
>
>
>- This problem is not so urgent. Maybe this feature
> only works for the Nutch 2.x. My Nutch version is 1.9. Also I could use
>the OODT Crawler to ingest local files.
>
>
>(4) Integrate the tike parser with the Apache Nutch [Failed, No tike
>fields available in the Solr].
>
>
>- Still in progress.
>
>
>(5) According to
>https://cwiki.apache.org/confluence/display/OODT/Using+the+SolrIndexer+to+
>dump+a+File+Manager+Catalog,
> use the SolrIndexer to dump all product information from the Apache OODT
>File Manager to the Apache Solr [Succeed].
>
>
>(6) According to
>https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+S
>tart+Guide, integrate the Apache OODT File Manager
> with the Apache Solr [Failed, No product information available in the
>Solr].
>
>
>- It doesn't work out. However, I could use (5) to integrate OODT File
>Manager and the Solr.
>
>
>So far, I have two ways to crawl remote data and construct indexes in the
>Solr:
>
>
>(1) moving data to the local machine using the FileZilla -> developing
>metadata extractor using the Tika -> crawling the data directory using
>the OODT Crawler -> migrating product information to the Solr uing the
>SolrIndexer
>
>
>(2) crawling websites using the Nutch -> indexing some basic metadata in
>the Solr
>
>
>
>
>Thanks.
>
>
>Best,
>Mengying (Angela) Wang
>
>On Mon, Sep 29, 2014 at 12:22 PM, MengYing Wang
><me...@gmail.com>> wrote:
>
>Dear Prof. Mattmann,
>
>
>In the previous two weeks, I was trying to solve the following problems:
>
>
>(1) According to
>https://cwiki.apache.org/confluence/display/OODT/OODT+Push-Pull+User+Guide
>, use the Apache OODT Pushpull to crawl data files from
> a remote server to the local machine [Failed, couldn't find the data
>files].
>
>
>(2) Use the Apache OODT Pushpull to crawl webpages [Failed, HttpClient
>ClassNotFoundException].
>
>
>(3) According to https://wiki.apache.org/nutch/IntranetDocumentSearch,
>use the Apache Nutch and Solr to crawl and index local data files [Failed,
> No data files found in Solr].
>
>
>(4) According to
>https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+S
>tart+Guide, search and delete redundant products
> in the Apache OODT File Manager [Succeed].
>
>
>(5) According to
>https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help, use
>the Apache OODT Crawler and Tika to extract metadata and then query
> the metadata in the Apache OODT File Manager [Succeed].
>
>
>(6) According to https://wiki.apache.org/nutch/IndexMetatags, use the
>plugins to parse HTML meta tags into separate fields in the Solr index
>[Succeed].
>
>
>(7) Integrate the tike parser with the Apache Nutch to extract metadata
>information which would be indexed in the Solr [Failed, No tike fields
>available in the Solr].
>
>
>(8) According to
>https://cwiki.apache.org/confluence/display/OODT/Using+the+SolrIndexer+to+
>dump+a+File+Manager+Catalog,
> use the SolrIndexer to dump all product information from the Apache OODT
>File Manager to the Apache Solr [Failed, No product information available
>in the Solr].
>
>
>(9) According to
>https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+S
>tart+Guide, integrate the
> Apache OODT File Manager with the Apache Solr [Failed, No product
>information available in the Solr].
>
>
>(10) According to https://lucene.apache.org/solr/4_10_0/tutorial.html,
>explore a simple command line tool for posting, deleting, updating and
>querying
> raw XMLs to the solr server [Succeed].
>
>
>Thank you.
>
>
>
>Best,
>Mengying Wang
>
>
>On Wed, Sep 17, 2014 at 11:44 AM, MengYing Wang
><me...@gmail.com>> wrote:
>
>Dear Prof. Mattmann,
>
>
>For the last week, I was learning the various apache tool tutorials, and
>trying to figure out how to crawl data files in the web, and then build
>up a metadata index for future queries. So far, I have found the
>following two approaches:
>
>
>1: Use the Apache OODT Pushpull to crawl a bunch of data files from some
>remote server to localhost ->  Use the Apache Tika to extract the
>metadata information for each data file ->  Use the Apache OODT File
>Manager to ingest the metadata files ->  Use
> the query_tool script to query the metadata information stored in the
>Apache OODT File Manager
>
>
>We could also achieve the above process by employing the Apache OODT
>CAS-Curator to automatically call the Apache Tika and the Apache File
>Manager, for the details you could refer to
>http://oodt.apache.org/components/maven/curator/user/basic.html
>
>
>2: Use the Apache Nutch to crawl a number of webpages -> Use the Apache
>Solr to do the text queries.
>
>
>However, there are some problems that I am still trying to solve:
>
>
>(1) According to the Apache OODT Pushpull user guide
>(https://cwiki.apache.org/confluence/display/OODT/OODT+Push-Pull+User+Guid
>e), data files should
> be downloaded to the staging area. However, when I started the pushpull
>script, I have waited for at least 15 minutes but nothing was downloaded.
>I have checked the remote FTP server, there indeed are some data files.
>-_-!
>
>
>**************************************************************************
>***********
>guest-wireless-207-151-035-013:bin AngelaWang$ ./pushpull
>TRANSFER:
>org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
>^C
>**************************************************************************
>***********
>
>
>Also, url-downloader script cannot work because of the java
>   NoClassDefFoundError.
>
>
>**************************************************************************
>***********
>
>guest-wireless-207-151-035-013:bin AngelaWang$ ./url-downloader
>
>http://pds-imaging.jpl.nasa.gov/data/msl/MSLHAZ_0XXX/CATALOG/CATINFO.TXT
><http://pds-imaging.jpl.nasa.gov/data/msl/MSLHAZ_0XXX/CATALOG/CATINFO.TXT>
> .
>Exception in thread "main" java.lang.NoClassDefFoundError:
>org/apache/oodt/cas/pushpull/protocol/http/HttpClient
>Caused by: java.lang.ClassNotFoundException:
>org.apache.oodt.cas.pushpull.protocol.http.HttpClient
>at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>at java.security.AccessController.doPrivileged(Native Method)
>at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>
>**************************************************************************
>***********
>
>
>
>2: According to the Apache OODT Crawler Help
>(https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help), the
>Apache OODT Crawler could be integrated
> with the Apache Tika. However, there is no
>org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor class in
>my Apache OODT Crawler package.
>
>
>3: How to dump the metadata in the Apache OODT File Manager to the Apache
>Solr using the Apache OODT Workflow Manager? I still have no clear answer
>yet.
>
>
>4: According to the Apache Solr Tutorial
>(https://lucene.apache.org/solr/4_10_0/tutorial.html), users should be
>able to add/delete/update documents using post.jar script.
> However, it doesn't work in my machine.
>
>
>**************************************************************************
>***********
>
>guest-wireless-207-151-035-013:exampledocs AngelaWang$ java -jar post.jar
>solr.xml
>SimplePostTool version 1.5
>Posting files to base url
>http://localhost:8983/solr/update <http://localhost:8983/solr/update>
>using content-type application/xml..
>POSTing file solr.xml
>SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for
>url:
>http://localhost:8983/solr/update
>SimplePostTool: WARNING: Response: <?xml version="1.0" encoding="UTF-8"?>
><response>
><lst name="responseHeader"><int name="status">400</int><int
>name="QTime">1</int></lst><lst name="error"><str name="msg">ERROR:
>[doc=SOLR1000] unknown field 'name'</str><int name="code">400</int></lst>
></response>
>SimplePostTool: WARNING: IOException while reading response:
>java.io.IOException: Server returned HTTP response code: 400 for URL:
>http://localhost:8983/solr/update
>1 files indexed.
>COMMITting Solr index changes to
>http://localhost:8983/solr/update..
>Time spent: 0:00:00.032
>
>**************************************************************************
>***********
>
>
>
>Solr logs:
>
>
>**************************************************************************
>***********
>
>6506114 [qtp1314570047-14] ERROR org.apache.solr.core.SolrCore  ­
>org.apache.solr.common.SolrException: ERROR: [doc=SOLR1000] unknown field
>'name'
>at
>org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185
>)
>at
>org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand
>.java:78)
>at
>org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.j
>ava:238)
>at
>org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.ja
>va:164)
>at
>org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdatePr
>ocessorFactory.java:69)
>.......
>**************************************************************************
>***********
>
>
>
>I will continue to solve the above problems this week, and could we
>discuss the two approaches this Thursday after the class? Many thanks!
>Have a good day!
>
>
>Best,
>Mengying (Angela) Wang
>
>
>
>On Mon, Sep 8, 2014 at 10:32 PM, MengYing Wang
><me...@gmail.com>> wrote:
>
>Dear Prof. Mattmann,
>
>
>For the previous week, I have successfully installed the following
>softwares in my personal computer:
>
>
>1: Apache OODT Catalog and Archive File Management Component:
>http://oodt.apache.org/components/maven/filemgr/user/basic.html
>2: Apache OODT Catalog and Archive Crawling Framework:
>http://oodt.apache.org/components/maven/crawler/user/
>
>3: Apache OODT Catalog and Archive Workflow Management Component:
>http://oodt.apache.org/components/maven/workflow/user/basic.html
>
>4: Apache Solr:
>https://cwiki.apache.org/confluence/display/solr/Installing+Solr
>5: Apache Nutch:
>http://wiki.apache.org/nutch/NutchTutorial#A3._Crawl_your_first_website
>6: Apache Tika: http://tika.apache.org/0.9/gettingstarted.html
>
>
>This week I will continue to playing with these softwares to figure out
>the following three questions:
>(1) how to get the metadata using Apache OODT or Apache Nutch?
>(2) how to dump the metadata from Apache OODT to Apache Solr?
>(3) how to query the metadata stored in Solr?
>
>Best,
>Mengying (Angela) Wang
>
>
>
>
>
>
>
>
>
>
>--
>Best,
>Mengying (Angela) Wang
>
>
>
>
>
>
>
>
>
>
>--
>Best,
>Mengying (Angela) Wang
>
>
>
>
>
>
>
>
>
>
>--
>Best,
>Mengying (Angela) Wang
>
>
>
>




--
Best,
Mengying (Angela) Wang

---
Rishi Verma
NASA Jet Propulsion Laboratory
California Institute of Technology
4800 Oak Grove Drive, M/S 158-248
Pasadena, CA 91109
Tel: 1-818-393-5826<tel:1-818-393-5826>




--
Best,
Mengying (Angela) Wang

---
Rishi Verma
NASA Jet Propulsion Laboratory
California Institute of Technology


Re: Directed Research Weekly Report from 2014/09/29 - 2014/10/05

Posted by "Verma, Rishi (398M)" <Ri...@jpl.nasa.gov>.
Hi MengYing,

Your CMD1 should not have the ‘-Pfm-solr-catalog’ argument. The reason is because that command generates a new project for you, whereas, the ‘-Pfm-solr-catalog’ should only be used to build the project once it has already been generated. You might want to read up a bit on Maven archetypes, which is what OODT RADiX is.
http://maven.apache.org/guides/introduction/introduction-to-archetypes.html

Let me explain in this way, here’s the steps to using OODT RADiX:
1. Get a hold of the latest OODT RADiX Maven Archetype (you might have already done this if you have the full OODT source)
    i.e. download the full OODT source and invoke ‘mvn install’ so that you can use the latest RADiX archetype
    http://svn.apache.org/repos/asf/oodt/trunk/
2. Use the OODT RADiX Maven Archetype to generate a new OODT project source folder structure for you (this is the source for your new project!)
    i.e. invoke the command:
    > mvn archetype:generate
(select RADiX from the list of archetypes you see, and follow the prompts)
3. Change into the newly generated directory from above, and build a tar-ball distribution of OODT that you can run from the source folder structure you generated earlier
> mvn clean package -Pfm-solr-catalog
4. Take the build tar-ball distribution, and extract it somewhere else for launching OODT
> tar zxf distribution/target/oodt-*.jar -C /usr/local/my-oodt-project
5. Run OODT
> cd /usr/local/my-oodt-project/bin
> ./oodt start

That’s the typical workflow for using RADiX. So the key here is, only use the ‘-Pfm-solr-catalog’ argument when building OODT, not when generating the folder structure.

If you’re starting from scratch:
1. Use Vagrant Virtual Machine technology to get a pre-built OODT deployment connected to Solr in one command: https://cwiki.apache.org/confluence/display/OODT/Vagrant+Powered+OODT

[ I didn't try this approach ]

You should try this! Because all five steps above are automated for you via the Vagrant machine.

Thanks,
rishi

On Oct 17, 2014, at 11:29 AM, MengYing Wang <me...@gmail.com>> wrote:

Dear Rishi,

Actually, in the first command of the tutorial<https://cwiki.apache.org/confluence/display/OODT/RADiX+Powered+By+OODT#RADiXPoweredByOODT-TheCommands>: curl -s http://svn.apache.org/repos/asf/oodt/trunk/mvn/archetypes/radix/src/main/resources/bin/radix | bash, the "default" instead of the "fm-solr-catalog" profile is already activated. So the Solr component is not built.

guest-wireless-207-151-035-005:Downloads AngelaWang$ curl -s http://svn.apache.org/repos/asf/oodt/trunk/mvn/archetypes/radix/src/main/resources/bin/radix | bash
[INFO] Scanning for projects...
[INFO] Searching repository for plugin with prefix: 'archetype'.
[INFO] ------------------------------------------------------------------------
[INFO] Building Maven Default Project
[INFO]    task-segment: [archetype:generate] (aggregator-style)
[INFO] ------------------------------------------------------------------------
[INFO] Preparing archetype:generate
[INFO] No goals needed for project - skipping
......


BTW, the "fm-solr-catalog" profile is defined in the filemgr pom.xml<http://svn.apache.org/repos/asf/oodt/trunk/mvn/archetypes/radix/src/main/resources/archetype-resources/filemgr/pom.xml> and distribution pom.xml<http://svn.apache.org/repos/asf/oodt/trunk/mvn/archetypes/radix/src/main/resources/archetype-resources/distribution/pom.xml>.

Best,
Mengying (Angela) Wang

On Thu, Oct 16, 2014 at 7:42 PM, MengYing Wang <me...@gmail.com>> wrote:
Dear Rishi,

Thank you for your help.

Yes, I am using the OODT 0.7 and also running the 'mvn package -Pfm-solr-catalog' command from the top-level directory.

Following are the commands and logs:

Cmd 1:

guest-wireless-207-151-035-005:Downloads AngelaWang$ mvn archetype:generate -Pfm-solr-catalog -DarchetypeGroupId=org.apache.oodt -DarchetypeArtifactId=radix-archetype -DarchetypeVersion=0.6 -Doodt=0.7 -DgroupId=com.mycompany -DartifactId=oodt -Dversion=0.1

[INFO] Scanning for projects...

[WARNING]

Profile with id: 'fm-solr-catalog' has not been activated.

......

[INFO] BUILD SUCCESSFUL

......

Cmd 2:



guest-wireless-207-151-035-005:Downloads AngelaWang$ cd oodt

Cmd 3:

guest-wireless-207-151-035-005:oodt AngelaWang$ mvn clean package -Pfm-solr-catalog

[INFO] Scanning for projects...

[WARNING]

Profile with id: 'fm-solr-catalog' has not been activated.

[INFO] Reactor build order:

[INFO]   Data Management System


[INFO]   Extensions

......


Thank you.

Best,

Mengying Wang





On Thu, Oct 16, 2014 at 6:09 PM, Verma, Rishi (398M) <Ri...@jpl.nasa.gov>> wrote:
Hey Mengying,

That error usually gets thrown if you invoke a Maven build from a subdirectory not containing the profile definition.

Two things to check for:
* Are you calling ‘mvn clean package -Pfm-solr-catalog from the top-level directory of your RADiX installation? i.e. the directory containing a pom.xml file and folders like ‘crawler’, ‘distribution’, ‘extensions’, etc ...
* Are you running an OODT version 0.7+?

Thanks,
rishi

On Oct 16, 2014, at 4:45 PM, MengYing Wang <me...@gmail.com>> wrote:

Dear Rishi,

When I try to use the OODT RADiX using the command "mvn clean package -Pfm-solr-catalog", I get the "profile with id: 'fm-solr-catalog' has not been activated" error. Have you by any chance seen this error before? Thank you! Also after the installation, no solr directory is found in my machine too.

$ mvn clean package -Pfm-solr-catalog

[INFO] Scanning for projects...

[WARNING]

Profile with id: 'fm-solr-catalog' has not been activated.

[INFO] Reactor build order:

[INFO]   Data Management System

......

Best,
Mengying Wang

On Sun, Oct 5, 2014 at 3:05 PM, Verma, Rishi (398J) <Ri...@jpl.nasa.gov>> wrote:
Hi Mengying,

For integrating OODT File Manager with Solr, you have a couple options depending on the type of deployment you are doing and what stage your software is at:

If you’re starting from scratch:
1. Use Vagrant Virtual Machine technology to get a pre-built OODT deployment connected to Solr in one command: https://cwiki.apache.org/confluence/display/OODT/Vagrant+Powered+OODT
2. Use OODT RADiX for a pre-built deployment directory containing OODT File Manager, Workflow, Resource etc and Solr pre-integrated. RADiX allows for pre-configured OODT deployments, abstracting you from checking out individual OODT modules via source and building them.
   See: https://cwiki.apache.org/confluence/display/OODT/RADiX+Powered+By+OODT#RADiXPoweredByOODT-TheCommands
   Make sure to build with the command: mvn -Pfm-solr-catalog package  (see read me: http://svn.apache.org/repos/asf/oodt/trunk/mvn/archetypes/radix/src/main/resources/archetype-resources/README.txt)
3. Connect OODT FM with Solr manually, see: https://cwiki.apache.org/confluence/display/OODT/Integrating+Solr+with+OODT+RADiX

If you already have a deployed OODT FM:
1. Follow these directions: https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+Start+Guide
2. If the above doesn’t work, then use OODT RADiX to create a FM and Solr deployment that works, and copy those directories to your currently deployed production directory.

Thanks - hope that helps!
Rishi

On Oct 5, 2014, at 10:56 AM, MengYing Wang <me...@gmail.com>> wrote:

Dear Prof. Mattmann and Rishi,

Attached is the nutch and solr directories.
​
[https://ssl.gstatic.com/docs/doclist/images/icon_9_archive_list.png] nutch_solr.zip<https://docs.google.com/file/d/0B7PYVKDpy0jlSnI3U1lFcGY0WnM/edit?usp=drive_web>
​
As for problem (6), I could use SolrIndexer instead. Following is my File Manager directory.

https://drive.google.com/file/d/0B7PYVKDpy0jlVTk2NWFFY2sycW8/view?usp=sharing

Thank you!

Best,
Mengying Wang



On Sun, Oct 5, 2014 at 9:25 AM, Christian Alan Mattmann <ma...@usc.edu>> wrote:
Thanks Angela. Great work!

Some comments/feedback:

(1) According to
https://cwiki.apache.org/confluence/display/OODT/OODT+Push-Pull+User+Guide,
 use the Apache OODT Pushpull to crawl data files from
a remote server to the local machine [Failed, no data files downloaded at
all].
- This problem is not so urgent. Maybe I should use some ftp client tools,
e.g., FileZilla, to download data files in the remote ftp servers.

MY COMMENT: Please send me your PushPull directory zipped up. I will
take a look - Tyler can you also look?

(3) According to https://wiki.apache.org/nutch/IntranetDocumentSearch, use
the Apache Nutch and Solr to crawl and index local data files [Failed,
No data is indexed in Solr].
- This problem is not so urgent. Maybe this feature
only works for the Nutch 2.x. My Nutch version is 1.9. Also I could use
the OODT Crawler to ingest local files.


MY COMMENT: Please send me your nutch + solr directories, zipped up.
I will take a look.

(6) According to
https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+St
art+Guide, integrate the Apache OODT File Manager
with the Apache Solr [Failed, No product information available in the
Solr].
- It doesn't work out. However, I could use (5) to integrate OODT File
Manager and the Solr.


MY COMMENT: Rishi, can you guys help Angela with OODT + Solr + FM? It¹s not
working for her.

Thanks!



++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Adjunct Associate Professor, Computer Science Department
University of Southern California
Los Angeles, CA 90089 USA
Email: mattmann@usc.edu<ma...@usc.edu>
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++




-----Original Message-----
From: MengYing Wang <me...@gmail.com>>
Date: Saturday, October 4, 2014 at 9:12 PM
To: Chris Mattmann <ma...@usc.edu>>, "Mcgibbney, Lewis J (398J)"
<Le...@jpl.nasa.gov>>
Cc: Annie Bryant <an...@gmail.com>>, "Ramirez, Paul M (398J)"
<pa...@jpl.nasa.gov>>, Chris Mattmann
<Ch...@jpl.nasa.gov>>
Subject: Directed Research Weekly Report from 2014/09/29 - 2014/10/05

>Dear Prof. Mattmann,
>
>
>New status of the previous failed problems:
>
>
>(1) According to
>https://cwiki.apache.org/confluence/display/OODT/OODT+Push-Pull+User+Guide
>, use the Apache OODT Pushpull to crawl data files from
> a remote server to the local machine [Failed, no data files downloaded
>at all].
>
>
>- This problem is not so urgent. Maybe I should use some ftp client
>tools, e.g., FileZilla, to download data files in the remote ftp servers.
>
>
>(2) Use the Apache OODT Pushpull to crawl webpages [Succeed].
>
>
>(3) According to https://wiki.apache.org/nutch/IntranetDocumentSearch,
>use the Apache Nutch and Solr to crawl and index local data files [Failed,
> No data is indexed in Solr].
>
>
>- This problem is not so urgent. Maybe this feature
> only works for the Nutch 2.x. My Nutch version is 1.9. Also I could use
>the OODT Crawler to ingest local files.
>
>
>(4) Integrate the tike parser with the Apache Nutch [Failed, No tike
>fields available in the Solr].
>
>
>- Still in progress.
>
>
>(5) According to
>https://cwiki.apache.org/confluence/display/OODT/Using+the+SolrIndexer+to+
>dump+a+File+Manager+Catalog,
> use the SolrIndexer to dump all product information from the Apache OODT
>File Manager to the Apache Solr [Succeed].
>
>
>(6) According to
>https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+S
>tart+Guide, integrate the Apache OODT File Manager
> with the Apache Solr [Failed, No product information available in the
>Solr].
>
>
>- It doesn't work out. However, I could use (5) to integrate OODT File
>Manager and the Solr.
>
>
>So far, I have two ways to crawl remote data and construct indexes in the
>Solr:
>
>
>(1) moving data to the local machine using the FileZilla -> developing
>metadata extractor using the Tika -> crawling the data directory using
>the OODT Crawler -> migrating product information to the Solr uing the
>SolrIndexer
>
>
>(2) crawling websites using the Nutch -> indexing some basic metadata in
>the Solr
>
>
>
>
>Thanks.
>
>
>Best,
>Mengying (Angela) Wang
>
>On Mon, Sep 29, 2014 at 12:22 PM, MengYing Wang
><me...@gmail.com>> wrote:
>
>Dear Prof. Mattmann,
>
>
>In the previous two weeks, I was trying to solve the following problems:
>
>
>(1) According to
>https://cwiki.apache.org/confluence/display/OODT/OODT+Push-Pull+User+Guide
>, use the Apache OODT Pushpull to crawl data files from
> a remote server to the local machine [Failed, couldn't find the data
>files].
>
>
>(2) Use the Apache OODT Pushpull to crawl webpages [Failed, HttpClient
>ClassNotFoundException].
>
>
>(3) According to https://wiki.apache.org/nutch/IntranetDocumentSearch,
>use the Apache Nutch and Solr to crawl and index local data files [Failed,
> No data files found in Solr].
>
>
>(4) According to
>https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+S
>tart+Guide, search and delete redundant products
> in the Apache OODT File Manager [Succeed].
>
>
>(5) According to
>https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help, use
>the Apache OODT Crawler and Tika to extract metadata and then query
> the metadata in the Apache OODT File Manager [Succeed].
>
>
>(6) According to https://wiki.apache.org/nutch/IndexMetatags, use the
>plugins to parse HTML meta tags into separate fields in the Solr index
>[Succeed].
>
>
>(7) Integrate the tike parser with the Apache Nutch to extract metadata
>information which would be indexed in the Solr [Failed, No tike fields
>available in the Solr].
>
>
>(8) According to
>https://cwiki.apache.org/confluence/display/OODT/Using+the+SolrIndexer+to+
>dump+a+File+Manager+Catalog,
> use the SolrIndexer to dump all product information from the Apache OODT
>File Manager to the Apache Solr [Failed, No product information available
>in the Solr].
>
>
>(9) According to
>https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+S
>tart+Guide, integrate the
> Apache OODT File Manager with the Apache Solr [Failed, No product
>information available in the Solr].
>
>
>(10) According to https://lucene.apache.org/solr/4_10_0/tutorial.html,
>explore a simple command line tool for posting, deleting, updating and
>querying
> raw XMLs to the solr server [Succeed].
>
>
>Thank you.
>
>
>
>Best,
>Mengying Wang
>
>
>On Wed, Sep 17, 2014 at 11:44 AM, MengYing Wang
><me...@gmail.com>> wrote:
>
>Dear Prof. Mattmann,
>
>
>For the last week, I was learning the various apache tool tutorials, and
>trying to figure out how to crawl data files in the web, and then build
>up a metadata index for future queries. So far, I have found the
>following two approaches:
>
>
>1: Use the Apache OODT Pushpull to crawl a bunch of data files from some
>remote server to localhost ->  Use the Apache Tika to extract the
>metadata information for each data file ->  Use the Apache OODT File
>Manager to ingest the metadata files ->  Use
> the query_tool script to query the metadata information stored in the
>Apache OODT File Manager
>
>
>We could also achieve the above process by employing the Apache OODT
>CAS-Curator to automatically call the Apache Tika and the Apache File
>Manager, for the details you could refer to
>http://oodt.apache.org/components/maven/curator/user/basic.html
>
>
>2: Use the Apache Nutch to crawl a number of webpages -> Use the Apache
>Solr to do the text queries.
>
>
>However, there are some problems that I am still trying to solve:
>
>
>(1) According to the Apache OODT Pushpull user guide
>(https://cwiki.apache.org/confluence/display/OODT/OODT+Push-Pull+User+Guid
>e), data files should
> be downloaded to the staging area. However, when I started the pushpull
>script, I have waited for at least 15 minutes but nothing was downloaded.
>I have checked the remote FTP server, there indeed are some data files.
>-_-!
>
>
>**************************************************************************
>***********
>guest-wireless-207-151-035-013:bin AngelaWang$ ./pushpull
>TRANSFER:
>org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
>^C
>**************************************************************************
>***********
>
>
>Also, url-downloader script cannot work because of the java
>   NoClassDefFoundError.
>
>
>**************************************************************************
>***********
>
>guest-wireless-207-151-035-013:bin AngelaWang$ ./url-downloader
>
>http://pds-imaging.jpl.nasa.gov/data/msl/MSLHAZ_0XXX/CATALOG/CATINFO.TXT
><http://pds-imaging.jpl.nasa.gov/data/msl/MSLHAZ_0XXX/CATALOG/CATINFO.TXT>
> .
>Exception in thread "main" java.lang.NoClassDefFoundError:
>org/apache/oodt/cas/pushpull/protocol/http/HttpClient
>Caused by: java.lang.ClassNotFoundException:
>org.apache.oodt.cas.pushpull.protocol.http.HttpClient
>at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>at java.security.AccessController.doPrivileged(Native Method)
>at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>
>**************************************************************************
>***********
>
>
>
>2: According to the Apache OODT Crawler Help
>(https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help), the
>Apache OODT Crawler could be integrated
> with the Apache Tika. However, there is no
>org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor class in
>my Apache OODT Crawler package.
>
>
>3: How to dump the metadata in the Apache OODT File Manager to the Apache
>Solr using the Apache OODT Workflow Manager? I still have no clear answer
>yet.
>
>
>4: According to the Apache Solr Tutorial
>(https://lucene.apache.org/solr/4_10_0/tutorial.html), users should be
>able to add/delete/update documents using post.jar script.
> However, it doesn't work in my machine.
>
>
>**************************************************************************
>***********
>
>guest-wireless-207-151-035-013:exampledocs AngelaWang$ java -jar post.jar
>solr.xml
>SimplePostTool version 1.5
>Posting files to base url
>http://localhost:8983/solr/update <http://localhost:8983/solr/update>
>using content-type application/xml..
>POSTing file solr.xml
>SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for
>url:
>http://localhost:8983/solr/update
>SimplePostTool: WARNING: Response: <?xml version="1.0" encoding="UTF-8"?>
><response>
><lst name="responseHeader"><int name="status">400</int><int
>name="QTime">1</int></lst><lst name="error"><str name="msg">ERROR:
>[doc=SOLR1000] unknown field 'name'</str><int name="code">400</int></lst>
></response>
>SimplePostTool: WARNING: IOException while reading response:
>java.io.IOException: Server returned HTTP response code: 400 for URL:
>http://localhost:8983/solr/update
>1 files indexed.
>COMMITting Solr index changes to
>http://localhost:8983/solr/update..
>Time spent: 0:00:00.032
>
>**************************************************************************
>***********
>
>
>
>Solr logs:
>
>
>**************************************************************************
>***********
>
>6506114 [qtp1314570047-14] ERROR org.apache.solr.core.SolrCore  ­
>org.apache.solr.common.SolrException: ERROR: [doc=SOLR1000] unknown field
>'name'
>at
>org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185
>)
>at
>org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand
>.java:78)
>at
>org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.j
>ava:238)
>at
>org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.ja
>va:164)
>at
>org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdatePr
>ocessorFactory.java:69)
>.......
>**************************************************************************
>***********
>
>
>
>I will continue to solve the above problems this week, and could we
>discuss the two approaches this Thursday after the class? Many thanks!
>Have a good day!
>
>
>Best,
>Mengying (Angela) Wang
>
>
>
>On Mon, Sep 8, 2014 at 10:32 PM, MengYing Wang
><me...@gmail.com>> wrote:
>
>Dear Prof. Mattmann,
>
>
>For the previous week, I have successfully installed the following
>softwares in my personal computer:
>
>
>1: Apache OODT Catalog and Archive File Management Component:
>http://oodt.apache.org/components/maven/filemgr/user/basic.html
>2: Apache OODT Catalog and Archive Crawling Framework:
>http://oodt.apache.org/components/maven/crawler/user/
>
>3: Apache OODT Catalog and Archive Workflow Management Component:
>http://oodt.apache.org/components/maven/workflow/user/basic.html
>
>4: Apache Solr:
>https://cwiki.apache.org/confluence/display/solr/Installing+Solr
>5: Apache Nutch:
>http://wiki.apache.org/nutch/NutchTutorial#A3._Crawl_your_first_website
>6: Apache Tika: http://tika.apache.org/0.9/gettingstarted.html
>
>
>This week I will continue to playing with these softwares to figure out
>the following three questions:
>(1) how to get the metadata using Apache OODT or Apache Nutch?
>(2) how to dump the metadata from Apache OODT to Apache Solr?
>(3) how to query the metadata stored in Solr?
>
>Best,
>Mengying (Angela) Wang
>
>
>
>
>
>
>
>
>
>
>--
>Best,
>Mengying (Angela) Wang
>
>
>
>
>
>
>
>
>
>
>--
>Best,
>Mengying (Angela) Wang
>
>
>
>
>
>
>
>
>
>
>--
>Best,
>Mengying (Angela) Wang
>
>
>
>




--
Best,
Mengying (Angela) Wang

---
Rishi Verma
NASA Jet Propulsion Laboratory
California Institute of Technology
4800 Oak Grove Drive, M/S 158-248
Pasadena, CA 91109
Tel: 1-818-393-5826<tel:1-818-393-5826>




--
Best,
Mengying (Angela) Wang

---
Rishi Verma
NASA Jet Propulsion Laboratory
California Institute of Technology




--
Best,
Mengying (Angela) Wang



--
Best,
Mengying (Angela) Wang

---
Rishi Verma
NASA Jet Propulsion Laboratory
California Institute of Technology


Re: Directed Research Weekly Report from 2014/09/29 - 2014/10/05

Posted by "Verma, Rishi (398M)" <Ri...@jpl.nasa.gov>.
Hi MengYing,

Your CMD1 should not have the ‘-Pfm-solr-catalog’ argument. The reason is because that command generates a new project for you, whereas, the ‘-Pfm-solr-catalog’ should only be used to build the project once it has already been generated. You might want to read up a bit on Maven archetypes, which is what OODT RADiX is.
http://maven.apache.org/guides/introduction/introduction-to-archetypes.html

Let me explain in this way, here’s the steps to using OODT RADiX:
1. Get a hold of the latest OODT RADiX Maven Archetype (you might have already done this if you have the full OODT source)
    i.e. download the full OODT source and invoke ‘mvn install’ so that you can use the latest RADiX archetype
    http://svn.apache.org/repos/asf/oodt/trunk/
2. Use the OODT RADiX Maven Archetype to generate a new OODT project source folder structure for you (this is the source for your new project!)
    i.e. invoke the command:
    > mvn archetype:generate
(select RADiX from the list of archetypes you see, and follow the prompts)
3. Change into the newly generated directory from above, and build a tar-ball distribution of OODT that you can run from the source folder structure you generated earlier
> mvn clean package -Pfm-solr-catalog
4. Take the build tar-ball distribution, and extract it somewhere else for launching OODT
> tar zxf distribution/target/oodt-*.jar -C /usr/local/my-oodt-project
5. Run OODT
> cd /usr/local/my-oodt-project/bin
> ./oodt start

That’s the typical workflow for using RADiX. So the key here is, only use the ‘-Pfm-solr-catalog’ argument when building OODT, not when generating the folder structure.

If you’re starting from scratch:
1. Use Vagrant Virtual Machine technology to get a pre-built OODT deployment connected to Solr in one command: https://cwiki.apache.org/confluence/display/OODT/Vagrant+Powered+OODT

[ I didn't try this approach ]

You should try this! Because all five steps above are automated for you via the Vagrant machine.

Thanks,
rishi

On Oct 17, 2014, at 11:29 AM, MengYing Wang <me...@gmail.com>> wrote:

Dear Rishi,

Actually, in the first command of the tutorial<https://cwiki.apache.org/confluence/display/OODT/RADiX+Powered+By+OODT#RADiXPoweredByOODT-TheCommands>: curl -s http://svn.apache.org/repos/asf/oodt/trunk/mvn/archetypes/radix/src/main/resources/bin/radix | bash, the "default" instead of the "fm-solr-catalog" profile is already activated. So the Solr component is not built.

guest-wireless-207-151-035-005:Downloads AngelaWang$ curl -s http://svn.apache.org/repos/asf/oodt/trunk/mvn/archetypes/radix/src/main/resources/bin/radix | bash
[INFO] Scanning for projects...
[INFO] Searching repository for plugin with prefix: 'archetype'.
[INFO] ------------------------------------------------------------------------
[INFO] Building Maven Default Project
[INFO]    task-segment: [archetype:generate] (aggregator-style)
[INFO] ------------------------------------------------------------------------
[INFO] Preparing archetype:generate
[INFO] No goals needed for project - skipping
......


BTW, the "fm-solr-catalog" profile is defined in the filemgr pom.xml<http://svn.apache.org/repos/asf/oodt/trunk/mvn/archetypes/radix/src/main/resources/archetype-resources/filemgr/pom.xml> and distribution pom.xml<http://svn.apache.org/repos/asf/oodt/trunk/mvn/archetypes/radix/src/main/resources/archetype-resources/distribution/pom.xml>.

Best,
Mengying (Angela) Wang

On Thu, Oct 16, 2014 at 7:42 PM, MengYing Wang <me...@gmail.com>> wrote:
Dear Rishi,

Thank you for your help.

Yes, I am using the OODT 0.7 and also running the 'mvn package -Pfm-solr-catalog' command from the top-level directory.

Following are the commands and logs:

Cmd 1:

guest-wireless-207-151-035-005:Downloads AngelaWang$ mvn archetype:generate -Pfm-solr-catalog -DarchetypeGroupId=org.apache.oodt -DarchetypeArtifactId=radix-archetype -DarchetypeVersion=0.6 -Doodt=0.7 -DgroupId=com.mycompany -DartifactId=oodt -Dversion=0.1

[INFO] Scanning for projects...

[WARNING]

Profile with id: 'fm-solr-catalog' has not been activated.

......

[INFO] BUILD SUCCESSFUL

......

Cmd 2:



guest-wireless-207-151-035-005:Downloads AngelaWang$ cd oodt

Cmd 3:

guest-wireless-207-151-035-005:oodt AngelaWang$ mvn clean package -Pfm-solr-catalog

[INFO] Scanning for projects...

[WARNING]

Profile with id: 'fm-solr-catalog' has not been activated.

[INFO] Reactor build order:

[INFO]   Data Management System


[INFO]   Extensions

......


Thank you.

Best,

Mengying Wang





On Thu, Oct 16, 2014 at 6:09 PM, Verma, Rishi (398M) <Ri...@jpl.nasa.gov>> wrote:
Hey Mengying,

That error usually gets thrown if you invoke a Maven build from a subdirectory not containing the profile definition.

Two things to check for:
* Are you calling ‘mvn clean package -Pfm-solr-catalog from the top-level directory of your RADiX installation? i.e. the directory containing a pom.xml file and folders like ‘crawler’, ‘distribution’, ‘extensions’, etc ...
* Are you running an OODT version 0.7+?

Thanks,
rishi

On Oct 16, 2014, at 4:45 PM, MengYing Wang <me...@gmail.com>> wrote:

Dear Rishi,

When I try to use the OODT RADiX using the command "mvn clean package -Pfm-solr-catalog", I get the "profile with id: 'fm-solr-catalog' has not been activated" error. Have you by any chance seen this error before? Thank you! Also after the installation, no solr directory is found in my machine too.

$ mvn clean package -Pfm-solr-catalog

[INFO] Scanning for projects...

[WARNING]

Profile with id: 'fm-solr-catalog' has not been activated.

[INFO] Reactor build order:

[INFO]   Data Management System

......

Best,
Mengying Wang

On Sun, Oct 5, 2014 at 3:05 PM, Verma, Rishi (398J) <Ri...@jpl.nasa.gov>> wrote:
Hi Mengying,

For integrating OODT File Manager with Solr, you have a couple options depending on the type of deployment you are doing and what stage your software is at:

If you’re starting from scratch:
1. Use Vagrant Virtual Machine technology to get a pre-built OODT deployment connected to Solr in one command: https://cwiki.apache.org/confluence/display/OODT/Vagrant+Powered+OODT
2. Use OODT RADiX for a pre-built deployment directory containing OODT File Manager, Workflow, Resource etc and Solr pre-integrated. RADiX allows for pre-configured OODT deployments, abstracting you from checking out individual OODT modules via source and building them.
   See: https://cwiki.apache.org/confluence/display/OODT/RADiX+Powered+By+OODT#RADiXPoweredByOODT-TheCommands
   Make sure to build with the command: mvn -Pfm-solr-catalog package  (see read me: http://svn.apache.org/repos/asf/oodt/trunk/mvn/archetypes/radix/src/main/resources/archetype-resources/README.txt)
3. Connect OODT FM with Solr manually, see: https://cwiki.apache.org/confluence/display/OODT/Integrating+Solr+with+OODT+RADiX

If you already have a deployed OODT FM:
1. Follow these directions: https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+Start+Guide
2. If the above doesn’t work, then use OODT RADiX to create a FM and Solr deployment that works, and copy those directories to your currently deployed production directory.

Thanks - hope that helps!
Rishi

On Oct 5, 2014, at 10:56 AM, MengYing Wang <me...@gmail.com>> wrote:

Dear Prof. Mattmann and Rishi,

Attached is the nutch and solr directories.
​
[https://ssl.gstatic.com/docs/doclist/images/icon_9_archive_list.png] nutch_solr.zip<https://docs.google.com/file/d/0B7PYVKDpy0jlSnI3U1lFcGY0WnM/edit?usp=drive_web>
​
As for problem (6), I could use SolrIndexer instead. Following is my File Manager directory.

https://drive.google.com/file/d/0B7PYVKDpy0jlVTk2NWFFY2sycW8/view?usp=sharing

Thank you!

Best,
Mengying Wang



On Sun, Oct 5, 2014 at 9:25 AM, Christian Alan Mattmann <ma...@usc.edu>> wrote:
Thanks Angela. Great work!

Some comments/feedback:

(1) According to
https://cwiki.apache.org/confluence/display/OODT/OODT+Push-Pull+User+Guide,
 use the Apache OODT Pushpull to crawl data files from
a remote server to the local machine [Failed, no data files downloaded at
all].
- This problem is not so urgent. Maybe I should use some ftp client tools,
e.g., FileZilla, to download data files in the remote ftp servers.

MY COMMENT: Please send me your PushPull directory zipped up. I will
take a look - Tyler can you also look?

(3) According to https://wiki.apache.org/nutch/IntranetDocumentSearch, use
the Apache Nutch and Solr to crawl and index local data files [Failed,
No data is indexed in Solr].
- This problem is not so urgent. Maybe this feature
only works for the Nutch 2.x. My Nutch version is 1.9. Also I could use
the OODT Crawler to ingest local files.


MY COMMENT: Please send me your nutch + solr directories, zipped up.
I will take a look.

(6) According to
https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+St
art+Guide, integrate the Apache OODT File Manager
with the Apache Solr [Failed, No product information available in the
Solr].
- It doesn't work out. However, I could use (5) to integrate OODT File
Manager and the Solr.


MY COMMENT: Rishi, can you guys help Angela with OODT + Solr + FM? It¹s not
working for her.

Thanks!



++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Adjunct Associate Professor, Computer Science Department
University of Southern California
Los Angeles, CA 90089 USA
Email: mattmann@usc.edu<ma...@usc.edu>
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++




-----Original Message-----
From: MengYing Wang <me...@gmail.com>>
Date: Saturday, October 4, 2014 at 9:12 PM
To: Chris Mattmann <ma...@usc.edu>>, "Mcgibbney, Lewis J (398J)"
<Le...@jpl.nasa.gov>>
Cc: Annie Bryant <an...@gmail.com>>, "Ramirez, Paul M (398J)"
<pa...@jpl.nasa.gov>>, Chris Mattmann
<Ch...@jpl.nasa.gov>>
Subject: Directed Research Weekly Report from 2014/09/29 - 2014/10/05

>Dear Prof. Mattmann,
>
>
>New status of the previous failed problems:
>
>
>(1) According to
>https://cwiki.apache.org/confluence/display/OODT/OODT+Push-Pull+User+Guide
>, use the Apache OODT Pushpull to crawl data files from
> a remote server to the local machine [Failed, no data files downloaded
>at all].
>
>
>- This problem is not so urgent. Maybe I should use some ftp client
>tools, e.g., FileZilla, to download data files in the remote ftp servers.
>
>
>(2) Use the Apache OODT Pushpull to crawl webpages [Succeed].
>
>
>(3) According to https://wiki.apache.org/nutch/IntranetDocumentSearch,
>use the Apache Nutch and Solr to crawl and index local data files [Failed,
> No data is indexed in Solr].
>
>
>- This problem is not so urgent. Maybe this feature
> only works for the Nutch 2.x. My Nutch version is 1.9. Also I could use
>the OODT Crawler to ingest local files.
>
>
>(4) Integrate the tike parser with the Apache Nutch [Failed, No tike
>fields available in the Solr].
>
>
>- Still in progress.
>
>
>(5) According to
>https://cwiki.apache.org/confluence/display/OODT/Using+the+SolrIndexer+to+
>dump+a+File+Manager+Catalog,
> use the SolrIndexer to dump all product information from the Apache OODT
>File Manager to the Apache Solr [Succeed].
>
>
>(6) According to
>https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+S
>tart+Guide, integrate the Apache OODT File Manager
> with the Apache Solr [Failed, No product information available in the
>Solr].
>
>
>- It doesn't work out. However, I could use (5) to integrate OODT File
>Manager and the Solr.
>
>
>So far, I have two ways to crawl remote data and construct indexes in the
>Solr:
>
>
>(1) moving data to the local machine using the FileZilla -> developing
>metadata extractor using the Tika -> crawling the data directory using
>the OODT Crawler -> migrating product information to the Solr uing the
>SolrIndexer
>
>
>(2) crawling websites using the Nutch -> indexing some basic metadata in
>the Solr
>
>
>
>
>Thanks.
>
>
>Best,
>Mengying (Angela) Wang
>
>On Mon, Sep 29, 2014 at 12:22 PM, MengYing Wang
><me...@gmail.com>> wrote:
>
>Dear Prof. Mattmann,
>
>
>In the previous two weeks, I was trying to solve the following problems:
>
>
>(1) According to
>https://cwiki.apache.org/confluence/display/OODT/OODT+Push-Pull+User+Guide
>, use the Apache OODT Pushpull to crawl data files from
> a remote server to the local machine [Failed, couldn't find the data
>files].
>
>
>(2) Use the Apache OODT Pushpull to crawl webpages [Failed, HttpClient
>ClassNotFoundException].
>
>
>(3) According to https://wiki.apache.org/nutch/IntranetDocumentSearch,
>use the Apache Nutch and Solr to crawl and index local data files [Failed,
> No data files found in Solr].
>
>
>(4) According to
>https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+S
>tart+Guide, search and delete redundant products
> in the Apache OODT File Manager [Succeed].
>
>
>(5) According to
>https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help, use
>the Apache OODT Crawler and Tika to extract metadata and then query
> the metadata in the Apache OODT File Manager [Succeed].
>
>
>(6) According to https://wiki.apache.org/nutch/IndexMetatags, use the
>plugins to parse HTML meta tags into separate fields in the Solr index
>[Succeed].
>
>
>(7) Integrate the tike parser with the Apache Nutch to extract metadata
>information which would be indexed in the Solr [Failed, No tike fields
>available in the Solr].
>
>
>(8) According to
>https://cwiki.apache.org/confluence/display/OODT/Using+the+SolrIndexer+to+
>dump+a+File+Manager+Catalog,
> use the SolrIndexer to dump all product information from the Apache OODT
>File Manager to the Apache Solr [Failed, No product information available
>in the Solr].
>
>
>(9) According to
>https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+S
>tart+Guide, integrate the
> Apache OODT File Manager with the Apache Solr [Failed, No product
>information available in the Solr].
>
>
>(10) According to https://lucene.apache.org/solr/4_10_0/tutorial.html,
>explore a simple command line tool for posting, deleting, updating and
>querying
> raw XMLs to the solr server [Succeed].
>
>
>Thank you.
>
>
>
>Best,
>Mengying Wang
>
>
>On Wed, Sep 17, 2014 at 11:44 AM, MengYing Wang
><me...@gmail.com>> wrote:
>
>Dear Prof. Mattmann,
>
>
>For the last week, I was learning the various apache tool tutorials, and
>trying to figure out how to crawl data files in the web, and then build
>up a metadata index for future queries. So far, I have found the
>following two approaches:
>
>
>1: Use the Apache OODT Pushpull to crawl a bunch of data files from some
>remote server to localhost ->  Use the Apache Tika to extract the
>metadata information for each data file ->  Use the Apache OODT File
>Manager to ingest the metadata files ->  Use
> the query_tool script to query the metadata information stored in the
>Apache OODT File Manager
>
>
>We could also achieve the above process by employing the Apache OODT
>CAS-Curator to automatically call the Apache Tika and the Apache File
>Manager, for the details you could refer to
>http://oodt.apache.org/components/maven/curator/user/basic.html
>
>
>2: Use the Apache Nutch to crawl a number of webpages -> Use the Apache
>Solr to do the text queries.
>
>
>However, there are some problems that I am still trying to solve:
>
>
>(1) According to the Apache OODT Pushpull user guide
>(https://cwiki.apache.org/confluence/display/OODT/OODT+Push-Pull+User+Guid
>e), data files should
> be downloaded to the staging area. However, when I started the pushpull
>script, I have waited for at least 15 minutes but nothing was downloaded.
>I have checked the remote FTP server, there indeed are some data files.
>-_-!
>
>
>**************************************************************************
>***********
>guest-wireless-207-151-035-013:bin AngelaWang$ ./pushpull
>TRANSFER:
>org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
>^C
>**************************************************************************
>***********
>
>
>Also, url-downloader script cannot work because of the java
>   NoClassDefFoundError.
>
>
>**************************************************************************
>***********
>
>guest-wireless-207-151-035-013:bin AngelaWang$ ./url-downloader
>
>http://pds-imaging.jpl.nasa.gov/data/msl/MSLHAZ_0XXX/CATALOG/CATINFO.TXT
><http://pds-imaging.jpl.nasa.gov/data/msl/MSLHAZ_0XXX/CATALOG/CATINFO.TXT>
> .
>Exception in thread "main" java.lang.NoClassDefFoundError:
>org/apache/oodt/cas/pushpull/protocol/http/HttpClient
>Caused by: java.lang.ClassNotFoundException:
>org.apache.oodt.cas.pushpull.protocol.http.HttpClient
>at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>at java.security.AccessController.doPrivileged(Native Method)
>at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>
>**************************************************************************
>***********
>
>
>
>2: According to the Apache OODT Crawler Help
>(https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help), the
>Apache OODT Crawler could be integrated
> with the Apache Tika. However, there is no
>org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor class in
>my Apache OODT Crawler package.
>
>
>3: How to dump the metadata in the Apache OODT File Manager to the Apache
>Solr using the Apache OODT Workflow Manager? I still have no clear answer
>yet.
>
>
>4: According to the Apache Solr Tutorial
>(https://lucene.apache.org/solr/4_10_0/tutorial.html), users should be
>able to add/delete/update documents using post.jar script.
> However, it doesn't work in my machine.
>
>
>**************************************************************************
>***********
>
>guest-wireless-207-151-035-013:exampledocs AngelaWang$ java -jar post.jar
>solr.xml
>SimplePostTool version 1.5
>Posting files to base url
>http://localhost:8983/solr/update <http://localhost:8983/solr/update>
>using content-type application/xml..
>POSTing file solr.xml
>SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for
>url:
>http://localhost:8983/solr/update
>SimplePostTool: WARNING: Response: <?xml version="1.0" encoding="UTF-8"?>
><response>
><lst name="responseHeader"><int name="status">400</int><int
>name="QTime">1</int></lst><lst name="error"><str name="msg">ERROR:
>[doc=SOLR1000] unknown field 'name'</str><int name="code">400</int></lst>
></response>
>SimplePostTool: WARNING: IOException while reading response:
>java.io.IOException: Server returned HTTP response code: 400 for URL:
>http://localhost:8983/solr/update
>1 files indexed.
>COMMITting Solr index changes to
>http://localhost:8983/solr/update..
>Time spent: 0:00:00.032
>
>**************************************************************************
>***********
>
>
>
>Solr logs:
>
>
>**************************************************************************
>***********
>
>6506114 [qtp1314570047-14] ERROR org.apache.solr.core.SolrCore  ­
>org.apache.solr.common.SolrException: ERROR: [doc=SOLR1000] unknown field
>'name'
>at
>org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185
>)
>at
>org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand
>.java:78)
>at
>org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.j
>ava:238)
>at
>org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.ja
>va:164)
>at
>org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdatePr
>ocessorFactory.java:69)
>.......
>**************************************************************************
>***********
>
>
>
>I will continue to solve the above problems this week, and could we
>discuss the two approaches this Thursday after the class? Many thanks!
>Have a good day!
>
>
>Best,
>Mengying (Angela) Wang
>
>
>
>On Mon, Sep 8, 2014 at 10:32 PM, MengYing Wang
><me...@gmail.com>> wrote:
>
>Dear Prof. Mattmann,
>
>
>For the previous week, I have successfully installed the following
>softwares in my personal computer:
>
>
>1: Apache OODT Catalog and Archive File Management Component:
>http://oodt.apache.org/components/maven/filemgr/user/basic.html
>2: Apache OODT Catalog and Archive Crawling Framework:
>http://oodt.apache.org/components/maven/crawler/user/
>
>3: Apache OODT Catalog and Archive Workflow Management Component:
>http://oodt.apache.org/components/maven/workflow/user/basic.html
>
>4: Apache Solr:
>https://cwiki.apache.org/confluence/display/solr/Installing+Solr
>5: Apache Nutch:
>http://wiki.apache.org/nutch/NutchTutorial#A3._Crawl_your_first_website
>6: Apache Tika: http://tika.apache.org/0.9/gettingstarted.html
>
>
>This week I will continue to playing with these softwares to figure out
>the following three questions:
>(1) how to get the metadata using Apache OODT or Apache Nutch?
>(2) how to dump the metadata from Apache OODT to Apache Solr?
>(3) how to query the metadata stored in Solr?
>
>Best,
>Mengying (Angela) Wang
>
>
>
>
>
>
>
>
>
>
>--
>Best,
>Mengying (Angela) Wang
>
>
>
>
>
>
>
>
>
>
>--
>Best,
>Mengying (Angela) Wang
>
>
>
>
>
>
>
>
>
>
>--
>Best,
>Mengying (Angela) Wang
>
>
>
>




--
Best,
Mengying (Angela) Wang

---
Rishi Verma
NASA Jet Propulsion Laboratory
California Institute of Technology
4800 Oak Grove Drive, M/S 158-248
Pasadena, CA 91109
Tel: 1-818-393-5826<tel:1-818-393-5826>




--
Best,
Mengying (Angela) Wang

---
Rishi Verma
NASA Jet Propulsion Laboratory
California Institute of Technology




--
Best,
Mengying (Angela) Wang



--
Best,
Mengying (Angela) Wang

---
Rishi Verma
NASA Jet Propulsion Laboratory
California Institute of Technology