You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by ahammad <ah...@gmail.com> on 2012/01/17 14:38:48 UTC

How can I index this?

Hello,

I am looking into indexing two data sources. One of those is a standard
website and the other is a Sharepoint site. The problem is that I have no
direct database access. Normally I would just use the DIH and get what I
need from the DB. I do have a java DAO (data access object) class that I am
using to directly to fetch information for a different purpose. 

In cases like this, what would be the best way to index the data? Should I
somehow integrate Nutch as the crawler? Should I write a custom DIH? Can I
use the DAO that I have in conjunction with the DIH?

I am really looking for some recommendations here. I do have a few hacks
that can be done (copy the data in a DB and index with DIH), but I am
interested in the proper way. Any insight will be greatly appreciated.

Cheers

--
View this message in context: http://lucene.472066.n3.nabble.com/How-can-I-index-this-tp3666106p3666106.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How can I index this?

Posted by Matthew Parker <mp...@apogeeintegration.com>.

I just started trying Apache ManifoldCF, which has a SharePoint connector
that appears to integrate through Sharepoint's web services.

Nutch also has a SharePoint connector, and it can publish documents into
SOLR for indexing.

On Wed, Jan 18, 2012 at 3:34 PM, ahammad <ah...@gmail.com> wrote:

> That would certainly work.
>
> Just as a general thing, how would one go about indexing Sharepoint content
> anyway? I heard about the Sharepoint connector for Lucene but I know
> nothing
> about it. Is there a standard best practice method?
>
> Also, what are your thoughts on extending the DIH? Is that recommended?
>
> Thanks for the input :)
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-can-I-index-this-tp3666106p3670392.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Regards,

Matt Parker (CTR)
Senior Software Architect
Apogee Integration, LLC
5180 Parkstone Drive, Suite #160
Chantilly, Virginia 20151
703.272.4797 (site)
703.474.1918 (cell)
www.apogeeintegration.com

------------------------------
This e-mail and any files transmitted with it may be proprietary.  Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Apogee Integration.

Re: How can I index this?

Posted by ahammad <ah...@gmail.com>.

That would certainly work.

Just as a general thing, how would one go about indexing Sharepoint content
anyway? I heard about the Sharepoint connector for Lucene but I know nothing
about it. Is there a standard best practice method?

Also, what are your thoughts on extending the DIH? Is that recommended?

Thanks for the input :)

--
View this message in context: http://lucene.472066.n3.nabble.com/How-can-I-index-this-tp3666106p3670392.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How can I index this?

Posted by Erick Erickson <er...@gmail.com>.

Well, if you can make an HTTP request, you can parse the return and
stuff it into a SolrInputDocument in SolrJ and then send it to Solr. At
least that seems possible if I'm understanding your setup. There are
other Solr clients that allow similar processes, but the Java version is
the one I know best.

Best
Erick

On Tue, Jan 17, 2012 at 11:10 AM, ahammad <ah...@gmail.com> wrote:
> Perhaps I was a little confusing...
>
> Normally when I have DB access, I do a regular indexing process using DIH.
> For these two sources, I do not have direct DB access. I can only view the
> two sources like any end-user would.
>
> I do have a java class that can get the information that I need. That class
> gets that information (through HTTP requests) and does not have DB access.
> That class is currently being used for other purposes but I can take it and
> use it for Solr as well. Does that make sense?
>
> Knowing all that, namely the fact that I cannot directly access the DB, and
> I can make HTTP requests to get the info, how can I index that info?
>
> Please let me know if this clarifies what I am trying to do.
>
> Regards
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/How-can-I-index-this-tp3666106p3666590.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: How can I index this?

Posted by ahammad <ah...@gmail.com>.

Perhaps I was a little confusing...

Normally when I have DB access, I do a regular indexing process using DIH.
For these two sources, I do not have direct DB access. I can only view the
two sources like any end-user would.

I do have a java class that can get the information that I need. That class
gets that information (through HTTP requests) and does not have DB access.
That class is currently being used for other purposes but I can take it and
use it for Solr as well. Does that make sense?

Knowing all that, namely the fact that I cannot directly access the DB, and
I can make HTTP requests to get the info, how can I index that info? 

Please let me know if this clarifies what I am trying to do.

Regards

--
View this message in context: http://lucene.472066.n3.nabble.com/How-can-I-index-this-tp3666106p3666590.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How can I index this?

Posted by Erick Erickson <er...@gmail.com>.

This sounds like, for the database source, that using SolrJ would
be the way to go. Assuming you can access the database from
Java this is pretty easy.

As for the website, Nutch is certainly an option...

But I'm a little puzzled. You mention a website, and sharepoint
as your sources, then ask about accessing the DB. How are
all these related?

Best
Erick

On Tue, Jan 17, 2012 at 8:38 AM, ahammad <ah...@gmail.com> wrote:
> Hello,
>
> I am looking into indexing two data sources. One of those is a standard
> website and the other is a Sharepoint site. The problem is that I have no
> direct database access. Normally I would just use the DIH and get what I
> need from the DB. I do have a java DAO (data access object) class that I am
> using to directly to fetch information for a different purpose.
>
> In cases like this, what would be the best way to index the data? Should I
> somehow integrate Nutch as the crawler? Should I write a custom DIH? Can I
> use the DAO that I have in conjunction with the DIH?
>
> I am really looking for some recommendations here. I do have a few hacks
> that can be done (copy the data in a DB and index with DIH), but I am
> interested in the proper way. Any insight will be greatly appreciated.
>
> Cheers
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/How-can-I-index-this-tp3666106p3666106.html
> Sent from the Solr - User mailing list archive at Nabble.com.