You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucenenet.apache.org by Todd McIndoo <tm...@speedyscan.biz> on 2010/01/06 19:10:32 UTC

Question

Sorry if this is duplicate

 

We are using Lucene.net of version 2.0.0.4. I am trying to search a document
which contains lots of PDFs. I want to search a document, which contains a
specific word, using Lucene.net. We are yielding results in text documents
but not in PDF. Is there something we have to do to be able to search in PDF

Documents. All ifilters have been installed on the computer so I do not
think that is the issue.

 

Regards,

SPEEDY SOLUTIONS

 

Todd McIndoo


RE: Question

Posted by Karell Ste-Marie <st...@brain-bank.com>.
Hi Erik,

While I have no doubts that Solr is a capable product I would like to
point out that it may not necessarily the question of the fact that Sorl
can talk to .NET (anything can talk to anything when you know what you
are doing) but perhaps more a problem of the comfort level that an
individual may have in committing to support a product based on a
platform (Java) that they don't use regularly.

What attracted me to Lucene.NET is not the fact that it is based on
Lucene which is a top product but primarily the fact that it uses
technology that I am comfortable on a day to day basis, is built using
source code that I am used to reading and doesn't require me to install
"Yet Another Framework" on a production server and expect an MCSE (who
openly admit being allergic to Java if only for "religious" reasons) to
then administer it.

Ed,

There was a few years back a gentleman that assembled a lucenenet site
but unfortunately it no longer exists and that site did have several
examples on how to use IFilter to index just about anything and store it
in Lucene. Lucene, however, is not exactly what I would call a search
infrastructure [quickly puts on bulletproof vest and prepares to dodge
bullets and random objects] (it does not do the indexing) but a very
well designed repository (database) and search engine. Beyond storing
content and allowing you to search it, it's responsibilities stop there.
However from what I've seen with past source code IFilter is quite easy
to implement. I'm sure if you use a combination of the Adapter and
Strategy patterns this can become trivial in any language.



Karell Ste-Marie
C.I.O. - BrainBank Inc

-----Original Message-----
From: Erik Hatcher [mailto:erik.hatcher@gmail.com] 
Sent: Thursday, January 07, 2010 4:49 AM
To: lucene-net-dev@lucene.apache.org
Subject: Re: Question

Ed - that's a reasonable critique, but the API is practically the same  
between the Lucene.Net and Lucene Java.   There is a section  
contributed by George in the upcoming 2nd edition of Lucene in Action  
- it's short and says basically that.

But, rather than buy a commercial search engine, consider Solr!

I don't want to come here and steal any of Lucene.Net's thunder by  
mentioning Solr, as no doubt Lucene.Net is the right fit for many  
projects.   Solr, though, is so much more than just Lucene, providing  
enterprisey features (replication, distributed search, facets, and  
more) that just can't be trivially/naively built on top of any flavor  
of Lucene.  And Solr is easy interfaced with .NET as a client.  Of  
course the hurdle then is "does Solr, a Java-based app, fit into the  
operations of your deployment environment?".  It's another technology  
to add if the shop is purely .NET currently.  But then again, it  
literally does run everywhere quite easily.

	Erik


RE: Question

Posted by "Nicholas Paldino [.NET/C# MVP]" <ca...@caspershouse.com>.
Erik,

	It's the fact that the API is exactly the same (as well as the lines
of code, practically) which causes many of the issues in Lucene.NET (not
only in use but in implementation), as while Java and C# are very similar,
that doesn't guarantee the same results.

	But that's an issue for another email, one which many (including
myself) have dealt with.

	That being said, I would go with a Solr provider if such a thing
existed.  I'm debating whether or not to use the Lucene.NET library in my
application, or to try and find a preexisting Solr provider.  However, it
doesn't seem that there are many, and I really don't have the luxury of
setting up the environment myself (although I'm interested in using it,
since I can very easily talk whatever language it does over the wire with
.NET).

		- Nick

-----Original Message-----
From: Erik Hatcher [mailto:erik.hatcher@gmail.com] 
Sent: Thursday, January 07, 2010 4:49 AM
To: lucene-net-dev@lucene.apache.org
Subject: Re: Question

Ed - that's a reasonable critique, but the API is practically the same  
between the Lucene.Net and Lucene Java.   There is a section  
contributed by George in the upcoming 2nd edition of Lucene in Action  
- it's short and says basically that.

But, rather than buy a commercial search engine, consider Solr!

I don't want to come here and steal any of Lucene.Net's thunder by  
mentioning Solr, as no doubt Lucene.Net is the right fit for many  
projects.   Solr, though, is so much more than just Lucene, providing  
enterprisey features (replication, distributed search, facets, and  
more) that just can't be trivially/naively built on top of any flavor  
of Lucene.  And Solr is easy interfaced with .NET as a client.  Of  
course the hurdle then is "does Solr, a Java-based app, fit into the  
operations of your deployment environment?".  It's another technology  
to add if the shop is purely .NET currently.  But then again, it  
literally does run everywhere quite easily.

	Erik

On Jan 7, 2010, at 4:27 AM, Ed Jones wrote:

> My problem with Lucene in Action and all the examples on the  
> internet is
> that they were all in Java and you have to understand exactly what  
> Java
> is doing to understand it all properly. It's for this very reason we  
> had
> to shun using Lucene.net in major projects. I wanted dearly to use it
> but the learning curve was far too steep and there appears to be very
> very few .net examples of code or help.
>
> Instead we have invested a significant amount of money in buying in a
> much more commercial search engine.
>
> I am keeping an eye on the Lucene.net project though in-case it can be
> used in other parts of our business, but again the same will apply, we
> will need more non Java examples.
>
> Ed
>
> -----Original Message-----
> From: Roger Chapman [mailto:roger@stormid.com]
> Sent: 07 January 2010 09:21
> To: lucene-net-dev@lucene.apache.org
> Subject: RE: Question
>
> From what I can remember the book Lucene in Action has a good  
> section on
> indexing documents and PDFs http://www.manning.com/hatcher2/
>
>
>
> Roger.
>
>
>
>
>
> -----Original Message-----
> From: Ben Martz [mailto:benmartz@gmail.com]
> Sent: 06 January 2010 19:51
> To: lucene-net-dev@lucene.apache.org
> Cc: <lu...@lucene.apache.org>
> Subject: Re: Question
>
>
>
> Todd,
>
>
>
> I would definitely take Michael's advice to learn more about the
>
> overall issue before you get too far.
>
>
>
> A quick answer that may help is Windows does not ship with an iFilter
>
> for PDF built-in. Installing Adobe Reader 8 or higher will install a
>
> decent PDF iFilter.
>
>
>
> I am a little surprised by your question though - I assume that you
>
> have access to your own source code and could examine the result from
>
> the iFilter that's being fed to the IndexWriter and compare the
>
> behavior in the TXT case with the behavior in the PDF case?
>
>
>
> Cheers,
>
> Ben
>
>
>
> Sent from my iPhone
>
>
>
> On Jan 6, 2010, at 10:13, Michael Garski <mg...@myspace-inc.com>
>
> wrote:
>
>
>
>> Todd,
>
>>
>
>> You'll need some way to extract the text from the PDF prior to
>
>> indexing.  I'm not familiar with any packages that can do that but I
>
>> have heard of them.  You may want to try searching the mailing list
>
>> to see if there has been mention of one previously.  Lucid
>
>> Imagination hosts a great mailing list search tool at
> http://www.lucidimagination.com/search/
>
>>
>
>> Michael
>
>>
>
>> -----Original Message-----
>
>> From: Todd McIndoo [mailto:tmcindoo@speedyscan.biz]
>
>> Sent: Wednesday, January 06, 2010 10:11 AM
>
>> To: lucene-net-dev@lucene.apache.org
>
>> Subject: Question
>
>>
>
>> Sorry if this is duplicate
>
>>
>
>>
>
>>
>
>> We are using Lucene.net of version 2.0.0.4. I am trying to search a
>
>> document
>
>> which contains lots of PDFs. I want to search a document, which
>
>> contains a
>
>> specific word, using Lucene.net. We are yielding results in text
>
>> documents
>
>> but not in PDF. Is there something we have to do to be able to
>
>> search in PDF
>
>>
>
>> Documents. All ifilters have been installed on the computer so I do
>
>> not
>
>> think that is the issue.
>
>>
>
>>
>
>>
>
>> Regards,
>
>>
>
>> SPEEDY SOLUTIONS
>
>>
>
>>
>
>>
>
>> Todd McIndoo
>
>>
>
>

Re: Question

Posted by Erik Hatcher <er...@gmail.com>.
Ed - that's a reasonable critique, but the API is practically the same  
between the Lucene.Net and Lucene Java.   There is a section  
contributed by George in the upcoming 2nd edition of Lucene in Action  
- it's short and says basically that.

But, rather than buy a commercial search engine, consider Solr!

I don't want to come here and steal any of Lucene.Net's thunder by  
mentioning Solr, as no doubt Lucene.Net is the right fit for many  
projects.   Solr, though, is so much more than just Lucene, providing  
enterprisey features (replication, distributed search, facets, and  
more) that just can't be trivially/naively built on top of any flavor  
of Lucene.  And Solr is easy interfaced with .NET as a client.  Of  
course the hurdle then is "does Solr, a Java-based app, fit into the  
operations of your deployment environment?".  It's another technology  
to add if the shop is purely .NET currently.  But then again, it  
literally does run everywhere quite easily.

	Erik

On Jan 7, 2010, at 4:27 AM, Ed Jones wrote:

> My problem with Lucene in Action and all the examples on the  
> internet is
> that they were all in Java and you have to understand exactly what  
> Java
> is doing to understand it all properly. It's for this very reason we  
> had
> to shun using Lucene.net in major projects. I wanted dearly to use it
> but the learning curve was far too steep and there appears to be very
> very few .net examples of code or help.
>
> Instead we have invested a significant amount of money in buying in a
> much more commercial search engine.
>
> I am keeping an eye on the Lucene.net project though in-case it can be
> used in other parts of our business, but again the same will apply, we
> will need more non Java examples.
>
> Ed
>
> -----Original Message-----
> From: Roger Chapman [mailto:roger@stormid.com]
> Sent: 07 January 2010 09:21
> To: lucene-net-dev@lucene.apache.org
> Subject: RE: Question
>
> From what I can remember the book Lucene in Action has a good  
> section on
> indexing documents and PDFs http://www.manning.com/hatcher2/
>
>
>
> Roger.
>
>
>
>
>
> -----Original Message-----
> From: Ben Martz [mailto:benmartz@gmail.com]
> Sent: 06 January 2010 19:51
> To: lucene-net-dev@lucene.apache.org
> Cc: <lu...@lucene.apache.org>
> Subject: Re: Question
>
>
>
> Todd,
>
>
>
> I would definitely take Michael's advice to learn more about the
>
> overall issue before you get too far.
>
>
>
> A quick answer that may help is Windows does not ship with an iFilter
>
> for PDF built-in. Installing Adobe Reader 8 or higher will install a
>
> decent PDF iFilter.
>
>
>
> I am a little surprised by your question though - I assume that you
>
> have access to your own source code and could examine the result from
>
> the iFilter that's being fed to the IndexWriter and compare the
>
> behavior in the TXT case with the behavior in the PDF case?
>
>
>
> Cheers,
>
> Ben
>
>
>
> Sent from my iPhone
>
>
>
> On Jan 6, 2010, at 10:13, Michael Garski <mg...@myspace-inc.com>
>
> wrote:
>
>
>
>> Todd,
>
>>
>
>> You'll need some way to extract the text from the PDF prior to
>
>> indexing.  I'm not familiar with any packages that can do that but I
>
>> have heard of them.  You may want to try searching the mailing list
>
>> to see if there has been mention of one previously.  Lucid
>
>> Imagination hosts a great mailing list search tool at
> http://www.lucidimagination.com/search/
>
>>
>
>> Michael
>
>>
>
>> -----Original Message-----
>
>> From: Todd McIndoo [mailto:tmcindoo@speedyscan.biz]
>
>> Sent: Wednesday, January 06, 2010 10:11 AM
>
>> To: lucene-net-dev@lucene.apache.org
>
>> Subject: Question
>
>>
>
>> Sorry if this is duplicate
>
>>
>
>>
>
>>
>
>> We are using Lucene.net of version 2.0.0.4. I am trying to search a
>
>> document
>
>> which contains lots of PDFs. I want to search a document, which
>
>> contains a
>
>> specific word, using Lucene.net. We are yielding results in text
>
>> documents
>
>> but not in PDF. Is there something we have to do to be able to
>
>> search in PDF
>
>>
>
>> Documents. All ifilters have been installed on the computer so I do
>
>> not
>
>> think that is the issue.
>
>>
>
>>
>
>>
>
>> Regards,
>
>>
>
>> SPEEDY SOLUTIONS
>
>>
>
>>
>
>>
>
>> Todd McIndoo
>
>>
>
>


RE: Question

Posted by "Granroth, Neal V." <ne...@thermofisher.com>.
I have no examples prepared, but they can be easily created as questions occur.  Here's a very simple example that creates an in-memory index of three documents then reports the result of several searches.  When run from the command-line it this is the result:

C:\>vb001
Query for cyan found 2 hits
   color set 1
   color set 2
Query for red but not green found 1 hits
   color set 3
Query for red or blue or magenta found 3 hits
   color set 3
   color set 2
   color set 1

---------------------------------------------------------------
Here's the program:


Imports Lucene.Net.Documents
Imports Lucene.Net.Index
Imports Lucene.Net.Search


Module Module1

    Sub Main()

        REM -- Create a simple in-memory index with three documents
        REM -- each document has name and color fields.

        Dim index As Lucene.Net.Store.RAMDirectory = New Lucene.Net.Store.RAMDirectory()
        Dim analyzer As Lucene.Net.Analysis.Standard.StandardAnalyzer = New Lucene.Net.Analysis.Standard.StandardAnalyzer()
        Dim writer As Lucene.Net.Index.IndexWriter = New Lucene.Net.Index.IndexWriter(index, analyzer, True, Lucene.Net.Index.IndexWriter.MaxFieldLength.UNLIMITED)
        Dim doc As Lucene.Net.Documents.Document

        doc = New Lucene.Net.Documents.Document()
        doc.Add(New Field("color", "red cyan green", Field.Store.YES, Field.Index.TOKENIZED))
        doc.Add(New Field("name", "color set 1", Field.Store.YES, Field.Index.TOKENIZED))

        writer.AddDocument(doc)

        doc = New Lucene.Net.Documents.Document()
        doc.Add(New Field("color", "cyan yellow magenta", Field.Store.YES, Field.Index.TOKENIZED))
        doc.Add(New Field("name", "color set 2", Field.Store.YES, Field.Index.TOKENIZED))

        writer.AddDocument(doc)

        doc = New Lucene.Net.Documents.Document()
        doc.Add(New Field("color", "blue yellow red", Field.Store.YES, Field.Index.TOKENIZED))
        doc.Add(New Field("name", "color set 3", Field.Store.YES, Field.Index.TOKENIZED))

        writer.AddDocument(doc)

        writer.Commit()
        writer.Close()

        REM ------------- Search the index

        Dim ixSearcher As IndexSearcher = New IndexSearcher(index)
        Dim qryParse As Lucene.Net.QueryParsers.QueryParser = New Lucene.Net.QueryParsers.QueryParser("color", analyzer)
        Dim testQry As Query
        Dim hits As Hits

        testQry = qryParse.Parse("cyan")
        hits = ixSearcher.Search(testQry)

        Console.WriteLine("Query for cyan found " + hits.Length().ToString() + " hits")

        Dim hitIterator As HitIterator = hits.Iterator
        Dim hitCurrent As Hit
        Dim foundDoc As Document

        While hitIterator.MoveNext = True
            hitCurrent = hitIterator.Current()
            foundDoc = hitCurrent.GetDocument()

            Console.WriteLine("   " + foundDoc.GetValues("name")(0))
        End While


        REM ------------- second search

        testQry = qryParse.Parse("red NOT green")
        hits = ixSearcher.Search(testQry)

        Console.WriteLine("Query for red but not green found " + hits.Length().ToString() + " hits")

        hitIterator = hits.Iterator
        While hitIterator.MoveNext = True
            hitCurrent = hitIterator.Current()
            foundDoc = hitCurrent.GetDocument()

            Console.WriteLine("   " + foundDoc.GetValues("name")(0))
        End While


        REM ------------- third search

        testQry = qryParse.Parse("red OR blue OR magenta")
        hits = ixSearcher.Search(testQry)

        Console.WriteLine("Query for red or blue or magenta found " + hits.Length().ToString() + " hits")

        hitIterator = hits.Iterator
        While hitIterator.MoveNext = True
            hitCurrent = hitIterator.Current()
            foundDoc = hitCurrent.GetDocument()

            Console.WriteLine("   " + foundDoc.GetValues("name")(0))
        End While

        ixSearcher.Close()


    End Sub

End Module


- Neal

-----Original Message-----
From: tony njedeh [mailto:njedeh@yahoo.com] 
Sent: Thursday, January 07, 2010 4:30 PM
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

Hi Neal,
 
I would like to see the examples you have, using Lucene.NET from VB ?

Njedeh

--- On Thu, 1/7/10, Granroth, Neal V. <ne...@thermofisher.com> wrote:


From: Granroth, Neal V. <ne...@thermofisher.com>
Subject: RE: Question
To: "lucene-net-dev@lucene.apache.org" <lu...@lucene.apache.org>
Date: Thursday, January 7, 2010, 3:05 PM


IFilter is a Microsoft COM interface implemented by components that extract searchable content from a specific document format (Word, PDF, etc.) Lucene.NET does not use these components directly, they are used by whatever software you construct to populate the Lucene index with searchable content.

There is a lot of information on IFilter on Microsoft's site; and I think their optional use is beyond the scope of the Lucene.NET project.

Would it help if I put together some simple examples of using Lucene.NET from VB ?

- Neal

-----Original Message-----
From: Ed Jones [mailto:Edmund.Jones@warc.com] 
Sent: Thursday, January 07, 2010 1:39 PM
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

Remember that not everyone uses c#, many people use VB.net and although it's relatively simple to move it over to c#, moving from c# to Java is just one extra step where things can go wrong.

At the time (3 years ago) I offered to spend time trying to make a set of examples such as how to use iFilters (I think that was the term) but nobody was interested so my attention moved elsewhere.

-----Original Message-----
From: Granroth, Neal V. [mailto:neal.granroth@thermofisher.com] 
Sent: 07 January 2010 19:37
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

I am very surprised by this comment.
There is so much similarity between Java and C# that I found absolutely no difficulty with the discussion and examples in "Lucene in Action" and in directly applying the techniques to my C#/.NET projects.

Maybe it would be helpful for some of those who find the java examples confusing to explain specifically why they are confusing.  Then we might consider putting together some type of short "Guide to understanding Lucene for C# developers" or FAQ on the web site.

- Neal

-----Original Message-----
From: Ed Jones [mailto:Edmund.Jones@warc.com] 
Sent: Thursday, January 07, 2010 3:57 AM
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

All I can say is that we found the lack of examples for .net problematic as when you are not too up to speed with Java there are a lot of basic hurdlers to overcome.

-----Original Message-----
From: Olivier Spinelli [mailto:olivier.spinelli@invenietis.fr] 
Sent: 07 January 2010 09:55
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

<quote>
Lucene.Net sticks to the APIs and classes used in the original Java
implementation of Lucene. The API names as well as class names are preserved
with the intention of giving Lucene.Net the look and feel of the C# language
and the .NET Framework. For example, the method Hits.length() in the Java
implementation now reads Hits.Length() in the C# port. 

In addition to the APIs and classes port to C#, the algorithm of Java Lucene
is ported to C# Lucene. This means an index created with Java Lucene is
back-and-forth compatible with the C# Lucene; both at reading, writing and
updating. In fact a Lucene index can be concurrently searched and updated
using Java Lucene and C# Lucene processes. 
</quote>

It's merely all about switching from camelCase to PascalCase...

HTH

Spi


-----Message d'origine-----
De : Ed Jones [mailto:Edmund.Jones@warc.com] 
Envoyé : jeudi 7 janvier 2010 10:27
À : lucene-net-dev@lucene.apache.org
Objet : RE: Question

My problem with Lucene in Action and all the examples on the internet is
that they were all in Java and you have to understand exactly what Java
is doing to understand it all properly. It's for this very reason we had
to shun using Lucene.net in major projects. I wanted dearly to use it
but the learning curve was far too steep and there appears to be very
very few .net examples of code or help.

Instead we have invested a significant amount of money in buying in a
much more commercial search engine.

I am keeping an eye on the Lucene.net project though in-case it can be
used in other parts of our business, but again the same will apply, we
will need more non Java examples.

Ed

-----Original Message-----
From: Roger Chapman [mailto:roger@stormid.com] 
Sent: 07 January 2010 09:21
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

>From what I can remember the book Lucene in Action has a good section on
indexing documents and PDFs http://www.manning.com/hatcher2/



Roger.





-----Original Message-----
From: Ben Martz [mailto:benmartz@gmail.com]
Sent: 06 January 2010 19:51
To: lucene-net-dev@lucene.apache.org
Cc: <lu...@lucene.apache.org>
Subject: Re: Question



Todd,



I would definitely take Michael's advice to learn more about the

overall issue before you get too far.



A quick answer that may help is Windows does not ship with an iFilter

for PDF built-in. Installing Adobe Reader 8 or higher will install a

decent PDF iFilter.



I am a little surprised by your question though - I assume that you

have access to your own source code and could examine the result from

the iFilter that's being fed to the IndexWriter and compare the

behavior in the TXT case with the behavior in the PDF case?



Cheers,

Ben



Sent from my iPhone



On Jan 6, 2010, at 10:13, Michael Garski <mg...@myspace-inc.com>

wrote:



> Todd,

>

> You'll need some way to extract the text from the PDF prior to

> indexing.  I'm not familiar with any packages that can do that but I

> have heard of them.  You may want to try searching the mailing list

> to see if there has been mention of one previously.  Lucid

> Imagination hosts a great mailing list search tool at
http://www.lucidimagination.com/search/

>

> Michael

>

> -----Original Message-----

> From: Todd McIndoo [mailto:tmcindoo@speedyscan.biz]

> Sent: Wednesday, January 06, 2010 10:11 AM

> To: lucene-net-dev@lucene.apache.org

> Subject: Question

>

> Sorry if this is duplicate

>

>

>

> We are using Lucene.net of version 2.0.0.4. I am trying to search a

> document

> which contains lots of PDFs. I want to search a document, which

> contains a

> specific word, using Lucene.net. We are yielding results in text

> documents

> but not in PDF. Is there something we have to do to be able to

> search in PDF

>

> Documents. All ifilters have been installed on the computer so I do

> not

> think that is the issue.

>

>

>

> Regards,

>

> SPEEDY SOLUTIONS

>

>

>

> Todd McIndoo

>



RE: Question

Posted by Ed Jones <Ed...@warc.com>.
Thanks for the offer but this was needed a few years ago. We've since gone with a high end search platform.

-----Original Message-----
From: Granroth, Neal V. [mailto:neal.granroth@thermofisher.com] 
Sent: 07 January 2010 20:06
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

IFilter is a Microsoft COM interface implemented by components that extract searchable content from a specific document format (Word, PDF, etc.) Lucene.NET does not use these components directly, they are used by whatever software you construct to populate the Lucene index with searchable content.

There is a lot of information on IFilter on Microsoft's site; and I think their optional use is beyond the scope of the Lucene.NET project.

Would it help if I put together some simple examples of using Lucene.NET from VB ?

- Neal

-----Original Message-----
From: Ed Jones [mailto:Edmund.Jones@warc.com] 
Sent: Thursday, January 07, 2010 1:39 PM
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

Remember that not everyone uses c#, many people use VB.net and although it's relatively simple to move it over to c#, moving from c# to Java is just one extra step where things can go wrong.

At the time (3 years ago) I offered to spend time trying to make a set of examples such as how to use iFilters (I think that was the term) but nobody was interested so my attention moved elsewhere.

-----Original Message-----
From: Granroth, Neal V. [mailto:neal.granroth@thermofisher.com] 
Sent: 07 January 2010 19:37
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

I am very surprised by this comment.
There is so much similarity between Java and C# that I found absolutely no difficulty with the discussion and examples in "Lucene in Action" and in directly applying the techniques to my C#/.NET projects.

Maybe it would be helpful for some of those who find the java examples confusing to explain specifically why they are confusing.  Then we might consider putting together some type of short "Guide to understanding Lucene for C# developers" or FAQ on the web site.

- Neal

-----Original Message-----
From: Ed Jones [mailto:Edmund.Jones@warc.com] 
Sent: Thursday, January 07, 2010 3:57 AM
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

All I can say is that we found the lack of examples for .net problematic as when you are not too up to speed with Java there are a lot of basic hurdlers to overcome.

-----Original Message-----
From: Olivier Spinelli [mailto:olivier.spinelli@invenietis.fr] 
Sent: 07 January 2010 09:55
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

<quote>
Lucene.Net sticks to the APIs and classes used in the original Java
implementation of Lucene. The API names as well as class names are preserved
with the intention of giving Lucene.Net the look and feel of the C# language
and the .NET Framework. For example, the method Hits.length() in the Java
implementation now reads Hits.Length() in the C# port. 

In addition to the APIs and classes port to C#, the algorithm of Java Lucene
is ported to C# Lucene. This means an index created with Java Lucene is
back-and-forth compatible with the C# Lucene; both at reading, writing and
updating. In fact a Lucene index can be concurrently searched and updated
using Java Lucene and C# Lucene processes. 
</quote>

It's merely all about switching from camelCase to PascalCase...

HTH

Spi


-----Message d'origine-----
De : Ed Jones [mailto:Edmund.Jones@warc.com] 
Envoyé : jeudi 7 janvier 2010 10:27
À : lucene-net-dev@lucene.apache.org
Objet : RE: Question

My problem with Lucene in Action and all the examples on the internet is
that they were all in Java and you have to understand exactly what Java
is doing to understand it all properly. It's for this very reason we had
to shun using Lucene.net in major projects. I wanted dearly to use it
but the learning curve was far too steep and there appears to be very
very few .net examples of code or help.

Instead we have invested a significant amount of money in buying in a
much more commercial search engine.

I am keeping an eye on the Lucene.net project though in-case it can be
used in other parts of our business, but again the same will apply, we
will need more non Java examples.

Ed

-----Original Message-----
From: Roger Chapman [mailto:roger@stormid.com] 
Sent: 07 January 2010 09:21
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

>From what I can remember the book Lucene in Action has a good section on
indexing documents and PDFs http://www.manning.com/hatcher2/



Roger.





-----Original Message-----
From: Ben Martz [mailto:benmartz@gmail.com]
Sent: 06 January 2010 19:51
To: lucene-net-dev@lucene.apache.org
Cc: <lu...@lucene.apache.org>
Subject: Re: Question



Todd,



I would definitely take Michael's advice to learn more about the

overall issue before you get too far.



A quick answer that may help is Windows does not ship with an iFilter

for PDF built-in. Installing Adobe Reader 8 or higher will install a

decent PDF iFilter.



I am a little surprised by your question though - I assume that you

have access to your own source code and could examine the result from

the iFilter that's being fed to the IndexWriter and compare the

behavior in the TXT case with the behavior in the PDF case?



Cheers,

Ben



Sent from my iPhone



On Jan 6, 2010, at 10:13, Michael Garski <mg...@myspace-inc.com>

wrote:



> Todd,

>

> You'll need some way to extract the text from the PDF prior to

> indexing.  I'm not familiar with any packages that can do that but I

> have heard of them.  You may want to try searching the mailing list

> to see if there has been mention of one previously.  Lucid

> Imagination hosts a great mailing list search tool at
http://www.lucidimagination.com/search/

>

> Michael

>

> -----Original Message-----

> From: Todd McIndoo [mailto:tmcindoo@speedyscan.biz]

> Sent: Wednesday, January 06, 2010 10:11 AM

> To: lucene-net-dev@lucene.apache.org

> Subject: Question

>

> Sorry if this is duplicate

>

>

>

> We are using Lucene.net of version 2.0.0.4. I am trying to search a

> document

> which contains lots of PDFs. I want to search a document, which

> contains a

> specific word, using Lucene.net. We are yielding results in text

> documents

> but not in PDF. Is there something we have to do to be able to

> search in PDF

>

> Documents. All ifilters have been installed on the computer so I do

> not

> think that is the issue.

>

>

>

> Regards,

>

> SPEEDY SOLUTIONS

>

>

>

> Todd McIndoo

>



RE: Question

Posted by tony njedeh <nj...@yahoo.com>.
Hi Neal,
 
I would like to see the examples you have, using Lucene.NET from VB ?

Njedeh

--- On Thu, 1/7/10, Granroth, Neal V. <ne...@thermofisher.com> wrote:


From: Granroth, Neal V. <ne...@thermofisher.com>
Subject: RE: Question
To: "lucene-net-dev@lucene.apache.org" <lu...@lucene.apache.org>
Date: Thursday, January 7, 2010, 3:05 PM


IFilter is a Microsoft COM interface implemented by components that extract searchable content from a specific document format (Word, PDF, etc.) Lucene.NET does not use these components directly, they are used by whatever software you construct to populate the Lucene index with searchable content.

There is a lot of information on IFilter on Microsoft's site; and I think their optional use is beyond the scope of the Lucene.NET project.

Would it help if I put together some simple examples of using Lucene.NET from VB ?

- Neal

-----Original Message-----
From: Ed Jones [mailto:Edmund.Jones@warc.com] 
Sent: Thursday, January 07, 2010 1:39 PM
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

Remember that not everyone uses c#, many people use VB.net and although it's relatively simple to move it over to c#, moving from c# to Java is just one extra step where things can go wrong.

At the time (3 years ago) I offered to spend time trying to make a set of examples such as how to use iFilters (I think that was the term) but nobody was interested so my attention moved elsewhere.

-----Original Message-----
From: Granroth, Neal V. [mailto:neal.granroth@thermofisher.com] 
Sent: 07 January 2010 19:37
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

I am very surprised by this comment.
There is so much similarity between Java and C# that I found absolutely no difficulty with the discussion and examples in "Lucene in Action" and in directly applying the techniques to my C#/.NET projects.

Maybe it would be helpful for some of those who find the java examples confusing to explain specifically why they are confusing.  Then we might consider putting together some type of short "Guide to understanding Lucene for C# developers" or FAQ on the web site.

- Neal

-----Original Message-----
From: Ed Jones [mailto:Edmund.Jones@warc.com] 
Sent: Thursday, January 07, 2010 3:57 AM
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

All I can say is that we found the lack of examples for .net problematic as when you are not too up to speed with Java there are a lot of basic hurdlers to overcome.

-----Original Message-----
From: Olivier Spinelli [mailto:olivier.spinelli@invenietis.fr] 
Sent: 07 January 2010 09:55
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

<quote>
Lucene.Net sticks to the APIs and classes used in the original Java
implementation of Lucene. The API names as well as class names are preserved
with the intention of giving Lucene.Net the look and feel of the C# language
and the .NET Framework. For example, the method Hits.length() in the Java
implementation now reads Hits.Length() in the C# port. 

In addition to the APIs and classes port to C#, the algorithm of Java Lucene
is ported to C# Lucene. This means an index created with Java Lucene is
back-and-forth compatible with the C# Lucene; both at reading, writing and
updating. In fact a Lucene index can be concurrently searched and updated
using Java Lucene and C# Lucene processes. 
</quote>

It's merely all about switching from camelCase to PascalCase...

HTH

Spi


-----Message d'origine-----
De : Ed Jones [mailto:Edmund.Jones@warc.com] 
Envoyé : jeudi 7 janvier 2010 10:27
À : lucene-net-dev@lucene.apache.org
Objet : RE: Question

My problem with Lucene in Action and all the examples on the internet is
that they were all in Java and you have to understand exactly what Java
is doing to understand it all properly. It's for this very reason we had
to shun using Lucene.net in major projects. I wanted dearly to use it
but the learning curve was far too steep and there appears to be very
very few .net examples of code or help.

Instead we have invested a significant amount of money in buying in a
much more commercial search engine.

I am keeping an eye on the Lucene.net project though in-case it can be
used in other parts of our business, but again the same will apply, we
will need more non Java examples.

Ed

-----Original Message-----
From: Roger Chapman [mailto:roger@stormid.com] 
Sent: 07 January 2010 09:21
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

>From what I can remember the book Lucene in Action has a good section on
indexing documents and PDFs http://www.manning.com/hatcher2/



Roger.





-----Original Message-----
From: Ben Martz [mailto:benmartz@gmail.com]
Sent: 06 January 2010 19:51
To: lucene-net-dev@lucene.apache.org
Cc: <lu...@lucene.apache.org>
Subject: Re: Question



Todd,



I would definitely take Michael's advice to learn more about the

overall issue before you get too far.



A quick answer that may help is Windows does not ship with an iFilter

for PDF built-in. Installing Adobe Reader 8 or higher will install a

decent PDF iFilter.



I am a little surprised by your question though - I assume that you

have access to your own source code and could examine the result from

the iFilter that's being fed to the IndexWriter and compare the

behavior in the TXT case with the behavior in the PDF case?



Cheers,

Ben



Sent from my iPhone



On Jan 6, 2010, at 10:13, Michael Garski <mg...@myspace-inc.com>

wrote:



> Todd,

>

> You'll need some way to extract the text from the PDF prior to

> indexing.  I'm not familiar with any packages that can do that but I

> have heard of them.  You may want to try searching the mailing list

> to see if there has been mention of one previously.  Lucid

> Imagination hosts a great mailing list search tool at
http://www.lucidimagination.com/search/

>

> Michael

>

> -----Original Message-----

> From: Todd McIndoo [mailto:tmcindoo@speedyscan.biz]

> Sent: Wednesday, January 06, 2010 10:11 AM

> To: lucene-net-dev@lucene.apache.org

> Subject: Question

>

> Sorry if this is duplicate

>

>

>

> We are using Lucene.net of version 2.0.0.4. I am trying to search a

> document

> which contains lots of PDFs. I want to search a document, which

> contains a

> specific word, using Lucene.net. We are yielding results in text

> documents

> but not in PDF. Is there something we have to do to be able to

> search in PDF

>

> Documents. All ifilters have been installed on the computer so I do

> not

> think that is the issue.

>

>

>

> Regards,

>

> SPEEDY SOLUTIONS

>

>

>

> Todd McIndoo

>



RE: Question

Posted by "Granroth, Neal V." <ne...@thermofisher.com>.
IFilter is a Microsoft COM interface implemented by components that extract searchable content from a specific document format (Word, PDF, etc.) Lucene.NET does not use these components directly, they are used by whatever software you construct to populate the Lucene index with searchable content.

There is a lot of information on IFilter on Microsoft's site; and I think their optional use is beyond the scope of the Lucene.NET project.

Would it help if I put together some simple examples of using Lucene.NET from VB ?

- Neal

-----Original Message-----
From: Ed Jones [mailto:Edmund.Jones@warc.com] 
Sent: Thursday, January 07, 2010 1:39 PM
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

Remember that not everyone uses c#, many people use VB.net and although it's relatively simple to move it over to c#, moving from c# to Java is just one extra step where things can go wrong.

At the time (3 years ago) I offered to spend time trying to make a set of examples such as how to use iFilters (I think that was the term) but nobody was interested so my attention moved elsewhere.

-----Original Message-----
From: Granroth, Neal V. [mailto:neal.granroth@thermofisher.com] 
Sent: 07 January 2010 19:37
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

I am very surprised by this comment.
There is so much similarity between Java and C# that I found absolutely no difficulty with the discussion and examples in "Lucene in Action" and in directly applying the techniques to my C#/.NET projects.

Maybe it would be helpful for some of those who find the java examples confusing to explain specifically why they are confusing.  Then we might consider putting together some type of short "Guide to understanding Lucene for C# developers" or FAQ on the web site.

- Neal

-----Original Message-----
From: Ed Jones [mailto:Edmund.Jones@warc.com] 
Sent: Thursday, January 07, 2010 3:57 AM
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

All I can say is that we found the lack of examples for .net problematic as when you are not too up to speed with Java there are a lot of basic hurdlers to overcome.

-----Original Message-----
From: Olivier Spinelli [mailto:olivier.spinelli@invenietis.fr] 
Sent: 07 January 2010 09:55
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

<quote>
Lucene.Net sticks to the APIs and classes used in the original Java
implementation of Lucene. The API names as well as class names are preserved
with the intention of giving Lucene.Net the look and feel of the C# language
and the .NET Framework. For example, the method Hits.length() in the Java
implementation now reads Hits.Length() in the C# port. 

In addition to the APIs and classes port to C#, the algorithm of Java Lucene
is ported to C# Lucene. This means an index created with Java Lucene is
back-and-forth compatible with the C# Lucene; both at reading, writing and
updating. In fact a Lucene index can be concurrently searched and updated
using Java Lucene and C# Lucene processes. 
</quote>

It's merely all about switching from camelCase to PascalCase...

HTH

Spi


-----Message d'origine-----
De : Ed Jones [mailto:Edmund.Jones@warc.com] 
Envoyé : jeudi 7 janvier 2010 10:27
À : lucene-net-dev@lucene.apache.org
Objet : RE: Question

My problem with Lucene in Action and all the examples on the internet is
that they were all in Java and you have to understand exactly what Java
is doing to understand it all properly. It's for this very reason we had
to shun using Lucene.net in major projects. I wanted dearly to use it
but the learning curve was far too steep and there appears to be very
very few .net examples of code or help.

Instead we have invested a significant amount of money in buying in a
much more commercial search engine.

I am keeping an eye on the Lucene.net project though in-case it can be
used in other parts of our business, but again the same will apply, we
will need more non Java examples.

Ed

-----Original Message-----
From: Roger Chapman [mailto:roger@stormid.com] 
Sent: 07 January 2010 09:21
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

>From what I can remember the book Lucene in Action has a good section on
indexing documents and PDFs http://www.manning.com/hatcher2/



Roger.





-----Original Message-----
From: Ben Martz [mailto:benmartz@gmail.com]
Sent: 06 January 2010 19:51
To: lucene-net-dev@lucene.apache.org
Cc: <lu...@lucene.apache.org>
Subject: Re: Question



Todd,



I would definitely take Michael's advice to learn more about the

overall issue before you get too far.



A quick answer that may help is Windows does not ship with an iFilter

for PDF built-in. Installing Adobe Reader 8 or higher will install a

decent PDF iFilter.



I am a little surprised by your question though - I assume that you

have access to your own source code and could examine the result from

the iFilter that's being fed to the IndexWriter and compare the

behavior in the TXT case with the behavior in the PDF case?



Cheers,

Ben



Sent from my iPhone



On Jan 6, 2010, at 10:13, Michael Garski <mg...@myspace-inc.com>

wrote:



> Todd,

>

> You'll need some way to extract the text from the PDF prior to

> indexing.  I'm not familiar with any packages that can do that but I

> have heard of them.  You may want to try searching the mailing list

> to see if there has been mention of one previously.  Lucid

> Imagination hosts a great mailing list search tool at
http://www.lucidimagination.com/search/

>

> Michael

>

> -----Original Message-----

> From: Todd McIndoo [mailto:tmcindoo@speedyscan.biz]

> Sent: Wednesday, January 06, 2010 10:11 AM

> To: lucene-net-dev@lucene.apache.org

> Subject: Question

>

> Sorry if this is duplicate

>

>

>

> We are using Lucene.net of version 2.0.0.4. I am trying to search a

> document

> which contains lots of PDFs. I want to search a document, which

> contains a

> specific word, using Lucene.net. We are yielding results in text

> documents

> but not in PDF. Is there something we have to do to be able to

> search in PDF

>

> Documents. All ifilters have been installed on the computer so I do

> not

> think that is the issue.

>

>

>

> Regards,

>

> SPEEDY SOLUTIONS

>

>

>

> Todd McIndoo

>



RE: Question

Posted by Ed Jones <Ed...@warc.com>.
Remember that not everyone uses c#, many people use VB.net and although it's relatively simple to move it over to c#, moving from c# to Java is just one extra step where things can go wrong.

At the time (3 years ago) I offered to spend time trying to make a set of examples such as how to use iFilters (I think that was the term) but nobody was interested so my attention moved elsewhere.

-----Original Message-----
From: Granroth, Neal V. [mailto:neal.granroth@thermofisher.com] 
Sent: 07 January 2010 19:37
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

I am very surprised by this comment.
There is so much similarity between Java and C# that I found absolutely no difficulty with the discussion and examples in "Lucene in Action" and in directly applying the techniques to my C#/.NET projects.

Maybe it would be helpful for some of those who find the java examples confusing to explain specifically why they are confusing.  Then we might consider putting together some type of short "Guide to understanding Lucene for C# developers" or FAQ on the web site.

- Neal

-----Original Message-----
From: Ed Jones [mailto:Edmund.Jones@warc.com] 
Sent: Thursday, January 07, 2010 3:57 AM
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

All I can say is that we found the lack of examples for .net problematic as when you are not too up to speed with Java there are a lot of basic hurdlers to overcome.

-----Original Message-----
From: Olivier Spinelli [mailto:olivier.spinelli@invenietis.fr] 
Sent: 07 January 2010 09:55
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

<quote>
Lucene.Net sticks to the APIs and classes used in the original Java
implementation of Lucene. The API names as well as class names are preserved
with the intention of giving Lucene.Net the look and feel of the C# language
and the .NET Framework. For example, the method Hits.length() in the Java
implementation now reads Hits.Length() in the C# port. 

In addition to the APIs and classes port to C#, the algorithm of Java Lucene
is ported to C# Lucene. This means an index created with Java Lucene is
back-and-forth compatible with the C# Lucene; both at reading, writing and
updating. In fact a Lucene index can be concurrently searched and updated
using Java Lucene and C# Lucene processes. 
</quote>

It's merely all about switching from camelCase to PascalCase...

HTH

Spi


-----Message d'origine-----
De : Ed Jones [mailto:Edmund.Jones@warc.com] 
Envoyé : jeudi 7 janvier 2010 10:27
À : lucene-net-dev@lucene.apache.org
Objet : RE: Question

My problem with Lucene in Action and all the examples on the internet is
that they were all in Java and you have to understand exactly what Java
is doing to understand it all properly. It's for this very reason we had
to shun using Lucene.net in major projects. I wanted dearly to use it
but the learning curve was far too steep and there appears to be very
very few .net examples of code or help.

Instead we have invested a significant amount of money in buying in a
much more commercial search engine.

I am keeping an eye on the Lucene.net project though in-case it can be
used in other parts of our business, but again the same will apply, we
will need more non Java examples.

Ed

-----Original Message-----
From: Roger Chapman [mailto:roger@stormid.com] 
Sent: 07 January 2010 09:21
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

>From what I can remember the book Lucene in Action has a good section on
indexing documents and PDFs http://www.manning.com/hatcher2/



Roger.





-----Original Message-----
From: Ben Martz [mailto:benmartz@gmail.com]
Sent: 06 January 2010 19:51
To: lucene-net-dev@lucene.apache.org
Cc: <lu...@lucene.apache.org>
Subject: Re: Question



Todd,



I would definitely take Michael's advice to learn more about the

overall issue before you get too far.



A quick answer that may help is Windows does not ship with an iFilter

for PDF built-in. Installing Adobe Reader 8 or higher will install a

decent PDF iFilter.



I am a little surprised by your question though - I assume that you

have access to your own source code and could examine the result from

the iFilter that's being fed to the IndexWriter and compare the

behavior in the TXT case with the behavior in the PDF case?



Cheers,

Ben



Sent from my iPhone



On Jan 6, 2010, at 10:13, Michael Garski <mg...@myspace-inc.com>

wrote:



> Todd,

>

> You'll need some way to extract the text from the PDF prior to

> indexing.  I'm not familiar with any packages that can do that but I

> have heard of them.  You may want to try searching the mailing list

> to see if there has been mention of one previously.  Lucid

> Imagination hosts a great mailing list search tool at
http://www.lucidimagination.com/search/

>

> Michael

>

> -----Original Message-----

> From: Todd McIndoo [mailto:tmcindoo@speedyscan.biz]

> Sent: Wednesday, January 06, 2010 10:11 AM

> To: lucene-net-dev@lucene.apache.org

> Subject: Question

>

> Sorry if this is duplicate

>

>

>

> We are using Lucene.net of version 2.0.0.4. I am trying to search a

> document

> which contains lots of PDFs. I want to search a document, which

> contains a

> specific word, using Lucene.net. We are yielding results in text

> documents

> but not in PDF. Is there something we have to do to be able to

> search in PDF

>

> Documents. All ifilters have been installed on the computer so I do

> not

> think that is the issue.

>

>

>

> Regards,

>

> SPEEDY SOLUTIONS

>

>

>

> Todd McIndoo

>



RE: Question

Posted by "Granroth, Neal V." <ne...@thermofisher.com>.
I am very surprised by this comment.
There is so much similarity between Java and C# that I found absolutely no difficulty with the discussion and examples in "Lucene in Action" and in directly applying the techniques to my C#/.NET projects.

Maybe it would be helpful for some of those who find the java examples confusing to explain specifically why they are confusing.  Then we might consider putting together some type of short "Guide to understanding Lucene for C# developers" or FAQ on the web site.

- Neal

-----Original Message-----
From: Ed Jones [mailto:Edmund.Jones@warc.com] 
Sent: Thursday, January 07, 2010 3:57 AM
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

All I can say is that we found the lack of examples for .net problematic as when you are not too up to speed with Java there are a lot of basic hurdlers to overcome.

-----Original Message-----
From: Olivier Spinelli [mailto:olivier.spinelli@invenietis.fr] 
Sent: 07 January 2010 09:55
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

<quote>
Lucene.Net sticks to the APIs and classes used in the original Java
implementation of Lucene. The API names as well as class names are preserved
with the intention of giving Lucene.Net the look and feel of the C# language
and the .NET Framework. For example, the method Hits.length() in the Java
implementation now reads Hits.Length() in the C# port. 

In addition to the APIs and classes port to C#, the algorithm of Java Lucene
is ported to C# Lucene. This means an index created with Java Lucene is
back-and-forth compatible with the C# Lucene; both at reading, writing and
updating. In fact a Lucene index can be concurrently searched and updated
using Java Lucene and C# Lucene processes. 
</quote>

It's merely all about switching from camelCase to PascalCase...

HTH

Spi


-----Message d'origine-----
De : Ed Jones [mailto:Edmund.Jones@warc.com] 
Envoyé : jeudi 7 janvier 2010 10:27
À : lucene-net-dev@lucene.apache.org
Objet : RE: Question

My problem with Lucene in Action and all the examples on the internet is
that they were all in Java and you have to understand exactly what Java
is doing to understand it all properly. It's for this very reason we had
to shun using Lucene.net in major projects. I wanted dearly to use it
but the learning curve was far too steep and there appears to be very
very few .net examples of code or help.

Instead we have invested a significant amount of money in buying in a
much more commercial search engine.

I am keeping an eye on the Lucene.net project though in-case it can be
used in other parts of our business, but again the same will apply, we
will need more non Java examples.

Ed

-----Original Message-----
From: Roger Chapman [mailto:roger@stormid.com] 
Sent: 07 January 2010 09:21
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

>From what I can remember the book Lucene in Action has a good section on
indexing documents and PDFs http://www.manning.com/hatcher2/



Roger.





-----Original Message-----
From: Ben Martz [mailto:benmartz@gmail.com]
Sent: 06 January 2010 19:51
To: lucene-net-dev@lucene.apache.org
Cc: <lu...@lucene.apache.org>
Subject: Re: Question



Todd,



I would definitely take Michael's advice to learn more about the

overall issue before you get too far.



A quick answer that may help is Windows does not ship with an iFilter

for PDF built-in. Installing Adobe Reader 8 or higher will install a

decent PDF iFilter.



I am a little surprised by your question though - I assume that you

have access to your own source code and could examine the result from

the iFilter that's being fed to the IndexWriter and compare the

behavior in the TXT case with the behavior in the PDF case?



Cheers,

Ben



Sent from my iPhone



On Jan 6, 2010, at 10:13, Michael Garski <mg...@myspace-inc.com>

wrote:



> Todd,

>

> You'll need some way to extract the text from the PDF prior to

> indexing.  I'm not familiar with any packages that can do that but I

> have heard of them.  You may want to try searching the mailing list

> to see if there has been mention of one previously.  Lucid

> Imagination hosts a great mailing list search tool at
http://www.lucidimagination.com/search/

>

> Michael

>

> -----Original Message-----

> From: Todd McIndoo [mailto:tmcindoo@speedyscan.biz]

> Sent: Wednesday, January 06, 2010 10:11 AM

> To: lucene-net-dev@lucene.apache.org

> Subject: Question

>

> Sorry if this is duplicate

>

>

>

> We are using Lucene.net of version 2.0.0.4. I am trying to search a

> document

> which contains lots of PDFs. I want to search a document, which

> contains a

> specific word, using Lucene.net. We are yielding results in text

> documents

> but not in PDF. Is there something we have to do to be able to

> search in PDF

>

> Documents. All ifilters have been installed on the computer so I do

> not

> think that is the issue.

>

>

>

> Regards,

>

> SPEEDY SOLUTIONS

>

>

>

> Todd McIndoo

>



RE: Question

Posted by Ed Jones <Ed...@warc.com>.
All I can say is that we found the lack of examples for .net problematic as when you are not too up to speed with Java there are a lot of basic hurdlers to overcome.

-----Original Message-----
From: Olivier Spinelli [mailto:olivier.spinelli@invenietis.fr] 
Sent: 07 January 2010 09:55
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

<quote>
Lucene.Net sticks to the APIs and classes used in the original Java
implementation of Lucene. The API names as well as class names are preserved
with the intention of giving Lucene.Net the look and feel of the C# language
and the .NET Framework. For example, the method Hits.length() in the Java
implementation now reads Hits.Length() in the C# port. 

In addition to the APIs and classes port to C#, the algorithm of Java Lucene
is ported to C# Lucene. This means an index created with Java Lucene is
back-and-forth compatible with the C# Lucene; both at reading, writing and
updating. In fact a Lucene index can be concurrently searched and updated
using Java Lucene and C# Lucene processes. 
</quote>

It's merely all about switching from camelCase to PascalCase...

HTH

Spi


-----Message d'origine-----
De : Ed Jones [mailto:Edmund.Jones@warc.com] 
Envoyé : jeudi 7 janvier 2010 10:27
À : lucene-net-dev@lucene.apache.org
Objet : RE: Question

My problem with Lucene in Action and all the examples on the internet is
that they were all in Java and you have to understand exactly what Java
is doing to understand it all properly. It's for this very reason we had
to shun using Lucene.net in major projects. I wanted dearly to use it
but the learning curve was far too steep and there appears to be very
very few .net examples of code or help.

Instead we have invested a significant amount of money in buying in a
much more commercial search engine.

I am keeping an eye on the Lucene.net project though in-case it can be
used in other parts of our business, but again the same will apply, we
will need more non Java examples.

Ed

-----Original Message-----
From: Roger Chapman [mailto:roger@stormid.com] 
Sent: 07 January 2010 09:21
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

>From what I can remember the book Lucene in Action has a good section on
indexing documents and PDFs http://www.manning.com/hatcher2/



Roger.





-----Original Message-----
From: Ben Martz [mailto:benmartz@gmail.com]
Sent: 06 January 2010 19:51
To: lucene-net-dev@lucene.apache.org
Cc: <lu...@lucene.apache.org>
Subject: Re: Question



Todd,



I would definitely take Michael's advice to learn more about the

overall issue before you get too far.



A quick answer that may help is Windows does not ship with an iFilter

for PDF built-in. Installing Adobe Reader 8 or higher will install a

decent PDF iFilter.



I am a little surprised by your question though - I assume that you

have access to your own source code and could examine the result from

the iFilter that's being fed to the IndexWriter and compare the

behavior in the TXT case with the behavior in the PDF case?



Cheers,

Ben



Sent from my iPhone



On Jan 6, 2010, at 10:13, Michael Garski <mg...@myspace-inc.com>

wrote:



> Todd,

>

> You'll need some way to extract the text from the PDF prior to

> indexing.  I'm not familiar with any packages that can do that but I

> have heard of them.  You may want to try searching the mailing list

> to see if there has been mention of one previously.  Lucid

> Imagination hosts a great mailing list search tool at
http://www.lucidimagination.com/search/

>

> Michael

>

> -----Original Message-----

> From: Todd McIndoo [mailto:tmcindoo@speedyscan.biz]

> Sent: Wednesday, January 06, 2010 10:11 AM

> To: lucene-net-dev@lucene.apache.org

> Subject: Question

>

> Sorry if this is duplicate

>

>

>

> We are using Lucene.net of version 2.0.0.4. I am trying to search a

> document

> which contains lots of PDFs. I want to search a document, which

> contains a

> specific word, using Lucene.net. We are yielding results in text

> documents

> but not in PDF. Is there something we have to do to be able to

> search in PDF

>

> Documents. All ifilters have been installed on the computer so I do

> not

> think that is the issue.

>

>

>

> Regards,

>

> SPEEDY SOLUTIONS

>

>

>

> Todd McIndoo

>



RE: Question

Posted by Olivier Spinelli <ol...@invenietis.fr>.
<quote>
Lucene.Net sticks to the APIs and classes used in the original Java
implementation of Lucene. The API names as well as class names are preserved
with the intention of giving Lucene.Net the look and feel of the C# language
and the .NET Framework. For example, the method Hits.length() in the Java
implementation now reads Hits.Length() in the C# port. 

In addition to the APIs and classes port to C#, the algorithm of Java Lucene
is ported to C# Lucene. This means an index created with Java Lucene is
back-and-forth compatible with the C# Lucene; both at reading, writing and
updating. In fact a Lucene index can be concurrently searched and updated
using Java Lucene and C# Lucene processes. 
</quote>

It's merely all about switching from camelCase to PascalCase...

HTH

Spi


-----Message d'origine-----
De : Ed Jones [mailto:Edmund.Jones@warc.com] 
Envoyé : jeudi 7 janvier 2010 10:27
À : lucene-net-dev@lucene.apache.org
Objet : RE: Question

My problem with Lucene in Action and all the examples on the internet is
that they were all in Java and you have to understand exactly what Java
is doing to understand it all properly. It's for this very reason we had
to shun using Lucene.net in major projects. I wanted dearly to use it
but the learning curve was far too steep and there appears to be very
very few .net examples of code or help.

Instead we have invested a significant amount of money in buying in a
much more commercial search engine.

I am keeping an eye on the Lucene.net project though in-case it can be
used in other parts of our business, but again the same will apply, we
will need more non Java examples.

Ed

-----Original Message-----
From: Roger Chapman [mailto:roger@stormid.com] 
Sent: 07 January 2010 09:21
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

>From what I can remember the book Lucene in Action has a good section on
indexing documents and PDFs http://www.manning.com/hatcher2/



Roger.





-----Original Message-----
From: Ben Martz [mailto:benmartz@gmail.com]
Sent: 06 January 2010 19:51
To: lucene-net-dev@lucene.apache.org
Cc: <lu...@lucene.apache.org>
Subject: Re: Question



Todd,



I would definitely take Michael's advice to learn more about the

overall issue before you get too far.



A quick answer that may help is Windows does not ship with an iFilter

for PDF built-in. Installing Adobe Reader 8 or higher will install a

decent PDF iFilter.



I am a little surprised by your question though - I assume that you

have access to your own source code and could examine the result from

the iFilter that's being fed to the IndexWriter and compare the

behavior in the TXT case with the behavior in the PDF case?



Cheers,

Ben



Sent from my iPhone



On Jan 6, 2010, at 10:13, Michael Garski <mg...@myspace-inc.com>

wrote:



> Todd,

>

> You'll need some way to extract the text from the PDF prior to

> indexing.  I'm not familiar with any packages that can do that but I

> have heard of them.  You may want to try searching the mailing list

> to see if there has been mention of one previously.  Lucid

> Imagination hosts a great mailing list search tool at
http://www.lucidimagination.com/search/

>

> Michael

>

> -----Original Message-----

> From: Todd McIndoo [mailto:tmcindoo@speedyscan.biz]

> Sent: Wednesday, January 06, 2010 10:11 AM

> To: lucene-net-dev@lucene.apache.org

> Subject: Question

>

> Sorry if this is duplicate

>

>

>

> We are using Lucene.net of version 2.0.0.4. I am trying to search a

> document

> which contains lots of PDFs. I want to search a document, which

> contains a

> specific word, using Lucene.net. We are yielding results in text

> documents

> but not in PDF. Is there something we have to do to be able to

> search in PDF

>

> Documents. All ifilters have been installed on the computer so I do

> not

> think that is the issue.

>

>

>

> Regards,

>

> SPEEDY SOLUTIONS

>

>

>

> Todd McIndoo

>



RE: Question

Posted by Ed Jones <Ed...@warc.com>.
My problem with Lucene in Action and all the examples on the internet is
that they were all in Java and you have to understand exactly what Java
is doing to understand it all properly. It's for this very reason we had
to shun using Lucene.net in major projects. I wanted dearly to use it
but the learning curve was far too steep and there appears to be very
very few .net examples of code or help.

Instead we have invested a significant amount of money in buying in a
much more commercial search engine.

I am keeping an eye on the Lucene.net project though in-case it can be
used in other parts of our business, but again the same will apply, we
will need more non Java examples.

Ed

-----Original Message-----
From: Roger Chapman [mailto:roger@stormid.com] 
Sent: 07 January 2010 09:21
To: lucene-net-dev@lucene.apache.org
Subject: RE: Question

>From what I can remember the book Lucene in Action has a good section on
indexing documents and PDFs http://www.manning.com/hatcher2/



Roger.





-----Original Message-----
From: Ben Martz [mailto:benmartz@gmail.com]
Sent: 06 January 2010 19:51
To: lucene-net-dev@lucene.apache.org
Cc: <lu...@lucene.apache.org>
Subject: Re: Question



Todd,



I would definitely take Michael's advice to learn more about the

overall issue before you get too far.



A quick answer that may help is Windows does not ship with an iFilter

for PDF built-in. Installing Adobe Reader 8 or higher will install a

decent PDF iFilter.



I am a little surprised by your question though - I assume that you

have access to your own source code and could examine the result from

the iFilter that's being fed to the IndexWriter and compare the

behavior in the TXT case with the behavior in the PDF case?



Cheers,

Ben



Sent from my iPhone



On Jan 6, 2010, at 10:13, Michael Garski <mg...@myspace-inc.com>

wrote:



> Todd,

>

> You'll need some way to extract the text from the PDF prior to

> indexing.  I'm not familiar with any packages that can do that but I

> have heard of them.  You may want to try searching the mailing list

> to see if there has been mention of one previously.  Lucid

> Imagination hosts a great mailing list search tool at
http://www.lucidimagination.com/search/

>

> Michael

>

> -----Original Message-----

> From: Todd McIndoo [mailto:tmcindoo@speedyscan.biz]

> Sent: Wednesday, January 06, 2010 10:11 AM

> To: lucene-net-dev@lucene.apache.org

> Subject: Question

>

> Sorry if this is duplicate

>

>

>

> We are using Lucene.net of version 2.0.0.4. I am trying to search a

> document

> which contains lots of PDFs. I want to search a document, which

> contains a

> specific word, using Lucene.net. We are yielding results in text

> documents

> but not in PDF. Is there something we have to do to be able to

> search in PDF

>

> Documents. All ifilters have been installed on the computer so I do

> not

> think that is the issue.

>

>

>

> Regards,

>

> SPEEDY SOLUTIONS

>

>

>

> Todd McIndoo

>



RE: Question

Posted by Roger Chapman <ro...@stormid.com>.
From what I can remember the book Lucene in Action has a good section on indexing documents and PDFs http://www.manning.com/hatcher2/



Roger.





-----Original Message-----
From: Ben Martz [mailto:benmartz@gmail.com]
Sent: 06 January 2010 19:51
To: lucene-net-dev@lucene.apache.org
Cc: <lu...@lucene.apache.org>
Subject: Re: Question



Todd,



I would definitely take Michael's advice to learn more about the

overall issue before you get too far.



A quick answer that may help is Windows does not ship with an iFilter

for PDF built-in. Installing Adobe Reader 8 or higher will install a

decent PDF iFilter.



I am a little surprised by your question though - I assume that you

have access to your own source code and could examine the result from

the iFilter that's being fed to the IndexWriter and compare the

behavior in the TXT case with the behavior in the PDF case?



Cheers,

Ben



Sent from my iPhone



On Jan 6, 2010, at 10:13, Michael Garski <mg...@myspace-inc.com>

wrote:



> Todd,

>

> You'll need some way to extract the text from the PDF prior to

> indexing.  I'm not familiar with any packages that can do that but I

> have heard of them.  You may want to try searching the mailing list

> to see if there has been mention of one previously.  Lucid

> Imagination hosts a great mailing list search tool at http://www.lucidimagination.com/search/

>

> Michael

>

> -----Original Message-----

> From: Todd McIndoo [mailto:tmcindoo@speedyscan.biz]

> Sent: Wednesday, January 06, 2010 10:11 AM

> To: lucene-net-dev@lucene.apache.org

> Subject: Question

>

> Sorry if this is duplicate

>

>

>

> We are using Lucene.net of version 2.0.0.4. I am trying to search a

> document

> which contains lots of PDFs. I want to search a document, which

> contains a

> specific word, using Lucene.net. We are yielding results in text

> documents

> but not in PDF. Is there something we have to do to be able to

> search in PDF

>

> Documents. All ifilters have been installed on the computer so I do

> not

> think that is the issue.

>

>

>

> Regards,

>

> SPEEDY SOLUTIONS

>

>

>

> Todd McIndoo

>



Re: Question

Posted by Ben Martz <be...@gmail.com>.
Todd,

I would definitely take Michael's advice to learn more about the  
overall issue before you get too far.

A quick answer that may help is Windows does not ship with an iFilter  
for PDF built-in. Installing Adobe Reader 8 or higher will install a  
decent PDF iFilter.

I am a little surprised by your question though - I assume that you  
have access to your own source code and could examine the result from  
the iFilter that's being fed to the IndexWriter and compare the  
behavior in the TXT case with the behavior in the PDF case?

Cheers,
Ben

Sent from my iPhone

On Jan 6, 2010, at 10:13, Michael Garski <mg...@myspace-inc.com>  
wrote:

> Todd,
>
> You'll need some way to extract the text from the PDF prior to  
> indexing.  I'm not familiar with any packages that can do that but I  
> have heard of them.  You may want to try searching the mailing list  
> to see if there has been mention of one previously.  Lucid  
> Imagination hosts a great mailing list search tool at http://www.lucidimagination.com/search/
>
> Michael
>
> -----Original Message-----
> From: Todd McIndoo [mailto:tmcindoo@speedyscan.biz]
> Sent: Wednesday, January 06, 2010 10:11 AM
> To: lucene-net-dev@lucene.apache.org
> Subject: Question
>
> Sorry if this is duplicate
>
>
>
> We are using Lucene.net of version 2.0.0.4. I am trying to search a  
> document
> which contains lots of PDFs. I want to search a document, which  
> contains a
> specific word, using Lucene.net. We are yielding results in text  
> documents
> but not in PDF. Is there something we have to do to be able to  
> search in PDF
>
> Documents. All ifilters have been installed on the computer so I do  
> not
> think that is the issue.
>
>
>
> Regards,
>
> SPEEDY SOLUTIONS
>
>
>
> Todd McIndoo
>

RE: Question

Posted by Michael Garski <mg...@myspace-inc.com>.
Todd,

You'll need some way to extract the text from the PDF prior to indexing.  I'm not familiar with any packages that can do that but I have heard of them.  You may want to try searching the mailing list to see if there has been mention of one previously.  Lucid Imagination hosts a great mailing list search tool at http://www.lucidimagination.com/search/

Michael

-----Original Message-----
From: Todd McIndoo [mailto:tmcindoo@speedyscan.biz] 
Sent: Wednesday, January 06, 2010 10:11 AM
To: lucene-net-dev@lucene.apache.org
Subject: Question

Sorry if this is duplicate

 

We are using Lucene.net of version 2.0.0.4. I am trying to search a document
which contains lots of PDFs. I want to search a document, which contains a
specific word, using Lucene.net. We are yielding results in text documents
but not in PDF. Is there something we have to do to be able to search in PDF

Documents. All ifilters have been installed on the computer so I do not
think that is the issue.

 

Regards,

SPEEDY SOLUTIONS

 

Todd McIndoo