You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Fatima Issawi <is...@qu.edu.qa> on 2013/12/26 06:24:41 UTC

How to use Solr in my project

Hello,

First off, I apologize if this was sent twice. I was having issues subscribing to the list.

I'm a complete noob in Solr (and indexing), so I'm hoping someone can help me figure out how to implement Solr in my project. I have gone through some tutorials online and I was able to import and query text in some Arabic PDF documents.

We have some scans of Historical Handwritten Arabic documents that will have text extracted into a database (or PDF). We would like the user to be able to search the document for text, then have the scanned image show up in a viewer with the text highlighted. I would like to use Solr to index the text in the documents, but I'm unsure how to store and get the "word location" in Solr  (area of text that needs to be highlighted).

Do I index and store the full document in the Solr? How do l link the "search term" to the "word location" on the page?
The only way I can figure out how to do this involves querying the database for the "word" and "location" after querying Solr for the search term, but is that defeating the purpose of using Solr?

I would really appreciate help figuring this out.

Thank you,
Fatima

RE: How to use Solr in my project

Posted by Fatima Issawi <is...@qu.edu.qa>.

I think we may have up to 100,000 books, but I don't think the site will have a lot of traffic.

Thank you for your help. I think it is a little more clear and will try to implement it now.

> -----Original Message-----
> From: Gora Mohanty [mailto:gora@mimirtech.com]
> Sent: Monday, December 30, 2013 11:46 AM
> To: solr-user@lucene.apache.org
> Subject: Re: How to use Solr in my project
> 
> On 30 December 2013 11:27, Fatima Issawi <is...@qu.edu.qa> wrote:
> > Hi again,
> >
> > We have another program that will be extracting the text, and it will be
> extracting the top right and bottom left corners of the words. You are right, I
> do expect to have a lot of data.
> >
> > When would solr start experiencing issues in performance? Is it better to:
> >
> > INDEX:
> > - document metadata
> > - words
> >
> > STORE:
> > - document metadata
> > - words
> > - coordinates
> >
> > in Solr rather than in the database? How would I set up the schema in order
> to store the coordinates?
> 
> You do not mention the number of documents, but for a few tens of
> thousands of documents, your problem should be tractable in Solr. Not sure
> what document metadata you have, and if you need to search through it, but
> what I would do is index the words, and store the coordinates in Solr, the
> assumption being that words are searched but not retrieved from Solr, while
> coordinates are retrieved but never searched.
> 
> Off the top of my head, each record can be:
> <doc1> <pg1> <word1> <coord_x1> <coord_y1> <coord_x2> <coord_y2>
> <doc1> <pg1> <word2> ....
> ...
> <doc1> <pg2> ...
> ...
> <doc2> ...
> 
> * <doc_id> and <pg_id> from Solr search results let you retrieve the image
>   from the filesystem
> * The coordinates allow post-processing to highlight the word in the image
> 
> As always, set up a prototype system with a subset of the records in order to
> measure performance.
> 
> > If storing the coordinates in solr is not recommended, what would be the
> best process to get the coordinates after indexing the words and metadata?
> Do I search in solr and then use the documentID to then search the database
> for the words and coordinates?
> 
> You could do that, but Solr by itself should be fine.
> 
> Regards,
> Gora

Re: How to use Solr in my project

Posted by Gora Mohanty <go...@mimirtech.com>.

On 30 December 2013 11:27, Fatima Issawi <is...@qu.edu.qa> wrote:
> Hi again,
>
> We have another program that will be extracting the text, and it will be extracting the top right and bottom left corners of the words. You are right, I do expect to have a lot of data.
>
> When would solr start experiencing issues in performance? Is it better to:
>
> INDEX:
> - document metadata
> - words
>
> STORE:
> - document metadata
> - words
> - coordinates
>
> in Solr rather than in the database? How would I set up the schema in order to store the coordinates?

You do not mention the number of documents, but for a few
tens of thousands of documents, your problem should be tractable
in Solr. Not sure what document metadata you have, and if you need
to search through it, but what I would do is index the words, and
store the coordinates in Solr, the assumption being that words are
searched but not retrieved from Solr, while coordinates are retrieved
but never searched.

Off the top of my head, each record can be:
<doc1> <pg1> <word1> <coord_x1> <coord_y1> <coord_x2> <coord_y2>
<doc1> <pg1> <word2> ....
...
<doc1> <pg2> ...
...
<doc2> ...

* <doc_id> and <pg_id> from Solr search results let you retrieve the image
  from the filesystem
* The coordinates allow post-processing to highlight the word in the image

As always, set up a prototype system with a subset of the records in order
to measure performance.

> If storing the coordinates in solr is not recommended, what would be the best process to get the coordinates after indexing the words and metadata? Do I search in solr and then use the documentID to then search the database for the words and coordinates?

You could do that, but Solr by itself should be fine.

Regards,
Gora

RE: How to use Solr in my project

Posted by Fatima Issawi <is...@qu.edu.qa>.

Hi again,

We have another program that will be extracting the text, and it will be extracting the top right and bottom left corners of the words. You are right, I do expect to have a lot of data.

When would solr start experiencing issues in performance? Is it better to:

INDEX: 
- document metadata 
- words  

STORE: 
- document metadata
- words 
- coordinates 

in Solr rather than in the database? How would I set up the schema in order to store the coordinates?

If storing the coordinates in solr is not recommended, what would be the best process to get the coordinates after indexing the words and metadata? Do I search in solr and then use the documentID to then search the database for the words and coordinates?

Thanks for your patience. I don't have much choice in the use case. 


> -----Original Message-----
> From: Gora Mohanty [mailto:gora@mimirtech.com]
> Sent: Sunday, December 29, 2013 2:48 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How to use Solr in my project
> 
> On 29 December 2013 11:10, Fatima Issawi <is...@qu.edu.qa> wrote:
> [...]
> > We will have the full text stored, but we want to highlight the text in the
> original image. I expect to process the image after retrieval. We do plan on
> storing the (x, y) coordinates of the words in a database - I suspected that it
> would be too expensive to store them in Solr. I guess I'm still confused about
> how to use Solr to index the document, but then retrieve the (x, y)
> coordinates of the search term from the database. Is this possible? If it can,
> can you give an example how this can be done?
> 
> Storing, and retrieving the coordinates from Solr will likely be faster than
> from the database. However, I still think that you should think more carefully
> about your use case of highlighting the images. It can be done, but is a
> significant amount of work, and will need storage, and computational
> resources.
> 1. For highlighting in the image, you will need to store two sets
>     of coordinates (e.g., top right and bottom left corners) as you
>     not know the length of the word in the image. Thus, say with
>     15 words per line, 50 lines per page, 100 pages per document,
>     you will need to store:
>       4 x 15 x 50 x 100 = 3,00,000 coordinates/document 2. Also, how are you
> going to get the coordinates in the first
>     place?
> 
> Regards,
> Gora

Re: How to use Solr in my project

Posted by Gora Mohanty <go...@mimirtech.com>.

On 29 December 2013 11:10, Fatima Issawi <is...@qu.edu.qa> wrote:
[...]
> We will have the full text stored, but we want to highlight the text in the original image. I expect to process the image after retrieval. We do plan on storing the (x, y) coordinates of the words in a database - I suspected that it would be too expensive to store them in Solr. I guess I'm still confused about how to use Solr to index the document, but then retrieve the (x, y) coordinates of the search term from the database. Is this possible? If it can, can you give an example how this can be done?

Storing, and retrieving the coordinates from Solr will likely be
faster than from the database. However, I still think that you
should think more carefully about your use case of highlighting
the images. It can be done, but is a significant amount of work,
and will need storage, and computational resources.
1. For highlighting in the image, you will need to store two sets
    of coordinates (e.g., top right and bottom left corners) as you
    not know the length of the word in the image. Thus, say with
    15 words per line, 50 lines per page, 100 pages per document,
    you will need to store:
      4 x 15 x 50 x 100 = 3,00,000 coordinates/document
2. Also, how are you going to get the coordinates in the first
    place?

Regards,
Gora

RE: How to use Solr in my project

Posted by Fatima Issawi <is...@qu.edu.qa>.

> What do you mean by "word location"? The number on the page? What
> purpose would this serve?

I mean the (x, y) coordinates of the word on the page. We want to be able to highlight the image of the word that was extracted from the text.

> I think that you might be confusing things:
> * If you have the full-text, you can highlight where the word was found. Solr
>   highlighting handles this for you, and there is no need to store word location
> * You can have different images (presumably, individual scanned pages)
> linked
>    to different sections of text, and show the entire image.
> Highlighting in the image
>    is not possible, unless by "word location" you mean the (x, y) coordinates of
>    the word on the page. Even then:
>    - It will be prohibitively expensive to store the location of every word in
> every
>      image for a large number of documents
>    - Some image processing will be required to handle the highlighting after
> the
>      scanned image is retrieved

We will have the full text stored, but we want to highlight the text in the original image. I expect to process the image after retrieval. We do plan on storing the (x, y) coordinates of the words in a database - I suspected that it would be too expensive to store them in Solr. I guess I'm still confused about how to use Solr to index the document, but then retrieve the (x, y) coordinates of the search term from the database. Is this possible? If it can, can you give an example how this can be done?

Thank you!

RE: How to use Solr in my project

Posted by Fatima Issawi <is...@qu.edu.qa>.

Hello,

Our pages are images of handwritten text in Arabic so OCR'ing is not possible. We will be extracting the text during pre-processing and storing the words and (x, y) coordinates in a database. Would your process apply to our images?

> Step 1:
> For sending the extracted text content from text pdf to solr, use a low level
> pdf converter such as poppler-utils (pdftotext or pdftohtml) to correctly get
> the coordinates and page no. of each word. Store it in a seperate file as word
> map. This word map will contain page+coordinates mapping to occurence
> number for word.

Can we generate a word map manually? Is this used by Solr and requires a specific format?

> Step 2:
> Solr highlighter needs to be changed to get the word and their occurence
> number in the text document, rather than the character offsets for each hit.

How is this done? I read the solr highlighting wiki, but don't see how this can be done.

> Step 3:
> Combine the solr output to the word map created in step 1 and the pdf page
> and coordinates can be generated for original pdf docuemnt which can be
> highlighted by any viewer.

Can I get more information about how to do this?

Thanks!

Re: How to use Solr in my project

Posted by Gopal Agarwal <go...@gmail.com>.

Highlighting can be done as three step process:

Pre-requisite: Get the pdf with text after the OCR of the image pdf.

Step 1:
For sending the extracted text content from text pdf to solr, use a low
level pdf converter such as poppler-utils (pdftotext or pdftohtml) to
correctly get the coordinates and page no. of each word. Store it in a
seperate file as word map. This word map will contain page+coordinates
mapping to occurence number for word.

Step 2:
Solr highlighter needs to be changed to get the word and their occurence
number in the text document, rather than the character offsets for each hit.

Step 3:
Combine the solr output to the word map created in step 1 and the pdf page
and coordinates can be generated for original pdf docuemnt which can be
highlighted by any viewer.

We are succesufully able to implement this for our own application.

Thanks,
Gopal

On Thu, Dec 26, 2013 at 3:56 PM, Gora Mohanty <go...@mimirtech.com> wrote:

> On 26 December 2013 15:44, Fatima Issawi <is...@qu.edu.qa> wrote:
> > Hi,
> >
> > I should clarify. We have another application extracting the text from
> the document. The full text from each document will be stored in a database
> either at the document level or page level (this hasn't been decided yet).
> We will also be storing word location of each word on the page in the
> database.
>
> What do you mean by "word location"? The number on the page? What purpose
> would this serve?
>
> > What I'm having problems with is deciding on the schema. We want a user
> to be able to search for a word in the database, have a list of documents
> that word is located in, and location in the document that word is located
> it. When he selects the search results, we want the scanned picture to have
> that word highlighted on the page.
> [...]
>
> I think that you might be confusing things:
> * If you have the full-text, you can highlight where the word was found.
> Solr
>   highlighting handles this for you, and there is no need to store word
> location
> * You can have different images (presumably, individual scanned pages)
> linked
>    to different sections of text, and show the entire image.
> Highlighting in the image
>    is not possible, unless by "word location" you mean the (x, y)
> coordinates of
>    the word on the page. Even then:
>    - It will be prohibitively expensive to store the location of every
> word in every
>      image for a large number of documents
>    - Some image processing will be required to handle the highlighting
> after the
>      scanned image is retrieved
>
> Regards,
> Gora
>

Re: How to use Solr in my project

Posted by Gora Mohanty <go...@mimirtech.com>.

On 26 December 2013 15:44, Fatima Issawi <is...@qu.edu.qa> wrote:
> Hi,
>
> I should clarify. We have another application extracting the text from the document. The full text from each document will be stored in a database either at the document level or page level (this hasn't been decided yet). We will also be storing word location of each word on the page in the database.

What do you mean by "word location"? The number on the page? What purpose
would this serve?

> What I'm having problems with is deciding on the schema. We want a user to be able to search for a word in the database, have a list of documents that word is located in, and location in the document that word is located it. When he selects the search results, we want the scanned picture to have that word highlighted on the page.
[...]

I think that you might be confusing things:
* If you have the full-text, you can highlight where the word was found. Solr
  highlighting handles this for you, and there is no need to store word location
* You can have different images (presumably, individual scanned pages) linked
   to different sections of text, and show the entire image.
Highlighting in the image
   is not possible, unless by "word location" you mean the (x, y) coordinates of
   the word on the page. Even then:
   - It will be prohibitively expensive to store the location of every
word in every
     image for a large number of documents
   - Some image processing will be required to handle the highlighting after the
     scanned image is retrieved

Regards,
Gora

RE: How to use Solr in my project

Posted by Fatima Issawi <is...@qu.edu.qa>.

Hi,

I should clarify. We have another application extracting the text from the document. The full text from each document will be stored in a database either at the document level or page level (this hasn't been decided yet). We will also be storing word location of each word on the page in the database. 

What I'm having problems with is deciding on the schema. We want a user to be able to search for a word in the database, have a list of documents that word is located in, and location in the document that word is located it. When he selects the search results, we want the scanned picture to have that word highlighted on the page. 

I want to index the document using Solr, but I'm having trouble figuring out how to design the schema to return that "word location" of a search term on the scanned picture in order to highlight it.

Does this make more sense?

Fatima

-----Original Message-----
From: Gora Mohanty [mailto:gora@mimirtech.com] 
Sent: Thursday, December 26, 2013 1:00 PM
To: solr-user@lucene.apache.org
Subject: Re: How to use Solr in my project

On 26 December 2013 10:54, Fatima Issawi <is...@qu.edu.qa> wrote:
> Hello,
>
> First off, I apologize if this was sent twice. I was having issues subscribing to the list.
>
> I'm a complete noob in Solr (and indexing), so I'm hoping someone can help me figure out how to implement Solr in my project. I have gone through some tutorials online and I was able to import and query text in some Arabic PDF documents.
>
> We have some scans of Historical Handwritten Arabic documents that will have text extracted into a database (or PDF). We would like the user to be able to search the document for text, then have the scanned image show up in a viewer with the text highlighted.

This will not work for scanned images which do not actually contain the text. If you have the text of the documents, the best that you can do is break the text into pages corresponding to the scanned images, and index into Solr the text from the pages and the scanned image that should be linked to the text. For a user search, you will need to show the scanned image for the entire page: Highlighting of the search term in an image is not possible without optical character recognition (OCR).

Similarly, if you are indexing from PDFs, you will need to ensure that they contain text, and not just images.

Regards,
Gora

Re: How to use Solr in my project

Posted by Gora Mohanty <go...@mimirtech.com>.

On 26 December 2013 10:54, Fatima Issawi <is...@qu.edu.qa> wrote:
> Hello,
>
> First off, I apologize if this was sent twice. I was having issues subscribing to the list.
>
> I'm a complete noob in Solr (and indexing), so I'm hoping someone can help me figure out how to implement Solr in my project. I have gone through some tutorials online and I was able to import and query text in some Arabic PDF documents.
>
> We have some scans of Historical Handwritten Arabic documents that will have text extracted into a database (or PDF). We would like the user to be able to search the document for text, then have the scanned image show up in a viewer with the text highlighted.

This will not work for scanned images which do not actually contain the
text. If you have the text of the documents, the best that you can do is
break the text into pages corresponding to the scanned images, and
index into Solr the text from the pages and the scanned image that should
be linked to the text. For a user search, you will need to show the scanned
image for the entire page: Highlighting of the search term in an image is not
possible without optical character recognition (OCR).

Similarly, if you are indexing from PDFs, you will need to ensure that they
contain text, and not just images.

Regards,
Gora