You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Daniel Rosher <ro...@googlemail.com> on 2008/11/20 15:20:37 UTC

solr.WordDelimiterFilterFactory

Hi,

I'm trying to index some content that has things like 'java/J2EE' but with
solr.WordDelimiterFilterFactory and parameters [generateWordParts="1"
generateNumberParts="0" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="0"] this ends up tokenized as
'java','j','2',EE'

Does anyone know a way of having this tokenized as 'java','j2ee'.

Perhaps this filter need something like a protected list of tokens not to
tokenize like EnglishPorterFilter ?

Cheers,
Dan

Re: solr.WordDelimiterFilterFactory

Posted by Yonik Seeley <yo...@apache.org>.

On Thu, Nov 20, 2008 at 9:20 AM, Daniel Rosher <ro...@googlemail.com> wrote:
> I'm trying to index some content that has things like 'java/J2EE' but with
> solr.WordDelimiterFilterFactory and parameters [generateWordParts="1"
> generateNumberParts="0" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="0"] this ends up tokenized as
> 'java','j','2',EE'
>
> Does anyone know a way of having this tokenized as 'java','j2ee'.
>
> Perhaps this filter need something like a protected list of tokens not to
> tokenize like EnglishPorterFilter ?

In addition to the other replies, you could use the SynonymFilter to
normalize certain terms before the WDF (assuming you want to keep the
WDF for other things).

Perhaps try the following synonym rules at both index and query time:

j2ee => javatwoee
java/j2ee => java javatwoee

-Yonik

Re: solr.WordDelimiterFilterFactory

Posted by Mike Klaas <mi...@gmail.com>.

On 20-Nov-08, at 6:20 AM, Daniel Rosher wrote:

> Hi,
>
> I'm trying to index some content that has things like 'java/J2EE'  
> but with
> solr.WordDelimiterFilterFactory and parameters [generateWordParts="1"
> generateNumberParts="0" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="0"] this ends up tokenized as
> 'java','j','2',EE'
>
> Does anyone know a way of having this tokenized as 'java','j2ee'.
>
> Perhaps this filter need something like a protected list of tokens  
> not to
> tokenize like EnglishPorterFilter ?

That's a possibility.  Another is to add code to filter out short  
tokens from being generated, and use catenateAll=true

-Mike

Re: solr.WordDelimiterFilterFactory

Posted by Chris Hostetter <ho...@fucit.org>.

: I'm trying to index some content that has things like 'java/J2EE' but with
: solr.WordDelimiterFilterFactory and parameters [generateWordParts="1"
: generateNumberParts="0" catenateWords="0" catenateNumbers="0"
: catenateAll="0" splitOnCaseChange="0"] this ends up tokenized as
: 'java','j','2',EE'
: 
: Does anyone know a way of having this tokenized as 'java','j2ee'.

WDF was really designed arround the assumption that if java/j2ee was 
something like a product sku, people might query it as javaj2ee 
or java-j2ee or java-j-2-ee, or java/j2-ee etc...

for more generic text, you may want to use a tokenizer that splits on "/" 




-Hoss

Re: Phrase query search with stopwords

Posted by Chris Hostetter <ho...@fucit.org>.

: Subject: Phrase query search with stopwords
: In-Reply-To: <1c...@mail.gmail.com>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking




-Hoss

RE: Using Solr with Hadoop ....

Posted by souravm <SO...@infosys.com>.

Yonik,

I already tried with around 200M doc in a desktop type box with 2Gb memory. The simple queries (like getting data for a date range, queries without wild card etc.) are working fine within the level of response time 10-20 secs, provided the number of records hit is low (within couple of 1000 docs). However, sorting does not work there due to memory limitation. And also I'm sure any complex query (involving processing like group by, unique etc.) would be challenging to handle with bad performance.

So given all these I thought exploiting HDFS and Map Reduce capability may be worthwhile where I use Solr/Lucene's indexing power and Hadoop's parallel processing capability.

Regards,
Sourav

-----Original Message-----
From: Yonik Seeley [mailto:yseeley@gmail.com] 
Sent: Friday, November 28, 2008 7:08 PM
To: solr-user@lucene.apache.org
Subject: Re: Using Solr with Hadoop ....

Ah sorry, I had misread your original post.  3-6M docs per hour can be
challenging.
Using the CSV loader, I've indexed 4000 docs per second (14M per hour)
on a 2.6GHz Athlon, but they were relatively simple and small docs.

On Fri, Nov 28, 2008 at 9:54 PM, souravm <SO...@infosys.com> wrote:
> There is a case where I'm expecting at peak season around 36M doc per day, at hourly level peaking to 2-3M per hr. Now I need to do some processing of those docs before I index them. Then based on the performance figure of indexing I saw in http://wiki.apache.org/solr/SolrPerformanceFactors (the embedded vs http post section) - it looks like it would take more than 2 hr index a 3M records using 4 machine. So I thought it would be difficult to achieve my goal only through Solr I need something else to further increasing the parallel processing.
>
> All together the doc size targeted would be around average 3B (the size would be around 300 Gb).

You definitely need distributed search.  Don't try to search this on a
single box.

> The docs would get constantly added and deleted every day basis at an average rate of 8M per day peak
> being 36M. Now considering around 10 boxes, every box need to store around 250M docs.

250M docs per box is probably too high, even for distributed search,
unless your query throughput and latency requirements are very low.

-Yonik

**************** CAUTION - Disclaimer *****************
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are not 
to copy, disclose, or distribute this e-mail or its contents to any other person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has taken 
every reasonable precaution to minimize this risk, but is not liable for any damage 
you may sustain as a result of any virus in this e-mail. You should carry out your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS******** End of Disclaimer ********INFOSYS***

Re: Using Solr with Hadoop ....

Posted by Yonik Seeley <ys...@gmail.com>.

Ah sorry, I had misread your original post.  3-6M docs per hour can be
challenging.
Using the CSV loader, I've indexed 4000 docs per second (14M per hour)
on a 2.6GHz Athlon, but they were relatively simple and small docs.

On Fri, Nov 28, 2008 at 9:54 PM, souravm <SO...@infosys.com> wrote:
> There is a case where I'm expecting at peak season around 36M doc per day, at hourly level peaking to 2-3M per hr. Now I need to do some processing of those docs before I index them. Then based on the performance figure of indexing I saw in http://wiki.apache.org/solr/SolrPerformanceFactors (the embedded vs http post section) - it looks like it would take more than 2 hr index a 3M records using 4 machine. So I thought it would be difficult to achieve my goal only through Solr I need something else to further increasing the parallel processing.
>
> All together the doc size targeted would be around average 3B (the size would be around 300 Gb).

You definitely need distributed search.  Don't try to search this on a
single box.

> The docs would get constantly added and deleted every day basis at an average rate of 8M per day peak
> being 36M. Now considering around 10 boxes, every box need to store around 250M docs.

250M docs per box is probably too high, even for distributed search,
unless your query throughput and latency requirements are very low.

-Yonik

RE: Using Solr with Hadoop ....

Posted by souravm <SO...@infosys.com>.

Hi Yonik,

There is a case where I'm expecting at peak season around 36M doc per day, at hourly level peaking to 2-3M per hr. Now I need to do some processing of those docs before I index them. Then based on the performance figure of indexing I saw in http://wiki.apache.org/solr/SolrPerformanceFactors (the embedded vs http post section) - it looks like it would take more than 2 hr index a 3M records using 4 machine. So I thought it would be difficult to achieve my goal only through Solr I need something else to further increasing the parallel processing.

All together the doc size targeted would be around average 3B (the size would be around 300 Gb). The docs would get constantly added and deleted every day basis at an average rate of 8M per day peak being 36M. Now considering around 10 boxes, every box need to store around 250M docs.

Regards,
Sourav

-----Original Message-----
From: yseeley@gmail.com [mailto:yseeley@gmail.com] On Behalf Of Yonik Seeley
Sent: Friday, November 28, 2008 5:38 PM
To: solr-user@lucene.apache.org
Subject: Re: Using Solr with Hadoop ....

The indexing rate you need to achieve should be equal to the rate that
new documents are produced.  It shouldn't matter much how long it
takes to index 3-6M documents the first time (within reason), given
that you only need to do it once/occasionally.  What is that rate
(i.e. why do you think you can't do it on a single box)?

For the scale of documents you are talking about, hadoop sounds like
it would complicate things more than simplify them.

There is a pending Solr patch for using custom IndexReader factories
that could easily open multiple indexes to search across (no optimize
needed).  Or, it would be relatively trivial to write a Lucene program
to merge the indexes.  You could also leave the indexes on multiple
boxes and use Solr's distributed search to search across them
(assuming you really didn't really need everything on a single box).

-Yonik

On Fri, Nov 28, 2008 at 7:01 PM, souravm <SO...@infosys.com> wrote:
> Hi Yonik,
>
> Let me explain why I thought using hadoop will help in achieving the parallel indexing better.
>
> Here are the set of requirements and constraints -
>
> 1. The 3-6M documents (around 300 to 600 MB size) would belong to the same schema
> 2. The resulting index of those 3-6M documents has to reside in a single box (the target box).
> 3. I have to use desktop grade servers with limited RAM (say maximum 2 GB) and single CPU but large enough disk space above 100 GB.
>
> Now if I try to achieve indexing for 3-6M records by running single thread in each of those servers then the steps are -
>
> 1. Create index in all N boxes
> 2. Merge those indexes in the target box
> 3. Optimize the resulting index in the target box.
>
> In Hadoop way what I need to do -
>
> 1. Use those 'N' servers to create the HDFS.
> 2. Copy the raw data (3-6M records) to the HDFS.
> 3. Then use Map/Reduce for indexing those documents and optimize.
>
> I this in this way the index merging and optimization time would be less as those would not be limited by my single server's CPU and memory instead through Map/Reduce the same would be happening in multiple boxes utilizing their CPUs and memory6 in parallel. As I know this way Rackspace implemented Solr's integration with Hadoop and got benefitted. But I realize that this integration is not available open source way
>
> Also please let me know if there is other option to reduce indexing time in my case within Solr given the limited capabilities of the servers.
>
> Regards,
> Sourav
>
>
>
> -----Original Message-----
> From: yseeley@gmail.com [mailto:yseeley@gmail.com] On Behalf Of Yonik Seeley
> Sent: Friday, November 28, 2008 1:58 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Using Solr with Hadoop ....
>
> While future Solr-hadoop integration is a definite possibility (and
> will enable other cool stuff), it doesn't necessarily seem needed for
> the problem you are trying to solve.
>
>> indexing them in parallel is not an option as my target doc size per hr itself can be very huge (3-6M)
>
> I'm not sure I understand... the bigger the indexing job, the more it
> makes sense to do in parallel.  If you're not doing any link inversion
> for web search, it doesn't seem like hadoop is needed for parallelism.
>  If you are doing web crawling, perhaps look to nutch, not hadoop.
>
> -Yonik
>
>
> On Fri, Nov 28, 2008 at 1:31 PM, souravm <SO...@infosys.com> wrote:
>> Hi All,
>>
>> I have huge number of documents to index (say per hr) and within a hr I cannot compete it using a single machine. Having them distributed in multiple boxes and indexing them in parallel is not an option as my target doc size per hr itself can be very huge (3-6M). So I am considering using HDFS and MapReduce to do the indexing job within time.
>>
>> In that regard I have following queries regarding using Solr with Hadoop.
>>
>> 1. After creating the index using Hadoop whether storing them for query purpose again in HDFS would mean additional performance overhead (compared to storing them in in actual disk in one machine.) ?
>>
>> 2. What type of change is needed to make Solr wuery read from an index which is stored in HDFS ?
>>
>> Regards,
>> Sourav

**************** CAUTION - Disclaimer *****************
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are not 
to copy, disclose, or distribute this e-mail or its contents to any other person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has taken 
every reasonable precaution to minimize this risk, but is not liable for any damage 
you may sustain as a result of any virus in this e-mail. You should carry out your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS******** End of Disclaimer ********INFOSYS***

Re: Using Solr with Hadoop ....

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.

On Sat, Nov 29, 2008 at 7:26 PM, Jon Baer <jo...@gmail.com> wrote:
> HadoopEntityProcessor for the DIH?
Reading data from Hadoop with DIH could be really cool
There are a few very useful ones which are required badly. Most useful
one would be a TikaEntityProcessor.

But I do not see it solving the scalability problem (the original post).
>
> Ive wondered about this as they make HadoopCluster LiveCDs and EC2 have
> images but best way to make use of them is always a challenge.
>
> - Jon
>
> On Nov 29, 2008, at 3:34 AM, Erik Hatcher wrote:
>
>>
>> On Nov 28, 2008, at 8:38 PM, Yonik Seeley wrote:
>>>
>>> Or, it would be relatively trivial to write a Lucene program
>>> to merge the indexes.
>>
>> FYI, such a tool exists in Lucene's API already:
>>
>>
>>  <http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/misc/IndexMergeTool.html>
>>
>>        Erik
>>
>
>



-- 
--Noble Paul

Re: Using Solr with Hadoop ....

Posted by Jon Baer <jo...@gmail.com>.

HadoopEntityProcessor for the DIH?

Ive wondered about this as they make HadoopCluster LiveCDs and EC2  
have images but best way to make use of them is always a challenge.

- Jon

On Nov 29, 2008, at 3:34 AM, Erik Hatcher wrote:

>
> On Nov 28, 2008, at 8:38 PM, Yonik Seeley wrote:
>> Or, it would be relatively trivial to write a Lucene program
>> to merge the indexes.
>
> FYI, such a tool exists in Lucene's API already:
>
>  <http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/misc/IndexMergeTool.html 
> >
>
> 	Erik
>

Re: Using Solr with Hadoop ....

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Nov 28, 2008, at 8:38 PM, Yonik Seeley wrote:
> Or, it would be relatively trivial to write a Lucene program
> to merge the indexes.

FYI, such a tool exists in Lucene's API already:

   <http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/misc/IndexMergeTool.html 
 >

	Erik

Re: Using Solr with Hadoop ....

Posted by Yonik Seeley <yo...@apache.org>.

The indexing rate you need to achieve should be equal to the rate that
new documents are produced.  It shouldn't matter much how long it
takes to index 3-6M documents the first time (within reason), given
that you only need to do it once/occasionally.  What is that rate
(i.e. why do you think you can't do it on a single box)?

For the scale of documents you are talking about, hadoop sounds like
it would complicate things more than simplify them.

There is a pending Solr patch for using custom IndexReader factories
that could easily open multiple indexes to search across (no optimize
needed).  Or, it would be relatively trivial to write a Lucene program
to merge the indexes.  You could also leave the indexes on multiple
boxes and use Solr's distributed search to search across them
(assuming you really didn't really need everything on a single box).

-Yonik

On Fri, Nov 28, 2008 at 7:01 PM, souravm <SO...@infosys.com> wrote:
> Hi Yonik,
>
> Let me explain why I thought using hadoop will help in achieving the parallel indexing better.
>
> Here are the set of requirements and constraints -
>
> 1. The 3-6M documents (around 300 to 600 MB size) would belong to the same schema
> 2. The resulting index of those 3-6M documents has to reside in a single box (the target box).
> 3. I have to use desktop grade servers with limited RAM (say maximum 2 GB) and single CPU but large enough disk space above 100 GB.
>
> Now if I try to achieve indexing for 3-6M records by running single thread in each of those servers then the steps are -
>
> 1. Create index in all N boxes
> 2. Merge those indexes in the target box
> 3. Optimize the resulting index in the target box.
>
> In Hadoop way what I need to do -
>
> 1. Use those 'N' servers to create the HDFS.
> 2. Copy the raw data (3-6M records) to the HDFS.
> 3. Then use Map/Reduce for indexing those documents and optimize.
>
> I this in this way the index merging and optimization time would be less as those would not be limited by my single server's CPU and memory instead through Map/Reduce the same would be happening in multiple boxes utilizing their CPUs and memory6 in parallel. As I know this way Rackspace implemented Solr's integration with Hadoop and got benefitted. But I realize that this integration is not available open source way
>
> Also please let me know if there is other option to reduce indexing time in my case within Solr given the limited capabilities of the servers.
>
> Regards,
> Sourav
>
>
>
> -----Original Message-----
> From: yseeley@gmail.com [mailto:yseeley@gmail.com] On Behalf Of Yonik Seeley
> Sent: Friday, November 28, 2008 1:58 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Using Solr with Hadoop ....
>
> While future Solr-hadoop integration is a definite possibility (and
> will enable other cool stuff), it doesn't necessarily seem needed for
> the problem you are trying to solve.
>
>> indexing them in parallel is not an option as my target doc size per hr itself can be very huge (3-6M)
>
> I'm not sure I understand... the bigger the indexing job, the more it
> makes sense to do in parallel.  If you're not doing any link inversion
> for web search, it doesn't seem like hadoop is needed for parallelism.
>  If you are doing web crawling, perhaps look to nutch, not hadoop.
>
> -Yonik
>
>
> On Fri, Nov 28, 2008 at 1:31 PM, souravm <SO...@infosys.com> wrote:
>> Hi All,
>>
>> I have huge number of documents to index (say per hr) and within a hr I cannot compete it using a single machine. Having them distributed in multiple boxes and indexing them in parallel is not an option as my target doc size per hr itself can be very huge (3-6M). So I am considering using HDFS and MapReduce to do the indexing job within time.
>>
>> In that regard I have following queries regarding using Solr with Hadoop.
>>
>> 1. After creating the index using Hadoop whether storing them for query purpose again in HDFS would mean additional performance overhead (compared to storing them in in actual disk in one machine.) ?
>>
>> 2. What type of change is needed to make Solr wuery read from an index which is stored in HDFS ?
>>
>> Regards,
>> Sourav

RE: Using Solr with Hadoop ....

Posted by souravm <SO...@infosys.com>.

Hi Yonik,

Let me explain why I thought using hadoop will help in achieving the parallel indexing better.

Here are the set of requirements and constraints -

1. The 3-6M documents (around 300 to 600 MB size) would belong to the same schema
2. The resulting index of those 3-6M documents has to reside in a single box (the target box).
3. I have to use desktop grade servers with limited RAM (say maximum 2 GB) and single CPU but large enough disk space above 100 GB.

Now if I try to achieve indexing for 3-6M records by running single thread in each of those servers then the steps are -

1. Create index in all N boxes
2. Merge those indexes in the target box
3. Optimize the resulting index in the target box.

In Hadoop way what I need to do -

1. Use those 'N' servers to create the HDFS. 
2. Copy the raw data (3-6M records) to the HDFS.
3. Then use Map/Reduce for indexing those documents and optimize. 

I this in this way the index merging and optimization time would be less as those would not be limited by my single server's CPU and memory instead through Map/Reduce the same would be happening in multiple boxes utilizing their CPUs and memory6 in parallel. As I know this way Rackspace implemented Solr's integration with Hadoop and got benefitted. But I realize that this integration is not available open source way

Also please let me know if there is other option to reduce indexing time in my case within Solr given the limited capabilities of the servers.

Regards,
Sourav

-----Original Message-----
From: yseeley@gmail.com [mailto:yseeley@gmail.com] On Behalf Of Yonik Seeley
Sent: Friday, November 28, 2008 1:58 PM
To: solr-user@lucene.apache.org
Subject: Re: Using Solr with Hadoop ....

While future Solr-hadoop integration is a definite possibility (and
will enable other cool stuff), it doesn't necessarily seem needed for
the problem you are trying to solve.

> indexing them in parallel is not an option as my target doc size per hr itself can be very huge (3-6M)

I'm not sure I understand... the bigger the indexing job, the more it
makes sense to do in parallel.  If you're not doing any link inversion
for web search, it doesn't seem like hadoop is needed for parallelism.
 If you are doing web crawling, perhaps look to nutch, not hadoop.

-Yonik

On Fri, Nov 28, 2008 at 1:31 PM, souravm <SO...@infosys.com> wrote:
> Hi All,
>
> I have huge number of documents to index (say per hr) and within a hr I cannot compete it using a single machine. Having them distributed in multiple boxes and indexing them in parallel is not an option as my target doc size per hr itself can be very huge (3-6M). So I am considering using HDFS and MapReduce to do the indexing job within time.
>
> In that regard I have following queries regarding using Solr with Hadoop.
>
> 1. After creating the index using Hadoop whether storing them for query purpose again in HDFS would mean additional performance overhead (compared to storing them in in actual disk in one machine.) ?
>
> 2. What type of change is needed to make Solr wuery read from an index which is stored in HDFS ?
>
> Regards,
> Sourav

**************** CAUTION - Disclaimer *****************
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are not 
to copy, disclose, or distribute this e-mail or its contents to any other person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has taken 
every reasonable precaution to minimize this risk, but is not liable for any damage 
you may sustain as a result of any virus in this e-mail. You should carry out your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS******** End of Disclaimer ********INFOSYS***

Re: Using Solr with Hadoop ....

Posted by Yonik Seeley <yo...@apache.org>.

While future Solr-hadoop integration is a definite possibility (and
will enable other cool stuff), it doesn't necessarily seem needed for
the problem you are trying to solve.

> indexing them in parallel is not an option as my target doc size per hr itself can be very huge (3-6M)

I'm not sure I understand... the bigger the indexing job, the more it
makes sense to do in parallel.  If you're not doing any link inversion
for web search, it doesn't seem like hadoop is needed for parallelism.
 If you are doing web crawling, perhaps look to nutch, not hadoop.

-Yonik

On Fri, Nov 28, 2008 at 1:31 PM, souravm <SO...@infosys.com> wrote:
> Hi All,
>
> I have huge number of documents to index (say per hr) and within a hr I cannot compete it using a single machine. Having them distributed in multiple boxes and indexing them in parallel is not an option as my target doc size per hr itself can be very huge (3-6M). So I am considering using HDFS and MapReduce to do the indexing job within time.
>
> In that regard I have following queries regarding using Solr with Hadoop.
>
> 1. After creating the index using Hadoop whether storing them for query purpose again in HDFS would mean additional performance overhead (compared to storing them in in actual disk in one machine.) ?
>
> 2. What type of change is needed to make Solr wuery read from an index which is stored in HDFS ?
>
> Regards,
> Sourav

Using Solr with Hadoop ....

Posted by souravm <SO...@infosys.com>.

Hi All,

I have huge number of documents to index (say per hr) and within a hr I cannot compete it using a single machine. Having them distributed in multiple boxes and indexing them in parallel is not an option as my target doc size per hr itself can be very huge (3-6M). So I am considering using HDFS and MapReduce to do the indexing job within time.

In that regard I have following queries regarding using Solr with Hadoop.

1. After creating the index using Hadoop whether storing them for query purpose again in HDFS would mean additional performance overhead (compared to storing them in in actual disk in one machine.) ?

2. What type of change is needed to make Solr wuery read from an index which is stored in HDFS ?

Regards,
Sourav

**************** CAUTION - Disclaimer *****************
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient, please
notify the sender by e-mail and delete the original message. Further, you are not
to copy, disclose, or distribute this e-mail or its contents to any other person and
any such actions are unlawful. This e-mail may contain viruses. Infosys has taken
every reasonable precaution to minimize this risk, but is not liable for any damage
you may sustain as a result of any virus in this e-mail. You should carry out your
own virus checks before opening the e-mail or attachment. Infosys reserves the
right to monitor and review the content of all messages sent to or from this e-mail
address. Messages sent to or from this e-mail address may be stored on the
Infosys e-mail system.
***INFOSYS******** End of Disclaimer ********INFOSYS***

Re: Phrase query search with stopwords

Posted by Yonik Seeley <yo...@apache.org>.

See https://issues.apache.org/jira/browse/SOLR-879
we never enabled position increments in the query parser.

-Yonik

On Mon, Nov 24, 2008 at 9:48 PM, Yonik Seeley <yo...@apache.org> wrote:
> Ack!  I tried it too, and it failed for me also.
> The analysis page indicates that the tokens are all in the same
> positions... need to look into this deeper.
> Could you open up a JIRA issue?
>
> -Yonik
>
> On Mon, Nov 24, 2008 at 5:58 PM, Robert Haschart <rh...@virginia.edu> wrote:
>> Yonik,
>>
>> I did make sure enablePositionIncrements="true"  for both indexing and
>> queries and just did a test where I  re-indexed a couple of test record
>> sets, and submitted a query from the solr admin page, this time searching
>> for  title_text:"gone with the wind"  which should return three hits, and
>> again it returns 0 hits.
>>
>> I also tried modifying SolrQueryParser to set  setEnablePositionIncrements
>> to true thinkg that would fix the problem,  but it doesn't seem to.
>>
>>
>> -Bob
>>
>>
>> Yonik Seeley wrote:
>>
>>> Robert,
>>>
>>> I've reproduced (sort of) this bad behavior with the example schema.
>>> There was an example configuration "bug" introduced in SOLR-521
>>> where enablePositionIncrements="true" was only set on the index
>>> analyzer but not the query analyzer for the "text" fieldType.
>>>
>>> A query on the example data of
>>> features:"Optimized for High Volume Web Traffic"
>>> will not match any documents.
>>>
>>> You seem to indicate that enablePositionIncrements="true" is set for
>>> both your index and query analyzer.  Can you verify that, and verify
>>> that you restarted solr and reindexed after that change was made?
>>>
>>> -Yonik
>>>
>>>
>>>
>>> On Thu, Nov 20, 2008 at 1:30 PM, Robert Haschart <rh...@virginia.edu>
>>> wrote:
>>>
>>>>
>>>> Greetings all,
>>>>
>>>> I'm having trouble tracking down why a particular query is not working.
>>>> A
>>>> user is trying to do a search for alternate_form_title_text:"three films
>>>> by
>>>> louis malle"  specifically to find the 4 records that contain the phrase
>>>> "Three films by Louis Malle" in their alternate_form_title_text field.
>>>> However the search return 0 records.
>>>>
>>>> The modified searches:
>>>>
>>>> alternate_form_title_text:"three films by louis malle"~1
>>>>
>>>> or
>>>>
>>>> alternate_form_title_text:"three films" AND
>>>> alternate_form_title_text:"louis
>>>> malle"
>>>>
>>>> both return the 4 records.   So it seems that it is the word "by" which
>>>> is
>>>> listed in the stopword filter list is causing the problem.
>>>>
>>>> The analyzer/filter sequence for indexing the alternate_form_title_text
>>>> field is _almost_ exactly the same as the sequence for querying that
>>>> field.
>>>>
>>>> for indexing the sequence is:
>>>>
>>>> org.apache.solr.analysis.HTMLStripWhitespaceTokenizerFactory   {}
>>>> schema.UnicodeNormalizationFilterFactory {composed=false,
>>>> remove_modifiers=true, fold=true, version=icu4j, remove_diacritics=true}
>>>> schema.CJKFilterFactory   {bigrams=false}
>>>> org.apache.solr.analysis.StopFilterFactory   {words=stopwords.txt,
>>>> ignoreCase=true, enablePositionIncrements=true}
>>>>
>>>> org.apache.solr.analysis.WordDelimiterFilterFactory{generateNumberParts=1,
>>>> catenateWords=1, generateWordParts=1, catenateAll=0, catenateNumbers=1}
>>>> org.apache.solr.analysis.LowerCaseFilterFactory   {}
>>>> org.apache.solr.analysis.EnglishPorterFilterFactory
>>>> {protected=protwords.txt}
>>>> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory   {}
>>>>
>>>> for querying the sequence is:
>>>>
>>>> org.apache.solr.analysis.WhitespaceTokenizerFactory   {}
>>>> schema.UnicodeNormalizationFilterFactory {composed=false,
>>>> remove_modifiers=true, fold=true, version=icu4j, remove_diacritics=true}
>>>> schema.CJKFilterFactory   {bigrams=false}
>>>> org.apache.solr.analysis.SynonymFilterFactory   {synonyms=synonyms.txt,
>>>> expand=true, ignoreCase=true}
>>>> org.apache.solr.analysis.StopFilterFactory   {words=stopwords.txt,
>>>> ignoreCase=true, enablePositionIncrements=true}
>>>>
>>>> org.apache.solr.analysis.WordDelimiterFilterFactory{generateNumberParts=1,
>>>> catenateWords=0, generateWordParts=1, catenateAll=0, catenateNumbers=0}
>>>> org.apache.solr.analysis.LowerCaseFilterFactory   {}
>>>> org.apache.solr.analysis.EnglishPorterFilterFactory
>>>> {protected=protwords.txt}
>>>> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory   {}
>>>>
>>>>
>>>> If I run a test through the field anaylsis admin page, submitting the
>>>> string* three films by louis malle *through both the Field value (Index)
>>>> and
>>>> the Field value (query) the reslts (shown below) seem to indicate the the
>>>> query ought to find the 4 records in question, by it does not, and I'm at
>>>> a
>>>> loss to explain why.
>>>>
>>>>
>>>>   Index Analyzer
>>>>
>>>> term position   1       2       4       5
>>>> term text       three   film    loui    mall
>>>> term type       word    word    word    word
>>>> source start,end        0,5     6,11    15,20   21,26
>>>>
>>>>
>>>>
>>>>   Query Analyzer
>>>>
>>>> term position   1       2       4       5
>>>> term text       three   film    loui    mall
>>>> term type       word    word    word    word
>>>> source start,end        0,5     6,11    15,20   21,26
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>>
>>
>

Re: Phrase query search with stopwords

Posted by Yonik Seeley <yo...@apache.org>.

Ack!  I tried it too, and it failed for me also.
The analysis page indicates that the tokens are all in the same
positions... need to look into this deeper.
Could you open up a JIRA issue?

-Yonik

On Mon, Nov 24, 2008 at 5:58 PM, Robert Haschart <rh...@virginia.edu> wrote:
> Yonik,
>
> I did make sure enablePositionIncrements="true"  for both indexing and
> queries and just did a test where I  re-indexed a couple of test record
> sets, and submitted a query from the solr admin page, this time searching
> for  title_text:"gone with the wind"  which should return three hits, and
> again it returns 0 hits.
>
> I also tried modifying SolrQueryParser to set  setEnablePositionIncrements
> to true thinkg that would fix the problem,  but it doesn't seem to.
>
>
> -Bob
>
>
> Yonik Seeley wrote:
>
>> Robert,
>>
>> I've reproduced (sort of) this bad behavior with the example schema.
>> There was an example configuration "bug" introduced in SOLR-521
>> where enablePositionIncrements="true" was only set on the index
>> analyzer but not the query analyzer for the "text" fieldType.
>>
>> A query on the example data of
>> features:"Optimized for High Volume Web Traffic"
>> will not match any documents.
>>
>> You seem to indicate that enablePositionIncrements="true" is set for
>> both your index and query analyzer.  Can you verify that, and verify
>> that you restarted solr and reindexed after that change was made?
>>
>> -Yonik
>>
>>
>>
>> On Thu, Nov 20, 2008 at 1:30 PM, Robert Haschart <rh...@virginia.edu>
>> wrote:
>>
>>>
>>> Greetings all,
>>>
>>> I'm having trouble tracking down why a particular query is not working.
>>> A
>>> user is trying to do a search for alternate_form_title_text:"three films
>>> by
>>> louis malle"  specifically to find the 4 records that contain the phrase
>>> "Three films by Louis Malle" in their alternate_form_title_text field.
>>> However the search return 0 records.
>>>
>>> The modified searches:
>>>
>>> alternate_form_title_text:"three films by louis malle"~1
>>>
>>> or
>>>
>>> alternate_form_title_text:"three films" AND
>>> alternate_form_title_text:"louis
>>> malle"
>>>
>>> both return the 4 records.   So it seems that it is the word "by" which
>>> is
>>> listed in the stopword filter list is causing the problem.
>>>
>>> The analyzer/filter sequence for indexing the alternate_form_title_text
>>> field is _almost_ exactly the same as the sequence for querying that
>>> field.
>>>
>>> for indexing the sequence is:
>>>
>>> org.apache.solr.analysis.HTMLStripWhitespaceTokenizerFactory   {}
>>> schema.UnicodeNormalizationFilterFactory {composed=false,
>>> remove_modifiers=true, fold=true, version=icu4j, remove_diacritics=true}
>>> schema.CJKFilterFactory   {bigrams=false}
>>> org.apache.solr.analysis.StopFilterFactory   {words=stopwords.txt,
>>> ignoreCase=true, enablePositionIncrements=true}
>>>
>>> org.apache.solr.analysis.WordDelimiterFilterFactory{generateNumberParts=1,
>>> catenateWords=1, generateWordParts=1, catenateAll=0, catenateNumbers=1}
>>> org.apache.solr.analysis.LowerCaseFilterFactory   {}
>>> org.apache.solr.analysis.EnglishPorterFilterFactory
>>> {protected=protwords.txt}
>>> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory   {}
>>>
>>> for querying the sequence is:
>>>
>>> org.apache.solr.analysis.WhitespaceTokenizerFactory   {}
>>> schema.UnicodeNormalizationFilterFactory {composed=false,
>>> remove_modifiers=true, fold=true, version=icu4j, remove_diacritics=true}
>>> schema.CJKFilterFactory   {bigrams=false}
>>> org.apache.solr.analysis.SynonymFilterFactory   {synonyms=synonyms.txt,
>>> expand=true, ignoreCase=true}
>>> org.apache.solr.analysis.StopFilterFactory   {words=stopwords.txt,
>>> ignoreCase=true, enablePositionIncrements=true}
>>>
>>> org.apache.solr.analysis.WordDelimiterFilterFactory{generateNumberParts=1,
>>> catenateWords=0, generateWordParts=1, catenateAll=0, catenateNumbers=0}
>>> org.apache.solr.analysis.LowerCaseFilterFactory   {}
>>> org.apache.solr.analysis.EnglishPorterFilterFactory
>>> {protected=protwords.txt}
>>> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory   {}
>>>
>>>
>>> If I run a test through the field anaylsis admin page, submitting the
>>> string* three films by louis malle *through both the Field value (Index)
>>> and
>>> the Field value (query) the reslts (shown below) seem to indicate the the
>>> query ought to find the 4 records in question, by it does not, and I'm at
>>> a
>>> loss to explain why.
>>>
>>>
>>>   Index Analyzer
>>>
>>> term position   1       2       4       5
>>> term text       three   film    loui    mall
>>> term type       word    word    word    word
>>> source start,end        0,5     6,11    15,20   21,26
>>>
>>>
>>>
>>>   Query Analyzer
>>>
>>> term position   1       2       4       5
>>> term text       three   film    loui    mall
>>> term type       word    word    word    word
>>> source start,end        0,5     6,11    15,20   21,26
>>>
>>>
>>>
>>>
>>>
>
>
>

Re: Phrase query search with stopwords

Posted by Robert Haschart <rh...@virginia.edu>.

Yonik,

I did make sure enablePositionIncrements="true"  for both indexing and 
queries and just did a test where I  re-indexed a couple of test record 
sets, and submitted a query from the solr admin page, this time 
searching for  title_text:"gone with the wind"  which should return 
three hits, and again it returns 0 hits.

I also tried modifying SolrQueryParser to set  
setEnablePositionIncrements to true thinkg that would fix the problem,  
but it doesn't seem to.


-Bob


Yonik Seeley wrote:

>Robert,
>
>I've reproduced (sort of) this bad behavior with the example schema.
>There was an example configuration "bug" introduced in SOLR-521
>where enablePositionIncrements="true" was only set on the index
>analyzer but not the query analyzer for the "text" fieldType.
>
>A query on the example data of
>features:"Optimized for High Volume Web Traffic"
>will not match any documents.
>
>You seem to indicate that enablePositionIncrements="true" is set for
>both your index and query analyzer.  Can you verify that, and verify
>that you restarted solr and reindexed after that change was made?
>
>-Yonik
>
>
>
>On Thu, Nov 20, 2008 at 1:30 PM, Robert Haschart <rh...@virginia.edu> wrote:
>  
>
>>Greetings all,
>>
>>I'm having trouble tracking down why a particular query is not working.   A
>>user is trying to do a search for alternate_form_title_text:"three films by
>>louis malle"  specifically to find the 4 records that contain the phrase
>>"Three films by Louis Malle" in their alternate_form_title_text field.
>>However the search return 0 records.
>>
>>The modified searches:
>>
>>alternate_form_title_text:"three films by louis malle"~1
>>
>>or
>>
>>alternate_form_title_text:"three films" AND alternate_form_title_text:"louis
>>malle"
>>
>>both return the 4 records.   So it seems that it is the word "by" which is
>>listed in the stopword filter list is causing the problem.
>>
>>The analyzer/filter sequence for indexing the alternate_form_title_text
>>field is _almost_ exactly the same as the sequence for querying that field.
>>
>>for indexing the sequence is:
>>
>>org.apache.solr.analysis.HTMLStripWhitespaceTokenizerFactory   {}
>>schema.UnicodeNormalizationFilterFactory {composed=false,
>>remove_modifiers=true, fold=true, version=icu4j, remove_diacritics=true}
>>schema.CJKFilterFactory   {bigrams=false}
>>org.apache.solr.analysis.StopFilterFactory   {words=stopwords.txt,
>>ignoreCase=true, enablePositionIncrements=true}
>>org.apache.solr.analysis.WordDelimiterFilterFactory{generateNumberParts=1,
>>catenateWords=1, generateWordParts=1, catenateAll=0, catenateNumbers=1}
>>org.apache.solr.analysis.LowerCaseFilterFactory   {}
>>org.apache.solr.analysis.EnglishPorterFilterFactory
>>{protected=protwords.txt}
>>org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory   {}
>>
>>for querying the sequence is:
>>
>>org.apache.solr.analysis.WhitespaceTokenizerFactory   {}
>>schema.UnicodeNormalizationFilterFactory {composed=false,
>>remove_modifiers=true, fold=true, version=icu4j, remove_diacritics=true}
>>schema.CJKFilterFactory   {bigrams=false}
>>org.apache.solr.analysis.SynonymFilterFactory   {synonyms=synonyms.txt,
>>expand=true, ignoreCase=true}
>>org.apache.solr.analysis.StopFilterFactory   {words=stopwords.txt,
>>ignoreCase=true, enablePositionIncrements=true}
>>org.apache.solr.analysis.WordDelimiterFilterFactory{generateNumberParts=1,
>>catenateWords=0, generateWordParts=1, catenateAll=0, catenateNumbers=0}
>>org.apache.solr.analysis.LowerCaseFilterFactory   {}
>>org.apache.solr.analysis.EnglishPorterFilterFactory
>>{protected=protwords.txt}
>>org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory   {}
>>
>>
>>If I run a test through the field anaylsis admin page, submitting the
>>string* three films by louis malle *through both the Field value (Index) and
>>the Field value (query) the reslts (shown below) seem to indicate the the
>>query ought to find the 4 records in question, by it does not, and I'm at a
>>loss to explain why.
>>
>>
>>    Index Analyzer
>>
>>term position   1       2       4       5
>>term text       three   film    loui    mall
>>term type       word    word    word    word
>>source start,end        0,5     6,11    15,20   21,26
>>
>>
>>
>>    Query Analyzer
>>
>>term position   1       2       4       5
>>term text       three   film    loui    mall
>>term type       word    word    word    word
>>source start,end        0,5     6,11    15,20   21,26
>>
>>
>>
>>
>>    
>>

Re: Phrase query search with stopwords

Posted by Yonik Seeley <yo...@apache.org>.

Robert,

I've reproduced (sort of) this bad behavior with the example schema.
There was an example configuration "bug" introduced in SOLR-521
where enablePositionIncrements="true" was only set on the index
analyzer but not the query analyzer for the "text" fieldType.

A query on the example data of
features:"Optimized for High Volume Web Traffic"
will not match any documents.

You seem to indicate that enablePositionIncrements="true" is set for
both your index and query analyzer.  Can you verify that, and verify
that you restarted solr and reindexed after that change was made?

-Yonik



On Thu, Nov 20, 2008 at 1:30 PM, Robert Haschart <rh...@virginia.edu> wrote:
> Greetings all,
>
> I'm having trouble tracking down why a particular query is not working.   A
> user is trying to do a search for alternate_form_title_text:"three films by
> louis malle"  specifically to find the 4 records that contain the phrase
> "Three films by Louis Malle" in their alternate_form_title_text field.
> However the search return 0 records.
>
> The modified searches:
>
> alternate_form_title_text:"three films by louis malle"~1
>
> or
>
> alternate_form_title_text:"three films" AND alternate_form_title_text:"louis
> malle"
>
> both return the 4 records.   So it seems that it is the word "by" which is
> listed in the stopword filter list is causing the problem.
>
> The analyzer/filter sequence for indexing the alternate_form_title_text
> field is _almost_ exactly the same as the sequence for querying that field.
>
> for indexing the sequence is:
>
> org.apache.solr.analysis.HTMLStripWhitespaceTokenizerFactory   {}
> schema.UnicodeNormalizationFilterFactory {composed=false,
> remove_modifiers=true, fold=true, version=icu4j, remove_diacritics=true}
> schema.CJKFilterFactory   {bigrams=false}
> org.apache.solr.analysis.StopFilterFactory   {words=stopwords.txt,
> ignoreCase=true, enablePositionIncrements=true}
> org.apache.solr.analysis.WordDelimiterFilterFactory{generateNumberParts=1,
> catenateWords=1, generateWordParts=1, catenateAll=0, catenateNumbers=1}
> org.apache.solr.analysis.LowerCaseFilterFactory   {}
> org.apache.solr.analysis.EnglishPorterFilterFactory
> {protected=protwords.txt}
> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory   {}
>
> for querying the sequence is:
>
> org.apache.solr.analysis.WhitespaceTokenizerFactory   {}
> schema.UnicodeNormalizationFilterFactory {composed=false,
> remove_modifiers=true, fold=true, version=icu4j, remove_diacritics=true}
> schema.CJKFilterFactory   {bigrams=false}
> org.apache.solr.analysis.SynonymFilterFactory   {synonyms=synonyms.txt,
> expand=true, ignoreCase=true}
> org.apache.solr.analysis.StopFilterFactory   {words=stopwords.txt,
> ignoreCase=true, enablePositionIncrements=true}
> org.apache.solr.analysis.WordDelimiterFilterFactory{generateNumberParts=1,
> catenateWords=0, generateWordParts=1, catenateAll=0, catenateNumbers=0}
> org.apache.solr.analysis.LowerCaseFilterFactory   {}
> org.apache.solr.analysis.EnglishPorterFilterFactory
> {protected=protwords.txt}
> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory   {}
>
>
> If I run a test through the field anaylsis admin page, submitting the
> string* three films by louis malle *through both the Field value (Index) and
> the Field value (query) the reslts (shown below) seem to indicate the the
> query ought to find the 4 records in question, by it does not, and I'm at a
> loss to explain why.
>
>
>     Index Analyzer
>
> term position   1       2       4       5
> term text       three   film    loui    mall
> term type       word    word    word    word
> source start,end        0,5     6,11    15,20   21,26
>
>
>
>     Query Analyzer
>
> term position   1       2       4       5
> term text       three   film    loui    mall
> term type       word    word    word    word
> source start,end        0,5     6,11    15,20   21,26
>
>
>
>

Phrase query search with stopwords

Posted by Robert Haschart <rh...@virginia.edu>.

Greetings all,

I'm having trouble tracking down why a particular query is not 
working.   A user is trying to do a search for 
alternate_form_title_text:"three films by louis malle"  specifically to 
find the 4 records that contain the phrase "Three films by Louis Malle" 
in their alternate_form_title_text field. 

However the search return 0 records.

The modified searches:

alternate_form_title_text:"three films by louis malle"~1

or

alternate_form_title_text:"three films" AND 
alternate_form_title_text:"louis malle"

both return the 4 records.   So it seems that it is the word "by" which 
is listed in the stopword filter list is causing the problem.

The analyzer/filter sequence for indexing the alternate_form_title_text 
field is _almost_ exactly the same as the sequence for querying that field.

for indexing the sequence is:

org.apache.solr.analysis.HTMLStripWhitespaceTokenizerFactory   {}
schema.UnicodeNormalizationFilterFactory {composed=false, remove_modifiers=true, fold=true, version=icu4j, remove_diacritics=true}
schema.CJKFilterFactory   {bigrams=false}
org.apache.solr.analysis.StopFilterFactory   {words=stopwords.txt, ignoreCase=true, enablePositionIncrements=true}
org.apache.solr.analysis.WordDelimiterFilterFactory{generateNumberParts=1, catenateWords=1, generateWordParts=1, catenateAll=0, catenateNumbers=1}
org.apache.solr.analysis.LowerCaseFilterFactory   {}
org.apache.solr.analysis.EnglishPorterFilterFactory   {protected=protwords.txt}
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory   {}

for querying the sequence is:

org.apache.solr.analysis.WhitespaceTokenizerFactory   {}
schema.UnicodeNormalizationFilterFactory {composed=false, remove_modifiers=true, fold=true, version=icu4j, remove_diacritics=true}
schema.CJKFilterFactory   {bigrams=false}
org.apache.solr.analysis.SynonymFilterFactory   {synonyms=synonyms.txt, expand=true, ignoreCase=true}
org.apache.solr.analysis.StopFilterFactory   {words=stopwords.txt, ignoreCase=true, enablePositionIncrements=true}
org.apache.solr.analysis.WordDelimiterFilterFactory{generateNumberParts=1, catenateWords=0, generateWordParts=1, catenateAll=0, catenateNumbers=0}
org.apache.solr.analysis.LowerCaseFilterFactory   {}
org.apache.solr.analysis.EnglishPorterFilterFactory   {protected=protwords.txt}
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory   {}


If I run a test through the field anaylsis admin page, submitting the 
string* three films by louis malle *through both the Field value (Index) 
and the Field value (query) the reslts (shown below) seem to indicate 
the the query ought to find the 4 records in question, by it does not, 
and I'm at a loss to explain why.


      Index Analyzer

term position 	1 	2 	4 	5
term text 	three 	film 	loui 	mall
term type 	word 	word 	word 	word
source start,end 	0,5 	6,11 	15,20 	21,26



      Query Analyzer

term position 	1 	2 	4 	5
term text 	three 	film 	loui 	mall
term type 	word 	word 	word 	word
source start,end 	0,5 	6,11 	15,20 	21,26