You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Michael Sokolov <ms...@gmail.com> on 2018/07/03 12:00:37 UTC

WordDelimiterGraphFilter swallows emojis

WDGF (and WordDelimiterFilter) treat emoji as "SUBWORD_DELIM" characters
like punctuation and thus remove them, but we would like to be able to
search for emoji and use this filter for handling dashes, dots and other
intra-word punctuation.

These filters identify non-word and non-digit characters by two mechanisms:
direct lookup in a character table, and fallback to Unicode class. The
character table can't easily be used to handle emoji since it would need to
be populated with the entire Unicode character set in order to reach
emoji-land. On the other hand, if we change the handling of emoji by class,
and say treat them as word-characters, this will also end up pulling in all
the other OTHER_SYMBOL characters as well. Maybe that's OK, but I think
some of these other symbols are more like punctuation (this class is a grab
bag of all kinds of beautiful dingbats like trademark, degrees-symbols, etc
https://www.compart.com/en/unicode/category/So). On the other other hand,
how do we even identify emoji? I don't think the Java Character API is
adequate to the task. Perhaps we must incorporate a table.

Suppose we come up with a good way to classify emoji; then how should they
be treated in this class? Sometimes they may be embedded in tokens with
other characters: I see people using emoji and other symbols as part of
their names, and sometimes they stand alone (with whitespace separation). I
think one way forward here would be to treat these as a special class akin
to words and numbers, and provide similar options (SPLIT_ON_EMOJI,
CATENATE_EMOJI) as we have for those classes.

Or maybe as a convenience, we provide a way to get a table that encodes the
default classifications of all characters up to some given limit, and then
let the caller modify it? That would at least provide an easy way to treat
emoji as letters.

Any thoughts?

Re: WordDelimiterGraphFilter swallows emojis

Posted by Robert Muir <rc...@gmail.com>.

On Tue, Jul 3, 2018 at 8:00 AM, Michael Sokolov <ms...@gmail.com> wrote:
> WDGF (and WordDelimiterFilter) treat emoji as "SUBWORD_DELIM" characters
> like punctuation and thus remove them, but we would like to be able to
> search for emoji and use this filter for handling dashes, dots and other
> intra-word punctuation.
>
> These filters identify non-word and non-digit characters by two mechanisms:
> direct lookup in a character table, and fallback to Unicode class. The
> character table can't easily be used to handle emoji since it would need to
> be populated with the entire Unicode character set in order to reach
> emoji-land. On the other hand, if we change the handling of emoji by class,
> and say treat them as word-characters, this will also end up pulling in all
> the other OTHER_SYMBOL characters as well. Maybe that's OK, but I think
> some of these other symbols are more like punctuation (this class is a grab
> bag of all kinds of beautiful dingbats like trademark, degrees-symbols, etc
> https://www.compart.com/en/unicode/category/So). On the other other hand,
> how do we even identify emoji? I don't think the Java Character API is
> adequate to the task. Perhaps we must incorporate a table.

There are several unicode properties for doing emoji (see e.g. unicode
segmentation algorithms, and tagging function in ICUTokenizer), but
its not based on general category. Additionally emoji may not be
single character but sequences so its more involved than what
WordDelimiterFilter is really ready for. I also don't think we should
start storing/maintaining unicode property tables ourselves, if we
want to fix WordDelimiterFilter, it should just depend on ICU instead.

> Suppose we come up with a good way to classify emoji; then how should they
> be treated in this class? Sometimes they may be embedded in tokens with
> other characters: I see people using emoji and other symbols as part of
> their names, and sometimes they stand alone (with whitespace separation). I
> think one way forward here would be to treat these as a special class akin
> to words and numbers, and provide similar options (SPLIT_ON_EMOJI,
> CATENATE_EMOJI) as we have for those classes.
>
> Or maybe as a convenience, we provide a way to get a table that encodes the
> default classifications of all characters up to some given limit, and then
> let the caller modify it? That would at least provide an easy way to treat
> emoji as letters.

There is already a way to provide a table to this thing. But one
bigger issue is word delimiter filter doesn't operate on unicode
codepoints, so I don't think you are gonna be able to do what you
want, since most emoji are not in the BMP. WordDelimiterFilter is
really only suitable for categorizing characters in the BMP, it just
doesn't split surrogates.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: WordDelimiterGraphFilter swallows emojis

Posted by Michael Sokolov <ms...@gmail.com>.

Ah I see -- there is \p{Emoji} to start with, which is nice, but also this
extended pictographic -- I'll read more, and get back if I have questions.
Might be a little while before I dig in to this though. Thanks again

On Tue, Jul 3, 2018 at 11:25 AM Robert Muir <rc...@gmail.com> wrote:

> If you customized the rules, maybe have a look at
> https://issues.apache.org/jira/browse/LUCENE-8366
>
> The rules got simpler and we also updated the customization example
> used for the factory's test.
>
> On Tue, Jul 3, 2018 at 10:46 AM, Michael Sokolov <ms...@gmail.com>
> wrote:
> > Yes that sounds good -- this ConditionalTokenFilter is going to be very
> > helpful. We have overridden the ICUTokenizer's rbbi rules, but I'll poke
> > around and see about incorporating the emoji rules from there.  Thanks
> > Robert
> >
> > On Tue, Jul 3, 2018 at 9:28 AM Robert Muir <rc...@gmail.com> wrote:
> >
> >> > Any thoughts?
> >>
> >> best idea I have would be to tokenize with ICUTokenizer, which will
> >> tag emoji sequences as "<EMOJI>" token type, then use
> >> ConditionalTokenFilter to send all tokens EXCEPT those with token type
> >> of  "<EMOJI>" to your WordDelimiterFilter. This way
> >> WordDelimiterFilter never sees the emoji at all and can't screw them
> >> up.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: WordDelimiterGraphFilter swallows emojis

Posted by Robert Muir <rc...@gmail.com>.

If you customized the rules, maybe have a look at
https://issues.apache.org/jira/browse/LUCENE-8366

The rules got simpler and we also updated the customization example
used for the factory's test.

On Tue, Jul 3, 2018 at 10:46 AM, Michael Sokolov <ms...@gmail.com> wrote:
> Yes that sounds good -- this ConditionalTokenFilter is going to be very
> helpful. We have overridden the ICUTokenizer's rbbi rules, but I'll poke
> around and see about incorporating the emoji rules from there.  Thanks
> Robert
>
> On Tue, Jul 3, 2018 at 9:28 AM Robert Muir <rc...@gmail.com> wrote:
>
>> > Any thoughts?
>>
>> best idea I have would be to tokenize with ICUTokenizer, which will
>> tag emoji sequences as "<EMOJI>" token type, then use
>> ConditionalTokenFilter to send all tokens EXCEPT those with token type
>> of  "<EMOJI>" to your WordDelimiterFilter. This way
>> WordDelimiterFilter never sees the emoji at all and can't screw them
>> up.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: WordDelimiterGraphFilter swallows emojis

Posted by Michael Sokolov <ms...@gmail.com>.

Yes that sounds good -- this ConditionalTokenFilter is going to be very
helpful. We have overridden the ICUTokenizer's rbbi rules, but I'll poke
around and see about incorporating the emoji rules from there.  Thanks
Robert

On Tue, Jul 3, 2018 at 9:28 AM Robert Muir <rc...@gmail.com> wrote:

> > Any thoughts?
>
> best idea I have would be to tokenize with ICUTokenizer, which will
> tag emoji sequences as "<EMOJI>" token type, then use
> ConditionalTokenFilter to send all tokens EXCEPT those with token type
> of  "<EMOJI>" to your WordDelimiterFilter. This way
> WordDelimiterFilter never sees the emoji at all and can't screw them
> up.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Size of Document

Posted by Chris Hostetter <ho...@fucit.org>.

: Subject: Size of Document
: To: java-user@lucene.apache.org
: References:
:     <CA...@mail.gmail.com>
:  <CA...@mail.gmail.com>
: Message-ID: <be...@bammers.net>
: In-Reply-To:
:     <CA...@mail.gmail.com>

https://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.




-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Size of Document

Posted by Adrien Grand <jp...@gmail.com>.

It was called IndexWriter.ramSizeInBytes() in 4.10.3.

Le mer. 4 juil. 2018 à 15:35, Chris Bamford <ch...@bammers.net> a écrit :

> > IndexWriter.ramBytesUsed() gives you access to the current memory usage
> of
> > IndexWriter's buffers, but it can't tell you by how much it increased
> for a
> > given document assuming concurrent access to the IndexWriter.
> >
> Thanks, although I can’t find that API. Is there an equivalent call for
> Lucene 4.10.3?
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Size of Document

Posted by Chris Bamford <ch...@bammers.net>.

> IndexWriter.ramBytesUsed() gives you access to the current memory usage of
> IndexWriter's buffers, but it can't tell you by how much it increased for a
> given document assuming concurrent access to the IndexWriter.
> 
Thanks, although I can’t find that API. Is there an equivalent call for Lucene 4.10.3?



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Size of Document

Posted by Adrien Grand <jp...@gmail.com>.

IndexWriter.ramBytesUsed() gives you access to the current memory usage of
IndexWriter's buffers, but it can't tell you by how much it increased for a
given document assuming concurrent access to the IndexWriter.

Le mer. 4 juil. 2018 à 15:13, Chris Bamford <ch...@bammers.net> a écrit :

> Hello Adrien,
>
>
> >
> > There is no way to compute the byte size of a document.
>
> I feared that!
>
> > Also note that the
> > relationship between the size of a document and how much space it will
> use
> > in the Lucene index is quite complex.
> >
> I understand. I was wondering if there was maybe some sneaky way of
> peeking inside the IndexWriter before and after a write to compare buffer
> sizes?
>
> Thanks
> Chris
>
> > Le mer. 4 juil. 2018 à 11:26, Chris and Helen Bamford <ch...@bammers.net>
> a
> > écrit :
> >
> >> Hi there,
> >>
> >> How can I calculate the total size of a Lucene Document that I'm about
> >> to write to an index so I know how many bytes I am writing please?  I
> >> need it for some external metrics collection.
> >>
> >> Thanks
> >>
> >> - Chris
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Size of Document

Posted by Chris Bamford <ch...@bammers.net>.

Hello Adrien,


> 
> There is no way to compute the byte size of a document.

I feared that!

> Also note that the
> relationship between the size of a document and how much space it will use
> in the Lucene index is quite complex.
> 
I understand. I was wondering if there was maybe some sneaky way of peeking inside the IndexWriter before and after a write to compare buffer sizes?

Thanks 
Chris

> Le mer. 4 juil. 2018 à 11:26, Chris and Helen Bamford <ch...@bammers.net> a
> écrit :
> 
>> Hi there,
>> 
>> How can I calculate the total size of a Lucene Document that I'm about
>> to write to an index so I know how many bytes I am writing please?  I
>> need it for some external metrics collection.
>> 
>> Thanks
>> 
>> - Chris
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Size of Document

Posted by Adrien Grand <jp...@gmail.com>.

Hello,

There is no way to compute the byte size of a document. Also note that the
relationship between the size of a document and how much space it will use
in the Lucene index is quite complex.

Le mer. 4 juil. 2018 à 11:26, Chris and Helen Bamford <ch...@bammers.net> a
écrit :

> Hi there,
>
> How can I calculate the total size of a Lucene Document that I'm about
> to write to an index so I know how many bytes I am writing please?  I
> need it for some external metrics collection.
>
> Thanks
>
> - Chris
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Size of Document

Posted by Adrien Grand <jp...@gmail.com>.

For the record, this is made even more complex by the fact that the disk
footprint of a document depends on other documents that are indexed nearby
in the same segment, and can change over merges.

Le jeu. 5 juil. 2018 à 08:22, Chris Bamford <ch...@bammers.net> a écrit :

> Yes I see, I originally missed Terry’s response which is probably the
> source of the confusion.
>
> So to clarify: I already know the size of the source document. As you say,
> this bears little resemblance to what actually gets written when indexed.
> It is this latter figure I was hoping to get.
>
> Thanks everyone.
>
> Chris
>
>
>
> > On 5 Jul 2018, at 03:31, Erick Erickson <er...@gmail.com> wrote:
> >
> > I think we're not talking about the same thing.
> >
> > You asked "How can I calculate the total size of a Lucene Document"...
> >
> > I was responding to the Terry's comment "In the document types I
> > usually index (.pdf, .docx/.doc, .eml), there exists a metadata field
> > called "stream_size" that contains the size of the document on disk. "
> >
> > Two totally different beasts. One is the source document, the other is
> > what you choose to put into the index from that document. Not to even
> > mention that you could, for instance, choose to index only the title
> > and throw everything else away so the size of the raw document on disk
> > doesn't seem useful for your case.
> >
> > Best,
> > Erick
> >
> >> On Wed, Jul 4, 2018 at 9:24 AM, Chris Bamford <ch...@bammers.net>
> wrote:
> >> Hi Erick
> >>
> >> Yes, size on disk is what I’m after as it will feed into an eventual
> calculation regarding actual bytes written (not interested in the source
> data document size, just real disk usage).
> >> Thanks
> >>
> >> Chris
> >>
> >> Sent from my iPhone
> >>
> >>> On 4 Jul 2018, at 17:08, Erick Erickson <er...@gmail.com>
> wrote:
> >>>
> >>> But does size on disk help? If the doc has a zillion
> >>> images in it, those aren't part of the resulting index
> >>> (I'm excluding stored data here)....
> >>>
> >>>> On Wed, Jul 4, 2018 at 7:49 AM, Terry Steichen <te...@net-frame.com>
> wrote:
> >>>> In the document types I usually index (.pdf, .docx/.doc, .eml), there
> >>>> exists a metadata field called "stream_size" that contains the size of
> >>>> the document on disk.  You don't have to compute it.  Thus, when you
> >>>> retrieve each document you can pull out the contents of this field
> and,
> >>>> if you like, include it in each hitlist entry.
> >>>>
> >>>>
> >>>>> On 07/04/2018 05:26 AM, Chris and Helen Bamford wrote:
> >>>>> Hi there,
> >>>>>
> >>>>> How can I calculate the total size of a Lucene Document that I'm
> about
> >>>>> to write to an index so I know how many bytes I am writing please?  I
> >>>>> need it for some external metrics collection.
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>> - Chris
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Size of Document

Posted by Chris Bamford <ch...@bammers.net>.

Yes I see, I originally missed Terry’s response which is probably the source of the confusion.

So to clarify: I already know the size of the source document. As you say, this bears little resemblance to what actually gets written when indexed. It is this latter figure I was hoping to get.

Thanks everyone.

Chris



> On 5 Jul 2018, at 03:31, Erick Erickson <er...@gmail.com> wrote:
> 
> I think we're not talking about the same thing.
> 
> You asked "How can I calculate the total size of a Lucene Document"...
> 
> I was responding to the Terry's comment "In the document types I
> usually index (.pdf, .docx/.doc, .eml), there exists a metadata field
> called "stream_size" that contains the size of the document on disk. "
> 
> Two totally different beasts. One is the source document, the other is
> what you choose to put into the index from that document. Not to even
> mention that you could, for instance, choose to index only the title
> and throw everything else away so the size of the raw document on disk
> doesn't seem useful for your case.
> 
> Best,
> Erick
> 
>> On Wed, Jul 4, 2018 at 9:24 AM, Chris Bamford <ch...@bammers.net> wrote:
>> Hi Erick
>> 
>> Yes, size on disk is what I’m after as it will feed into an eventual calculation regarding actual bytes written (not interested in the source data document size, just real disk usage).
>> Thanks
>> 
>> Chris
>> 
>> Sent from my iPhone
>> 
>>> On 4 Jul 2018, at 17:08, Erick Erickson <er...@gmail.com> wrote:
>>> 
>>> But does size on disk help? If the doc has a zillion
>>> images in it, those aren't part of the resulting index
>>> (I'm excluding stored data here)....
>>> 
>>>> On Wed, Jul 4, 2018 at 7:49 AM, Terry Steichen <te...@net-frame.com> wrote:
>>>> In the document types I usually index (.pdf, .docx/.doc, .eml), there
>>>> exists a metadata field called "stream_size" that contains the size of
>>>> the document on disk.  You don't have to compute it.  Thus, when you
>>>> retrieve each document you can pull out the contents of this field and,
>>>> if you like, include it in each hitlist entry.
>>>> 
>>>> 
>>>>> On 07/04/2018 05:26 AM, Chris and Helen Bamford wrote:
>>>>> Hi there,
>>>>> 
>>>>> How can I calculate the total size of a Lucene Document that I'm about
>>>>> to write to an index so I know how many bytes I am writing please?  I
>>>>> need it for some external metrics collection.
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> - Chris
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Size of Document

Posted by Erick Erickson <er...@gmail.com>.

I think we're not talking about the same thing.

You asked "How can I calculate the total size of a Lucene Document"...

I was responding to the Terry's comment "In the document types I
usually index (.pdf, .docx/.doc, .eml), there exists a metadata field
called "stream_size" that contains the size of the document on disk. "

Two totally different beasts. One is the source document, the other is
what you choose to put into the index from that document. Not to even
mention that you could, for instance, choose to index only the title
and throw everything else away so the size of the raw document on disk
doesn't seem useful for your case.

Best,
Erick

On Wed, Jul 4, 2018 at 9:24 AM, Chris Bamford <ch...@bammers.net> wrote:
> Hi Erick
>
> Yes, size on disk is what I’m after as it will feed into an eventual calculation regarding actual bytes written (not interested in the source data document size, just real disk usage).
> Thanks
>
> Chris
>
> Sent from my iPhone
>
>> On 4 Jul 2018, at 17:08, Erick Erickson <er...@gmail.com> wrote:
>>
>> But does size on disk help? If the doc has a zillion
>> images in it, those aren't part of the resulting index
>> (I'm excluding stored data here)....
>>
>>> On Wed, Jul 4, 2018 at 7:49 AM, Terry Steichen <te...@net-frame.com> wrote:
>>> In the document types I usually index (.pdf, .docx/.doc, .eml), there
>>> exists a metadata field called "stream_size" that contains the size of
>>> the document on disk.  You don't have to compute it.  Thus, when you
>>> retrieve each document you can pull out the contents of this field and,
>>> if you like, include it in each hitlist entry.
>>>
>>>
>>>> On 07/04/2018 05:26 AM, Chris and Helen Bamford wrote:
>>>> Hi there,
>>>>
>>>> How can I calculate the total size of a Lucene Document that I'm about
>>>> to write to an index so I know how many bytes I am writing please?  I
>>>> need it for some external metrics collection.
>>>>
>>>> Thanks
>>>>
>>>> - Chris
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Size of Document

Posted by Chris Bamford <ch...@bammers.net>.

Hi Erick

Yes, size on disk is what I’m after as it will feed into an eventual calculation regarding actual bytes written (not interested in the source data document size, just real disk usage).
Thanks 

Chris

Sent from my iPhone

> On 4 Jul 2018, at 17:08, Erick Erickson <er...@gmail.com> wrote:
> 
> But does size on disk help? If the doc has a zillion
> images in it, those aren't part of the resulting index
> (I'm excluding stored data here)....
> 
>> On Wed, Jul 4, 2018 at 7:49 AM, Terry Steichen <te...@net-frame.com> wrote:
>> In the document types I usually index (.pdf, .docx/.doc, .eml), there
>> exists a metadata field called "stream_size" that contains the size of
>> the document on disk.  You don't have to compute it.  Thus, when you
>> retrieve each document you can pull out the contents of this field and,
>> if you like, include it in each hitlist entry.
>> 
>> 
>>> On 07/04/2018 05:26 AM, Chris and Helen Bamford wrote:
>>> Hi there,
>>> 
>>> How can I calculate the total size of a Lucene Document that I'm about
>>> to write to an index so I know how many bytes I am writing please?  I
>>> need it for some external metrics collection.
>>> 
>>> Thanks
>>> 
>>> - Chris
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> 
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Size of Document

Posted by Erick Erickson <er...@gmail.com>.

But does size on disk help? If the doc has a zillion
images in it, those aren't part of the resulting index
(I'm excluding stored data here)....

On Wed, Jul 4, 2018 at 7:49 AM, Terry Steichen <te...@net-frame.com> wrote:
> In the document types I usually index (.pdf, .docx/.doc, .eml), there
> exists a metadata field called "stream_size" that contains the size of
> the document on disk.  You don't have to compute it.  Thus, when you
> retrieve each document you can pull out the contents of this field and,
> if you like, include it in each hitlist entry.
>
>
> On 07/04/2018 05:26 AM, Chris and Helen Bamford wrote:
>> Hi there,
>>
>> How can I calculate the total size of a Lucene Document that I'm about
>> to write to an index so I know how many bytes I am writing please?  I
>> need it for some external metrics collection.
>>
>> Thanks
>>
>> - Chris
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Size of Document

Posted by Terry Steichen <te...@net-frame.com>.

In the document types I usually index (.pdf, .docx/.doc, .eml), there
exists a metadata field called "stream_size" that contains the size of
the document on disk.  You don't have to compute it.  Thus, when you
retrieve each document you can pull out the contents of this field and,
if you like, include it in each hitlist entry.

On 07/04/2018 05:26 AM, Chris and Helen Bamford wrote:
> Hi there,
>
> How can I calculate the total size of a Lucene Document that I'm about
> to write to an index so I know how many bytes I am writing please?  I
> need it for some external metrics collection.
>
> Thanks
>
> - Chris
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Size of Document

Posted by Chris and Helen Bamford <ch...@bammers.net>.

Hi there,

How can I calculate the total size of a Lucene Document that I'm about 
to write to an index so I know how many bytes I am writing please?  I 
need it for some external metrics collection.

Thanks

- Chris

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: WordDelimiterGraphFilter swallows emojis

Posted by Robert Muir <rc...@gmail.com>.

> Any thoughts?

best idea I have would be to tokenize with ICUTokenizer, which will
tag emoji sequences as "<EMOJI>" token type, then use
ConditionalTokenFilter to send all tokens EXCEPT those with token type
of  "<EMOJI>" to your WordDelimiterFilter. This way
WordDelimiterFilter never sees the emoji at all and can't screw them
up.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: WordDelimiterGraphFilter swallows emojis

Posted by Michael Sokolov <ms...@gmail.com>.

Thanks for the pointer

On Tue, Jul 3, 2018 at 9:04 AM julien Blaize <ju...@gmail.com>
wrote:

> Hello Michael,
>
> i had previously worked on emoji detection with lucene.
>
> I had to extends the Tokenizer class (and not the TokenFilter like
> WordDelimiterFilter) to preserve the delimiter attribute.
> I also had to keep track of consecutive delimiters in the character stream
> because Lucene default implementation only keep the last one.
>
> Maybe it can put you on the right track to start by looking at the
> Tokenizer instead of the TokenFilter.
>
> By the way I used the emoji list from this project to detect sequences of
> characters.
>
> https://github.com/jolicode/emoji-search/blob/master/synonyms/cldr-emoji-annotation-synonyms-fr.txt
> I detect sequences of character and while the sequence is a possible emoji
> i keep tracking, when i have a full emoji i put it in the CharTermAttribute
> so it's treated as a word and not a delimiter.
>
> Regards
> --
> Julien Blaize
>
>
> Le mar. 3 juil. 2018 à 14:00, Michael Sokolov <ms...@gmail.com> a
> écrit :
>
> > WDGF (and WordDelimiterFilter) treat emoji as "SUBWORD_DELIM" characters
> > like punctuation and thus remove them, but we would like to be able to
> > search for emoji and use this filter for handling dashes, dots and other
> > intra-word punctuation.
> >
> > These filters identify non-word and non-digit characters by two
> mechanisms:
> > direct lookup in a character table, and fallback to Unicode class. The
> > character table can't easily be used to handle emoji since it would need
> to
> > be populated with the entire Unicode character set in order to reach
> > emoji-land. On the other hand, if we change the handling of emoji by
> class,
> > and say treat them as word-characters, this will also end up pulling in
> all
> > the other OTHER_SYMBOL characters as well. Maybe that's OK, but I think
> > some of these other symbols are more like punctuation (this class is a
> grab
> > bag of all kinds of beautiful dingbats like trademark, degrees-symbols,
> etc
> > https://www.compart.com/en/unicode/category/So). On the other other
> hand,
> > how do we even identify emoji? I don't think the Java Character API is
> > adequate to the task. Perhaps we must incorporate a table.
> >
> > Suppose we come up with a good way to classify emoji; then how should
> they
> > be treated in this class? Sometimes they may be embedded in tokens with
> > other characters: I see people using emoji and other symbols as part of
> > their names, and sometimes they stand alone (with whitespace
> separation). I
> > think one way forward here would be to treat these as a special class
> akin
> > to words and numbers, and provide similar options (SPLIT_ON_EMOJI,
> > CATENATE_EMOJI) as we have for those classes.
> >
> > Or maybe as a convenience, we provide a way to get a table that encodes
> the
> > default classifications of all characters up to some given limit, and
> then
> > let the caller modify it? That would at least provide an easy way to
> treat
> > emoji as letters.
> >
> > Any thoughts?
> >
>

Re: WordDelimiterGraphFilter swallows emojis

Posted by julien Blaize <ju...@gmail.com>.

Hello Michael,

i had previously worked on emoji detection with lucene.

I had to extends the Tokenizer class (and not the TokenFilter like
WordDelimiterFilter) to preserve the delimiter attribute.
I also had to keep track of consecutive delimiters in the character stream
because Lucene default implementation only keep the last one.

Maybe it can put you on the right track to start by looking at the
Tokenizer instead of the TokenFilter.

By the way I used the emoji list from this project to detect sequences of
characters.
https://github.com/jolicode/emoji-search/blob/master/synonyms/cldr-emoji-annotation-synonyms-fr.txt
I detect sequences of character and while the sequence is a possible emoji
i keep tracking, when i have a full emoji i put it in the CharTermAttribute
so it's treated as a word and not a delimiter.

Regards
--
Julien Blaize


Le mar. 3 juil. 2018 à 14:00, Michael Sokolov <ms...@gmail.com> a écrit :

> WDGF (and WordDelimiterFilter) treat emoji as "SUBWORD_DELIM" characters
> like punctuation and thus remove them, but we would like to be able to
> search for emoji and use this filter for handling dashes, dots and other
> intra-word punctuation.
>
> These filters identify non-word and non-digit characters by two mechanisms:
> direct lookup in a character table, and fallback to Unicode class. The
> character table can't easily be used to handle emoji since it would need to
> be populated with the entire Unicode character set in order to reach
> emoji-land. On the other hand, if we change the handling of emoji by class,
> and say treat them as word-characters, this will also end up pulling in all
> the other OTHER_SYMBOL characters as well. Maybe that's OK, but I think
> some of these other symbols are more like punctuation (this class is a grab
> bag of all kinds of beautiful dingbats like trademark, degrees-symbols, etc
> https://www.compart.com/en/unicode/category/So). On the other other hand,
> how do we even identify emoji? I don't think the Java Character API is
> adequate to the task. Perhaps we must incorporate a table.
>
> Suppose we come up with a good way to classify emoji; then how should they
> be treated in this class? Sometimes they may be embedded in tokens with
> other characters: I see people using emoji and other symbols as part of
> their names, and sometimes they stand alone (with whitespace separation). I
> think one way forward here would be to treat these as a special class akin
> to words and numbers, and provide similar options (SPLIT_ON_EMOJI,
> CATENATE_EMOJI) as we have for those classes.
>
> Or maybe as a convenience, we provide a way to get a table that encodes the
> default classifications of all characters up to some given limit, and then
> let the caller modify it? That would at least provide an easy way to treat
> emoji as letters.
>
> Any thoughts?
>