You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Tim Allison <ta...@apache.org> on 2024/04/08 16:29:27 UTC

Document chunking

Not sure we should jump on the bandwagon, but anything we can do to support
smart chunking would benefit us.

Could just be more integrations with parsers that turn out to be useful. I
haven’t had much joy with some. Here’s one that I haven’t evaluated yet:
https://github.com/Filimoa/open-parse

Re: Document chunking

Posted by Nicholas DiPiazza <ni...@gmail.com>.

I am also very interested in this vector-based search. Indexes are a big
thing right now.

On Mon, Apr 8, 2024, 4:16 PM Michael Wechner <mi...@wyona.com>
wrote:

> It would be great to have good "semantic chunking" in order to generate
> vector embeddings.
>
> Thanks for the link below, will try to test it.
>
> Thanks
>
> Michael
>
>
>
> Am 08.04.24 um 18:29 schrieb Tim Allison:
> > Not sure we should jump on the bandwagon, but anything we can do to
> support
> > smart chunking would benefit us.
> >
> > Could just be more integrations with parsers that turn out to be useful.
> I
> > haven’t had much joy with some. Here’s one that I haven’t evaluated yet:
> > https://github.com/Filimoa/open-parse
> >
>
>

Re: Document chunking

Posted by Michael Wechner <mi...@wyona.com>.

It would be great to have good "semantic chunking" in order to generate 
vector embeddings.

Thanks for the link below, will try to test it.

Thanks

Michael



Am 08.04.24 um 18:29 schrieb Tim Allison:
> Not sure we should jump on the bandwagon, but anything we can do to support
> smart chunking would benefit us.
>
> Could just be more integrations with parsers that turn out to be useful. I
> haven’t had much joy with some. Here’s one that I haven’t evaluated yet:
> https://github.com/Filimoa/open-parse
>

Re: Document chunking

Posted by Tim Allison <ta...@apache.org>.

My 0.02...

1) It is important that we do what we can to make it easy for people
to integrate Tika into the dense vector/llm/rag landscape. I see A LOT
of projects reinventing the wheel (without multi-parser full recursion
like we have), or just running pdftotext and declaring victory. So, if
we can implement Nick's algorithm in a submodule, great!

2) No matter how ingenious Nick's algorithm is, different people will
want different behavior. We barely have enough resources to keep the
project going (IMHO), let alone dealing with some people who want
captions as part of the figure, etc in our chunking algo. So we have
to be careful about what we promise.

3) I think it should be a _core_ capability of Tika to enable the
xhtml structural markup across all file formats as we do now for
"paragraph" based file formats (e.g. office and other documents).
Where things get murky is in PDF remediation (guessing the structure
from a rendering or from x,y coordinates and font information) and/or
OCR processing on images. The tech is getting better and better for
this and we need to integrate whatever makes sense, likely as an
external parser.

On Tue, Apr 9, 2024 at 7:48 AM Eric Pugh
<ep...@opensourceconnections.com> wrote:
>
> Your approach sounds great as well Nick….
>
> > On Apr 9, 2024, at 2:21 AM, Michael Wechner <mi...@wyona.com> wrote:
> >
> > Thanks for sharing your approach!
> >
> > Do you already have some code to share?
> >
> > Today I read about https://github.com/infiniflow/ragflow which might also have some interesting chunking approaches.
> >
> > Thanks
> >
> > Michael
> >
> > Am 09.04.24 um 01:25 schrieb Nick Burch:
> >> On Mon, 8 Apr 2024, Tim Allison wrote:
> >>> Not sure we should jump on the bandwagon, but anything we can do to support smart chunking would benefit us.
> >>>
> >>> Could just be more integrations with parsers that turn out to be useful. I
> >>> haven’t had much joy with some. Here’s one that I haven’t evaluated yet:
> >>> https://github.com/Filimoa/open-parse
> >>
> >> I played around with chunking a bit late last year, but owing to not getting any of the AI jobs I went for, I didn't get it beyond a rough protype. I can say that most people are doing a terrible job in their out-of-the box configs...
> >>
> >> My current suggested (but not fully tested) approach is:
> >>  * Define a range of chunk sizes that you'd like (min / ideal / max)
> >>  * Parse as XHTML with Tika
> >>  * Keep track of headings and table headers
> >>  * Break on headings
> >>  * If a chunk is too big, break on other elements (eg div or p)
> >>  * If a chunk is too small, and near other small chunks, join them
> >>  * Include 1-2 headings above the current one at the top,
> >>    as a targetted bit of Table of Contents. (eg chunk starts on H3, put
> >>    the H2 in as well)
> >>  * If you broke up a huge table, repeat the table headers at the
> >>    start of every chunk
> >>  * When you're done chunking + adding bits back at the top, convert
> >>    to markdown on output
> >>
> >> Happy to explain more! But sadly lacking time right now to do much on that
> >>
> >> Nick
> >
>
> _______________________
> Eric Pugh | Founder | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
>

Re: Document chunking

Posted by Eric Pugh <ep...@opensourceconnections.com>.

Your approach sounds great as well Nick….   

> On Apr 9, 2024, at 2:21 AM, Michael Wechner <mi...@wyona.com> wrote:
> 
> Thanks for sharing your approach!
> 
> Do you already have some code to share?
> 
> Today I read about https://github.com/infiniflow/ragflow which might also have some interesting chunking approaches.
> 
> Thanks
> 
> Michael
> 
> Am 09.04.24 um 01:25 schrieb Nick Burch:
>> On Mon, 8 Apr 2024, Tim Allison wrote:
>>> Not sure we should jump on the bandwagon, but anything we can do to support smart chunking would benefit us.
>>> 
>>> Could just be more integrations with parsers that turn out to be useful. I
>>> haven’t had much joy with some. Here’s one that I haven’t evaluated yet:
>>> https://github.com/Filimoa/open-parse
>> 
>> I played around with chunking a bit late last year, but owing to not getting any of the AI jobs I went for, I didn't get it beyond a rough protype. I can say that most people are doing a terrible job in their out-of-the box configs...
>> 
>> My current suggested (but not fully tested) approach is:
>>  * Define a range of chunk sizes that you'd like (min / ideal / max)
>>  * Parse as XHTML with Tika
>>  * Keep track of headings and table headers
>>  * Break on headings
>>  * If a chunk is too big, break on other elements (eg div or p)
>>  * If a chunk is too small, and near other small chunks, join them
>>  * Include 1-2 headings above the current one at the top,
>>    as a targetted bit of Table of Contents. (eg chunk starts on H3, put
>>    the H2 in as well)
>>  * If you broke up a huge table, repeat the table headers at the
>>    start of every chunk
>>  * When you're done chunking + adding bits back at the top, convert
>>    to markdown on output
>> 
>> Happy to explain more! But sadly lacking time right now to do much on that
>> 
>> Nick
> 

_______________________
Eric Pugh | Founder | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>	
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.

Re: Document chunking

Posted by Michael Wechner <mi...@wyona.com>.

Thanks for sharing your approach!

Do you already have some code to share?

Today I read about https://github.com/infiniflow/ragflow which might 
also have some interesting chunking approaches.

Thanks

Michael

Am 09.04.24 um 01:25 schrieb Nick Burch:
> On Mon, 8 Apr 2024, Tim Allison wrote:
>> Not sure we should jump on the bandwagon, but anything we can do to 
>> support smart chunking would benefit us.
>>
>> Could just be more integrations with parsers that turn out to be 
>> useful. I
>> haven’t had much joy with some. Here’s one that I haven’t evaluated yet:
>> https://github.com/Filimoa/open-parse
>
> I played around with chunking a bit late last year, but owing to not 
> getting any of the AI jobs I went for, I didn't get it beyond a rough 
> protype. I can say that most people are doing a terrible job in their 
> out-of-the box configs...
>
> My current suggested (but not fully tested) approach is:
>  * Define a range of chunk sizes that you'd like (min / ideal / max)
>  * Parse as XHTML with Tika
>  * Keep track of headings and table headers
>  * Break on headings
>  * If a chunk is too big, break on other elements (eg div or p)
>  * If a chunk is too small, and near other small chunks, join them
>  * Include 1-2 headings above the current one at the top,
>    as a targetted bit of Table of Contents. (eg chunk starts on H3, put
>    the H2 in as well)
>  * If you broke up a huge table, repeat the table headers at the
>    start of every chunk
>  * When you're done chunking + adding bits back at the top, convert
>    to markdown on output
>
> Happy to explain more! But sadly lacking time right now to do much on 
> that
>
> Nick

Re: Document chunking

Posted by Nick Burch <ni...@apache.org>.

On Mon, 8 Apr 2024, Tim Allison wrote:
> Not sure we should jump on the bandwagon, but anything we can do to 
> support smart chunking would benefit us.
>
> Could just be more integrations with parsers that turn out to be useful. I
> haven’t had much joy with some. Here’s one that I haven’t evaluated yet:
> https://github.com/Filimoa/open-parse

I played around with chunking a bit late last year, but owing to not 
getting any of the AI jobs I went for, I didn't get it beyond a rough 
protype. I can say that most people are doing a terrible job in their 
out-of-the box configs...

My current suggested (but not fully tested) approach is:
  * Define a range of chunk sizes that you'd like (min / ideal / max)
  * Parse as XHTML with Tika
  * Keep track of headings and table headers
  * Break on headings
  * If a chunk is too big, break on other elements (eg div or p)
  * If a chunk is too small, and near other small chunks, join them
  * Include 1-2 headings above the current one at the top,
    as a targetted bit of Table of Contents. (eg chunk starts on H3, put
    the H2 in as well)
  * If you broke up a huge table, repeat the table headers at the
    start of every chunk
  * When you're done chunking + adding bits back at the top, convert
    to markdown on output

Happy to explain more! But sadly lacking time right now to do much on that

Nick