Intelligent Transcript Chunking

6 min readJan 19, 2025

This article addresses a common problem in document chunking where speaker attribution is lost when processing long monologues.

Instead of using fixed-length prefixes, the new approach intelligently captures speaker information and sufficient context while chunking transcripts, making the output more suitable for LLM processing.

Losing Speakers in Long Monologues

Consider the scenario where we have a simple meeting transcript and a change in speaker is indicated by a double line break. We want to split it into chunks, where each chunk will be sized under an arbitrary token limit and contain a prefix from the previous chunk designed to carry contextual information.

Using a library such as Chonkie, we can chunk our document using custom rules to ensure we preserve as much as the document structure as possible.

import tiktoken
tokenizer = tiktoken.encoding_for_model("gpt-4o")

from chonkie import RecursiveChunker, RecursiveLevel
from chonkie.refinery import OverlapRefinery

def create_transcript_rules():

    # First level: Split by speaker changes (double newline)
    speaker_level = RecursiveLevel(
        delimiters=["\n\n"],
        whitespace=False
    )
    
    # Second level: Split by single newlines within speaker blocks
    # This helps separate the speaker marker/timestamp from content
    timestamp_level = RecursiveLevel(
        delimiters=["\n", "\r\n"], 
        whitespace=False
    )
    
    # Third level: Split by sentences within speaker blocks
    sentence_level = RecursiveLevel(
        delimiters=[".", "?", "!"],
        whitespace=False
    )
    
    # Fourth level: Split by natural pauses
    pause_level = RecursiveLevel(
        delimiters=[",", ";", ":", "-", "..."],
        whitespace=False
    )
    
    # Fifth level: Words (only used if needed)
    word_level = RecursiveLevel(
        delimiters=None,
        whitespace=True
    )
    
    # Final level: Token-based splits (only used if all above levels still result in too-long chunks)
    token_level = RecursiveLevel(
        delimiters=None,
        whitespace=False
    )
    
    return RecursiveRules(levels=[
        speaker_level,
        timestamp_level,
        sentence_level,
        pause_level,
        word_level,
        token_level
    ])


chunker = RecursiveChunker(
    tokenizer=tokenizer,
    chunk_size=MAX_CHUNK_SIZE,
    rules=create_transcript_rules(),
    min_characters_per_chunk=12,
)

chunks = chunker.chunk(text)

# Initialize the overlap refinery
refinery = OverlapRefinery(
    context_size=MAX_PREFIX_TOKENS,
    tokenizer=tokenizer,
    merge_context=False,
    approximate=False
)

# Refine chunks to add overlap
refined_chunks = refinery.refine(chunks)

Our first two chunks would look something like the below.

Chunk N:

Alice Smith 14:22
Let me walk you through our infrastructure design in detail. We’ve implemented a multi-layered approach that addresses several key requirements. First, we have our core service layer which handles all the primary business logic. This connects to our data persistence layer, which we’ve carefully designed to handle both transactional and analytical workloads. The system uses a combination of relational databases for consistent ACID compliance and document stores for flexibility.
[chunk continues for several hundred tokens]

Chunk N+1 (including our prefix):

[Prefix: “…and this architecture ensures we maintain six-nines of availability across all regions. The redundancy patterns we’ve implemented mean that any single point of failure can be automatically”]
replicated, with automated failover procedures that activate within milliseconds of detecting any degradation in primary performance.

What’s wrong here?

Provide an LLM with our second chunk, and it will not be able to identify who is speaking or establish the context of why this information is being presented.

This is because we used a fixed-length prefix (via OverlapRefinery) to grab tokens from the previous chunk without any awareness of the document structure. As a result, both the chunk and its prefix lack speaker attribution, leaving us with disembodied text.

However, by extending the OverlapRefinery class, we can modify the overlap behaviour to adhere better to the document structure, helping us preserve critical speaker context.

Hierarchical Overlaps

When a speaker goes into a lengthy explanation that exceeds our chunk size, we can end up with contextless chunks. The default OverlapRefinery class in Chonkie tries to help by adding context, but its approach is token-centric rather than structure-aware:

refinery = OverlapRefinery(
    context_size=MAX_PREFIX_TOKENS, # Fixed token length
    tokenizer=tokenizer,
    merge_context=False,
    approximate=False
)

Instead of doing this, I wanted to:

Apply the same recursive rules to the prefix as we did to the chunk, but this time in reverse so that we always try to capture the last speaker.
Enforce a minimum token size, so that we still retrieve sufficient context even if the previous chunk ended with a speaker simply saying “Okay”.
If, even in the previous chunk, there is no identifiable speaker, return the entire chunk so that we can at least establish maximum context.

I implemented these features in a new child class called HierarchicalOverlapRefinery.

class HierarchicalOverlapRefinery(OverlapRefinery):
    def __init__(
        self,
        context_size: int = 128,
        min_tokens: int = 50,
        tokenizer: Any = None,
        rules: RecursiveRules = None,
        merge_context: bool = True,
        inplace: bool = True,
        approximate: bool = True,
    ) -> None:
        super().__init__(
            context_size=context_size,
            tokenizer=tokenizer,
            merge_context=merge_context,
            inplace=inplace,
            approximate=approximate
        )
        self.min_tokens = min_tokens
        self.rules = rules
        self.mode = "prefix"

    def _get_token_count(self, text: str) -> int:
        if hasattr(self, "tokenizer") and not self.approximate:
            return len(self.tokenizer.encode(text))
        return int(len(text) / self._AVG_CHAR_PER_TOKEN)

    def _find_minimal_speaker_boundary(self, text: str) -> Optional[Tuple[str, int]]:
        """Find the smallest speaker chunk from the end that meets min_tokens."""
        if not self.rules or not self.rules.levels:
            return None
        
        speaker_rule = self.rules.levels[0]
        if not speaker_rule.delimiters:
            return None

        # Split text by speaker delimiter
        for delimiter in speaker_rule.delimiters:
            # Split text preserving delimiters
            parts = text.split(delimiter)
            if len(parts) <= 1:
                continue

            # Build chunks from the end until we meet min_tokens
            current_text = ""
            accumulated_parts = []
            
            # Work backwards through parts
            for part in reversed(parts):
                if part:  # Skip empty parts
                    test_text = part + delimiter + current_text
                    token_count = self._get_token_count(test_text)
                    
                    print(f"Testing chunk: {test_text[:50]}...")
                    print(f"Token count: {token_count}")
                    
                    if token_count >= self.min_tokens:
                        # Found smallest valid chunk
                        if accumulated_parts:
                            final_text = part + delimiter + delimiter.join(accumulated_parts)
                        else:
                            final_text = test_text
                            
                        # Find position in original text
                        start_pos = text.rindex(final_text)
                        return final_text, start_pos
                        
                    # Keep accumulating if under min_tokens
                    accumulated_parts.insert(0, part)
                    current_text = test_text

        return None

    def _ensure_min_tokens(self, text: str, start_level: int = 1) -> Optional[str]:
        """Recursively ensure text meets minimum token requirement using rules."""
        current_tokens = self._get_token_count(text)
        if current_tokens >= self.min_tokens:
            return text
            
        # Try each remaining rule level to extend context
        for level in range(start_level, len(self.rules.levels)):
            rule = self.rules.levels[level]
            if not rule.delimiters:
                continue
                
            for delimiter in rule.delimiters:
                prefix_text = text
                pos = text.rfind(delimiter)
                while pos > 0 and self._get_token_count(prefix_text) < self.min_tokens:
                    prefix_text = text[pos:]
                    pos = text[:pos].rfind(delimiter)
                    
                if self._get_token_count(prefix_text) >= self.min_tokens:
                    return prefix_text
                    
        return None

    def _hierarchical_prefix_context(self, chunk: Chunk) -> Optional[Context]:
        """Get prefix context using hierarchical rules."""
        if not self.rules or not self.rules.levels:
            return self._prefix_overlap_token(chunk)

        # First try to find minimal speaker boundary
        speaker_result = self._find_minimal_speaker_boundary(chunk.text)
        
        if speaker_result:
            context_text, start_pos = speaker_result
        else:
            # No speaker boundary found - use entire chunk
            context_text = chunk.text
            start_pos = 0

        try:
            context_tokens = self._get_token_count(context_text)
            return Context(
                text=context_text,
                token_count=context_tokens,
                start_index=chunk.start_index + start_pos,
                end_index=chunk.end_index
            )
        except ValueError:
            print(f"Warning: Could not find context in original text")
            return self._prefix_overlap_token(chunk)

    def _get_prefix_overlap_context(self, chunk: Chunk) -> Optional[Context]:
        return self._hierarchical_prefix_context(chunk)

Impact and Limitations

This new chunking strategy guarantees that prefixes will now contain speaker information, as long as a speaker is mentioned somewhere in the previous chunk.

Chunk N

Alice Smith 14:22 Let me walk you through our infrastructure design in detail. We’ve implemented a multi-layered approach that addresses several key requirements. First, we have our core service layer which handles all the primary business logic…
[chunk continues for several hundred tokens]

Chunk N+1 (with hierarchical prefix):

[Prefix: Alice Smith 14:22 Let me walk you through our infrastructure design in detail. We’ve implemented a multi-layered approach that addresses several key requirements. First, we have our core service layer which handles all the primary business logic… Our infrastructure design uses a multi-layered approach. We’ve implemented redundant systems across regions, and this architecture ensures we maintain six-nines of availability. The redundancy patterns we’ve implemented mean that any single point of failure can be automatically]
replicated, with automated failover procedures that activate within milliseconds of detecting any degradation in primary performance…

We have now an improved set of chunks, which are much more likely to contain speaker context!

The biggest limitation with this approach is that our prefixes are now longer. However, because we only provide the most immediate context necessary (the nearest speaker boundary over our minimum tokens), on average, prefixes tend to be pretty small. We can also still define an upper token limit on prefix sizes; though, I recommend setting this to the same value as your chunk.

Remember, if the prefix is large, it is now only because it would benefit the LLM to have the additional context. It’s a good trade.