Important: This documentation covers Yarn 1 (Classic).
For Yarn 2+ docs and migration guide, see yarnpkg.com.

Package detail

intelligent-text-chunking

yourusername274MIT1.0.3TypeScript support: included

An intelligent text chunking library that respects document structure and semantic boundaries

text-chunking, nlp, document-processing, machine-learning, text-analysis, chunking, intelligent-chunking

readme

Intelligent Text Chunking

A powerful TypeScript library for intelligent text chunking with advanced document structure recognition, PDF support, and semantic boundary preservation.

Features

  • 🧠 Intelligent Structure Recognition: Automatically detects headings, sections, and document patterns
  • 📄 PDF Support: Page-aware chunking with page number metadata
  • 🎯 Semantic Boundaries: Respects sentence, paragraph, and heading boundaries
  • 📊 Rich Metadata: Comprehensive chunk information including headings, sections, and statistics
  • 🔧 Flexible Configuration: Customizable chunk sizes, overlap, and boundary preferences
  • 📚 Multiple Document Types: Supports academic papers, legal documents, technical docs, and more

Installation

npm install intelligent-text-chunking

Quick Start

Basic Usage

import { chunkTextIntelligently, ChunkingOptions } from 'intelligent-text-chunking';

const text = `
# Introduction
This is the introduction section with some content.

## Methodology
Here we describe our methodology in detail.

### Data Collection
We collected data from various sources.

## Results
Our results show significant improvements.
`;

const options: ChunkingOptions = {
  maxChunkSize: 500,
  overlapSize: 50,
  respectHeadingBoundaries: true
};

const chunks = chunkTextIntelligently(text, options);

console.log(`Generated ${chunks.length} chunks`);
chunks.forEach((chunk, index) => {
  console.log(`Chunk ${index + 1}:`);
  console.log(`  Heading: ${chunk.metadata.heading || 'None'}`);
  console.log(`  Level: ${chunk.metadata.headingLevel || 'N/A'}`);
  console.log(`  Words: ${chunk.metadata.wordCount}`);
  console.log(`  Text: ${chunk.text.substring(0, 100)}...`);
});

PDF-Specific Chunking

import { chunkPDFTextIntelligently } from 'intelligent-text-chunking';

// Extract text from PDF (using pdf2json or similar)
const pdfText = "Your PDF text content...";
const pageBreaks = [1000, 2000, 3000]; // Character positions of page breaks

const chunks = chunkPDFTextIntelligently(pdfText, pageBreaks, {
  maxChunkSize: 800,
  respectParagraphBoundaries: true
});

chunks.forEach(chunk => {
  console.log(`Page ${chunk.metadata.pageNumber}: ${chunk.text.substring(0, 50)}...`);
});

Advanced Configuration

import { IntelligentChunker, ChunkingOptions } from 'intelligent-text-chunking';

const options: ChunkingOptions = {
  maxChunkSize: 1000,        // Maximum characters per chunk
  minChunkSize: 200,         // Minimum characters per chunk
  overlapSize: 100,          // Overlap between chunks
  respectSentenceBoundaries: true,    // Don't break mid-sentence
  respectParagraphBoundaries: true,  // Don't break mid-paragraph
  respectHeadingBoundaries: true,    // Don't break across headings
  preserveHeadingHierarchy: true,    // Maintain heading structure
  maxHeadingLevel: 6         // Maximum heading level to recognize
};

const chunker = new IntelligentChunker(options);
const chunks = chunker.chunkText(yourText);

API Reference

Types

ChunkingOptions

interface ChunkingOptions {
  maxChunkSize?: number;           // Default: 1000
  minChunkSize?: number;           // Default: 200
  overlapSize?: number;            // Default: 100
  respectSentenceBoundaries?: boolean;    // Default: true
  respectParagraphBoundaries?: boolean;   // Default: true
  respectHeadingBoundaries?: boolean;     // Default: true
  preserveHeadingHierarchy?: boolean;     // Default: true
  maxHeadingLevel?: number;               // Default: 6
}

IntelligentChunk

interface IntelligentChunk {
  text: string;
  metadata: ChunkMetadata;
}

interface ChunkMetadata {
  heading?: string;           // Detected heading text
  headingLevel?: number;      // Heading level (1-6)
  section?: string;           // Section name
  pageNumber?: number;        // Page number (for PDFs)
  chunkIndex: number;         // Index of this chunk
  totalChunks: number;       // Total number of chunks
  wordCount: number;         // Word count in chunk
  charCount: number;         // Character count in chunk
  startPosition: number;     // Start position in original text
  endPosition: number;       // End position in original text
}

Functions

chunkTextIntelligently(text: string, options?: ChunkingOptions): IntelligentChunk[]

Chunks regular text intelligently based on document structure.

chunkPDFTextIntelligently(text: string, pageBreaks?: number[], options?: ChunkingOptions): IntelligentChunk[]

Chunks PDF text with page awareness and page number metadata.

IntelligentChunker

Main class for advanced chunking operations.

Supported Document Patterns

The library recognizes various document structures:

Academic Papers

  • Abstract, Introduction, Conclusion
  • References, Bibliography
  • Numbered sections (1., 1.1, 1.1.1)
  • Articles, Sections, Chapters
  • Roman numerals (I., II., III.)
  • Lettered sections (A., B., C.)

Technical Documentation

  • Overview, Implementation
  • API Reference, Configuration
  • Markdown headings (# ## ###)

General Documents

  • All caps headings
  • Title case with colons
  • Table of contents patterns

Use Cases

  • RAG Systems: Create semantic chunks for retrieval-augmented generation
  • Document Analysis: Process and analyze structured documents
  • Search Systems: Build searchable document chunks with metadata
  • Content Management: Organize and structure document content
  • AI Training: Prepare text data for machine learning models

Examples

Academic Paper Processing

const academicText = `
Abstract
This paper presents a novel approach to text processing.

1. Introduction
Text processing is a fundamental task in NLP.

1.1 Background
Previous work has shown...

2. Methodology
We propose a new algorithm...

3. Results
Our experiments demonstrate...

References
[1] Smith, J. (2023). Text Processing...
`;

const chunks = chunkTextIntelligently(academicText);
// Automatically detects Abstract, Introduction, Methodology, Results, References
const legalText = `
Article 1. Definitions
For the purposes of this agreement...

Section 2.1. Rights and Obligations
Each party shall have the right to...

Chapter III. Termination
This agreement may be terminated...
`;

const chunks = chunkTextIntelligently(legalText);
// Recognizes Article, Section, Chapter structure

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

License

MIT License - see LICENSE file for details.

Changelog

1.0.0

  • Initial release
  • Intelligent text chunking with structure recognition
  • PDF support with page awareness
  • Comprehensive metadata support
  • TypeScript definitions included