StringZilla 🦖
Strings are the first fundamental data type every programming language implements in software rather than hardware, so dedicated CPU instructions are rare - and the few that exist are hardly ideal.
That's why most languages lean on the C standard library (libc) for their string operations, which, despite its name, ships its hottest code in hand-tuned assembly.
It does exploit SIMD, but it isn't perfect.
1️⃣ Even on ubiquitous hardware - over a billion 64-bit ARM CPUs - routines such as strstr
and memmem
top out at roughly one-third of available throughput.
2️⃣ SIMD coverage is uneven: fast forward scans don't guarantee speedy reverse searches.
3️⃣ Many higher-level languages can't rely on libc at all because their strings aren't NUL-terminated - or may even contain embedded zeroes.
That's why StringZilla exists: predictable, high performance on every modern platform, OS, and programming language.
StringZilla is the GodZilla of string libraries, using SIMD and SWAR to accelerate string operations on modern CPUs and GPUs. It delivers up to 10x higher CPU throughput in C, C++, and Python and can be 100x faster than existing GPU kernels, covering a broad range of functionality. It accelerates exact and fuzzy string matching, hashing, edit distance computations, sorting, provides allocation-free lazily-evaluated smart-iterators, and even random-string generators.
- 🐂 C: Upgrade LibC's
<string.h>
to<stringzilla/stringzilla.h>
in C 99 - 🐉 C++: Upgrade STL's
<string>
to<stringzilla/stringzilla.hpp>
in C++ 11 - 🧮 CUDA: Process in-bulk with
<stringzillas/stringzillas.cuh>
in CUDA C++ 17 - 🐍 Python: Upgrade your
str
to fasterStr
- 🦀 Rust: Use the
StringZilla
traits crate - 🦫 Go: Use the
StringZilla
cGo module - 🍎 Swift: Use the
String+StringZilla
extension - 🟨 JavaScript: Use the
StringZilla
library - 🐚 Shell: Accelerate common CLI tools with
sz_
prefix - 📚 Researcher? Jump to Algorithms & Design Decisions
- 💡 Thinking to contribute? Look for "good first issues"
- 🤝 And check the guide to set up the environment
- Want more bindings or features? Let me know!
Who is this for?
- For data-engineers parsing large datasets, like the CommonCrawl, RedPajama, or LAION.
- For software engineers optimizing strings in their apps and services.
- For bioinformaticians and search engineers looking for edit-distances for USearch.
- For DBMS devs, optimizing
LIKE
,ORDER BY
, andGROUP BY
operations. - For hardware designers, needing a SWAR baseline for string-processing functionality.
- For students studying SIMD/SWAR applications to non-data-parallel operations.
Performance
C | C++ | Python | StringZilla |
---|---|---|---|
find the first occurrence of a random word from text, ≅ 5 bytes long | |||
strstr 1x86: 7.4 · arm: 2.0 GB/s |
.find x86: 2.9 · arm: 1.6 GB/s |
.find x86: 1.1 · arm: 0.6 GB/s |
sz_find x86: 10.6 · arm: 7.1 GB/s |
find the last occurrence of a random word from text, ≅ 5 bytes long | |||
⚪ |
.rfind x86: 0.5 · arm: 0.4 GB/s |
.rfind x86: 0.9 · arm: 0.5 GB/s |
sz_rfind x86: 10.8 · arm: 6.7 GB/s |
split lines separated by \n or \r 2 |
|||
strcspn 1x86: 5.42 · arm: 2.19 GB/s |
.find_first_of x86: 0.59 · arm: 0.46 GB/s |
re.finditer x86: 0.06 · arm: 0.02 GB/s |
sz_find_byteset x86: 4.08 · arm: 3.22 GB/s |
find the last occurrence of any of 6 whitespaces 2 | |||
⚪ |
.find_last_of x86: 0.25 · arm: 0.25 GB/s |
⚪ |
sz_rfind_byteset x86: 0.43 · arm: 0.23 GB/s |
Random string from a given alphabet, 20 bytes long 5 | |||
rand() % n x86: 18.0 · arm: 9.4 MB/s |
std::uniform_int_distribution x86: 47.2 · arm: 20.4 MB/s |
join(random.choices(...)) x86: 13.3 · arm: 5.9 MB/s |
sz_fill_random x86: 56.2 · arm: 25.8 MB/s |
Mapping characters with lookup table transforms | |||
⚪ |
std::transform x86: 3.81 · arm: 2.65 GB/s |
str.translate x86: 260.0 · arm: 140.0 MB/s |
sz_lookup x86: 21.2 · arm: 8.5 GB/s |
Get sorted order, ≅ 8 million English words 6 | |||
qsort_r x86: 3.55 · arm: 5.77 s |
std::sort x86: 2.79 · arm: 4.02 s |
numpy.argsort x86: 7.58 · arm: 13.00 s |
sz_sequence_argsort x86: 1.91 · arm: 2.37 s |
Levenshtein edit distance, text lines ≅ 100 bytes long | |||
⚪ | ⚪ |
via NLTK 3 and CuDF x86: 1,615,306 · arm: 1,349,980 · cuda: 6,532,411,354 CUPS |
szs_levenshtein_distances_t x86: 3,434,427,548 · arm: 1,605,340,403 · cuda: 93,662,026,653 CUPS |
Needleman-Wunsch alignment scores, proteins ≅ 1 K amino acids long | |||
⚪ | ⚪ |
via biopython 4x86: 575,981,513 · arm: 436,350,732 CUPS |
szs_needleman_wunsch_scores_t x86: 452,629,942 · arm: 520,170,239 · cuda: 9,017,327,818 CUPS |
Most StringZilla modules ship ready-to-run benchmarks for C, C++, Python, and more.
Grab them from ./scripts
, and see CONTRIBUTING.md
for instructions.
On CPUs that permit misaligned loads, even the 64-bit SWAR baseline outruns both libc and the STL.
For wider head-to-heads against Rust and Python favorites, browse the StringWars repository.
To inspect collision resistance and distribution shapes for our hashers, see HashEvals.
Most benchmarks were conducted on a 1 GB English text corpus, with an average word length of 6 characters. The code was compiled with GCC 12, using
glibc
v2.35. The benchmarks were performed on Arm-based Graviton3 AWSc7g
instances andr7iz
Intel Sapphire Rapids. Most modern Arm-based 64-bit CPUs will have similar relative speedups. Variance within x86 CPUs will be larger. For CUDA benchmarks, the Nvidia H100 GPUs were used. 1 Unlike other libraries, LibC requires strings to be NULL-terminated. 2 Six whitespaces in the ASCII set are:\t\n\v\f\r
. Python's and other standard libraries have specialized functions for those. 3 Most Python libraries for strings are also implemented in C. 4 Unlike the rest of BioPython, the alignment score computation is implemented in C. 5 All modulo operations were conducted withuint8_t
to allow compilers more optimization opportunities. The C++ STL and StringZilla benchmarks used a 64-bit Mersenne Twister as the generator. For C, C++, and StringZilla, an in-place update of the string was used. In Python every string had to be allocated as a new object, which makes it less fair. 6 Contrary to the popular opinion, Python's defaultsorted
function works faster than the C and C++ standard libraries. That holds for large lists or tuples of strings, but fails as soon as you need more complex logic, like sorting dictionaries by a string key, or producing the "sorted order" permutation. The latter is very common in database engines and is most similar tonumpy.argsort
. The current StringZilla solution can be at least 4x faster without loss of generality.
Functionality
StringZilla is compatible with most modern CPUs, and provides a broad range of functionality. It's split into 2 layers:
- StringZilla: single-header C library and C++ wrapper for high-performance string operations.
- StringZillas: parallel CPU/GPU backends used for large-batch operations and accelerators.
Having a second C++/CUDA layer greatly simplifies the implementation of similarity scoring and fingerprinting functions, which would otherwise require too much error-prone boilerplate code in pure C. Both layers are designed to be extremely portable:
- <input checked="" disabled="" type="checkbox"> across both little-endian and big-endian architectures.
- <input checked="" disabled="" type="checkbox"> across 32-bit and 64-bit hardware architectures.
- <input checked="" disabled="" type="checkbox"> across operating systems and compilers.
- <input checked="" disabled="" type="checkbox"> across ASCII and UTF-8 encoded inputs.
Not all features are available across all bindings. Consider contributing if you need a feature that's not yet implemented.
| | Maturity | C | C++ | Python | Rust | JS | Swift | Go | | :----------------------------- | :------: | :---: | :---: | :----: | :---: | :---: | :---: | :---: | | Substring Search | 🌳 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | Character Set Search | 🌳 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | Sorting & Sequence Operations | 🌳 | ✅ | ✅ | ✅ | ✅ | ⚪ | ⚪ | ⚪ | | Streaming Hashes | 🌳 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | Small String Class | 🧐 | ✅ | ✅ | ❌ | ⚪ | ❌ | ❌ | ❌ | | Lazy Ranges, Compressed Arrays | 🌳 | ❌ | ✅ | ✅ | ✅ | ❌ | ⚪ | ⚪ | | | | | | | | | | | | Parallel Similarity Scoring | 🌳 | ✅ | ✅ | ✅ | ✅ | ⚪ | ⚪ | ⚪ | | Parallel Rolling Fingerprints | 🌳 | ✅ | ✅ | ✅ | ✅ | ⚪ | ⚪ | ⚪ |
🌳 parts are used in production. 🧐 parts are in beta. 🚧 parts are under active development, and are likely to break in subsequent releases. ✅ are implemented. ⚪ are considered. ❌ are not intended.
Quick Start: Python 🐍
Python bindings are available on PyPI for Python 3.8+, and can be installed with pip
.
pip install stringzilla # for serial algorithms
pip install stringzillas-cpus # for parallel multi-CPU backends
pip install stringzillas-cuda # for parallel Nvidia GPU backend
You can immediately check the installed version and the used hardware capabilities with following commands:
python -c "import stringzilla; print(stringzilla.__version__)"
python -c "import stringzillas; print(stringzillas.__version__)"
python -c "import stringzilla; print(stringzilla.__capabilities__)" # for serial algorithms
python -c "import stringzillas; print(stringzillas.__capabilities__)" # for parallel algorithms
Basic Usage
If you've ever used the Python str
, bytes
, bytearray
, or memoryview
classes, you'll know what to expect.
StringZilla's Str
class is a hybrid of the above, providing a str
-like interface to byte arrays.
from stringzilla import Str, File
text_from_str = Str('some-string') # no copies, just a view
text_from_bytes = Str(b'some-array') # no copies, just a view
text_from_file = Str(File('some-file.txt')) # memory-mapped file
import numpy as np
alphabet_array = np.arange(ord("a"), ord("z"), dtype=np.uint8)
text_from_array = Str(memoryview(alphabet_array))
The File
class memory-maps a file from persistent storage without loading its copy into RAM.
The contents of that file would remain immutable, and the mapping can be shared by multiple Python processes simultaneously.
A standard dataset pre-processing use case would be to map a sizable textual dataset like Common Crawl into memory, spawn child processes, and split the job between them.
Basic Operations
- Length:
len(text) -> int
- Indexing:
text[42] -> str
- Slicing:
text[42:46] -> Str
- Substring check:
'substring' in text -> bool
- Hashing:
hash(text) -> int
- String conversion:
str(text) -> str
Advanced Operations
import sys
x: bool = text.contains('substring', start=0, end=sys.maxsize)
x: int = text.find('substring', start=0, end=sys.maxsize)
x: int = text.count('substring', start=0, end=sys.maxsize, allowoverlap=False)
x: str = text.decode(encoding='utf-8', errors='strict')
x: Strs = text.split(separator=' ', maxsplit=sys.maxsize, keepseparator=False)
x: Strs = text.rsplit(separator=' ', maxsplit=sys.maxsize, keepseparator=False)
x: Strs = text.splitlines(keeplinebreaks=False, maxsplit=sys.maxsize)
It's important to note that the last function's behavior is slightly different from Python's str.splitlines
.
The native version matches \n
, \r
, \v
or \x0b
, \f
or \x0c
, \x1c
, \x1d
, \x1e
, \x85
, \r\n
, \u2028
, \u2029
, including 3x two-byte-long runes.
The StringZilla version matches only \n
, \v
, \f
, \r
, \x1c
, \x1d
, \x1e
, \x85
, avoiding two-byte-long runes.
Character Set Operations
Python strings don't natively support character set operations.
This forces people to use regular expressions, which are slow and hard to read.
To avoid the need for re.finditer
, StringZilla provides the following interfaces:
x: int = text.find_first_of('chars', start=0, end=sys.maxsize)
x: int = text.find_last_of('chars', start=0, end=sys.maxsize)
x: int = text.find_first_not_of('chars', start=0, end=sys.maxsize)
x: int = text.find_last_not_of('chars', start=0, end=sys.maxsize)
x: Strs = text.split_byteset(separator='chars', maxsplit=sys.maxsize, keepseparator=False)
x: Strs = text.rsplit_byteset(separator='chars', maxsplit=sys.maxsize, keepseparator=False)
You can also transform the string using Look-Up Tables (LUTs), mapping it to a different character set.
This would result in a copy - str
for str
inputs and bytes
for other types.
x: str = text.translate('chars', {}, start=0, end=sys.maxsize, inplace=False)
x: bytes = text.translate(b'chars', {}, start=0, end=sys.maxsize, inplace=False)
For efficiency reasons, pass the LUT as a string or bytes object, not as a dictionary. This can be useful in high-throughput applications dealing with binary data, including bioinformatics and image processing. Here is an example:
import stringzilla as sz
look_up_table = bytes(range(256)) # Identity LUT
image = open("/image/path.jpeg", "rb").read()
sz.translate(image, look_up_table, inplace=True)
Hash
Single-shot and incremental hashing are both supported:
import stringzilla as sz
# One-shot - stable 64-bit output across all platforms!
one = sz.hash(b"Hello, world!", seed=42)
# Incremental updates return itself; digest does not consume state
hasher = sz.Hasher(seed=42)
hasher.update(b"Hello, ").update(b"world!")
streamed = hasher.digest() # or `hexdigest()` for a string
assert one == streamed
Collection-Level Operations
Once split into a Strs
object, you can sort, shuffle, and reorganize the slices with minimal memory footprint.
If all the chunks are located in consecutive memory regions, the memory overhead can be as low as 4 bytes per chunk.
lines: Strs = text.split(separator='\n') # 4 bytes per line overhead for under 4 GB of text
batch: Strs = lines.sample(seed=42) # 10x faster than `random.choices`
lines.shuffle(seed=42) # or shuffle all lines in place and shard with slices
lines_sorted: Strs = lines.sorted() # returns a new Strs in sorted order
order: tuple = lines.argsort() # similar to `numpy.argsort`
Working on RedPajama, addressing 20 billion annotated English documents, one will need only 160 GB of RAM instead of terabytes. Once loaded, the data will be memory-mapped, and can be reused between multiple Python processes without copies. And of course, you can use slices to navigate the dataset and shard it between multiple workers.
lines[::3] # every third line
lines[1::1] # every odd line
lines[:-100:-1] # last 100 lines in reverse order
Iterators and Memory Efficiency
Python's operations like split()
and readlines()
immediately materialize a list
of copied parts.
This can be very memory-inefficient for large datasets.
StringZilla saves a lot of memory by viewing existing memory regions as substrings, but even more memory can be saved by using lazily evaluated iterators.
x: SplitIterator[Str] = text.split_iter(separator=' ', keepseparator=False)
x: SplitIterator[Str] = text.rsplit_iter(separator=' ', keepseparator=False)
x: SplitIterator[Str] = text.split_byteset_iter(separator='chars', keepseparator=False)
x: SplitIterator[Str] = text.rsplit_byteset_iter(separator='chars', keepseparator=False)
StringZilla can easily be 10x more memory efficient than native Python classes for tokenization. With lazy operations, it practically becomes free.
import stringzilla as sz
%load_ext memory_profiler
text = open("enwik9.txt", "r").read() # 1 GB, mean word length 7.73 bytes
%memit text.split() # increment: 8670.12 MiB (152 ms)
%memit sz.split(text) # increment: 530.75 MiB (25 ms)
%memit sum(1 for _ in sz.split_iter(text)) # increment: 0.00 MiB
Low-Level Python API
Aside from calling the methods on the Str
and Strs
classes, you can also call the global functions directly on str
and bytes
instances.
Assuming StringZilla CPython bindings are implemented without any intermediate tools like SWIG or PyBind, the call latency should be similar to native classes.
import stringzilla as sz
contains: bool = sz.contains("haystack", "needle", start=0, end=sys.maxsize)
offset: int = sz.find("haystack", "needle", start=0, end=sys.maxsize)
count: int = sz.count("haystack", "needle", start=0, end=sys.maxsize, allowoverlap=False)
Similarity Scores
StringZilla exposes high-performance, batch-oriented similarity via the stringzillas
module.
Use DeviceScope
to pick hardware and optionally limit capabilities per engine.
import stringzilla as sz
import stringzillas as szs
cpu_scope = szs.DeviceScope(cpu_cores=4) # force CPU-only
gpu_scope = szs.DeviceScope(gpu_device=0) # pick GPU 0 if available
strings_a = sz.Strs(["kitten", "flaw"])
strings_b = sz.Strs(["sitting", "lawn"])
strings_a = szs.to_device(strings_a) # optional ahead of time transfer
strings_b = szs.to_device(strings_b) # optional ahead of time transfer
engine = szs.LevenshteinDistances(
match=0, mismatch=2, # costs don't have to be 1
open=3, extend=1, # may be different in Bio
capabilities=("serial",) # avoid SIMD 🤭
)
distances = engine(strings_a, strings_b, device=cpu_scope)
assert int(distances[0]) == 3 and int(distances[1]) == 2
Note, that this computes byte-level distances. For UTF-8 codepoints, use a different engine class:
strings_a = sz.Strs(["café", "αβγδ"])
strings_b = sz.Strs(["cafe", "αγδ"])
engine = szs.LevenshteinDistancesUTF8(capabilities=("serial",))
distances = engine(strings_a, strings_b, device=cpu_scope)
assert int(distances[0]) == 1 and int(distances[1]) == 1
For alignment scoring provide a 256×256 substitution matrix using NumPy:
import numpy as np
import stringzilla as sz
import stringzillas as szs
substitution_matrix = np.zeros((256, 256), dtype=np.int8)
substitution_matrix.fill(-1) # mismatch score
np.fill_diagonal(substitution_matrix, 0) # match score
engine = szs.NeedlemanWunsch(substitution_matrix=substitution_matrix, open=1, extend=1)
scores = engine(strings_a, strings_b, device=cpu_scope)
Several Python libraries provide edit distance computation. Most are implemented in C but may be slower than StringZilla on large inputs. For proteins ~10k chars, 100 pairs:
- JellyFish: 62.3s
- EditDistance: 32.9s
- StringZilla: 0.8s
Using the same proteins for Needleman-Wunsch alignment scores:
- BioPython: 25.8s
- StringZilla: 7.8s
import numpy as np
from Bio import Align
from Bio.Align import substitution_matrices
aligner = Align.PairwiseAligner()
aligner.substitution_matrix = substitution_matrices.load("BLOSUM62")
aligner.open_gap_score = 1
aligner.extend_gap_score = 1
# Convert the matrix to NumPy
subs_packed = np.array(aligner.substitution_matrix).astype(np.int8)
subs_reconstructed = np.zeros((256, 256), dtype=np.int8)
# Initialize all banned characters to a the largest possible penalty
subs_reconstructed.fill(127)
for packed_row, packed_row_aminoacid in enumerate(aligner.substitution_matrix.alphabet):
for packed_column, packed_column_aminoacid in enumerate(aligner.substitution_matrix.alphabet):
reconstructed_row = ord(packed_row_aminoacid)
reconstructed_column = ord(packed_column_aminoacid)
subs_reconstructed[reconstructed_row, reconstructed_column] = subs_packed[packed_row, packed_column]
# Let's pick two examples of tripeptides (made of 3 amino acids)
glutathione = "ECG" # Need to rebuild human tissue?
thyrotropin_releasing_hormone = "QHP" # Or to regulate your metabolism?
import stringzillas as szs
engine = szs.NeedlemanWunsch(substitution_matrix=subs_reconstructed, open=1, extend=1)
score = int(engine(sz.Strs([glutathione]), sz.Strs([thyrotropin_releasing_hormone]))[0])
assert score == aligner.score(glutathione, thyrotropin_releasing_hormone) # Equal to 6
Rolling Fingerprints
MinHashing is a common technique for Information Retrieval, producing compact representations of large documents. For $D$ hash-functions and a text of length $L$, in the worst case it involves computing $O(D \cdot L)$ hashes.
import numpy as np
import stringzilla as sz
import stringzillas as szs
texts = sz.Strs([
"quick brown fox jumps over the lazy dog",
"quick brown fox jumped over a very lazy dog",
])
cpu = szs.DeviceScope(cpu_cores=4)
ndim = 1024
window_widths = np.array([4, 6, 8, 10], dtype=np.uint64)
engine = szs.Fingerprints(
ndim=ndim,
window_widths=window_widths, # optional
alphabet_size=256, # default for byte strings
capabilities=("serial",), # defaults to all, can also pass a `DeviceScope`
)
hashes, counts = engine(texts, device=cpu)
assert hashes.shape == (len(texts), ndim)
assert counts.shape == (len(texts), ndim)
assert hashes.dtype == np.uint32 and counts.dtype == np.uint32
Serialization
Filesystem
Similar to how File
can be used to read a large file, other interfaces can be used to dump strings to disk faster.
The Str
class has write_to
to write the string to a file, and offset_within
to obtain integer offsets of substring view in larger string for navigation.
web_archive = Str("<html>...</html><html>...</html>")
_, end_tag, next_doc = web_archive.partition("</html>") # or use `find`
next_doc_offset = next_doc.offset_within(web_archive)
web_archive.write_to("next_doc.html") # no GIL, no copies, just a view
PyArrow
A Str
is easy to cast to PyArrow buffers.
from pyarrow import foreign_buffer
from stringzilla import Strs
strs = Strs(["alpha", "beta", "gamma"])
arrow = foreign_buffer(strs.address, strs.nbytes, strs)
And only slightly harder to convert in reverse direction:
arr = pa.Array.from_buffers(
pa.large_string() if strs.offsets_are_large else pa.string(),
len(strs),
[None,
pa.foreign_buffer(strs.offsets_address, strs.offsets_nbytes, strs),
pa.foreign_buffer(strs.tape_address, strs.tape_nbytes, strs)],
)
That means you can convert Str
to pyarrow.Buffer
and Strs
to pyarrow.Array
without extra copies.
For more details on the tape-like layouts, refer to the StringTape repository.
Quick Start: C/C++ 🛠️
The C library is header-only, so you can just copy the stringzilla.h
header into your project.
Same applies to C++, where you would copy the stringzilla.hpp
header.
Alternatively, add it as a submodule, and include it in your build system.
git submodule add https://github.com/ashvardanian/StringZilla.git external/stringzilla
git submodule update --init --recursive
Or using a pure CMake approach:
FetchContent_Declare(
stringzilla
GIT_REPOSITORY https://github.com/ashvardanian/StringZilla.git
GIT_TAG main # or specify a version tag
)
FetchContent_MakeAvailable(stringzilla)
Last, but not the least, you can also install it as a library, and link against it. This approach is worse for inlining, but brings dynamic runtime dispatch for the most advanced CPU features.
Basic Usage with C 99 and Newer
There is a stable C 99 interface, where all function names are prefixed with sz_
.
Most interfaces are well documented, and come with self-explanatory names and examples.
In some cases, hardware specific overloads are available, like sz_find_skylake
or sz_find_neon
.
Both are companions of the sz_find
, first for x86 CPUs with AVX-512 support, and second for Arm NEON-capable CPUs.
#include <stringzilla/stringzilla.h>
// Initialize your haystack and needle
sz_string_view_t haystack = {your_text, your_text_length};
sz_string_view_t needle = {your_subtext, your_subtext_length};
// Perform string-level operations auto-picking the backend or dispatching manually
sz_cptr_t ptr = sz_find(haystack.start, haystack.length, needle.start, needle.length);
sz_size_t substring_position = ptr ? (sz_size_t)(ptr - haystack.start) : SZ_SIZE_MAX; // SZ_SIZE_MAX if not found
// Backend-specific variants return pointers as well
sz_cptr_t ptr = sz_find_skylake(haystack.start, haystack.length, needle.start, needle.length);
sz_cptr_t ptr = sz_find_haswell(haystack.start, haystack.length, needle.start, needle.length);
sz_cptr_t ptr = sz_find_westmere(haystack.start, haystack.length, needle.start, needle.length);
sz_cptr_t ptr = sz_find_neon(haystack.start, haystack.length, needle.start, needle.length);
// Hash strings at once
sz_u64_t hash = sz_hash(haystack.start, haystack.length, 42); // 42 is the seed
sz_u64_t checksum = sz_bytesum(haystack.start, haystack.length); // or accumulate byte values
// Hash strings incrementally with "init", "update", and "digest":
sz_hash_state_t state;
sz_hash_state_init(&state, 42);
sz_hash_state_update(&state, haystack.start, 1); // first char
sz_hash_state_update(&state, haystack.start + 1, haystack.length - 1); // rest of the string
sz_u64_t streamed_hash = sz_hash_state_digest(&state);
// Perform collection level operations
sz_sequence_t array = {your_handle, your_count, your_get_start, your_get_length};
sz_sequence_argsort(&array, &your_config);
By design, StringZilla has a couple of notable differences from LibC:
- all strings are expected to have a length, and are not necessarily null-terminated.
- every operations has a reverse order counterpart.
That way sz_find
and sz_rfind
are similar to strstr
and strrstr
in LibC.
Similarly, sz_find_byte
and sz_rfind_byte
replace memchr
and memrchr
.
The sz_find_byteset
maps to strspn
and strcspn
, while sz_rfind_byteset
has no sibling in LibC.
LibC Functionality | StringZilla Equivalents |
---|---|
memchr(haystack, needle, haystack_length) , strchr |
sz_find_byte(haystack, haystack_length, needle) |
memrchr(haystack, needle, haystack_length) |
sz_rfind_byte(haystack, haystack_length, needle) |
memcmp , strcmp |
sz_order , sz_equal |
strlen(haystack) |
sz_find_byte(haystack, haystack_length, needle) |
strcspn(haystack, needles) |
sz_rfind_byteset(haystack, haystack_length, needles_bitset) |
strspn(haystack, needles) |
sz_find_byteset(haystack, haystack_length, needles_bitset) |
memmem(haystack, haystack_length, needle, needle_length) , strstr |
sz_find(haystack, haystack_length, needle, needle_length) |
memcpy(destination, source, destination_length) |
sz_copy(destination, source, destination_length) |
memmove(destination, source, destination_length) |
sz_move(destination, source, destination_length) |
memset(destination, value, destination_length) |
sz_fill(destination, destination_length, value) |
Basic Usage with C++ 11 and Newer
There is a stable C++ 11 interface available in the ashvardanian::stringzilla
namespace.
It comes with two STL-like classes: string_view
and string
.
The first is a non-owning view of a string, and the second is a mutable string with a Small String Optimization.
#include <stringzilla/stringzilla.hpp>
namespace sz = ashvardanian::stringzilla;
sz::string haystack = "some string";
sz::string_view needle = sz::string_view(haystack).substr(0, 4);
auto substring_position = haystack.find(needle); // Or `rfind`
auto hash = std::hash<sz::string_view>{}(haystack); // Compatible with STL's `std::hash`
haystack.end() - haystack.begin() == haystack.size(); // Or `rbegin`, `rend`
haystack.find_first_of(" \v\t") == 4; // Or `find_last_of`, `find_first_not_of`, `find_last_not_of`
haystack.starts_with(needle) == true; // Or `ends_with`
haystack.remove_prefix(needle.size()); // Why is this operation in-place?!
haystack.contains(needle) == true; // STL has this only from C++ 23 onwards
haystack.compare(needle) == 1; // Or `haystack <=> needle` in C++ 20 and beyond
StringZilla also provides string literals for automatic type resolution, similar to STL:
using sz::literals::operator""_sv;
using std::literals::operator""sv;
auto a = "some string"; // char const *
auto b = "some string"sv; // std::string_view
auto b = "some string"_sv; // sz::string_view
Similarity Scores
StringZilla exposes high-performance, batch-oriented similarity via the stringzillas/stringzillas.h
header.
Use szs_device_scope_t
to pick hardware and optionally limit capabilities per engine.
#include <stringzillas/stringzillas.h>
szs_device_scope_t device = NULL;
szs_device_scope_init_default(&device);
szs_levenshtein_distances_t engine = NULL;
szs_levenshtein_distances_init(0, 1, 1, 1, /*alloc*/ NULL, /*caps*/ sz_cap_serial_k, &engine);
sz_sequence_u32tape_t strings_a {data_a, offsets_a, count}; // or `sz_sequence_u64tape_t` for large inputs
sz_sequence_u32tape_t strings_b {data_b, offsets_b, count}; // or `sz_sequence_t` to pass generic containers
sz_size_t distances[count];
szs_levenshtein_distances_u32tape(engine, device, &strings_a, &strings_b, distances, sizeof(distances[0]));
szs_levenshtein_distances_free(engine);
szs_device_scope_free(device);
To target a different device, use the appropriate szs_device_scope_init_{cpu_cores,gpu_device}
function.
When dealing with GPU backends, make sure to use the "unified memory" allocators exposed as szs_unified_{alloc,free}
.
Similar stable C ABIs are exposed for other workloads as well.
- UTF-8:
szs_levenshtein_distances_utf8_{sequence,u32tape,u64tape}
- Needleman-Wunsch:
szs_needleman_wunsch_scores_{sequence,u32tape,u64tape}
- Smith-Waterman:
szs_smith_waterman_scores_{sequence,u32tape,u64tape}
Moreover, in C++ codebases one can tap into the raw templates implementing that functionality, customizing them with custom executors, SIMD plugins, etc.
For that include stringzillas/similarities.hpp
for C++ and stringzillas/similarities.cuh
for CUDA.
#include <stringzillas/similarities.hpp>
#include <stringzilla/types.hpp> // tape of strings
#include <fork_union.hpp> // optional thread pool
namespace sz = ashvardanian::stringzilla;
namespace szs = ashvardanian::stringzillas;
// Pack strings into an Arrow-like tape
std::vector<std::string> left = {"kitten", "flaw"};
std::vector<std::string> right = {"sitting", "lawn"};
sz::arrow_strings_tape<char, sz::size_t, std::allocator<char>> tape_a, tape_b;
auto _ = tape_a.try_assign(left.begin(), left.end());
auto _ = tape_b.try_assign(right.begin(), right.end());
// Run on the current thread
using levenshtein_t = szs::levenshtein_distances<char, szs::linear_gap_costs_t, std::allocator<char>, sz_cap_serial_k>;
levenshtein_t engine {szs::uniform_substitution_costs_t{0,1}, szs::linear_gap_costs_t{1}};
std::size_t distances[2];
auto _ = engine(tape_a, tape_b, distances);
// Or run in parallel with a pool
fork_union::basic_pool_t pool;
auto _ = pool.try_spawn(std::thread::hardware_concurrency());
auto _ = engine(tape_a, tape_b, distances, pool);
All of the potentially failing StringZillas' interfaces return error codes, and none raise C++ exceptions. Parallelism is enabled at both collection-level and within individual pairs of large inputs.
Rolling Fingerprints
StringZilla exposes parallel fingerprinting (Min-Hashes or Count-Min-Sketches) via the stringzillas/stringzillas.h
header.
Use szs_device_scope_t
to pick hardware and optionally limit capabilities per engine.
#include <stringzillas/stringzillas.h>
szs_device_scope_t device = NULL;
szs_device_scope_init_default(&device);
szs_fingerprints_t engine = NULL;
sz_size_t const dims = 1024; sz_size_t const window_widths[] = {4, 6, 8, 10};
szs_fingerprints_init(dims, /*alphabet*/ 256, window_widths, 4, /*alloc*/ NULL, /*caps*/ sz_cap_serial_k, &engine);
sz_sequence_u32tape_t texts = {data, offsets, count};
sz_u32_t *min_hashes = (sz_u32_t*)szs_unified_alloc(count * dims * sizeof(*min_hashes));
sz_u32_t *min_counts = (sz_u32_t*)szs_unified_alloc(count * dims * sizeof(*min_counts));
szs_fingerprints_u32tape(engine, device, &texts,
min_hashes, dims * sizeof(*min_hashes), // support strided matrices
min_counts, dims * sizeof(*min_counts)); // for both output arguments
szs_fingerprints_free(engine);
szs_device_scope_free(device);
Moreover, in C++ codebases one can tap into the raw templates implementing that functionality, customizing them with custom executors, SIMD plugins, etc.
For that include stringzillas/fingerprints.hpp
for C++ and stringzillas/fingerprints.cuh
for CUDA.
#include <stringzillas/fingerprints.hpp>
#include <stringzilla/types.hpp> // tape of strings
#include <fork_union.hpp> // optional thread pool
namespace sz = ashvardanian::stringzilla;
namespace szs = ashvardanian::stringzillas;
// Pack strings into an Arrow-like tape
std::vector<std::string> docs = {"alpha beta", "alpha betta"};
sz::arrow_strings_tape<char, sz::size_t, std::allocator<char>> tape;
auto _ = tape.try_assign(docs.begin(), docs.end());
// Run on the current thread with a Rabin-Karp family hasher
constexpr std::size_t dimensions_k = 256;
constexpr std::size_t window_width_k = 7;
using row_t = std::array<sz_u32_t, 256>;
using fingerprinter_t = szs::floating_rolling_hashers<sz_cap_serial_k, dimensions_k>;
fingerprinter_t engine;
auto _ = engine.try_extend(window_width_k, dimensions_k);
std::vector<row_t> hashes(docs.size()), counts(docs.size());
auto _ = engine(tape, hashes, counts);
// Or run in parallel with a pool
fork_union::basic_pool_t pool;
auto _ = pool.try_spawn(std::thread::hardware_concurrency());
auto _ = engine(tape, hashes, counts, pool);
CUDA
StringZilla provides CUDA C++ templates for composable string batch-processing operations.
Different GPUs have varying warp sizes, shared memory capacities, and register counts, affecting algorithm selection, so it's important to query the gpu_specs_t
via gpu_specs_fetch
.
For memory management, ensure that you use GPU-visible' unified memoryexposed in an STL-compatible manner as a
unified_alloctemplate class.
For error handling,
cuda_status_textends the traditional
status_twith GPU-specific information.
It's implicitly convertible to
status_t, so you can use it in places expecting a
status_t`.
Most algorithms can load-balance both a large number of small strings and a small number of large strings. Still, with large H100-scale GPUs, it's best to submit thousands of inputs at once.
Memory Ownership and Small String Optimization
Most operations in StringZilla don't assume any memory ownership. But in addition to the read-only search-like operations StringZilla provides a minimalistic C and C++ implementations for a memory owning string "class". Like other efficient string implementations, it uses the Small String Optimization (SSO) to avoid heap allocations for short strings.
typedef union sz_string_t {
struct internal {
sz_ptr_t start;
sz_u8_t length;
char chars[SZ_STRING_INTERNAL_SPACE]; /// Ends with a null-terminator.
} internal;
struct external {
sz_ptr_t start;
sz_size_t length;
sz_size_t space; /// The length of the heap-allocated buffer.
sz_size_t padding;
} external;
} sz_string_t;
As one can see, a short string can be kept on the stack, if it fits within internal.chars
array.
Before 2015 GCC string implementation was just 8 bytes, and could only fit 7 characters.
Different STL implementations today have different thresholds for the Small String Optimization.
Similar to GCC, StringZilla is 32 bytes in size, and similar to Clang it can fit 22 characters on stack.
Our layout might be preferential, if you want to avoid branches.
If you use a different compiler, you may want to check its SSO buffer size with a simple Gist.
| | libstdc++
in GCC 13 | libc++
in Clang 17 | StringZilla |
| :-------------------- | ---------------------: | -------------------: | ----------: |
| sizeof(std::string)
| 32 | 24 | 32 |
| Small String Capacity | 15 | 22 | 22 |
This design has been since ported to many high-level programming languages.
Swift, for example, can store 15 bytes in the String
instance itself.
StringZilla implements SSO at the C level, providing the sz_string_t
union and a simple API for primary operations.
sz_memory_allocator_t allocator;
sz_string_t string;
// Init and make sure we are on stack
sz_string_init(&string);
sz_string_is_on_stack(&string); // == sz_true_k
// Optionally pre-allocate space on the heap for future insertions.
sz_string_grow(&string, 100, &allocator); // == sz_true_k
// Append, erase, insert into the string.
sz_string_expand(&string, 0, "_Hello_", 7, &allocator); // == sz_true_k
sz_string_expand(&string, SZ_SIZE_MAX, "world", 5, &allocator); // == sz_true_k
sz_string_erase(&string, 0, 1);
// Unpacking & introspection.
sz_ptr_t string_start;
sz_size_t string_length;
sz_size_t string_space;
sz_bool_t string_is_external;
sz_string_unpack(string, &string_start, &string_length, &string_space, &string_is_external);
sz_equal(string_start, "Hello_world", 11); // == sz_true_k
// Reclaim some memory.
sz_string_shrink_to_fit(&string, &allocator); // == sz_true_k
sz_string_free(&string, &allocator);
Unlike the conventional C strings, the sz_string_t
is allowed to contain null characters.
To safely print those, pass the string_length
to printf
as well.
printf("%.*s\n", (int)string_length, string_start);
What's Wrong with the C Standard Library?
StringZilla is not a drop-in replacement for the C Standard Library. It's designed to be a safer and more modern alternative. Conceptually:
- LibC strings are expected to be null-terminated, so to use the efficient LibC implementations on slices of larger strings, you'd have to copy them, which is more expensive than the original string operation.
- LibC functionality is asymmetric - you can find the first and the last occurrence of a character within a string, but you can't find the last occurrence of a substring.
- LibC function names are typically very short and cryptic.
- LibC lacks crucial functionality like hashing and doesn't provide primitives for less critical but relevant operations like fuzzy matching.
Something has to be said about its support for UTF-8.
Aside from a single-byte char
type, LibC provides wchar_t
:
- The size of
wchar_t
is not consistent across platforms. On Windows, it's typically 16 bits (suitable for UTF-16), while on Unix-like systems, it's usually 32 bits (suitable for UTF-32). This inconsistency can lead to portability issues when writing cross-platform code. wchar_t
is designed to represent wide characters in a fixed-width format (UTF-16 or UTF-32). In contrast, UTF-8 is a variable-length encoding, where each character can take from 1 to 4 bytes. This fundamental difference means thatwchar_t
and UTF-8 are incompatible.
StringZilla partially addresses those issues.
What's Wrong with the C++ Standard Library?
C++ Code | Evaluation Result | Invoked Signature |
---|---|---|
"Loose"s.replace(2, 2, "vath"s, 1) |
"Loathe" 🤢 |
(pos1, count1, str2, pos2) |
"Loose"s.replace(2, 2, "vath", 1) |
"Love" 🥰 |
(pos1, count1, str2, count2) |
StringZilla is designed to be a drop-in replacement for the C++ Standard Templates Library. That said, some of the design decisions of STL strings are highly controversial, error-prone, and expensive. Most notably:
- Argument order for
replace
,insert
,erase
and similar functions is impossible to guess. - Bounds-checking exceptions for
substr
-like functions are only thrown for one side of the range. - Returning string copies in
substr
-like functions results in absurd volume of allocations. - Incremental construction via
push_back
-like functions goes through too many branches. - Inconsistency between
string
andstring_view
methods, like the lack ofremove_prefix
andremove_suffix
.
Check the following set of asserts validating the std::string
specification.
It's not realistic to expect the average developer to remember the 14 overloads of std::string::replace
.
using str = std::string;
assert(str("hello world").substr(6) == "world");
assert(str("hello world").substr(6, 100) == "world"); // 106 is beyond the length of the string, but its OK
assert_throws(str("hello world").substr(100), std::out_of_range); // 100 is beyond the length of the string
assert_throws(str("hello world").substr(20, 5), std::out_of_range); // 20 is beyond the length of the string
assert_throws(str("hello world").substr(-1, 5), std::out_of_range); // -1 casts to unsigned without any warnings...
assert(str("hello world").substr(0, -1) == "hello world"); // -1 casts to unsigned without any warnings...
assert(str("hello").replace(1, 2, "123") == "h123lo");
assert(str("hello").replace(1, 2, str("123"), 1) == "h23lo");
assert(str("hello").replace(1, 2, "123", 1) == "h1lo");
assert(str("hello").replace(1, 2, "123", 1, 1) == "h2lo");
assert(str("hello").replace(1, 2, str("123"), 1, 1) == "h2lo");
assert(str("hello").replace(1, 2, 3, 'a') == "haaalo");
assert(str("hello").replace(1, 2, {'a', 'b'}) == "hablo");
To avoid those issues, StringZilla provides an alternative consistent interface.
It supports signed arguments, and doesn't have more than 3 arguments per function or
The standard API and our alternative can be conditionally disabled with SZ_SAFETY_OVER_COMPATIBILITY=1
.
When it's enabled, the subjectively risky overloads from the Standard will be disabled.
using str = sz::string;
str("a:b").front(1) == "a"; // no checks, unlike `substr`
str("a:b").front(2) == "2"; // take first 2 characters
str("a:b").back(-1) == "b"; // accepting negative indices
str("a:b").back(-2) == ":b"; // similar to Python's `"a:b"[-2:]`
str("a:b").sub(1, -1) == ":"; // similar to Python's `"a:b"[1:-1]`
str("a:b").sub(-2, -1) == ":"; // similar to Python's `"a:b"[-2:-1]`
str("a:b").sub(-2, 1) == ""; // similar to Python's `"a:b"[-2:1]`
"a:b"_sv[{-2, -1}] == ":"; // works on views and overloads `operator[]`
Assuming StringZilla is a header-only library you can use the full API in some translation units and gradually transition to safer restricted API in others. Bonus - all the bound checking is branchless, so it has a constant cost and won't hurt your branch predictor.
Beyond the C++ Standard Library - Learning from Python
Python is arguably the most popular programming language for data science. In part, that's due to the simplicity of its standard interfaces. StringZilla brings some of that functionality to C++.
- Content checks:
isalnum
,isalpha
,isascii
,isdigit
,islower
,isspace
,isupper
. - Trimming character sets:
lstrip
,rstrip
,strip
. - Trimming string matches:
remove_prefix
,remove_suffix
. - Ranges of search results:
splitlines
,split
,rsplit
. - Number of non-overlapping substring matches:
count
. - Partitioning:
partition
,rpartition
.
For example, when parsing documents, it is often useful to split it into substrings.
Most often, after that, you would compute the length of the skipped part, the offset and the length of the remaining part.
This results in a lot of pointer arithmetic and is error-prone.
StringZilla provides a convenient partition
function, which returns a tuple of three string views, making the code cleaner.
auto parts = haystack.partition(':'); // Matching a character
auto [before, match, after] = haystack.partition(':'); // Structure unpacking
auto [before, match, after] = haystack.partition(sz::byteset(":;")); // Character-set argument
auto [before, match, after] = haystack.partition(" : "); // String argument
auto [before, match, after] = haystack.rpartition(sz::whitespaces_set()); // Split around the last whitespace
Combining those with the split
function, one can easily parse a CSV file or HTTP headers.
for (auto line : haystack.split("\r\n")) {
auto [key, _, value] = line.partition(':');
headers[key.strip()] = value.strip();
}
Some other extensions are not present in the Python standard library either. Let's go through the C++ functionality category by category.
- Splits and Ranges.
- Concatenating Strings without Allocations.
- Random Generation.
- Edit Distances and Fuzzy Search.
Some of the StringZilla interfaces are not available even Python's native str
class.
Here is a sneak peek of the most useful ones.
text.hash(); // -> 64 bit unsigned integer
text.ssize(); // -> 64 bit signed length to avoid `static_cast<std::ssize_t>(text.size())`
text.contains_only(" \w\t"); // == text.find_first_not_of(sz::byteset(" \w\t")) == npos;
text.contains(sz::whitespaces_set()); // == text.find(sz::byteset(sz::whitespaces_set())) != npos;
// Simpler slicing than `substr`
text.front(10); // -> sz::string_view
text.back(10); // -> sz::string_view
// Safe variants, which clamp the range into the string bounds
using sz::string::cap;
text.front(10, cap) == text.front(std::min(10, text.size()));
text.back(10, cap) == text.back(std::min(10, text.size()));
// Character set filtering
text.lstrip(sz::whitespaces_set()).rstrip(sz::newlines_set()); // like Python
text.front(sz::whitespaces_set()); // all leading whitespaces
text.back(sz::digits_set()); // all numerical symbols forming the suffix
// Incremental construction
using sz::string::unchecked;
text.push_back('x'); // no surprises here
text.push_back('x', unchecked); // no bounds checking, Rust style
text.try_push_back('x'); // returns `false` if the string is full and the allocation failed
sz::concatenate(text, "@", domain, ".", tld); // No allocations
Splits and Ranges
One of the most common use cases is to split a string into a collection of substrings. Which would often result in StackOverflow lookups and snippets like the one below.
std::vector<std::string> lines = split(haystack, "\r\n"); // string delimiter
std::vector<std::string> words = split(lines, ' '); // character delimiter
Those allocate memory for each string and the temporary vectors.
Each allocation can be orders of magnitude more expensive, than even serial for
-loop over characters.
To avoid those, StringZilla provides lazily-evaluated ranges, compatible with the Range-v3 library.
for (auto line : haystack.split("\r\n"))
for (auto word : line.split(sz::byteset(" \w\t.,;:!?")))
std::cout << word << std::endl;
Each of those is available in reverse order as well.
It also allows interleaving matches, if you want both inclusions of xx
in xxx
.
Debugging pointer offsets is not a pleasant exercise, so keep the following functions in mind.
haystack.[r]find_all(needle, interleaving)
haystack.[r]find_all(sz::byteset(""))
haystack.[r]split(needle)
haystack.[r]split(sz::byteset(""))
For $N$ matches the split functions will report $N+1$ matches, potentially including empty strings. Ranges have a few convenience methods as well:
range.size(); // -> std::size_t
range.empty(); // -> bool
range.template to<std::set<std::sting>>();
range.template to<std::vector<std::sting_view>>();
Concatenating Strings without Allocations
Another common string operation is concatenation.
The STL provides std::string::operator+
and std::string::append
, but those are not very efficient, if multiple invocations are performed.
std::string name, domain, tld;
auto email = name + "@" + domain + "." + tld; // 4 allocations
The efficient approach would be to pre-allocate the memory and copy the strings into it.
std::string email;
email.reserve(name.size() + domain.size() + tld.size() + 2);
email.append(name), email.append("@"), email.append(domain), email.append("."), email.append(tld);
That's mouthful and error-prone.
StringZilla provides a more convenient concatenate
function, which takes a variadic number of arguments.
It also overrides the operator|
to concatenate strings lazily, without any allocations.
auto email = sz::concatenate(name, "@", domain, ".", tld); // 0 allocations
auto email = name | "@" | domain | "." | tld; // 0 allocations
sz::string email = name | "@" | domain | "." | tld; // 1 allocations
Random Generation
Software developers often need to generate random strings for testing purposes.
The STL provides std::generate
and std::random_device
, that can be used with StringZilla.
sz::string random_string(std::size_t length, char const *alphabet, std::size_t cardinality) {
sz::string result(length, '\0');
static std::random_device seed_source; // Expensive to construct - due to system calls
static std::mt19937 generator(seed_source()); // Also expensive - due to the state size
std::uniform_int_distribution<std::size_t> distribution(0, cardinality);
std::generate(result.begin(), result.end(), [&]() { return alphabet[distribution(generator)]; });
return result;
}
Mouthful and slow.
StringZilla provides a C native method - sz_fill_random
and a convenient C++ wrapper - sz::generate
.
Similar to Python it also defines the commonly used character sets.
auto protein = sz::string::random(300, "ARNDCQEGHILKMFPSTWYV"); // static method
auto dna = sz::basic_string<custom_allocator>::random(3_000_000_000, "ACGT");
dna.fill_random("ACGT"); // `noexcept` pre-allocated version
dna.fill_random(&std::rand, "ACGT"); // pass any generator, like `std::mt19937`
char uuid[36];
sz::fill_random(sz::string_span(uuid, 36), "0123456789abcdef-"); // Overwrite any buffer
Bulk Replacements
In text processing, it's often necessary to replace all occurrences of a specific substring or set of characters within a string. Standard library functions may not offer the most efficient or convenient methods for performing bulk replacements, especially when dealing with large strings or performance-critical applications.
haystack.replace_all(needle_string, replacement_string)
haystack.replace_all(sz::byteset(""), replacement_string)
haystack.try_replace_all(needle_string, replacement_string)
haystack.try_replace_all(sz::byteset(""), replacement_string)
haystack.lookup(sz::look_up_table::identity())
haystack.lookup(sz::look_up_table::identity(), haystack.data())
Sorting in C and C++
LibC provides qsort
and STL provides std::sort
.
Both have their quirks.
The LibC standard has no way to pass a context to the comparison function, that's only possible with platform-specific extensions.
Those have different arguments order on every OS.
// Linux: https://linux.die.net/man/3/qsort_r
void qsort_r(void *elements, size_t count, size_t element_width,
int (*compare)(void const *left, void const *right, void *context),
void *context);
// macOS and FreeBSD: https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/qsort_r.3.html
void qsort_r(void *elements, size_t count, size_t element_width,
void *context,
int (*compare)(void *context, void const *left, void const *right));
// Windows conflicts with ISO `qsort_s`: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/qsort-s?view=msvc-170
void qsort_s(id *elements, size_t count, size_t element_width,
int (*compare)(void *context, void const *left, void const *right),
void *context);
C++ generic algorithm is not perfect either.
There is no guarantee in the standard that std::sort
won't allocate any memory.
If you are running on embedded, in real-time or on 100+ CPU cores per node, you may want to avoid that.
StringZilla doesn't solve the general case, but hopes to improve the performance for strings.
Use sz_sequence_argsort
, or the high-level sz::argsort
, which can be used sort any collection of elements convertible to sz::string_view
.
std::vector<std::string> data({"c", "b", "a"});
std::vector<std::size_t> order = sz::argsort(data); //< Simple shortcut
// Or, taking care of memory allocation:
sz::argsort(data.begin(), data.end(), order.data(), [](auto const &x) -> sz::string_view { return x; });
Standard C++ Containers with String Keys
The C++ Standard Templates Library provides several associative containers, often used with string keys.
std::map<std::string, int, std::less<std::string>> sorted_words;
std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>> words;
The performance of those containers is often limited by the performance of the string keys, especially on reads.
StringZilla can be used to accelerate containers with std::string
keys, by overriding the default comparator and hash functions.
std::map<std::string, int, sz::less> sorted_words;
std::unordered_map<std::string, int, sz::hash, sz::equal_to> words;
Alternatively, a better approach would be to use the sz::string
class as a key.
The right hash function and comparator would be automatically selected and the performance gains would be more noticeable if the keys are short.
std::map<sz::string, int> sorted_words;
std::unordered_map<sz::string, int> words;
Compilation Settings and Debugging
SZ_DEBUG
:
For maximal performance, the C library does not perform any bounds checking in Release builds. In C++, bounds checking happens only in places where the STL
std::string
would do it. If you want to enable more aggressive bounds-checking, defineSZ_DEBUG
before including the header. If not explicitly set, it will be inferred from the build type.
SZ_USE_WESTMERE
, SZ_USE_HASWELL
, SZ_USE_SKYLAKE
, SZ_USE_ICE
, SZ_USE_NEON
, SZ_USE_NEON_AES
, SZ_USE_SVE
, SZ_USE_SVE2
, SZ_USE_SVE2_AES
:
One can explicitly disable certain families of SIMD instructions for co