$ cat ./blog/2016-05-06-regex-in-c.md
→ read in → ler em Português (pt-BR)

Regex in C: POSIX Pattern Matching Without a Safety Net

Using POSIX regex in C — no garbage collector, no exceptions, just regcomp, regexec, and careful memory management.

Most languages make regex easy. Python gives you re.compile. JavaScript gives you /pattern/flags. C gives you POSIX headers, raw structs, and the full weight of manual memory management.

That’s not a complaint. Working close to the metal clarifies what regex actually is — compiled finite automata operating on byte sequences.

The POSIX Regex API

POSIX regex in C lives in <regex.h> and gives you two primary functions:

#include <stdio.h>
#include <regex.h>
#include <stdlib.h>

int match(const char *pattern, const char *string) {
    regex_t regex;
    int result;

    /* Compile the pattern */
    result = regcomp(&regex, pattern, REG_EXTENDED);
    if (result != 0) {
        char errbuf[128];
        regerror(result, &regex, errbuf, sizeof(errbuf));
        fprintf(stderr, "regcomp error: %s\n", errbuf);
        return -1;
    }

    /* Execute the match */
    result = regexec(&regex, string, 0, NULL, 0);

    /* Always free the compiled regex */
    regfree(&regex);

    return result == 0 ? 1 : 0;
}

int main(void) {
    printf("%d\n", match("^[0-9]+$", "12345"));   /* 1 */
    printf("%d\n", match("^[0-9]+$", "123ab"));   /* 0 */
    return 0;
}

Capturing Groups

To capture substrings you need regmatch_t — an array of match structs holding start/end offsets:

#include <stdio.h>
#include <regex.h>
#include <string.h>

void extract_groups(const char *pattern, const char *string, size_t nmatch) {
    regex_t regex;
    regmatch_t matches[nmatch];

    if (regcomp(&regex, pattern, REG_EXTENDED) != 0) {
        fprintf(stderr, "Invalid pattern\n");
        return;
    }

    if (regexec(&regex, string, nmatch, matches, 0) == 0) {
        for (size_t i = 0; i < nmatch; i++) {
            if (matches[i].rm_so == -1) break;

            /* rm_so / rm_eo are byte offsets into the input string */
            int len = matches[i].rm_eo - matches[i].rm_so;
            printf("Group %zu: %.*s\n", i, len, string + matches[i].rm_so);
        }
    }

    regfree(&regex);
}

int main(void) {
    /* Extract year, month, day from an ISO date */
    extract_groups(
        "([0-9]{4})-([0-9]{2})-([0-9]{2})",
        "Today is 2016-05-06 and tomorrow is 2016-05-07.",
        4   /* full match + 3 groups */
    );
    return 0;
}

Output:

Group 0: 2016-05-06
Group 1: 2016
Group 2: 05
Group 3: 06

What Can Go Wrong

Forgetting regfree. Each regcomp allocates internal state. Missing regfree leaks memory — silently, in long-running processes.

Stack-allocating large regmatch_t arrays. For patterns with many groups, allocate on the heap.

REG_EXTENDED vs basic regex. Without REG_EXTENDED, +, ?, |, and () lose their special meaning or require backslash escaping. Always use REG_EXTENDED unless you have a specific reason not to.

Thread safety. regcomp / regexec / regfree are thread-safe. The compiled regex_t struct is not — don’t share it between threads without a mutex.

When C Regex Makes Sense

Mostly when you’re already writing C and need lightweight pattern matching without pulling in PCRE or another library. For anything complex, use a language with a richer regex API and garbage collection.

But understanding the C API clarifies regex semantics that higher-level wrappers hide. regmatch_t.rm_so and rm_eo are byte offsets, not character indices — a distinction that matters the moment your input contains multibyte UTF-8 sequences.

comments