We're open sourcing CRE and preq to make it easier for humans and agents to find and fix reliability problems

When was the last time you scrambled through logs, dashboards, and metrics trying to figure out if and why an application was broken?

We founded Prequel, the community-driven reliability company, to give engineers a first line of defense when monitoring their applications. Our product leverages collective knowledge of failure, curated detection methods, and proven mitigations to help you quickly find and fix problems.

Until now, only Prequel customers were able to take part in this vision.

Today, we’re excited to open two projects that enable anyone to participate in community-driven problem detection.

cre – an open, structured standard for sharing and operationalizing knowledge of reliability problems (github.com/prequel-dev/cre)
preq (“preek”) – a free and open community-driven reliability problem detector that consumes CREs (github.com/prequel-dev/preq)

These projects are released under the Apache 2 license and enable any engineer to more rapidly find bugs, misconfigurations and developer anti-patterns based on community knowledge.

In addition to reading this blog post, check out this article in TheNewStack explaining why this is transformative for the industry.

"At a previous company, we built precise problem detectors to proactively uncover failure within our stack, but most engineering teams haven't historically had the resources, time, or technology to do this. What excites me about CREs and preq is that they make problem detection available to everyone." - Kelsey Hightower

Community-Driven Problem Detection

Despite advancements in observability, the application monitoring process hasn’t changed. We store data in new places, label it in different ways, but the task of figuring out what’s happening, why it’s happening, and if it matters still rests on your shoulders.

You are the problem detector.

With growing application complexity and AI-accelerated software development, it is increasingly impossible to keep up and look for ways applications can fail.

We want you to feel like you have thousands of engineers by your side before, during, or after an issue occurs.

We’ve already experienced first hand the power that collective knowledge can bring to detection challenges. For example, in security, technologies, like Nessus for vulnerabilities, Yara for malware, and Sublime Security for email threats apply community knowledge to take the burden off of internal teams.

When it comes to reliability, it doesn’t make sense that every engineering team is tackling reliability problem detection in a manual and isolated way. This is especially true when over 81% of underlying problems (according to our most recent analysis), have already been seen by an engineer at another company.

Prequel is credited with introducing community-driven problem detection, a new approach to monitoring and troubleshooting.

We enable engineering teams to build, run, and exchange vetted problem detectors. These are composable, extensible rules that run where your applications live. Each rule is the product of hard-earned knowledge about bugs, misconfigurations, and developer anti-patterns.

Instead of noisy alerts that hint at symptoms (i.e. high latency, memory growth, error count), these rules are designed to deterministically tell you exactly where a problem is and how to fix it.

You can proactively scan for hundreds of known problems as opposed to waiting for an incident to happen and retroactively adding a one-off alert

Today we are opening the Common Reliability Enumeration (CRE) standard and preq so that everyone is able to participate.

Let’s dive deeper into how preq and CREs work.

CRE – Common Reliability Enumeration

A CRE is essentially a specification of a known problem: it includes a unique ID (e.g. “CRE-2024-0007”), a description of the issue, why it happens (cause), what could result (impact), and how to fix or mitigate it.

CREs are to reliability what Common Vulnerabilities and Exposures (CVEs) are to security.

Just as CVEs let us share and detect known vulnerabilities, CREs provide a standardized way to describe reliability issues (complete with their causes, impacts, and fixes) so that teams can recognize reoccurring failure modes across systems. This works whether a problem is frequent or relatively obscure.

CREs cover everything from known bugs in popular open-source frameworks that manifest with nondescript log messages, misconfiguration pitfalls (e.g. a Kafka topic replication factor higher than the number of brokers), to performance anti-patterns (like an N+1 database query issue in an ORM).

Each CRE has a structured YAML definition that captures all this information. For example, a CRE for a specific RabbitMQ failure would include:

title (RabbitMQ Mnesia overloaded recovering persistent queues)
cause (explaining that the Erlang Mnesia database is overloading during queue recovery)
impact (RabbitMQ can’t process new messages, causing outage downstream), and
mitigation steps (e.g. adjust queue mirroring policies, use lazy queues)

And other fields included in this code snippet.

 - cre:
      id: CRE-2024-0007
      severity: 0
      title: RabbitMQ Mnesia overloaded recovering persistent queues
      category: message-queue-problem
      author: Prequel
      description: |
        The RabbitMQ cluster is processing a large number of persistent mirrored queues at boot. The underlying Erlang process, Mnesia, is overloaded (`** WARNING ** Mnesia is overloaded`).
      cause: |
        - There are so many that the underlying Erlang process, Mnesia, is reporting that it is overloaded while recovering these queues on boot. 
      impact: |
        - RabbitMQ is unable to process any new messages, which can lead to outages in consumers and producers.
      impactScore: 9
      tags: 
        - known-problem
        - rabbitmq
      mitigation: |
        - Increase the size of the cluster
        - Increase the Kubernetes CPU limits for the RabbitMQ brokers
        - Consider adjusting mirroring policies to limit the number of mirrored queues
        - Remove high-availability policies from queues where it is not needed
        - Consider using [lazy queues](https://www.rabbitmq.com/docs/lazy-queues) to avoid incurring the costs of writing data to disk 
      mitigationScore: 8
      references:
        - https://groups.google.com/g/rabbitmq-users/search?q=discarding%20message
      applications:
        - name: "rabbitmq"
          version: "3.9.x"
    metadata:
      kind: prequel
      id: 5UD1RZxGC5LJQnVpAkV11A
      gen: 1

While there is no obligation to share a CRE you create, the public CRE repo is open source and community-maintained.

As of today’s launch, public CREs exist for a variety of systems – databases, message queues, cloud infrastructure components, and more – covering many common causes of outages. Using these out-of-the-box rules, you can recognize known failure patterns before they bite you. And the library is constantly expanding: if you’ve encountered a tricky bug or failure and solved it, you can contribute a new CRE so others can benefit.

The CRE index is a map of software failure modes, much like how public vulnerability databases cover known exploitable bugs. By adopting CREs today, you give your team a head start on problems that have already been seen and solved elsewhere, rather than reinventing the wheel each time.

Prequel Rules - A Powerful Syntax for Detecting Failure

Importantly, CREs also define the conditions to detect the problem described in the CRE. This is done by embedding the Prequel rule syntax directly in the CRE YAML. Each rule is designed to be readable by humans and machines.

The rule syntax is purpose-built – serving as a domain-specific language tailored to express reliability detection patterns against event data. This rule syntax is uniquely designed with distributed systems and asynchronous failure in mind, and it’s implemented in YAML to be easy to read and write.

Let’s break down the main concepts:

1. Sequences and Sets of Events: At the core, a rule describes a sequence of events (A then B then C in order) or a set of events (A, B, and C in any order) that must occur to satisfy the conditions. You can also nest sequences and sets, allowing complex patterns (e.g. “a set of two sequences” if multiple ordering groups are needed).

2. Positive and Negative Conditions: Rules can specify events that must happen (positive conditions) as well as events that must NOT happen (negative conditions) during a given window. The ability to include negative conditions is a key feature – it lets you encode things like “if error X happens without subsequent recovery event Y” or “if we saw no heartbeat for 5 minutes after event Z”. This helps reduce noise by ensuring the absence of expected normal events can be part of the detection logic.

3. Time Windows: Because order and timing matter, you can bound rules with time windows. If a rule has multiple steps, you specify a window (e.g. 30 seconds, 5 minutes, etc.) within which those events must occur. You can even apply windows to negative conditions (e.g. “no restart event for at least 1 minute after the error”). The rule engine uses these windows to know when to stop waiting for the next event in a sequence or to decide that a negative condition has been satisfied (if the forbidden event did not appear in the interval).

4. Data Sources and Correlation: Each condition in a rule is evaluated against a specific data source (for example, the Apache Kafka log, or a Kubernetes log stream). Conditions include a reference to the source (either explicitly or implicitly by context). The powerful thing is you can express cross-source relationships: a rule could say “an error in Service A’s log and a failure in Service B’s log” as a set of two conditions on two sources. By default, such a rule would fire if those events occur anywhere in the system within the window. But often we want them to be related – e.g. the same request ID, or same host.

This is where correlation keys come in: you can require that multiple conditions share some attribute to be considered a match. For instance, correlating on host would ensure that all events in the sequence happened on the same host machine. Correlation by a custom identifier (like a trace ID or user session) is also possible if the data provides it. This feature is crucial for describing distributed problems without centralizing data: it lets each node evaluate its own events, while the rule logic links them by common context.

5. Flexible Match Conditions: The content of an event can be matched in various ways. The rule syntax supports plain string matching, regular expressions (using RE2 syntax), and even JSON/YAML queries via jq for structured logs. This means you can target anything from a simple substring in a log line, to a specific field in a JSON log event (e.g. event.type == "ERROR"), or a regex pattern for more complex criteria. You can also specify a count if a certain message must appear N times to qualify (as we hinted with “10 occurrences” earlier). All these matching primitives are declarative – you just describe what to look for, and a CRE-driven problem detector takes care of the rest.

To make this discussion more concrete, here’s a simplified example rule (revisiting the real RabbitMQ CRE-2024-0007) illustrating the YAML syntax:

…

    metadata:
      kind: prequel
      id: 5UD1RZxGC5LJQnVpAkV11A
      gen: 1
    rule:
      sequence:
        window: 30s
        event:
          source: cre.log.rabbitmq
        order:
          - regex: Discarding message(.+)in an old incarnation(.+)of this node
          - value: Mnesia is overloaded
        negate:
          - value: "RabbitMQ is asked to stop"
            anchor: 0
            slide: 30s
          - value: "SIGTERM received - shutting down"
            anchor: 1
            window: 10s

In plain language, this rule says:

Within 30 seconds, in the RabbitMQ log, if we observe a “Discarding message in an old incarnation...” warning followed by “Mnesia is overloaded”, and we did not see any “SIGTERM… shutting down” message during that period, then trigger CRE-2024-0007.

This corresponds to a known RabbitMQ bug where the node is struggling to recover mirrored queues on startup without actually shutting down. The rule encodes both the telltale signs and the absence of a normal shutdown sequence. When the rule runs, the problem detector will continuously look for this sequence of events in the log. If it finds a match, it reports the problem and provides all the context from the CRE (so you’d immediately see the cause and mitigation suggestions for this issue).

Prequel hosts a full development playground that enables you to write, test, and share CRE rules. The playground is wasm based and the data stays in your browser

You can run CRE-2024-0007 in the playground here.

Creating Your CRE Rules

Writing CRE rules is meant to be straightforward for developers and SREs. You don’t need to write code or complex queries – the pattern is declarative. Moreover, having a custom syntax for reliability rules allows you to add specialized features (like the negative conditions and correlations above) that general log query languages or monitoring systems do not support.

There were some design trade-offs here: for example, the team opted for a YAML instead of creating a completely new language or repurposing a query language. This was done to ensure the syntax remains simple while supporting complex reliability scenarios. There is no existing query language (i.e. PromQl, LogQl, SQL) that supports these functions and enables this correlation across various signals.

We chose YAML to reduce the friction of learning a new syntax. The CRE schema is fairly small and constrained. To help you write your own rules, the cre repo contains:

a verification schema that you can load into your IDE
a CRE compiler/validator (ruler) that checks contributors’ rules for correctness (e.g., no duplicate IDs, proper formatting)

One key benefit is that rules can be easily shared, versioned, and even automatically validated by tools.

We have a documented PR process where submissions are reviewed to ensure the quality of community contributions stays high.

`preq` – A Free, Open, CRE-Powered Problem Detector

preq (pronounced “preek”) is a lightweight engine that runs CREs against your applications.

We wanted running preq to be as simple as running grep, while supporting the powerful capabilities embedded in CRE rules.

You can install preq in a number of ways.

Download a standalone binary for Linux, macOS, or Windows. You can pipe data to it or configure data sources.

cat /var/log/syslog | preq

‍

Install the Kubernetes kubectl plugin via krew for easy in-cluster use. If you have the Krew package manager installed, just run: kubectl krew install preq

kubectl preq pg17-postgresql-0

‍

You can schedule preq to run as a job and have it push slack notifications.

preq ingests streams of events and looks for sequences or combinations that match the patterns defined in the CRE rules. Impressively, preq can crunch through huge volumes of log data very efficiently. It’s written in Go and optimized for streaming throughput, so it can keep up even with high-volume production logs.

When a problem is detected, preq reports a detailed finding including the CRE ID, a human-readable title of the issue, severity level, and timestamps of occurrences. You can output results to the console (in plain or JSON format) or integrate it into your incident pipeline. Each detection links back to the CRE that explains the problem and suggested remediation.

For example, running preq and finishing the RabbitMQ CRE mentioned above would look like this:

Parsing rules           done! [1 rules in 0s; 649 rules/s]
Problems detected       done! [1 in 1ms; 643/s]
Reading stdin           done! [2.88KB in 0s; 2.50MB/s]
Matching lines          done! [14 lines in 0s; 12.20K lines/s]
CRE-2024-0007        critical [2 hits @ 2025-03-11T09:00:19-05:00]

Wrote report to preq-report-1743654939.json

‍

‍Data Privacy

preq is built to be run where your application lives whether that is on a VM, in a Kubernetes cluster, or on your laptop. In practice, this architecture aligns with Prequel’s philosophy of bringing the detector to the data, not vice versa

There are certainly benefits to centralizing logging, but we didn’t want that to be a requirement for preq since many teams are struggling to overcome punitive cloud egress and observability costs.

With CREs and the architecture of preq, reliability intelligence flows into your environment. Application data stays where you want it.

Hasn’t AI Solved This Problem?

In 2025, there is a lot of focus on building the agentic SRE. We are extremely bullish on an AI-assisted future and believe the results will be better than the prior AIOps era, which generally produced more noise than value.

But engineering teams have some obstacles on the AI-adoption journey:

LLMs excel when they’re fed the right, high‑signal inputs—not unfiltered torrents of logs, metrics, and traces.
Piping all of your telemetry through a foundational model is price prohibitive. Teams need a way to slash token usage and speed up responses.
Without guidance, models are left to make inferences from noisy data, which often leads to generic or unverified advice, at best, and hallucinations, at worst.
Data privacy remains a concern as AI-adoption grows.

CREs and AI are a better together story. The combination unlocks powerful possibilities.

CREs are designed to both support engineering teams today and enable a more accurate and powerful AI-assisted future by encapsulating both knowledge, logic, and results in a human and machine readable way.

CREs are designed to give you immediate, high‑confidence detections. They use 100% transparent logic you can audit, tune, version, and unit test.
CREs becomes a perfect “training record” for fine‑tuning or RAG pipelines, dramatically improving precision when you do bring LLM or reasoning models into the loop.
The results of a CRE can be handed to downstream agents—auto‑file a Jira ticket, trigger a Playbook, or ask the LLM to draft a post mortem.

We’re pioneering these use cases at Prequel.

You don’t have to choose between CREs and AI. By front‑loading detection with CREs, you get value today and lay down the structured ground truth your AI-assisted workflows will thrive on tomorrow.

What It All Means

CREs and preq introduce a few notable innovations to the reliability space, along with conscious trade-offs made for the sake of usability and precision:

Community-Curated Intelligence

Perhaps the biggest innovation is treating reliability issues as shared problems, not something every team must solve in isolation. By providing a common language and mechanism for sharing knowledge and detection rules, CRE enables a crowdsourced approach to reliability.

This stands in contrast to traditional monitoring, which leaves every team to either A) conceive and create problem detectors from scratch B) underinvest in problem detection and focus on monitoring high-level symptoms.

The trade-off here is that effectiveness grows with community engagement – it shines best when many contribute and update the rule library. We’re betting on the community. This doesn’t work without you.

Initially, coverage might be skewed toward more common systems or well-known issues, which is fine. Over time the goal is comprehensive coverage addressing long tail issues. We’ve already seen contributions spanning various technologies.

Rules vs Anomaly Detection

CREs focus on rule-based detection as opposed to anomaly detection. This means if a CRE triggers, it’s because a specific known bad pattern was observed – yielding high confidence and actionable output (complete with explanation).

The upside is far fewer false positives and less noise, since each rule encodes expert knowledge about a real issue.

What if we can’t catch truly novel, never-before-seen issues (where no rule exists yet)?

In practice, this is acceptable because CREs are meant to complement, not entirely replace, traditional monitoring. But our research shows that over 80% of problems engineers encounter have already been seen, and often solved, by other teams.

By handling the known failure modes, CREs actually reduce noise in your other monitoring systems and you’re less likely to chase symptoms since the root cause is identified deterministically.

But that’s not the end of the story. Looser detection rules can also be written, versioned, and refined over time as new knowledge is available.

Custom Rule Syntax

As discussed, we created a new rule syntax instead of using something like SQL or an existing observability query language.

The innovation here is a language that cleanly supports ordering of events, correlation, nested conditions, and negative patterns in a way that’s directly meaningful for reliability scenarios.

By keeping the rule syntax in YAML and relatively constrained, Prequel makes it accessible for any developer to read a rule and understand the failure scenario it represents.

The trade-off is that some complex logic might not be expressible if it falls outside sequences/sets and correlated events. In those cases, we may need to expand the syntax. But so far the approach has proven flexible enough for a wide range of issues, and the simplicity of the syntax is a big reason why contributors can easily add rules.

It’s also tool-friendly: because the rules are declarative, they can be analyzed by a supporting detection engine’s optimizer for improvements (for example, optimizing the evaluation order, or ensuring that only relevant logs are read for a given rule).

Open Source

preq and CRE are 100% open source, Apache-2.0 licensed.

We mention this to be clear – the rules engine, schema, community rules, and tooling are 100% open source and open to all. We hope this fosters a vibrant community and ecosystem around problem detection.

preq is powerful enough for individuals or teams to run periodically or continuously.

We could have offered this as a free, but proprietary product, but we wanted to give you the full freedom and transparency to trust and extend them.

At the same time, these projects benefit from the continuous contributions and financial support of Prequel.

How Prequel Uses CRE and Preq

CREs and preq underpin our commercial product, Prequel; but we want to make them available for all.

Companies with at-scale needs can opt for a commercially-supported solution with extra capabilities, such as

a distributed detection engine that runs across many nodes and clusters
a web UI with guided workflows for investigation and collaboration
deeper integrations (for incident tracking, etc.)
a control plane for managing the distributed engine
a larger, proprietary set of CRE rules maintained by the Prequel Reliability Research team (PRRT).

Get Started and Contribute

We invite you to try preq today and become part of the community-driven reliability movement.

Getting started is simple:

Download preq here: docs.prequel.dev/
View and star the repos here:
- github.com/prequel-dev/cre
- github.com/prequel-dev/preq

The community Slack and GitHub discussions are active if you need help or want to discuss your use cases.

We’re particularly excited to see contributions that expand CRE coverage to new problems – encoding them as a CRE means nobody else has to be blindsided by it again. We encourage you to share your feedback and ideas.

Excited to see what you do with these projects.

open source

preq

CRE