I've finally fixed a mysterious bug that was reported while I was traveling. I thought it would be interesting to share the investigation and resolution. Plus, it's an excuse to do some writing...

Rules-based logic has been part of ktistec for a while, and the rules engine has been running successfully in production for me and others, so I was surprised to learn that the recent introduction of a new rule resulted in spurious notifications. The rule in question is supposed to create a notification when someone replies to one of your posts. Instead, the rule was creating a notification when anyone replied to anything.

rule "create/reply"
  condition activity, IsAddressedTo, actor
  condition CreateActivity, activity, object: object
  condition Object, object, in_reply_to: other
  condition Object, other, attributed_to: actor
  none Notification, owner: actor, activity: activity
  assert Notification, owner: actor, activity: activity
end

I had unit tests for the rule's logic, and the logic seemed correct when I visually inspected the rule again. To top it off, I wasn't able to find a set of steps that reproduced the problem locally.

For various poor reasons, I hadn't tried the rule in production myself. With no other obvious path forward, I deployed it and waited. Sure enough, the bug surfaced—along with a stack trace helpfully correlated with every occurrence of the bug. Jackpot—or so I thought!

Exception: relationship: already exists (Ktistec::Model::Invalid)
  from /workspace/src/framework/model.cr:650:9 in 'save'
  from /workspace/src/framework/model.cr:649:7 in 'save'
  from /workspace/src/rules/content_rules.cr:49:3 in 'assert'
  from /workspace/src/utils/compiler.cr:118:5 in 'assert'
  from /workspace/src/utils/compiler.cr:39:19 in '->'
  from /workspace/lib/school/src/school/rule/rule.cr:38:23 in 'call'
  from /workspace/lib/school/src/school/domain/domain.cr:158:9 in 'run'
  from /workspace/src/rules/content_rules.cr:89:5 in 'run'
  from /workspace/src/controllers/inboxes.cr:248:5 in '->'
  ...

The stack trace was curious for two reasons. The error occurred when creating (asserting) the notification because a notification already existed, which 1) shouldn't be possible because the assertion is preceded by a guard condition that ensures that the notification does not exist! And of course, 2) the notifications were spurious—rule evaluation shouldn't have resulted in a notification in the first place...! Luckily for me, it soon got even weirder.

It's possible to trace rule evaluation. Turning tracing on revealed a surprising mystery: 3) thousands of successful activations of the "create/reply" rule for any given reply. Thousands! That did explain something, though. The first successful activation created the spurious notification. The second activation raised the error (because the notification had just been created). When evaluating rules, all matches are first found, and then all actions are taken, therefore, in this case, the guard condition couldn't have had any effect. The error also terminated any further action processing, so there was only one stack trace.

So, I swapped one curiosity for a mystery—but why were there thousands of matches for that rule? There should have been only one (really, none).

A trace prints information about conditions and the facts that match, along with information about the values that are bound to variables in the process.

Rule create/reply
  Condition School::BinaryPattern(ContentRules::IsAddressedTo, School::Expression, School::Expression)
    Match ContentRules::IsAddressedTo, bindings: activity=#<ActivityPub::Activity::Create iri=...> actor=#<ActivityPub::Actor::Person iri=...> []
  Condition ContentRules::CreateActivity
    Match #<ActivityPub::Activity::Create iri=...>, bindings: object=#<ActivityPub::Object::Note iri=...> [activity=#<ActivityPub::Activity::Create iri=...> actor=#<ActivityPub::Actor::Person iri=...>]
  Condition ContentRules::Object
    Match #<ActivityPub::Object::Note iri=...>, bindings: [activity=#<ActivityPub::Activity::Create iri=...> actor=#<ActivityPub::Actor::Person iri=...> object=#<ActivityPub::Object::Note iri=...>]
  Condition ContentRules::Object
    Match #<ActivityPub::Object::Note iri=...>, bindings: other=#<ActivityPub::Object::Note iri=...> [activity=#<ActivityPub::Activity::Create iri=...> actor=#<ActivityPub::Actor::Person iri=...> object=#<ActivityPub::Object::Note iri=...>]
  ...

Hmmm, that third condition... Nothing is bound to the variable other, as expected, yet the condition is treated as having successfully matched a fact. That's obviously incorrect—other should be the post being replied to. Or the condition should fail and rule evaluation terminate. Instead, since it is not bound but evaluation continues, other is free to be bound as necessary to satisfy the fourth condition. Which is exactly what happened.

The fourth condition matches posts that are attributed to me. Unless this condition is otherwise constrained—say, by a variable bound in the previous condition—there are going to be thousands of matches. Sokath, his eyes opened!

So, here's the fix. It's in the code that binds variables to values. in_reply_to is an association on a post (object) that links it to the post (object) it's a reply to. This code is inside a block that processes every potential match.

  {% for association in associations %}
    if @options.has_key?({{association.id.stringify}})
      if (target = @options[{{association.id.stringify}}]) && (name = target.name?) && !temporary.has_key?(name)
-       if (value = model.{{association.id}}?(include_deleted: true, include_undone: true))
-         temporary[name] = value
-       end
+       next unless (value = model.{{association.id}}?(include_deleted: true, include_undone: true))
+       temporary[name] = value
      end
    end
  {% end %}

The post that's being replied to isn't always cached locally. When it's not, the association returns nil and nothing is bound. That's okay. But previously, the match was also considered to be successful! The solution... treat it as a failure and proceed to the next potential match.

#ktistec #troubleshooting