Likely error in command cache sync in self-updating code?

Question

Likely error in command cache sync in self-updating code?

Lots of related questions < How does x86 cache command sync? > to mention that x86 handles i-cache sync properly in self-editing code. I wrote the following piece of code that turns on and off calling a function from different threads alternating with its execution. I use the compare and replace operation as additional protection, so the modification is atomic. But I get intermittent crashes (SIGSEGV, SIGILL) and the kernel dump analysis makes me suspicious if the processor is trying to execute partially updated instructions. Code and analysis below. Maybe I missed something. Let me know if this happens.

toggle.c

#include <stdio.h>
#include <inttypes.h>
#include <time.h>
#include <pthread.h>
#include <sys/mman.h>
#include <errno.h>
#include <unistd.h>

int active = 1; // Whether the function is toggled on or off
uint8_t* funcAddr = 0; // Address where function call happens which we need to toggle on/off
uint64_t activeSequence = 0; // Byte sequence for toggling on the function CALL
uint64_t deactiveSequence = 0; // NOP byte sequence for toggling off the function CALL

inline int modify_page_permissions(uint8_t* addr) {

  long page_size = sysconf(_SC_PAGESIZE);
  int code = mprotect((void*)(addr - (((uint64_t)addr)%page_size)), page_size,
    PROT_READ | PROT_WRITE | PROT_EXEC);

  if (code) {
    fprintf(stderr, "mprotect was not successfull! code %d\n", code);
    fprintf(stderr, "errno value is : %d\n", errno);
    return 0;
  }

  // If the 8 bytes we need to modify straddles a page boundary make the next page writable too
  if (page_size - ((uint64_t)addr)%page_size < 8) {
    code = mprotect((void*)(addr-((uint64_t)addr)%page_size+ page_size) , page_size,
      PROT_READ | PROT_WRITE | PROT_EXEC);
    if (code) {
      fprintf(stderr, "mprotect was not successfull! code %d\n", code);
      fprintf(stderr, "errno value is : %d\n", errno);
      return 0;;
    }
  }

  return 1;
}

void* add_call(void* param) {

  struct timespec ts;
  ts.tv_sec = 0;
  ts.tv_nsec = 50000;

  while (1) {
    if (!active) {
      if (activeSequence != 0) {
        int status = modify_page_permissions(funcAddr);
        if (!status) {
          return 0;
        }

        uint8_t* start_addr = funcAddr - 8;

        fprintf(stderr, "Activating foo..\n");
        uint64_t res = __sync_val_compare_and_swap((uint64_t*) start_addr,
                                    *((uint64_t*)start_addr), activeSequence);
        active = 1;
      } else {
        fprintf(stderr, "Active sequence not initialized..\n");
      }
    }

    nanosleep(&ts, NULL);
  }

}

int remove_call(uint8_t* addr) {

  if (active) {
    // Remove gets called first before add so we initialize active and deactive state byte sequences during the first call the remove
    if (deactiveSequence == 0) {
      uint64_t sequence =  *((uint64_t*)(addr-8));
      uint64_t mask = 0x0000000000FFFFFF;
      uint64_t deactive = (uint64_t) (sequence & mask);
      mask = 0x9090909090000000; // We NOP 5 bytes of CALL instruction and leave rest of the 3 bytes as it is

      activeSequence = sequence;
      deactiveSequence = deactive |  mask;
      funcAddr = addr;
    }

    int status = modify_page_permissions(addr);
    if (!status) {
      return -1;
    }

    uint8_t* start_addr = addr - 8;

    fprintf(stderr, "Deactivating foo..\n");
    uint64_t res = __sync_val_compare_and_swap((uint64_t*)start_addr,
                                  *((uint64_t*)start_addr), deactiveSequence);
    active = 0;
    // fprintf(stderr, "Result : %p\n", res);
  }
}

int counter = 0;

void foo(int i) {

  // Use the return address to determine where we need to patch foo CALL instruction (5 bytes)
  uint64_t* addr = (uint64_t*)__builtin_extract_return_addr(__builtin_return_address(0));

  fprintf(stderr, "Foo counter : %d\n", counter++);
  remove_call((uint8_t*)addr);
}

// This thread periodically checks if the method is inactive and if so reactivates it
void spawn_add_call_thread() {
  pthread_t tid;
  pthread_create(&tid, NULL, add_call, (void*)NULL);
}

int main() {

  spawn_add_call_thread();

  int i=0;
  for (i=0; i<1000000; i++) {
    // fprintf(stderr, "i : %d..\n", i);
   foo(i);
  }

  fprintf(stderr, "Final count : %d..\n\n\n", counter);
}

Kernel dump analysis

Program terminated with signal 4, Illegal instruction.
#0  0x0000000000400a28 in main () at toggle.c:123
(gdb) info frame
 Stack level 0, frame at 0x7fff7c8ee360:
   rip = 0x400a28 in main (toggle.c:123); saved rip 0x310521ed5d
 source language c.
 Arglist at 0x7fff7c8ee350, args:
 Locals at 0x7fff7c8ee350, Previous frame sp is 0x7fff7c8ee360
 Saved registers:
 rbp at 0x7fff7c8ee350, rip at 0x7fff7c8ee358
(gdb) disas /r 0x400a28,+30
 Dump of assembler code from 0x400a28 to 0x400a46:
  => 0x0000000000400a28 <main+64>:   ff (bad)
     0x0000000000400a29 <main+65>:   ff (bad)
     0x0000000000400a2a <main+66>:   ff eb  ljmpq  *<internal disassembler error>
     0x0000000000400a2c <main+68>:   e7 48  out    %eax,$0x48
 (gdb) disas /r main
  Dump of assembler code for function main:
     0x00000000004009e8 <+0>:    55 push   %rbp
     ...
     0x0000000000400a24 <+60>:   89 c7  mov    %eax,%edi
     0x0000000000400a26 <+62>:   e8 11 ff ff ff callq  0x40093c <foo>
     0x0000000000400a2b <+67>:   eb e7  jmp    0x400a14 <main+44>

So, as you can see, the instruction pointer appears to be located inside the address within the CALL instruction, and the processor appears to be trying to execute this inconsistent instruction, causing an illegal instruction error.

+3

c assembly x86-64 cpu-cache

chamibuddhika 06 nov. 14 at 4:36 am

source to share

2 answers

In 80x86 most calls use a relative offset rather than an absolute address. Basically, this is a "code call here + <offset", not a "code call in <address>".

For 64-bit code, the offset can be 8 bits or 32 bits. It's never 64-bit.

For example, for a two-byte "call with 8-bit move" instruction, you must destroy 6 bytes before the call instruction, the opcode itself, call

and the instruction operand (offset).

In another example, for an instruction with a 5-byte "call with 32-bit move", you must discard 3 bytes before the call instruction, the opcode itself, call

and the instruction operand (offset).

However...

This is not the only way to call. For example, you can call a function using a pointer, where the address of the called code is not included in the instruction at all (but can be in a register or be a variable in memory). Also, the optimization is called "tail call optimization" where call

followed by is ret

replaced with jmp

(probably with some extra stack for passing parameters, clearing caller local variables, etc.).

Essentially; your code is badly broken, you can't cover all possible corner cases, you shouldn't be doing this to start with, and you should probably use a function pointer instead of self-modifying code (which will be faster and easier and portable).

+3

Brendan 06 nov. '14 at 6:49

source to share

JS1 · Accepted Answer · 2014-11-06T06:56:43+0000

I think your problem is that you replaced the 5 byte CALL instruction with 5 1 byte NOPs. Think about what happens when your thread has done 3 of the NOPs and then your main thread decides to replace the CALL command. Your PC stream will be three bytes in the middle of a CALL instruction and therefore will execute an unexpected and probably illegal instruction.

What you need to do is exchange a 5 byte CALL instruction with a 5 byte NOP. You just have to find a multibyte command that does nothing (for example, or does register against itself), and if you need extra bytes, add some prefix bytes like the gs override prefix and the address size override prefix (both which will do nothing) ... By using a 5 byte NOP, your stream is guaranteed to be either in a CALL instruction or after a CALL instruction, but never inside it.

Likely error in command cache sync in self-updating code?

More articles: