Fibonacci in x86 Assembly

Leave a comment

On the 31st of October 2009, sometime in the afternoon, the topic of Fibonacci came up, on the computer science channel for the University of Adelaide. I think the weeks coding activity for first years was to write a recursive Fibonacci algorithm. That said, future post will be why recursive Fibonacci is the wrong way of thinking about the problem. I was in my third year and for fun I decided that instead I would write an iterative one, in p in C. After that was done was still not happy with that as being an end point so I decided it was time to learn a little bit of X86 assembly. So the logical step here was to translate the algorithm I wrote to x86 assembly which goes a little as follows:


.globl _fib

# Fibonacci in x86 assembly by Sean Donno <> 
# Paramter n is in  0x8(esp) store in ebx
# Return via eax (last)
# ecx = n  (loaded from 0x(esp)
# ebx = secondlast
# eax = last  (and the value returned)
    pushl %ebp
    movl  %esp, %ebp
    pushl %ebx
    pushl %ecx
    # Saved Registers
    movl 8(%ebp), %ecx          # ecx = n
    movl $0, %ebx               # ebx = secondlast = 1
    movl $1, %eax               # eax = last = 0
    jmp    fib2
    add  %ebx, %eax             # last = last + secondlast
    neg  %ebx
    add  %eax, %ebx             # secondlast = secondlast + last
    dec  %ecx                   # n = n  1
    cmp  $0, %ecx
    jne  fib1                   # if n != 0 goto fib1
    # Restore registers and return
    popl %ecx
    popl %ebx
    popl %ebp 

Code available on Google Docs at, I really wish WordPress had support this JavaScript syntax highlighter so it provides a a plain version, and can do line numbers, and colouring on the fly, it might and I just haven’t looked hard enough. At the time of writing this i am using hosting at so to the best of my knowledge installing any plugins are not possible.

The comments make more sense when you can compare it to the C version, a while later I followed this up with a version for DLX, for those who haven’t gone to the University of Adelaide (UofA), DLX is a a RISC processor designed by Hennessy and Patterson, who worked on the MIPS architecture, and its a slightly slimmer version of MIPS, that is taught in the Computer Systems course at UofA.  For some comparison the DLX version is 10 instructions, but passes arguments via registers instead of a stack, which is what 7 of the instructions is the x86 version does. There are 15  lines of comments in the traditional ‘function’ commenting style used in the course, which outnumbers the actual lines of code.

SEGP – Fourth Milestone – 1

1 Comment

This post documents the work done for the fourth milestone on the Software Engineering Group Project done at the University of Adelaide in association with ASTC as follow up to the Summer Project I worked on. 

The first part of the follow up work involved improving the tools that I had written during summer for collecting results such that more information could be extracted and formulated. Another thing that was done was the addition of using oprofile to determine which blocks the QEMU spends the most time in for a given test program. This means its easier to determine where time is spent the most and can point people into the right direction of what instructions and blocks to look at getting translatable next. The final thing was the addition of new instructions, which were reverse subtract and exclusive or. The addressing modes for using ASR and LSL were looked at and implementation provided, along with a implementation for the specialised zero case but none of the tests we run use that so those cases are disabled. The big one was the push instruction, out of two variations that were implemented only one works to date.

The plan was to remove the memcpy of LLVM Blocks by allowing qemu to jump out to the location were LLVM generates the native code for a block and then returning to the epilogue for that block, with the hope to remove the locking/blocking. For extended detail on this I will write the post at a later date. (*IF NOT PUT IT HERE AND SEPERATE THIS)

The first task to do was to first of all calculating the total time spent in the critical section

This was added in ext-blk-cmpl.c

#include <time.h>
	uint64_t ext_blk_time_spent_processing = 0;


After the lock is successful in ext_blk_cmpl_replace_block then the current time is recorded.

	struct timeval tv;
	gettimeofday(&tv, NULL);
	uint64_t start_time = (tv.tv_sec * 1000000) + tv.tv_usec;

After the lock is released update the total time spent by adding on the difference in current time with start time along with a print out of the time so far.

	gettimeofday(&tv, NULL);
	ext_blk_time_spent_processing += (tv.tv_sec * 1000000) + tv.tv_usec - start_time;
	printf("Acc Time spent blocked processing the jobs: %llu\n",ext_blk_time_spent_processing);

This has a higher precision, i first did it with just time which is second precision and it just read as zero. The result on my machine from that is the final was 36673 microseconds which is 0.036673 seconds. That is a lot smaller then expected,

Acc Time spent blocked processing the jobs: 36673

Research – Exploring the TranslationBlock structure and epilogue.The structure has a field unit8_t *tc_ptr, this is a pointer to the translated code. (The Address in the code cache). What we are after is weather or not it is read from and stored elsewhere, such as in the epilogue of another block. An quick check with ack-grep for tc_ptr reveals 43 lines in which tc_ptr appears. 28 are pointer references (->). 15 out of the 43 of them were in the ext-blk-cmpl which is created by our work so easily to handle, leaving 28 references in translate-all.c, exec-all.h, exec_c, cpu-exec.c, two of which are in comments.


So we have 28 occurrences of tc_ptr, and of those there are 13 that use pointer referencing. Three of them are uint8_t *tc_ptr so eliminating that brings us to 25, eliminating the two .tc_ptr which are in comments is down to 23.Remove another 2, one for a function with argument tc_ptr and unsigned long tc_ptr; and one is doing the disassembly of the code to the log file so that brings us to 20 possible usages of tc_ptr that may result in a problem. For the exact command I did, the result after wards is not exactly but i just modified it so it was blog friendly.

ack-grep tc_ptr | grep "\.[ch]" | grep -v ext-blk-cmpl | grep -v \*tc_ptr | grep -v "]\.tc" | grep -v "log_disas" | grep -v "unsigned long tc_ptr" | wc -l 

translate-all.c: 102, 173, 174, 185
exec-all.h: 242, 245, 265
exec.c: 735,875, 876, 1179, 1234, 1235, 1243, 1244, 1246
cpu-exec.c: 109, 581, 607, 614


  • translate-all.c
    • 102 – Located in cpu_gen_code, it is read into gen_code_buf, which is passed to tcg_gen_code so this is SAFE, as this is before TCG code is generated, and is only used there.
    • 173/174 – Located in cpu_restore_state and tc_ptr is read from and checked against searched_pc  (Potential problem comment says finds opc index corresponding to search pc)
    • 185 – Located in cpu_restore_state –> tcg_gen_code_search_pc is called with the tc_ptr and searched_pc – tc_ptr (Specially here is the problem)
      • This is the only location where tcg_gen_code_search_pc is called from, this calls tcg_gen_code_common.
  • exec-all.h
    • 242/245 – Insidetb_set_jmp_target, calls tb_set_jmp_target1 with the tc_ptr.
      • This is located in an #elif defined(__arm__) and requires USE_DIRECT_JUMP, which is also defined when arm is used
      • Now  following it down the rabbit hole: static inline void tb_set_jmp_target1(unsigned long jmp_addr, unsigned long addr)
      • Called from
        • tb_reset_jump in exec.c where by it is used to “reset the jump entry ‘n’ of a TB so taht it is not chained to another TB
          • the address is tc_ptr + tb->tb_next_offset[n]
        • tb_add_jump 264  where it is used to patch a native jump address. (Assuming it is used to chain)
      • tb_jmp_offset is 4 elements and is the offset of jump instruction.
    • 265 – Located in tb_add_jump, this is used to patch the native jump address
      • Called in cpu-exec.c line 593
  • exec.c
    • 735 – Used in calculation for tb_set_jmp_target see 242/245 in exec-all.h (above)
    • 875/876 – Located inside tb_gen_code this is where the tc_ptr in the TranslationBlock is first set so no problem.
    • 1179 – Located in tb_free, it frees a translation block, if its the last one used.
    • 1234/1235/1234/1244/1246 – Located in a function called tb_find_pc – Similar problem as the search_pc. This returns the translation block of a given PC. PROBLEM it uses the tc_ptr the thing for searching.
  • cpu-exec.c
    • 109 – Located in cpu_exec_nocache, tcg_qemu_tb_exec is called on the translated code. This executes the code after it is generated, so safe.
    • 581 – Used for the trace debug statement output
    • 607/614 – This reads in the tc_ptr from the translation block and executes it.


Summary of problem functions

  • tb_find_pc – It uses the the Translated Code Pointer to find a Translation Block. Located in exec.c.
  • cpu_restore_state – It uses the Translated Code Pointer. Requires more details, and research.
  • Chaining
    • tb_reset_jump – Called/Defined in exec.c
    • tb_add_jump – Called in cpu-exec.c and defined in exec-all.h