It is a brand new week, and hopefully a very productive one at that, with some promising progress. Last night, I spoke with Andrew, who helped clear up a few things that I wasn’t quite sure about. The main thing was, how come when I do my thing print out a a block that there is it that there is only a few lines (instructions) verses a lot of lines for what I assumed was a the block in the qemu.log for the corresponding out_asm. The reason for the difference is the is the qemu.log includes the generated code to handle the direct block chaining stuff, which purpose is to handles the branches at the end of a block. He chose to let QEMU handle to continue to handle this, rather then writing the necessary code to allow LLVM part to be able to handle it. I was still confused, at this point since, I had assumed it goes from translated code, direct block chaining then the epilogue. But after some more explaining it was clearer that the thing with the direct block chaining is the fancier name (more accurate name) for what has been called the epilogue of a block.

Two things that I intend to get working on this morning is adding a ext_asm, option to qemu, that will print the externally generated block. Currently, the code I wrote on Friday only prints the actual translated code, excluding the no-ops and epilogue, so first, I am going to try to make it capable, of doing that. The next thing, Andrew suggested that could possibly be done to help improve performance, is to move the suitability tests for a block (the initial tests that are, run on a block to see if the external compiler (LLVM) can translate it), are moved into the LLVM thread. This would mean every block is queued onto the ring buffer, rather than the ones the suitability test picked up. However the suitability tests will still be used to allow for earlier exits, before the LLVM conversion takes place. This should in theory, make it faster, since your offloading some work from the qemu (main) thread, to the LLVM thread. This is not placing any strain on the buffer, by having every block verses only ones suitable, since considering if almost all the instructions are implemented, then all the blocks would be, on the buffer anyway. Andrew did warn me that replacing his implementation of the ring buffer with a nicer one, would be advisable.

Another speed up would be inserting a jump to skip over the no-op slide.

When outputting the llvm verses arm, I decided to have a quick look into the correctness because the TCG looked crazy compared to the llvm which was really a one liner (instruction).

The following code is for the h264ref in the SPEC benchmarks, for replacing input assembly block 0x8c20, where the original tcg is 33 bytes, and the llvm is 11 bytes of code.

ARM Input

0x00008c20:  add    ip, pc, #0    ; 0x0
0x00008c24:  add    ip, ip, #950272    ; 0xe8000
0x00008c28:  ldr    pc, [ip, #1028]!

 

TCG Output

xor r12d, r12d
mov r15d, 0x8c28
add r15d, r12d
mov [r14+0x30], r15d
mov r12d, 0xe8000
mov r15d, [r14+0x30]
add r15d, r12d
mov [r14+0x30], r15d

 

LLVM Output

mov rdi, r14
mov dword [rdi+0x30], 0xf0c28
nop
nop

 

Now here’s what I determined it was doing and I will say line instead of instruction as it is shorter to say and easier to get an idea of what I am referring to in the output above. First line of tcg, is a xor a register with itself, this has the outcome of zeroing the register, so r12d has a value of zero. The next line, is putting the value 0x8c28 into r15d. The value is then saved to memory, Next 0xe8000 is put into r12d. the value saved before is loaded back into r15d, the two are then added and saved. Saving the result of 0xe8000 + 0x8c28. Which the result of is 0xf0c28. The LLVM is simply that first it moves r14 into rdi, rather than just using r14 in the next, instruction. It then saves 0xf0c28 to the memory location. So is clearly faster, your making add and mov instructions including a redundant memory load and save into one instruction. So this that block is correct. The only improvement with the above is if the second line used r14 to begin with, and then if the no-op slide was jumped over since it is rather long in this case. The next part is to produce statistics about the number of bytes, saved or wasted as well as the instruction usage, and analysis for the potential of new blocks added due to implementing a new instruction.

To summarise the results above the numbers for each run is the percentage, for the number of blocks generated by LLVM that are smaller than TCG. It should be noted as a reminder that even tho it produces larger blocks, that block will be thrown away as it is too large to fit into the space where TCG code already is.

The top 10 instructions used per case is as follows and listed by frequency, h26ref uses ldr, cmp add, mov, str, beq, bne, sub, bl, bx. The sjeng benchmark (chess) makes use of ldr, str,add,cmp, mov, beq, bne, lsl, b.

Results

 

sjeng using llvm

sjeng using qemu 0.10.6

Run Real User Sys Real User Sys
1 2:24.07 143.14 1.10 2:01.29 119.76 1.03
2 2:03.62 122.76 1.04 2:01.29 119.76 1.03
3 2:03.64 122.79 1.03 2:00.97 119.85 1.05
4 2:03.79 122.79 1.07 2:00.80 119.74 0.99
5 2:03.44 122.58 1.03 2:00.94 119.75 1.03
Avg 2:07.71 126.81 1.05 2:01.06 119.77 1.03
Advertisements