Summer Research – Day 12

Leave a comment

Start of a new day, had a chat with Brad. He wants me to really get my head around how the block chaining and execution works. He mentioned the big key word, object loading, the idea of which is you compile a library and so thought from arm to x86\_64 then load it in and use the already optimized version. So what would need to be done is writing the output.

  • How does Block Chaining Work?
  • Can a block link to itself if it is a loop?
  • Are there many blocks that can jump to another?
  • Are there concurrency limitations to this? Once the chaining is setup the first time is it modified?
  • What is the structure of the cache?

An aside to finish up some of the benchmarks and tests from yesterday. There is one other result I haven’t calculated and documented yet, which is what is the raw saving is in bytes, so in total for the two tests, how many bytes for the code are reduced by when using llvm. I also believe it is possible that I could find the total saved, by finding out how many times each of the llvm replaced blocks are executed. First i had to finish working on assembling the results from yesterday for presentation, which are presented in yesterdays entry. I also decided to start a separate document for holding all the results. I intend to look into an easy way for backing up all the original source of the results, (the numbers themselves not the whole framework).

That is enough work on that for now, I will get back to producing the new data I want which is the actual total size of the externally translated blocks verses the old size.

For the h264ref test, the generated functions are 526 bytes smaller (number is averaged over 3 runs) in total, that is for all generated blocks where the llvm are smaller than the tcg, the amount of which they are smaller by is summed. The same method for the sjeng benchmark. It should be noted that this is purely based on the generated functions and not based on the number of times a block is executed, and if the block has been replaced before it’s execution lifetime as expired.

Bytes saved
Test Run 1 Run 2 Run 3 Average Corrected
h264ref 496 501 581 526 1312
Sjeng 2824 2587 3015 2808 6417

I started a test run of qemu with sjeng and h264ref tests with the options to output the execution log and the out\_asm. Which brings up something for some spare time, is to compile another test or two from the spec benchmarks, and setup them up for my testing system.

One thing that just came across my mind is what happens in the case where, you have a translated block (native code) that is based on the ARM code (target code). What happen when you jump midway into a ‘block’, so say the original arm code was 5 instructions long, and you have just reached something that wants to jump to the 3rd instruction in that block. I would assume that it can’t just do it because the native code may not even match to the arm, especially since stuff is performed in such a way that it is fast rather than exact preservation. The most plausible idea I can think of which would be reasonable simple is that you then jump to that address and and do the procedure of creating a new translation block with the start being that block.

Brad stopped by to see how things were going, and I showed him the tables of results i had been working on all morning and he pointed out that the bytes saved and translated code size per run should be deterministic, which prompted me to remember that I overlooked catching all the generated and replacing statements, so spent the next twenty five minutes fixing the script to handle all cases as well as perform the fix ups so that the replacing/generated stuff would be removed as far as the rest of the parser was concerned.

This the table above with bytes saved is null and void, the correct value is 6417 bytes for sjeng benchmark and 1312 for the h264ref. So wasted quite a bit of time preparing those results and presenting them, I shall leave the originals there and have amended the tables with a corrected column.

Brad had a very good suggestion on what additional measures I should perform, that is to calculate the total number of blocks seen, the number that are eligible for replacing, the number of blocks replaced and the hit rate/history. Hit rate means the total of times each block was executed, if replaced the number of times it was hit before vs number after. A extra thing would be compare the number hit after for those that were eligible but didn’t get replaced due to larger code size.

Today I found out why it is best to use se xreadlines instead of readlines, specially when your files are known to grow very. In the version of python I am using readlines, reads all the lines in a file and compiles a list, where as xreadline is more of a fake list for iterating. As a result it attempted to load the 22GB log file into memory first, on a machine with only 4GiB. Hopefully this information is sane enough that I will be able to minimize it down to just give me smaller and useable.

The next step along with calculating the information Brad suggested is to improve the number of arm instructions picked up by my script as implemented in LLVM and then run tests using it, to experiment, how many new blocks can be handled by LLVM if we implement a given instruction.

Generated all the benchmarks in CPU2006 that use C, next is to setting up a few of them (about 2) with the test runner.

Summer Project – Day 11

Leave a comment

It is a brand new week, and hopefully a very productive one at that, with some promising progress. Last night, I spoke with Andrew, who helped clear up a few things that I wasn’t quite sure about. The main thing was, how come when I do my thing print out a a block that there is it that there is only a few lines (instructions) verses a lot of lines for what I assumed was a the block in the qemu.log for the corresponding out_asm. The reason for the difference is the is the qemu.log includes the generated code to handle the direct block chaining stuff, which purpose is to handles the branches at the end of a block. He chose to let QEMU handle to continue to handle this, rather then writing the necessary code to allow LLVM part to be able to handle it. I was still confused, at this point since, I had assumed it goes from translated code, direct block chaining then the epilogue. But after some more explaining it was clearer that the thing with the direct block chaining is the fancier name (more accurate name) for what has been called the epilogue of a block.

Two things that I intend to get working on this morning is adding a ext_asm, option to qemu, that will print the externally generated block. Currently, the code I wrote on Friday only prints the actual translated code, excluding the no-ops and epilogue, so first, I am going to try to make it capable, of doing that. The next thing, Andrew suggested that could possibly be done to help improve performance, is to move the suitability tests for a block (the initial tests that are, run on a block to see if the external compiler (LLVM) can translate it), are moved into the LLVM thread. This would mean every block is queued onto the ring buffer, rather than the ones the suitability test picked up. However the suitability tests will still be used to allow for earlier exits, before the LLVM conversion takes place. This should in theory, make it faster, since your offloading some work from the qemu (main) thread, to the LLVM thread. This is not placing any strain on the buffer, by having every block verses only ones suitable, since considering if almost all the instructions are implemented, then all the blocks would be, on the buffer anyway. Andrew did warn me that replacing his implementation of the ring buffer with a nicer one, would be advisable.

Another speed up would be inserting a jump to skip over the no-op slide.

When outputting the llvm verses arm, I decided to have a quick look into the correctness because the TCG looked crazy compared to the llvm which was really a one liner (instruction).

The following code is for the h264ref in the SPEC benchmarks, for replacing input assembly block 0x8c20, where the original tcg is 33 bytes, and the llvm is 11 bytes of code.

ARM Input

0x00008c20:  add    ip, pc, #0    ; 0x0
0x00008c24:  add    ip, ip, #950272    ; 0xe8000
0x00008c28:  ldr    pc, [ip, #1028]!


TCG Output

xor r12d, r12d
mov r15d, 0x8c28
add r15d, r12d
mov [r14+0x30], r15d
mov r12d, 0xe8000
mov r15d, [r14+0x30]
add r15d, r12d
mov [r14+0x30], r15d


LLVM Output

mov rdi, r14
mov dword [rdi+0x30], 0xf0c28


Now here’s what I determined it was doing and I will say line instead of instruction as it is shorter to say and easier to get an idea of what I am referring to in the output above. First line of tcg, is a xor a register with itself, this has the outcome of zeroing the register, so r12d has a value of zero. The next line, is putting the value 0x8c28 into r15d. The value is then saved to memory, Next 0xe8000 is put into r12d. the value saved before is loaded back into r15d, the two are then added and saved. Saving the result of 0xe8000 + 0x8c28. Which the result of is 0xf0c28. The LLVM is simply that first it moves r14 into rdi, rather than just using r14 in the next, instruction. It then saves 0xf0c28 to the memory location. So is clearly faster, your making add and mov instructions including a redundant memory load and save into one instruction. So this that block is correct. The only improvement with the above is if the second line used r14 to begin with, and then if the no-op slide was jumped over since it is rather long in this case. The next part is to produce statistics about the number of bytes, saved or wasted as well as the instruction usage, and analysis for the potential of new blocks added due to implementing a new instruction.

To summarise the results above the numbers for each run is the percentage, for the number of blocks generated by LLVM that are smaller than TCG. It should be noted as a reminder that even tho it produces larger blocks, that block will be thrown away as it is too large to fit into the space where TCG code already is.

The top 10 instructions used per case is as follows and listed by frequency, h26ref uses ldr, cmp add, mov, str, beq, bne, sub, bl, bx. The sjeng benchmark (chess) makes use of ldr, str,add,cmp, mov, beq, bne, lsl, b.



sjeng using llvm

sjeng using qemu 0.10.6

Run Real User Sys Real User Sys
1 2:24.07 143.14 1.10 2:01.29 119.76 1.03
2 2:03.62 122.76 1.04 2:01.29 119.76 1.03
3 2:03.64 122.79 1.03 2:00.97 119.85 1.05
4 2:03.79 122.79 1.07 2:00.80 119.74 0.99
5 2:03.44 122.58 1.03 2:00.94 119.75 1.03
Avg 2:07.71 126.81 1.05 2:01.06 119.77 1.03

Summer Project – Day 10

Leave a comment

15th of January

Forgot my power pack for my laptop, so quickly booted it up and transferred the SPEC benchmark to it, and committed any outstanding work.

Talk to brad, and mentioned about the pointer switching, for being able to hot swap the code, by generating the code in a separate memory location, then automatically change a jump to that memory location and presto it would now be in that loop. First it would need to be explored weather this is actually case and that an scenario where it keeps running the same block, and does not get a chance to replace it, is occurring.

I looked over the results of the h264ref benchmark briefly, the times where

	Running test case: 1 h264ref.arm qemu-0.10.6 results in Real: 9:41.46 | User: 576.08 | Sys: 4.91
	Running test case: 1 h264ref.arm llvm-org results in Real: 9:56.49 | User: 591.69 | Sys: 4.73

I will run the instruction usage script over it to find the most used ARM instructions.

I proceeded to extend the instrument script to help with identifying, blocks that have been replaced. It really needs more test and to handle and fix the concurrency case, where by the replacing statement gets printed in the middle of the output of an instruction.

Next I added a simple way to link blocks together in the script, to get the previous block. so an OUT block can get the IN block, meaning the in block would get the out block that ran before. This needs some improvement, since it is one way. Wish to extend it so you can traversal just in and outs, and then only in to out, out to in, but not in to another out that does not belong.

Got the analysis working for going through in blocks and doing the check to see if it is only using instructions that are implemented, followed, by checking if the output block, is at the address of a block that was replaced. This is where I needed the double link, because at first I was running the address comparison on the in block but LLVM stuff replaces out blocks, next I used the link but forgot it points to the out block before the in not the one that generated from itself.

Ideas expand it to compare blocks, so size, first and last address used.

The current list of things to do was, look into how the qemu cache works, is it just a farm of pointers, or it it a graph structure, and the other was get Qemu to output the the code generated by LLVM. This is related to above, as if you could access the cache system you could print out all the blocks in there.

The format for Generated block for "(00011) generated block for 0x40095540 @ 0x60943cc0; 17:7" is a count or index in to the ring buffer, the address of the ARM code @ address of the AMD64. The remainder two numbers I am still not sure about the first is the size from the top of the translation block to the epilogue, the other the size of the external translated ‘block’.

Successfully managed to get what I am quite certain to be the LLVM version of a block. The thing that was defeating me was I was I was changing code but no effect took place since I was running Andrew’s version of it rather then my own. I have looked into getting it to print on replacement the TCG version before its overwritten then then LLVM. I currently have some of the numbers off so it seems to produce more then necessary, however some of the fields make more sense now after I started typing this up and reading/looking into some other things for the exact details.

Brad stopped by and suggested looking at the noop slide, and see what calculations and magic numbers are used for that along with other variables in the area. Granted I had tried that but quite sure I was just using the wrong combination and using them incorrectly. His main point was to try to get the overall look at the big pciture, and not just a small little block. Which is what I am heading towards, once i can print LLVM vs. TCG blocks, with the size can produce stats such as out of all the replacements how many are better, worst and about even.


End of week 2.

Summer Project – Day 9

Leave a comment

14th of January

Discovered the download for the benchmarks had lost the connection, at 1 am, so won’t be completed until tomorrow. Stared by recovering it and making a start of tackling how all this llvm ties in with qemu.

To aid with this I decided to write down the questions i had (typed out) in my case, and proceed to answer them myself by looking through the code, as well as looking at Andrew’s diary, then if I couldn’t really look at the code or if it was too obscure like something that Andrew did a different way, ask him why.

saw Brad who also liked this idea and really wanted to push that I get my head around how the low level machine stuff works as well as qemu, so for the upcoming meeting we really have something to push for.


The following contains questions that I asked myself mainly today, and then filled in the answer if I already had looked at it, followed by extra research to ensure it was correct.

Q: Where does the transformation to LLVM take place?

A: The approximately named llm_translate, inside llvm_disas_arm_ans. The instruction is decoded, then the LLVM Instructions are Built, using LLVMBuild(Add|Sub), LLVMBuildStore, Additional: As ARM instructions are 32-bit the instruction are passed as an argument to the function.

Q: Where does the external compiler jobs get queued?

A: Well the actual queuing takes place in ext_blk_cmpl_new_jobs, which has the appropriate mutex/concurrency locks in place, and deals with the queue. This is not however where the determination of if we will use llvm takes place. That will be answered in the next question instead.

Q: Where does the determination for are we going to use LLVM take place?

A: Tracing where ext_blk_cmpl_new_jobs is called from shows its used in exec.c in tb_gen_code. The call of this function is determined if the translation block has the llvmable field true, which is set in target-arm/translate.c. llvm_qemu_insn_test – This takes a instruction (32-bit) and decodes the condition, category, flags, opcodes, similar to how the actual translate stuff then shifts it and sets a bit.

Q: What is the disassembler (udis86) used for?

A: It is used to calcuate the size of a function by being pointed to the start of a block, and counting until, it gets to the return instruction. llvm_translate@x86_64_function_size, as well as a helper function, unused at the moment designed for printing out the LLVM version of the block

Q: Where does the replacement of LLVM with TCG generated code take place?

A: This is found in ext-blk-cmpl,c in ext_blk_cmpl function right at the bottom. It does this by putting a mov %14, %rdi in the start of the function. This function is called by external block complier process job, function in the same file, which is what setups the LLVM Moudle. Passes and JIT Compiler and performs, the optimizations, it is also where the mutex clocks are. To sum up this this function is created as a thread in the linux-user/main.c Notes: Instruction buffer size is 512 bytes.

Q: What is the "Enable this if theres trouble with LLVM generating blocks", code do?

A: Performs a LLVMVerifyModule, prints an error then DisposeMessage (frees the error string) LLVMVerifyMoudle – From the sounds of the Analysis.h, it performs checking on a module, then gives the caller, a string that contains a description of the problems that a human can understand.

Q: Why aren’t the replaced blocks being printed out via the out_asm option?

Need to asked Andrew specifically, because he did write the method block_dias in llvm_translate to get the blocks. so will more then likely know why they not getting printed.

Q: What are the instructions already ARM -> LLVM instructions that are already implemented?

A: From translate.c and checked with Andrew. DATA_PROC with ADD,AND, ORR, SUB, MOV, and LD_ST_IM/LD_ST_REG: STUW and LDUW with PRE and POST.

Shift with register – pitfall

A special point to note, that Andrew made special note to since, I found some of the code related to it commented out, is that arm, has a special case on what shifting by 0 means, for arm it means shift by 31. So this means it needs conditional logic for when the shift is from a register to check if the operand is 0.

Statistic Script Improvements

Worked on improving the instruct statistic script by adding a new method, plan to improve it further to make life easier by introducing a kinda pipe line system, so the Reading is a pass, the Instruction Usage Count is another pass, then some analysis on that is another

Summer Project – Day 8

Leave a comment

13th of January

Due to the previous heat wave, I slept past my alarm and didn’t get in to uni until about 11 am. There are a couple of hours missing, since I didn’t want to take a break writing it up, and never got around to it that night or the next day.

An aside, I got close to having a working gentoo arm system setup on the machine, could run busybox and bzip2, I produced a 1.9GiB file from /dev/urandom, so the compressor would be put through its paces trying its best to compress random data. At first i mucked up and accidentally used /dev/zero by accident which results in, not a lot of work load. The test lasted for approximately 70 minutes, with the Augmented Qemu taking about 5 minutes slower. Both were run against the same data.

The specific results for this run was:

	[donno@qcore1 Tests]$ time qemu-arm -L /opt/gentoo64/usr/armv7a-softfloat-linux-gnueabi/ bzip2 -9 compressimage.bin
	real	73m35.599s,	user	72m57.410s	sys	0m9.324s
	[donno@qcore1 Tests]$ time ~/Research/qemu-async-llvm/arm-linux-user/qemu-arm -d in_asm \
		-L /opt/gentoo64/usr/armv7a-softfloat-linux-gnueabi/ bzip2 -9  compressimage.llvm.bin
	cpu_loop called, instantiating LLVM thread...
	real	78m31.286s	user	77m41.695s	sys	0m8.332s

Andrew recommended cscope which is kinda like grep but with C/C++ specifics, so I installed that on both machines, but didn’t set it up.

Next I setup a branch of ‘qemu-async-llvm’ on, where Joel, had already created a project, and uploaded Andrew’s work so far.

Ended the day with installing three different C/C++ for Linux, to see which will provide a better environment then terminal for make and scripts and gedit and started the download for the spec benchmarks going only to have it fail at 1 am, whilst i was sleeping.

Summer Project – Day 7

Leave a comment

12th of January

Started the day by working on cleaning up the runTests script, to remove the old for loop per (test, qemu-version) since testRunN takes that. Granted currently there is almost more lines of comments to lines of actual code so it kind of looks messy still.

Thinking for the future, think I will extend it to support, running comparison tests (well both llvm version and 0.10.6), and then single version, as well as run tests individually. As well as support for a post test action, such as invoking an analysis script.

Some ideas for analysis scripts are:

  1. Instruction Usage – count the number of each instruction used in the input asm
  2. Compare the Instruction Count – compare number of resulting instructions in qemu and qemu with async
  3. Produce Time Plot – produce a graph of the runtime of the tests with the different versions of qemu.

Made a start on the instruction usage script, made a simple finite state machine, with some added improvements to get the lines of code/ duplicated, so can successfully parse a Qemu log file. Next will get it to strip out the instructions mnemonics.

Brad stopped in to the labs for a brief chat, for ideas and to check on the progress. He mentioned that for by the up coming meeting I should have been able to replicate the results from Andrew, and learn about how Qemu instruction cache work and how it is stored, as well as an understanding of how the chaining process work, and raised the question of does it only work on unconditional statements or can it do handle conditional branches.

Instruction Usage

11535 ldr
2912 str
2892 add
2815 cmp
2793 mov
1627 sub
1139 beq
1086 bne
1025 lsl

Ends up the lab machine, that I had my eye on turned out to only have 32-bit edition of Fedora installed, which makes it unsuitable for this project, Brad and Yuval was going to sorting out getting it re-imaged with 64-bit. In the meantime, its still down to using my laptop for now, and it has got me this far already.

Looking over the code, and discussing it with Andrew, who explained over a few of the finer points. The first thing, that sparked me was what was the requirement for the the x86_64 disassembler used for, traced down a use for it in llvm_translate for x86_64_function_size and x86_64_block_disas. Looking over the function_size as Andrew first prompted what it is doing is quite simple and ingenious.

I have the machine, I decided to spare Yuval the time, and decided since I would be using it I would install Fedora 12 on it myself. Once up created, an account, and set-up the proxy. To do this I choose first set-up a proxy server on my laptop which then forwards to the university proxy. The main reason for this is it means my user name and password for the university proxy, is stored on my machine. Then set-up yum and Gnome to use my machine as the proxy. Success, I could then install packages and upgrade the system.

At the same time as setting up the test and development environment for AQ, I setup the Gentoo portage.

	# For /etc/make.conf so it uses the internode mirrors for free traffic at uni

	mkdir /opt/gentoo64
	cd /opt/gentoo64
	tar xvfj stage3-amd64-20091231.tar.bz2

	chroot .
	# modify the /etc/make.conf to be correct mirrors
	emerge-webrsync # since this one works with http proxy by default emerge --sync will use rsync so a pain
	emerge -av crossdev layman git subversion lzma-utils
	emerge -av crossdev-wrappers # this on amd64 is masked

	crossdev -t armv7a-softfloat-linux-gnueabi

	Not the guide I used the first time on my laptop but close
	I used the guide on the gentoo website.

qemu-async-llvm Compiling

    # prereq: llvm 2.5
    tar xvfz llvm-2.5.tar.gz
    cd llvm-2.5
    ./configure --enable-optimized
    make -j5
    sudo make install

    tar xvfz udis86-1.7.tar.gz
    cd udis86-1.7
    sudo make install

    # checkout sources and configure/make
    git clone
    cd qemu-async-llvm
    ./configure --target-list=arm-linux-user

For future – should only require llvm on x86_64 Linux (that is what its been developed on), two the dependencies for llvm and udis should be checked and able to disable our llvm support.

Summer Project – Day 6

Leave a comment

Week 2

11th of January

Started the day with a brief chat with Brad, who mentioned that the if all goes well the money will be sorted by this week, and also reminded to chase up Yuval again with respect to access to the quad core machine.

I spent the morning reading over the output assembly for the qemu and qemu with llvm, and realised that the assembly output from the llvm version was a lot longer then the standard qemu version. I looked into that for a while at first I thought I had labelled to files wrong way, then saw that the appropriate one ad the generated and replacing statements that only Andrew’s LLVM version has. Next I decided to check what the benchmark timing, and yes, there was that relation of instruction count to run time, the standard qemu one ran faster. I was confused as I was sure, Andrew had seen benefits. At this time I could not contact him, and wasn’t until late in the afternoon he confirmed that this was typical, and that the table in the thesis confirmed this.

In the meantime, there were two other tests I wanted to run (well one new test and one observation from the results of ones I had perform. The first was if you run qemu-arm multiple times will it indeed always produce the same output assembly, similarly with the llvm version. My thoughts would be yes for standard, but maybe not for llvm. Since qemu would always see the same code and apply the same template, where as llvm it depends on when the optimization thread is run, and when its replaced.

The results from this is standard qemu was how I thought, the only difference in assembly output was that the memory addresses were different. The llvm one was the same, the output code was the same, the added difference was where the lines saying generated and replacing were. The Other test I performed was to make sure the reason i was getting the results above (llvm version being slower), was not due to improvements in qemu-arm since 0.10.6 which qemu-async-llvm is based. So grabbed the source code for 0.10.6 compiled it and conducted the same tests. The results from this was the same as if you run the tests multiple times with qemu (the memory addresses were the only thing that different).

As additional step for further exploring and possibly more test cases, I setup a stage3 Gentoo for arm (extracted the system) and attempted to use that as the base path (-L) and binaries from that and the existing test binaries. Ran into problems with that it would just sit there with no output and seem to be doing something but not show anything. I used debootstrap (which is similar to above) but it pulls packages from the the internet. Same problem with that. I then stumbled on the useful strace tool, strace is a tool for tracing system calls and signals, so it can show you what system calls a given program is making.

Using this, it showed that qemu-arm was trying to open /opt/arm/dev/fd/3/dev/fd/3/dev/fd/3/dev/fd/3/….., and recursively going up, searching for files. Brief search on the internet, found this was a common problem, one that had since apparently been fixed, since 0.10.x, however I think this scenario was not 100 percent the same to the one described in the bug report (, to solve i just deleted the dev directory, and presto it worked. It does however seem weird behaviour that its searching like that, but anyway, it worked. That fixed both the gentoo and Debian version.

debpackget – Script

The last thing I worked on was more a side-project of mine, that had an application for this project. A script for grabbing a package for Debian, that didn’t rely fully on dpkg/apt. Quite simply it grabs the page\$DISTRO/\$ARCH/\$NAME/download then finds the download link from the mirror we want, so since this is free quota at the University. I then set it up to wget it. Thinking of it as I write this a bash script could of done, by wgeting the download page to tmp file, greping/awking the temp file for the download link, followed by another link. However I wrote it in python so it can be expanded, will probably drop the wget dependency (uses system(.)). Then I coupled this with a call to run dpkg-deb -x \$filename \$outputpath, so it can extract the deb file into the appropriate location. This is not the right place for further discussing this but, it the best place for me to write it down and have it tied. New features:

  • Grab the sources for packages, and extract them (like apt-get source packagename), but not limited to a Debian host
  • Command line options for specifying the destination path, arch and distrbution
  • Switch wget usage to use pythons file downloaded. (Done)

Older Entries