Start of a new day, had a chat with Brad. He wants me to really get my head around how the block chaining and execution works. He mentioned the big key word, object loading, the idea of which is you compile a library and so thought from arm to x86\_64 then load it in and use the already optimized version. So what would need to be done is writing the output.

  • How does Block Chaining Work?
  • Can a block link to itself if it is a loop?
  • Are there many blocks that can jump to another?
  • Are there concurrency limitations to this? Once the chaining is setup the first time is it modified?
  • What is the structure of the cache?

An aside to finish up some of the benchmarks and tests from yesterday. There is one other result I haven’t calculated and documented yet, which is what is the raw saving is in bytes, so in total for the two tests, how many bytes for the code are reduced by when using llvm. I also believe it is possible that I could find the total saved, by finding out how many times each of the llvm replaced blocks are executed. First i had to finish working on assembling the results from yesterday for presentation, which are presented in yesterdays entry. I also decided to start a separate document for holding all the results. I intend to look into an easy way for backing up all the original source of the results, (the numbers themselves not the whole framework).

That is enough work on that for now, I will get back to producing the new data I want which is the actual total size of the externally translated blocks verses the old size.

For the h264ref test, the generated functions are 526 bytes smaller (number is averaged over 3 runs) in total, that is for all generated blocks where the llvm are smaller than the tcg, the amount of which they are smaller by is summed. The same method for the sjeng benchmark. It should be noted that this is purely based on the generated functions and not based on the number of times a block is executed, and if the block has been replaced before it’s execution lifetime as expired.

Bytes saved
Test Run 1 Run 2 Run 3 Average Corrected
h264ref 496 501 581 526 1312
Sjeng 2824 2587 3015 2808 6417

I started a test run of qemu with sjeng and h264ref tests with the options to output the execution log and the out\_asm. Which brings up something for some spare time, is to compile another test or two from the spec benchmarks, and setup them up for my testing system.

One thing that just came across my mind is what happens in the case where, you have a translated block (native code) that is based on the ARM code (target code). What happen when you jump midway into a ‘block’, so say the original arm code was 5 instructions long, and you have just reached something that wants to jump to the 3rd instruction in that block. I would assume that it can’t just do it because the native code may not even match to the arm, especially since stuff is performed in such a way that it is fast rather than exact preservation. The most plausible idea I can think of which would be reasonable simple is that you then jump to that address and and do the procedure of creating a new translation block with the start being that block.

Brad stopped by to see how things were going, and I showed him the tables of results i had been working on all morning and he pointed out that the bytes saved and translated code size per run should be deterministic, which prompted me to remember that I overlooked catching all the generated and replacing statements, so spent the next twenty five minutes fixing the script to handle all cases as well as perform the fix ups so that the replacing/generated stuff would be removed as far as the rest of the parser was concerned.

This the table above with bytes saved is null and void, the correct value is 6417 bytes for sjeng benchmark and 1312 for the h264ref. So wasted quite a bit of time preparing those results and presenting them, I shall leave the originals there and have amended the tables with a corrected column.

Brad had a very good suggestion on what additional measures I should perform, that is to calculate the total number of blocks seen, the number that are eligible for replacing, the number of blocks replaced and the hit rate/history. Hit rate means the total of times each block was executed, if replaced the number of times it was hit before vs number after. A extra thing would be compare the number hit after for those that were eligible but didn’t get replaced due to larger code size.

Today I found out why it is best to use se xreadlines instead of readlines, specially when your files are known to grow very. In the version of python I am using readlines, reads all the lines in a file and compiles a list, where as xreadline is more of a fake list for iterating. As a result it attempted to load the 22GB log file into memory first, on a machine with only 4GiB. Hopefully this information is sane enough that I will be able to minimize it down to just give me smaller and useable.

The next step along with calculating the information Brad suggested is to improve the number of arm instructions picked up by my script as implemented in LLVM and then run tests using it, to experiment, how many new blocks can be handled by LLVM if we implement a given instruction.

Generated all the benchmarks in CPU2006 that use C, next is to setting up a few of them (about 2) with the test runner.

Advertisements