yes, I am insane Tired of waiting many miliseconds for linux_logo to run? Tired of wasting 35k of disk space? Upset that to run linux_logo you need huge GLIBC? Well your worries are over! With "ll" [even the name was shortened to save space!] you get all of the benefits of linux_logo in a smaller, faster package! "ll" is written entirely in native Linux assembly language! Some Statistics --------------- *NOTE* not all architectures implement the same feature-set (IE, not all have MHz in /proc/cpuinfo) so this is only a rough comparison. Processor lzss executable --------- --------------- ia64 2874 bytes (as 2.17) alpha 1957 bytes (as 2.17.50) parisc 1400 bytes (as 2.17.50) SPARC 1397 bytes (as 2.17.50) 6502 1394 bytes (ca64 2.7.1) mips 1292 bytes (as 2.17.50) arm 1218 bytes (as 2.17) PPC 1206 bytes (as 2.17.50) s390 1096 bytes (as 2.17) x86_64 1033 bytes (as 2.17.50.0.5) m68k 1014 bytes (as 2.17) vax 1010 bytes (as 2.16.1) sh3 994 bytes (as 2.17.50.0.5) arm_thumb 989 bytes (as 2.17) x86 969 bytes (as 2.17.50) avr32 914 bytes (as 2.16.1) The various implementations have varying functunality and often use different methods to get system info. Still, some gross comparisons between the architectures can be made. Individual architectural comments, in descending order of executable size: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ia64: ia64 is VLIW. no divide at all. you use the fp unit to do int multiply no unaligned words. on the plus side, basically have infnite (well, 128) registers does have an auto-incrementing load/store. The actual ia64 architecture is too bizzarre for words. It probably doesn't make any sense at all unless you are a computer architect. The instructions come in groups of 3 (40-bits each, with a total instruction size of 128bits). These run in parallel. So if your instructions don't parallelize, you end up running lots of nops. In any case it's no suprise the executable ends up being so huge. It very probably can be optimized a lot from here. If I really cared I'd turn off the automatic bundling and set all of the instruction bundles by hand. I found a bug in the assembler where it puts two call instructions in the same bundle (which can't work). I wonder if thise means I am one of the few people who writes programs entirely in assembly for ia64.... alpha: Alpha is hurt because it implements some "optional" features, such as MHz (which added a lot of code) and also counting num of cpus, with proper pluralization of Processor. The *big* hurt though is lack of byte-manipulating instructions. The original alpha architecture did not support operations on bytes, you have to do a lot of shifting and masking. Unfortunately ll uses a lot of byte-sized memory operations. On the x86 the instruction "lodsb" is 1 byte in size. On the PPC the equivelant is "lbzu" which is 4 bytes in size. On the alpha, the instruction is "ldb" which expands to a "lda","ldq_u","extbl","sll","sra" sequence, plus an add instruction that x86 and ppc do automatically. Thus, taking 24 bytes. It's a bit better if you do an unsigned load, which is only 16 bytes, but still. A store byte actually does a 32-bit load, masks in the byte by hand, and then does the actual 32-bit store :( The "jump if bit one set" type instructions help the lzss. The immediate field for ALU ops of only 8 bits really hurts There is no native integer division routine on alpha. parisc: + really hurt by its short immediate field. Most addresses require 2 instructions to load, even if try to use relative-data add. + no integer divide, have to code it up... not so bad in loop form + delay slot that can be nulled out + some ALU instructions can also null out following instructions conditionally + compare immediate instruction only can handle 5-bit immediate + loads/stores must be aligned + no AND immediate instruction sparc: Condition codes make for tighter code. register/register load address calcs also help. 13-bit immdediate hurts a bit. SPARC is unfair a bit, because my test machine is a 24-proc niagara, so it has extra code to handle that many chips properly. 6502: So obviously this isn't running on Linux. I was curious how an 8-bit processor would compare. The big problem is that the LZSS algorithm and the ll data set are very much 16-bits in size. So there is a lot of overhead having to increment 16-bit values on an 8-bit processor. Only having 3 registers is a handicap, but the zero page (the first 256 bytes of memory which can be accesses in one less byte and in fewer cycles) act as almost virtual registers. Some potential useful instructions that would have helped (and that are actually implemented in the later 65C02 version of the chip): phx/plx (push/pull X directly) ina/dea (increment/decrement A... otherwise need 2 instructions) bra (branch always, relative jump. otherwise need 3-byte 16-bit) stz (store zero) mips: Recent binutils has made mips come in line with the other architectures. It is the most RISC of the RISC architectures. Thus it ends up having a very non-dense instruction set. On the plus side, it has hardware support for unaligned loads, plus hardware integer divide, which help a lot. arm: no integer division routine Really painful to load constants > 8 bits that aren't powers of two or else 8-bit values shifted by power of two. If we had integer divide, saner constant support, and unaligned loads we could probably beat x86 even with 32 bit instructions. ppc: The PowerPC has very CISC-like opcodes as well. Despite being load-store with 3 operand instructions, you almost wouldn't know it was considered RISC. I also think I could optimize the code a bit more and challenge x86. The big help is auto-incrementing load/store byte instructions. s390: This is the most CISC architecture I've ever seen. If only it had a "load byte" opcode it would definitely beat out x86. I am sure it can be optimized even smaller than x86 by a s390 expert. Being able to do "strcat" in 2 or 3 op-codes and strlen in not more than 5 is a big plus. + fact all opcodes are often 16-bit and often 32-bit is annoying + not having 3-operand opcodes also hurts + crazy CISC operations are amazing, but often don't do what I need + would be nice if offsets could be negative + would be nice if there was a relative branch shorter than 32-bits x86_64: When doing a straight x86 -> x86_64 conversion (which involves making all of the push %e?? instructions into push %r??, as well as jmp *%edx into jmp *%rdx) makes the code 28 bytes longer, due to the "inc" instruction becoming 2 bytes, and extra addr32 prefixes being added to various move instructions. Switching the syscalls to native syscalls is about neutral. You do have to make sure to save %ecx across syscalls then. The sad part is we have 8 extra regs, but can't use any of them because the extra byte prefix is a killer. Also added in a few bytes extra to print the name better (gratuitous spaces on some cpuinfos). Also we have to handle 4GB of RAM so we lose a few bytes for a 64-bit load. m68k: is even more CISC than x86 if such a thing is possible (if you don't count Vector instructions). In addition to BCD instructions it also has a wide variety of bit-field manipulation instructions, plus full ALU complement. bizzarrely, m68k assembly is very similar to THUMB. weird having separate address and data registers. can't shift by an immediate more than 8? can't add carry with immediate? have to clear upper parts of words when doing byte math; no equivelant of the mips "lbu" instruction. vax: vax is crazy CISC. Some of the CISC instructions: + can operate on variable sized bit-fields + an asm instruction that implements switch/case statements + a fp instruction that calculates polynomials + special instructionss for handling queues + various opcodes to accelerate COBOL (edit, etc) + xfc - extended function call, create your own opcodes You can do strlen with essentially one instruction, though it's a long one. vax could easily beat x86 if it had a few one-byte instructions. sh3: auto-increment addressing for loads but not for stores? -> yes, auto-incrememnt/decrement set up for stack accesses so it decrements on store (push) and incs on load (pop) branch delay slots make things difficult could really use a compare-with-zero instruction for reg other than r0 pretty compact code, even with lots of wasted branch delay slots Really wish could put the divide instructions in a loop (like parisc) arm_thumb: I tried by hardest to beat x86, even though the arm port doesn't have to do things x86 does (SMP support for example). I came close, but not close enough. The lack of an integer divide instruction and the lack of unaligned memory reads killed it. I do like the thumb instruction set, it is in many ways more powerful than x86 while cleaner at the same time. There is a powerful push/pop instructions that can push/pop any combination of registers in 16 bits. The "blx" instruction to branch to a register (even a high one!) is great. I cheated a bit by using the Arm5 instruction subset. The code wouldn't be anywhere near as small if I had to use generic arm4 thumb. x86: The x86 code is currently the smallest, mainly because I had a running contest for a while with Stephan Walter until we got it below 1k. It does help that there are a lot of useful 1-byte instructions in the x86 command set, which give it an instant advantage over all of the RISC chips. Lack of alignment makes string manipulating programs (like ll) a lot easier, as you can store 16 and 32 bit values w/o having to worry if the string is properly aligned. avr32: They specifically designed the arch to have compact assemley. The "ret" return instruction is the most useful ever. It can handle returning a value, as well as having a special case to return 0 or -1, and also sets the status flags. The one weakness is that almost no instructions can take immediate values. It also has a great "load halfword and swap bytes" which would be great, only it has to be an aligned halfword so we can't use it :( Has the advantage that binaries start at a low address, so the addresses of functions fit in a small number of bits. The new champion for size ;) And there's probably a few bytes lurking that can be removed still. Features: -------- + Runs in 4 miliseconds, more than twice as fast as the 10 linux_logo takes on a K6-2+ 450! + Takes up only 969 bytes when super-stripped on x86! Amaze your enemies! Impress your friends! BUGS: ----- No pretty-printing: This means that your computer is reported just as /proc/cpuinfo reports, ugly model-name, off MHz, and all. Possibly kernel-dependent: I only tested this on 2.4 and 2.6 kernels. The sysinfo() syscall changed between 2.2 and 2.4 Custom Logo: ------------ Point the "ANSI_TO_USE" variable in the Makefile to any text or ansi file you want when building. HOW TO HELP: ------------ If you have a Linux box running on an unsupported architecture, offer the author a shell-account so he can create a version for your type of machine! Useful Resources: ----------------- http://www.linuxassembly.org http://www.deater.net/weave/vmwprod/asm http://www.deater.net/weave/vmwprod/linux_logo Thanks to: ---------- Shellcoders. You seem to be the only useful resource for linux assembly on the various platforms. Special Thanks to: ------------------ my lovely wife AUTHOR: ------- Vince Weaver http://www.deater.net/weave