yes, I am insane

Tired of waiting many miliseconds for linux_logo to run?
Tired of wasting 35k of disk space?
Upset that to run linux_logo you need huge GLIBC?

Well your worries are over!

With "ll"  [even the name was shortened to save space!] you get all of the
benefits of linux_logo in a smaller, faster package!

"ll" is written entirely in native Linux assembly language!


Some Statistics
---------------

     *NOTE*  not all architectures implement the same feature-set
       (IE, not all have MHz in /proc/cpuinfo) so this is only
       a rough comparison.


     Processor        lzss executable
     ---------        ---------------
     ia64		2826 bytes   (as 2.18.0.20080103)
     alpha		1821 bytes   (as 2.19.51.20090723)
     RiSC		1418 bytes   (RiSC-as 0.0.2)
     parisc		1400 bytes   (as 2.17.50)
     mips               1314 bytes   (as 2.28)
     microblaze		1298 bytes   (as 2.16)
     m88k		1240 bytes   (as 1.92.3)
     SPARC		1221 bytes   (as 2.22)
     arm.oabi		1186 bytes   (as 2.18.0.20080103)
     PPC		1165 bytes   (as 2.19.51.20090805)
     riscv64-im		1161 bytes   (as 2.28.0.20170505)
     arm.eabi		1161 bytes   (as 2.24.51.20141001)
     micromips		1147 bytes   (as 2.28)
     6502		1130*bytes   (ca64 2.12.0)
     riscv32-im		1125 bytes   (as 2.28.0.20170505)
     mips16		1104 bytes   (as 2.28)
     arm64		1094 bytes   (as 2.23.1)
     s390	        1064 bytes   (as 2.19.51.20090805)
     x86_64		1027 bytes   (as 2.29)
     riscv64-imc	1019 bytes   (as 2.28.0.20170505)
     x86_x32            1005 bytes   (as 2.28)
     sh3		 994 bytes   (as 2.17.50.0.5)
     x86                 968 bytes   (as 2.28)
     riscv32-imc	 961 bytes   (as 2.28.0.20170505)
     vax		 950 bytes   (as 2.16.1)
     arm_thumb		 920 bytes   (as 2.25)
     1802		 915 bytes   (asmx)
     avr32		 914 bytes   (as 2.16.1)
     arm_thumb2		 908 bytes   (as 2.25)
     crisv32		 905 bytes   (as 2.12.1)
     z80		 891 bytes   (as 2.20.1.20100303)
     pdp-11		 890 bytes   (as 2.19)
     m68k		 870 bytes   (as 2.28)
     8086		 780 bytes   (as 2.20.1-system.20100303)
     
     * the 6502 results were adjusted to match the code present in other
       architectures (i.e., not counting the graphical routines)
     
     The various implementations have varying functunality and often
     use different methods to get system info.  Still, some gross
     comparisons between the architectures can be made.


Individual architectural comments, in descending order of executable size:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ia64:   ia64 is VLIW.
     
	no divide at all.
	you use the fp unit to do int multiply
	no unaligned words.
	on the plus side, basically have infnite (well, 128) registers

	does have an auto-incrementing load/store.
	
	The actual ia64 architecture is too bizzarre for words.
	It probably doesn't make any sense at all unless you are a 
	   computer architect.
	The instructions come in groups of 3 (40-bits each, with
	   a total instruction size of 128bits).  These run in
	   parallel.  So if your instructions don't parallelize,
	   you end up running lots of nops.
	   
	In any case it's no suprise the executable ends up being
	   so huge.  It very probably can be optimized a lot from here.
	If I really cared I'd turn off the automatic bundling
	   and set all of the instruction bundles by hand.

	   I found a bug in the assembler where it puts two
	   call instructions in the same bundle (which can't work).
	   I wonder if thise means I am one of the few people who
	   writes programs entirely in assembly for ia64....

alpha:  
	Alpha is hurt because it implements some "optional" features,
	such as MHz (which added a lot of code) and also counting
	num of cpus, with proper pluralization of Processor.
	
	The *big* hurt though is lack of byte-manipulating instructions.

	The original alpha architecture did not support operations on bytes,
	you have to do a lot of shifting and masking.
	
	Unfortunately ll uses a lot of byte-sized memory operations.
     
        On the x86 the instruction "lodsb" is 1 byte in size.
        On the PPC the equivelant is "lbzu" which is 4 bytes in size.
        On the alpha, the instruction is "ldb" which expands to
           a "lda","ldq_u","extbl","sll","sra" sequence, plus
	   an add instruction that x86 and ppc do automatically.
	   Thus, taking 24 bytes.
	It's a bit better if you do an unsigned load, which is only 16
	   bytes, but still.

	A store byte actually does a 32-bit load, masks in the byte by
	  hand, and then does the actual 32-bit store :(

	The "jump if bit one set" type instructions help the lzss.

	The immediate field for ALU ops of only 8 bits really hurts
	
	There is no native integer division routine on alpha.
	
	The GOT section is a huge space-sink.  It has 64-bit constants,
	  which most of the time you shouldn't really need.

RiSC:
	This is a 16-bit architecture used in my undergrad
	computer courses I took at University of Maryland.
		 http://www.ece.umd.edu/~blj/RiSC/
		 
	No logic instructions besides nand.  This means
	masking with an "and" takes at least 2 instructions 
	(because there is no immediate form).
	
	16-bit memory acesses only.  It is quite complicated to load a byte.
	
	Only 7 registers, one of which is always 0, and one of which
	is the stack pointer.
	
	No shift instructions.  Left shift can be done with add, but
	right shift takes a fairly complicated routine.

        Only able to branch +/- 64 instructions, which can be a limit.
	
	Only a 6-bit immediate, though the lui instruction helps.


parisc:	+ really hurt by its short immediate field.  Most addresses
	  require 2 instructions to load, even if try to use relative-data add.
	+ no integer divide, have to code it up... not so bad in loop form
        + delay slot that can be nulled out
	+ some ALU instructions can also null out following instructions 
          conditionally
	+ compare immediate instruction only can handle 5-bit immediate	
        + loads/stores must be aligned
	+ no AND immediate instruction

sparc:  
	Condition codes make for tighter code.
	register/register load address calcs also help.
        13-bit immdediate hurts a bit.
        Crazy register windows a bit hard to understand

	SPARC is unfair a bit, because my test machine is a 24-proc
        niagara, so it has extra code to handle that many chips properly.

65c816:
        Another non-Linux addition.
	
	The method of switching the A and X, Y registers into and out
	of 16 bit mode is a PAIN.  A non-intuitive opcode, and if you
	aren't careful your routines break if you're in an unexpected mode.
	Would have been much better if somehow there were separate opcodes
	for 8 vs 16-bit instructions.

riscv:
	This is obviously an academic setup.  Things changing all the time,
	some parts of the arch extremely over-analyzed while others
	just handwaving.

	It's more or less the same as MIPS though.

	Lack of addressing modes hurts.  Also lack of short increment.

	Also annoying the assembler won't let you do pointer math.

riscv-r64c:
	This is the compressed RISCv.

	No short byte/half load/store instructions which hurts this benchmark.
	Also no logical immediate ones either.

	Also no 3-operation small adds.

	Couldn't figure out C.JAL

	The assembler does a good job of auto-using the small instructions,
	so it was mostly a matter of register choice.

6502:
        So obviously this isn't running on Linux.  I was curious
	how an 8-bit processor would compare.
	
	The big problem is that the LZSS algorithm and the ll data
	set are very much 16-bits in size.  So there is a lot of
	overhead having to increment 16-bit values on an 8-bit processor.
	
	Only having 3 registers is a handicap, but the zero page (the
	first 256 bytes of memory which can be accesses in one less byte
	and in fewer cycles) act as almost virtual registers.
	
	Some potential useful instructions that would have helped (and
	that are actually implemented in the later 65C02 version of the
	chip):
	   phx/plx (push/pull X directly)
	   ina/dea (increment/decrement A... otherwise need 2 instructions)
	   bra (branch always, relative jump.  otherwise need 3-byte 16-bit)
	   stz (store zero)

        The 6502 code uses high-res graphics to approximate the color
	ascii art.  When calculating total image size, we don't count
	the overly-large graphics code and instead only count the code
	that would be used to print the text, say, out over a serial port
	to an ANSI capable term client.
	
microblaze:
        Similar to MIPS.
	
	Branch delay slot is optional, which helps.
	
	Only register+register and register+immediate address modes.
	
	The branch+link instruction returns to the same instruction
	  that left from, so for the return you have to specify
	  to add 4 or 8?  Very weird.
	  
	The signed 16-bit immediate makes it impossible (I think) to load
	  0xff00 into a register w/o two instructions.

        Can't get the assembler to do pointer math into an immediate.

mips:   Recent binutils has made mips come in line with the
        other architectures.
	
	It is the most RISC of the RISC architectures.  Thus it ends
	up having a very non-dense instruction set.
	
	On the plus side, it has hardware support for unaligned loads,
	plus hardware integer divide, which help a lot.

mips16:	Reduced size instructions should help, but as always the
	limitations make it hard.

	Mostly the constant size, too eay to kick into extended
	instructions and those won't fit in delay slots.

	Also was fighting the assembler the whole way.

micromips:
	I thought this would be a straightforward port of mips16, but no.
	Way larger, mostly because loading addresses with "la" regressed.

	They dropped a lot of useful 16-bit instructions.

	The constants are all weird too, and odd sized things like 5 or 10
	can no longer be added in 16-bits

	Fighting the assembler all the way.  In theory ADDIUPC could do
	smaller loads but couldn't convince it to do so.

	It is helpful having easy access to the higher regs.

	Jals with short delay slot helped.

m88k:	This chip came after m68k and was motorola's first RISC chip.
	Didn't go well, they moved on to PPC.
	
	It's similar to a cross between MIPS and PPC.
	
	OpenBSD because there is no Linux port to m88k.
	Using gxemul.
	
	OpenBSD forces syscall error handling by skipping a 
	jump-to-error-handler instruction on return.  This
	makes each syscall in effect take 8 bytes.
	
	Branch delay slot is optional (for performance only?)
	
	Starts at low address space.  This is a huge win as it
	makes each pointer load 4-bytes instead of 8 assuming
	we stay smaller than 64k.

	bcnd instruction saves a cmp usually
	
	cmp quite elaborate, setting 16-bits of comparison info in
	a reg, not in flags.  a more powerful version of the equivelant
	Alpha instruction.

arm64:
	New ISA different from arm32/thumb/thumb2
	Fixed-width 32-bit instructions

	Wider constants help *a lot*
	Unaligned loads help too
	New instructions, such as tbz also help
	The conditional instructions like csel might, but at least in the
		lzss code doesn't make a difference.	
	The compare/branch/zero instructions also help.

	Lack of conditional execution hurts

	What we really need is an "increment multiple" instruction.
		inc {r1,r2,r4}
	And also an equivelent of x86 loop	
	
	No open syscall, have to waste an instruction setting
	AT_FDCWD and use openat() instead.

arm:	no integer division routine
	Really painful to load constants > 8 bits that aren't powers of two or
	  else 8-bit values shifted by power of two.
	If we had integer divide, saner constant support, and  unaligned loads 
	  we could probably beat x86 even with 32 bit instructions.
	  
	OABI smaller than EABI because don't need to load r7 with
	syscall number.

ppc:	The PowerPC has very CISC-like opcodes as well.
        Despite being load-store with 3 operand instructions, you almost
	wouldn't know it was considered RISC.  I also think I could
	optimize the code a bit more and challenge x86.
	
	The big help is auto-incrementing load/store byte instructions.

s390:   This is the most CISC architecture I've ever seen.  If only it had
	   a "load byte" opcode it would definitely beat out x86.  I am sure
	   it can be optimized even smaller than x86 by a s390 expert.
	   Being able to do "strcat" in 2 or 3 op-codes and strlen in not
	   more than 5 is a big plus.
	+ fact all opcodes are often 16-bit and often 32-bit is annoying
	+ not having 3-operand opcodes also hurts
	+ crazy CISC operations are amazing, but often don't do what I need
	+ would be nice if offsets could be negative
	+ would be nice if there was a relative branch shorter than 32-bits

x86_64: When doing a straight x86 -> x86_64 conversion (which involves making
        all of the push %e?? instructions into push %r??, as well as jmp *%edx
        into jmp *%rdx) makes the code 28 bytes longer, due to the "inc" 
	instruction becoming 2 bytes, and extra addr32 prefixes being added to
	various move instructions.

	Switching the syscalls to native syscalls is about neutral.
	You do have to make sure to save %ecx across syscalls then.

	The sad part is we have 8 extra regs, but can't use any of them
	because the extra byte prefix is a killer.

	Also added in a few bytes extra to print the name better
        (gratuitous spaces on some cpuinfos).  Also we have to
	handle 4GB of RAM so we lose a few bytes for a 64-bit load.

x86_x32:	This code is more or less the same as x86_64

	The x32 changes don't really help us much, as we weren't
	using 64-bit pointers before, and we try not to use
	the R8-R15 because they take extra bytes.

	In fact, almost all of the size saving comes from the fact
	that a 32-bit ELF header is smaller than a 64-bit one.

	The move to having all syscalls with bit 30 set hurts us
	by about 10 bytes or so.

1802:	RCA cosmac 1802

	This is an interesting architecture.

	No dedicated PC (you can set any of the 16 index registers to be PC)
	No stack (though there are auto-inc/dec insns to help).

	This makes function calls interesting.  The traditional way
	is to have dedicated index registers for each function, then
	jump to just before the beginning before returning
	(to "reset" that particular PC).

	This is only workable if you have < 8 or so functions and
	no leaf functions.  Otherwise you have to emulate a stack
	with fairly high overhead instruction count wise.

	Otherwise nothing unusual when optimizing, except for the
	always troublesome problem of doing 16-bit math when you
	can only do math in the 8-bit accumulator.


vax:
	vax is crazy CISC.  
	Some of the CISC instructions:
	   + can operate on variable sized bit-fields
	   + an asm instruction that implements switch/case statements
	   + a fp instruction that calculates polynomials
	   + special instructionss for handling queues
	   + various opcodes to accelerate COBOL (edit, etc)
	   + xfc - extended function call, create your own opcodes
        You can do strlen with essentially one instruction, though it's
	 a long one.  vax could easily beat x86 if it had a few one-byte
	 instructions.

sh3:
	auto-increment addressing for loads but not for stores?
	-> yes, auto-incrememnt/decrement set up for stack
	   accesses so it decrements on store (push) and incs on load (pop)
	branch delay slots make things difficult
	could really use a compare-with-zero instruction for reg other than r0
	pretty compact code, even with lots of wasted branch delay slots
	Really wish could put the divide instructions in a loop (like parisc)


m68k:   is even more CISC than x86 if such a thing is possible
           (if you don't count Vector instructions).  In addition
	   to BCD instructions it also has a wide variety of
	   bit-field manipulation instructions, plus full ALU complement.

	   bizzarrely, m68k assembly is very similar to THUMB.

	   weird having separate address and data registers.

	   can't shift by an immediate more than 8?
	   can't add carry with immediate?
	   have to clear upper parts of words when doing byte math;
	      no equivelant of the mips "lbu" instruction.
	      
	      
x86:    The x86 code is currently the smallest, mainly because I had a
        running contest for a while with Stephan Walter until
	we got it below 1k.  It does help that there are a lot of
	useful 1-byte instructions in the x86 command set, which give
	it an instant advantage over all of the RISC chips.
	
	Lack of alignment makes string manipulating programs (like ll) a lot
	easier, as you can store 16 and 32 bit values w/o having to worry
	if the string is properly aligned.


arm_thumb:
        I tried by hardest to beat x86, even though the arm port
	doesn't have to do things x86 does (SMP support for example).
	
	I came close, but not close enough.  The lack of an integer
	divide instruction and the lack of unaligned memory reads
	killed it.
	
	I do like the thumb instruction set, it is in many ways more
	powerful than x86 while cleaner at the same time.
	
	There is a powerful push/pop instructions that can push/pop 
	 any combination of registers in 16 bits.

        The "blx" instruction to branch to a register (even a high one!)
	 is great.  I cheated a bit by using the Arm5 instruction subset.
	 The code wouldn't be anywhere near as small if I had to use
	 generic arm4 thumb.

arm_thumb2:

	Is same size as thumb if you make sure to use the narrow
        forms of all the opcodes.

	The "cbz" instruction saves space, but oddly only works
        for forward branches.

	The "itt" conditional instruction does help, as does
	"movw" and "movt" to load 32-bit constants.


z80:
		could really use a 16-bit dec that updated flags
		could also use more useful 16-bit arithmatic insns
		could use reg+reg addressing mode		
		nice if we could do ALU ops on other than 8-bit accumulator

		z80 is nice for pascal-type strings, not C-type
		
		Need 16-bit shifts.  And shifts by more than 1
		Need better way to set regs to zero.

		Some extra bytes taken to handle CR/LF instead of just LF

		The string copy routines are almost useless, having almost
		as many bytes to set up as doing things discretely.

		Really hurting not having just one more register that
		cat be accessed with one-byte opcodes (or else
		an IX+REG addressing mode)

avr32:  They specifically designed the arch to have compact assemley.

        The "ret" return instruction is the most useful ever.  It 
	 can handle returning a value, as well as having a special
	 case to return 0 or -1, and also sets the status flags.
	 
	The one weakness is that almost no instructions can take
	 immediate values.
	 
	It also has a great "load halfword and swap bytes" which
	  would be great, only it has to be an aligned halfword
	  so we can't use it :(
	  
	Has the advantage that binaries start at a low address,
	  so the addresses of functions fit in a small number of bits.

        The new champion for size ;)  And there's probably a few
	  bytes lurking that can be removed still.


crisv32:
        mostly 16-bit instructions
	
	If constants are > 6bits need longer encoding
	
	reg+offset addressing mode only available for acr (r15)
	register, which increases code size.  This is an improvement
	on original cris where r15 was the PC.
	
	VM usage starts at 0
	
	No hw divide instruction
	Branch delay slot

	branch instructions always 3bytes.  Use jump (register val) instead?


pdp-11:	       Despite wealth of addressing modes, some critical ones
	       are missing, such as being able to add two regs to
	       equal address, or reg and value *before* dereferencing.
	       
	       really could use logical shifts
	       really could use a real AND instruction
	       
	       severe register pressure
	       
	       UGH it's a pain working in octal
	       
	       extensive use of "adb"	         
	       
	       the a.out executable format makes for small binaries

8086:	       This port targets a DOS COM file.
               
	       A COM file basically has zero overhead, which is why
	         things are very small.
		 
	       A COM file only has one 16-bit segment, so pointers are
	         all less than 16-bits which helps size.
		 
	       8086 support is simpler than the x86 support which has
	         to handle more complicated cpu_info files, plus some
		 things DOS doesn't support like hostname.
		 
	       8086 was a direct port of the 386 code, with 32-bit
	         registers switched to 16.  Some changes had to be
		 made, as the 16-bit instruction set isn't as orthogonal
		 as the 32-bit one so some register combinations
		 (especially in effective address calculation) aren't allowed.
		 

Features:
--------
    + Runs in 4 miliseconds, more than twice as fast as the 10 linux_logo 
      takes on a K6-2+ 450!
    + Takes up only 969 bytes when super-stripped on x86!
  
     Amaze your enemies!  Impress your friends!


BUGS:
-----
	No pretty-printing: This means that your computer is reported just
                            as /proc/cpuinfo reports, ugly model-name, off 
			    MHz, and all.

	Possibly kernel-dependent:  I only tested this on 2.4 and 2.6 kernels.
                            The sysinfo() syscall changed between 2.2 and 2.4

Custom Logo:
------------

	Point the "ANSI_TO_USE" variable in the Makefile to any text
	or ansi file you want when building.


HOW TO HELP:
------------
	If you have a Linux box running on an unsupported architecture,
	offer the author a shell-account so he can create a version
	for your type of machine!

Useful Resources:
-----------------
	http://deater.net/weave/vmwprod/asm/ll/ll.html
        http://www.linuxassembly.org
	http://www.deater.net/weave/vmwprod/asm
	http://www.deater.net/weave/vmwprod/linux_logo

Publications using ll:
----------------------
    * V.M. Weaver, S.A. McKee. "Code Density Concerns for New Architectures", 
      27th IEEE International Conference on Computer Design (ICCD 2009), 
      Lake Tahoe, California, October 2009.
    * V.M. Weaver, S.A. McKee. "Optimizing for Size: Exploring the Limits of
      Code Density" (Poster), Architectural Support for Programming Languages 
      and Operating Systems (ASPLOS '09), Washington DC, March 2009.

Thanks to:
----------
  Shellcoders.  You seem to be the only useful resource for
    linux assembly on the various platforms.

Special Thanks to:
------------------
	my lovely wife
	my beautiful daughter
	my two sons
	my three squeaking guinea pigs

AUTHOR:
-------
        Vince Weaver <vince _at_ deater.net>  http://www.deater.net/weave