Left work early, drove to Ithica, NY to meet up with Vince. We drove from
there to Canada. Weird harassment by canadian border person trying to be
funny. Checked into the hotel, which is -very- nice, except broadband doesn't
work in the room.
Watched weird cartoons on TV, and the Ottawa Lynx baseball game retelecast.
Ottawa won, i think, i fell asleep before the end.
Found conference center. No food, no vendor showcase for swag :(. Did have
Telsa Cox hand me my name tag. Ran into former co-worker from MCLX.
2.7 might be further off... 2.6 much more stuff going in. Overview of what
went into 2.6 that would normally have gone into a development kernel.
Lots of code churn in 2.6 is from stanford checker/etc annotations.
Schedular domains.. awsome. talk of object-based reversed mapping.
what overhead was added by adding the anon_vma?
how did that hurt malloc performace?
Embedded linux... nothing much there except the main thing is getting embedded people to play nice with the community.
Went downstairs for vendor talk.. found nice wireless access area and where
everyone else was congrating, including Linus and Alan. Very odd feeling :).
I'm Looking for a copy of the proceedings.
Access point talk:
taking a few moments to relax in a nice cousioned seat with wireless access in
an area that appearently AMD is sponsoring, because there's a while wall full
of Athlon64 machines for public use. Should probably go to another talk, may
may walk in late. uploading this blog and reloading my web page.
Block IO talk
walked in late here in leiu of the access point talk. Jens is a very monotonic
speaker ;). Hard to pick up this talkin the middle.. seems to be just
explaining things. Should be able to get the info from the proceedings,
which are still proving elusive.
Went and had lunch.. the conference center's attached to a mall, whcih has a
decently stocked food court. Ate at an A&W, had a rootbeer float,
life is good :).
Saw a few more famous people. At events like this, there's
always a bit of stargazing, but it's somehow differnt for us geeks.. we all
know that everyone's just a normal person, but some of these people really
have changed the way the world works, so doing a doubletake when you read a
nametag that says "Alan Cox", i think that's perfectly justifiable, just no
more than that.
Ran into another former-MCLX person.
Anyway, waiting to hear a talk by Keith Packard, one of the fathers of X.
The things the freedesktop.org is doing are freaky but cool.
X talk, pushing X away form the hardware.
History: User mode stuff caused by Sun, and proprietary driver and
kernels, forcing things into user space. bad idea.
DRI, fbdev, input layer help. So what next...
Take advantage of above.
Lots of talk of moving mode switching out of the server proper in user space
library. Good discussion followed involving many X developers and Alan Cox
about early boot messages, consoles, fbdev, opengl, and X, etc.
Boot-time minimizing talk:
Geared for embedded devices. Talks about kernel XIP for firmware (execute
directly from flash) for cutting down firmware load/decompress time.
Used cool gcc -finstrument-functions call to profile the kernel. IDE probing
accounted for the most time. Using "quiet" helped. Calibrate delay loop added
"lpj=XXX" (loops-per-jiffy) option to avoid that calc.
Interesting, statically compiled drivers are loaded sequentially with BKL held.
So use modules, and replace busy-waits with yields in the driver.
BOFS: Linux Scalability
Wonderful discussion. Work on NFS scalability. NUMA API's from Andi are ok.
15% improvements. CKRM discussions, better version then my little procgroups
patch. Lost of discussion of overhead. Dcache and LDAP work discussed.
This was wonderful and depressing. I'll get into that more later, though.
BOFS: Hardware error detection
More QE type of presentation, rapidly losing interest. they want to make NMI's
automatically boot the machine. I disagree. NMI's are used, and defined
to be non-maskable interrupts. While they're useful for indicatig errors,
there's nothing that states that they -must- be used in that sense, and some
things are using them for non-fatal errors. therefore, we shouldn't just
blanketly panic the machine on a, what could be non-fatal, NMI.
Went out and had dinner at a pub. having trouble meeting new people because
I keep running into people I vaguely know, and I feel slightly out of place
because i've been on the outskirts of the community for so long. People here
are either serious kernel hackers or here because thier company sent them.
Vince and I are neither. Speaking of Sun....
The Linux Scalability BOF was very, very depressing to me. In this BOF, I saw
EVERY major large system builder, combining thier resources and working
together to make linux scale even further. SGI. Cray. IBM. NEC. Oracle... and
Sun is nowhere to be seen. Linux already outperforms Solaris in so many ways.
It even scales better in most cases. The only thing solaris has that linux
doesn't yet is memory and CPU DR, and linux is already well on it's way to
having these as well. Sun should be here. Sun should be a leading member
of the scalability initive. Being in a room full of senior engineers for all
sorts of HPC companies, all working towards a common goal, I just saw again
how far Sun is behind, and will continue to fall behind. S10 will help, but
will not be the saving grace. Sun and Solaris are getting left behind, and
there's nothing I can go about it. We should be embracing Linux on our servers
and working with the community to make it robust. Instead, we're still
struggling with NUMA support in Solaris. It's just sad.
Ok, i'm off my high hourse now.
AMD sponsered a little party. Rich Brunner gave a short speech talking about
AMD's dual core plans. Linux work is already underway. Big thing: they'll
reporting dual core cpu's with the "HT" cpuid bit set to help with detection.
Weird freaky author gave a wierd talk, I won't touch that one. I've tuned him
out, compiling a new kernel for my machine, and writing this paragraph. I'm
waiting for it to end, because at the end i have a 1/500 chance of winning
a fill blown Alienware Althon64 PC... which would be interesting getting
across customs if i -did- actually win, but hey, it'd be fun to try. they'd
probably make me pay taxes on it ;).
Sigh, no new computer for me. Ohh well. Back in the hotel.. relaxing for a
bit and heading to bed.
Hey, the orioles have won 3 in a row! but what the hell happened and how did
we get an asshole like Karim Garcia??
Proceedings are available here:
Volume 1. Volume 1
Volume 2. Volume 2
SGI Scaling talk.
One of the main reasons I came to the conference. How to scale linux to
512+ processors and 4+ terabytes of memory.
Normal 2.6 runs just fine on small to miedium configs (64-128 proc). These
are all the IA64/Altix 3000 systems. each brick has 4 procs (2 procs/
Motherboard, 2 motherboard/brick). Interconnect is fat-tree with express
links. Can scale up to 2048 procs, but not cache coherant. Shiz, you can
use thier numaflex for wildcat-like things. (RSM). Thier numaflex interconnect
is wildcat SSM -and- RSM. And they've been shipping it for years since the
Origin 2000 days.
Changes to Linux 2.4: O(1) schedular, backported to 2.4. Improves affinity,
removed global runqueue lock. 6x performance gain on some benchmarks.
dplace and cpuset. allows soft partitioning of the machine, solaris projects
equivelants. Page Cache was a big problem. Allowed round-robin allocation
for mallocing the page cache and the slab cache, balanced the page-cache load
over the system so one local node doesn't get overloaded with page cache.
Added per-node active/inacvite lists. Also limited the page cache to keep
it from growing out of hand. Shrunk some kernel structures. XFS. etc..
64->512 procs: TLB conflisct become a problem. Fixed by moving to per-proc
counters and only aggregate to global values only when needed, sometimes..
Also, stop using cmpxchg on multiproc systems, doesn't scale.. 10x slower
on 4p, 100x slower on 32p. minimised page table lock. fixed /dev/zero mmap()
SGI knows scalability.
Vanilla 2.6 boots and runs fine for ~64p-128p systems. Still some work for
>128 proc systems.
I/O; suppoerts PCI and PCI-X today. supports several I/O busses, posted
writes, etc. Linux PCI drivers "all just work", if they're written correctly.
Added read_relaxed API if you don't need a PIO to complete before a DMA.
Hardware can make the above not scale very well.
Interesting discussion afterward, mainly people asking about the hardware.
And yes, they still use spinlocks, and yet somehow manage to get better
scaling than Sun and it's slow adaptive mutexs. . Put that in your pipe and
smoke it, solaris people.
Wow, what a talk. I'm happy. Now on to a talk by the other founder of X,
The re-architecture of X.
Again, moving to newer backends and technologies. openGL. Cairo.
Moving to Render/XFT and client-side fonts, big win. fontconfig is -the- font
library for everything, and is not X specific.
Eye Candy. how to support various features. Magnifiers, real translucency.
4 new X extensions. A new "compositing manager" is wehre you render, no
longer writing directly to the screenbuffer and clipping directly.
(ed. overhead?). Compositing manager is the only thing that draws to the
real screen. Compositing manager could be associated with the window manager.
XEVIE from Sun could be used to help improve the input layer.
new extensions: Xfixes: little stuff that should have been in X in the first
place. XDamage, XComposite, and XEVIE.
Applications can control parts of it's translucency.
Keith Packard gave a nice demo showing 2 transparent movies, XDamage
extensions and a composit manager, etc.. very very cool and very well
(ed. XDamage/Composite mean OSX-type Expose immediately.)
Jim's talking again. Again, make X a GL application. Talking about Croquet>
similar to looking-glass.
Now: Improving build, breaking up monolithic X tarball. Working on removing
latencies, over the network and over startup. Moving to Cairo>. Need help
working on security, and shared resources.
Lots of calls for help.
XFree86 is nowehere to be seen at this conference.. moving to X.org, which
is working with freedesktop.org.
Off to lunch.
They have a Hypervisor, of course. therefore virtualization requires changes
to the OS.
Seems to be a pretty normal virtualization talk.
Talked about the code changes and optimisations they made to support shared
systems. Things like spinlock rewriting, etc.. except it adds overhead i
bet, and should be removed for non-shared-processor systenms.
Sone talk of virtual devices/adapters/etc.. but all the cool stuff happens in
the hypervisor, which is not Linux, and thus we can't see.
Nothing too interesting to me.
Rusty Russell. Intersting, one use is turning hyperthreading on and off.
Good talk. CPU hotswap works for ix86, ia64, ppc64, and s390. Interface
has now stabalised and other archs are encouraged to use it. It is stable,
IBM talk, they've got hot-add working, just starting on hot-remove. Another
group has hot-add and preliminary hot-remove support, but thier aim is
primaritly in adding/subtracting whole numa nodes, not with single dimms.
Probably a 2.7 feature, but given that 2.6 is going to allow more invasive
changes, could actually go into the 2.6 tree, but hot-remove is nowhere near
Broke for dinner. Wandered around the city for a little bit.. wound up eating
at a subway of all places. Met people working there from the US who didn't
like canada and were happy to talk to americans. Weird, that doesn't
normally happen ;).
Not too interested in the BOFS tonight, because all the ones I'm interested
in are teh hotplugging ones, and I've already sat through 5 hours of
IBM hotplugging discussion today. May instead take the time to relax and do
some wchool work.. or read up on mesa-solo and some of the X work people are
Well, the Orioles split a double-header with boston yesterday. Pretty cool,
still have a winning record against boston. Thank you, Rodrigo Lopez.
DBUS, new things to replace CORBA/DCOP. "messages" are the things sent
across. Simple DBUS is an application to application IPC. Message routing
daemon creates a bus topology.
Major efforts put into security. All user space. 2 busses, one global bus,
and a per user session bus. Hotplug messages can be put on the dbus. Almost
stable, but the API might change.
Kernel information is transferred up to and from the kernel from netlink.
-sidenote-, netlink is a generic way to use the sockets interface to get
stuff from teh kernl to userspace. Issues that you don't want to put
lots of notification in teh kernel that happen for various things.
Basically, need buy-in from most applications, which it may or may not get.
Already got KDE and Gnome buyin.
No priorities. concern about what happens under load.
I'm torn, the next hour has 2 interesting talks. One on lockless reference
counters for the kernel, and one on directions for tiny devices. I'm leaning
towards going to the lockless reference counting one, even thought the tiny
linux one might be more relevant to work.
Krefs / New Kernel Patch.
He isn't doing a krefs talk, he's doing a talk on the new patch process.
for 2.6.0 development, 1.66 patches per hour for 2 years. pretty good
productivity. From 2.6.0-2.6.7, averaging 2.2 patches per hour.
MM tree has essentially become new devel tree, staging area for linus.
Then talked about krefs. Not really lockless, poorly named. Only lock needed
is aroudn the last release.
Lunch with some old MCLX-now-redhat people. Ran into -another- one today.
It's amazing how we proliferate.
Decided to go to this embedded sensor talk.
Linux on an embedded sensor device:
PC-104 (1-2W per module) ->PDA's, etc (~1W) -> tiny
microcontrollers, 10-100mW active. Want flexibility of pc104, usability
of PDA, power budget of motes.
Move away from cpu centric design. They did a complicated things involving
multiple modules that powered when not needed managed by i2c bus.
Met up with Jim Gettys afterward and asked about 2.6 kernel support for the
iPAQ 3800, he directed me to the kernel maintainer, who gave me the mailing
lists for discussion and IRC.
cluster communication protocol. small message latency is 80% better than
TCP locally, 35% better inter-node. Large messages are the same.
Layered on top of ethernet.
Mentioned something called kernel port interface, which he claimed is sort of
like netlink.. man, any other possible way to communicate to user space?
I zoned out for the rest of this presentation and did school work.
And now, the other big talk i really wanted to go to here:
SMP and CPU frequency scaling
Opteron and Linux cpufreq support working with two processors at two different
speeds. powernow-k8 driver in place to support it, CG + rev opterons
support running in that state. Processor runs at 89W max TDP, normally
at 1.5v and 2.2G, slowing it down to 1.0G at 1.1v, drops that number to
29W. Think of a cluster ot 1000 processors, that's a savings of 50,000W.
-IF- you're not idle
Problems: ACPI doesn't allow specifying cpu's with different power states
(or speeds), because windows doesn't support it. Linux runs just fine,
but no bios vendor will release a bios to support that.
Opteron (current revs) change the northbridge speed as well as the core
frequency. Which means other processors access to that processor's lower
speed memory :(. remote latency goes from 1.7x to 2.5x worst case.
Current transitions cannot happen all at once (step down). HT is actually
paused during the transition. Latency is 0.015s for a full state change.
serial, so it takes 0.080s to change all 4 processors.
powernow-k8 supports scaling_available_frequencies. need to modify powernowd
to use that and a table. also need to submit patches for cpu's that don't
AMD will not supply a way to override the bios table.
Talked about myriad of userspace daemons, and cpufreqd in particular, no
mention of my baby, powernowd.
Went out and wandered around the downtown of ottawa and looked at the locks,
the parliment buildings, etc. grabbed a quick sandwich for dinner.
I'm attending a BOF about removing jiffies as a time mechanism, and another BOF
about the X windows system.
Interesting BOFs, lots to do. Came back to the hotel, doing some schoolwork.
The orioles lost last night, 7-3. Sigh.. 1.5g behind the devil rays, 2.5 in
front of toronto. damnit.
Got up, went to cinnebon for breakfast since i've been threatening to do that
since i got here.
Kindav just a rehash of everything I heard at the BOF. Linux has good NUMA
support. A userspace NUMA API library is availble for apps, and "numactl"
can change the policy for unmodified apps to get an idea of how the
improvements can help. Sched Domains support was key.
Now looking into areas of expanding support, allowing clusters to run as a
SSI, IO-awareness in the numa code, etc. Basically about a year ahead of
solaris and increasing.
Decided to blow off the next two sessions (which were not-very-interesting
BOFs), and work on powernowd. However, the next step for powernowd is to
support scheduling_available_frequencies and move to a table based approach.
PPC cpufreq driver didn't have that. Added support to the driver, and
submitted a patch for inclusion. Can now write the rest of the code when
I should be doing homework.
Played some networked bzflag against some people, who turned out to be the
authors of the software. Oddly, i didn't fare too well. eavesdropped
on some cool X conversations. Played "guess the OS and architecture panic
message" using xscreensaver with alan cox and friends.
Now off to hear andrew morton's keynote. I'm sitting here and it's about to
start, and someone put an inflatable penguin in the row behind me so
everyone's taking pictures of it, hell, i might make slashdot.
Andrew's keynote was very good, mainly about linux and it's relation to
vendors and large companies wanting to contribute. nothing really of too
much note that won't be covered elsewhere i'm sure.
I'm now sitting in the hotel room. There's a party later tonight i may go
to, or i may just sit around and get some work done. It's been a really
cool trip. I'm very glad I came..
I'll be spending most of the day in a car tomorrow getting back to MA via
a detour throught Ithica. That means i should get some sleep anyway.