Until recently, you could reasonably expect this year's software to run faster on next year's machines, but that's not necessarily true going forward. For the foreseeable future, significant performance improvements are likely to be achieved only through arduous reprogramming.
Some time ago, computer vendors passed the point of diminishing returns concerning processor clock speeds, and could no longer keep hiking frequency rates. To maintain continued performance improvements, suppliers turned to installing multiple instances of the processor — multiple cores — on a processor chip, and as a result, multicore processors are now mainstream for desktops. But to realize any performance improvements the software has to be able to use those multiple cores.
And to do that, most software will need to be rewritten.
“We have to reinvent computing, and get away from the fundamental premises we inherited from von Neumann,” says Burton Smith, technical fellow at Microsoft Corp., referring to the theories of computer science pioneer John von Neumann (1903 – 1957). “He assumed one instruction would be executed at a time, and we are no longer even maintaining the appearance of one instruction at a time.”
But software cannot always keep up with the advances in hardware, says Tom Halfhill, senior analyst for the Microprocessor Report newsletter in Scottsdale, Ariz. “If you have a task that cannot be parallelized and you are currently on a plateau of performance in a single-processor environment, you will not see that task getting significantly faster in the future.”
New law in town
For four decades, computer performance progress was defined by Moore's Law, which said that the number of devices that could economically be placed on a chip would double every other year. A side effect was that the smaller circuits allowed faster clock speeds, meaning software would run faster without any effort from programmers. But overheating problems on CPU chips have changed everything.
“The industry has hit the wall when it comes to increasing clock frequency and power consumption,” says Halfhill. There are some chips edging above 4GHz, “but those are extreme cases,” he says. The mainstream is still below 3GHz. “The main way forward is through multiple processors.”
By adding more cores to the CPU, vendors offer the possibility of higher performance. But realizing higher performance through multiple cores assumes that the software knows about those cores, and will use them to run code segments in parallel.
Even when the software does that, the results are gated by Amdahl's Law. Sometimes called Amdahl's Curse, and named for computer pioneer Gene Amdahl, it lacks the upbeat outlook of Moore's Law. It says that the expected improvement from parallelization is 1 divided by the percentage of the task that cannot be parallelized plus the improved run time of the parallelized segment.
In other words, “It says that the serial portion of a computation limits the total speedup you can get through parallelization,” says Russell Williams, chief architect for Photoshop at Adobe Systems in San Jose, Calif. “If 10% of a computation is serial and can't be parallelized, then even if you have an infinite number of infinitely fast processors, you could only get the computation to run 10 times faster.”
People in the know often refer to Photoshop as a model desktop application in terms of multi-core support and parallelization. Williams says that Photoshop has been supporting multi-processor operations since about 1995, but adds that, even so, much of Photoshop's code is devoted to opening and closing dialog boxes, and therefore is not subject to parallelization.
Lots of algorithms have “significant chunks” of serial code, Williams notes. “People with PhDs have been working on this problem for 20 years — it is not a matter of solving it by sitting at your desk and thinking hard for a few minutes. Typically, the way around Amdahl's Law is to simply find embarrassingly parallel problems, but you can't escape the fact you're limited by the serial portion of your calculations.”
Even with parallelization, Williams explains, performance does not scale linearly — two cores can give nearly 2X acceleration, but four cores gives less than 4X acceleration. This is due to memory bandwidth issues (i.e., the RAM being slower than the processor) and delays imposed by inter-processor communications.
“We can't take advantage of eight cores without improved memory bandwidth, and I know of no application that could take advantage of 16 cores,” he says. “Memory bandwidth is a huge issue because after a while you are just waiting for the memory.” New processors with on-board memory controllers are offering some help, he adds. On-board memory controllers speed up RAM access; however, they also lock the CPU into using a specific type of memory.
When Microsoft first shipped Windows, most programs were still written for DOS, and it was a good 10 years until the industry saw more Windows than DOS software. Similarly, “most of the software on the shelf now is not parallel and some, like word processors, never will be,” says Halfhill.
On the other hand, “We are talking about a similar thing here, but the presence of parallelization APIs in Windows 7 and in the Macintosh Snow Leopard operating systems will speed up the process, and the low-hanging fruit may be done in three to five years,” Halfhill says. Further, not every program needs to be rewritten.
Microsoft's Smith agrees. “Not all software will be converted in five years, but we will have made significant progress. This is a more profound change than has ever been seen before in computing.”
Microsoft's current desktop operating systems, Windows XP and Vista, “like most other systems,” use the kernel to schedule threads on the multiple cores of the system, Smith explains. A thread is a code segment that the computer will execute entirely before executing another thread, which may be from another application entirely.
“When a thread needs to wait for something, like I/O or another thread's output, the kernel runs some other ready-to-go thread on the freed-up core,” Smith explains. “When the first thread's wait is over and it becomes eligible to run again, it will eventually get a core assigned to it.”
In general, consumer operating systems “don't do anything very smart” with multiple cores, says Jim Turley, head analyst with Silicon Insider, a consulting service and newsletter in Pacific Grove, Calif. Vista is “reasonably aware” of multiple cores, and is “fairly smart about dividing up background tasks and foreground tasks.” Vista can run games on one or two cores while housekeeping tasks run the other cores.
Rob Enderle, principal at the Enderle Group in San Jose, Calif., says that Windows 7 does an even better job of it. “Windows 7 is designed to use as many cores as the machine has, and will partition an application among the multiple cores — but that does not give as much benefit as if the application used the cores directly.”
Windows 7 has an alternative mechanism called User Mode Scheduling (UMS), which lets thread multiplexing onto cores take place within the application itself instead of in the kernel. Multiplexing of threads is the process of deciding which thread is executed next. Handling this multiplexing within the application instead of in the operating system kernel “makes thread scheduling more efficient,” Smith says.
A Microsoft blog link he supplied indicates that programmer access to UMS is possible through Visual Studio 2010, currently in beta, and involves use of the operating system's Concurrency Runtime facility. Windows 7 will also be able to use 256 cores, arranged in four groups of 64.
Meanwhile, most applications will run on only one core, “so you get the benefit of having multiple cores only when running multiple applications,” Enderle says. Virus checkers and utilities that run in the background “tend to not visibly drag down your machine, whereas on a single-core processor they definitely do,” he says. Two cores seem to be optimum and a third “gives you headroom.” When watching the performance meter in Windows “you can light up two cores really easily, three occasionally and four hardly ever. Four cores are for video games, heavily threaded applications or DNA analysis.”
Some Intel processors additionally offer a form of on-chip dual processing, called Hyper-Threading Technology, where each core can run two threads in parallel, so the software sees twice as many cores as there really are. It is not as good as having two separate cores, and the boost you get varies greatly, but most people get a 20% to 30% boost through Hyper-Threading, according to George Alfs, an Intel spokesman.
Enderle notes that the Windows performance meter displays each core with Hyper-Threading as if it were two cores.
Re-coding
“What we'd all like is a magic compiler that takes yesterday's source code and spreads it across multiple cores, and that is just not happening,” Turley says. “There are C compilers that make a modest dent, but a lot of research indicates that C will never take you very far since the fundamental problem is C itself — it is inherently serial. There is no easy way to program in parallel; it's like writing poetry in Klingon.”
Turley says that the world does not need yet another programming language. “Any third-year student worth his salt has invented one, but the trouble is getting people to adopt it — no one wants to learn a new language.” Since there are so many alternative approaches, “no one wants to commit,” he says. If some authority would declare for one approach people would rally around it, but in the meantime there is widespread confusion and competing claims. “We may have to wait for the current generation of programmers to die off and be replaced by programmers brought up on a new paradigm,” Turley laments.
The easiest way to add parallelism is to call code that is already parallelized, from a library, says Williams at Adobe. The next easiest is to use bottleneck routines, or separate little routines that only know about specific pixels. “That is the way we did it for a long time,” he says. A third way is to write a parallel version of a complicated algorithm. But “that can easily take twice as much work [as writing a non-parallel version]. We're not talking about 10% more work here.”
A fourth approach is functional parallelism, “where you let the user do different things simultaneously, such as getting thumbnail images while changing meta-images,” Williams explains. “Photoshop was written before system software supported that, so we don't do a lot of that. Modern operating system facilities let you do functional threading without a huge amount of effort — maybe 50% more — but converting a large algorithm written before such stuff was available is a big effort,” he says.
What is needed is not more code but different code — and a different way to organize the application, adds Smith at Microsoft. “You must understand parallelism and that is not always obvious.”
A first step is to minimize the use of variables. “Variables are artifacts of sequential execution,” Smith says. “If it is always true that A+B=C, what if someone gets in the middle of that and adds something to B so that the equation no longer holds true? You must have a consistent state where that is prevented.” Traditionally this prevention is done by locking the variables, but he advocates the use of transactional memory, which does much the same thing automatically by isolating the variables from other code that is running at the same time.