On September 14th, the project group met with Stanford University President and MIPS founder John Hennessy to discuss the foundations of the RISC architecture and
his views on the directions of processor technology. Below is a transcript of this interview.|
US: When we were researching, we found out when different things started, etc. What was the specific timeline? What were the project's original intentions?
HENNESSY: Well, the start was, we had research funding to do work in the general area of VLSI and VLSI systems and with the notion from the funding agencies that the introduction of very large scale integration was going to change some of the traditional trade-offs that have been made (hardware design, and the way systems could be built). So the first thing we built was the geometry engine, which became the heart of Silicon Graphics. That was our first real big research project. And then we had finished the chip and we were in the throes of building the actual prototype for it. We then said, "let's think about what we can do next."
Since I came to Stanford ('78 or so), I had been running a group in compiler technology and optimizing compilers and things and we said, "Well we have this expertise in compiler technology. Why don't we think about doing a processor?" So that was the origin of how we kind of established the concept of doing a processor. And then we started the process with sort of a one quarter graduate level brainstorming class. Let's just read a bunch of papers, think about crazy ideas, different ways to think about processors that we might use as the basis. So really trying to come up with some key insights that would form the core of what became MIPS. I think those key insights were really from our perspective that we knew how to build compilers.
At that time there was a lot of work going on in computer architecture. The VAX had come out a few years before. There was all this work going on in computer architecture that was focused on the notion of raising hardware up to the level of software. And there was even these famous papers on the semantic gap between hardware and compilers.
Well, having worked on optimizing compilers, our view was quite different - that we could take the compiler right down to the level of hardware. So what you wanted to do was optimize that boundary so that the overall performance of the hardware on the software system was optimized. So that was sort of the conceptual notion from day one. We were almost thinking about compiling down to the level of microcode, which reflected background that I'd had because what I'd done for the geometry engine was basically build microcode compiler for the geometry engine. And then I'd worked on a project as a consultant for Silicon Compilers that became the first microprocessor version of the VAX and it was a microcoded pipelined engine - it was pipelined at the microcode level actually, not at the VAX instruction set level
US: So at this point, did you know about the other projects that were being worked on?
HENNESSY: We didn't know anything about the Berkeley project because this was 1-2 quarters later. We were almost overlapped in terms of their starting point. We knew that IBM had been diddling around with some ideas. We didn't know how far they'd gotten or anything else, but we knew that they had some concepts that involved some of the things we also believed in: lots of registers, simple architecture...we decided quite early on that pipelining was important and was going to become a key technology so the notion of having a compiler schedule a pipeline and compiling down to the microcode level and putting two operations in an instruction - things like that were concepts that really came from the compiler/microcode kind of view. We didn't know that much more about the IBM project. One of the students on it had spent a summer in Yorktown so he knew a little bit about what was going on, but that was about it.
You know, both the Yorktown group that worked on the 801 and our group were heavily influenced by compiler technology and I think a belief that modern register allocaters could work quite well. and that it was actually easier to compile for simpler machines than it was to compile for more complex machines. More complex machines, you actually end up in a situation where the code representation before you generate the file code is below the level of machine instructions so you actually have to somehow accrete these individual operations together so that you pack them into an instruction set.
Actually, work was going into compiler technology at that time. A group that Sugram (?) had at Berkeley was actually doing exactly that. they were actually taking the final representation of the compiler and parsing it to try to pattern match VAX instructions out of this. So that you actually were almost reverse compiling. So we said, "Well, that's a really dumb idea...why not just put the hardware down to this level and get rid of this kind of going backwards stuff?" So that was stuff that we built in early on and the idea of the project so...
Then you know, we did the course in the spring, I believe, and then really started working hard in the summer and by the fall we were designing silicon and people were doing layout and stuff like that, and working hard on getting a chip done.
US: So were there any sacrifices or decisions to be made in the formation of the instruction set?
HENNESSY: Yeah. well, we decided early on that the machine should be pipelined that it should aim for single cycle execution. We quickly saw all the advantages of single cycle execution in terms not having to deal with instructions which we get page faults in the middle of them or interrupts in the middle of them...having simple pipeline structure or exposing parallelism in the data path.
Some of the more controversial decisions we had to make: We decided not to implement bite addressing. That was kind of an experimental try to say, "What could you do if you didnt supporting it? Could you make a machine that really worked well and how much advantage did that gain you in terms of simplicity?" We didnt put pipelining at all, because we believed the compiler could do that function just as well as the hardware could do the function. And so there were trade-offs like that that got made early on. But once you say this is going to be a load/store instruction set, every instruction is going to make one path though the pipeline, youve already narrowed the scope of things pretty significantly. There were a few things to figure out in terms of shift commands and things like that. A long discussion on how to support multiply. divide. and which we ended up doing with....some instructions did a piece of a multiply rather than entire integer multiply and then similarly on the divide. But once you make the big decisions, a lot of the landscape is laid out. And you can see that if you go and look at the RISC machines out there and take the core of the instruction set, its 90% identical.
US: Are there any differences between the different RISC chips, then? Something about the MIPS technology that the others didnt incorporate?
HENNESSY: Simpler.You know, I think a little more simplicity, probably, than the others, and I think thats been its primary advantage from day one. That it has been a little simpler than most of the other machines. Designed by a handful of people rather than a committee. Anytime you do a design by a committee, it suffers.
You know, there are some things in the differences between the Stanford project and Berkeley project that I think come from one key thing and thats whether or not you had a strong compiler background or not. We came from a compiler group, as did the 801 group. and the Berkeley people didnt. And so the reason they ended up with register windows is because they saw that the procedure call overhead was high. Well, from our view, thats because register allocation wasnt done carefully and so we took the view of the right answers to do register allocation properly. And one of the things I worked on with one of my graduate students before we started that project was this new register allocation algorithm that took into account, for example, things like theres a procedure call here and therefore if I have a value in a register that hasnt been saved, then I have to save the register, and whats the cost of that? And so it was really a global scheme that accounted for all those costs, and I think that made us all more confident that we could get the cist if the procedure call way down without having to put special hardware support for it.
US: Another component of our project is covering developments in RISC chips and other processors. One of the articles we were looking at claimed we are now in a "post-RISC" era - with most CISC chips performing one instruction per cycle and most RISC chips including more commands - so that the differences between the chips arent as significant. Do you agree with this or do you have a different perspective?
HENNESSY: Well, I have a different point of view. What I would point out is that, first of all, the only remaining CISC architecture per se, is the Intel X86 architecture. All the rest have sort of vaporized. That alone says something. I think that market presence is primarily whats kept the X86 alive.
But even if you look at the modern implementations Pentium Pro, Pentium II, Pentium III implementation of the X86 - the way they work is they take the X86 instruction set and translate it into micro-instructions, to these things they call micro-ops, which are basically RISC instructions - they do load, store, add, things like that and then they pipeline that structure. So they dont pipeline X86 instructions, they actually pipeline and detect all the interlocks among these things.
This actually gets back to an earlier idea we had. When we did MIPS, one of the reasons that we said "Well, we shouldnt have micro-coding in there" is that I believed that microcode did work in runtime that could be done in compile time. What it really does is interpret the instruction set from one level to another level: to this micro level. Well, thats stupid, because its stupid to do work in run time that you can do in compile time. So thats really the way a modern X86 works.
Now what has happened, theres no doubt about it, is that the RISC machines have become extremely complicated in their implementations. To some extent, yeah, there have been instrucions added and things like that. But it was never about instruction count; it was about designing the instruction set so that a high performance implementation was made easier. Whats happened is really interesting. Weve designed these out-of-order, dynamically-speculative machines. Well, I think if you told people 10, 12, or 13 years ago, that this was the kind of machines they were going to be designing, they would have said, "Youre crazy."
In fact, the only machine that had ever been designed with that philosophy was the 360/91, chipped six processors never fully debugged. They never got all of the bugs of the hardware, because at that time (this was in the 1970s) the machine was so complicated for the state of the art technology they couldnt get it right. I remember Mike Flynn telling me, "You know, nobody will ever build a machine like that again" because it was too complicated.
Well, several things happened. The RISC machines, at one level, made it easier to think about design in a piece of hardware that was that complicated because things are a lot simpler if the instructions only take one cycle and they only do simple formats, you can actually contemplate something that is logically that complicated and get it right. Thats the ongoing issue: the simplicity of the instruction set, or the care in the design of the instruction set makes it easier to do the implementation. And I think thats an ongoing advantage that translates into how big the design team needs to be, how long it takes to do the design.
Whether or not that translates into a market advantage depends on how big the market is how many people are buying your processor. The fact is that Intel can afford to spend ten times as much on a chip design than anybody else because they have ten times the volume. Thats certainly made it an interesting, intereseting game, you know? What will happen with the whole I864, and Merced? I dont think thats necessarily the right design plan, but I think its a smart idea for Intel to try get a new architecture. They took the X86 which was designed to be a 16-bit machine and they cobbled it to be a 32 bit machine. Its another hack to make it be a 64-bit machine which is where they need to move to.
So there are good reasons for them to think about a new architecture. Whether or not this is the right choice, well, thats another debate. Its certainly a
lets say they didnt hesitate to add things to this new architecture. They added a lot of stuff. Its a pretty big architecture with a lot of
and my view has always been: start small. Because if you have an architecture thats going to live 15 or 20 years of the commercial MIPS architecture now 16 years old, youre going to end up adding things along the way because new markets are going to develop. Graphics is going to become important youre going to want to put some support for graphics in there. Or securitys gong to become important
So start small so that when you have to add those things you dont end up with something that looks completely like a kitchen sink. And thats hard to do. When you start out with an architecture that tries to do everything in day one, I guarantee that you wont get it right and youll end up with something that looks terrible, especially after it evolves after time.
US: What kinds of bottlenecks are there today in processor technology?
HENNESSY: Thats a good question. I think there are a variety of things. If you want to fetch and dispatch multiple instructions per clock, then the first bottleneck is just how many instructions you have to keep fetching if you expect to issue four a clock, especially if you account for instruction cache misses and branch miss predictions and everything else. The biggest problem right now is that
.there are two problems: one is that the peak to sustained ratio for these machines that are all trying to do four to six instructions per clock are terrible.
Now theyve got better in this second generation of machines then they were in the first generation of machines the engineers just got smarter. The first time you design this speculative, out of order machine you get a lot of performance bugs just because you didnt understand as you did the design where the bottlenecks were. And what happens in any of these designs is youre going along, youre simulating it, youre designing it, you figure "I cant quite yet," the high-level architecture says do it this way, but it aint gonna work. I end up doing it this way, and then you end up introducing bottlenecking with the machine and if you could afford to do it all over again or extend the design cycle by another six months, then you can fix it. But by that time youre already late, the schedules already behind, and youre racing to try and get it done.
So, this peak-to-sustained ratio is a real problem and the way to think about this is well, you watch what happens. They gobble a bunch of instructions in the front end (and occasionaly they get cache miss), so instead of their fetching (lets say theyre fetching four instructions per clock and trying to dispatch four), you know occasionaly they get cache miss, so they end up with an average of, lets say, three instructions per clock actually fetched. Then sometimes they get branch misdirections, so they lose a little more. And then the instructions arent aligned properly in someway, or there are three loads in a row (and they cant dispatch more than two loads in a row), so a little more gets out. Then you get a, say, secondary cache miss, and all of the reservation stations and all of the buffers get filled up with things that cant quite issue until that miss gets back, so you lose a little more, lose a little more. Start at one end: four instructions per clock, or six instructions per clock, and you end up with one and a half or two, just a kind of <chopping noise> along the way.
So trying to understand that, optimize that process, figure out what the right balance between compiler technology and automated dynamic hardware techniques is. We dont yet have a graceful combination of those two. We have a set of dynamic techniques; we have a set of static techniques. Each of them works okay on their own, but each has something to contribute that the other side isnt quite taking advantage of. How to bring those two together is, I think, a key question.
The other bottleneck is that as you try to boost these issue rates, and go from four to six to eight to sixteen eventually, its just incredibly complex. The amount of hardware you have to throw in, the amount of simultaneous decisions that you have to make in a single clock, goes up very rapidly. In these machines in a single clock cycle, you have to determine all of the interdepencies among those sets of instructions. So if each instructions has a result operand and each instruction reads two operands, youve got to do a comparison, in order, between all of the instructions in the issue packet to look for those, and then adjust the renaming and reorder buffer numbers so that you can issue that whole packed of instructions.
And that has to be done in, basically, a single clock cycle, effectively. Because thats your issue bottleneck. So there was a thing Mike Flynn actually wrote a paper about this many years ago where he called it the Flynn Bottleneck where he sort of said said if you want to execute more instructions you cant execute more than well, at that time, he said more than one per clock because you have to look at each instruction, get it through that little hole. Well, now that number is up to four or so, but the problem is each time you raise it, theres a piece of that thats quadratic, where the number of things you have to look at is the square of the number you want to raise it by. And thats the real bottleneck that makes it difficult.
So, add to that the fact that you get diminishing returns. Lets suppose now youre at four issues per clock, lets say you sustain on things that dont have a lot of cache misses, you might sustain one and a half to two instructions per clock so youre slight less than 50%. When you go to eight, youre not going to get four. And with the difference, youll probably be a little lower than four maybe three. And now, the absolute difference, though, in terms of hardware and number of transistors that are sitting there doing nothing, has gone way up.
So these things are all becoming challenges. I gave this analogy in this talk I gave last year to the Federated Computer Conference. I said its like trying to push a giant rock up a hill. And weve been pushing this rock up the fill. First it was very flat, just putting the rock along flatland. And eventually you get near the top of that hill and that rock, with so much gravity trying to push back on you, occasionally the rock actually rolls backwards, squishing the design team. And you get these after spending three years and who knows, maybe fifty engineers a year, to conclude you cant finish youve missed the window. And all of the sudden you cancel a project on which at that point youve probably spent thirty million, forty million dollars on. And then you just <"poof" sound>. Well thats really bad. So that should be avoided.
So I think we need to think long-term, my view is that we need to get a better handle on parallism at a higher level, we need to figure how to get the software systems there, and thats a really long term transition. But if we dont start kind of moving in that direction, that way all of the rocks will get too big to push up the hill and it wont work anymore.
US: Are there any recent advances in RISC processing, or anything like that, that we should know about any rocks that have gone all the way up the hill?
HENNESSY: Yeah, yeah, I think there are a few ideas that are floating around. I think the idea of doing speculative execution is now pretty well understood (five years ago people would have say "Well, youre crazy," but now its pretty well understood). I think that some of this thread-level speculation, which is still up on the research side, is pretty interesting.
The simultaneous multi-threading idea is a very interesting idea: it enables you to, if youre going to build a dynamically scheduled machine, to get a lot more throughput out the back end of the machine with very little extra hardware by basically running multiple threads and interweaving threads on the hardware.
And then theres this idea being kicked around called "value prediction" where you predict the result of an instruction the value that an instruction will return. Its interesting for two different kinds of applications. One of the problems you have in these machines is that they can reorder loads and stores in the flow, and thats important to do because if you have a store here and you have a load here, that load is often at the beginning of a calculation that you really want to get started early, because it has a bunch of things that depend on that load. So what you want to do is yank the load up across the store, but the problem is that the load and store might touch the same memory location. So the way that this problem is solved is that you break the loads and stores into two pieces. In the first piece you calculate the effective address the memory address thats touching. And then what you do is you build a little buffer which keeps the effective addresses in and then you look at the effective address that the load wants to touch and the effective address the store wants to touch and before you have the store results (you dont actually need the value that youre going to store yet) , you can move the load across the store.
The problem with this is if you have programs with a lot of pointer chasing, then whats actually happening with the store, youre storing with a pointer value, and with load youre loading with a pointer value, so you may not know what the address is, early. So then what happens is you cant get these two things to pass by one another.
But, if, suppose you run the program, and you went through this loop fifty times, and you found that in that fifty times that the load and store never touch the same threads, well then you guess that in fact they wouldnt touch the same address again, pull the load above the store, and then use all of these hardware mechanisms that are in these speculative machines. And if the guess was wrong, you undo it. So that actually could work okay. Thats one use of it.
The other use of it is: suppose you have this load down here, and that load loads a memory value, and you notice that, in the last ten executions, it always loaded the value five. Then you can guess that the next load will be the value five, so you do the same thing. You get that its going to be five, and then you start using the code, and then again you can use the speculative mechanisms to help back it up. These are all ideas that try to increase the sustained to peak ratio so that you get it closer, so that you get some improvement in it. I like the SMP idea. Have you seen this? Simultaneous multi-threading? So you know what multi-threading is?
US: Um hm.
HENNESSY: So multi-threading is you just run multiple threads on the same background. So multi-threading has been around for a while and its an interesting idea. Somebody made the observation that, if you build a back-end machine that is dynamically scheduled, speculative, and all that kind of thing, that all it really is is an instruction structure that fetches instructions, throws it into a buffer, and then this buffer basically schedules the instructions in data flow order.
So, you can take those instructions from multiple threads and throw them into the buffer and let the back end of the hardware do all of the work about deciding which instruction will execute next. And the thing thats advantageous about this is the same thing thats advantageous about all of the threading in general. If you have a single thread, and lets say it gets a cache miss, that cache miss may be long enough that it stalls the entire thread. And even though you have the next twenty or thirty or forty instructions buffered from that thread, eventually they will all depend on that cache miss, and you cant go forward because theres an add that depends on that, and theres a load that depends on the add, and more on that. But by doing multi-threading, you break the problem that you may have dependencies or clustering of cache misses, and you pull the different threads and execute them. So it provides you a way to boost the throughput of these things.
The other thing I like about it is that it moves you gracefully from this notion of using just construction level parallelism to trying to use thread level parallelism. Its just a graceful movement of that in that direction. So, Im optimistic that it may help in trying to think about how to design the next-generation machine.
US: What do you think about Moores Law? Do you think that processors are going to continue to double every eighteen months or if were going to eventually reach a point where we cant do that anymore?
HENNESSY: I think we are on the verge of reaching a point for integer code where we cant do it, or it would be very hard to do it. Floating point guys are doing quite well, I think. If you could do it for floating point code. The problem is the floating point is not growing as fast as the rest of the market. The market thats really growing is sort of the Internet explosion market and its not floating point intensive. So floating point and scientific calculations as a fraction of the overall market are shrinking. So it wouldnt be a smart move to just concentrate on that part of the market. Youd find yourself in trouble.
So, I think Moores Law will continue for awhile. Whether or not that could be translated into processor speed is the tricky thing. The thing to remember about increasing the transistor density is that the number of transistors increases quadratically as you decrease the size. But the transistors only get faster linearly. So you want things which use more transistors to get speed, not just relying on faster transistors.
Now, that all said, there have been a number of introductions into the process capability which are providing at least one-time jumps: copper, silicon insultor technologies. New changes so they obtain the short channel effect low capacitance technologies. So, theyre providing, I think, a very ambititous growth in clock rate. One of the reasons why you see so much hype on clock rate is that right now, clock rate is easier to get than a lower CPI, so everybody focuses on clock rate for a while. But I think its going to be hard to sustain that rate of growth on the integer side. We need a really killer idea.
And the other interesting thing that Ive observed is that ten years ago there were lots of ideas which were sitting on the warehouse shelves on the university and research labs that hadnt been deployed in machines. No one had built a speculative machine; no one had returned to dynamic scheduling in many years. Lots of things that had been explored by research teams, simulated, but nobody was really using them yet. Now, the gap between university research and industry has shrunk dramatically. The industry is just pulling ideas out, and theyre moving so quickly into use. I think the reason is that theyre starved for new ideas to try to boost performance.
US: We just have one more question. Can you point us in the direction of any resources or articles we might want to look at that would help us for this topic?
HENNESSY: Uh, yeah
Well, what do you want to know. Do you want to know where its going? Are you more interested in the historical development?
US: Well, its kind of both.
HENNESSY: I think there are a couple of things you might find interesting to read about from a historical perspective. One of them would be the extremely fascinating machine that was never brought to fruition at IBM, but played a tremendous role in feeding some of the ideas, was a machine called ACS Advanced Computing System. It was never developed, but its a machine that John Coche was a lead architect on. It was the first machine to propose issuing more than one instruction per clock. It never quite got there; later on, they worked on a machine called the Americas, which became the prototype for eventually the IBM power series, but thats an interesting machine to look at, and there are some references on the web if you look up ACS on Google youll find it. Mark Smotherman at SMU has also done some background on that machine.
But its interesting because not very much is known about the machine. Thats true of the early days of the eta-1, too. There are a bunch of other people around that have interesting perspectives. John Masseuris, who was one of the co-founders of MIPS and was at IBM before that can give some interesting history of what happened at IBM, and probably even tell you why they cancelled the eta-1 and decided instead to product-ize what was called the micro 370, which was a fiasco. But, he lives in Palo Alto, and you can get a very interesting perspective on whats happened on the project from him.
Whats been happening lately, there are some slides on my website which are some slides from the talk I gave at SCRC. The first half of it is kind of a "Whats happening in processor achitecture" talk. Theres also a talk there that I did for the 25th anniversy of the microprocessor that is kind of a historical development of the microprocessor more focused on the microprocessor in general than on RISC, but interesting still from that perspective. Those are all good things.
Theres a guy who just retired from HP but you probably can still find him there, called Joe Burnbaum, who was the head of the IBM Watson research lab at the time that the Eta-1 was being done. And then came to HP and started the HP RISC efforts, so he has an interesting perspective. Of course, Patterson, the Sun guys, can give you an interesting perspective on their view, as well. Hes at Berkeley. Dave Ditzel, who worked on the first RISC paper with Dave Patterson, is the CEO now of Transmeta, which is a local company doing low-power stuff. He has an interesting perspective on some of those early developments as well and where things are going. Those are all good people. Good resources.
US: Um, were going to present next Wednesday, so
HENNESSY: Oh! <laughs> Look on the net! I think one of the interesting things about architectures in general is that theres a very interesting balance between an idea, a concept, an approach, and the underlying technology. This is very much a discipline where the technology shapes the ideas that are happening at any time.
Gordon Bell at Watson said something really interesting to me: when memory was very expensive, which it was until DRAMs really took over, so in the core days if you were designing from the perspective of somebody building this core memory, memory was more expensive than logic. What you did was minimize the use of memory. But when DRAMs came over and took over, all of the sudden that flipped, and initially they were maybe the same price, but then memory plumetted in price compared to DRAM, partly because its design cost was so low. You designed one DRAM and just replicated it, right? Cache became very inexpensive. Gordon Bell said when that change occurred, moving from microcoded machines like the VACs to non-microcoded machines that relied on cache and had a larger memory footprint was exactly the right thing to do. So its interesting to watch that occur.
Theres a really excellent you know one of the long battles that went on here, its really hard to imagine, but when Patterson and I first started writing papers, people were mad they were really mad. I remember going to a panel one time and one guy got up and said this is fraud. Really, they did. They just said it was fraud. Really, this technology was university, in the real world it wouldnt work. If you had to ship it in a product, it wouldnt work. This was one of the reasons to start a company, to prove our point. But, you know, he said, this is fraud. Well, this guy just said, they started this company, they took money from investors, what should they do? A counselman said well if I were them, Id take the money and move to Barbados.
One of the reasons was that we had these benchmarks that said it was faster, but we didnt have a good scientific explanation why. So finally, there was a landmark paper published on the VACs that showed that the machine was taking ten cycles to do every intstruction, which was a factor of two higher than even the VACs designers believed it to be. And, we could show these RISC machines were doing one and a half to two instructions per cycle. So now we knew where the performance advantage was. And that really clarified it. But everybody said, you know well maybe everythings not equal. Finally, though, the
guys wrote a very famous, carefully documented paper, by Philipe Bandacar and Doug Clarke thats cited in our book thats something like a comparison of a RISC machine against a CISC approach. It really documents exactly this and demonstrates the fact that there really is something fundamental here, that it is a more effective way to build machines. So thats a great paper. That was sort of a capstone. It said, "All right guys, the arguments over."
But, it was very threatening. Here we said you can build these microprocessors and because its microprocessor technology, maybe it costs half as much to begin with, and by the way its two or three or four times faster. And this is why IBM cancelled their project. IBM cancelled the project because the technology was too disruptive, and it just threw all of this into, the entire market into upheavel. So thats a good paper to look at. Interesting perspective.