Ibm's Chatty Revolution

@brentK: so you're suggesting that super computers don't try to push the envelope? That's kind of missing the point.

Super computers are pretty much designed around the assumption that 'this is it. This is what it looks like, and you'll have to code.specifically. against this hardware to get peak performance.

Next year's supercomputer will look radically different' Of course it's expensive, but the people who have access to these supercomputers are not 'hesistant to spend a lot of time programming it', because they need the performance that this hardware can offer. They're not interested in whether their code will run 5 years from now, one a different supercomputer. They're writing code to compute what they need computed.now. And specialization and programming against unique architectures is the name of the game, basically.

That all depends on the target application and the user involved. My criticism above is partly based on a conversation I had with a scientist at Los Alamos who couldn't care less about what computer his code was running on, just that he got reliable results and that he spend as little of his time as possible re-optimizing his code for the new architecture. He felt Roadrunner was a waste of money and time and made his job harder. (He has since started optimizing for CUDA instead.) At the end of the day a supercomputer needs to be used to justify its pricetag and the price of paying people to write code for it. For computer scientists it may be cool to work on the new architecture, but for many of the users, engineers and scientists, the output result, including a reusable code, is much more important than the specifics of the architecture.

If your target supercomputer changes drastically every few years you'll end up eating a lot more man-hours rewriting code, while the computer itself won't be used to its potential. We're better-off slowing-down a little bit to figure-out what the best approaches should be in order to move forward on a sustainable path. I'm somewhat skeptical of HTM. The main thing that bothers me is that you can only really use it in low-contention situations (i.e. Few sporadic transactional writes). If transactional writes occur often, all affected threads are effectively busy-waiting which is highly inefficient (I'm guessing this is one of the first problems IBM will look to address).

High-contention situations are better served by mutexes (blocked threads sleep while unblocked threads can progress), deadlock risks notwithstanding. This seems to me to be a fundamental weakness of the entire scheme and it makes me question whether HTM will be worth the effort. Maybe in niche applications. A possible problem with transactional memory might come when the transaction wouldn't have been done if the new value wouldn't have been processed.

Imagine using transactional memory to buy shares (to me this seems the best example to use as shares and transactions go hand in glove) at the time the buy order was placed the share in question was just under the buy under price but by the time the buy order went through I imagine a queue the price was above it. This would cause user frustration when the system did not work as they hoped it would. I understand this is not as bad as deadlock but when computers seem incapable of operating at peak efficiency this could cause problems all of its own.

Imagine after your transaction the next thousand were buy orders by automated response computers and they all had buy orders similar to yours that were also frustrated. Imagine if you were a trading company who couldnt buy anything all day because all the prices kept rising faster than your buy orders whereas the trader next door had no problem as many of his buy orders beat yours boosting the price above the buy under. You'd be knocking on the door demanding an equal place on the crazy floor where folk wiggle fingers at each other again and damning the whole computer system idea. But you only use locking and multithreading because you care about performance. If you don't care about performance, you don't use multiple threads at all-TM may be relatively easy, but single-threaded is easier still. The very fact that someone is using multiple threads in the first place means that they care about performance.

I don't know if the following conflicts with or complements your statements, so I'll just toss it in and see if anyone can make sense of it. In my business we use multithreading because it is easier than writing a custom multi-tasking scheduler in a single-threaded environment.

(At least, that's our perception). We program robotic equipment with numerous motors, solenoids, sensors, etc. In a single instrument, these components are grouped together into various subsystems.

Each subsystem is typically under the control of a single thread. So, when a sequence of hardware commands needs to execute on the different components within a subsystem, a single thread takes care of executing those commands sequentially. Multiple hardware subsystems can therefore run in parallel via multiple threads without us needing to write a very low level sequencing or scheduling system. (We do have higher level scheduling systems, but I don't think they're relevant here). Where two or more subsystems can interact in the physical world, such that there may be a physical collision or serialization point between subsystems, we use locks and signals in the software to ensure that the critical hardware sequences on one subsystem thread are completed before the relevant sequences on the other subsystem thread are executed. This ensures that the first subsystem's hardware is in a safe, non-colliding position before the second subsystem attempts to enter that space. I say all that to say that even when the hardware is very busy, the CPU is still very idle, since most of its time is spent waiting for hardware commands to complete.

So we don't use multithreading to gain computational performance. We instead use it to leverage the OS's scheduler to gain programmer productivity. We still have to be careful of many typical multithreading issues though, since the subsystem threads will typically access shared memory, etc. So, that counts against the productivity gains elsewhere, but I don't know by how much. In general, we believe it to be easier and more efficient than writing our own sequencing engine. I don't know if all that falls under your use of 'performance' or not.

Ibm

Interestingly, all that being said, I'd like to have efficient transactional memory because it may help in the implementation of some of our custom lock and signalling systems. E.g., to make available the equivalent of compare-and-swap-N.

I haven't looked into it that deeply, though, so I don't know if it would help. But go figure - using TM to implement locks and signals. Transactional memory still sounds like it's best for certain problems. Large multi-process solvers, for example, would perpetually trigger transaction rollbacks at every boundary it would seem.

And the comment about using on-chip memory Transactional memory is not something you use to wrap your entire program. It's not magic. The comment in the article about 'maybe you sent something over the network' was very misleading. You still design your program with a conceptual separation between shared variables (a few, to be treated carefully) and non-shared variables (the vast bulk of your data). You still THINK about when and under what circumstances shared variables need to be touched.

What TM does for you is allow more of the low level grunt work involving the shared variables to be handled for you. Instead of either creating and destroying lock (or the equivalent), you can indicate a block of code that is tied to the modification of shared variables, and have that happen as a single unit. But you still have to figure out what that block of code is, and for maximal performance you still want that block of code to be as small as possible.

You would certainly expect the compiler, for example, to automatically allocate shared variables in the appropriate place on the chip without you worrying about it. I would hope that compilers (even C compilers) would check and warn about basic possible errors, like using a variable that has not been annotated as shared by two different threads, or reading/writing a shared variable outside an atomic block.

So, to take an analogy, it's like Objective C ARC (automatic reference counting) as opposed to garbage collection - it doesn't solve EVERY problem in the world automatically, and it requires the programmer to think a little, and to understand the tricky parts of their code; but it cuts out the low-level donkey work, and it solves most problems with little effort. Remarkably, it worked correctly first time. As a professional hardware developer, I would have loved to see their face when they powered it on the first time and it worked.

Shit.never. works the first time. I was also skeptical of this statement. If they said, 'the first n transactions worked,' (where n is some relatively low number) I'd believe it. If they said, 'the first implementation had no bugs,' the only way I'd believe it is if it was either very simple, or they spent the time to do extensive formal verification. OF COURSE they spent the time to do extensive formal verification. This is not amateur hour.

IBM (and I assume Intel, but I know less there) do formal verification on things like bus/cache protocols - that's why the POWER6 bus protocol could blossom from the 4(.) MERSI states to something like 39 states, tracking all manner of weird issues of how VM pages are being used differently on different processors, and the whole thing still works - and is, of course, more efficient. Transactional memory still sounds like it's best for certain problems. Large multi-process solvers, for example, would perpetually trigger transaction rollbacks at every boundary it would seem. And the comment about using on-chip memory Transactional memory is not something you use to wrap your entire program. It's not magic.

The comment in the article about 'maybe you sent something over the network' was very misleading. You still design your program with a conceptual separation between shared variables (a few, to be treated carefully) and non-shared variables (the vast bulk of your data). You still THINK about when and under what circumstances shared variables need to be touched. What TM does for you is allow more of the low level grunt work involving the shared variables to be handled for you. Instead of either creating and destroying lock (or the equivalent), you can indicate a block of code that is tied to the modification of shared variables, and have that happen as a single unit. But you still have to figure out what that block of code is, and for maximal performance you still want that block of code to be as small as possible. You would certainly expect the compiler, for example, to automatically allocate shared variables in the appropriate place on the chip without you worrying about it.

I would hope that compilers (even C compilers) would check and warn about basic possible errors, like using a variable that has not been annotated as shared by two different threads, or reading/writing a shared variable outside an atomic block. I think you're presuming rather a lot about the actual implementation. The reality is, we don't really know how it works, and there are many software transactional memory systems that don't work they way you described; they don't require annotation of shared variables, for example, instead transacting every access. Some even offer the possibility of being integrated with database and filesystem transactions, so your atomic operations don't even need to be limited in scope to in-memory data.

But you only use locking and multithreading because you care about performance. If you don't care about performance, you don't use multiple threads at all-TM may be relatively easy, but single-threaded is easier still. The very fact that someone is using multiple threads in the first place means that they care about performance. I don't know if the following conflicts with or complements your statements, so I'll just toss it in and see if anyone can make sense of it. In my business we use multithreading because it is easier than writing a custom multi-tasking scheduler in a single-threaded environment. (At least, that's our perception). It's true that this kind of multithreading is quite common, but opinions vary as to whether it's a good way to solve a problem.

I think quite a few people would argue that a single-threaded event-driven system is easier to develop. HTM has been shipped in hardare two other times that I know of. One of the more interesting products I came across at Nortel used mostly hardware transactional memory.

They have, err had, a patent dated 1996 and I think the product (XA-Core) shipped around 2001. Anyway, I didn't work on it myself but this blog entry: talks about it a bit. Also Azul Systems had a 1000 core custom hardware system back in 2008 or so that was supposedly capable for hardware TM. I believe that Cliff Click, Azul's tech voice, claimed that for implementing the JVM or at least for solving the JVM scalability problems that their customers care about, HTM wasn't a big win.

There are probably other examples of HTM that predate this one by IBM that even of more mainstream. It would certainly be interesting to put such systems in the hands of a mass of geeks.

Now different potential users of these facilities have different interests. The OS people tend to be very down on HW TM precisely because they are interested in locks that protect NON-MEMORY, including random pieces of non-idempotent hardware. I'm coming from a different viewpoint, the viewpoint of someone interested in CPU limited code who sees wrapping OS, file system, and networking calls inside lock/transaction code as absolutely insane, and it was from that point of view that I wrote my comment.

I believe this also reflects IBM's point of view. This computer is a test-bed for better ways to write petaflop code; it is NOT a test-bed for alternative ways to write an SMP OS.

I would say it's already useful for the operating system to offer, for example, a transacted file system, to allow for things like atomic software installations. There are widely-used commercial operating systems with widely-used commercial file systems that already have this capability. Given that the capability already exists, why not hook it into the memory transaction manager?

HTM has been shipped in hardare two other times that I know of. One of the more interesting products I came across at Nortel used mostly hardware transactional memory. They have, err had, a patent dated 1996 and I think the product (XA-Core) shipped around 2001. Anyway, I didn't work on it myself but this blog entry: talks about it a bit. Also Azul Systems had a 1000 core custom hardware system back in 2008 or so that was supposedly capable for hardware TM.

I believe that Cliff Click, Azul's tech voice, claimed that for implementing the JVM or at least for solving the JVM scalability problems that their customers care about, HTM wasn't a big win. There are probably other examples of HTM that predate this one by IBM that even of more mainstream.

It would certainly be interesting to put such systems in the hands of a mass of geeks. I've seen some vague claims about the Azul hardware, but a cursory search didn't find much that was definitive. It's perfectly possible that they did-Azul does lots of weird stuff. The Nortel case is interesting, I had not heard of that. Although it sounds like this was embedded into telephony hardware and not readily programmable. Obviously, Sequoia isn't going to be something that any old person will be able to run programs on, but it should be a little more diverse.

Nonetheless, I think that programming using LL/SC or TM is basically the same thing, but TM is slightly more complicated because you have to specify an 'atomic group' of variables instead of a single atomic variable. There is a slight similarity in that both deal with making changes atomically. But where LL/SC and similar instructions allow you to atomically update word-sized data (or 2xword, at best), the entire point in TM is to extend a similar scheme to allow arbitrary.sets. of changes to be applied atomically.

Using LL/SC to update a single variable is easy. Where they get complicated to use is when you want to use them to update larger data structures. And that is exactly the case that TM solves. You're not (depending on implementation) specifying an 'atomic group' of variables. You're simply specifying that 'anything I do between now.

Now' should be done atomically. You typically don't need to list the variables that you plan to modify. What is Sequoia specialized for? What sort of workload?

What are they going to be computing with it? Nuke testing? About multi-processor support. Could this not be achieved by making a new kind of memory?

Have a few bits or something with each memory word that tells if it's being used or has been modified by someone, and who? Ofcourse this means more use of the memory bus and wasted power and time I guess? But IBM is already working on bringing memory a lot closer to the processor (just look at their 'cognitive' chip). So maybe that would help, especially if we use 3D circuits so the memory could be right below the processor? Maybe my idea is wrong.I could be misunderstanding the algo.

Anyone care to comment? EDIT: On second thought perhaps modifying memory would not even be required to do multiprocessor support. Just some more local memory and circuitry in the CPU could be used to implement that? EDITED for clarity and to reduce mistakes EDIT 2: Plz keep in mind that i am total noob at hardware as well as the involved problems of multi-CPU/multi-core. Sorry if I am posting nonsense.

Am reading and trying to get it. @ibad: are we talking about TM in general, or the specific HTM implementation described here? In general terms, there's nothing stopping TM from working with multi-processor or multi-socket systems. It assumes a shared-memory system, but that's about all it requires. Of course the performance characteristics will vary across different hardware configurations, but there's nothing about TM that's incompatible with multi-socket systems.

I am talking about hardware TM, and this implementation in particular. I guess since no one knows the specifics of IBM's system we should restrict ourselves to Hardware TM in general. So why did IBM not bother with multi-socket in this case?

Surely such a system would be beneficial to them in the long run? Was this just an application specific technology, and hence they did not bother? What are the simulating btw? So why did IBM not bother with multi-socket in this case?

Ibm's Chatty Revolution Video

Surely such a system would be beneficial to them in the long run? Was this just an application specific technology, and hence they did not bother? Cache coherence, probably. It's an old problem: synchronizing CPU caches between 'far apart' CPUs is expensive, and HTM is basically a glorified cache coherence protocol. It would likely cripple performance if they had to implement this across multiple processors in separate sockets.

I suspect this is a case where STM might be better able to cope, as it operates in terms of language-level objects, rather than cache lines and CPU writes. Simplicity and price are key for mac. But that's just a gut feeling. Cache coherence, probably.

It's an old problem: synchronizing CPU caches between 'far apart' CPUs is expensive, and HTM is basically a glorified cache coherence protocol. It would likely cripple performance if they had to implement this across multiple processors in separate sockets. I suspect this is a case where STM might be better able to cope, as it operates in terms of language-level objects, rather than cache lines and CPU writes. But that's just a gut feeling. Ok.another question, possibly a stupid one. When does cache coherence come into play?

The core will check the memory itself to see if it has changed or not since the atomic execution began right? Or perhaps for speed we are doing things purely from cache if possible. And hence if we have multiple CPUs, cache coherence becomes a bottleneck? In a system where the memory was very close to the CPU (hypothetical here, like a 3d circuit or something) our CPUs could check if the memory was changed direct from RAM with relatively low cost and this issue would be mitigated? What cache coherence policy would IBM be using?

Updating cache values based on cache in other cores in the same die? Again.sorry if this is obvious. Am just checking my understanding. I'm not sure that would necessarily be true-LL/SC often operates on a cache line basis anyway It does, but STM gives the programmer more control over which data should be exposed to the STM system, which could conceivably minimize the number of shared data writes. In STM, you're typically use LL/SC to manipulate a small header storing the STM metadata for each object, while the object itself can be accessed normally.

With HTM, every memory write within a transaction would have to be dealt with 'transactionally', as I understand it. But you're right about the additional bookkeeping (although HTM needs some of that as well, because it requires a software component). I'm not really sure which way it'd go, and I've always focused on STM, so my knowledge of HTM is pretty sketchy. But I wouldn't count STM out just yet. When does cache coherence come into play? The core will check the memory itself to see if it has changed or not since the atomic execution began right? Or perhaps for speed we are doing things purely from cache if possible.

And hence if we have multiple CPUs, cache coherence becomes a bottleneck? 'Checking the memory itself' is a very expensive operation. Memory is sloooow. The entire point in CPU caches is that you.don't. want to push every write all the way to memory, which means that without some kind of cache coherence protocol in place, one core can perform a write that'll be invisible to other cores. Memory is cached because going through RAM is expensive, but at the same time, we need to ensure that all our CPUs agree on what memory looks like.

Remember that a memory access can take hundreds of cycles, so one core can issue a write to a memory address, another core can read from the same location, and for the next many clock cycles, your entire memory subsystem is in an inconsistent state. One core may see the first write as having completed, while other cores won't, because, well, every core is seeing memory as it looked a few hundred cycles ago. Add in that the 'distance' (and thus, latency) between CPU and memory will vary in multi-socket systems. Some sockets will be closer to memory, and have lower latency, than others.

And depending on the architecture, other factors like bus congestion could further influence latency. Overall, for all your cores to agree on what memory looks like, you need some form of cache coherence. In a system where the memory was very close to the CPU (hypothetical here, like a 3d circuit or something) our CPUs could check if the memory was changed direct from RAM with relatively low cost and this issue would be mitigated? RAM is still slow, even if you eliminate the latency caused by sending a signal across the motherboard from CPU to RAM.

But remember that if we're talking about multi-processor systems then, pretty much by definition, it is impossible for.every. CPU to be that close to memory. And so we get the same old problem. The more cores you have, the longer the mean distance to memory becomes, and the more latency and timing issues get in the way. Cache coherence, probably. It's an old problem: synchronizing CPU caches between 'far apart' CPUs is expensive, and HTM is basically a glorified cache coherence protocol. It would likely cripple performance if they had to implement this across multiple processors in separate sockets.

I suspect this is a case where STM might be better able to cope, as it operates in terms of language-level objects, rather than cache lines and CPU writes. But that's just a gut feeling.

Ok.another question, possibly a stupid one. When does cache coherence come into play? The core will check the memory itself to see if it has changed or not since the atomic execution began right? Or perhaps for speed we are doing things purely from cache if possible. And hence if we have multiple CPUs, cache coherence becomes a bottleneck?

In a system where the memory was very close to the CPU (hypothetical here, like a 3d circuit or something) our CPUs could check if the memory was changed direct from RAM with relatively low cost and this issue would be mitigated? What cache coherence policy would IBM be using? Updating cache values based on cache in other cores in the same die? Again.sorry if this is obvious. Am just checking my understanding. I could be wrong, but I was under the impression than the core doesn't check the above cache if something has changed, it checks the 'flag' on the cache line.

Ibm's Chatty Revolution 2017

The cache coherence protocol is suppose to 'push' every change to any memory location to every core. This means, for every core on your machine that writes to a location in memory, an update must get pushed to every other core. This gets very chatty very fast.

Ibm's Chatty Revolution 2

This is also why Intel and AMD are both looking and removing cache coherency in the current design because it does not scale with cores. They want to replace it with a selective push based system where a given core pushes changes only to the cores it wants to update. More management overhead from the OS/Software, but much much much better scalability. Interesting stuff. In 'The Art of Multiprocessor Programming,' Herlihy/Shavit contend we don't really know how to design and maintain complex systems that depend on locking.

Sure, one can hire experts to work through all the tricky cases, but they point out that this can be an expensive and increasingly impractical proposition, especially as our applications grow in complexity. And, of course, it's still locking; as the number of cores/processors increase, locking will not in general scale very well. I'm not really sure about all of that; I suspect they may be underselling the effect of good software engineering practices, but there does appear to be at least some merit to their claims. Wait-free / lock-free methods that use nonblocking methods (like compareAndSet) can overcome many of the bottlenecks from using locks (and nice libraries have and will be developed that efficiently and correctly implement things like wait-free stacks, queues, etc), but since atomic operations like compareAndSet only operate on a single word, Herlihy/Shavit observes that this constraint imposes really ugly, unnatural looking algorithms.

My limited experience developing wait/lock-free algorithms seems to have a similar prognosis. And, hell, it was difficult for me to simply decipher some of the intricate and clever logic used by some of the more efficient algorithms.

Ibm

Equally importantly, Herlihy/Shavit point out that all the synchronization methods I've mentioned can't be easily composed and almost always require unpleasant ad hoc solutions. For instance, these ad hoc solutions may require one to tinker with an otherwise self-contained monitor, since additional conditions may be needed to make it work correctly, thus somewhat defeating principles of information hiding and encapsulation. (Btw, Universal wait-free constructors are interesting from a theoretical point of view, but they seem to impose much greater overhead than TM.

On the other hand, they wouldn't have to do rollbacks so they may ultimately scale better than TM. I don't know. However, due to the significant overhead of the Universal constructor, it doesn't seem anyone is seriously entertaining the idea that they will someday be used to solve real multiprocessor synchronization problems. Rather, it's reserved as a theoretical device to illustrate how any algorithm can be converted into a nonblocking algorithm by using this wrapper.) So, that brings us to TM. Assuming little in the way of performance penalty, it seems to be almost always better than a lock (performance-wise) since a lock imposes mutually exclusive access to its critical section (except when rollbacks are not feasible, e.g., when some operations are not reversible).

Of course, there are various ways to make locks more efficient, e.g., deferring locking until absolutely necessary (lazy synchronization), but it still comes down to, at some point, needing exclusive access to a critical section. In light of Amdahl's law, this may spell doom. I'm curious to see people's results with this.