Thursday, 29 December 2016

Site updates

Some years ago, my wife and I were both quite overweight. We joined Weight Watchers, and with their support, we made a major lifestyle change. For my wife, though, it went even further than that.

In June 2016 she joined the Weight Watchers company as one of their team of weight loss coaches. Now she hosts three meetings in the York area. I am pleased to be able to link to Jilly's web page on Facebook. If you live in York and struggle with weight loss issues, then you may find that the help and support of one of Jilly's Weight Watchers groups is valuable. Jilly is naturally good at supporting people, has plenty of enthusiasm and a lot of personal experience.

I sometimes think of writing about the process of this lifestyle change, what led us to it, and also what it led to. But it's all a bit personal. I put on weight because things were bad. My job had become a complete waste of time. The research projects that paid my salary were obviously going to achieve nothing, my own research had hit a brick wall, and the next round of projects showed the same signs of complete worthlessness. I realised I had taken a wrong turn, staying at the University after finishing my PhD, but I was trapped, and it seemed like too much effort - and too much risk - to leave.

However, that was just at work. At home, things were looking up. We got married and this motivated us to lose weight. Weight Watchers was a great help. Good results arrived quickly. We both felt better. My job was still going nowhere, but the improvements in my personal life gave me more confidence. I found myself spending more time at the gym. The depression cleared. For a while I tried to concentrate on the aspects of my work that I enjoyed, with limited success, and even considered changing research topic, but I found I had no enthusiasm left.

When a great opportunity to leave arose, I took it without hesitation and never looked back. When it came down to it, leaving was not nearly as hard as I'd thought.

I may not have discovered anything scientific as a researcher, but I did find this: If many things are making you unhappy, then your physical health is the first thing you should try to change. Health, happiness and self confidence are closely related. Once you've got your health back, then you're in a position to make other changes. Deal with this first, and other things will follow.

Friday, 17 June 2016

Supporting Duff's Device in RVS

Recently I made some improvements to the C/C++ code instrumenter used by Rapita's timing and coverage tools. The result of these improvements is that the new version of RVS is now going to support some really obscure features of C, such as "switch" statements in which the case labels are nested inside of other structures. The famous example of code like that is called Duff's Device.

I wrote a short article about the work which you can find here:

Tuesday, 29 March 2016

A detailed timing trace, with video demonstration

My last article discussed the differences between tracing and profiling. Compared to tracing, profiling is much easier to set up, but it only reveals information about the average execution of your program. It won't tell you anything about rare events - the "long tail" - which could cause high latency in certain cases.

I found this very interesting article by Dan Luu in which he discusses performance problems which cannot be analysed using a sampling profiler - because the rate of sampling is just not fast enough. The article is all about online services, but the problem described here is actually the "worst-case response time" problem. To me, this is an embedded systems problem. It applies to real-time systems where safety requires the software to respond within a known time interval - like the computer controlling a jet engine, for instance. It is known to be a very hard problem, and the solutions described in the article (tracing) are exactly the same solutions applied at the company where I work! But we're doing it for embedded systems. It's very interesting to read that the exact same problems exist in online services, and at sufficient scale, they can be extremely costly. 

I thought it might be interesting to share an example of RapiTime tracing in action. The following video is taken from Doom. The game source code has been instrumented with RapiTime. The video shows the game rendering certain scenes in real time, as if the CPU speed were slowed from 800MHz to just 48kHz. At this low speed, you can see what the game is doing as it draws each scene. 

This video is possible because a trace was used to capture detailed timing information. I couldn't actually reduce the CPU speed to 48kHz, so I ran the game twice. The first pass created the timing trace. The second pass created the video. A video frame is captured every 1600 clock cycles (2 microseconds) as determined using the timestamps recorded in the trace.

A profiler would just show that certain rendering functions are called very frequently - this is helpful for reducing the average time required to draw a scene, but not for reducing the longest possible times required for certain scenes. A trace has enough detail to reconstruct everything. We can see almost exactly what the game is doing with each CPU clock cycle. If you have the RapiTime project file, you can match each "ipoint" number to a specific place in the source code: generally, down to a C statement. If anything takes longer than expected, even only once, there are analysis tools to help you understand why.

You can see that the "long tail" has been captured by the tracing software in the following RapiTime graph. It shows the total time needed to render each of the frames produced by the game while playing a demo:

Almost all game frames take less than 4 million clock cycles to draw, and the average is 2.3 million. But there are some which take much longer - 39 million in one case! Fortunately, the explanation is straightforward. These frames represent transitions between levels, and the extra time is the time required to load the new level from disk. The game is not playable while the next level is loading, so the extra latency doesn't matter to the player.

Though it's more difficult to capture a detailed timing trace than to capture a profile, you get much more information with a trace, and that extra information can reveal things about the program that profiling cannot show, such as the long tail of latency. This information may be used to estimate and reduce the worst-case response time of the code.

Sunday, 28 February 2016

Profiling versus tracing

Profiling and tracing may be used to analyse the dynamic behaviour of a program. A profile (or trace) reveals performance issues, telling you which parts of a program require the most time. You can then focus your efforts on investigating those parts and improving them. If your program takes longer than you expect, then profiling is one of the first resorts to understand the issue.

Profilers generally operate by sampling ("statistical profiling"). Your program is repeatedly interrupted as it runs, with a fixed interval between interruptions (e.g. every millisecond). The purpose of each interruption is to take a sample, which is done by visiting each running thread, and then examining the stack to discover which functions are running. The effect is similar to stopping execution in a debugger and asking for a backtrace, e.g. using the "bt" command in GDB. The profiler records each sample and all of the samples are aggregated into a report or turned into a graph: flame graphs are a nice example of that. On Linux platforms, the best profiler is perf: do not waste time with the archaic gprof tool.

Tracers do not operate by sampling. Instead, the trace is a log of events within to the program. This log may be detailed enough to report function calls and returns, and execution of other statements. Tracing may require the program to be instrumented, i.e. modified to include code to log events. This instrumentation may be added to the source code (RapiTime works like that) or it may be added dynamically to the machine code (Intel's PIN tool works like that). Alternatively, tracing may rely on hardware such as the branch trace store in recent Intel CPUs, or the trace may be generated by an instruction set simulator. There are also less detailed methods of tracing code - for instance, strace or Process Monitor - which only record certain types of event, e.g. system calls.

A detailed trace may be used to reconstruct a profile. As the trace is a timestamped record of calls and returns, a trace parsing tool can track the state of the stack at each point in time. If the state of the stack is sampled periodically, the result is the same as a profile.

However, a profile cannot be used to reconstruct a trace. This is because the profile omits information about everything that happened between two samples. Many calls and returns may occur between samples: these are invisible in the profile report. In many cases, this missing information is not necessary in order to see where the program is spending most of its time, and it's possible to understand the performance problem without it.

Traces are necessary for other forms of timing analysis. Sometimes it isn't enough to rely on sampling to obtain the timing information required, because you need all of the information, not just the cases that are most likely to occur. Worst-case execution time (WCET) analysis is one example: this analysis indicates the longest possible execution time required for part of the program. WCET analysis requires very detailed information about the program and the hardware it runs on, usually in order to certify a safety-critical system such as engine control software for an aircraft. A profile is neither reliable enough nor detailed enough for this. But that sort of analysis is only required for embedded real-time software (i.e. what RapiTime is intended for).

Traces can also be useful for discovering the chain of events that led to a problem: in this sense, a tracing tool is a bit like a reversible debugger such as Undodb or RR, but with the advantage of running on platforms that reversible debugging can't support, like exotic embedded systems hardware or Microsoft Windows. Traces don't let you see absolutely all information about the program state, but you can at least see the path taken to reach a problem, which may provide enough clues. (I have used this debugging method on Windows for a program that crashed during startup on some systems and not others, and could not be debugged in any other way.)

The unfortunate fact is that tracing is often more difficult to set up, compared to profiling, so in many cases it does not make sense to use a tracing method when you can use a profiler. Here are some of the problems:
  • Compiling instrumenting code is never quite as easy as compiling the original code,
  • Detailed traces are very large,
  • The overhead of writing detailed trace information to disk may well be greater than the execution time of your program,
  • If any information is missing from the trace, it may be impossible to accurately parse it.
This last point requires a little extra information. If the trace is generated by instrumenting the code, you run into a problem with code that can't be instrumented, such as code that's already been compiled into the C library. Ignoring calls to uninstrumented code is risky, because uninstrumented code may call back into instrumented code. For instance, the C library function "qsort" is passed a function pointer for a comparison function. If "qsort" is not instrumented, but the comparison function is, then parsing the trace poses a problem: the transition from the call to qsort to the comparison is unexpected, and the trace parser cannot know whether the trace is a correct representation of program execution or not. Signal handlers and interrupt handlers are also an issue for trace parsing for a similar reason.

There are ways of dealing with all of these problems, and they're pretty straightforward, e.g. marking the functions that call back, indicating which functions represent signal handlers, or excluding certain functions from instrumentation. But this is extra effort for the programmer, not necessary to set up statistical profiling.

Tracing can be useful, but there are additional difficulties when it is used. In many cases a profile will be sufficient for analysing the performance of a program, understanding which parts are slow, and figuring out how to resolve bottlenecks, and these are the problems of interest to most programmers. However, statistical profiles are not suitable for certain problems, such as those in safety-critical systems, and in those cases, tracing may be required.

Sunday, 7 February 2016

RISC instruction sets I have known and disliked

Nobody writes assembly code any more, except when they do, and in the last week my work has taken me into an environment where you can't use C, because it's just too high level. C requires a stack (which isn't available) and the compiler makes its own choices about register allocation, which is really unhelpful because only a few registers are actually available.

The platform in this case was PowerPC. The instruction set is quite good. If you're forced to drop to instruction level, PowerPC is generally straightforward, sensible and comprehensible. You can take a good guess at what the instructions mean, and you'll normally get it right. There aren't many instructions, and doubts can be rapidly resolved by looking in the architecture manual or this nice reference.

But something did catch me out. I wrote some inline assembly code, within C source:
asm volatile ("stw %0, 0(%1)" : "" : "r"(id), "r"(ptr));
Now, this is broadly equivalent to the C code "*ptr = id", but for reasons which I won't go into, I needed to explicitly specify the machine instruction to be used ("stw"). I found that this code would sometimes produce a segmentation fault, but I couldn't figure out why. Instead of writing to the location in "ptr", the store instruction had attempted to write to location 0!

I thought at first that "ptr" must have become corrupted. But it wasn't that. In fact the problem was introduced at compile time. The "r" directive gives the compiler the freedom to choose any register to represent "ptr". Sometimes it picks register r0. And on PowerPC, r0 has special properties. Usually, r0 means general-purpose register (GPR) 0. But for some instructions it means a literal zero. For "stw", if the compiler chooses r0 for %1, the instruction becomes a store to address 0. Consequently, this code will work or not work, depending on the choices made by the register allocator. You have to tell the compiler to NOT use r0. Here's what IBM has to say on the subject:
...if a general register is required by the instruction, you could use either the "r" or "b" constraint. The "b" constraint is of interest, because many instructions use the designation of register 0 specially – a designation of register 0 does not mean that r0 is used, but instead a literal value of 0. For these instructions, it is wise to use "b" to denote the input operands to prevent the compiler from choosing r0. If the compiler chooses r0, and the instruction takes that to mean a literal 0, the instruction would produce incorrect results.
I think this is the only big surprise I encountered in the PowerPC instruction set. It's like an irregular verb in an otherwise completely regular language. r0 is not zero all the time, as in some RISC architectures, but it behaves as a zero register in some special cases, which you just have to know. Like the plural of "sheep" or "mouse".

RISC instruction sets like PowerPC are usually expected to be highly regular, like a spoken language designed from scratch with the intention of being simple to use, whereas the CISC style of instruction set is expected to be highly irregular and full of oddities, like a spoken language that evolved over thousands of years, borrowing words from other languages at any opportunity.

This should mean that RISC assembly languages are easy to understand, while CISC assembly languages are baroque and full of surprises. But actually this is not a general rule at all. CISC assembly languages are often quite nice to work with, because they were designed with the expectation that programmers would use assembly. Whereas RISC assembly languages can be absolutely horrible to use, because they're designed with the expectation that programmers will use C and higher-level languages, assembly coding is the compiler's job, and machine code should be optimised for efficient execution on a simple pipeline.

This applies particularly to the two pioneering RISC instruction sets, namely SPARC and MIPS. MIPS is the worst offender. It deliberately omits a feature which is so fundamental to CPU architecture that software people don't even think about it.  The architecture leaves out the mechanism in the CPU pipeline which would otherwise stall execution until the data was ready. A register access which would have created a minor inefficiency on any other architecture instead creates a "hazard" on MIPS. You can read from a register before that register is ready. If you are writing or debugging MIPS code, you have to know how this works. If you do a serious degree-level computer architecture class, it'll probably involve MIPS, mainly so that your professor can set some really tough exam questions about the pipeline.

This "feature" is based on assumptions about the CPU implementation which don't hold if the implementation is scaled up, becoming like the "recent" (i.e. last two decades) x86 CPUs with parallel execution of instructions ("superscalar" execution) and reordering of instructions. At that level, it is a legacy feature which makes the architecture awkward to use. It is much worse than any of the x86 legacy features, because those aren't visible all the time. Who cares if the CPU supports some weird compatibility mode for running 16-bit DOS software, if you're writing 32-bit or 64-bit code? The "real mode" support doesn't affect you at all. A legacy feature which affects every single instruction is not avoidable.

Both SPARC and MIPS share another horrid feature - delayed branches. These create a dependency between instructions, in which the branch takes effect after the next instruction, rather than immediately. When using assembly code, you have to know which instructions have a delayed effect, and what rules apply to the instruction (or, sometimes, instructions) in the "delay slot" following it. The delay slot is restricted in various ways: for instance, you can't put another delayed branch there.

The purpose is to simplify the pipeline, improving efficiency, but like the other MIPS "feature" described above, delayed branches are based on assumptions about the CPU implementation. If you scale up, or even just add a simple branch predictor, those assumptions don't apply any more, and the "delay" feature is just an annoying legacy which the CPU designer has to support, because it affects all of the code ever made for the platform.

SPARC also has a crazy feature all of its own, the "rotating register file", which makes code incredibly hard to understand because it's very hard to follow the movement of data across function calls. On SPARC, the registers are not numbered r0 to r31, they're assigned other names like o1 and g1, and some are remapped at call/return. This legacy feature makes some big assumptions about the typical number of registers that need to be preserved across a call, and forces the programmer to remember which registers get remapped and how they are remapped. I particularly disliked working on a SPARC simulator because you only see the values 0 through 31 in the machine code: you then have to translate these to the SPARC name and then figure out what they mean. (If you do have the misfortune to have to work with SPARC, page 26 of this document explains how the register file rotates, while page 193 shows how to translate between numbers and names. Have fun.).

Later generations of RISC instruction set - PowerPC, ARM and Alpha - don't have any of this nonsense. There are no delayed branches, no rotating register files, and as an assembly programmer, the only reason to understand the details of the pipeline implementation would be to maximise performance. These instruction sets are much nicer to use. Design errors, like having r15 as the program counter or making every instruction conditional, are problems for CPU architects rather than programmers, and it's no surprise that they disappeared in the 64-bit version of the ARM architecture. They must have made it awfully hard to implement superscalar, out-of-order execution.

However, many of the later RISC architectures do share one annoying flaw. Immediate values are constants embedded within the instructions. Sometimes these are used for small values within expressions, but often they're used for addresses, which are 32-bit or 64-bit values. On PowerPC, as on SPARC and MIPS, the mechanism for storing 32-bit immediates can only encode a 32-bit value by splitting it across two instructions: a "load high" followed by an "add". This is a pain. Sometimes the two instructions containing the value are some distance apart. Often you have to decode the address by hand, because the disassembler can't automatically recognise that it is an address. This design pattern turns up on almost all RISC architectures, though ARM does it differently (large immediates are accessed by PC-relative loads). When I worked on an object code analyser for another sort of RISC machine, I gave up on the idea of statically resolving the target of a call instructions, because the target address was split across two instructions, one of which could appear anywhere in the surrounding code.

The x86 system for storing 32-bit/64-bit immediates is much nicer. They just follow the instruction, which is possible because the instruction length is variable. Variable-length instructions are not usually seen in RISC, the Thumb-2 instruction set being the only exception that I know of.

What frustrates me about the whole RISC versus CISC debate is the unfairness of it all. It seems really biased against x86. If you're going to criticise x86 for its legacy features, shouldn't you also mention the legacy features of RISC architectures? The debate seems to be based on 1980s "talking points". There's an Edison/Tesla narrative of "evil empire" versus small competitor, where the small competitor uses good science to do it better, but gets squashed anyway by the evil empire's desire to preserve its monopoly. And there's the question of how much better CPUs might have been, if the "bad guys" hadn't won.

There are elements of truth here, but also major misrepresentations. Actually, once a CPU design scales up, it hardly matters if the instructions can map efficiently onto a 5-stage design like the one in the textbook. At this point, it's probably better to have an efficient instruction encoding, save on memory bandwidth and instruction cache space, and have a comprehensible instruction set. Hence x86.

After 2000, a new meme joined the RISC/CISC debate. RISC had won, as x86 CPUs were now "RISC on the inside". The people who repeated this bizarre claim meant that the newer Intel and AMD designs converted x86 instructions to "RISC-like" micro-instructions before they entered the out-of-order execution unit. But RISC hadn't won at all. CISC had always involved decoding instructions into micro-instructions. RISC was an attempt to abolish the process of converting instructions to micro-instructions, by having simple instructions which fit the hardware well and are easy to decode. But the CISC approach came back because it worked better at scale. Nowadays the out-of-order superscalar PowerPC and ARM CPUs use the x86 approach, not the other way around.

Unfortunately, CPU architecture is ripe for snake-oil. There is hype, there are bandwagons, and there is even flim-flam when someone proposes a new sort of architecture and invites you to subscribe to their Kickstarter. The subject is so complex, CPUs are so hard to compare to each other, and it's very hard to distinguish between good marketing and good engineering. Benchmark results can be selective and may even be totally meaningless. As yet, there is no all-encompassing science of CPU design which would lead to the perfect architecture, because the CPU is just one part of a system comprising compilers, applications, legacy software, input data and user requirements. This system is too large and complex to be understood by any scientific method currently available to humanity. At best we can understand parts of it - how to make a better cache, for instance, or a better branch predictor, for some limited definition of "better", measured against some small subset of software.

As a university researcher I once looked into the briefly-fashionable subject of "hardware/software codesign", where the idea is to come up with the perfect mixture of hardware and software to solve a specific problem. Hardware description languages make it possible to define hardware much like software, so in principle, you can take any algorithmic problem and produce an ideal combination that solves it, minimising time or energy consumption. But in reality the "codesign problem" is nearly impossible to examine scientifically because of the number of degrees of freedom available. And this is merely a small part of the larger problem of building an ideal general-purpose computer architecture.

In conclusion, RISC instruction sets are not always very nice, and x86 is not particularly nasty. A good initial design can create a legacy problem years later - and no CPU architecture is free of that. Be wary of anyone who says they can do better.

Sunday, 31 January 2016

Looking at the Linux kernel with "reverse maps"

I was asked to add some functionality to the Linux kernel to help capture timing information. The target platform is an embedded systems development board with a PowerPC CPU. This runs a (basically) vanilla version of Linux version 3.12 for PowerPC. In "user space" the distribution is Debian version 8.

The problem facing me here was my limited familiarity with the kernel. Like most people with CS degrees, I have a basic understanding of what it does, but I don't know the implementation details - especially not for PowerPC. There was much to learn. In the "arch/powerpc" directory, I was faced with 37000 lines of assembly code and nearly 152000 lines of C. A daunting prospect, and that is just the architecture-specific part of the kernel! How could I make sense of it quickly?

The aim was to add a feature which would log all transitions between user space (an application) and the kernel. This would form part of a RapiTime integration for one of Rapita's customers. Application measurements had to exclude the time spent in the kernel.

Faced with the complexity of the code, I began by a sort of informed guesswork. I thought of events that would cause kernel <--> user transitions: system calls, task scheduling and external interrupts triggered by hardware devices. I added tracing code for those cases, using the elegant "tracepoint" mechanism that is already in the kernel. I got this working. The log was captured by a kernel module and could be "downloaded" through a device in /dev.

Then I found that there were plenty of other ways to enter the kernel. These didn't show up in my log. Instead they showed up in timing measurements as long unexplained latencies. On this platform, the causes included floating-point emulation, page faults, alignment exceptions, illegal instruction exceptions and timer interrupts. All of these events needed to be detected. I could not miss anything out: the application measurements had to be accurate.

I realised I did not know how to detect these events. There was a lot of code in "arch/powerpc" which appeared to be related to them, but there is also a lot of replication to support different siblings in the PowerPC family: the code for 40x isn't all the same as 440, and different again from 603e. E200 and E500 have their own code too. Though all these CPUs share the same basic instruction set, there are crucial low-level differences in the hardware which require different drivers. These differences are rare in the x86 world where compatibility is crucial, being limited to workarounds for buggy chipsets, but they're completely normal in the embedded systems world.

Preprocessed assembly in entry_32.S.

"arch/powerpc/kernel" is filled with preprocessor macros (#ifdef, etc.) to enable different code paths depending on the exact variant of PowerPC as specified in the kernel configuration: CONFIG_40x, CONFIG_44x, etc. This is extremely complex. It is very hard to see which code will actually be used (live code), and which is omitted for a particular platform (dead code). Just looking at the C and assembly code was not enough for me to understand how the kernel was actually handling interrupts. I saw many different interrupt handlers for different PowerPC variants. I could not unpick which ones were in use.

The solution was to look at the code through a "reverse map".

In low-level programming, a "map" is a file showing the relationship from machine code addresses to symbols. Sometimes a map also reveals the relationship to source code lines. You can obtain a map for a Linux program by running the "nm" tool. You can go even further with the "addr2line" tool, translating addresses to source code lines, if the program includes debugging information. But the "objdump" tool beats both of them for many purposes, because it provides addresses, symbols, source code and disassembled machine code all in the same place! Here's an example:

objdump of the kernel, showing the (bootup) entry point.
The three columns are (left to right) the address, the
machine code and the (decoded) machine instruction.

A "reverse map" is the opposite of a map: it goes from source code line to code address. scans the output of "objdump" when applied to the "vmlinux" (containing the kernel) It reads code addresses and source code lines. For each source file, it creates a map from line numbers to addresses. This reverse mapping is saved in a data file alongside the source. combines a source file with the reverse map data, annotating the source code to show corresponding addresses:
Preprocessed assembly code in entry_32.S, viewed with reverse map information (left)
The left-hand side of the source code shows the addresses, within C-style comments (/**/). You can see that the code for CONFIG_44x and CONFIG_40x is not present in "vmlinux", because it has no addresses.

This reverse map technique can unravel the complexity introduced by a preprocessor or compiler. It is language-independent, requiring only that debugging symbols are present. It could be applied to any program, not just a kernel, though it is probably most appropriate to programs which make heavy use of preprocessing and have many compile-time configuration options to control the preprocessor. Simpler programs are amenable to simpler techniques such as an identifier search within an IDE.

By using it, I was able to exclude many source files, knowing that they were not used in my kernel's build configuration. Within the remaining files, I could ignore large sections of code, again knowing that they are unused. This simplified the work enormously. I did not have to determine which of the many different MMU drivers actually mattered - the answer was already revealed. And I could immediately see where the interrupt vectors were:
 Interrupt vector code in head_fsl_booke.S (not somewhere I would have guessed)
I could also have approached the problem using a debugger, but I'd need a debugger capable of stepping through the kernel's setup code. This would require debugging hardware: an "in circuit emulator" capable of controlling the CPU directly. These are very valuable for low-level kernel development, but they're extremely platform-specific and very expensive, and I don't have one for this particular board. So I needed to find another way. If I were working with x86 code, I could use a system-level simulator such as QEMU in place of debugging hardware, but that won't work in this case because QEMU's PowerPC simulator is too generic, i.e. it doesn't simulate this specific sort of PowerPC platform. A reverse map, from "vmlinux" to source code, was the easiest way approach.

Sunday, 24 January 2016

Ada's influence on my programming style

"A language that doesn't affect the way you think about programming, is not worth knowing."

This is one of Alan Perlis' Epigrams on Programming. On the surface, Ada doesn't seem like the sort of language that might affect your thinking in any sort of positive way. Structurally it is like C, but with more constraints on what you can do, and with quite verbose syntax. Where is the thoughtfulness in a language that's like C, but is more difficult to use?

The answer is that the difficulties of writing Ada are mostly avoidable through better programming practice. This language punishes you for writing "bad" code, by requiring more elaborate approaches and more verbose syntax: basically, forcing more typing.

There are things that can be done easily in C, which are also possible in Ada, but require a lot more typing. Conversions between types are an obvious example. In C, type conversion is often implicit, perhaps provoking a warning if the conversion involves a dangerous assumption (e.g. "pointers are 32 bit words"). In Ada, type conversion is explicit. You spend more time stating exactly what should happen.

The whole point is, I think, to make you consider: "Am I doing this wrong?" Maybe you can avoid type conversions by creating more types, or by explicitly defining what certain operators do when applied to particular combinations of types. This will be better, because you're being clear about what you want. You're providing more information to the compiler, which can be checked for consistency, turning potential bugs into compile-time errors. But it's outside of the C mindset, where type names tend to be mere aliases, implicitly converted to/from "int" as necessary. As an Ada programmer I make heavy use of arrays and abstract data types, because they're easier to use than pointers. In C/C++ I would use pointers.

This aspect of the design of Ada goes back to the beginnings of the language in the early 1980s, and is a clearly intentional attempt to improve on C. The issues with C were well-known back then! And Ada is a sort of compromise. It doesn't stop you writing C-like code. It just makes it more difficult to write C-like code, so you'll think "how can I do this better?"

Another apparently thoughtful aspect of the design of Ada probably wasn't intentional. In 1983, object-oriented programming (OOP) was still a few years from mass popularity. Perhaps that's why the 1983 edition of Ada only supports it in the same way that C does: that is, it isn't a first-class feature of the language.

Most commonly, OOP is used to hide implementation details behind an interface, and associate a data type with some closely-related functionality. But OOP is not the only way to do that. A modular language - like Ada - provides an alternative. Implementation details can also be hidden inside a "package", which also associates data types and functionality. The main difference is that you can't extend the functionality by inheritance. There's no dynamic dispatching.

Later on (for Ada 2005), dynamic dispatching was added. Now you can do all of the same things that C++/Java can do, without needing to introduce your own "vtable" of function pointers. But the syntax is weird and verbose. I usually have to refer to the textbook in order to use it.

That forces me to consider: "Is OOP really what I want?" And the answer is very often "no". Usually all I really want is to encapsulate an implementation. I don't want a class hierarchy and inheritance. Supporting those things would not make the code better. It would make it less clear.

I think it is quite by accident that Ada ends up challenging OOP, providing an alternative that is simpler and therefore preferable in many cases.

My preferred design uses procedures to define the operations that are unique to this program, and packages to provide "library" functionality that is reusable between programs or shared between parts of the program. OOP is a last resort, used only when a library needs to be extensible in a way that is only conveniently arranged by inheritance.

Even something as "obviously OOP" as a database interface is much better expressed as a package. Classes only represent one type. That covers the database interface, but not associated types, such as identifiers for objects within the database. Packages, however, can contain any number of related types. So the database interface in the prototype tool I've been writing at Rapita has a different type for the identifier of each of the kinds of object it can deal with inside the database. I can't mix up the identifiers by mistake, because the program won't compile if I do.

There are many other examples I could mention here. I also like the Ada mechanism for supporting threads and locks, and the mechanism for templates ("generics") which is not as powerful as C++'s template system, but nevertheless keeps the parts that actually matter.

Ada has had a positive influence on my programming style. When used badly, it's a hassle to write, because you're forced to type more. When used well, the result is better code.