Disclaimer

October 30, 2006

Software Core Values

What core values do you use when you design software?  For VxWorks, we defined the core values to be small footprint, scalable and modular functionality, high performance, low latency and determinism, and reliability.  Enumerating the core values for software is pretty useful – it allows you to evaluate designs or technology at a high level to see if there is a good fit for a specific problem (OK, opportunity) domain.

When Wind River needs to leverage some software technology from one product into another, one of the things we consider is the set of core values to see if there is a match.  This helps us identify any integration problems or technology re-work that might arise from combining the software.  For example, a network stack from an open source “best-effort” operating system that was designed for optimal throughput might not align well with respect to the low latency or footprint core values of a real-time operating system like VxWorks.

Most software component management infrastructures, such as Yum, seem to focus on managing interface dependencies and resolving design-time and run-time dependencies.  I agree this is very useful stuff but in the device software space we need to consider core value alignment as well.

Core values are interesting pieces of meta-data that are properties of the software which are not immediately visible upon inspection of the software.  It would be interesting if there was a way of tagging software components with meta-data that would describe the software core values and other software attributes which are sometimes hard to detect (such as intellectual property rights).  Then, when we have a domain-specific set of requirements for software, we could find and leverage software components using queries that were much more sophisticated than simply considering interfaces.  Is anyone aware of any open technology initiatives which would help in this area?

October 06, 2006

Manageable Devices

The more we rely on intelligent devices in our lives, the more they need to be easily managed.  It just doesn't scale for us to look at the LCD panel of dozens of devices in our homes just to see if they are all still working properly.  Who has time to do that once a week let alone once a day?  And if my thermostat decides that the furnace isn't doing what its been told, I'd like to know before the aquarium freezes solid.  The devices we rely on need to be manageable - and connected.

Enterprise and carrier networking products are  devices we rely on that have  been doing remote software management,  diagnostics and control for decades.  They call it OA&M for Operations, Administration and Maintenance.  My PVR already has some remote OA&M capabilities and I know it  is occasionally upgraded by someone who is definitely not in my house - I like that.

I think connectivity is an important part of the equation but I don't want to just enable a whole pile of low-value emails that are directed at me from devices.  I don't want to know about every little event that happens to all the devices I care about.  I'd like some sort of filter between them and me.  If the phone line goes dead I'd rather just be notified of that than have every device connected to the phone line tell me the phone line is dead.   To solve that one you need some overseeing entity looking for event patterns and aggregating them into related bundles of events.  This could be done with standardization on event notifications and classifications and an expert system helping along the way.

Even when events are important sometimes its better to get recent events than major events that are old.  When the nuclear generator at Three Mile island had an event, the printer outputting the alarms was backlogged by 2.5 hours.  The system was working as designed but the humans reading the alarms probably wanted the most recent data.   Alarm bundles may have helped there.

My friend's brother's wife's sister's husband is a radiologist in Ontario, Canada.  Their cat scan machines are centrally monitored so if the machine has a problem they can remotely diagnose the problem to make sure it is not a dangerous issue.  The monitoring is triple-redundant around the world.  That's pretty cool.

More connected devices drives a need for better device management software.  That same device management software could be programmable to react to events (or bundles of events or alarms) in some user customizable way.  For example, if my sump pump starts running at a weird time of year maybe the water pump should just be shut off (could be a leak in a pipe somewhere in my house).  But someone else  might want their house to do something else in that case - maybe its a normal event in their house.

Anyone have any interesting ideas for features that could be achieve by simply connecting remotely manageable devices together via some custom programming? 

September 29, 2006

Software Componentization

Componentization is a seductive software development practice.  Simply put, software componentization is when you break your software system down into smaller easily identifiable pieces that have well-defined interfaces – and you do this in a specific way (i.e. by following a component model).

The promise of software componentization is five-fold:

  1. Complexity Management: having a manageable unit of software functionality enables design time tools and run time tools to help developers divide and conquer their increasingly unwieldy device software.
  2. Product Flexibility: the ability to easily replace a component when product requirements change provides product flexibility.
  3. Product Reliability: matured components that meet well-defined specifications and get re-used increases product reliability.
  4. Time-to-Market: re-use of components in new products accelerates product development and lowers cost
  5. Collaboration: a common format for software implementations helps multiple companies collaborate more easily.

A software component must satisfy these conditions: 1) it is a unit of functionality with a well-defined interface; 2) it is a unit of composition in a larger system; and 3) it conforms to some known component model.   A component model must prescribe a) how a component is described and what the description contains; and b) how components bind to each other at design time and/or at runtime.

Most software developers already create products or subsystems out of multiple pieces of previously written code - often augmented with something new - and they usually know how to replace those pieces with alternate implementations. Sometimes it is easy to replace an implementation, sometimes not. I have found that when a piece of software enforces a strict interface (preferably as small as possible) and keeps as much of the implementation as possible private then it gets easier (cheaper) to maintain that piece of software over time. And, as a bonus it often gets easier to reuse that piece of software since you can look at the interface and understand what you are getting. If you know what components it uses directly or indirectly, you can easily know if it can be quickly ported to a new environment or not.

So if we have been doing this all along, why don’t we have a huge Chinese menus of well-defined components to choose from when we build reliable device software? I think this is because the traditional development process is to break down a large problem into sub-problems and solve those.  We call that decomposition. To benefit from software componentization we need to have a paradigm shift from decomposition-based thinking to composition-based thinking. Design by composition starts by putting existing components together towards the product requirements. And by investing in successive refinement in the components (through component parameterization or component reimplementation while maintaining its same interface) we will in the long run end up with an arsenal of componentry that can be used to build products more cheaply and faster.

This is a short-term vs. long-term issue in many cases. By implementing exactly what is needed we can make that short-term product and then evolve its implementation. However by investing in a component-based approach to build components that are well-defined and flexible we get the payoff when we re-use it – the next time around - and component evolution (including bug fixes) will payoff multiplicatively.

September 22, 2006

Fault Propagation Paths

I like to use the term fault propagation a lot.  It has so many uses.  When we talk about multiprocessor systems, we think about fault propagation across a bus or other interconnect at the electrical level, the protocol level and even at the service level.  It is bad enough if a fault in a piece of code manages to take down the processor it runs on.  It is multiplicatively worse if the fault propagates to take down other processors directly or indirectly.

Fault propagation is a problem even within a single processor.  We put code out in processes to help isolate the code.  But there are often fault propagation paths that confuse other parts of the system if the process fails.  These fault propagation paths can be fairly predictable (e.g. a bad pointer overwrites a data structure in memory shared by another process) or they can be quite subtle (e.g. one process does a lot of IO on a shared disk causing another process to slow down even though lots of CPU is available for both). Figuring out what caused the fault in the first place can be really hard - especially if the symptom shows up elsewhere via a subtle fault propagation path.  OSs have implemented all kinds of abstractions and entities to help reduce fault propagation paths. Processes are one useful mechanism we can use. Processes can help reduce the scope of what we have to consider when we detect that a fault has occurred. 

I suspect that a large part of the code in any system is related to managing faults in some way.  We could include resource management subsystems (so one task doesn’t consume all the CPU, files or memory resources), network management (so you can see what is going on and react), debug infrastructure, software upgrade, diagnostics, permissions, error reporting, and so on – it is a lot of technology that we need to keep control of our applications.

The good news is that the technologies we are using to fight fault propagation paths are getting more powerful.  Multicore processors are one opportunity to insert a barrier for fault propagation paths – simply put applications on separate cores so that they can’t steal CPU from each other directly.  Mullticore processors are like having time and space partition technology in hardware.  Processor virtualization technology is the ultimate software-based time and space partitioning technology.  If we virtualize an entire processor and control what resources it gets and when it gets them (especially CPU and memory) we can protect other applications from those that run in the virtualized environment.

My concern is that as these tools are getting more powerful, those of us that are runtime developers will have a harder and harder time figuring out what is going on inside our systems.  For a while back in the late 1980s I was developing GUIs on X-Windows version 10.  I remember this nightmare I had about an errant pixel in one of my widgets.  The layers of software I had in that system spanned application processes (X client to X server), protocols (they communicated over TCP/IP) , OS primitives (sockets and other system calls) , device drivers (the framebuffer), compilers and the operating system.  My nightmare was that the bug that was setting this bit was so far down I could not get to it without an in circuit emulator and OS source code.  Luckily I woke up from that nightmare.

But the nightmare could come back.  With multiple operating systems on a multicore processor or even in a single processor with multiple virtualized processors, the layers of software in our devices get thicker and even harder to fathom.  We need to trust that the environment provided is stable.  Java developers have been doing this for a while.  But I like to "trust but verify" so I want great tools that illuminate the answers to query’s like “how much” or “when” or “what”.  I think there is a lot of value in having smarter tools for the run time – not just the design time  - that will help us manage, visualize and control these more complex runtime environments we are increasingly using to manage fault propagation paths.

I am not sure what these tools exactly are but I am pretty sure I will need them.  Anyone have any ideas in this area?

Maarten Koning

  • Maarten Koning is a Senior Principal Technologist at Wind River with 20 years of device software experience. Maarten got hooked on device software development back in 1984 when he designed a MIDI interface for his Commodore SuperPET computer and wrote a music sequencer for it in 6809 assembler.