http://computingcases.org/case_materials/therac/th… The Therac-25 case is an example of whatever could go wrong, did go wrong. The mistakes included: Simple programming errors (buffer overflow) Reuse of code from earlier systems (erroneously assumed no errors in reused software) Lack of technical and user documentation (including explanation of error messages) Removing hardware safety features (relying entirely on software controls) Compressed testing (expert users were not part of the testing) Lack of communication within the different divisions/departments in the manufacturer (Canadian and US divisions did not know accidents had been reported to the other division) Lack of timely communication between hospitals using the product (met only once a year, pre-Web days, so little in-between communications occurred) Technicians ignoring error messages (older systems had errors that technicians could safely ignore) Technicians putting too much reliance in the technology (failed to believe patients complaints) Programming/Software Mistakes:A buffer overflow occurs when the information is too large to be stored properly. A common practice is to set a flag in a program to test the state of some occurrence. For instance a 1 can be stored if some test is true and 0 if false. The variable to store this flag only needs to be big enough to store the 1 or 0 (which in binary code takes 8 bits to store). The 8 bit variable could actually store any number from 0 to 255. A buffer overflow would occur if you tried to store the value 256, which takes 16 bits. The accepted practice is to initialize the variable (start the variable out at) to 0. Then as the program runs a test may be performed to see if the flag is no longer 0 or the program can test to see if the flag equals 1. Until something in the program changes the value, the flag is 0. When something changes so the flag should no longer be zero, the software should set the flag to 1. A common mistake that inexperienced programmers may make is to add 1 (example a) to the flag instead of setting the flag equal to 1 (example b). a) flag1 = flag1 1b) flag1 = 1Why would someone use formula a instead of b? In many programming languages formula a can be shortened to:flag1 = This increments flag1 by 1. This is what was used in the reused code from the older Therac models. Code is reused for a number of reasons. It saves time. It is time consuming to write everything from scratch and may introduce other errors. Programming languages come with libraries of code that can be easily used, however, the real savings in time and cost is to reuse the code from previous versions. This is standard practice. Usually this technique means that the programmers are starting with well tested and (relatively) error free code (see the section on removing hardware safety controls for more on this). Software that is complex or reused over time is often developed by teams of programmers. The individual sections must follow prescribed rules in order to properly work together. When a problem occurs it may be due to a mistake by a single programmer, or it may be two sections not working together properly. Technical documentation is extremely important to trouble shoot the problems. Without the documentation, the job of looking for these mistakes is more difficult. In the Therac-25 case the technical documentation had not been kept up to date so it had many errors and inaccuracies in it. Also, memory was a big issue, remember, this is before the personal computer and before the breakthroughs in technology that give us lots of memory in very small devices. Errors and other messages were given a numeric code that was defined in the documentation. Some of the error messages were not properly documented. Since documentation is an extra step thatt does not directly add to the software, it is sometimes missed or intentionally skipped if a project is running behind schedule.Removing hardware safety featuresThe previous Therac devices had exceptional safety records. They used a combination of hardware and software controls to ensure safety. The hardware and software controls were redundant, so the decision was made to remove the hardware controls and rely on the software controls. The decision was made by the manufacturer to remove the hardware controls because they were very expensive. If the software controls, that were relatively inexpensive, worked, than the hardware controls added unnecessary costs to the machine. If they could reduce the cost of manufacture, than more hospitals could afford the machine, more lives could be saved and the company would see higher profits. A win-win situation. Unfortunately, errors in the reused software code had not been revealed previously because of the redundant hardware controls. When the hardware controls were removed, the software failed to catch errors in the settings made by the technicians resulting in over radiating the patients.Compressed testingDuring the testing stage, two types of testing should occur. The first involves the quality assurance team in the manufacturer. This team should include developers who were not part of the team that worked on the project. The second stage should include actual professionals who would be using the product on the job. The manufacturer appears to have cut out the second stage of testing. If real world professionals had been used, the buffer overflow (one of the errors that was ultimately found responsible for the overdoses) would have been caught. In simple terms, the buffer overflow occurred because the technicians entered the information (settings) faster than the software could process it. Error messages may have popped up, some required a reset and the technician had to start over, some allowed the technician to continue. The software lost track of some of the settings the technicians entered causing it to put the wrong values in some of the flags resulting in the buffer overflows (when the value in the flag went from 255 to 256 the flag was reset to 0 instead) that allowed the radiation treatment to proceed with the wrong settings. The people who did the testing had basic training on the equipment, and did not enter the settings at the same speed as an expert would. Therefore, this error was not detected during testing. In other words, the testers used the equipment exactly as the developers expected them to.CommunicationThe Therac-25 incidents took place before the communication revolution of the Internet and the World Wide Web. Communications between professionals within an industry and even within a single business was much more cumbersome. Email was unheard of outside of the military and academia. Communication was through memos (sheets of paper sent through interdepartmental mail or the Post Office). These sheets of paper were often only read by top or middle level management, and only discriminated to rank and file employees if deemed necessary. In this light it is not hard to imagine how different divisions (located in different countries) and individual hospitals did not effectively communicate with each other. Another problem related to communication is what to communicate. In hind sight it is easy to point out the errors that caused the accidents. At the time they were being reported, the cause wasn’t known. It was initially believed that the errors were caused by operator error, which is only partially right (see section on Human Factor). It wasn’t until the FDA became involved that a thorough investigation revealed the multiple causes.Human FactorTechnicians ignored error messages and continued treatments. The older systems had errors that technicians could safely ignore, usually by reentering some data and then continuing. Some of the errors in the older systems shut the machine down forcing the technicians to restart. Without the hardware safety controls, the system was less likely to cause a shutdown. Technicians were making decisions based on experience with the older technology. In some cases technicians put too much reliance in the technology. They had confidence in the equipment (based on safety records of prior technology) and therefore discounted patient’s complaints. Weekly Forum Discussion QuestionsRemember that these incidents happened nearly 30 years ago before the World Wide Web and the instant communication society we live in today. The Therac-25 case often leads one to believe that reuse of code is a bad thing, that businesses are evil and only care about making money, and that technology kills people. The truth is that the reuse of code is necessary for advancements. Businesses have not only a right to make money but an obligation to make money to support their employees and investors. While technology can kill people (intentionally or unintentionally) it has also saved millions, if not billions of lives.Where do you think the medical industry would be today if machines like the Therac-25 were discontinued because of these types of problems? Where would the transportation industry be, including air, ground and sea transportation, if computers were not used because of the early problems? Can you think of other industries that initially had problems implementing technological advancements? How did they overcome the initial setbacks? What improvements have these industries experienced due to technology?