Data Processing at One Center
For forty years, John Vilandre was associated with or directed the data-processing and analysis operations at the Laboratory of Physiological Hygiene, which pioneered the field of CVD prevention research, and then at the Division of Epidemiology, University of Minnesota School of Public Health. His depiction here of historic transitions in data management in that center reflect the technical developments common to many North American institutions involved with epidemiological studies over the same period from the 1960s to the 2000s. In a 2005 interview, he began by describing mechanical calculators used from the outset and well into the 1960s:
“These calculators were able to accumulate sums and calculate sums of squares and cross- products of the two variables, values used to compute descriptive statistics such as means, standard deviations, and correlation coefficients. The Monroes and Friedens (brands of mechanical and later electrical calculators) were at first hand-operated but by the time I arrived they were electric. Such mechanical machines placed practical limitations on the sizes of studies because someone had to sit there and add up all those data by hand and do the squares. And, of course, if you have a column of 100 or more numbers and punch it in twice, you are often going to come up with different answers. So you must do it a third time to get the thing rectified.”
Eventually a common technique for handling large samples was to use [Hollerith[ card-sorting machinery to order the cards on a particular variable, then determine quintile, decile, or centile cut points to use as entry values for the calculating machines. (Vilandre 2005)
Rose Hilk, another long-term LPH data analyst tells of the time “there was a huge card sort going on in the Laboratory and we had a large bin on the wall that held the cards as they came out in order. Dr. Keys was eager to get this analysis done, so he came in late one evening and loaded the card stacker with his finished work. It fell off the wall later that night and I discovered it in the morning. Needless to say, I had to pick up all the cards and resort them!” (Hilk 2005).
Such near-disasters were a common experience of the punch-card era. Kalevi Pyörälä, recounted a harrowing episode in the life of the Helsinki Policemen Study:
“We had a frightening accident in 1970, when all the data files of the study, including the paper archives and large packages of punch cards, had to be transferred from the Institute of Occupational Health, where the first round of the study was done, to the new location provided by the Finnish Heart Association. Pirkko Parviainen (later Siltanen), who was our study nurse, ordered a lorry from a transport enterprise to take them to the new location, but they did not arrive at the given address.
For some weeks nothing was heard about them, but then Pirkko got a phone call that the lost packages had been found among other items in the storeroom of the Helsinki main railway station. Thereafter we were more careful and always kept copies of the data in several places. This became easier with the advent of magnetic tapes” (Pyörälä, pers.comm.).
As cumbersome as the punch-card system was, it represented a giant step forward in data analysis–a leap into the computer age. Old-timers compare notes about the excitement generated by each technological advance that seemed revolutionary at the time but then was viewed as laughably primitive only a short time later. Vilandre recalled how precious drawers-full of Minnesota data on punched cards were at first transported to the site of large Univac and Control Data computers on and near the Minneapolis campus. This continued for some years, even after the LPH got its first very limited-capacity computer, the Digital Equipment Company’s “PDP 8” (“PDP” for “Programmed Data Processor”)–introduced in 1963 by Bill Parlin, actuary and data manager at a Twin Cities life insurance company, who became the Lab’s first computer programmer. That machine began the transition to computer autonomy:
“Everyone has a computer these days and talks about megabytes and gigabytes of memory. That machine [the PDP 8] had 4 kilobytes of memory! It had no storage device of any kind except paper tape, which isn’t really internal. There was no disk drive, no tape drive, no display CRT. Each time you ran a program you had to load it in from the paper tape, and the numbers were punched out on a teletype machine of the kind used by the news media in those days . . . With its attached punched paper tape, it allowed [small] programs and data sets to be stored for later use.”
The first thing [Parlin] did was to write a little piece of code called a “handler” (today we call them “drivers”) that would interface a card-reading machine with the computer. So now we could actually read the numbers from the punch cards into our computer. You couldn’t store anything. All the statistics were done by numbers stored in memory for that run only. Once you turned the machine off you had to start all over.
So, although the PDP 8 allowed us to perform some data analysis in-house, its limitations (small memory, no internal data storage) meant that larger problems still needed to be taken outside (Vilandre 2005).
Programmers at the time who were accustomed to working with high-level programming language had to learn to do machine-language programming, Vilandre said, because trying to use Fortran with the limited memory available on the new, smaller computers resulted in codes that were inefficient. “If you really wanted to get the maximum use of those 4,000 units of PDP memory, you had to write using the computer’s basic instruction set. It was great fun, much like doing puzzles” (ibid.).
The PDP 8 was replaced in 1973 by a PDP 11, which Vilandre called “a nice little machine.” Its auxiliary hardware floating-point number processor, now an integral part of any PC computer chip, was then a “6-foot-tall cabinet full of cards with resistors and capacitors–just to do basic multiplication and division.” But its magnetic tape storage allowed data exchanges, back-up, and more sophisticated programming, though still only a single-user machine. The department was growing rapidly and analytical staff had to work in shifts.
The next technological leap, five years later, was to the Lab’s first multi-user system, the PDP 11/34. “Each disk drive was a separate, floor-standing unit,” Vilandre said, “about the size of a small dishwasher. When you think of what’s in your little laptop now it seems crazy. But the nice thing about the PDP 11 was its capacity for multiple, concurrent users.” And direct key-to- disk data entry now allowed bypassing punch cards. Vilandre described the next transition:
“The second massive data media conversion, this time from punch cards to magnetic tape, was performed during this period. We had literally millions of punched cards by the mid-1970s. Rose Hilk, over a period of months, fed the cards through a reader attached to a computer and then we stored the data on magnetic tape. We had big bins in the hallway to dump the cards into once they had been read and when the card stock was sold to a recycling company, we got enough money to buy a microwave oven for the kitchen!” (ibid.)
Subsequent versions of the PDP 11, each faster and with greater capacity, formed the nucleus of departmental networks of thirty to forty users, in which the machines would perform sluggishly at peak-use hours. Then, with gigantic trials, surveillance, and community projects coming along in the 1970s, the data needs were accommodated by a new kind of DEC computer, the VAX series, still in use in many places. Macintosh computers began to be used as a VAX terminal, and progress since then in operating systems and networks has been steady.
Before the VAX, it was still necessary to put data on the PDP 11 magnetic tape and take them to the University’s CDC 6600 for batch processing and for-a-fee storage. “Some of the jobs that we were running otherwise would have run for days,” Vilandre said, “while on the CDC machine they could run in a matter of hours. Today, they run on our own desktops in a few minutes” (ibid.).
About 1985, with computing demand increasing, the first VAX computer, a model 8600 was purchased. This new, 32-bit virtual memory machine removed most limitations on program size that were inherent in the PDP series. The new processor had sufficient capacity to run a relational database in software, allowing retirement of the Britton-Lee machine and supporting the SAS analysis system, which quickly supplanted the use of BMDP.
The VMS (Virtual Memory System) operating system is now supported on newer, high speed, 64-bit RISC processors. While the longevity of VMS has meant less retraining of programmers and other staff, today most users in the division at Minnesota use personal computers for the bulk of their work. The VMS systems are service providers, or servers, to the desk and laptop systems, providing data storage, database capabilities, e-mail service, etc. to the end users.
For all that has been gained in the fast-moving history of high-tech computation, Vilandre noted that something has been lost, as well. With all its limitations, an old-fashioned mechanical calculator at least assured that “you had to understand the process. Now you put it in SAS and get a number out. Sometimes I think people don’t even know if their chosen statistic is the appropriate one to use. Anyway, that’s why we have statisticians to check the work. Basically, still true today, we human beings are the single weakest point–because we make errors” (ibid.).
Minnesota Epidemiology is now, like most such academic centers, a large, computer-capable, demanding family of faculty and staff requiring servers and intimate support services for computation and communication among many PCs and Macs.