Design of a general hierarchical storage system

(1)

(2)

LIBRARY

OFTHE

MASSACHUSETTS

INSTITUTE

OF

TECHNOLOGY

(3)

(4)

(5)

(6)

(7)

)28

1414

m~

Z

APR

3

5

1975

Center

for

Information

Systems Research

MassachusettsInstituteofTechnology

Alfred P.Sloan SchoolofManagement

50 Memorial Drive

(8)

(9)

DESIGN OF A GENERAL HIERARCHICAL STORAGE SYSTEM

Stuart

E.

Madnick

REPORT

CISR-6

SLOAN WP- 769-75

March

3,

1975

*MR fflttr

(10)

HT3

(11)

DESIGN OF A GENERAL HIERARCHICAL STORAGE SYSTEM

Stuart

E.

Madnick

Center

for

Information Systems Research

Alfred

P.

Sloan School of Management

Massachusetts Institute of Technology

Cambridge,

Massachusetts

02139

ABSTRACT

By

extending the concepts used in contemporary two-level hierarchical

storage

systems,

such as cache or paging systems,

it

is

possible to develop

an orderly

strategy for the design of large-scale automated hierarchical

storage systems.

In

this

paper systems encompassing up to six levels of

storage

technology are studied.

Specific techniques,

such as page

splitting,

shadow storage,

direct transfer,

read through,

store behind,

and distributed

control,

are

discussed and are shown to provide considerable

advantages

for

increased systems performance and reliability.

(12)

(13)

DESIGN OF A GENERAL HIERARCHICAL STORAGE SYSTEM

StuartE.Madnick

Centerfor Information Systems Research

AlfredP. Sloan School of Management

Massachusetts Institute of Technology

Cambridge, Massachusetts 02139

INTRODUCTION and

The evolution of computer systems has been marked

byacontinually increasingdemand for faster, larger,

andmoreeconomicalstorage facilities. Clearly, if

wehadthe technologytoproduce ultra-fast

limitless-capacitystorage devicesfor miniscule cost, the

demand could be easily satisfied. Due tothecurrent

unavailability of suchawonderous device, the

requirements of high-performance yet low-cost are

bestaatisfied byamixture of technologies combining

expensive high-performance devices withinexpensive

lowar-performance devices. Suchastrategy is often

calleda hlernrchicalstorage system or multilevel

storagesystem.

CONTEMPORARY HIERARCHICAL STORAGE SYSTEMS Range of Storage Technologies

TableIindicatesthe range of performance and

costcharacteristics for typicalcurrent-day storage

technologies dividedinto6cost-performance levels.

Although this tableisonly a simplified summary,it

does illustratethe factthat thereexists aspectrum

of devicesthat spanover6 orders of magnitude

U00,000,000\) inboth cost and performance. Locality of Reference

Ifallreferences toonline storage were equally

likely, the use ofhierarchicalstorage systemswould

be ofmarginal valueatbest. Forexample, consider

ahierarchical system, M',consisting of two types of

devices.Mil) andM(2), with access times A (l)-.05xA(2)

andcosts C(D-20xC(2), respectively. If thetotal

storage capacityS*were equally divided between M(l)

and M(2), i.e., S(l)-S(2), then we can determine an

overall cost per byte,

C,

and effective access time.

C

-.5xC(l)+ .5xC(2)

C* -10.5 x C(2) - .5045x C(l)

A' - .5xAll) + .5 xA(2)

A* - .525xA(2) - 10.5 x A(l)

From these figures,wesee thattheeffective

system, M', costs about half as much as M(l) butis

more than 10 times as slow. On theotherhand, M*

is almosttwice as fast asM(2), butitcostsmore

than 10tirn^sasmuch per byte.

Fortunately, most actual programs and

applica-tions cluster their references so that, during any

interval of time, onlya small subset ofthe

infor-mation is actually used. Thisphenomenonisknown

aslocality ofreference. Ifwe couldkeep the

"current" information ina smallamount of M(l), we

could produce hierarchical storage systems with an effective access time close to M(l) but an overall cost per byte close to M(2).

The actual performance attainable by such

hier-archical storage systems isaddressed by theother

papers in this session12<14<2°. 23and thegeneral

literature2'3<4-6•7«8'10.11.13,15,18,19,21, 22. In

many actualsystems ithasbeenpossible to find

over 90% of all references inM(l) even though M(l)

wasmuchsmaller than M(2)13-l 5.

Automated Hierarchical Storage Systems

There are at leastthree ways in whichthe

locality of reference can be exploited in a

hier-archical storagesystem: static, manual, orautomatic.

Ifthepattern of reference isconstant and

predictable overa long period of time (e.g.,days,

weeks,ormonths), the information can be statically

allocated to specific storage devices so that the most frequently referenced information is assigned to

the highest performancedevices.

(14)

(15)

If the pattern ofreference isnotconstant but

ispredictable, theprogrammer can manually (i.e., by

explicit instructions in theprogram) causefrequently

referenced informationtobemoved onto high

perfor-mance devices while being used. Such manual control

places aheavy burden on the programmer to understand

the detailed behaviorof theapplication and to

determinethe optimum way to utilize the storage

hierarchy. This can be verydifficultif the

applica-tioniacomplex (e.g.,multiple access data base

•ystem) orisoperating ina multiprogramming

environ-nent with otherapplications.

Inan automated storage hierarchy, all aspects

ofthephysical information organization and

distribu-tion and movementofinformation are removed from

theprogrammer's responsibility. The algorithms that

managethe storagehierarchymay be implemented in

hardware, firmware, system software or some combination.

Contemporary Automated Hierarchical Storage Systems

To date, most implementations of automated

hierarchical storage systems have been large-scale

cache systems or paling systems. Typically, cache

systems,usinga combination of level1andlevel 2

storagedevicesmanaged by specialized hardware, have

been usedin large-scalehigh-performance computer

.4,10,23 storage

storagedevices managed mostly by system software with

limitedsupportfromspecialized hardware and/or

firmware*.11, 12.

Impact of HewTechnology

A* a result of new storage technologies,

especial-lyfor levels 3and6,as wellas substantially

reduced costsfor implementing the necessary hardware

control functions,nany new types of hierarchical

storagesystems haveevolved. Several of these

advances are discussed inthis session.

Bill Strecker23 explains how the cache system

concept can be extendedfor use inhigh-performance

minicomputer systems.

Bernard Greenberg and Steven Weber 12_{discuss the}

design andperformance of a three-level paging system

usedin theMulticsOperating System (levels 2, 4, and

5 intheearlierHoneywell 645 implementation; levels

2, 3, and5 inthemore recent Honeywell 68/80

imple-mentation).

Suran Ohnigian-" analyzesthehandling of data

baseapplicationsina storage hierarchy.

Clay Johnson*'* explains howa new level6

tech-nology isbeingused inconjunction with level5

devices toformavery largecapacity hierarchical

Storagesystem, the IBM 3850MassStorage System.

Thevariousautomatedhierarchical storage systems

discussed in this session aredepictedinFigure1.

Cache

©

Main

©

Block

©

Backing

©

Secondary

©

Mass

o

t

©_

Q

z%

i s i

2

c I S

?

)©

i?

{645 68/80J r i

O

i

©

o

Figure 1

Various Automated Hierarchical Storage Systems GENERAL HIERARCHICAL STORAGE SYSTEM All of the automated hierarchical stoarge systems

described thusfar have only dealt with two or three

levels of storage devices (the Honeywell68/80 Hultiea

system actually encompasses four levels since it

containsacache memory which ispartially controlled

by the system software). Furthermore,manyofthese

systems wereinitially conceived and designed several

years agowithout the benefit of ourcurrentknowledge

aboutmemory and processor capabilities and costs. In

thissection the structure ofageneralautomated

hierarchical storage systemisoutlined (seeFigure2)

.

Continuous and Corplete Hierarchy

Anautomatedstoragehierarchy must consist of a

continuous and complete hierarchy that encompasses the

fullrange of storage levels. Otherwise, theuser

would beforced to rely on manual or semi-automatic

storage management techniques to deal with the storage

levelsthat are not automatically managed.

PageSplitting and Shadow Storage

Theaveragetime, Tm, required to move apage,a

unit of information,between two levels of the

hier-archyconsists of the sum of (1) the average access

time,T, and (2) thetransfer time,which is the

product of the transfer rate,R, andthesize ofa

(16)

(17)

Level 0. Processor francfez kilt 5. Secondary figure2 GeneralAutomated

Hierarchical Storage Systea

Ocice

cfFaceS:ze. By exa:

tative devices indicated in T^cle

tinesvar;. audi-ore than transfer rates. Asaresult,

Taisvery sensitive to the transfer unit size, S, for

devices with s-sll values of T [e.g.. levels 1and 2).

Thus, for these devicessi-allpage sizessuch as N=16

or32bytes are typically used. Conversely, for

devices with large values forT, themarginal cost of

increasing thepacesize is quite ssaall, especially in

view of the benefits possible due to locality of

reference. Thus,page sizes such as N-4D96 bytes are

typicalfor levels ;a-d5. "uch larger units of

informationtransfer are used for level 6devices.

Since themarginal increase in Ta decreases none—

tonically asa function of storage level, ther.-ier cf

bytestransferred between levels should ^-crease

correspondingly. Inorder to simplify the

implementa-tion ofthe systea and tobeconsistentwith the

normal mappingsfroraprocessor address references to

page addresses, itisdesirable thatall:a;esices be

apower cftwo. 7-.echoice foreach pagesize depends

upon the characteristics of theprogramstoberunand

the effectiveness of the overall stcrace system.

Pre-liminary measurements indicate thata ratio ofu.:1 cr

8:1between levels is reasonable. Meade:?_{has reported}

inilarfindings. Figure 3 indicates possible transfer

unit sizes between the6 levels presented inTable I.

The particular choice of size will, of course, depend upon the particulars of the devices.

Figure3

Saaple Transfer Unit Sizes

PageSol

—

tiro, sowletus cc-.siderthe art.il

movement of

:::->t.:-

i- the stcrace -lerarr-v. At

time : the processor generatesa refere-re to*

logicaladdress a. kssone that the corresponding

Information _s not currently stored i- - . :.-

-but -s found in M<3!. For simplicity, assume that

pace sizesare doubled as-- :r iota the ttierarcbj

(i.e.,N(2.3)-2xN(1.2). S(3,4)-2x.N(2,3)-4xS(1,2).etc.

,

see Figure•). The zzzecf sizeB12.3 containing

ais copied from M(3) to M(2). n(2) now ro-ta.-s the

reeded ir.form-aticr., sowerep-eat theprocess. Ota

page of size

U.2]

containing a .scopied froe M(2)

toM(l). sow, finally,th.e pagecf sizeN(0,1)

containinga iscopied free - l and fcrwaroedtothe

processor. In t-eprocess thepage cf Information is

splitrepeatedly as it moves up the biexaa

ic:

i.r

--resultof the splitting

t--3t is reoe..-: b]

5fta s - 5d o .

consisting of itself

Indinall the

process, the page

theprocessor -is

and itsadjacent t

praa rappl

localityofreference.-a-, of t--- . igeswill

be refer- ;fter--ardand oe

owed

farther

bv i- the hierarchy also.

:--p

:i-rof Fares. I- t-estrateov presented.

capesareart-allycopied as theymove up the hierarchy]

a pageatlevel i -as ore copyof itself ;- ear- |

-levels]> i. :r *read' rep-.ests

' " wite- requests tor-est applications

• - rams t.-.e

thus

.fa

page

hasnot been :-i-:e: and -s selectedto oeremoved

Level to a Lower ivel, it reednot o-e

act. all- tu-i-sferredsince avalidcopyalready exists

in the lowerlevel. Theoo-ter.ts cf any level of t--e

hierarchyis always asucset of t_-.einformation

(18)

(19)

TT.

I IN(0,1)

I

•N(2,3)

V////////^///////A

-i I N(3,4)

1"

T

Page Splitting and Shadow Storage

Direct Transfer

In thedescriptionsaboveit isimplied that

information isdirectlytransferred between adjacent

levels. Bycomparison, most presentmultiplelevel

storagesystems are based upon an indirect transfer

(e.g., theMultics PageMultilevelsystem1^). in an

indirect transfersystem,all information is routed

through level 1or level2. Forexample, tomovea

page fromlevel n-1to leveln, thepage ismovedfrom

leveln-1 tolevel 1andthen fromlevel 1to leveln.

Clearly, anindirectapproachisundesirable

since itrequires extra pagemovementand its

associ-atedoverhead as well as consuminga portion of the

limited M(l) capacity in the process. In thepast

such indirect schemes were necessary since all

inter-connections were radial from themain memoryor

pro-cessorto thestoragedevices. Thus, alltransfers

hadtoberoutedthrough the central element.

Further-more, duetothedifferences intransfer rate between

devices, especiallyelectro-mechanicaldevices that

must operate at a fixedtransfer rate, direct transfer

was not possible.

Based upon current technology, these problems can

besolved. Many of the newer storage devices are

non-electromechanical (i.e., strictlyelectrical). In

these cases thetransfer rates can be synchronized to

allow direct transfer. Forelectromechanicaldevices.

directtransfer ispossible iftransferrates are

similar ora small scratchpadbuffer, such as low-cost

semiconductorshift registers, are usedtoenable

synchronization between the devices. Direct transfer

isprovided between the level 6 and level 5storage

devices in the IBM 3850 Mass Storage System14

.

Read Through

Apossible implementation of the page splitting

strategy could be accomplished by transferring

infor-mation from leveli to theprocessor (level0) ina

series of sequential steps: (1) transfer page of size

N(i-l,i) fromlevelito leveli-1, andthen (2)

extract theappropriate page subset of size N(i-2, i-1)

and transfer it from level l-l to level i-2, etc.

Undersuchascheme, atransferfromlevel itothe

processor would consist ofa series ofisteps. This

would result inaconsiderable delay, especially for

certain devices that require a large delay for

reading-recently written information (e.g., rotationaldelay

toreposition electromechanical devices).

The inefficiencies of the sequential transfers can be avoided by having the Information read through

toall the upper levels simultaneously. For example,

assume the processor references logical address a which

iscurrently stored at level i (andall lowerlevels,

ofcourse). The controller for M(i) outputs the

N(i-l,i) bytesthat contain a onto the data buses

along with their corresponding addresses. The

con-troller for M(i-l) willaccept all of these N(i-l,i)

bytes and store them inM(i-l). At thesame time,

thecontroller for M(i-2) willextract the particular

N(i-2, i-1) bytes thatitwants and storethemin

M(i-2). Inalikemanner, all ofthe controllers can

simultaneously extract the portion that is needed.

Using thismechanism, theprocessor (level0)extracts

theN(0,1) bytesthat itneeds without any further

delays

—

thusthe information from level iisread

through to the processor without any delays. This

processis illustrated inFigure 5.

The read through mechanism also offers some

important reliability, availability, andserviceability

advantages. Since allstorage levels communicate

anonymously over thedata buses, if astorage level

must be removed from thesystem, there are nochanges

needed. In thiscase, the information is "read

through"thislevel as if itdidn't exist. Nochanges

are needed toany of theother storage levels or the

storagemanagement algorithms although we would expect

theperformance to decrease asa result of themissing

storagelevel. This reliabilitystrategyisemployed

inmostcurrent-day cache memorysystems^.

Store Behind

Under steady-state operation, we would expect all

the levels of the storage hierarchy to be full (with

theexception of the lowest level, L). Thus, whenever

a pageis to bemoved intoa level, it isnecessary

toremove a current page. If thepage selectedfor

removalhas not been changed by means ofaprocessor

write, thenew page can beimmediately stored into

the levelsincea copy of the page to be removed

already exists inthe nextlower level ofthe

hier-archy. But, if theprocessorperforms a write

opera-tion,all levelsthat containa copy of the

informa-tion beingmodified must be updated. This can be

accomplished ineither of at least three basic ways:

(1) storethrough, (2) store replacement, or (3)

(20)

(21)

rr€

processor

Figure 5

Read-Through Example

(Datatransferred from

M(3) to M(2), M(l),

andprocessor

simultaneously)

StoreThrough. Undera store through strategy,

alllevels are updated simultaneously. This is the

logical inverse of theread through strategy but it

hasacrucial distinction. The store through is

limited by the SDeed of the lowest level, L, ofthe

hierarchy, whereas the readthrough islimited by the

speed ofthe highestlevelcontainingthedesired

information. Storethrough isefficientonlyif the

accesstime oflevel Lis comparableto theaccess

time of level1, such as ina two-level cache system

(e.g., itisused inthe IBMSystem/370Models 158

and 168).

.torereplacement

StoreReplacement. Under

strategy, the processor only updates M(l)andmarks

thepage as "changed." If suchachanged page is

later selected for removalfrom M(l), it isthenmoved

tothenext lowerlevel, fl(2), immediately prior to

being replaced. Thisprocessoccurs at every level

and, eventually, level L willbeupdated but only

after thepage hasbeen selectedfor removal from all

the higherlevels. Duetotheextradelays caused by

updating changed pages before replacement, the

tffective access tinefor reads is increased. Various

versions of store replacement are used inmost

two-levelpaging systems since itofferssubstantially

Betterperformance than store through for slow storage

levices (e.g., disks and drums).

StoreBehind. In bothstrategies above, the

storagesystem was required toperform the update

operation at some specifictime, either atthe timeof

write or replacement. Oncethe information has been

storedinto M(l), theprocessor doesn'treally care

how or when the other copies are updated. Store

behindtakes advantage of this degree of freedom.

The maximum transfer capability betweenlevelsis

rarelymaintained since,at anyinstantof time,a

storage levelmay not have anyoutstanding requests

for service oritmay be waiting for proper

position-ingtoservicea pending request. During these

"idle-periods,data can betransferreddown to the next

lower levelofthe storage hierarchywithoutadding

anysignificant delays to the read or storeoperations.

Since these"idle"periods areusually quitefrequent

under normal operation, therecan beacontinual flow

ofchangedinformationdown through the storage

hier-archy.

Avariation of this store behind strategy i3 used

in the MulticsPageMultilevel system12whereby a .

certainnumber of pages at each level are kept

avail-ableforimmediate replacement. Ifthe number of

replaceable pages dropsbelow the threshold, changed"

pages are selected to beupdated tothenextlower

level. Theactualwriting of these pagestothe lower

levelisscheduled totakeplace when thereisno

real requestto bedone.

The store behindstrategy can also be used to

provide high reliability inthestorage system.

Ordinarily, achanged pageis not purgedfroma level

untilthenext lower level acknowledges that the

corresponding page has been updated. Wecanextend

thisapproach to require two levels ofacknowledgement.

Forexample, achanged page isnotremoved from M(l)

until thecorresponding pages inboth M(2) and M(3)

have been updated. Inthis way, there will be at

leasttwocopies of each changed piece of information

at levels M(i) and M(i*l) inthehierarchy. Ifany

level malfunctions or must be serviced, itcan be

removedfrom the hierarchy without causing any

infor-mation to be lost. There are twoexceptions to this

process, levels M(l) andM(L). Tocompletely

safe-guard thereliability of the system, it may be

necessary to store duplicate copies ofinformation at

theselevels. Figure6illustrates this process.

DistributedControl

There isa substantialamount of work required

tomanage thestorage hierarchy. Itisdesirable to

remove asmuch as possible of the storage management

from the concern of theprocessor. In the

hier-archicalstorage system described inthis paper, all

ofthemanagement algorithms can be operated based

uponinformationthatis local toa level or, at most,

inconjunction with information from neighboring

levels. Thus, it ispossibleto distributecontrol

ofthe hierarchyinto the levels, thisalso

facili-tatesparallel and asychronous operation in the

hierarchy (e.g., the storebehind algorithm).

Theindependent control facilitiesforeach

level can be accomplished by extending the

function-ality ofconventional device controllers. Most

current-day sophisticated device controllers are

actually microprogrammed processors capable of

per-forming the storage management function. TheIBM

(22)

(23)

(24)

(25)

REFERENCES

1. Ahearn, G. R., Y.Dishon, and R.N. Snively,

"Design Innovations of the IBM 3830 and 2835

StorageControl Units,"IBM Journal of Research

andDevelopment,Vol. 16,No. 1, pp. 11-18,

January, 1972.

2. Anacker, W. andC. P. Wong, "Performance

Evaluation ofComputer Systems with Memory

Hier-archies," IEEE-TC, Vol. EC-16, No. 6, pp. 765-773,

December1967.

3. Arora,S. R.andA. Gallo, "OptimalSizing,

Loading andRe-loading in aMulti-level Memory

Hierarchy System," SJCC,Vol. 38, pp. 337-344,

1971.

4. Bell, GordenC. andDavid Casasent,

"Implementa-tion of aBufferMemory inMinicomputers,"

ComputerDesign, pp. 83-89, November 1971.

5. Benaoussan,A., C. T. Clingen,andR. C.Daley,

"The MulticsVirtual Memory," Second ACM Symposium

onOperating Systems Principles, Princeton

University, pp. 30-42, October 1969.

6. Brawn,B. S.,and F.G. Gustavson, "Program

Behavior ina Paging Environment," FJCC,Vol. 33,

Pt. 11,pp. 1019-1032, 1968.

7. Chu,W. W., andH.Opderbeck, "Performance of

Replacement Algorithms withDifferentPage Sizes,"

Computer,Vol. 7, No. 11, pp. 14-21,November 1974.

17. Madnick, S. E.,andJ. J.Donovan,Operating

Systems,McGraw-Hill, New York,1974.

18. Mattson, R., et al., "Evaluation Techniques for

Storage Hierarchies," IBMSystems Journal, Vol.9

No. 2, pp. 78-117,1970.

19. Meade,R. M., "OnMemory System Design," FJCC,

Vol.37, pp. 33-44, 1970.

20. Ohnigian, Suran, "RandomFile Processingina

Storage Hierarchy,"IEEEINTERCON, 1975.

21. Ramamoorthy,C. V. andK. M.Chandy,

"Optimiza-tion ofMemory Hierarchies inMultiprogrammed

Systems," JACM,Vol.17, No. 3, pp. 426-445,

July 1970.

22. Spirn,J. R.,andP. J.Denning, "Experiments

with Program Locality," FJCC, Vol.41, Pt. 1,

pp. 611-621. 1972.

23. Strecker, William, "CacheMemoriesfor

Mini-computer Systems," IEEEINTERCON, 1975.

24. Wilkes,M. v., "Slave MemoriesandDynamic

StorageAllocation." IEEE-TC,Vol. EC-14,

No. 2,pp. 270-271,April 1965.

25. Withlngton,F. G., "Beyond 1984i ATechnology

Forecast," Datamation, Vol.21, No. 1,pp. 54-73,

January 1975.

8. Coffman, E.B., Jr., andL.C. Varian, "Further

Experimental Data on the Behavior of Programs in

•PagingEnvironment," CACM, Vol. 11, No. 7,

pp. 471-474, July1968.

9. Considine, J. P., andA. H.Weis. "Establishment

and Maintenance ofa StorageHierarchyfor an

On-line Data Ba3e under TSS/360," FJCC, Vol. 35,

pp. 433-440, 1969.

10. Conti, C.J., "Concepts forBuffer Storage,"

IEEEComputerGroupNews, pp. 6-13,March1969.

11. Denning, P. J., "VirtualMemory,"ACMComputing

Surveys,Vol. 2,No. 3, pp. 153-190, September

1970.

12. Greenberg, Bernard S.,andStevenH. Webber,

"MulticsMultilevel Paging Hierarchy," IEEE

INTCRCON, 1975.

13. -Hatfield, D.J., "Experiments on Page Size,

Program AccessPatterns, and VirtualMemory

Performance," IBM Journal of Research and

Development,Vol. 16, No. 1, pp. 58-66,January

1972.

Johnson, Clay,

IEEEINTERCON,

'IBM 3850

—

MassStorage System,

L975.

Madnick, S. E., "StorageHierarchySystems,

MITProject MAC Report TR-105,Cambridge,

Massachusetts, 1973.

16. Madnick, S. E., "INFOPLEX

—

Hierarchical

Decom-position ofaLargeInformationManagement System

(26)

(27)

(28)

(29)

(30)